2025-07-08

Title: Loki's Dance of Illusions: A Comprehensive Survey of Hallucination in Large Language Models

Authors: Chaozhuo Li, Pengbo Wang, Chenxu Wang, Litian Zhang, Zheng Liu, Qiwei Ye, Yuanbo Xu, Feiran Huang, Xi Zhang, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02870
Pdf URL: https://arxiv.org/pdf/2507.02870
Copy Paste: [[2507.02870]] Loki's Dance of Illusions: A Comprehensive Survey of Hallucination in Large Language Models(https://arxiv.org/abs/2507.02870)
Keywords: large language model
Abstract: Edgar Allan Poe noted, "Truth often lurks in the shadow of error," highlighting the deep complexity intrinsic to the interplay between truth and falsehood, notably under conditions of cognitive and informational asymmetry. This dynamic is strikingly evident in large language models (LLMs). Despite their impressive linguistic generation capabilities, LLMs sometimes produce information that appears factually accurate but is, in reality, fabricated, an issue often referred to as 'hallucinations'. The prevalence of these hallucinations can mislead users, affecting their judgments and decisions. In sectors such as finance, law, and healthcare, such misinformation risks causing substantial economic losses, legal disputes, and health risks, with wide-ranging this http URL our research, we have methodically categorized, analyzed the causes, detection methods, and solutions related to LLM hallucinations. Our efforts have particularly focused on understanding the roots of hallucinations and evaluating the efficacy of current strategies in revealing the underlying logic, thereby paving the way for the development of innovative and potent approaches. By examining why certain measures are effective against hallucinations, our study aims to foster a comprehensive approach to tackling this issue within the domain of LLMs.

Title: Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras

Authors: Miao Fan, Quanxin Zheng, Shengtong Xu, Linghe Kong, Haoyi Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02899
Pdf URL: https://arxiv.org/pdf/2507.02899
Copy Paste: [[2507.02899]] Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras(https://arxiv.org/abs/2507.02899)
Keywords: robust, extraction
Abstract: Vectorized maps are indispensable for precise navigation and the safe operation of autonomous vehicles. Traditional methods for constructing these maps fall into two categories: offline techniques, which rely on expensive, labor-intensive LiDAR data collection and manual annotation, and online approaches that use onboard cameras to reduce costs but suffer from limited performance, especially at complex intersections. To bridge this gap, we introduce MRC-VMap, a cost-effective, vision-centric, end-to-end neural network designed to generate high-definition vectorized maps directly at intersections. Leveraging existing roadside surveillance cameras, MRC-VMap directly converts time-aligned, multi-directional images into vectorized map representations. This integrated solution lowers the need for additional intermediate modules--such as separate feature extraction and Bird's-Eye View (BEV) conversion steps--thus reducing both computational overhead and error propagation. Moreover, the use of multiple camera views enhances mapping completeness, mitigates occlusions, and provides robust performance under practical deployment constraints. Extensive experiments conducted on 4,000 intersections across 4 major metropolitan areas in China demonstrate that MRC-VMap not only outperforms state-of-the-art online methods but also achieves accuracy comparable to high-cost LiDAR-based approaches, thereby offering a scalable and efficient solution for modern autonomous navigation systems.

Title: Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions

Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das, Tapas Samanta
Subjects: cs.CV, cs.AI, cs.GR, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2507.02900
Pdf URL: https://arxiv.org/pdf/2507.02900
Copy Paste: [[2507.02900]] Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions(https://arxiv.org/abs/2507.02900)
Keywords: diffusion
Abstract: Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: this https URL.

Title: Controllable diffusion-based generation for multi-channel biological data

Authors: Haoran Zhang, Mingyuan Zhou, Wesley Tansey
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2507.02902
Pdf URL: https://arxiv.org/pdf/2507.02902
Copy Paste: [[2507.02902]] Controllable diffusion-based generation for multi-channel biological data(https://arxiv.org/abs/2507.02902)
Keywords: diffusion, generative
Abstract: Spatial profiling technologies in biology, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate high-dimensional, multi-channel data with strong spatial alignment and complex inter-channel relationships. Generative modeling of such data requires jointly capturing intra- and inter-channel structure, while also generalizing across arbitrary combinations of observed and missing channels for practical application. Existing diffusion-based models generally assume low-dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that break spatial correspondence and ignore inter-channel dependencies. This work proposes a unified diffusion framework for controllable generation over structured and spatial biological data. Our model contains two key innovations: (1) a hierarchical feature injection mechanism that enables multi-resolution conditioning on spatially aligned channels, and (2) a combination of latent-space and output-space channel-wise attention to capture inter-channel relationships. To support flexible conditioning and generalization to arbitrary subsets of observed channels, we train the model using a random masking strategy, enabling it to reconstruct missing channels from any combination of inputs. We demonstrate state-of-the-art performance across both spatial and non-spatial prediction tasks, including protein imputation in IMC and gene-to-protein prediction in single-cell datasets, and show strong generalization to unseen conditional configurations.

Title: Harnessing Near-Infrared Spectroscopy and Machine Learning for Traceable Classification of Hanwoo and Holstein Beef

Authors: AMM Nurul Alam, Abdul Samad, AMM Shamsul Alam, Jahan Ara Monti, Ayesha Muazzam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.02903
Pdf URL: https://arxiv.org/pdf/2507.02903
Copy Paste: [[2507.02903]] Harnessing Near-Infrared Spectroscopy and Machine Learning for Traceable Classification of Hanwoo and Holstein Beef(https://arxiv.org/abs/2507.02903)
Keywords: robust
Abstract: This study evaluates the use of Near-Infrared spectroscopy (NIRS) combined with advanced machine learning (ML) techniques to differentiate Hanwoo beef (HNB) and Holstein beef (HLB) to address food authenticity, mislabeling, and adulteration. Rapid and non-invasive spectral data were attained by a portable NIRS, recording absorbance data within the wavelength range of 700 to 1100 nm. A total of 40 Longissimus lumborum samples, evenly split between HNB and HLB, were obtained from a local hypermarket. Data analysis using Principal Component Analysis (PCA) demonstrated distinct spectral patterns associated with chemical changes, clearly separating the two beef varieties and accounting for 93.72% of the total variance. ML models, including Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Logistic Regression (LR), Random Forest, Gradient Boosting (GB), K-Nearest Neighbors, Decision Tree (DT), Naive Bayes (NB), and Neural Networks (NN), were implemented, optimized through hyperparameter tuning, and validated by 5-fold cross-validation techniques to enhance model robustness and prevent overfitting. Random Forest provided the highest predictive accuracy with a Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) of 0.8826, closely followed by the SVM model at 0.8747. Furthermore, GB and NN algorithms exhibited satisfactory performances, with cross-validation scores of 0.752. Notably, the NN model achieved the highest recall rate of 0.7804, highlighting its suitability in scenarios requiring heightened sensitivity. DT and NB exhibited comparatively lower predictive performance. The LR and SVM models emerged as optimal choices by effectively balancing high accuracy, precision, and recall. This study confirms that integrating NIRS with ML techniques offers a powerful and reliable method for meat authenticity, significantly contributing to detecting food fraud.

Title: Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis

Authors: Charlton Teo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02904
Pdf URL: https://arxiv.org/pdf/2507.02904
Copy Paste: [[2507.02904]] Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis(https://arxiv.org/abs/2507.02904)
Keywords: large language model
Abstract: The use of Large Language Models (LLMs) in recent years has also given rise to the development of Multimodal LLMs (MLLMs). These new MLLMs allow us to process images, videos and even audio alongside textual inputs. In this project, we aim to assess the effectiveness of MLLMs in analysing sports videos, focusing mainly on tennis videos. Despite research done on tennis analysis, there remains a gap in models that are able to understand and identify the sequence of events in a tennis rally, which would be useful in other fields of sports analytics. As such, we will mainly assess the MLLMs on their ability to fill this gap - to classify tennis actions, as well as their ability to identify these actions in a sequence of tennis actions in a rally. We further looked into ways we can improve the MLLMs' performance, including different training methods and even using them together with other traditional models.

Title: Enhancing Sports Strategy with Video Analytics and Data Mining: Automated Video-Based Analytics Framework for Tennis Doubles

Authors: Jia Wei Chen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02906
Pdf URL: https://arxiv.org/pdf/2507.02906
Copy Paste: [[2507.02906]] Enhancing Sports Strategy with Video Analytics and Data Mining: Automated Video-Based Analytics Framework for Tennis Doubles(https://arxiv.org/abs/2507.02906)
Keywords: robust
Abstract: We present a comprehensive video-based analytics framework for tennis doubles that addresses the lack of automated analysis tools for this strategically complex sport. Our approach introduces a standardised annotation methodology encompassing player positioning, shot types, court formations, and match outcomes, coupled with a specialised annotation tool designed to meet the unique requirements of tennis video labelling. The framework integrates advanced machine learning techniques including GroundingDINO for precise player localisation through natural language grounding and YOLO-Pose for robust pose estimation. This combination significantly reduces manual annotation effort whilst improving data consistency and quality. We evaluate our approach on doubles tennis match data and demonstrate that CNN-based models with transfer learning substantially outperform pose-based methods for predicting shot types, player positioning, and formations. The CNN models effectively capture complex visual and contextual features essential for doubles tennis analysis. Our integrated system bridges advanced analytical capabilities with the strategic complexities of tennis doubles, providing a foundation for automated tactical analysis, performance evaluation, and strategic modelling in professional tennis.

Title: Scaling Transformers for Time Series Forecasting: Do Pretrained Large Models Outperform Small-Scale Alternatives?

Authors: Sanjay Chakraborty, Ibrahim Delibasoglu, Fredrik Heintz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.02907
Pdf URL: https://arxiv.org/pdf/2507.02907
Copy Paste: [[2507.02907]] Scaling Transformers for Time Series Forecasting: Do Pretrained Large Models Outperform Small-Scale Alternatives?(https://arxiv.org/abs/2507.02907)
Keywords: interpretability, transformer
Abstract: Large pre-trained models have demonstrated remarkable capabilities across domains, but their effectiveness in time series forecasting remains understudied. This work empirically examines whether pre-trained large-scale time series models (LSTSMs) trained on diverse datasets can outperform traditional non-pretrained small-scale transformers in forecasting tasks. We analyze state-of-the-art (SOTA) pre-trained universal time series models (e.g., Moirai, TimeGPT) alongside conventional transformers, evaluating accuracy, computational efficiency, and interpretability across multiple benchmarks. Our findings reveal the strengths and limitations of pre-trained LSTSMs, providing insights into their suitability for time series tasks compared to task-specific small-scale architectures. The results highlight scenarios where pretraining offers advantages and where simpler models remain competitive.

Title: Hyperbolic Kernel Graph Neural Networks for Neurocognitive Decline Analysis from Multimodal Brain Imaging

Authors: Meimei Yang, Yongheng Sun, Qianqian Wang, Andrea Bozoki, Maureen Kohi, Mingxia Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02908
Pdf URL: https://arxiv.org/pdf/2507.02908
Copy Paste: [[2507.02908]] Hyperbolic Kernel Graph Neural Networks for Neurocognitive Decline Analysis from Multimodal Brain Imaging(https://arxiv.org/abs/2507.02908)
Keywords: diffusion
Abstract: Multimodal neuroimages, such as diffusion tensor imaging (DTI) and resting-state functional MRI (fMRI), offer complementary perspectives on brain activities by capturing structural or functional interactions among brain regions. While existing studies suggest that fusing these multimodal data helps detect abnormal brain activity caused by neurocognitive decline, they are generally implemented in Euclidean space and can't effectively capture intrinsic hierarchical organization of structural/functional brain networks. This paper presents a hyperbolic kernel graph fusion (HKGF) framework for neurocognitive decline analysis with multimodal neuroimages. It consists of a multimodal graph construction module, a graph representation learning module that encodes brain graphs in hyperbolic space through a family of hyperbolic kernel graph neural networks (HKGNNs), a cross-modality coupling module that enables effective multimodal data fusion, and a hyperbolic neural network for downstream predictions. Notably, HKGNNs represent graphs in hyperbolic space to capture both local and global dependencies among brain regions while preserving the hierarchical structure of brain networks. Extensive experiments involving over 4,000 subjects with DTI and/or fMRI data suggest the superiority of HKGF over state-of-the-art methods in two neurocognitive decline prediction tasks. HKGF is a general framework for multimodal data analysis, facilitating objective quantification of structural/functional brain connectivity changes associated with neurocognitive decline.

Title: Efficient Certified Reasoning for Binarized Neural Networks

Authors: Jiong Yang, Yong Kiam Tan, Mate Soos, Magnus O. Myreen, Kuldeep S. Meel
Subjects: cs.LG, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2507.02916
Pdf URL: https://arxiv.org/pdf/2507.02916
Copy Paste: [[2507.02916]] Efficient Certified Reasoning for Binarized Neural Networks(https://arxiv.org/abs/2507.02916)
Keywords: robust
Abstract: Neural networks have emerged as essential components in safety-critical applications -- these use cases demand complex, yet trustworthy computations. Binarized Neural Networks (BNNs) are a type of neural network where each neuron is constrained to a Boolean value; they are particularly well-suited for safety-critical tasks because they retain much of the computational capacities of full-scale (floating-point or quantized) deep neural networks, but remain compatible with satisfiability solvers for qualitative verification and with model counters for quantitative reasoning. However, existing methods for BNN analysis suffer from either limited scalability or susceptibility to soundness errors, which hinders their applicability in real-world scenarios. In this work, we present a scalable and trustworthy approach for both qualitative and quantitative verification of BNNs. Our approach introduces a native representation of BNN constraints in a custom-designed solver for qualitative reasoning, and in an approximate model counter for quantitative reasoning. We further develop specialized proof generation and checking pipelines with native support for BNN constraint reasoning, ensuring trustworthiness for all of our verification results. Empirical evaluations on a BNN robustness verification benchmark suite demonstrate that our certified solving approach achieves a $9\times$ speedup over prior certified CNF and PB-based approaches, and our certified counting approach achieves a $218\times$ speedup over the existing CNF-based baseline. In terms of coverage, our pipeline produces fully certified results for $99\%$ and $86\%$ of the qualitative and quantitative reasoning queries on BNNs, respectively. This is in sharp contrast to the best existing baselines which can fully certify only $62\%$ and $4\%$ of the queries, respectively.

Title: Echo State Transformer: When chaos brings memory

Authors: Yannis Bendi-Ouis (Mnemosyne), Xavier Hinaut (Mnemosyne)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02917
Pdf URL: https://arxiv.org/pdf/2507.02917
Copy Paste: [[2507.02917]] Echo State Transformer: When chaos brings memory(https://arxiv.org/abs/2507.02917)
Keywords: transformer, large language model
Abstract: While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language and working memory. Furthermore, sequential data processing with Transformers encounters a fundamental barrier: quadratic complexity growth with sequence length. Motivated by these limitations, our ambition is to create more efficient models that are less reliant on intensive computations and massive volumes of data. We introduce Echo State Transformers (EST), a hybrid architecture that elegantly resolves this challenge while demonstrating exceptional performance in low-data regimes. EST integrates the Transformer attention mechanisms with principles from Reservoir Computing to create a fixedsize window distributed memory system. Drawing inspiration from Echo State Networks, the most prominent instance of the Reservoir Computing paradigm, our architecture integrates a new module called ''Working Memory'' based on several reservoirs (i.e. random recurrent networks) working in parallel. These reservoirs work as independent memory units with distinct internal dynamics. A novelty here is that the classical reservoir hyperparameters controlling the dynamics are now trained. Thus, the EST dynamically adapts the memory/non-linearity trade-off in reservoirs. By maintaining a fixed number of memory units regardless of sequence length, EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers. Evaluations on the STREAM benchmark, which comprises 12 diverse sequential processing tasks, demonstrate that EST outperforms GRUs, LSTMs, and even Transformers on 8 of these tasks. These findings highlight that Echo State Transformers can be an effective replacement to GRUs and LSTMs while complementing standard Transformers at least on resource-constrained environments and low-data scenarios across diverse sequential processing tasks.

Title: ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models

Authors: Dai Li, Linzhuo Li, Huilian Sophie Qiu
Subjects: cs.CL, cs.CY, cs.ET
Abstract URL: https://arxiv.org/abs/2507.02919
Pdf URL: https://arxiv.org/pdf/2507.02919
Copy Paste: [[2507.02919]] ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models(https://arxiv.org/abs/2507.02919)
Keywords: large language model
Abstract: Large language models (LLMs) in the form of chatbots like ChatGPT and Llama are increasingly proposed as "silicon samples" for simulating human opinions. This study examines this notion, arguing that LLMs may misrepresent population-level opinions. We identify two fundamental challenges: a failure in structural consistency, where response accuracy doesn't hold across demographic aggregation levels, and homogenization, an underrepresentation of minority opinions. To investigate these, we prompted ChatGPT (GPT-4) and Meta's Llama 3.1 series (8B, 70B, 405B) with questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020. Our findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. We propose an "accuracy-optimization hypothesis," suggesting homogenization stems from prioritizing modal responses. These issues challenge the validity of using LLMs, especially chatbots AI, as direct substitutes for human survey data, potentially reinforcing stereotypes and misinforming policy.

Title: Domain Knowledge in Artificial Intelligence: Using Conceptual Modeling to Increase Machine Learning Accuracy and Explainability

Authors: V.C. Storey, J. Parsons, A. Castellanos, M. Tremblay, R. Lukyanenko, W. Maass, A. Castillo
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2507.02922
Pdf URL: https://arxiv.org/pdf/2507.02922
Copy Paste: [[2507.02922]] Domain Knowledge in Artificial Intelligence: Using Conceptual Modeling to Increase Machine Learning Accuracy and Explainability(https://arxiv.org/abs/2507.02922)
Keywords: extraction, explainability
Abstract: Machine learning enables the extraction of useful information from large, diverse datasets. However, despite many successful applications, machine learning continues to suffer from performance and transparency issues. These challenges can be partially attributed to the limited use of domain knowledge by machine learning models. This research proposes using the domain knowledge represented in conceptual models to improve the preparation of the data used to train machine learning models. We develop and demonstrate a method, called the Conceptual Modeling for Machine Learning (CMML), which is comprised of guidelines for data preparation in machine learning and based on conceptual modeling constructs and principles. To assess the impact of CMML on machine learning outcomes, we first applied it to two real-world problems to evaluate its impact on model performance. We then solicited an assessment by data scientists on the applicability of the method. These results demonstrate the value of CMML for improving machine learning outcomes.

Title: Modeling Urban Food Insecurity with Google Street View Images

Authors: David Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02924
Pdf URL: https://arxiv.org/pdf/2507.02924
Copy Paste: [[2507.02924]] Modeling Urban Food Insecurity with Google Street View Images(https://arxiv.org/abs/2507.02924)
Keywords: security, extraction
Abstract: Food insecurity is a significant social and public health issue that plagues many urban metropolitan areas around the world. Existing approaches to identifying food insecurity rely primarily on qualitative and quantitative survey data, which is difficult to scale. This project seeks to explore the effectiveness of using street-level images in modeling food insecurity at the census tract level. To do so, we propose a two-step process of feature extraction and gated attention for image aggregation. We evaluate the effectiveness of our model by comparing against other model architectures, interpreting our learned weights, and performing a case study. While our model falls slightly short in terms of its predictive power, we believe our approach still has the potential to supplement existing methods of identifying food insecurity for urban planners and policymakers.

Title: Large Language Model Agent for Modular Task Execution in Drug Discovery

Authors: Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar, Amir Barati Farimani
Subjects: cs.LG, cs.CL, q-bio.BM
Abstract URL: https://arxiv.org/abs/2507.02925
Pdf URL: https://arxiv.org/pdf/2507.02925
Copy Paste: [[2507.02925]] Large Language Model Agent for Modular Task Execution in Drug Discovery(https://arxiv.org/abs/2507.02925)
Keywords: large language model
Abstract: We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information-including FASTA sequences, SMILES representations, and literature-and answered mechanistic questions with improved contextual accuracy over standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55, and those passing at least four out of five empirical drug-likeness rules rose from 29 to 52, within a pool of 194 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.

Title: A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations

Authors: Phurich Saengthong, Boonnithi Jiaramaneepinit, Sheng Li, Manabu Okumura, Takahiro Shinozaki
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.02927
Pdf URL: https://arxiv.org/pdf/2507.02927
Copy Paste: [[2507.02927]] A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations(https://arxiv.org/abs/2507.02927)
Keywords: large language model, segmentation
Abstract: Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic speech recognition (ASR) and spoken dialogue modeling. However, their effectiveness in real-world multilingual conversations remains limited by the scarcity of data that captures natural conversational phenomena. To address this, the MLC-SLM Challenge provides a multilingual conversational dataset and evaluates models on two tasks: ASR with oracle segmentation (Task I) and joint diarization and recognition without oracle information (Task II). In this paper, we focus on Task II and propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio and achieves a 54.87\% relative improvement in tcpWER/tcpCER over the baseline, ranking 8th overall, despite using a smaller LLM backbone. We also report results from Task I using a fine-tuned speech LLM.

Title: Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models

Authors: Hao Yang, Haoxuan Li, Luyu Chen, Haoxiang Wang, Xu Chen, Mingming Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02928
Pdf URL: https://arxiv.org/pdf/2507.02928
Copy Paste: [[2507.02928]] Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models(https://arxiv.org/abs/2507.02928)
Keywords: robust, large language model
Abstract: Hidden confounding remains a central challenge in estimating treatment effects from observational data, as unobserved variables can lead to biased causal estimates. While recent work has explored the use of large language models (LLMs) for causal inference, most approaches still rely on the unconfoundedness assumption. In this paper, we make the first attempt to mitigate hidden confounding using LLMs. We propose ProCI (Progressive Confounder Imputation), a framework that elicits the semantic and world knowledge of LLMs to iteratively generate, impute, and validate hidden confounders. ProCI leverages two key capabilities of LLMs: their strong semantic reasoning ability, which enables the discovery of plausible confounders from both structured and unstructured inputs, and their embedded world knowledge, which supports counterfactual reasoning under latent confounding. To improve robustness, ProCI adopts a distributional reasoning strategy instead of direct value imputation to prevent the collapsed outputs. Extensive experiments demonstrate that ProCI uncovers meaningful confounders and significantly improves treatment effect estimation across various datasets and LLMs.

Title: MolProphecy: Bridging Medicinal Chemists' Knowledge and Molecular Pre-Trained Models via a Multi-Modal Framework

Authors: Jianping Zhao, Qiong Zhou, Tian Wang, Yusi Fan, Qian Yang, Li Jiao, Chang Liu, Zhehao Guo, Qi Lu, Fengfeng Zhou, Ruochi Zhang
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2507.02932
Pdf URL: https://arxiv.org/pdf/2507.02932
Copy Paste: [[2507.02932]] MolProphecy: Bridging Medicinal Chemists' Knowledge and Molecular Pre-Trained Models via a Multi-Modal Framework(https://arxiv.org/abs/2507.02932)
Keywords: interpretability, large language model
Abstract: MolProphecy is a human-in-the-loop (HITL) multi-modal framework designed to integrate chemists' domain knowledge into molecular property prediction models. While molecular pre-trained models have enabled significant gains in predictive accuracy, they often fail to capture the tacit, interpretive reasoning central to expert-driven molecular design. To address this, MolProphecy employs ChatGPT as a virtual chemist to simulate expert-level reasoning and decision-making. The generated chemist knowledge is embedded by the large language model (LLM) as a dedicated knowledge representation and then fused with graph-based molecular features through a gated cross-attention mechanism, enabling joint reasoning over human-derived and structural features. Evaluated on four benchmark datasets (FreeSolv, BACE, SIDER, and ClinTox), MolProphecy outperforms state-of-the-art (SOTA) models, achieving a 15.0 percent reduction in RMSE on FreeSolv and a 5.39 percent improvement in AUROC on BACE. Analysis reveals that chemist knowledge and structural features provide complementary contributions, improving both accuracy and interpretability. MolProphecy offers a practical and generalizable approach for collaborative drug discovery, with the flexibility to incorporate real chemist input in place of the current simulated proxy--without the need for model retraining. The implementation is publicly available at this https URL.

Title: Theory of Mind in Action: The Instruction Inference Task

Authors: Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2507.02935
Pdf URL: https://arxiv.org/pdf/2507.02935
Copy Paste: [[2507.02935]] Theory of Mind in Action: The Instruction Inference Task(https://arxiv.org/abs/2507.02935)
Keywords: large language model
Abstract: The Theory of Mind (ToM) refers to an agent's capacity to infer the mental states of other agents. ToM is essential for effective collaboration. To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting indirect or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implement two variants of Tomcat. One, dubbed Fs-CoT, is based on a small number of examples (i.e., few-shot or Fs) demonstrating the requisite structured reasoning (i.e., chain-of-thought or CoT). One, dubbed CP, relies on commonsense knowledge and information about the problem (i.e., commonsense prompt or CP). We realized both variants of Tomcat on three leading large language models (LLMs), namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant of Tomcat. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-AI collaboration.

Title: FoGE: Fock Space inspired encoding for graph prompting

Authors: Sotirios Panagiotis Chytas, Rudrasis Chakraborty, Vikas Singh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02937
Pdf URL: https://arxiv.org/pdf/2507.02937
Copy Paste: [[2507.02937]] FoGE: Fock Space inspired encoding for graph prompting(https://arxiv.org/abs/2507.02937)
Keywords: transformer, large language model
Abstract: Recent results show that modern Large Language Models (LLM) are indeed capable of understanding and answering questions about structured data such as graphs. This new paradigm can lead to solutions that require less supervision while, at the same time, providing a model that can generalize and answer questions beyond the training labels. Existing proposals often use some description of the graph to create an ``augmented'' prompt fed to the LLM. For a chosen class of graphs, if a well-tailored graph encoder is deployed to play together with a pre-trained LLM, the model can answer graph-related questions well. Existing solutions to graph-based prompts range from graph serialization to graph transformers. In this work, we show that the use of a parameter-free graph encoder based on Fock space representations, a concept borrowed from mathematical physics, is remarkably versatile in this problem setting. The simple construction, inherited directly from the theory with a few small adjustments, can provide rich and informative graph encodings, for a wide range of different graphs. We investigate the use of this idea for prefix-tuned prompts leveraging the capabilities of a pre-trained, frozen LLM. The modifications lead to a model that can answer graph-related questions -- from simple graphs to proteins to hypergraphs -- effectively and with minimal, if any, adjustments to the architecture. Our work significantly simplifies existing solutions and generalizes well to multiple different graph-based structures effortlessly.

Title: A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis

Authors: Jiachen Liu, Ziheng Geng, Ran Cao, Lu Cheng, Paolo Bocchini, Minghui Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02938
Pdf URL: https://arxiv.org/pdf/2507.02938
Copy Paste: [[2507.02938]] A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis(https://arxiv.org/abs/2507.02938)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have exhibited remarkable capabilities across diverse open-domain tasks, yet their application in specialized domains such as civil engineering remains largely unexplored. This paper starts bridging this gap by evaluating and enhancing the reliability and robustness of LLMs in structural analysis of beams. Reliability is assessed through the accuracy of correct outputs under repetitive runs of the same problems, whereas robustness is evaluated via the performance across varying load and boundary conditions. A benchmark dataset, comprising eight beam analysis problems, is created to test the Llama-3.3 70B Instruct model. Results show that, despite a qualitative understanding of structural mechanics, the LLM lacks the quantitative reliability and robustness for engineering applications. To address these limitations, a shift is proposed that reframes the structural analysis as code generation tasks. Accordingly, an LLM-empowered agent is developed that (a) integrates chain-of-thought and few-shot prompting to generate accurate OpeeSeesPy code, and (b) automatically executes the code to produce structural analysis results. Experimental results demonstrate that the agent achieves accuracy exceeding 99.0% on the benchmark dataset, exhibiting reliable and robust performance across diverse conditions. Ablation studies highlight the complete example and function usage examples as the primary contributors to the agent's enhanced performance.

Title: Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting

Authors: Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, Hao Wu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.02939
Pdf URL: https://arxiv.org/pdf/2507.02939
Copy Paste: [[2507.02939]] Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting(https://arxiv.org/abs/2507.02939)
Keywords: transformer
Abstract: Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation (termed SDKD), which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details and low-frequency trends using convolution and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher's latent space, including both high and low frequency components, to guide the lightweight student model in capturing both local fine-grained variations and global evolution patterns. Experimental results show that SDKD significantly improves performance, achieving reductions of up to 81.3% in MSE and in MAE 52.3% on the Navier-Stokes equation dataset. The framework effectively captures both high-frequency variations and long-term trends while reducing computational complexity. Our codes are available at this https URL

Title: Towards a Comparative Framework for Compositional AI Models

Authors: Tiffany Duneau
Subjects: cs.CL, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2507.02940
Pdf URL: https://arxiv.org/pdf/2507.02940
Copy Paste: [[2507.02940]] Towards a Comparative Framework for Compositional AI Models(https://arxiv.org/abs/2507.02940)
Keywords: interpretability
Abstract: The DisCoCirc framework for natural language processing allows the construction of compositional models of text, by combining units for individual words together according to the grammatical structure of the text. The compositional nature of a model can give rise to two things: compositional generalisation -- the ability of a model to generalise outside its training distribution by learning compositional rules underpinning the entire data distribution -- and compositional interpretability -- making sense of how the model works by inspecting its modular components in isolation, as well as the processes through which these components are combined. We present these notions in a framework-agnostic way using the language of category theory, and adapt a series of tests for compositional generalisation to this setting. Applying this to the DisCoCirc framework, we consider how well a selection of models can learn to compositionally generalise. We compare both quantum circuit based models, as well as classical neural networks, on a dataset derived from one of the bAbI tasks, extended to test a series of aspects of compositionality. Both architectures score within 5% of one another on the productivity and substitutivity tasks, but differ by at least 10% for the systematicity task, and exhibit different trends on the overgeneralisation tasks. Overall, we find the neural models are more prone to overfitting the Train data. Additionally, we demonstrate how to interpret a compositional model on one of the trained models. By considering how the model components interact with one another, we explain how the model behaves.

Title: GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation

Authors: Yi-Chun Chen, Arnav Jhala
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2507.02941
Pdf URL: https://arxiv.org/pdf/2507.02941
Copy Paste: [[2507.02941]] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation(https://arxiv.org/abs/2507.02941)
Keywords: generative, large language model
Abstract: GameTileNet is a dataset designed to provide semantic labels for low-resolution digital game art, advancing procedural content generation (PCG) and related AI research as a vision-language alignment task. Large Language Models (LLMs) and image-generative AI models have enabled indie developers to create visual assets, such as sprites, for game interactions. However, generating visuals that align with game narratives remains challenging due to inconsistent AI outputs, requiring manual adjustments by human artists. The diversity of visual representations in automatically generated game content is also limited because of the imbalance in distributions across styles for training data. GameTileNet addresses this by collecting artist-created game tiles from this http URL under Creative Commons licenses and providing semantic annotations to support narrative-driven content generation. The dataset introduces a pipeline for object detection in low-resolution tile-based game art (e.g., 32x32 pixels) and annotates semantics, connectivity, and object classifications. GameTileNet is a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images. TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment.

Title: Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention

Authors: Haitz Sáez de Ocáriz Borde
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.02944
Pdf URL: https://arxiv.org/pdf/2507.02944
Copy Paste: [[2507.02944]] Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention(https://arxiv.org/abs/2507.02944)
Keywords: transformer, large language model
Abstract: Multi-head attention powers Transformer networks, the primary deep learning architecture behind the success of large language models (LLMs). Yet, the theoretical advantages of multi-head versus single-head attention, beyond mere parallel processing, remain underexplored. In this paper, we reframe multi-head attention as a system of potentially synergistic computational graphs, where each head functions as a feedforward directed acyclic graph (DAG) with a common sink state. We provide intuition and preliminary theoretical analysis of mixing time and minimax fidelity in this framework. Our results show that multi-head attention can synergistically enhance information propagation, yielding faster mixing times and minimax fidelity amplification under specific head-diversity conditions. Finally, we train single-head and multi-head Transformers, each with the same total number of parameters, on sequence manipulation tasks and empirically verify the predicted effects.

Title: Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding

Authors: Chenglin Li, Qianglong Chen, fengtao, Yin Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02946
Pdf URL: https://arxiv.org/pdf/2507.02946
Copy Paste: [[2507.02946]] Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding(https://arxiv.org/abs/2507.02946)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in video understanding tasks. However, they continue to struggle with long-form videos because of an inefficient perception of temporal intervals. Unlike humans, who can dynamically adjust their temporal focus to locate query-relevant moments, current MLLMs often rely on dense, uniform sampling across the video timeline, leading to high memory consumption and a risk of missing crucial information. To address this challenge, we introduce Temporal Search, a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively. TS is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy. TS operates through two main iterative stages. First, the MLLM proposes a temporal interval that is likely to contain task-relevant information. Then, it samples a fixed number of frames from the interval, regardless of length, and feeds them into the model to produce a refined response and confidence score. TS refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos. Additionally, keyframe-level descriptions are collected to facilitate cross-interval perception throughout the video. To further improve efficiency, we introduce TS-BFS, a best-first search strategy over a tree. Each node represents a candidate interval and is expanded via two methods: self-driven proposals and uniform partitioning. Nodes are scored based on confidence and self-evaluation, and the most promising one is selected for continued exploration.

Title: The Application of Large Language Models on Major Depressive Disorder Support Based on African Natural Products

Authors: Linyan Zou
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2507.02947
Pdf URL: https://arxiv.org/pdf/2507.02947
Copy Paste: [[2507.02947]] The Application of Large Language Models on Major Depressive Disorder Support Based on African Natural Products(https://arxiv.org/abs/2507.02947)
Keywords: large language model
Abstract: Major depressive disorder represents one of the most significant global health challenges of the 21st century, affecting millions of people worldwide and creating substantial economic and social burdens. While conventional antidepressant therapies have provided relief for many individuals, their limitations including delayed onset of action, significant side effects, and treatment resistance in a substantial portion of patients have prompted researchers and healthcare providers to explore alternative therapeutic approaches (Kasneci et al.). African traditional medicine, with its rich heritage of plant-based remedies developed over millennia, offers a valuable resource for developing novel antidepressant treatments that may address some of these limitations. This paper examines the integration of large language models with African natural products for depression support, combining traditional knowledge with modern artificial intelligence technology to create accessible, evidence-based mental health support systems. The research presented here encompasses a comprehensive analysis of African medicinal plants with documented antidepressant properties, their pharmacological mechanisms, and the development of an AI-powered support system that leverages DeepSeek's advanced language model capabilities. The system provides evidence-based information about African herbal medicines, their clinical applications, safety considerations, and therapeutic protocols while maintaining scientific rigor and appropriate safety standards. Our findings demonstrate the potential for large language models to serve as bridges between traditional knowledge and modern healthcare, offering personalized, culturally appropriate depression support that honors both traditional wisdom and contemporary medical understanding.

Title: RADIANT: Retrieval AugmenteD entIty-context AligNmenT -- Introducing RAG-ability and Entity-Context Divergence

Authors: Vipula Rawte, Rajarshi Roy, Gurpreet Singh, Danush Khanna, Yaswanth Narsupalli, Basab Ghosh, Abhay Gupta, Argha Kamal Samanta, Aditya Shingote, Aadi Krishna Vikram, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02949
Pdf URL: https://arxiv.org/pdf/2507.02949
Copy Paste: [[2507.02949]] RADIANT: Retrieval AugmenteD entIty-context AligNmenT -- Introducing RAG-ability and Entity-Context Divergence(https://arxiv.org/abs/2507.02949)
Keywords: large language model
Abstract: As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation.

Title: Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Authors: Keita Kiuchi, Yoshikazu Fujimoto, Hideyuki Goto, Tomonori Hosokawa, Makoto Nishimura, Yosuke Sato, Izumi Sezai
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.02950
Pdf URL: https://arxiv.org/pdf/2507.02950
Copy Paste: [[2507.02950]] Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria(https://arxiv.org/abs/2507.02950)
Keywords: large language model
Abstract: This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools.

Title: Bittensor Protocol: The Bitcoin in Decentralized Artificial Intelligence? A Critical and Empirical Analysis

Authors: Elizabeth Lui, Jiahao Sun
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02951
Pdf URL: https://arxiv.org/pdf/2507.02951
Copy Paste: [[2507.02951]] Bittensor Protocol: The Bitcoin in Decentralized Artificial Intelligence? A Critical and Empirical Analysis(https://arxiv.org/abs/2507.02951)
Keywords: security, attack, robust
Abstract: This paper investigates whether Bittensor can be considered the Bitcoin of decentralized Artificial Intelligence by directly comparing its tokenomics, decentralization properties, consensus mechanism, and incentive structure against those of Bitcoin. Leveraging on-chain data from all 64 active Bittensor subnets, we first document considerable concentration in both stake and rewards. We further show that rewards are overwhelmingly driven by stake, highlighting a clear misalignment between quality and compensation. As a remedy, we put forward a series of two-pronged protocol-level interventions. For incentive realignment, our proposed solutions include performance-weighted emission split, composite scoring, and a trust-bonus multiplier. As for mitigating security vulnerability due to stake concentration, we propose and empirically validate stake cap at the 88th percentile, which elevates the median coalition size required for a 51-percent attack and remains robust across daily, weekly, and monthly snapshots.

Title: Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III

Authors: Pranam Shetty, Abhisek Upadhayaya, Parth Mitesh Shah, Srikanth Jagabathula, Shilpi Nayak, Anna Joo Fee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02954
Pdf URL: https://arxiv.org/pdf/2507.02954
Copy Paste: [[2507.02954]] Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III(https://arxiv.org/abs/2507.02954)
Keywords: large language model
Abstract: As financial institutions increasingly adopt Large Language Models (LLMs), rigorous domain-specific evaluation becomes critical for responsible deployment. This paper presents a comprehensive benchmark evaluating 23 state-of-the-art LLMs on the Chartered Financial Analyst (CFA) Level III exam - the gold standard for advanced financial reasoning. We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover. Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III. These results, achieved under a revised, stricter essay grading methodology, indicate significant progress in LLM capabilities for high-stakes financial applications. Our findings provide crucial guidance for practitioners on model selection and highlight remaining challenges in cost-effective deployment and the need for nuanced interpretation of performance against professional benchmarks.

Title: A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks

Authors: Blake Bullwinkel, Mark Russinovich, Ahmed Salem, Santiago Zanella-Beguelin, Daniel Jones, Giorgio Severi, Eugenia Kim, Keegan Hines, Amanda Minnich, Yonatan Zunger, Ram Shankar Siva Kumar
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02956
Pdf URL: https://arxiv.org/pdf/2507.02956
Copy Paste: [[2507.02956]] A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks(https://arxiv.org/abs/2507.02956)
Keywords: secure, defense, attack
Abstract: Recent research has demonstrated that state-of-the-art LLMs and defenses remain susceptible to multi-turn jailbreak attacks. These attacks require only closed-box model access and are often easy to perform manually, posing a significant threat to the safe and secure deployment of LLM-based systems. We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations and find that safety-aligned LMs often represent Crescendo responses as more benign than harmful, especially as the number of conversation turns increases. Our analysis indicates that at each turn, Crescendo prompts tend to keep model outputs in a "benign" region of representation space, effectively tricking the model into fulfilling harmful requests. Further, our results help explain why single-turn jailbreak defenses like circuit breakers are generally ineffective against multi-turn attacks, motivating the development of mitigations that address this generalization gap.

Title: CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning

Authors: Andrew Kiruluta, Preethi Raju, Priscilla Burity
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02957
Pdf URL: https://arxiv.org/pdf/2507.02957
Copy Paste: [[2507.02957]] CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning(https://arxiv.org/abs/2507.02957)
Keywords: transformer
Abstract: Vision-Language Models (vLLMs) have emerged as powerful architectures for joint reasoning over visual and textual inputs, enabling breakthroughs in image captioning, cross modal retrieval, and multimodal dialogue. However, as these models scale to longer video sequences and richer language descriptions, the quadratic complexity of the standard attention mechanism presents a fundamental computational bottleneck. This challenge is exacerbated in vLLMs, where attention must be computed not only within modalities but also across them, leading to prohibitive memory and latency costs. In this work, we introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing. By projecting high dimensional key and value representations into a lower-dimensional subspace via random measurement matrices and reconstructing the attention outputs using sparse recovery algorithms, CSAT significantly reduces attention complexity while maintaining semantic fidelity. Applied to vLLMs, CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse. In contrast to LLMs, which must often model entangled symbolic dependencies, vLLMs benefit from structured sparsity in alignment and scene composition, making them particularly well-suited to compressed attention. We provide a formal mathematical treatment of CSAT, demonstrate its integration into vision language pipelines, and validate its performance on standard benchmarks, highlighting its promise as a scalable, interpretable, and resource efficient solution for next generation multimodal transformers.

Title: Real-World En Call Center Transcripts Dataset with PII Redaction

Authors: Ha Dao, Gaurav Chawla, Raghu Banda, Caleb DeLeeuw
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02958
Pdf URL: https://arxiv.org/pdf/2507.02958
Copy Paste: [[2507.02958]] Real-World En Call Center Transcripts Dataset with PII Redaction(https://arxiv.org/abs/2507.02958)
Keywords: privacy, protect, biometric
Abstract: We introduce CallCenterEN, a large-scale (91,706 conversations, corresponding to 10448 audio hours), real-world English call center transcript dataset designed to support research and development in customer support and sales AI systems. This is the largest release to-date of open source call center transcript data of this kind. The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States. The dataset includes high-quality, PII-redacted human-readable transcriptions. All personally identifiable information (PII) has been rigorously removed to ensure compliance with global data protection laws. The audio is not included in the public release due to biometric privacy concerns. Given the scarcity of publicly available real-world call center datasets, CallCenterEN fills a critical gap in the landscape of available ASR corpora, and is released under a CC BY-NC 4.0 license for non-commercial research use.

Title: A Novel Active Learning Approach to Label One Million Unknown Malware Variants

Authors: Ahmed Bensaoud, Jugal Kalita
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02959
Pdf URL: https://arxiv.org/pdf/2507.02959
Copy Paste: [[2507.02959]] A Novel Active Learning Approach to Label One Million Unknown Malware Variants(https://arxiv.org/abs/2507.02959)
Keywords: robust, transformer
Abstract: Active learning for classification seeks to reduce the cost of labeling samples by finding unlabeled examples about which the current model is least certain and sending them to an annotator/expert to label. Bayesian theory can provide a probabilistic view of deep neural network models by asserting a prior distribution over model parameters and estimating the uncertainties by posterior distribution over these parameters. This paper proposes two novel active learning approaches to label one million malware examples belonging to different unknown modern malware families. The first model is Inception-V4+PCA combined with several support vector machine (SVM) algorithms (UTSVM, PSVM, SVM-GSU, TBSVM). The second model is Vision Transformer based Bayesian Neural Networks ViT-BNN. Our proposed ViT-BNN is a state-of-the-art active learning approach that differs from current methods and can apply to any particular task. The experiments demonstrate that the ViT-BNN is more stable and robust in handling uncertainty.

Title: RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Authors: Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.02962
Pdf URL: https://arxiv.org/pdf/2507.02962
Copy Paste: [[2507.02962]] RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism(https://arxiv.org/abs/2507.02962)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models' search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model's capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.

Title: VR-YOLO: Enhancing PCB Defect Detection with Viewpoint Robustness Based on YOLO

Authors: Hengyi Zhu, Linye Wei, He Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.02963
Pdf URL: https://arxiv.org/pdf/2507.02963
Copy Paste: [[2507.02963]] VR-YOLO: Enhancing PCB Defect Detection with Viewpoint Robustness Based on YOLO(https://arxiv.org/abs/2507.02963)
Keywords: robust
Abstract: The integration of large-scale circuits and systems emphasizes the importance of automated defect detection of electronic components. The YOLO image detection model has been used to detect PCB defects and it has become a typical AI-assisted case of traditional industrial production. However, conventional detection algorithms have stringent requirements for the angle, orientation, and clarity of target images. In this paper, we propose an enhanced PCB defect detection algorithm, named VR-YOLO, based on the YOLOv8 model. This algorithm aims to improve the model's generalization performance and enhance viewpoint robustness in practical application scenarios. We first propose a diversified scene enhancement (DSE) method by expanding the PCB defect dataset by incorporating diverse scenarios and segmenting samples to improve target diversity. A novel key object focus (KOF) scheme is then presented by considering angular loss and introducing an additional attention mechanism to enhance fine-grained learning of small target features. Experimental results demonstrate that our improved PCB defect detection approach achieves a mean average precision (mAP) of 98.9% for the original test images, and 94.7% for the test images with viewpoint shifts (horizontal and vertical shear coefficients of $\pm 0.06$ and rotation angle of $\pm 10$ degrees), showing significant improvements compared to the baseline YOLO model with negligible additional computational cost.

Title: Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

Authors: Salahuddin Salahuddin, Ahmed Hussain, Jussi Löppönen, Toni Jutila, Panos Papadimitratos
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02964
Pdf URL: https://arxiv.org/pdf/2507.02964
Copy Paste: [[2507.02964]] Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens(https://arxiv.org/abs/2507.02964)
Keywords: security, large language model
Abstract: While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models lack specialized domain knowledge for effective cybersecurity analysis. In this work, we investigate Domain-Adaptive Continuous Pretraining (DAP) as a methodology for enhancing cybersecurity understanding in pretrained LLMs while preserving general language capabilities. We systematically adapted three decoder-based architectures -- Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct -- using a curated 126-million-word cybersecurity corpus from standards, academic literature, and various other sources. Our approach employed constrained training parameters and distributed FSDP training to balance domain specialization with knowledge preservation. Evaluation across three cybersecurity benchmarks, namely, CTI-MCQ, CyberMetric, and SecEval, demonstrates consistent improvements post-adaptation. The Llama-3.3-70B-Ins-DAP model achieved state-of-the-art accuracies of 0.718, 0.933, and 0.864, respectively, outperforming specialized models, including Llama-Primus-Base. Notably, competitive performance was achieved using substantially smaller datasets (118.8 million versus 2.77 billion tokens), demonstrating efficient domain specialization viability. We establish that targeted continuous pretraining enables effective cybersecurity domain adaptation with computational feasibility, providing foundations for specialized AI assistants in threat analysis, vulnerability assessment, and security documentation while challenging prevailing assumptions about data requirements for LLM specialization.

Title: Concept-based Adversarial Attack: a Probabilistic Perspective

Authors: Andi Zhang, Xuan Ding, Steven McDonagh, Samuel Kaski
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02965
Pdf URL: https://arxiv.org/pdf/2507.02965
Copy Paste: [[2507.02965]] Concept-based Adversarial Attack: a Probabilistic Perspective(https://arxiv.org/abs/2507.02965)
Keywords: attack, generative
Abstract: We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept -- represented by a probabilistic generative model or a set of images -- to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.

Title: PB-LLMs: Privacy- and Bias-aware NLP Models using Named-Entity Recognition

Authors: Gonzalo Mancera, Aythami Morales, Julian Fierrez, Ruben Tolosana, Alejandro Penna, Miguel Lopez-Duran, Francisco Jurado, Alvaro Ortigosa
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02966
Pdf URL: https://arxiv.org/pdf/2507.02966
Copy Paste: [[2507.02966]] PB-LLMs: Privacy- and Bias-aware NLP Models using Named-Entity Recognition(https://arxiv.org/abs/2507.02966)
Keywords: privacy, protect, large language model
Abstract: The use of Natural Language Processing (NLP) in high-stakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named-Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacy- and Bias-aware LLMs (PB-LLMs). Note that the proposed PB-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.

Title: YOLO-Based Pipeline Monitoring in Challenging Visual Environments

Authors: Pragya Dhungana, Matteo Fresta, Niraj Tamrakar, Hariom Dhungana
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02967
Pdf URL: https://arxiv.org/pdf/2507.02967
Copy Paste: [[2507.02967]] YOLO-Based Pipeline Monitoring in Challenging Visual Environments(https://arxiv.org/abs/2507.02967)
Keywords: robust, segmentation
Abstract: Condition monitoring subsea pipelines in low-visibility underwater environments poses significant challenges due to turbidity, light distortion, and image degradation. Traditional visual-based inspection systems often fail to provide reliable data for mapping, object recognition, or defect detection in such conditions. This study explores the integration of advanced artificial intelligence (AI) techniques to enhance image quality, detect pipeline structures, and support autonomous fault diagnosis. This study conducts a comparative analysis of two most robust versions of YOLOv8 and Yolov11 and their three variants tailored for image segmentation tasks in complex and low-visibility subsea environments. Using pipeline inspection datasets captured beneath the seabed, it evaluates model performance in accurately delineating target structures under challenging visual conditions. The results indicated that YOLOv11 outperformed YOLOv8 in overall performance.

Title: Unveiling Privacy Policy Complexity: An Exploratory Study Using Graph Mining, Machine Learning, and Natural Language Processing

Authors: Vijayalakshmi Ramasamy, Seth Barrett, Gokila Dorai, Jessica Zumbach
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02968
Pdf URL: https://arxiv.org/pdf/2507.02968
Copy Paste: [[2507.02968]] Unveiling Privacy Policy Complexity: An Exploratory Study Using Graph Mining, Machine Learning, and Natural Language Processing(https://arxiv.org/abs/2507.02968)
Keywords: privacy, interpretability
Abstract: Privacy policy documents are often lengthy, complex, and difficult for non-expert users to interpret, leading to a lack of transparency regarding the collection, processing, and sharing of personal data. As concerns over online privacy grow, it is essential to develop automated tools capable of analyzing privacy policies and identifying potential risks. In this study, we explore the potential of interactive graph visualizations to enhance user understanding of privacy policies by representing policy terms as structured graph models. This approach makes complex relationships more accessible and enables users to make informed decisions about their personal data (RQ1). We also employ graph mining algorithms to identify key themes, such as User Activity and Device Information, using dimensionality reduction techniques like t-SNE and PCA to assess clustering effectiveness. Our findings reveal that graph-based clustering improves policy content interpretability. It highlights patterns in user tracking and data sharing, which supports forensic investigations and identifies regulatory non-compliance. This research advances AI-driven tools for auditing privacy policies by integrating interactive visualizations with graph mining. Enhanced transparency fosters accountability and trust.

Title: Reinforcement Learning for Automated Cybersecurity Penetration Testing

Authors: Daniel López-Montero, José L. Álvarez-Aldana, Alicia Morales-Martínez, Marta Gil-López, Juan M. Auñón García
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02969
Pdf URL: https://arxiv.org/pdf/2507.02969
Copy Paste: [[2507.02969]] Reinforcement Learning for Automated Cybersecurity Penetration Testing(https://arxiv.org/abs/2507.02969)
Keywords: security
Abstract: This paper aims to provide an innovative machine learning-based solution to automate security testing tasks for web applications, ensuring the correct functioning of all components while reducing project maintenance costs. Reinforcement Learning is proposed to select and prioritize tools and optimize the testing path. The presented approach utilizes a simulated webpage along with its network topology to train the agent. Additionally, the model leverages Geometric Deep Learning to create priors that reduce the search space and improve learning convergence. The validation and testing process was conducted on real-world vulnerable web pages commonly used by human hackers for learning. As a result of this study, a reinforcement learning algorithm was developed that maximizes the number of vulnerabilities found while minimizing the number of steps required

Title: Aim High, Stay Private: Differentially Private Synthetic Data Enables Public Release of Behavioral Health Information with High Utility

Authors: Mohsen Ghasemizade, Juniper Lovato, Christopher M. Danforth, Peter Sheridan Dodds, Laura S. P. Bloomfield, Matthew Price, Team LEMURS, Joseph P. Near
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2507.02971
Pdf URL: https://arxiv.org/pdf/2507.02971
Copy Paste: [[2507.02971]] Aim High, Stay Private: Differentially Private Synthetic Data Enables Public Release of Behavioral Health Information with High Utility(https://arxiv.org/abs/2507.02971)
Keywords: privacy, protect, attack
Abstract: Sharing health and behavioral data raises significant privacy concerns, as conventional de-identification methods are susceptible to privacy attacks. Differential Privacy (DP) provides formal guarantees against re-identification risks, but practical implementation necessitates balancing privacy protection and the utility of data. We demonstrate the use of DP to protect individuals in a real behavioral health study, while making the data publicly available and retaining high utility for downstream users of the data. We use the Adaptive Iterative Mechanism (AIM) to generate DP synthetic data for Phase 1 of the Lived Experiences Measured Using Rings Study (LEMURS). The LEMURS dataset comprises physiological measurements from wearable devices (Oura rings) and self-reported survey data from first-year college students. We evaluate the synthetic datasets across a range of privacy budgets, epsilon = 1 to 100, focusing on the trade-off between privacy and utility. We evaluate the utility of the synthetic data using a framework informed by actual uses of the LEMURS dataset. Our evaluation identifies the trade-off between privacy and utility across synthetic datasets generated with different privacy budgets. We find that synthetic data sets with epsilon = 5 preserve adequate predictive utility while significantly mitigating privacy risks. Our methodology establishes a reproducible framework for evaluating the practical impacts of epsilon on generating private synthetic datasets with numerous attributes and records, contributing to informed decision-making in data sharing practices.

Title: Farm-Level, In-Season Crop Identification for India

Authors: Ishan Deshpande, Amandeep Kaur Reehal, Chandan Nath, Renu Singh, Aayush Patel, Aishwarya Jayagopal, Gaurav Singh, Gaurav Aggarwal, Amit Agarwal, Prathmesh Bele, Sridhar Reddy, Tanya Warrier, Kinjal Singh, Ashish Tendulkar, Luis Pazos Outon, Nikita Saxena, Agata Dondzik, Dinesh Tewari, Shruti Garg, Avneet Singh, Harsh Dhand, Vaibhav Rajan, Alok Talekar
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02972
Pdf URL: https://arxiv.org/pdf/2507.02972
Copy Paste: [[2507.02972]] Farm-Level, In-Season Crop Identification for India(https://arxiv.org/abs/2507.02972)
Keywords: security, robust
Abstract: Accurate, timely, and farm-level crop type information is paramount for national food security, agricultural policy formulation, and economic planning, particularly in agriculturally significant nations like India. While remote sensing and machine learning have become vital tools for crop monitoring, existing approaches often grapple with challenges such as limited geographical scalability, restricted crop type coverage, the complexities of mixed-pixel and heterogeneous landscapes, and crucially, the robust in-season identification essential for proactive decision-making. We present a framework designed to address the critical data gaps for targeted data driven decision making which generates farm-level, in-season, multi-crop identification at national scale (India) using deep learning. Our methodology leverages the strengths of Sentinel-1 and Sentinel-2 satellite imagery, integrated with national-scale farm boundary data. The model successfully identifies 12 major crops (which collectively account for nearly 90% of India's total cultivated area showing an agreement with national crop census 2023-24 of 94% in winter, and 75% in monsoon season). Our approach incorporates an automated season detection algorithm, which estimates crop sowing and harvest periods. This allows for reliable crop identification as early as two months into the growing season and facilitates rigorous in-season performance evaluation. Furthermore, we have engineered a highly scalable inference pipeline, culminating in what is, to our knowledge, the first pan-India, in-season, farm-level crop type data product. The system's effectiveness and scalability are demonstrated through robust validation against national agricultural statistics, showcasing its potential to deliver actionable, data-driven insights for transformative agricultural monitoring and management across India.

Title: InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy

Authors: Vishnu Vinod, Krishna Pillutla, Abhradeep Guha Thakurta
Subjects: cs.LG, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2507.02974
Pdf URL: https://arxiv.org/pdf/2507.02974
Copy Paste: [[2507.02974]] InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy(https://arxiv.org/abs/2507.02974)
Keywords: privacy
Abstract: As major progress in LLM-based long-form text generation enables paradigms such as retrieval-augmented generation (RAG) and inference-time scaling, safely incorporating private information into the generation remains a critical open question. We present InvisibleInk, a highly scalable long-form text generation framework satisfying rigorous differential privacy guarantees with respect to the sensitive references. It interprets sampling from the LLM's next-token-distribution as the exponential mechanism over the LLM logits with two innovations. First, we reduce the privacy cost by isolating and clipping only the sensitive information in the model logits (relative to the public logits). Second, we improve text quality by sampling from a small superset of the top-$k$ private tokens. Empirical evaluations demonstrate a consistent $8\times$ reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels. In summary, InvisibleInk is able to generate private long-form text at less than $10\times$ the computation cost of non-private generation.

Title: Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

Authors: Julian D Baldwin, Christina Dinh, Arjun Mukerji, Neil Sanghavi, Saurabh Gombar
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2507.02975
Pdf URL: https://arxiv.org/pdf/2507.02975
Copy Paste: [[2507.02975]] Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence(https://arxiv.org/abs/2507.02975)
Keywords: large language model
Abstract: The growing use of large language models (LLMs) for biomedical question answering raises concerns about the accuracy and evidentiary support of their responses. To address this, we present Answered with Evidence, a framework for evaluating whether LLM-generated answers are grounded in scientific literature. We analyzed thousands of physician-submitted questions using a comparative pipeline that included: (1) Alexandria, fka the Atropos Evidence Library, a retrieval-augmented generation (RAG) system based on novel observational studies, and (2) two PubMed-based retrieval-augmented systems (System and Perplexity). We found that PubMed-based systems provided evidence-supported answers for approximately 44% of questions, while the novel evidence source did so for about 50%. Combined, these sources enabled reliable answers to over 70% of biomedical queries. As LLMs become increasingly capable of summarizing scientific content, maximizing their value will require systems that can accurately retrieve both published and custom-generated evidence or generate such evidence in real time.

Title: Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench

Authors: Amirali Sajadi, Kostadin Damevski, Preetha Chatterjee
Subjects: cs.CR, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2507.02976
Pdf URL: https://arxiv.org/pdf/2507.02976
Copy Paste: [[2507.02976]] Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench(https://arxiv.org/abs/2507.02976)
Keywords: secure, security, large language model
Abstract: Large Language Models (LLMs) and their agentic frameworks are increasingly adopted to automate software development tasks such as issue resolution and program repair. While prior work has identified security risks in LLM-generated code, most evaluations have focused on synthetic or isolated settings, leaving open questions about the security of these systems in real-world development contexts. In this study, we present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset. We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches. We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data. Finally, we analyze a wide range of code, issue, and project-level factors to understand the conditions under which LLMs and agents are most likely to generate insecure code. Our findings reveal that the standalone LLM introduces nearly 9x more new vulnerabilities than developers, with many of these exhibiting unique patterns not found in developers' code. Agentic workflows also generate a significant number of vulnerabilities, particularly when granting LLMs more autonomy, potentially increasing the likelihood of misinterpreting project context or task requirements. We find that vulnerabilities are more likely to occur in LLM patches associated with a higher number of files, more lines of generated code, and GitHub issues that lack specific code snippets or information about the expected code behavior and steps to reproduce. These results suggest that contextual factors play a critical role in the security of the generated code and point toward the need for proactive risk assessment methods that account for both code and issue-level information to complement existing vulnerability detection tools.

Title: Iterative Misclassification Error Training (IMET): An Optimized Neural Network Training Technique for Image Classification

Authors: Ruhaan Singh, Sreelekha Guggilam
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02979
Pdf URL: https://arxiv.org/pdf/2507.02979
Copy Paste: [[2507.02979]] Iterative Misclassification Error Training (IMET): An Optimized Neural Network Training Technique for Image Classification(https://arxiv.org/abs/2507.02979)
Keywords: robust
Abstract: Deep learning models have proven to be effective on medical datasets for accurate diagnostic predictions from images. However, medical datasets often contain noisy, mislabeled, or poorly generalizable images, particularly for edge cases and anomalous outcomes. Additionally, high quality datasets are often small in sample size that can result in overfitting, where models memorize noise rather than learn generalizable patterns. This in particular, could pose serious risks in medical diagnostics where the risk associated with mis-classification can impact human life. Several data-efficient training strategies have emerged to address these constraints. In particular, coreset selection identifies compact subsets of the most representative samples, enabling training that approximates full-dataset performance while reducing computational overhead. On the other hand, curriculum learning relies on gradually increasing training difficulty and accelerating convergence. However, developing a generalizable difficulty ranking mechanism that works across diverse domains, datasets, and models while reducing the computational tasks and remains challenging. In this paper, we introduce Iterative Misclassification Error Training (IMET), a novel framework inspired by curriculum learning and coreset selection. The IMET approach is aimed to identify misclassified samples in order to streamline the training process, while prioritizing the model's attention to edge case senarious and rare outcomes. The paper evaluates IMET's performance on benchmark medical image classification datasets against state-of-the-art ResNet architectures. The results demonstrating IMET's potential for enhancing model robustness and accuracy in medical image analysis are also presented in the paper.

Title: We Need Knowledge Distillation for Solving Math Word Problems

Authors: Zhenquan Shen, Xinguo Yu, Xiaotian Cheng, Rao Peng, Hao Ming
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02982
Pdf URL: https://arxiv.org/pdf/2507.02982
Copy Paste: [[2507.02982]] We Need Knowledge Distillation for Solving Math Word Problems(https://arxiv.org/abs/2507.02982)
Keywords: large language model
Abstract: The enhancement of mathematical capabilities in large language models (LLMs) fosters new developments in mathematics education within primary and secondary schools, particularly as they relate to intelligent tutoring systems. However, LLMs require substantial computational resources, resulting in significant costs in educational contexts. To mitigate this drawback, this paper investigates the feasibility of compressing LLMs for solving math word problems (MWPs). We compress the embedded vectors encoded by BERT and distill a considerably smaller student model. Our findings indicate that the student model can maintain nearly 90% of the performance of the teacher model while utilizing only 1/12 of its parameters. In addition to achieving high accuracy, the model exhibits strong generalizability, as the compressed vectors perform well across all tasks related to MWPs, and the distillation process is not task-specific. The success of this distillation demonstrates that the underlying principles are generic and not limited to a specific task. We further explore the reasons behind the compressibility of embedded vectors, revealing that part-of-speech information, rather than entity recognition, is crucial for MWPs, which may significantly contribute to their compressibility. The improvements in efficiency and cost reduction provide substantial value for intelligent tutoring systems and significantly advance the field of intelligent education.

Title: Truth, Trust, and Trouble: Medical AI on the Edge

Authors: Mohammad Anas Azeez, Rafiq Ali, Ebad Shabbir, Zohaib Hasan Siddiqui, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02983
Pdf URL: https://arxiv.org/pdf/2507.02983
Copy Paste: [[2507.02983]] Truth, Trust, and Trouble: Medical AI on the Edge(https://arxiv.org/abs/2507.02983)
Keywords: large language model
Abstract: Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

Title: From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Authors: Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02984
Pdf URL: https://arxiv.org/pdf/2507.02984
Copy Paste: [[2507.02984]] From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought(https://arxiv.org/abs/2507.02984)
Keywords: large language model
Abstract: Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methodologies primarily focus on synthesizing positive rationales, while overlooking the critical role of negative rationales in training models to discern flawed reasoning patterns. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). This framework enables models to utilize AoT-Oriented Chain-of-Thought (AoT) prompts to automatically generate high-quality positive and negative reasoning paths, followed by self-alignment to enhance their reasoning abilities. Inspired by human strategies for solving proof-based problems, AoT uses answers as a guide to help the model extract critical visual information that links questions and answers. When provided with ground truth answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with misleading alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. This encourages the use of improved models to generate higher-quality preference data for further optimization. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model's reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code, datasets, and models will be released.

Title: Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers

Authors: Yusuf Shihata
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.02985
Pdf URL: https://arxiv.org/pdf/2507.02985
Copy Paste: [[2507.02985]] Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers(https://arxiv.org/abs/2507.02985)
Keywords: robust, transformer
Abstract: Multimodal learning faces a fundamental tension between deep, fine-grained fusion and computational scalability. While cross-attention models achieve strong performance through exhaustive pairwise fusion, their quadratic complexity is prohibitive for settings with many modalities. We address this challenge with Gated Recurrent Fusion (GRF), a novel architecture that captures the power of cross-modal attention within a linearly scalable, recurrent pipeline. Our method processes modalities sequentially, updating an evolving multimodal context vector at each step. The core of our approach is a fusion block built on Transformer Decoder layers that performs symmetric cross-attention, mutually enriching the shared context and the incoming modality. This enriched information is then integrated via a Gated Fusion Unit (GFU) a GRU-inspired mechanism that dynamically arbitrates information flow, enabling the model to selectively retain or discard features. This stateful, recurrent design scales linearly with the number of modalities, O(n), making it ideal for high-modality environments. Experiments on the CMU-MOSI benchmark demonstrate that GRF achieves competitive performance compared to more complex baselines. Visualizations of the embedding space further illustrate that GRF creates structured, class-separable representations through its progressive fusion mechanism. Our work presents a robust and efficient paradigm for powerful, scalable multimodal representation learning.

Title: GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models

Authors: Seshu Tirupathi, Dhaval Salwala, Elizabeth Daly, Inge Vejsbjerg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02986
Pdf URL: https://arxiv.org/pdf/2507.02986
Copy Paste: [[2507.02986]] GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models(https://arxiv.org/abs/2507.02986)
Keywords: robust, large language model
Abstract: As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at this https URL.

Title: A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements

Authors: Reham Alharbi, Valentina Tamma, Terry R. Payne, Jacopo de Berardinis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02989
Pdf URL: https://arxiv.org/pdf/2507.02989
Copy Paste: [[2507.02989]] A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements(https://arxiv.org/abs/2507.02989)
Keywords: large language model
Abstract: Competency Questions (CQs) are pivotal in knowledge engineering, guiding the design, validation, and testing of ontologies. A number of diverse formulation approaches have been proposed in the literature, ranging from completely manual to Large Language Model (LLM) driven ones. However, attempts to characterise the outputs of these approaches and their systematic comparison are scarce. This paper presents an empirical comparative evaluation of three distinct CQ formulation approaches: manual formulation by ontology engineers, instantiation of CQ patterns, and generation using state of the art LLMs. We generate CQs using each approach from a set of requirements for cultural heritage, and assess them across different dimensions: degree of acceptability, ambiguity, relevance, readability and complexity. Our contribution is twofold: (i) the first multi-annotator dataset of CQs generated from the same source using different methods; and (ii) a systematic comparison of the characteristics of the CQs resulting from each approach. Our study shows that different CQ generation approaches have different characteristics and that LLMs can be used as a way to initially elicit CQs, however these are sensitive to the model used to generate CQs and they generally require a further refinement step before they can be used to model requirements.

Title: `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Authors: Annika M Schoene, Cansu Canca
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02990
Pdf URL: https://arxiv.org/pdf/2507.02990
Copy Paste: [[2507.02990]] `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts(https://arxiv.org/abs/2507.02990)
Keywords: robust, large language model
Abstract: Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.

Title: Physics Augmented Machine Learning Discovery of Composition-Dependent Constitutive Laws for 3D Printed Digital Materials

Authors: Steven Yang, Michal Levin, Govinda Anantha Padmanabha, Miriam Borshevsky, Ohad Cohen, D. Thomas Seidl, Reese E. Jones, Nikolaos Bouklas, Noy Cohen
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2507.02991
Pdf URL: https://arxiv.org/pdf/2507.02991
Copy Paste: [[2507.02991]] Physics Augmented Machine Learning Discovery of Composition-Dependent Constitutive Laws for 3D Printed Digital Materials(https://arxiv.org/abs/2507.02991)
Keywords: interpretability
Abstract: Multi-material 3D printing, particularly through polymer jetting, enables the fabrication of digital materials by mixing distinct photopolymers at the micron scale within a single build to create a composite with tunable mechanical properties. This work presents an integrated experimental and computational investigation into the composition-dependent mechanical behavior of 3D printed digital materials. We experimentally characterize five formulations, combining soft and rigid UV-cured polymers under uniaxial tension and torsion across three strain and twist rates. The results reveal nonlinear and rate-dependent responses that strongly depend on composition. To model this behavior, we develop a physics-augmented neural network (PANN) that combines a partially input convex neural network (pICNN) for learning the composition-dependent hyperelastic strain energy function with a quasi-linear viscoelastic (QLV) formulation for time-dependent response. The pICNN ensures convexity with respect to strain invariants while allowing non-convex dependence on composition. To enhance interpretability, we apply $L_0$ sparsification. For the time-dependent response, we introduce a multilayer perceptron (MLP) to predict viscoelastic relaxation parameters from composition. The proposed model accurately captures the nonlinear, rate-dependent behavior of 3D printed digital materials in both uniaxial tension and torsion, achieving high predictive accuracy for interpolated material compositions. This approach provides a scalable framework for automated, composition-aware constitutive model discovery for multi-material 3D printing.

Title: Enabling Robust, Real-Time Verification of Vision-Based Navigation through View Synthesis

Authors: Marius Neuhalfen, Jonathan Grzymisch, Manuel Sanchez-Gestido
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2507.02993
Pdf URL: https://arxiv.org/pdf/2507.02993
Copy Paste: [[2507.02993]] Enabling Robust, Real-Time Verification of Vision-Based Navigation through View Synthesis(https://arxiv.org/abs/2507.02993)
Keywords: robust
Abstract: This work introduces VISY-REVE: a novel pipeline to validate image processing algorithms for Vision-Based Navigation. Traditional validation methods such as synthetic rendering or robotic testbed acquisition suffer from difficult setup and slow runtime. Instead, we propose augmenting image datasets in real-time with synthesized views at novel poses. This approach creates continuous trajectories from sparse, pre-existing datasets in open or closed-loop. In addition, we introduce a new distance metric between camera poses, the Boresight Deviation Distance, which is better suited for view synthesis than existing metrics. Using it, a method for increasing the density of image datasets is developed.

Title: MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization

Authors: Huihui Xu, Yuanpeng Nie, Hualiang Wang, Ying Chen, Wei Li, Junzhi Ning, Lihao Liu, Hongqiu Wang, Lei Zhu, Jiyao Liu, Xiaomeng Li, Junjun He
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.02994
Pdf URL: https://arxiv.org/pdf/2507.02994
Copy Paste: [[2507.02994]] MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization(https://arxiv.org/abs/2507.02994)
Keywords: large language model
Abstract: Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at this https URL

Title: FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images

Authors: Guang Yang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2507.02995
Pdf URL: https://arxiv.org/pdf/2507.02995
Copy Paste: [[2507.02995]] FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images(https://arxiv.org/abs/2507.02995)
Keywords: robust, extraction, diffusion
Abstract: The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8\% accuracy, outperforming state-of-the-art baselines by 5.2\%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1--0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.

Title: Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis

Authors: Haiqing Li, Yuzhi Guo, Feng Jiang, Thao M. Dang, Hehuan Ma, Qifeng Zhou, Jean Gao, Junzhou Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02996
Pdf URL: https://arxiv.org/pdf/2507.02996
Copy Paste: [[2507.02996]] Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis(https://arxiv.org/abs/2507.02996)
Keywords: interpretability, large language model
Abstract: Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at this https URL

Title: What to Do Next? Memorizing skills from Egocentric Instructional Video

Authors: Jing Bi, Chenliang Xu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.02997
Pdf URL: https://arxiv.org/pdf/2507.02997
Copy Paste: [[2507.02997]] What to Do Next? Memorizing skills from Egocentric Instructional Video(https://arxiv.org/abs/2507.02997)
Keywords: robust, transformer
Abstract: Learning to perform activities through demonstration requires extracting meaningful information about the environment from observations. In this research, we investigate the challenge of planning high-level goal-oriented actions in a simulation setting from an egocentric perspective. We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture. The process of memorizing the environment's structure through extracting affordances facilitates selecting appropriate actions based on the context. Moreover, the memory model allows us to detect action deviations while accomplishing specific objectives. To assess the method's versatility, we evaluate it in a realistic interactive simulation environment. Our experimental results demonstrate that the proposed approach learns meaningful representations, resulting in improved performance and robust when action deviations occur.

Title: A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease

Authors: Kimberly F. Greco, Zongxin Yang, Mengyan Li, Han Tong, Sara Morini Sweet, Alon Geva, Kenneth D. Mandl, Benjamin A. Raby, Tianxi Cai
Subjects: cs.LG, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2507.02998
Pdf URL: https://arxiv.org/pdf/2507.02998
Copy Paste: [[2507.02998]] A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease(https://arxiv.org/abs/2507.02998)
Keywords: transformer
Abstract: Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions often remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. While computational phenotyping algorithms show promise for automating rare disease detection, their development is hindered by the scarcity of labeled data and biases in existing label sources. Gold-standard labels from registries and expert chart reviews are highly accurate but constrained by selection bias and the cost of manual review. In contrast, labels derived from electronic health records (EHRs) cover a broader range of patients but can introduce substantial noise. To address these challenges, we propose a weakly supervised, transformer-based framework that combines a small set of gold-standard labels with a large volume of iteratively updated silver-standard labels derived from EHR data. This hybrid approach enables the training of a highly accurate and generalizable phenotyping model that scales rare disease detection beyond the scope of individual clinical expertise. Our method is initialized by learning embeddings of medical concepts based on their semantic meaning or co-occurrence patterns in EHRs, which are then refined and aggregated into patient-level representations via a multi-layer transformer architecture. Using two rare pulmonary diseases as a case study, we validate our model on EHR data from Boston Children's Hospital. Our framework demonstrates notable improvements in phenotype classification, identification of clinically meaningful subphenotypes through patient clustering, and prediction of disease progression compared to baseline methods. These results highlight the potential of our approach to enable scalable identification and stratification of rare disease patients for clinical care and research applications.

Title: Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Authors: Akram Mustafa, Usman Naseem, Mostafa Rahimi Azghadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03001
Pdf URL: https://arxiv.org/pdf/2507.03001
Copy Paste: [[2507.03001]] Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs(https://arxiv.org/abs/2507.03001)
Keywords: large language model
Abstract: This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable enough for full automation. Future work should explore hybrid methods, domain-specific model training, and the use of structured clinical data.

Title: Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages

Authors: Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao, Hongxiang Fan, Nicholas D. Lane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03003
Pdf URL: https://arxiv.org/pdf/2507.03003
Copy Paste: [[2507.03003]] Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages(https://arxiv.org/abs/2507.03003)
Keywords: federate, large language model
Abstract: Pre-trained large language models (LLMs) have become a cornerstone of modern natural language processing, with their capabilities extending across a wide range of applications and languages. However, the fine-tuning of multilingual LLMs, especially for low-resource languages, faces significant challenges arising from data-sharing restrictions (the physical border) and inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, particularly those in low-resource regions, from fully benefiting from the advantages of LLMs. To address these challenges, we propose the Federated Prompt Tuning Paradigm for multilingual scenarios, which utilizes parameter-efficient fine-tuning while adhering to data sharing restrictions. We design a comprehensive set of experiments and analyze them using a novel notion of language distance to highlight the strengths of our paradigm: Even under computational constraints, our method not only improves data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local cross-lingual transfer tuning methods, our approach achieves 6.9\% higher accuracy with improved data efficiency, and demonstrates greater stability and generalization. These findings underscore the potential of our approach to promote social equality and champion linguistic diversity, ensuring that no language is left behind.

Title: CLUES: Collaborative High-Quality Data Selection for LLMs via Training Dynamics

Authors: Wanru Zhao, Hongxiang Fan, Shell Xu Hu, Wangchunshu Zhou, Bofan Chen, Nicholas D. Lane
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2507.03004
Pdf URL: https://arxiv.org/pdf/2507.03004
Copy Paste: [[2507.03004]] CLUES: Collaborative High-Quality Data Selection for LLMs via Training Dynamics(https://arxiv.org/abs/2507.03004)
Keywords: federate, large language model
Abstract: Recent research has highlighted the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces unique challenges in collaborative settings where sharing is not allowed directly between data silos. To tackle this issue, this paper proposes a novel data quality control technique based on the notion of data influence on the training dynamics of LLMs, that high quality data are more likely to have similar training dynamics to the anchor dataset. We then leverage the influence of the training dynamics to select high-quality data from different private domains, with centralized model updates on the server side in a collaborative training fashion by either model merging or federated learning. As for the data quality indicator, we compute the per-sample gradients with respect to the private data and the anchor dataset, and use the trace of the accumulated inner products as a measurement of data quality. In addition, we develop a quality control evaluation tailored for collaborative settings with heterogeneous domain data. Experiments show that training on the high-quality data selected by our method can often outperform other data selection methods for collaborative fine-tuning of LLMs, across diverse private domain datasets, in medical, multilingual and financial settings. Our code is released at this http URL.

Title: Topological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification

Authors: Faisal Ahmed, Mohammad Alfrad Nobel Bhuiyan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03006
Pdf URL: https://arxiv.org/pdf/2507.03006
Copy Paste: [[2507.03006]] Topological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification(https://arxiv.org/abs/2507.03006)
Keywords: extraction
Abstract: We present the first comparative study of two fundamentally distinct feature extraction techniques: Histogram of Oriented Gradients (HOG) and Topological Data Analysis (TDA), for medical image classification using retinal fundus images. HOG captures local texture and edge patterns through gradient orientation histograms, while TDA, using cubical persistent homology, extracts high-level topological signatures that reflect the global structure of pixel intensities. We evaluate both methods on the large APTOS dataset for two classification tasks: binary detection (normal versus diabetic retinopathy) and five-class diabetic retinopathy severity grading. From each image, we extract 26244 HOG features and 800 TDA features, using them independently to train seven classical machine learning models with 10-fold cross-validation. XGBoost achieved the best performance in both cases: 94.29 percent accuracy (HOG) and 94.18 percent (TDA) on the binary task; 74.41 percent (HOG) and 74.69 percent (TDA) on the multi-class task. Our results show that both methods offer competitive performance but encode different structural aspects of the images. This is the first work to benchmark gradient-based and topological features on retinal imagery. The techniques are interpretable, applicable to other medical imaging domains, and suitable for integration into deep learning pipelines.

Title: PDFMathTranslate: Scientific Document Translation Preserving Layouts

Authors: Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03009
Pdf URL: https://arxiv.org/pdf/2507.03009
Copy Paste: [[2507.03009]] PDFMathTranslate: Scientific Document Translation Preserving Layouts(https://arxiv.org/abs/2507.03009)
Keywords: diffusion, large language model
Abstract: Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at this https URL with more than 22k downloads.

Title: Intrinsic Fingerprint of LLMs: Continue Training is NOT All You Need to Steal A Model!

Authors: Do-hyeon Yoon, Minsoo Chun, Thomas Allen, Hans Müller, Min Wang, Rajesh Sharma
Subjects: cs.CR, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03014
Pdf URL: https://arxiv.org/pdf/2507.03014
Copy Paste: [[2507.03014]] Intrinsic Fingerprint of LLMs: Continue Training is NOT All You Need to Steal A Model!(https://arxiv.org/abs/2507.03014)
Keywords: protect, robust, steal, watermark, large language model
Abstract: Large language models (LLMs) face significant copyright and intellectual property challenges as the cost of training increases and model reuse becomes prevalent. While watermarking techniques have been proposed to protect model ownership, they may not be robust to continue training and development, posing serious threats to model attribution and copyright protection. This work introduces a simple yet effective approach for robust LLM fingerprinting based on intrinsic model characteristics. We discover that the standard deviation distributions of attention parameter matrices across different layers exhibit distinctive patterns that remain stable even after extensive continued training. These parameter distribution signatures serve as robust fingerprints that can reliably identify model lineage and detect potential copyright infringement. Our experimental validation across multiple model families demonstrates the effectiveness of our method for model authentication. Notably, our investigation uncovers evidence that a recently Pangu Pro MoE model released by Huawei is derived from Qwen-2.5 14B model through upcycling techniques rather than training from scratch, highlighting potential cases of model plagiarism, copyright violation, and information fabrication. These findings underscore the critical importance of developing robust fingerprinting methods for protecting intellectual property in large-scale model development and emphasize that deliberate continued training alone is insufficient to completely obscure model origins.

Title: Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH

Authors: Felix Friedrich, Thiemo Ganesha Welsch, Patrick Schramowski, Kristian Kersting
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03015
Pdf URL: https://arxiv.org/pdf/2507.03015
Copy Paste: [[2507.03015]] Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH(https://arxiv.org/abs/2507.03015)
Keywords: fair, diffusion
Abstract: Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.

Title: OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering

Authors: Zipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03018
Pdf URL: https://arxiv.org/pdf/2507.03018
Copy Paste: [[2507.03018]] OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering(https://arxiv.org/abs/2507.03018)
Keywords: large language model
Abstract: Open-domain table question answering traditionally relies on a two-stage pipeline: static table retrieval followed by a closed-domain answer. In contrast, we propose an end-to-end agentic framework that embeds multi-turn tool calls-using a BM25+-based search API and a SQLite SQL executor-directly into a large language model. To further adapt a compact 4B-parameter model, we introduce a two-stage fine-tuning process: supervised cold-start on easy questions, then Async GRPO reinforcement learning on harder cases with LoRA adapters and a rollout buffer. This unified approach enables the model to jointly retrieve, reason, and execute queries, yielding a dramatic accuracy improvement from single-digit zero-shot performance to over 0.86 exact match on a held-out test set. Our results underscore the effectiveness of integrating structured tool calls with targeted RL fine-tuning for scalable, accurate table QA. The code is available at this https URL.

Title: Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

Authors: Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, Li Yuan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03019
Pdf URL: https://arxiv.org/pdf/2507.03019
Copy Paste: [[2507.03019]] Look-Back: Implicit Visual Re-focusing in MLLM Reasoning(https://arxiv.org/abs/2507.03019)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back" at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.

Title: A Multi-Resolution Dynamic Game Framework for Cross-Echelon Decision-Making in Cyber Warfare

Authors: Ya-Ting Yang, Quanyan Zhu
Subjects: cs.CR, cs.GT
Abstract URL: https://arxiv.org/abs/2507.03021
Pdf URL: https://arxiv.org/pdf/2507.03021
Copy Paste: [[2507.03021]] A Multi-Resolution Dynamic Game Framework for Cross-Echelon Decision-Making in Cyber Warfare(https://arxiv.org/abs/2507.03021)
Keywords: defense
Abstract: Cyber warfare has become a critical dimension of modern conflict, driven by society's increasing dependence on interconnected digital and physical infrastructure. Effective cyber defense often requires decision-making at different echelons, where the tactical layer focuses on detailed actions such as techniques, tactics, and procedures, while the strategic layer addresses long-term objectives and coordinated planning. Modeling these interactions at different echelons remains challenging due to the dynamic, large-scale, and interdependent nature of cyber environments. To address this, we propose a multi-resolution dynamic game framework in which the tactical layer captures fine-grained interactions using high-resolution extensive-form game trees, while the strategic layer is modeled as a Markov game defined over lower-resolution states abstracted from those game trees. This framework supports scalable reasoning and planning across different levels of abstraction through zoom-in and zoom-out operations that adjust the granularity of the modeling based on operational needs. A case study demonstrates how the framework works and its effectiveness in improving the defender's strategic advantage.

Title: Generalized Adaptive Transfer Network: Enhancing Transfer Learning in Reinforcement Learning Across Domains

Authors: Abhishek Verma, Nallarasan V, Balaraman Ravindran
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03026
Pdf URL: https://arxiv.org/pdf/2507.03026
Copy Paste: [[2507.03026]] Generalized Adaptive Transfer Network: Enhancing Transfer Learning in Reinforcement Learning Across Domains(https://arxiv.org/abs/2507.03026)
Keywords: robust
Abstract: Transfer learning in Reinforcement Learning (RL) enables agents to leverage knowledge from source tasks to accelerate learning in target tasks. While prior work, such as the Attend, Adapt, and Transfer (A2T) framework, addresses negative transfer and selective transfer, other critical challenges remain underexplored. This paper introduces the Generalized Adaptive Transfer Network (GATN), a deep RL architecture designed to tackle task generalization across domains, robustness to environmental changes, and computational efficiency in transfer. GATN employs a domain-agnostic representation module, a robustness-aware policy adapter, and an efficient transfer scheduler to achieve these goals. We evaluate GATN on diverse benchmarks, including Atari 2600, MuJoCo, and a custom chatbot dialogue environment, demonstrating superior performance in cross-domain generalization, resilience to dynamic environments, and reduced computational overhead compared to baselines. Our findings suggest GATN is a versatile framework for real-world RL applications, such as adaptive chatbots and robotic control.

Title: The Book of Life approach: Enabling richness and scale for life course research

Authors: Mark D. Verhagen, Benedikt Stroebl, Tiffany Liu, Lydia T. Liu, Matthew J. Salganik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03027
Pdf URL: https://arxiv.org/pdf/2507.03027
Copy Paste: [[2507.03027]] The Book of Life approach: Enabling richness and scale for life course research(https://arxiv.org/abs/2507.03027)
Keywords: large language model
Abstract: For over a century, life course researchers have faced a choice between two dominant methodological approaches: qualitative methods that analyze rich data but are constrained to small samples, and quantitative survey-based methods that study larger populations but sacrifice data richness for scale. Two recent technological developments now enable us to imagine a hybrid approach that combines some of the depth of the qualitative approach with the scale of quantitative methods. The first development is the steady rise of ''complex log data,'' behavioral data that is logged for purposes other than research but that can be repurposed to construct rich accounts of people's lives. The second is the emergence of large language models (LLMs) with exceptional pattern recognition capabilities on plain text. In this paper, we take a necessary step toward creating this hybrid approach by developing a flexible procedure to transform complex log data into a textual representation of an individual's life trajectory across multiple domains, over time, and in context. We call this data representation a ''book of life.'' We illustrate the feasibility of our approach by writing over 100 million books of life covering many different facets of life, over time and placed in social context using Dutch population-scale registry data. We open source the book of life toolkit (BOLT), and invite the research community to explore the many potential applications of this approach.

Title: Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation

Authors: Johnson Thomas, Ayush Mudgal, Wendao Liu, Nisten Tahiraj, Zeeshaan Mohammed, Dhruv Diddi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03033
Pdf URL: https://arxiv.org/pdf/2507.03033
Copy Paste: [[2507.03033]] Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation(https://arxiv.org/abs/2507.03033)
Keywords: privacy, large language model
Abstract: Background: Clinical documentation represents a significant burden for healthcare providers, with physicians spending up to 2 hours daily on administrative tasks. Recent advances in large language models (LLMs) offer promising solutions, but privacy concerns and computational requirements limit their adoption in healthcare settings. Objective: To develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser. Methods: We fine-tuned a Llama 3.2 1B model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on 1,500 synthetic medical transcription-to-structured note pairs. The model was evaluated against the base Llama 3.2 1B on two datasets: 100 endocrinology transcripts and 140 modified ACI benchmark cases. Evaluation employed both statistical metrics (ROUGE, BERTScore, BLEURT) and LLM-as-judge assessments across multiple clinical quality dimensions. Results: The fine-tuned OnDevice model demonstrated substantial improvements over the base model. On the ACI benchmark, ROUGE-1 scores increased from 0.346 to 0.496, while BERTScore F1 improved from 0.832 to 0.866. Clinical quality assessments showed marked reduction in major hallucinations (from 85 to 35 cases) and enhanced factual correctness (2.81 to 3.54 on 5-point scale). Similar improvements were observed on the internal evaluation dataset, with composite scores increasing from 3.13 to 4.43 (+41.5%). Conclusions: Fine-tuning compact LLMs for medical transcription yields clinically meaningful improvements while enabling complete on-device browser deployment. This approach addresses key barriers to AI adoption in healthcare: privacy preservation, cost reduction, and accessibility for resource-constrained environments.

Title: Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Authors: Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren
Subjects: cs.LG, cs.AI, cs.CR, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2507.03034
Pdf URL: https://arxiv.org/pdf/2507.03034
Copy Paste: [[2507.03034]] Rethinking Data Protection in the (Generative) Artificial Intelligence Era(https://arxiv.org/abs/2507.03034)
Keywords: privacy, protect, generative
Abstract: The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

Title: Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards

Authors: Shirley Wu, Parth Sarthi, Shiyu Zhao, Aaron Lee, Herumb Shandilya, Adrian Mladenic Grobelnik, Nurendra Choudhary, Eddie Huang, Karthik Subbian, Linjun Zhang, Diyi Yang, James Zou, Jure Leskovec
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03041
Pdf URL: https://arxiv.org/pdf/2507.03041
Copy Paste: [[2507.03041]] Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards(https://arxiv.org/abs/2507.03041)
Keywords: large language model
Abstract: Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component's local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component's local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at this https URL.

Title: Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction

Authors: Yuyang Lou, Charles Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03042
Pdf URL: https://arxiv.org/pdf/2507.03042
Copy Paste: [[2507.03042]] Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction(https://arxiv.org/abs/2507.03042)
Keywords: large language model
Abstract: Memory storage for Large Language models (LLMs) is becoming an increasingly active area of research, particularly for enabling personalization across long conversations. We propose Pref-LSTM, a dynamic and lightweight framework that combines a BERT-based classifier with a LSTM memory module that generates memory embedding which then is soft-prompt injected into a frozen LLM. We synthetically curate a dataset of preference and non-preference conversation turns to train our BERT-based classifier. Although our LSTM-based memory encoder did not yield strong results, we find that the BERT-based classifier performs reliably in identifying explicit and implicit user preferences. Our research demonstrates the viability of using preference filtering with LSTM gating principals as an efficient path towards scalable user preference modeling, without extensive overhead and fine-tuning.

Title: Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation

Authors: Yutian Liu, Zhengyi Yang, Jiancan Wu, Xiang Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.03047
Pdf URL: https://arxiv.org/pdf/2507.03047
Copy Paste: [[2507.03047]] Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation(https://arxiv.org/abs/2507.03047)
Keywords: large language model
Abstract: Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users' historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. By conceptualizing temporal order as an independent causal factor distinct from item content, we can quantify its unique contribution through counterfactual reasoning--comparing what recommendations would be made with and without temporal information while keeping all other factors constant. This causal framing enables CETRec to design a novel counterfactual tuning objective that directly optimizes the model's temporal sensitivity, teaching LLMs to recognize both absolute timestamps and relative ordering patterns in user histories. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs' awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items).

Title: Monitoring of Static Fairness

Authors: Thomas A. Henzinger, Mahyar Karimi, Konstantin Kueffner, Kaushik Mallik
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03048
Pdf URL: https://arxiv.org/pdf/2507.03048
Copy Paste: [[2507.03048]] Monitoring of Static Fairness(https://arxiv.org/abs/2507.03048)
Keywords: fair
Abstract: Machine-learned systems are in widespread use for making decisions about humans, and it is important that they are fair, i.e., not biased against individuals based on sensitive attributes. We present a general framework of runtime verification of algorithmic fairness for systems whose models are unknown, but are assumed to have a Markov chain structure, with or without full observation of the state space. We introduce a specification language that can model many common algorithmic fairness properties, such as demographic parity, equal opportunity, and social burden. We build monitors that observe a long sequence of events as generated by a given system, and output, after each observation, a quantitative estimate of how fair or biased the system was on that run until that point in time. The estimate is proven to be correct modulo a variable error bound and a given confidence level, where the error bound gets tighter as the observed sequence gets longer. We present two categories of monitoring algorithms, namely ones with a uniform error bound across all time points, and ones with weaker non-uniform, pointwise error bounds at different time points. Our monitoring algorithms use statistical tools that are adapted to suit the dynamic requirements of monitoring and the special needs of the fairness specifications. Using a prototype implementation, we show how we can monitor if a bank is fair in giving loans to applicants from different social backgrounds, and if a college is fair in admitting students while maintaining a reasonable financial burden on the society. In these experiments, our monitors took less than a millisecond to update their verdicts after each observation.

Title: Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization

Authors: Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.03051
Pdf URL: https://arxiv.org/pdf/2507.03051
Copy Paste: [[2507.03051]] Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization(https://arxiv.org/abs/2507.03051)
Keywords: security, large language model
Abstract: Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an extensive study aimed at advancing recent RL-based finetuning techniques for LLMs in the context of vulnerability detection. We start by highlighting key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others. To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards. We enable its application to the vulnerability detection task by redefining its advantage functions and reward signals using annotations from widely used datasets in the field, including BigVul, DiverseVul, and CleanVul. The proposed methodology enables an extensive set of experiments, addressing multiple research questions regarding the impact of GRPO on generalization, reasoning capabilities, and performance improvements over standard supervised finetuning (SFT). Our findings offer valuable insights into the potential of RL-based training to enhance both the performance and reasoning abilities of LLMs in the context of software vulnerability detection.

Title: From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction

Authors: Egor Maximov, Yulia Kuzkina, Azamat Kanametov, Alexander Prutko, Aleksei Goncharov, Maxim Zhelnin, Egor Shvetsov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03052
Pdf URL: https://arxiv.org/pdf/2507.03052
Copy Paste: [[2507.03052]] From 2:4 to 8:16 sparsity patterns in LLMs for Outliers and Weights with Variance Correction(https://arxiv.org/abs/2507.03052)
Keywords: large language model
Abstract: As large language models (LLMs) grow in size, efficient compression techniques like quantization and sparsification are critical. While quantization maintains performance with reduced precision, structured sparsity methods, such as N:M sparsification, often fall short due to limited flexibility, and sensitivity to outlier weights. We explore 8:16 semi-structured sparsity, demonstrating its ability to surpass the Performance Threshold-where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element). We also apply sparse structured patterns for salient weights, showing that structured sparsity for outliers is competitive with unstructured approaches leading to equivalent or better results. Finally, we demonstrate that simple techniques such as variance correction and SmoothQuant like weight equalization improve sparse models performance.

Title: LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection

Authors: Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03054
Pdf URL: https://arxiv.org/pdf/2507.03054
Copy Paste: [[2507.03054]] LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection(https://arxiv.org/abs/2507.03054)
Keywords: diffusion
Abstract: The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This can erode trust in digital media, making it critical to develop generalizable detectors for generated images. Recent methods leverage diffusion denoising cues, but mainly focus on single-step reconstruction errors, ignoring the inherent sequential nature of the denoising process. In this work, we propose LATTE - Latent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across several denoising timesteps. By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images. Each latent is refined by employing our latent-visual feature refinement module and aggregated into a unified representation. Afterwards, it is fused with the visual features and finally passed into a lightweight classifier. Our experiments demonstrate that LATTE surpasses the baselines on several established benchmarks, such as GenImage and DiffusionFake. Moreover, it demonstrates strong performance in cross-generator and cross-datasets settings, highlighting the potential of using the trajectory of latent embeddings for generated image detection. The code is available on the following link: this https URL.

Title: Automated Grading of Students' Handwritten Graphs: A Comparison of Meta-Learning and Vision-Large Language Models

Authors: Behnam Parsaeifard, Martin Hlosta, Per Bergamin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03056
Pdf URL: https://arxiv.org/pdf/2507.03056
Copy Paste: [[2507.03056]] Automated Grading of Students' Handwritten Graphs: A Comparison of Meta-Learning and Vision-Large Language Models(https://arxiv.org/abs/2507.03056)
Keywords: large language model
Abstract: With the rise of online learning, the demand for efficient and consistent assessment in mathematics has significantly increased over the past decade. Machine Learning (ML), particularly Natural Language Processing (NLP), has been widely used for autograding student responses, particularly those involving text and/or mathematical expressions. However, there has been limited research on autograding responses involving students' handwritten graphs, despite their prevalence in Science, Technology, Engineering, and Mathematics (STEM) curricula. In this study, we implement multimodal meta-learning models for autograding images containing students' handwritten graphs and text. We further compare the performance of Vision Large Language Models (VLLMs) with these specially trained metalearning models. Our results, evaluated on a real-world dataset collected from our institution, show that the best-performing meta-learning models outperform VLLMs in 2-way classification tasks. In contrast, in more complex 3-way classification tasks, the best-performing VLLMs slightly outperform the meta-learning models. While VLLMs show promising results, their reliability and practical applicability remain uncertain and require further investigation.

Title: BERT4Traj: Transformer Based Trajectory Reconstruction for Sparse Mobility Data

Authors: Hao Yang, Angela Yao, Christopher Whalen, Gengchen Mai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03062
Pdf URL: https://arxiv.org/pdf/2507.03062
Copy Paste: [[2507.03062]] BERT4Traj: Transformer Based Trajectory Reconstruction for Sparse Mobility Data(https://arxiv.org/abs/2507.03062)
Keywords: transformer
Abstract: Understanding human mobility is essential for applications in public health, transportation, and urban planning. However, mobility data often suffers from sparsity due to limitations in data collection methods, such as infrequent GPS sampling or call detail record (CDR) data that only capture locations during communication events. To address this challenge, we propose BERT4Traj, a transformer based model that reconstructs complete mobility trajectories by predicting hidden visits in sparse movement sequences. Inspired by BERT's masked language modeling objective and self_attention mechanisms, BERT4Traj leverages spatial embeddings, temporal embeddings, and contextual background features such as demographics and anchor points. We evaluate BERT4Traj on real world CDR and GPS datasets collected in Kampala, Uganda, demonstrating that our approach significantly outperforms traditional models such as Markov Chains, KNN, RNNs, and LSTMs. Our results show that BERT4Traj effectively reconstructs detailed and continuous mobility trajectories, enhancing insights into human movement patterns.

Title: LLM-Driven Auto Configuration for Transient IoT Device Collaboration

Authors: Hetvi Shastri, Walid A. Hanafy, Li Wu, David Irwin, Mani Srivastava, Prashant Shenoy
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03064
Pdf URL: https://arxiv.org/pdf/2507.03064
Copy Paste: [[2507.03064]] LLM-Driven Auto Configuration for Transient IoT Device Collaboration(https://arxiv.org/abs/2507.03064)
Keywords: secure, large language model
Abstract: Today's Internet of Things (IoT) has evolved from simple sensing and actuation devices to those with embedded processing and intelligent services, enabling rich collaborations between users and their devices. However, enabling such collaboration becomes challenging when transient devices need to interact with host devices in temporarily visited environments. In such cases, fine-grained access control policies are necessary to ensure secure interactions; however, manually implementing them is often impractical for non-expert users. Moreover, at run-time, the system must automatically configure the devices and enforce such fine-grained access control rules. Additionally, the system must address the heterogeneity of devices. In this paper, we present CollabIoT, a system that enables secure and seamless device collaboration in transient IoT environments. CollabIoT employs a Large language Model (LLM)-driven approach to convert users' high-level intents to fine-grained access control policies. To support secure and seamless device collaboration, CollabIoT adopts capability-based access control for authorization and uses lightweight proxies for policy enforcement, providing hardware-independent abstractions. We implement a prototype of CollabIoT's policy generation and auto configuration pipelines and evaluate its efficacy on an IoT testbed and in large-scale emulated environments. We show that our LLM-based policy generation pipeline is able to generate functional and correct policies with 100% accuracy. At runtime, our evaluation shows that our system configures new devices in ~150 ms, and our proxy-based data plane incurs network overheads of up to 2 ms and access control overheads up to 0.3 ms.

Title: Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference

Authors: Xin Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03065
Pdf URL: https://arxiv.org/pdf/2507.03065
Copy Paste: [[2507.03065]] Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference(https://arxiv.org/abs/2507.03065)
Keywords: generative
Abstract: The Helmholtz Machine (HM) is a foundational architecture for unsupervised learning, coupling a bottom-up recognition model with a top-down generative model through alternating inference. However, its reliance on symmetric, data-driven updates constrains its ability to perform goal-directed reasoning or simulate temporally extended processes. In this work, we introduce the \emph{Cycle-Consistent Helmholtz Machine} (C$^2$HM), a novel extension that reframes inference as a \emph{goal-seeded}, \emph{asymmetric} process grounded in structured internal priors. Rather than inferring latent causes solely from sensory data, C$^2$HM simulates plausible latent trajectories conditioned on abstract goals, aligning them with observed outcomes through a recursive cycle of forward generation and inverse refinement. This cycle-consistent formulation integrates top-down structure with bottom-up evidence via a variational loop, enforcing mutual alignment between goal-conditioned latent predictions and recognition-based reconstructions. We formalize this mechanism within the framework of the \emph{Context-Content Uncertainty Principle} (CCUP), which posits that inference proceeds by aligning structured, low-entropy content with high-entropy, ambiguous context. C$^2$HM improves representational efficiency, supports memory chaining via path-dependent inference, and enables spatial compositional imagination. By offering a biologically inspired alternative to classical amortized inference, $C^2$HM reconceives generative modeling as intentional simulation, bridging memory-based planning and unsupervised learning in a unified probabilistic framework.

Title: Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case

Authors: Alvaro Riquelme, Pedro Costa, Catalina Martinez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03067
Pdf URL: https://arxiv.org/pdf/2507.03067
Copy Paste: [[2507.03067]] Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case(https://arxiv.org/abs/2507.03067)
Keywords: security, robust, large language model
Abstract: For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.

Title: Mitigating Goal Misgeneralization with Minimax Regret

Authors: Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, Michael Dennis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03068
Pdf URL: https://arxiv.org/pdf/2507.03068
Copy Paste: [[2507.03068]] Mitigating Goal Misgeneralization with Minimax Regret(https://arxiv.org/abs/2507.03068)
Keywords: robust
Abstract: Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer's intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates the risk that policies will behave as if in pursuit of the proxy goal, rather than the intended goal, in deployment -- a phenomenon known as goal misgeneralization. In this paper, we formalize this problem setting in order to theoretically study the possibility of goal misgeneralization under different training objectives. We show that goal misgeneralization is possible under approximate optimization of the maximum expected value (MEV) objective, but not the minimax expected regret (MMER) objective. We then empirically show that the standard MEV-based training method of domain randomization exhibits goal misgeneralization in procedurally-generated grid-world environments, whereas current regret-based unsupervised environment design (UED) methods are more robust to goal misgeneralization (though they don't find MMER policies in all cases). Our findings suggest that minimax expected regret is a promising approach to mitigating goal misgeneralization.

Title: ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

Authors: YuXuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03069
Pdf URL: https://arxiv.org/pdf/2507.03069
Copy Paste: [[2507.03069]] ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization(https://arxiv.org/abs/2507.03069)
Keywords: transformer
Abstract: With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.

Title: RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Authors: Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.03112
Pdf URL: https://arxiv.org/pdf/2507.03112
Copy Paste: [[2507.03112]] RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents(https://arxiv.org/abs/2507.03112)
Keywords: large language model
Abstract: Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

Title: BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Authors: Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Kentaro Katayama, Takumi Honda, Maciej Besta, Torsten Hoefler
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2507.03117
Pdf URL: https://arxiv.org/pdf/2507.03117
Copy Paste: [[2507.03117]] BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers(https://arxiv.org/abs/2507.03117)
Keywords: robust, transformer
Abstract: The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.

Title: How Overconfidence in Initial Choices and Underconfidence Under Criticism Modulate Change of Mind in Large Language Models

Authors: Dharshan Kumaran, Stephen M Fleming, Larisa Markeeva, Joe Heyward, Andrea Banino, Mrinal Mathur, Razvan Pascanu, Simon Osindero, Benedetto de Martino, Petar Velickovic, Viorica Patraucean
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03120
Pdf URL: https://arxiv.org/pdf/2507.03120
Copy Paste: [[2507.03120]] How Overconfidence in Initial Choices and Underconfidence Under Criticism Modulate Change of Mind in Large Language Models(https://arxiv.org/abs/2507.03120)
Keywords: large language model
Abstract: Large language models (LLMs) exhibit strikingly conflicting behaviors: they can appear steadfastly overconfident in their initial answers whilst at the same time being prone to excessive doubt when challenged. To investigate this apparent paradox, we developed a novel experimental paradigm, exploiting the unique ability to obtain confidence estimates from LLMs without creating memory of their initial judgments -- something impossible in human participants. We show that LLMs -- Gemma 3, GPT4o and o1-preview -- exhibit a pronounced choice-supportive bias that reinforces and boosts their estimate of confidence in their answer, resulting in a marked resistance to change their mind. We further demonstrate that LLMs markedly overweight inconsistent compared to consistent advice, in a fashion that deviates qualitatively from normative Bayesian updating. Finally, we demonstrate that these two mechanisms -- a drive to maintain consistency with prior commitments and hypersensitivity to contradictory feedback -- parsimoniously capture LLM behavior in a different domain. Together, these findings furnish a mechanistic account of LLM confidence that explains both their stubbornness and excessive sensitivity to criticism.

Title: ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Authors: Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03133
Pdf URL: https://arxiv.org/pdf/2507.03133
Copy Paste: [[2507.03133]] ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models(https://arxiv.org/abs/2507.03133)
Keywords: large language model
Abstract: Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs' reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.

Title: Holographic Projection and Cyber Attack Surface: A Physical Analogy for Digital Security

Authors: Ricardo Queiroz de Araujo Fernandes, Anderson Santos, Daniel Maier de Carvalho, André Luiz Bandeira Molina
Subjects: cs.CR, cs.LO
Abstract URL: https://arxiv.org/abs/2507.03136
Pdf URL: https://arxiv.org/pdf/2507.03136
Copy Paste: [[2507.03136]] Holographic Projection and Cyber Attack Surface: A Physical Analogy for Digital Security(https://arxiv.org/abs/2507.03136)
Keywords: security, protect, defense, attack
Abstract: This article presents an in-depth exploration of the analogy between the Holographic Principle in theoretical physics and cyber attack surfaces in digital security. Building on concepts such as black hole entropy and AdS/CFT duality, it highlights how complex infrastructures project their vulnerabilities onto their external interfaces. The paper draws a parallel between a black hole's event horizon, which encodes all internal information, and the attack surface, which reflects the internal architecture's security posture. Additionally, the article outlines how this conceptual framework can guide cybersecurity practices, emphasizing strategies such as attack surface reduction, continuous scanning with tools like OWASP ZAP and Greenbone OpenVAS, and the implementation of Zero Trust Architecture. This analogy not only provides a unique perspective on digital security but also underscores the critical importance of boundary-level defenses in protecting vast internal infrastructures.

Title: From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models

Authors: Melanie Galea, Claudia Borg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03142
Pdf URL: https://arxiv.org/pdf/2507.03142
Copy Paste: [[2507.03142]] From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models(https://arxiv.org/abs/2507.03142)
Keywords: large language model
Abstract: The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.

Title: Set Valued Predictions For Robust Domain Generalization

Authors: Ron Tsibulsky, Daniel Nevo, Uri Shalit
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03146
Pdf URL: https://arxiv.org/pdf/2507.03146
Copy Paste: [[2507.03146]] Set Valued Predictions For Robust Domain Generalization(https://arxiv.org/abs/2507.03146)
Keywords: robust
Abstract: Despite the impressive advancements in modern machine learning, achieving robustness in Domain Generalization (DG) tasks remains a significant challenge. In DG, models are expected to perform well on samples from unseen test distributions (also called domains), by learning from multiple related training distributions. Most existing approaches to this problem rely on single-valued predictions, which inherently limit their robustness. We argue that set-valued predictors could be leveraged to enhance robustness across unseen domains, while also taking into account that these sets should be as small as possible. We introduce a theoretical framework defining successful set prediction in the DG setting, focusing on meeting a predefined performance criterion across as many domains as possible, and provide theoretical insights into the conditions under which such domain generalization is achievable. We further propose a practical optimization method compatible with modern learning architectures, that balances robust performance on unseen domains with small prediction set sizes. We evaluate our approach on several real-world datasets from the WILDS benchmark, demonstrating its potential as a promising direction for robust domain generalization.

Title: HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference

Authors: Weishu Deng, Yujie Yang, Peiran Du, Lingfeng Xiang, Zhen Lin, Chen Zhong, Song Jiang, Hui Lu, Jia Rao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03153
Pdf URL: https://arxiv.org/pdf/2507.03153
Copy Paste: [[2507.03153]] HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference(https://arxiv.org/abs/2507.03153)
Keywords: large language model
Abstract: Scaling inference for large language models (LLMs) is increasingly constrained by limited GPU memory, especially due to growing key-value (KV) caches required for long-context generation. While existing approaches offload KV caches to CPU memory or apply sparse attention to reduce GPU load, they often underutilize CPU compute resources and compromise accuracy. We present HGCA, a hybrid CPU-GPU attention mechanism that enables scalable, high-throughput LLM inference with near-full attention quality. HGCA performs dense attention on recently generated KV entries retained in GPU memory and parallel sparse attention on selected, salient KV entries in CPU memory. The attention outputs are efficiently merged using log-sum-exp fusion, minimizing PCIe transfer overhead. HGCA also introduces a finegrained, per-head sparsification strategy optimized for CPU execution, preserving contextual relevance while reducing computation. Our implementation seamlessly integrates into existing LLM frameworks without requiring model retraining. Experiments across diverse models and workloads show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy -- all on commodity GPU hardware.

Title: PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset

Authors: Michal Golovanevsky, Pranav Mahableshwarkar, Carsten Eickhoff, Ritambhara Singh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03165
Pdf URL: https://arxiv.org/pdf/2507.03165
Copy Paste: [[2507.03165]] PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset(https://arxiv.org/abs/2507.03165)
Keywords: fair, interpretability
Abstract: Multimodal deep learning holds promise for improving clinical prediction by integrating diverse patient data, including text, imaging, time-series, and structured demographics. Contrastive learning facilitates this integration by producing a unified representation that can be reused across tasks, reducing the need for separate models or encoders. Although contrastive learning has seen success in vision-language domains, its use in clinical settings remains largely limited to image and text pairs. We propose the Pipeline for Contrastive Modality Evaluation and Encoding (PiCME), which systematically assesses five clinical data types from MIMIC: discharge summaries, radiology reports, chest X-rays, demographics, and time-series. We pre-train contrastive models on all 26 combinations of two to five modalities and evaluate their utility on in-hospital mortality and phenotype prediction. To address performance plateaus with more modalities, we introduce a Modality-Gated LSTM that weights each modality according to its contrastively learned importance. Our results show that contrastive models remain competitive with supervised baselines, particularly in three-modality settings. Performance declines beyond three modalities, which supervised models fail to recover. The Modality-Gated LSTM mitigates this drop, improving AUROC from 73.19% to 76.93% and AUPRC from 51.27% to 62.26% in the five-modality setting. We also compare contrastively learned modality importance scores with attribution scores and evaluate generalization across demographic subgroups, highlighting strengths in interpretability and fairness. PiCME is the first to scale contrastive learning across all modality combinations in MIMIC, offering guidance for modality selection, training strategies, and equitable clinical prediction.

Title: Adversarial Manipulation of Reasoning Models using Internal Representations

Authors: Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03167
Pdf URL: https://arxiv.org/pdf/2507.03167
Copy Paste: [[2507.03167]] Adversarial Manipulation of Reasoning Models using Internal Representations(https://arxiv.org/abs/2507.03167)
Keywords: attack
Abstract: Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at this https URL

Title: Adopting a human developmental visual diet yields robust, shape-based AI vision

Authors: Zejin Lu, Sushrut Thorat, Radoslaw M Cichy, Tim C Kietzmann
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.03168
Pdf URL: https://arxiv.org/pdf/2507.03168
Copy Paste: [[2507.03168]] Adopting a human developmental visual diet yields robust, shape-based AI vision(https://arxiv.org/abs/2507.03168)
Keywords: attack, robust
Abstract: Despite years of research and the dramatic scaling of artificial intelligence (AI) systems, a striking misalignment between artificial and human vision persists. Contrary to humans, AI heavily relies on texture-features rather than shape information, lacks robustness to image distortions, remains highly vulnerable to adversarial attacks, and struggles to recognise simple abstract shapes within complex backgrounds. To close this gap, we here introduce a solution that arises from a previously underexplored direction: rather than scaling up, we take inspiration from how human vision develops from early infancy into adulthood. We quantified the visual maturation by synthesising decades of psychophysical and neurophysiological research into a novel developmental visual diet (DVD) for AI vision. We show that guiding AI systems through this human-inspired curriculum produces models that closely align with human behaviour on every hallmark of robust vision tested yielding the strongest reported reliance on shape information to date, abstract shape recognition beyond the state of the art, higher robustness to image corruptions, and stronger resilience to adversarial attacks. By outperforming high parameter AI foundation models trained on orders of magnitude more data, we provide evidence that robust AI vision can be achieved by guiding the way how a model learns, not merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.

Title: Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data

Authors: Yunrui Qiu, Richard John, Lukas Herron, Pratyush Tiwary
Subjects: cs.LG, cond-mat.stat-mech, physics.bio-ph, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2507.03174
Pdf URL: https://arxiv.org/pdf/2507.03174
Copy Paste: [[2507.03174]] Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data(https://arxiv.org/abs/2507.03174)
Keywords: interpretability, generative
Abstract: Accurate characterization of the equilibrium distributions of complex molecular systems and their dependence on environmental factors such as temperature is essential for understanding thermodynamic properties and transition mechanisms. Projecting these distributions onto meaningful low-dimensional representations enables interpretability and downstream analysis. Recent advances in generative AI, particularly flow models such as Normalizing Flows (NFs), have shown promise in modeling such distributions, but their scope is limited without tailored representation learning. In this work, we introduce Latent Thermodynamic Flows (LaTF), an end-to-end framework that tightly integrates representation learning and generative modeling. LaTF unifies the State Predictive Information Bottleneck (SPIB) with NFs to simultaneously learn low-dimensional latent representations, referred to as Collective Variables (CVs), classify metastable states, and generate equilibrium distributions across temperatures beyond the training data. The two components of representation learning and generative modeling are optimized jointly, ensuring that the learned latent features capture the system's slow, important degrees of freedom while the generative model accurately reproduces the system's equilibrium behavior. We demonstrate LaTF's effectiveness across diverse systems, including a model potential, the Chignolin protein, and cluster of Lennard Jones particles, with thorough evaluations and benchmarking using multiple metrics and extensive simulations. Finally, we apply LaTF to a RNA tetraloop system, where despite using simulation data from only two temperatures, LaTF reconstructs the temperature-dependent structural ensemble and melting behavior, consistent with experimental and prior extensive computational results.

Title: How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?

Authors: Abeer Alessa, Akshaya Lakshminarasimhan, Param Somane, Julian Skirzynski, Julian McAuley, Jessica Echterhoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03194
Pdf URL: https://arxiv.org/pdf/2507.03194
Copy Paste: [[2507.03194]] How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?(https://arxiv.org/abs/2507.03194)
Keywords: robust, large language model
Abstract: Large language models (LLMs) are increasingly integrated into applications ranging from review summarization to medical diagnosis support, where they affect human decisions. Even though LLMs perform well in many tasks, they may also inherit societal or cognitive biases, which can inadvertently transfer to humans. We investigate when and how LLMs expose users to biased content and quantify its severity. Specifically, we assess three LLM families in summarization and news fact-checking tasks, evaluating how much LLMs stay consistent with their context and/or hallucinate. Our findings show that LLMs expose users to content that changes the sentiment of the context in 21.86% of the cases, hallucinates on post-knowledge-cutoff data questions in 57.33% of the cases, and primacy bias in 5.94% of the cases. We evaluate 18 distinct mitigation methods across three LLM families and find that targeted interventions can be effective. Given the prevalent use of LLMs in high-stakes domains, such as healthcare or legal analysis, our results highlight the need for robust technical safeguards and for developing user-centered interventions that address LLM limitations.

Title: DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing

Authors: Liangyu Wang, Huanyi Xie, Di Wang
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2507.03211
Pdf URL: https://arxiv.org/pdf/2507.03211
Copy Paste: [[2507.03211]] DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing(https://arxiv.org/abs/2507.03211)
Keywords: transformer, large language model
Abstract: Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale. While zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating backward passes, its application to multi-hundred-billion-parameter models is constrained by GPU memory and compute throughput. The ZO2 framework addresses the memory bottleneck by offloading model parameters to CPU memory and overlapping transformer block transfer with dual forward computation on a single GPU. However, ZO2 remains limited by its single-device execution and achieves modest throughput. In this work, we present DistZO2, a high-throughput, memory-efficient framework for distributed zeroth-order fine-tuning of LLMs. DistZO2 introduces three parallel strategies: (1) Perturbation Parallelism (PertP), which parallelizes the two perturbed forward passes across devices; (2) Distributed Data Parallelism (DDP), adapted to the scalar-gradient nature of ZO training; and (3) a unified 2D Parallelism design that combines PertP and DDP. To further mitigate communication bottlenecks introduced by parameter offloading, we propose a hardware-aware communication strategy that slices parameter blocks and redistributes them across GPUs via high-speed interconnects such as NVLink. DistZO2 scales zeroth-order fine-tuning to modern multi-GPU systems, preserving ZO2's memory efficiency while substantially improving training throughput. In our experiments on OPT-175B, DistZO2 achieves a 3x speedup over ZO2 with distributed computing. DistZO2's code has been open-sourced in this https URL.

Title: Development of an Improved Capsule-Yolo Network for Automatic Tomato Plant Disease Early Detection and Diagnosis

Authors: Idris Ochijenu, Monday Abutu Idakwo, Sani Felix
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03219
Pdf URL: https://arxiv.org/pdf/2507.03219
Copy Paste: [[2507.03219]] Development of an Improved Capsule-Yolo Network for Automatic Tomato Plant Disease Early Detection and Diagnosis(https://arxiv.org/abs/2507.03219)
Keywords: security
Abstract: Like many countries, Nigeria is naturally endowed with fertile agricultural soil that supports large-scale tomato production. However, the prevalence of disease causing pathogens poses a significant threat to tomato health, often leading to reduced yields and, in severe cases, the extinction of certain species. These diseases jeopardise both the quality and quantity of tomato harvests, contributing to food insecurity. Fortunately, tomato diseases can often be visually identified through distinct forms, appearances, or textures, typically first visible on leaves and fruits. This study presents an enhanced Capsule-YOLO network architecture designed to automatically segment overlapping and occluded tomato leaf images from complex backgrounds using the YOLO framework. It identifies disease symptoms with impressive performance metrics: 99.31% accuracy, 98.78% recall, and 99.09% precision, and a 98.93% F1-score representing improvements of 2.91%, 1.84%, 5.64%, and 4.12% over existing state-of-the-art methods. Additionally, a user-friendly interface was developed to allow farmers and users to upload images of affected tomato plants and detect early disease symptoms. The system also provides recommendations for appropriate diagnosis and treatment. The effectiveness of this approach promises significant benefits for the agricultural sector by enhancing crop yields and strengthening food security.

Title: Neural Inhibition Improves Dynamic Routing and Mixture of Experts

Authors: Will Y. Zou, Jennifer Y. Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03221
Pdf URL: https://arxiv.org/pdf/2507.03221
Copy Paste: [[2507.03221]] Neural Inhibition Improves Dynamic Routing and Mixture of Experts(https://arxiv.org/abs/2507.03221)
Keywords: transformer
Abstract: To be effective, efficient, and diverse, deep learning models need to dynamically choose its architecture based on signals from a population of neurons. We hypothesize dynamic routing models can be improved with neural inhibition in those neural populations. This means signals commonly shared among the various modes of data statistics can be inhibited so that the routing model can choose a specialized expert path for each data sample. Only through inhibition is the routing mechanism able to effectively select neural pathways. We believe this is an under-studied and under-verified implementation methodology for Mixture-of-Experts, dynamic routing, and transformer language models. We provide experimental evidence that the neural inhibition algorithm significantly boosts the performance of general tasks and motivates more effort to be invested in this research direction.

Title: On Jailbreaking Quantized Language Models Through Fault Injection Attacks

Authors: Noureldin Zahran, Ahmad Tahmasivand, Ihsen Alouani, Khaled Khasawneh, Mohammed E. Fouda
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03236
Pdf URL: https://arxiv.org/pdf/2507.03236
Copy Paste: [[2507.03236]] On Jailbreaking Quantized Language Models Through Fault Injection Attacks(https://arxiv.org/abs/2507.03236)
Keywords: attack
Abstract: The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (>80\% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20\% and 50\%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65\%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (<5\% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35\% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.

Title: A Vision-Based Closed-Form Solution for Measuring the Rotation Rate of an Object by Tracking One Point

Authors: Daniel Raviv, Juan D. Yepes, Eiki M. Martinson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03237
Pdf URL: https://arxiv.org/pdf/2507.03237
Copy Paste: [[2507.03237]] A Vision-Based Closed-Form Solution for Measuring the Rotation Rate of an Object by Tracking One Point(https://arxiv.org/abs/2507.03237)
Keywords: segmentation
Abstract: We demonstrate that, under orthographic projection and with a camera fixated on a point located on a rigid body, the rotation of that body can be analytically obtained by tracking only one other feature in the image. With some exceptions, any tracked point, regardless of its location on the body, yields the same value of the instantaneous rotation rate. The proposed method is independent of the shape of the 3D object and does not require a priori knowledge about the scene. This algorithm is suited for parallel processing and can achieve segmentation of the scene by distinguishing points that do not belong to the same rigid body, simply because they do not produce the same value of the rotation. This paper presents an analytical derivation, simulation results, and results from real video data.

Title: KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

Authors: Antoine Nzeyimana, Andre Niyongabo Rubungo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03241
Pdf URL: https://arxiv.org/pdf/2507.03241
Copy Paste: [[2507.03241]] KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation(https://arxiv.org/abs/2507.03241)
Keywords: transformer, large language model
Abstract: The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a morphology-based tokenization coupled with two-tier transformer encoding. This methodology results in lexically grounded contextual embeddings that are both fine-grained and self-contained. Our evaluation results indicate that KinyaColBERT outperforms strong baselines and leading commercial text embedding APIs on a Kinyarwanda agricultural retrieval benchmark. By adopting this retrieval strategy, we believe that practitioners in other low-resource settings can not only achieve reliable RAG systems but also deploy solutions that are more cost-effective.

Title: RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Authors: Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03253
Pdf URL: https://arxiv.org/pdf/2507.03253
Copy Paste: [[2507.03253]] RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs(https://arxiv.org/abs/2507.03253)
Keywords: large language model
Abstract: The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

Title: LACONIC: A 3D Layout Adapter for Controllable Image Creation

Authors: Léopold Maillard, Tom Durand, Adrien Ramanana Rahary, Maks Ovsjanikov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03257
Pdf URL: https://arxiv.org/pdf/2507.03257
Copy Paste: [[2507.03257]] LACONIC: A 3D Layout Adapter for Controllable Image Creation(https://arxiv.org/abs/2507.03257)
Keywords: diffusion, generative
Abstract: Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.

Title: Novel Blockchain-based Protocols for Electronic Voting and Auctions

Authors: Zhaorun Lin
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2507.03258
Pdf URL: https://arxiv.org/pdf/2507.03258
Copy Paste: [[2507.03258]] Novel Blockchain-based Protocols for Electronic Voting and Auctions(https://arxiv.org/abs/2507.03258)
Keywords: secure, security, privacy, protect
Abstract: Programmable blockchains have long been a hot research topic given their tremendous use in decentralized applications. Smart contracts, using blockchains as their underlying technology, inherit the desired properties such as verifiability, immutability, and transparency, which make it a great suit in trustless environments. In this thesis, we consider several decentralized protocols to be built on blockchains, specifically using smart contracts on Ethereum. We used algorithmic and cryptographic tools in our implementations to further improve the level of security and efficiency beyond the state-of-the-art works. We proposed a new approach called Blind Vote, which is an untraceable, secure, efficient, secrecy-preserving, and fully on-chain electronic voting protocol based on the well-known concept of Chaum's blind signatures. We illustrate that our approach achieves the same security guarantees as previous methods such as Tornado Vote [1], while consuming significantly less gas. Thus, we provide a cheaper and considerably more gas-efficient alternative for anonymous blockchain-based voting. On the other hand, we propose a new family of algorithms for private, trustless auctions that protect bidder identities and bid values while remaining practical for smart contract execution. We ensure trustlessness by running the auction logic in a smart contract, thereby eliminating reliance on any single trusted party. This approach prevents bid tampering, front-running, and collusion by enforcing immutability and decentralized verification of bids. The resulting protocol uniquely combines efficiency, trustlessness, and enduring bid privacy, offering a scalable and secure solution for blockchain-based marketplaces and other decentralized applications.

Title: Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Authors: Song Mao, Yang Chen, Pinglong Cai, Ding Wang, Guohang Yan, Zhi Yu, Botian Shi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03262
Pdf URL: https://arxiv.org/pdf/2507.03262
Copy Paste: [[2507.03262]] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders(https://arxiv.org/abs/2507.03262)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) increasingly adopt multiple vision encoders to capture diverse visual information, ranging from coarse semantics to fine grained details. While this approach is intended to enhance visual understanding capability, we observe that the performance gains from adding encoders often diminish and can even lead to performance degradation, a phenomenon we term encoder redundancy. This paper presents a systematic investigation into this issue. Through comprehensive ablation studies on state of the art multi encoder MLLMs, we empirically demonstrate that significant redundancy exists. To quantify each encoder's unique contribution, we propose a principled metric: the Conditional Utilization Rate (CUR). Building on CUR, we introduce the Information Gap (IG) to capture the overall disparity in encoder utility within a this http URL experiments reveal that certain vision encoders contribute little, or even negatively, to overall performance, confirming substantial redundancy. Our experiments reveal that certain vision encoders contribute minimally, or even negatively, to the model's performance, confirming the prevalence of redundancy. These findings highlight critical inefficiencies in current multi encoder designs and establish that our proposed metrics can serve as valuable diagnostic tools for developing more efficient and effective multimodal architectures.

Title: Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification

Authors: Xinyue Xin, Ming Li, Yan Wu, Xiang Li, Peng Zhang, Dazhi Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03268
Pdf URL: https://arxiv.org/pdf/2507.03268
Copy Paste: [[2507.03268]] Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification(https://arxiv.org/abs/2507.03268)
Keywords: extraction
Abstract: The collaborative classification of dual-frequency PolSAR images is a meaningful but also challenging research. The effect of regional consistency on classification information learning and the rational use of dual-frequency data are two main difficulties for dual-frequency collaborative classification. To tackle these problems, a selected knowledge distillation network with statistical-based sample rectification (SKDNet-SSR) is proposed in this article. First, in addition to applying CNN and ViT as local and global feature extractors, a statistical-based dynamic sample rectification (SDSR) module is designed to avoid the impact of poor regional consistency on spatial information learning process. Specifically, based on the fact that the PolSAR covariance matrix conforms to the complex Wishart distribution, SDSR first dynamically evaluates the sample purity, and then performs pixel selection and pixel generation to remove noisy pixels, thereby avoiding the feature interaction between informative pixels and noisy pixels and improving the classification feature extraction process. Next, a dual-frequency gate-selected distillation (DGSD) module is constructed to emphasize the advantages of different frequency bands and perform complementary learning on dual-frequency data. It uses the dominant single-frequency branch on each sample as teacher model to train the dual-frequency student model, enabling the student model to learn the optimal results and realizing complementary utilization of dual-frequency data on different terrain objects. Comprehensive experiments on four measured dual-frequency PolSAR data demonstrate that the proposed SKDNet-SSR outperforms other related methods.

Title: ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization

Authors: Haosheng Gan, Berk Tinaz, Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03275
Pdf URL: https://arxiv.org/pdf/2507.03275
Copy Paste: [[2507.03275]] ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization(https://arxiv.org/abs/2507.03275)
Keywords: fair, diffusion, generative
Abstract: Current text-to-image (T2I) benchmarks evaluate models on rigid prompts, potentially underestimating true generative capabilities due to prompt sensitivity and creating biases that favor certain models while disadvantaging others. We introduce ConceptMix++, a framework that disentangles prompt phrasing from visual generation capabilities by applying iterative prompt optimization. Building on ConceptMix, our approach incorporates a multimodal optimization pipeline that leverages vision-language model feedback to refine prompts systematically. Through extensive experiments across multiple diffusion models, we show that optimized prompts significantly improve compositional generation performance, revealing previously hidden model capabilities and enabling fairer comparisons across T2I models. Our analysis reveals that certain visual concepts -- such as spatial relationships and shapes -- benefit more from optimization than others, suggesting that existing benchmarks systematically underestimate model performance in these categories. Additionally, we find strong cross-model transferability of optimized prompts, indicating shared preferences for effective prompt phrasing across models. These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities, while our framework provides more accurate assessment and insights for future development.

Title: Securing Transformer-based AI Execution via Unified TEE and Crypto-protected Accelerators

Authors: Jiaqi Xue, Yifei Zhao, Mengxin Zheng, Xun Chen, Fan Yao, Yan Solihin, Qian Lou
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03278
Pdf URL: https://arxiv.org/pdf/2507.03278
Copy Paste: [[2507.03278]] Securing Transformer-based AI Execution via Unified TEE and Crypto-protected Accelerators(https://arxiv.org/abs/2507.03278)
Keywords: secure, security, protect, transformer, large language model
Abstract: Recent advances in Transformer models, e.g., large language models (LLMs), have brought tremendous breakthroughs in various artificial intelligence (AI) tasks, leading to their wide applications in many security-critical domains. Due to their unprecedented scale and prohibitively high development cost, these models have become highly valuable intellectual property for AI stakeholders and are increasingly deployed via machine learning as a service (MLaaS). However, MLaaS often runs on untrusted cloud infrastructure, exposing data and models to potential breaches. Mainstream protection mechanisms leverage trusted execution environments (TEEs) where confidentiality and integrity for secretive data are shielded using hardware-based encryption and integrity checking. Unfortunately, running model inference entirely within TEEs is subject to non-trivial slowdown, which is further exacerbated in LLMs due to the substantial computation and memory footprint involved. Recent studies reveal that the hybrid TEE-based scheme offloading partial model inference operations to the untrusted accelerators (e.g., GPU) is a promising solution. However, prior offloading schemes fail to ensure dual protection of data and model in Transformer inference, as they cannot securely offload critical operations, i.e., Attention and SoftMax, forcing these computations to remain confined within TEEs. To address these challenges, we propose TwinShield, a framework enabling secure Transformer inference in heterogeneous TEE and accelerator systems with dual protection for both model and data. TwinShield offloads ~87% of computation to GPUs and delivers 4.0x - 6.1x speedups over previous approaches across various Transformer models.

Title: Conformal Information Pursuit for Interactively Guiding Large Language Models

Authors: Kwan Ho Ryan Chan, Yuyan Ge, Edgar Dobriban, Hamed Hassani, René Vidal
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03279
Pdf URL: https://arxiv.org/pdf/2507.03279
Copy Paste: [[2507.03279]] Conformal Information Pursuit for Interactively Guiding Large Language Models(https://arxiv.org/abs/2507.03279)
Keywords: robust, interpretability, large language model
Abstract: A significant use case of instruction-finetuned Large Language Models (LLMs) is to solve question-answering tasks interactively. In this setting, an LLM agent is tasked with making a prediction by sequentially querying relevant information from the user, as opposed to a single-turn conversation. This paper explores sequential querying strategies that aim to minimize the expected number of queries. One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty. However, obtaining accurate estimates of mutual information or conditional entropy for LLMs is very difficult in practice due to over- or under-confident LLM probabilities, which leads to suboptimal query selection and predictive performance. To better estimate the uncertainty at each iteration, we propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets. More specifically, C-IP leverages a relationship between prediction sets and conditional entropy at each iteration to estimate uncertainty based on the average size of conformal prediction sets. In contrast to conditional entropy, we find that conformal prediction sets are a distribution-free and robust method of measuring uncertainty. Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the MediQ dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability.

Title: NOVO: Unlearning-Compliant Vision Transformers

Authors: Soumya Roy, Soumya Banerjee, Vinay Verma, Soumik Dasgupta, Deepak Gupta, Piyush Rai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03281
Pdf URL: https://arxiv.org/pdf/2507.03281
Copy Paste: [[2507.03281]] NOVO: Unlearning-Compliant Vision Transformers(https://arxiv.org/abs/2507.03281)
Keywords: attack, membership infer, transformer
Abstract: Machine unlearning (MUL) refers to the problem of making a pre-trained model selectively forget some training instances or class(es) while retaining performance on the remaining dataset. Existing MUL research involves fine-tuning using a forget and/or retain set, making it expensive and/or impractical, and often causing performance degradation in the unlearned model. We introduce {\pname}, an unlearning-aware vision transformer-based architecture that can directly perform unlearning for future unlearning requests without any fine-tuning over the requested set. The proposed model is trained by simulating unlearning during the training process itself. It involves randomly separating class(es)/sub-class(es) present in each mini-batch into two disjoint sets: a proxy forget-set and a retain-set, and the model is optimized so that it is unable to predict the forget-set. Forgetting is achieved by withdrawing keys, making unlearning on-the-fly and avoiding performance degradation. The model is trained jointly with learnable keys and original weights, ensuring withholding a key irreversibly erases information, validated by membership inference attack scores. Extensive experiments on various datasets, architectures, and resolutions confirm {\pname}'s superiority over both fine-tuning-free and fine-tuning-based methods.

Title: MolVision: Molecular Property Prediction with Vision Language Models

Authors: Deepan Adak, Yogesh Singh Rawat, Shruti Vyas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03283
Pdf URL: https://arxiv.org/pdf/2507.03283
Copy Paste: [[2507.03283]] MolVision: Molecular Property Prediction with Vision Language Models(https://arxiv.org/abs/2507.03283)
Keywords: large language model
Abstract: Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : $\href{this https URL}{this https URL}$.

Title: Global Variational Inference Enhanced Robust Domain Adaptation

Authors: Lingkun Luo, Shiqiang Hu, Liming Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03291
Pdf URL: https://arxiv.org/pdf/2507.03291
Copy Paste: [[2507.03291]] Global Variational Inference Enhanced Robust Domain Adaptation(https://arxiv.org/abs/2507.03291)
Keywords: robust, diffusion, generative
Abstract: Deep learning-based domain adaptation (DA) methods have shown strong performance by learning transferable representations. However, their reliance on mini-batch training limits global distribution modeling, leading to unstable alignment and suboptimal generalization. We propose Global Variational Inference Enhanced Domain Adaptation (GVI-DA), a framework that learns continuous, class-conditional global priors via variational inference to enable structure-aware cross-domain alignment. GVI-DA minimizes domain gaps through latent feature reconstruction, and mitigates posterior collapse using global codebook learning with randomized sampling. It further improves robustness by discarding low-confidence pseudo-labels and generating reliable target-domain samples. Extensive experiments on four benchmarks and thirty-eight DA tasks demonstrate consistent state-of-the-art performance. We also derive the model's evidence lower bound (ELBO) and analyze the effects of prior continuity, codebook size, and pseudo-label noise tolerance. In addition, we compare GVI-DA with diffusion-based generative frameworks in terms of optimization principles and efficiency, highlighting both its theoretical soundness and practical advantages.

Title: MGAA: Multi-Granular Adaptive Allocation fof Low-Rank Compression of LLMs

Authors: Guangyan Li, Yongqiang Tang, Wensheng Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03294
Pdf URL: https://arxiv.org/pdf/2507.03294
Copy Paste: [[2507.03294]] MGAA: Multi-Granular Adaptive Allocation fof Low-Rank Compression of LLMs(https://arxiv.org/abs/2507.03294)
Keywords: large language model
Abstract: The enormous parameter scale of large language models (LLMs) has made model compression a research hotspot, which aims to alleviate computational resource demands during deployment and inference. As a promising direction, low-rank approximation technique has made remarkable achievements. Nevertheless, unfortunately, the vast majority of studies to low-rank approximation compression generally apply uniform compression ratios across all weight matrices, while disregarding their inherently differentiated impacts on the model's performance. Although a few recent work attempts to employ heuristic search strategies to achieve the optimal parameter allocation, such strategies are computationally inefficient and lose the generalization ability in the era of LLMs. In this study, we propose a novel parameter Multi-Granular Adaptive Allocation (MGAA) method, which can adaptively allocate parameters between and within sublayers without task-specific evaluations in the compression process. MGAA consists of two components: 1) Among different sublayers, it assigns compression ratios based on their cosine similarity between inputs and outputs, allowing for a more tailored compression in sublayers with varying degrees of importance, and 2) Within each sublayer, it allocates different compression ratios to weight matrices based on their energy distribution characteristics, ensuring a consistent energy retention ratio while optimizing compression efficiency. Comprehensive evaluations of MGAA across multiple LLMs backbone models and benchmark datasets demonstrate its superior performance. Additionally, we apply our MGAA to multimodal model LLaVA, exhibiting remarkable performance improvements.

Title: CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Authors: Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Yaqi Wang, Chengfeng Zhou, Xiaobo Li, Dahong Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03295
Pdf URL: https://arxiv.org/pdf/2507.03295
Copy Paste: [[2507.03295]] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection(https://arxiv.org/abs/2507.03295)
Keywords: diffusion, generative
Abstract: Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.

Title: LRM-1B: Towards Large Routing Model

Authors: Han Li, Fei Liu, Zhenkun Wang, Qingfu Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03300
Pdf URL: https://arxiv.org/pdf/2507.03300
Copy Paste: [[2507.03300]] LRM-1B: Towards Large Routing Model(https://arxiv.org/abs/2507.03300)
Keywords: large language model
Abstract: Vehicle routing problems (VRPs) are central to combinatorial optimization with significant practical implications. Recent advancements in neural combinatorial optimization (NCO) have demonstrated promising results by leveraging neural networks to solve VRPs, yet the exploration of model scaling within this domain remains underexplored. Inspired by the success of model scaling in large language models (LLMs), this study introduces a Large Routing Model with 1 billion parameters (LRM-1B), designed to address diverse VRP scenarios. We present a comprehensive evaluation of LRM-1B across multiple problem variants, distributions, and sizes, establishing state-of-the-art results. Our findings reveal that LRM-1B not only adapts to different VRP challenges but also showcases superior performance, outperforming existing models. Additionally, we explore the scaling behavior of neural routing models from 1M to 1B parameters. Our analysis confirms power-law between multiple model factors and performance, offering critical insights into the optimal configurations for foundation neural routing solvers.

Title: Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model

Authors: Wooseok Shin, Jisu Kang, Hyeonki Jeong, Jin Sob Kim, Sung Won Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03302
Pdf URL: https://arxiv.org/pdf/2507.03302
Copy Paste: [[2507.03302]] Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model(https://arxiv.org/abs/2507.03302)
Keywords: segmentation
Abstract: In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at this https URL

Title: Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

Authors: Hai Huang, Yan Xia, Sashuai Zhou, Hanting Wang, Shulei Wang, Zhou Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03304
Pdf URL: https://arxiv.org/pdf/2507.03304
Copy Paste: [[2507.03304]] Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations(https://arxiv.org/abs/2507.03304)
Keywords: robust
Abstract: Domain Generalization (DG) aims to enhance model robustness in unseen or distributionally shifted target domains through training exclusively on source domains. Although existing DG techniques, such as data manipulation, learning strategies, and representation learning, have shown significant progress, they predominantly address single-modal data. With the emergence of numerous multi-modal datasets and increasing demand for multi-modal tasks, a key challenge in Multi-modal Domain Generalization (MMDG) has emerged: enabling models trained on multi-modal sources to generalize to unseen target distributions within the same modality set. Due to the inherent differences between modalities, directly transferring methods from single-modal DG to MMDG typically yields sub-optimal results. These methods often exhibit randomness during generalization due to the invisibility of target domains and fail to consider inter-modal consistency. Applying these methods independently to each modality in the MMDG setting before combining them can lead to divergent generalization directions across different modalities, resulting in degraded generalization capabilities. To address these challenges, we propose a novel approach that leverages Unified Representations to map different paired modalities together, effectively adapting DG methods to MMDG by enabling synchronized multi-modal improvements within the unified space. Additionally, we introduce a supervised disentanglement framework that separates modal-general and modal-specific information, further enhancing the alignment of unified representations. Extensive experiments on benchmark datasets, including EPIC-Kitchens and Human-Animal-Cartoon, demonstrate the effectiveness and superiority of our method in enhancing multi-modal domain generalization.

Title: MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion

Authors: Peilin Tao, Hainan Cui, Diantao Tu, Shuhan Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03306
Pdf URL: https://arxiv.org/pdf/2507.03306
Copy Paste: [[2507.03306]] MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion(https://arxiv.org/abs/2507.03306)
Keywords: robust
Abstract: Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. The code is available at this https URL.

Title: ReTimeCausal: EM-Augmented Additive Noise Models for Interpretable Causal Discovery in Irregular Time Series

Authors: Weihong Li, Anpeng Wu, Kun Kuang, Keting Yin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03310
Pdf URL: https://arxiv.org/pdf/2507.03310
Copy Paste: [[2507.03310]] ReTimeCausal: EM-Augmented Additive Noise Models for Interpretable Causal Discovery in Irregular Time Series(https://arxiv.org/abs/2507.03310)
Keywords: interpretability
Abstract: This paper studies causal discovery in irregularly sampled time series-a pivotal challenge in high-stakes domains like finance, healthcare, and climate science, where missing data and inconsistent sampling frequencies distort causal mechanisms. Traditional methods (e.g., Granger causality, PCMCI) fail to reconcile multi-scale interactions (e.g., hourly storms vs. decadal climate shifts), while neural approaches (e.g., CUTS+) lack interpretability, stemming from a critical gap: existing frameworks either rigidly assume temporal regularity or aggregate dynamics into opaque representations, neglecting real-world granularity and auditable logic. To bridge this gap, we propose ReTimeCausal, a novel integration of Additive Noise Models (ANM) and Expectation-Maximization (EM) that unifies physics-guided data imputation with sparse causal inference. Through kernelized sparse regression and structural constraints, ReTimeCausal iteratively refines missing values (E-step) and causal graphs (M-step), resolving cross-frequency dependencies and missing data issues. Extensive experiments on synthetic and real-world datasets demonstrate that ReTimeCausal outperforms existing state-of-the-art methods under challenging irregular sampling and missing data conditions.

Title: GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

Authors: Himanshu Dutta, Sunny Manchanda, Prakhar Bapat, Meva Ram Gurjar, Pushpak Bhattacharyya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03311
Pdf URL: https://arxiv.org/pdf/2507.03311
Copy Paste: [[2507.03311]] GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation(https://arxiv.org/abs/2507.03311)
Keywords: large language model, segmentation
Abstract: Document level Machine Translation (DocMT) approaches often struggle with effectively capturing discourse level phenomena. Existing approaches rely on heuristic rules to segment documents into discourse units, which rarely align with the true discourse structure required for accurate translation. Otherwise, they fail to maintain consistency throughout the document during translation. To address these challenges, we propose Graph Augmented Agentic Framework for Document Level Translation (GRAFT), a novel graph based DocMT system that leverages Large Language Model (LLM) agents for document translation. Our approach integrates segmentation, directed acyclic graph (DAG) based dependency modelling, and discourse aware translation into a cohesive framework. Experiments conducted across eight translation directions and six diverse domains demonstrate that GRAFT achieves significant performance gains over state of the art DocMT systems. Specifically, GRAFT delivers an average improvement of 2.8 d BLEU on the TED test sets from IWSLT2017 over strong baselines and 2.3 d BLEU for domain specific translation from English to Chinese. Moreover, our analyses highlight the consistent ability of GRAFT to address discourse level phenomena, yielding coherent and contextually accurate translations.

Title: MPX: Mixed Precision Training for JAX

Authors: Alexander Gräfe, Sebastian Trimpe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03312
Pdf URL: https://arxiv.org/pdf/2507.03312
Copy Paste: [[2507.03312]] MPX: Mixed Precision Training for JAX(https://arxiv.org/abs/2507.03312)
Keywords: robust
Abstract: Mixed-precision training has emerged as an indispensable tool for enhancing the efficiency of neural network training in recent years. Concurrently, JAX has grown in popularity as a versatile machine learning toolbox. However, it currently lacks robust support for mixed-precision training. We propose MPX, a mixed-precision training toolbox for JAX that simplifies and accelerates the training of large-scale neural networks while preserving model accuracy. MPX seamlessly integrates with popular toolboxes such as Equinox and Flax, allowing users to convert full-precision pipelines to mixed-precision versions with minimal modifications. By casting both inputs and outputs to half precision, and introducing a dynamic loss-scaling mechanism, MPX alleviates issues like gradient underflow and overflow that commonly arise in half precision computations. Its design inherits critical features from JAX's type-promotion behavior, ensuring that operations take place in the correct precision and allowing for selective enforcement of full precision where needed (e.g., sums, means, or softmax). MPX further provides wrappers for automatic creation and management of mixed-precision gradients and optimizers, enabling straightforward integration into existing JAX training pipelines. MPX's source code, documentation, and usage examples are available at this http URL.

Title: Personalized Image Generation from an Author Writing Style

Authors: Sagar Gandhi, Vishal Gandhi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03313
Pdf URL: https://arxiv.org/pdf/2507.03313
Copy Paste: [[2507.03313]] Personalized Image Generation from an Author Writing Style(https://arxiv.org/abs/2507.03313)
Keywords: diffusion, generative, large language model
Abstract: Translating nuanced, textually-defined authorial writing styles into compelling visual representations presents a novel challenge in generative AI. This paper introduces a pipeline that leverages Author Writing Sheets (AWS) - structured summaries of an author's literary characteristics - as input to a Large Language Model (LLM, Claude 3.7 Sonnet). The LLM interprets the AWS to generate three distinct, descriptive text-to-image prompts, which are then rendered by a diffusion model (Stable Diffusion 3.5 Medium). We evaluated our approach using 49 author styles from Reddit data, with human evaluators assessing the stylistic match and visual distinctiveness of the generated images. Results indicate a good perceived alignment between the generated visuals and the textual authorial profiles (mean style match: $4.08/5$), with images rated as moderately distinctive. Qualitative analysis further highlighted the pipeline's ability to capture mood and atmosphere, while also identifying challenges in representing highly abstract narrative elements. This work contributes a novel end-to-end methodology for visual authorial style personalization and provides an initial empirical validation, opening avenues for applications in creative assistance and cross-modal understanding.

Title: Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

Authors: Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03318
Pdf URL: https://arxiv.org/pdf/2507.03318
Copy Paste: [[2507.03318]] Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization(https://arxiv.org/abs/2507.03318)
Keywords: interpretability, explainability
Abstract: Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable machine learning models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as limited activity data per target and the sensitivity of properties to subtle molecular changes. To address this, we leveraged activity-cliff molecule pairs, i.e., compounds sharing a common scaffold but differing sharply in potency, targeting three proto-oncogene tyrosine-protein kinase Src proteins (i.e., PDB IDs 1O42, 2H8H, and 4MXO). We implemented graph neural network (GNN) methods to obtain atom-level feature information and predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). In addition, we trained GNN models with different structure-aware loss functions to adequately leverage molecular property and structure information. We also utilized group lasso and sparse group lasso to prune and highlight molecular subgraphs and enhance the structure-specific model explainability for the predicted property difference in molecular activity-cliff pairs. We improved drug property prediction by integrating common and uncommon node information and using sparse group lasso, reducing the average root mean squared error (RMSE) by 12.70%, and achieving the lowest averaged RMSE=0.2551 and the highest PCC=0.9572. Furthermore, applying regularization enhances feature attribution methods that estimate the contribution of each atom in the molecular graphs by boosting global direction scores and atom-level accuracy in atom coloring accuracy, which improves model interpretability in drug discovery pipelines, particularly in investigating important molecular substructures in lead optimization.

Title: Source-Free Domain Adaptation via Multi-view Contrastive Learning

Authors: Amirfarhad Farhadi, Naser Mozayani, Azadeh Zamanifar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03321
Pdf URL: https://arxiv.org/pdf/2507.03321
Copy Paste: [[2507.03321]] Source-Free Domain Adaptation via Multi-view Contrastive Learning(https://arxiv.org/abs/2507.03321)
Keywords: privacy
Abstract: Domain adaptation has become a widely adopted approach in machine learning due to the high costs associated with labeling data. It is typically applied when access to a labeled source domain is available. However, in real-world scenarios, privacy concerns often restrict access to sensitive information, such as fingerprints, bank account details, and facial images. A promising solution to this issue is Source-Free Unsupervised Domain Adaptation (SFUDA), which enables domain adaptation without requiring access to labeled target domain data. Recent research demonstrates that SFUDA can effectively address domain discrepancies; however, two key challenges remain: (1) the low quality of prototype samples, and (2) the incorrect assignment of pseudo-labels. To tackle these challenges, we propose a method consisting of three main phases. In the first phase, we introduce a Reliable Sample Memory (RSM) module to improve the quality of prototypes by selecting more representative samples. In the second phase, we employ a Multi-View Contrastive Learning (MVCL) approach to enhance pseudo-label quality by leveraging multiple data augmentations. In the final phase, we apply a noisy label filtering technique to further refine the pseudo-labels. Our experiments on three benchmark datasets - VisDA 2017, Office-Home, and Office-31 - demonstrate that our method achieves approximately 2 percent and 6 percent improvements in classification accuracy over the second-best method and the average of 13 well-known state-of-the-art approaches, respectively.

Title: A Note on Single-Cut Full-Open Protocols

Authors: Kazumasa Shinagawa, Koji Nuida
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.03323
Pdf URL: https://arxiv.org/pdf/2507.03323
Copy Paste: [[2507.03323]] A Note on Single-Cut Full-Open Protocols(https://arxiv.org/abs/2507.03323)
Keywords: secure
Abstract: Card-based cryptography is a research area that realizes cryptographic protocols such as secure computation by applying shuffles to sequences of cards that encode input values. A single-cut full-open protocol is one that obtains an output value by applying a random cut to an input sequence of cards, after which all cards are opened. In this paper, we propose three single-cut full-open protocols: two protocols for three-variable functions and one protocol for a four-variable function.

Title: Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents

Authors: Zhao Wang, Bowen Chen, Yotaro Shimose, Sota Moriyama, Heng Wang, Shingo Takamatsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03326
Pdf URL: https://arxiv.org/pdf/2507.03326
Copy Paste: [[2507.03326]] Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents(https://arxiv.org/abs/2507.03326)
Keywords: diffusion, generative
Abstract: Recent generative models such as GPT-4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity -- they require structured layouts, precise typography, consistent branding, and more. In this paper, we introduce MIMO (Mirror In-the-Model), an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.

Title: Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs

Authors: Yuanxin Wang, Ganesh Venkatesh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03327
Pdf URL: https://arxiv.org/pdf/2507.03327
Copy Paste: [[2507.03327]] Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs(https://arxiv.org/abs/2507.03327)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses. However, a critical distinction from human cognition is their typical lack of a distinct internal `reading' or deliberation phase before `speaking' (i.e., generating text). Humans often engage in silent reading to comprehend context and formulate thoughts prior to articulation. This paper investigates methods to imbue LLMs with a similar capacity for internal processing. We introduce and evaluate techniques that encourage LLMs to `read silently.' Our findings indicate that even a straightforward approach, such as providing the model with an initial contextual prompt or `reading space' before it begins predicting subsequent tokens for the final output, can yield significant performance improvements. We further enhance this concept by developing a `reading buddy' architecture, where an auxiliary component silently processes the input and provides refined contextual insights to the primary generation model. These approaches aim to foster deeper understanding from LLMs so that they can produce better reasoned responses, moving them one step closer to more human-like text processing. Our results indicate that these simple techniques can provide surprisingly strong impact on accuracy with multiple point accuracy boost.

Title: Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

Authors: Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, Miki Haseyama
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03331
Pdf URL: https://arxiv.org/pdf/2507.03331
Copy Paste: [[2507.03331]] Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling(https://arxiv.org/abs/2507.03331)
Keywords: generative
Abstract: To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks.

Title: De-Fake: Style based Anomaly Deepfake Detection

Authors: Sudev Kumar Padhi, Harshit Kumar, Umesh Kashyap, Sk. Subidh Ali
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03334
Pdf URL: https://arxiv.org/pdf/2507.03334
Copy Paste: [[2507.03334]] De-Fake: Style based Anomaly Deepfake Detection(https://arxiv.org/abs/2507.03334)
Keywords: privacy, protect
Abstract: Detecting deepfakes involving face-swaps presents a significant challenge, particularly in real-world scenarios where anyone can perform face-swapping with freely available tools and apps without any technical knowledge. Existing deepfake detection methods rely on facial landmarks or inconsistencies in pixel-level features and often struggle with face-swap deepfakes, where the source face is seamlessly blended into the target image or video. The prevalence of face-swap is evident in everyday life, where it is used to spread false information, damage reputations, manipulate political opinions, create non-consensual intimate deepfakes (NCID), and exploit children by enabling the creation of child sexual abuse material (CSAM). Even prominent public figures are not immune to its impact, with numerous deepfakes of them circulating widely across social media platforms. Another challenge faced by deepfake detection methods is the creation of datasets that encompass a wide range of variations, as training models require substantial amounts of data. This raises privacy concerns, particularly regarding the processing and storage of personal facial data, which could lead to unauthorized access or misuse. Our key idea is to identify these style discrepancies to detect face-swapped images effectively without accessing the real facial image. We perform comprehensive evaluations using multiple datasets and face-swapping methods, which showcases the effectiveness of SafeVision in detecting face-swap deepfakes across diverse scenarios. SafeVision offers a reliable and scalable solution for detecting face-swaps in a privacy preserving manner, making it particularly effective in challenging real-world applications. To the best of our knowledge, SafeVision is the first deepfake detection using style features while providing inherent privacy protection.

Title: Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency

Authors: Naoki Nishikawa, Rei Higuchi, Taiji Suzuki
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03340
Pdf URL: https://arxiv.org/pdf/2507.03340
Copy Paste: [[2507.03340]] Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency(https://arxiv.org/abs/2507.03340)
Keywords: transformer
Abstract: Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller error under a fixed computational budget. Furthermore, we introduce an efficient layerwise training strategy to learn nonlinear features tailored to each layer. Experiments on multiple pre-trained transformers demonstrate that our method improves the performance of distilled models compared to baselines without increasing the inference cost. Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.

Title: SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

Authors: Yuxiang Mei, Yuang Zheng, Dongxing Xu, Yanhua Long
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2507.03343
Pdf URL: https://arxiv.org/pdf/2507.03343
Copy Paste: [[2507.03343]] SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge(https://arxiv.org/abs/2507.03343)
Keywords: large language model
Abstract: This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-"maybe"), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.

Title: Securing Mixed Rust with Hardware Capabilities

Authors: Jason Zhijingcheng Yu, Fangqi Han, Kaustab Choudhury, Trevor E. Carlson, Prateek Saxena
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2507.03344
Pdf URL: https://arxiv.org/pdf/2507.03344
Copy Paste: [[2507.03344]] Securing Mixed Rust with Hardware Capabilities(https://arxiv.org/abs/2507.03344)
Keywords: security
Abstract: The Rust programming language enforces three basic Rust principles, namely ownership, borrowing, and AXM (Aliasing Xor Mutability) to prevent security bugs such as memory safety violations and data races. However, Rust projects often have mixed code, i.e., code that also uses unsafe Rust, FFI (Foreign Function Interfaces), and inline assembly for low-level control. The Rust compiler is unable to statically enforce Rust principles in mixed Rust code which can lead to many security vulnerabilities. In this paper, we propose CapsLock, a security enforcement mechanism that can run at the level of machine code and detect Rust principle violations at run-time in mixed code. CapsLock is kept simple enough to be implemented into recent capability-based hardware abstractions that provide low-cost spatial memory safety. CapsLock introduces a novel revoke-on-use abstraction for capability-based designs, wherein accessing a memory object via a capability implicitly invalidates certain other capabilities pointing to it, thereby also providing temporal memory safety automatically, without requiring software to explicitly specify such invalidation. Thus, CapsLock is the first mechanism capable of providing cross-language enforcement of Rust principles. We implemented a prototype of CapsLock on QEMU. Evaluation results show that CapsLock is highly compatible with existing Rust code (passing 99.7% of the built-in test cases of the 100 most popular crates) and flags Rust principle violations in real-world Rust projects that use FFI or inline assembly. We discovered 8 previously unknown bugs in such crates in our experiments.

Title: Accelerating Private Heavy Hitter Detection on Continual Observation Streams

Authors: Rayne Holland
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.03361
Pdf URL: https://arxiv.org/pdf/2507.03361
Copy Paste: [[2507.03361]] Accelerating Private Heavy Hitter Detection on Continual Observation Streams(https://arxiv.org/abs/2507.03361)
Keywords: privacy
Abstract: Differentially private frequency estimation and heavy hitter detection are core problems in the private analysis of data streams. Two models are typically considered: the one-pass model, which outputs results only at the end of the stream, and the continual observation model, which requires releasing private summaries at every time step. While the one-pass model allows more efficient solutions, continual observation better reflects scenarios where timely and ongoing insights are critical. In the one-pass setting, sketches have proven to be an effective tool for differentially private frequency analysis, as they can be privatized by a single injection of calibrated noise. In contrast, existing methods in the continual observation model add fresh noise to the entire sketch at every step, incurring high computational costs. This challenge is particularly acute for heavy hitter detection, where current approaches often require querying every item in the universe at each step, resulting in untenable per-update costs for large domains. To overcome these limitations, we introduce a new differentially private sketching technique based on lazy updates, which perturbs and updates only a small, rotating part of the output sketch at each time step. This significantly reduces computational overhead while maintaining strong privacy and utility guarantees. In comparison to prior art, for frequency estimation, our method improves the update time by a factor of $O(w)$ for sketches of dimension $d \times w$; for heavy hitter detection, it reduces per-update complexity from $\Omega(|U|)$ to $O(d \log w)$, where $U$ is the input domain. Experiments show a increase in throughput by a factor of~$250$, making differential privacy more practical for real-time, continual observation, applications.

Title: Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization

Authors: Buqing Nie, Yangqing Fu, Jingtian Ji, Yue Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03372
Pdf URL: https://arxiv.org/pdf/2507.03372
Copy Paste: [[2507.03372]] Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization(https://arxiv.org/abs/2507.03372)
Keywords: robust
Abstract: Reinforcement Learning (RL) has achieved remarkable success in sequential decision tasks. However, recent studies have revealed the vulnerability of RL policies to different perturbations, raising concerns about their effectiveness and safety in real-world applications. In this work, we focus on the robustness of RL policies against action perturbations and introduce a novel framework called Optimal Adversary-aware Policy Iteration (OA-PI). Our framework enhances action robustness under various perturbations by evaluating and improving policy performance against the corresponding optimal adversaries. Besides, our approach can be integrated into mainstream DRL algorithms such as Twin Delayed DDPG (TD3) and Proximal Policy Optimization (PPO), improving action robustness effectively while maintaining nominal performance and sample efficiency. Experimental results across various environments demonstrate that our method enhances robustness of DRL policies against different action adversaries effectively.

Title: WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

Authors: Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03373
Pdf URL: https://arxiv.org/pdf/2507.03373
Copy Paste: [[2507.03373]] WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia(https://arxiv.org/abs/2507.03373)
Keywords: large language model
Abstract: Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.

Title: MRC-DETR: An Adaptive Multi-Residual Coupled Transformer for Bare Board PCB Defect Detection

Authors: Jiangzhong Cao, Huanqi Wu, Xu Zhang, Lianghong Tan, Huan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03386
Pdf URL: https://arxiv.org/pdf/2507.03386
Copy Paste: [[2507.03386]] MRC-DETR: An Adaptive Multi-Residual Coupled Transformer for Bare Board PCB Defect Detection(https://arxiv.org/abs/2507.03386)
Keywords: transformer
Abstract: In modern electronic manufacturing, defect detection on Printed Circuit Boards (PCBs) plays a critical role in ensuring product yield and maintaining the reliability of downstream assembly processes. However, existing methods often suffer from limited feature representation, computational redundancy, and insufficient availability of high-quality training data -- challenges that hinder their ability to meet industrial demands for both accuracy and efficiency. To address these limitations, we propose MRC-DETR, a novel and efficient detection framework tailored for bare PCB defect inspection, built upon the foundation of RT-DETR. Firstly, to enhance feature representation capability, we design a Multi-Residual Directional Coupled Block (MRDCB). This module improves channel-wise feature interaction through a multi-residual structure. Moreover, a cross-spatial learning strategy is integrated to capture fine-grained pixel-level relationships, further enriching the representational power of the extracted features. Secondly, to reduce computational redundancy caused by inefficient cross-layer information fusion, we introduce an Adaptive Screening Pyramid Network (ASPN). This component dynamically filters and aggregates salient low-level features, selectively fusing them with high-level semantic features. By focusing on informative regions and suppressing redundant computations, ASPN significantly improves both efficiency and detection accuracy. Finally, to tackle the issue of insufficient training data, particularly in the context of bare PCBs, we construct a new, high-quality dataset that fills a critical gap in current public resources. Our dataset not only supports the training and evaluation of our proposed framework but also serves as a valuable benchmark for future research in this domain.

Title: Breaking the Bulkhead: Demystifying Cross-Namespace Reference Vulnerabilities in Kubernetes Operators

Authors: Andong Chen, Zhaoxuan Jin, Ziyi Guo, Yan Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.03387
Pdf URL: https://arxiv.org/pdf/2507.03387
Copy Paste: [[2507.03387]] Breaking the Bulkhead: Demystifying Cross-Namespace Reference Vulnerabilities in Kubernetes Operators(https://arxiv.org/abs/2507.03387)
Keywords: security, attack
Abstract: Kubernetes Operators, automated tools designed to manage application lifecycles within Kubernetes clusters, extend the functionalities of Kubernetes, and reduce the operational burden on human engineers. While Operators significantly simplify DevOps workflows, they introduce new security risks. In particular, Kubernetes enforces namespace isolation to separate workloads and limit user access, ensuring that users can only interact with resources within their authorized namespaces. However, Kubernetes Operators often demand elevated privileges and may interact with resources across multiple namespaces. This introduces a new class of vulnerabilities, the Cross-Namespace Reference Vulnerability. The root cause lies in the mismatch between the declared scope of resources and the implemented scope of the Operator logic, resulting in Kubernetes being unable to properly isolate the namespace. Leveraging such vulnerability, an adversary with limited access to a single authorized namespace may exploit the Operator to perform operations affecting other unauthorized namespaces, causing Privilege Escalation and further impacts. To the best of our knowledge, this paper is the first to systematically investigate the security vulnerability of Kubernetes Operators. We present Cross-Namespace Reference Vulnerability with two strategies, demonstrating how an attacker can bypass namespace isolation. Through large-scale measurements, we found that over 14% of Operators in the wild are potentially vulnerable. Our findings have been reported to the relevant developers, resulting in 7 confirmations and 6 CVEs by the time of submission, affecting vendors including ****** and ******, highlighting the critical need for enhanced security practices in Kubernetes Operators. To mitigate it, we also open-source the static analysis suite to benefit the ecosystem.

Title: Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Authors: Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03393
Pdf URL: https://arxiv.org/pdf/2507.03393
Copy Paste: [[2507.03393]] Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos(https://arxiv.org/abs/2507.03393)
Keywords: diffusion
Abstract: In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at this https URL.

Title: Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images

Authors: Yuran Dong, Mang Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03402
Pdf URL: https://arxiv.org/pdf/2507.03402
Copy Paste: [[2507.03402]] Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images(https://arxiv.org/abs/2507.03402)
Keywords: robust, diffusion
Abstract: To advance real-world fashion image editing, we analyze existing two-stage pipelines(mask generation followed by diffusion-based editing)which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothes masks preserve poses but forbid style/length customization). II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories). To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. In Pose-Star, we calibrate diffusion-derived attention (Star tokens) via skeletal keypoints to enhance rare structure localization in complex poses, suppress noise through phase-aware analysis of attention dynamics (Convergence,Stabilization,Divergence) with threshold masking and sliding-window fusion, and refine edges via cross-self attention merging and Canny alignment. This work bridges controlled benchmarks and open-world demands, pioneering anatomy-aware, pose-robust editing and laying the foundation for industrial fashion image editing.

Title: Graph Repairs with Large Language Models: An Empirical Study

Authors: Hrishikesh Terdalkar, Angela Bonifati, Andrea Mauri
Subjects: cs.CL, cs.DB, cs.ET
Abstract URL: https://arxiv.org/abs/2507.03410
Pdf URL: https://arxiv.org/pdf/2507.03410
Copy Paste: [[2507.03410]] Graph Repairs with Large Language Models: An Empirical Study(https://arxiv.org/abs/2507.03410)
Keywords: interpretability, large language model
Abstract: Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost--both in terms of time and effort--of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.

Title: A Hybrid Game-Theory and Deep Learning Framework for Predicting Tourist Arrivals via Big Data Analytics and Opinion Leader Detection

Authors: Ali Nikseresht
Subjects: cs.LG, cs.ET, cs.GT, eess.SP
Abstract URL: https://arxiv.org/abs/2507.03411
Pdf URL: https://arxiv.org/pdf/2507.03411
Copy Paste: [[2507.03411]] A Hybrid Game-Theory and Deep Learning Framework for Predicting Tourist Arrivals via Big Data Analytics and Opinion Leader Detection(https://arxiv.org/abs/2507.03411)
Keywords: robust
Abstract: In the era of Industry 5.0, data-driven decision-making has become indispensable for optimizing systems across Industrial Engineering. This paper addresses the value of big data analytics by proposing a novel non-linear hybrid approach for forecasting international tourist arrivals in two different contexts: (i) arrivals to Hong Kong from five major source nations (pre-COVID-19), and (ii) arrivals to Sanya in Hainan province, China (post-COVID-19). The method integrates multiple sources of Internet big data and employs an innovative game theory-based algorithm to identify opinion leaders on social media platforms. Subsequently, nonstationary attributes in tourism demand data are managed through Empirical Wavelet Transform (EWT), ensuring refined time-frequency analysis. Finally, a memory-aware Stacked Bi-directional Long Short-Term Memory (Stacked BiLSTM) network is used to generate accurate demand forecasts. Experimental results demonstrate that this approach outperforms existing state-of-the-art techniques and remains robust under dynamic and volatile conditions, highlighting its applicability to broader Industrial Engineering domains, such as logistics, supply chain management, and production planning, where forecasting and resource allocation are key challenges. By merging advanced Deep Learning (DL), time-frequency analysis, and social media insights, the proposed framework showcases how large-scale data can elevate the quality and efficiency of decision-making processes.

Title: SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

Authors: Michał Perełkiewicz, Sławomir Dadas, Rafał Poświata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03415
Pdf URL: https://arxiv.org/pdf/2507.03415
Copy Paste: [[2507.03415]] SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation(https://arxiv.org/abs/2507.03415)
Keywords: robust
Abstract: This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.

Title: Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense

Authors: Lina Ma, Xiaowei Fu, Fuxiang Huang, Xinbo Gao, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03427
Pdf URL: https://arxiv.org/pdf/2507.03427
Copy Paste: [[2507.03427]] Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense(https://arxiv.org/abs/2507.03427)
Keywords: defense, attack, robust
Abstract: Existing defense methods fail to defend against unknown attacks and thus raise generalization issue of adversarial robustness. To remedy this problem, we attempt to delve into some underlying common characteristics among various attacks for generality. In this work, we reveal the commonly overlooked low entropy prior (LE) implied in various adversarial samples, and shed light on the universal robustness against unseen attacks in inference phase. LE prior is elaborated as two properties across various attacks as shown in Fig. 1 and Fig. 2: 1) low entropy misclassification for adversarial samples and 2) lower entropy prediction for higher attack intensity. This phenomenon stands in stark contrast to the naturally distributed samples. The LE prior can instruct existing test-time defense methods, thus we propose a two-stage REAL approach: Rectify Adversarial sample based on LE prior for test-time adversarial rectification. Specifically, to align adversarial samples more closely with clean samples, we propose to first rectify adversarial samples misclassified with low entropy by reverse maximizing prediction entropy, thereby eliminating their adversarial nature. To ensure the rectified samples can be correctly classified with low entropy, we carry out secondary rectification by forward minimizing prediction entropy, thus creating a Max-Min entropy optimization scheme. Further, based on the second property, we propose an attack-aware weighting mechanism to adaptively adjust the strengths of Max-Min entropy objectives. Experiments on several datasets show that REAL can greatly improve the performance of existing sample rectification models.

Title: Multi-Level Fusion Graph Neural Network for Molecule Property Prediction

Authors: XiaYu Liu, Hou-biao Li, Yang Liu, Chao Fan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03430
Pdf URL: https://arxiv.org/pdf/2507.03430
Copy Paste: [[2507.03430]] Multi-Level Fusion Graph Neural Network for Molecule Property Prediction(https://arxiv.org/abs/2507.03430)
Keywords: interpretability, transformer
Abstract: Accurate molecular property prediction is essential in drug discovery and related fields. However, existing graph neural networks (GNNs) often struggle to simultaneously capture both local and global molecular structures. In this work, we propose a Multi-Level Fusion Graph Neural Network (MLFGNN) that integrates Graph Attention Networks and a novel Graph Transformer to jointly model local and global dependencies. In addition, we incorporate molecular fingerprints as a complementary modality and introduce a mechanism of interaction between attention to adaptively fuse information across representations. Extensive experiments on multiple benchmark datasets demonstrate that MLFGNN consistently outperforms state-of-the-art methods in both classification and regression tasks. Interpretability analysis further reveals that the model effectively captures task-relevant chemical patterns, supporting the usefulness of multi-level and multi-modal fusion in molecular representation learning.

Title: Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

Authors: Adrien Bazoge, Pacôme Constant dit Beaufils, Mohammed Hmitouch, Romain Bourcier, Emmanuel Morin, Richard Dufour, Béatrice Daille, Pierre-Antoine Gourraud, Matilde Karakachoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03433
Pdf URL: https://arxiv.org/pdf/2507.03433
Copy Paste: [[2507.03433]] Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models(https://arxiv.org/abs/2507.03433)
Keywords: extraction, large language model
Abstract: Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.

Title: Unlearning the Noisy Correspondence Makes CLIP More Robust

Authors: Haochen Han, Alex Jinpeng Wang, Peijun Ye, Fangming Liu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2507.03434
Pdf URL: https://arxiv.org/pdf/2507.03434
Copy Paste: [[2507.03434]] Unlearning the Noisy Correspondence Makes CLIP More Robust(https://arxiv.org/abs/2507.03434)
Keywords: robust
Abstract: The data appetite for Vision-Language Models (VLMs) has continuously scaled up from the early millions to billions today, which faces an untenable trade-off with data quality and inevitably introduces Noisy Correspondence (NC) samples. Undoubtedly, such semantically unrelated data significantly impairs the performance of VLMs. Previous efforts mainly address this challenge by estimating refined alignment for more precise guidance. However, such resource-intensive pipelines that train VLMs from scratch struggle to meet realistic data demands. In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. Specifically, we propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs' robustness by forgetting learned noisy knowledge. The key to NCU is learning the hardest negative information, which can provide explicit unlearning direction for both false positives and false negatives. Such twin goals unlearning process can be formalized into one unified optimal transport objective for fast fine-tuning. We validate our approach with the prevailing CLIP model over various downstream tasks. Remarkably, NCU surpasses the robust pre-trained method on zero-shot transfer while with lower computational overhead. The code will be released upon acceptance.

Title: Radar Tracker: Moving Instance Tracking in Sparse and Noisy Radar Point Clouds

Authors: Matthias Zeller, Daniel Casado Herraez, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03441
Pdf URL: https://arxiv.org/pdf/2507.03441
Copy Paste: [[2507.03441]] Radar Tracker: Moving Instance Tracking in Sparse and Noisy Radar Point Clouds(https://arxiv.org/abs/2507.03441)
Keywords: segmentation
Abstract: Robots and autonomous vehicles should be aware of what happens in their surroundings. The segmentation and tracking of moving objects are essential for reliable path planning, including collision avoidance. We investigate this estimation task for vehicles using radar sensing. We address moving instance tracking in sparse radar point clouds to enhance scene interpretation. We propose a learning-based radar tracker incorporating temporal offset predictions to enable direct center-based association and enhance segmentation performance by including additional motion cues. We implement attention-based tracking for sparse radar scans to include appearance features and enhance performance. The final association combines geometric and appearance features to overcome the limitations of center-based tracking to associate instances reliably. Our approach shows an improved performance on the moving instance tracking benchmark of the RadarScenes dataset compared to the current state of the art.

Title: Evaluating the Evaluators: Trust in Adversarial Robustness Tests

Authors: Antonio Emanuele Cinà, Maura Pintor, Luca Demetrio, Ambra Demontis, Battista Biggio, Fabio Roli
Subjects: cs.CR, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03450
Pdf URL: https://arxiv.org/pdf/2507.03450
Copy Paste: [[2507.03450]] Evaluating the Evaluators: Trust in Adversarial Robustness Tests(https://arxiv.org/abs/2507.03450)
Keywords: security, attack, robust
Abstract: Despite significant progress in designing powerful adversarial evasion attacks for robustness verification, the evaluation of these methods often remains inconsistent and unreliable. Many assessments rely on mismatched models, unverified implementations, and uneven computational budgets, which can lead to biased results and a false sense of security. Consequently, robustness claims built on such flawed testing protocols may be misleading and give a false sense of security. As a concrete step toward improving evaluation reliability, we present AttackBench, a benchmark framework developed to assess the effectiveness of gradient-based attacks under standardized and reproducible conditions. AttackBench serves as an evaluation tool that ranks existing attack implementations based on a novel optimality metric, which enables researchers and practitioners to identify the most reliable and effective attack for use in subsequent robustness evaluations. The framework enforces consistent testing conditions and enables continuous updates, making it a reliable foundation for robustness verification.

Title: Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

Authors: Leyan Xue, Zongbo Han, Guangyu Wang, Qinghua Hu, Mingyue Cheng, Changqing Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03458
Pdf URL: https://arxiv.org/pdf/2507.03458
Copy Paste: [[2507.03458]] Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach(https://arxiv.org/abs/2507.03458)
Keywords: robust, large language model
Abstract: Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP's strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to activate CLIP's latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model's receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that D&D achieves promising performance.

Title: Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds

Authors: Matthias Zeller, Vardeep S. Sandhu, Benedikt Mersch, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03463
Pdf URL: https://arxiv.org/pdf/2507.03463
Copy Paste: [[2507.03463]] Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds(https://arxiv.org/abs/2507.03463)
Keywords: transformer, segmentation
Abstract: The awareness about moving objects in the surroundings of a self-driving vehicle is essential for safe and reliable autonomous navigation. The interpretation of LiDAR and camera data achieves exceptional results but typically requires to accumulate and process temporal sequences of data in order to extract motion information. In contrast, radar sensors, which are already installed in most recent vehicles, can overcome this limitation as they directly provide the Doppler velocity of the detections and, hence incorporate instantaneous motion information within a single measurement. % In this paper, we tackle the problem of moving object segmentation in noisy radar point clouds. We also consider differentiating parked from moving cars, to enhance scene understanding. Instead of exploiting temporal dependencies to identify moving objects, we develop a novel transformer-based approach to perform single-scan moving object segmentation in sparse radar scans accurately. The key to our Radar Velocity Transformer is to incorporate the valuable velocity information throughout each module of the network, thereby enabling the precise segmentation of moving and non-moving objects. Additionally, we propose a transformer-based upsampling, which enhances the performance by adaptively combining information and overcoming the limitation of interpolation of sparse point clouds. Finally, we create a new radar moving object segmentation benchmark based on the RadarScenes dataset and compare our approach to other state-of-the-art methods. Our network runs faster than the frame rate of the sensor and shows superior segmentation results using only single-scan radar data.

Title: Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right

Authors: Heather Lent
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03473
Pdf URL: https://arxiv.org/pdf/2507.03473
Copy Paste: [[2507.03473]] Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right(https://arxiv.org/abs/2507.03473)
Keywords: secure, security, attack
Abstract: Despite mounting evidence that multilinguality can be easily weaponized against language models (LMs), works across NLP Security remain overwhelmingly English-centric. In terms of securing LMs, the NLP norm of "English first" collides with standard procedure in cybersecurity, whereby practitioners are expected to anticipate and prepare for worst-case outcomes. To mitigate worst-case outcomes in NLP Security, researchers must be willing to engage with the weakest links in LM security: lower-resourced languages. Accordingly, this work examines the security of LMs for lower- and medium-resourced languages. We extend existing adversarial attacks for up to 70 languages to evaluate the security of monolingual and multilingual LMs for these languages. Through our analysis, we find that monolingual models are often too small in total number of parameters to ensure sound security, and that while multilinguality is helpful, it does not always guarantee improved security either. Ultimately, these findings highlight important considerations for more secure deployment of LMs, for communities of lower-resourced languages.

Title: Molecular Machine Learning Using Euler Characteristic Transforms

Authors: Victor Toscano-Duran, Florian Rottach, Bastian Rieck
Subjects: cs.LG, math.AT, q-bio.BM
Abstract URL: https://arxiv.org/abs/2507.03474
Pdf URL: https://arxiv.org/pdf/2507.03474
Copy Paste: [[2507.03474]] Molecular Machine Learning Using Euler Characteristic Transforms(https://arxiv.org/abs/2507.03474)
Keywords: robust, extraction
Abstract: The shape of a molecule determines its physicochemical and biological properties. However, it is often underrepresented in standard molecular representation learning approaches. Here, we propose using the Euler Characteristic Transform (ECT) as a geometrical-topological descriptor. Computed directly on a molecular graph derived from handcrafted atomic features, the ECT enables the extraction of multiscale structural features, offering a novel way to represent and encode molecular shape in the feature space. We assess the predictive performance of this representation across nine benchmark regression datasets, all centered around predicting the inhibition constant $K_i$. In addition, we compare our proposed ECT-based representation against traditional molecular representations and methods, such as molecular fingerprints/descriptors and graph neural networks (GNNs). Our results show that our ECT-based representation achieves competitive performance, ranking among the best-performing methods on several datasets. More importantly, its combination with traditional representations, particularly with the AVALON fingerprint, significantly \emph{enhances predictive performance}, outperforming other methods on most datasets. These findings highlight the complementary value of multiscale topological information and its potential for being combined with established techniques. Our study suggests that hybrid approaches incorporating explicit shape information can lead to more informative and robust molecular representations, enhancing and opening new avenues in molecular machine learning tasks. To support reproducibility and foster open biomedical research, we provide open access to all experiments and code used in this work.

Title: Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Authors: Eva Seidlmayer, Lukas Galke, Konrad U. Förstner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03488
Pdf URL: https://arxiv.org/pdf/2507.03488
Copy Paste: [[2507.03488]] Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences(https://arxiv.org/abs/2507.03488)
Keywords: large language model
Abstract: Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: this https URL

Title: Reinforcement Learning-based Feature Generation Algorithm for Scientific Data

Authors: Meng Xiao, Junfeng Zhou, Yuanchun Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03498
Pdf URL: https://arxiv.org/pdf/2507.03498
Copy Paste: [[2507.03498]] Reinforcement Learning-based Feature Generation Algorithm for Scientific Data(https://arxiv.org/abs/2507.03498)
Keywords: large language model
Abstract: Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine-learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high-order feature combinations in scientific data necessitates profound and extensive domain-specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the Data-Centric Artificial Intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, in the iterative exploration stage, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpreta-tively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that the MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.

Title: Decoupled Relative Learning Rate Schedules

Authors: Jan Ludziejewski, Jan Małaśnicki, Maciej Pióro, Michał Krutul, Kamil Ciebiera, Maciej Stefaniak, Jakub Krajewski, Piotr Sankowski, Marek Cygan, Kamil Adamczewski, Sebastian Jaszczur
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03526
Pdf URL: https://arxiv.org/pdf/2507.03526
Copy Paste: [[2507.03526]] Decoupled Relative Learning Rate Schedules(https://arxiv.org/abs/2507.03526)
Keywords: transformer
Abstract: In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23\%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27\times$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

Title: Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding

Authors: Namho Kim, Junhwa Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03531
Pdf URL: https://arxiv.org/pdf/2507.03531
Copy Paste: [[2507.03531]] Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding(https://arxiv.org/abs/2507.03531)
Keywords: robust
Abstract: Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance.

Title: CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation

Authors: Elena Bueno-Benito, Mariella Dimiccoli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03539
Pdf URL: https://arxiv.org/pdf/2507.03539
Copy Paste: [[2507.03539]] CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation(https://arxiv.org/abs/2507.03539)
Keywords: segmentation
Abstract: Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions on the action ordering, and it is able to decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, which limits the effectiveness of the feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework that introduces a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.

Title: Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition

Authors: Redwan Sony, Parisa Farmanifard, Arun Ross, Anil K. Jain
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03541
Pdf URL: https://arxiv.org/pdf/2507.03541
Copy Paste: [[2507.03541]] Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition(https://arxiv.org/abs/2507.03541)
Keywords: explainability
Abstract: In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. (c) A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., ``Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g., ``Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity...and the pair is likely of the same person''), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.

Title: H2HTalk: Evaluating Large Language Models as Emotional Companion

Authors: Boyang Wang, Yalun Wu, Hongcheng Guo, Zhoujun Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03543
Pdf URL: https://arxiv.org/pdf/2507.03543
Copy Paste: [[2507.03543]] H2HTalk: Evaluating Large Language Models as Emotional Companion(https://arxiv.org/abs/2507.03543)
Keywords: secure, large language model
Abstract: As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.

Title: Communication Efficient, Differentially Private Distributed Optimization using Correlation-Aware Sketching

Authors: Julien Nicolas, Mohamed Maouche, Sonia Ben Mokhtar, Mark Coates
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03545
Pdf URL: https://arxiv.org/pdf/2507.03545
Copy Paste: [[2507.03545]] Communication Efficient, Differentially Private Distributed Optimization using Correlation-Aware Sketching(https://arxiv.org/abs/2507.03545)
Keywords: secure, privacy, federate
Abstract: Federated learning with differential privacy suffers from two major costs: each client must transmit $d$-dimensional gradients every round, and the magnitude of DP noise grows with $d$. Yet empirical studies show that gradient updates exhibit strong temporal correlations and lie in a $k$-dimensional subspace with $k \ll d$. Motivated by this, we introduce DOME, a decentralized DP optimization framework in which each client maintains a compact sketch to project gradients into $\mathbb{R}^k$ before privatization and Secure Aggregation. This reduces per-round communication from order $d$ to order $k$ and moves towards a gradient approximation mean-squared error of $\sigma^2 k$. To allow the sketch to span new directions and prevent it from collapsing onto historical gradients, we augment it with random probes orthogonal to historical directions. We prove that our overall protocol satisfies $(\epsilon,\delta)$-Differential Privacy.

Title: An Advanced Deep Learning Framework for Ischemic and Hemorrhagic Brain Stroke Diagnosis Using Computed Tomography (CT) Images

Authors: Md. Sabbir Hossen, Eshat Ahmed Shuvo, Shibbir Ahmed Arif, Pabon Shaha, Md. Saiduzzaman, Mostofa Kamal Nasir
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03558
Pdf URL: https://arxiv.org/pdf/2507.03558
Copy Paste: [[2507.03558]] An Advanced Deep Learning Framework for Ischemic and Hemorrhagic Brain Stroke Diagnosis Using Computed Tomography (CT) Images(https://arxiv.org/abs/2507.03558)
Keywords: robust, extraction
Abstract: Brain stroke is one of the leading causes of mortality and long-term disability worldwide, highlighting the need for precise and fast prediction techniques. Computed Tomography (CT) scan is considered one of the most effective methods for diagnosing brain strokes. The majority of stroke classification techniques rely on a single slice-level prediction mechanism, allowing the radiologist to manually choose the most critical CT slice from the original CT volume. Although clinical evaluations are often used in traditional diagnostic procedures, machine learning (ML) has opened up new avenues for improving stroke diagnosis. To supplement traditional diagnostic techniques, this study investigates the use of machine learning models, specifically concerning the prediction of brain stroke at an early stage utilizing CT scan images. In this research, we proposed a novel approach to brain stroke detection leveraging machine learning techniques, focusing on optimizing classification performance with pre-trained deep learning models and advanced optimization strategies. Pre-trained models, including DenseNet201, InceptionV3, MobileNetV2, ResNet50, and Xception, are utilized for feature extraction. Additionally, we employed feature engineering techniques, including BFO, PCA, and LDA, to enhance models' performance further. These features are subsequently classified using machine learning algorithms such as SVC, RF, XGB, DT, LR, KNN, and GNB. Our experiments demonstrate that the combination of MobileNetV2, LDA, and SVC achieved the highest classification accuracy of 97.93%, significantly outperforming other model-optimizer-classifier combinations. The results underline the effectiveness of integrating lightweight pre-trained models with robust optimization and classification techniques for brain stroke diagnosis.

Title: 2.5D Object Detection for Intelligent Roadside Infrastructure

Authors: Nikolai Polley, Yacin Boualili, Ferdinand Mütsch, Maximilian Zipfl, Tobias Fleck, J. Marius Zöllner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03564
Pdf URL: https://arxiv.org/pdf/2507.03564
Copy Paste: [[2507.03564]] 2.5D Object Detection for Intelligent Roadside Infrastructure(https://arxiv.org/abs/2507.03564)
Keywords: robust
Abstract: On-board sensors of autonomous vehicles can be obstructed, occluded, or limited by restricted fields of view, complicating downstream driving decisions. Intelligent roadside infrastructure perception systems, installed at elevated vantage points, can provide wide, unobstructed intersection coverage, supplying a complementary information stream to autonomous vehicles via vehicle-to-everything (V2X) communication. However, conventional 3D object-detection algorithms struggle to generalize under the domain shift introduced by top-down perspectives and steep camera angles. We introduce a 2.5D object detection framework, tailored specifically for infrastructure roadside-mounted cameras. Unlike conventional 2D or 3D object detection, we employ a prediction approach to detect ground planes of vehicles as parallelograms in the image frame. The parallelogram preserves the planar position, size, and orientation of objects while omitting their height, which is unnecessary for most downstream applications. For training, a mix of real-world and synthetically generated scenes is leveraged. We evaluate generalizability on a held-out camera viewpoint and in adverse-weather scenarios absent from the training set. Our results show high detection accuracy, strong cross-viewpoint generalization, and robustness to diverse lighting and weather conditions. Model weights and inference code are provided at: this https URL

Title: Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

Authors: Tao Tang, Shijie Xu, Yiting Wu, Zhixiang Lu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.03585
Pdf URL: https://arxiv.org/pdf/2507.03585
Copy Paste: [[2507.03585]] Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation(https://arxiv.org/abs/2507.03585)
Keywords: robust, large language model, segmentation
Abstract: The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

Title: Kinetic Langevin Diffusion for Crystalline Materials Generation

Authors: François Cornet, Federico Bergamin, Arghya Bhowmik, Juan Maria Garcia Lastra, Jes Frellsen, Mikkel N. Schmidt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03602
Pdf URL: https://arxiv.org/pdf/2507.03602
Copy Paste: [[2507.03602]] Kinetic Langevin Diffusion for Crystalline Materials Generation(https://arxiv.org/abs/2507.03602)
Keywords: diffusion, generative
Abstract: Generative modeling of crystalline materials using diffusion models presents a series of challenges: the data distribution is characterized by inherent symmetries and involves multiple modalities, with some defined on specific manifolds. Notably, the treatment of fractional coordinates representing atomic positions in the unit cell requires careful consideration, as they lie on a hypertorus. In this work, we introduce Kinetic Langevin Diffusion for Materials (KLDM), a novel diffusion model for crystalline materials generation, where the key innovation resides in the modeling of the coordinates. Instead of resorting to Riemannian diffusion on the hypertorus directly, we generalize Trivialized Diffusion Model (TDM) to account for the symmetries inherent to crystals. By coupling coordinates with auxiliary Euclidean variables representing velocities, the diffusion process is now offset to a flat space. This allows us to effectively perform diffusion on the hypertorus while providing a training objective that accounts for the periodic translation symmetry of the true data distribution. We evaluate KLDM on both Crystal Structure Prediction (CSP) and De-novo Generation (DNG) tasks, demonstrating its competitive performance with current state-of-the-art models.

Title: VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification

Authors: Cédric Bonhomme, Alexandre Dulaunoy
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.03607
Pdf URL: https://arxiv.org/pdf/2507.03607
Copy Paste: [[2507.03607]] VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification(https://arxiv.org/abs/2507.03607)
Keywords: transformer
Abstract: This paper presents VLAI, a transformer-based model that predicts software vulnerability severity levels directly from text descriptions. Built on RoBERTa, VLAI is fine-tuned on over 600,000 real-world vulnerabilities and achieves over 82% accuracy in predicting severity categories, enabling faster and more consistent triage ahead of manual CVSS scoring. The model and dataset are open-source and integrated into the Vulnerability-Lookup service.

Title: EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

Authors: Klim Zaporojets, Daniel Daza, Edoardo Barba, Ira Assent, Roberto Navigli, Paul Groth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03617
Pdf URL: https://arxiv.org/pdf/2507.03617
Copy Paste: [[2507.03617]] EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge(https://arxiv.org/abs/2507.03617)
Keywords: extraction
Abstract: Knowledge Graphs (KGs) are structured knowledge repositories containing entities and relations between them. In this paper, we investigate the problem of automatically updating KGs over time with respect to the evolution of knowledge in unstructured textual sources. This problem requires identifying a wide range of update operations based on the state of an existing KG at a specific point in time. This contrasts with traditional information extraction pipelines, which extract knowledge from text independently of the current state of a KG. To address this challenge, we propose a method for lifelong construction of a dataset consisting of Wikidata KG snapshots over time and Wikipedia passages paired with the corresponding edit operations that they induce in a particular KG snapshot. The resulting dataset comprises 376K Wikipedia passages aligned with a total of 1.25M KG edits over 10 different snapshots of Wikidata from 2019 to 2025. Our experimental results highlight challenges in updating KG snapshots based on emerging textual knowledge, positioning the dataset as a valuable benchmark for future research. We will publicly release our dataset and model implementations.

Title: Blackbox Dataset Inference for LLM

Authors: Ruikai Zhou, Kang Yang, Xun Chen, Wendy Hui Wang, Guanhong Tao, Jun Xu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.03619
Pdf URL: https://arxiv.org/pdf/2507.03619
Copy Paste: [[2507.03619]] Blackbox Dataset Inference for LLM(https://arxiv.org/abs/2507.03619)
Keywords: attack, robust, membership infer, large language model
Abstract: Today, the training of large language models (LLMs) can involve personally identifiable information and copyrighted material, incurring dataset misuse. To mitigate the problem of dataset misuse, this paper explores \textit{dataset inference}, which aims to detect if a suspect model $\mathcal{M}$ used a victim dataset $\mathcal{D}$ in training. Previous research tackles dataset inference by aggregating results of membership inference attacks (MIAs) -- methods to determine whether individual samples are a part of the training dataset. However, restricted by the low accuracy of MIAs, previous research mandates grey-box access to $\mathcal{M}$ to get intermediate outputs (probabilities, loss, perplexity, etc.) for obtaining satisfactory results. This leads to reduced practicality, as LLMs, especially those deployed for profits, have limited incentives to return the intermediate outputs. In this paper, we propose a new method of dataset inference with only black-box access to the target model (i.e., assuming only the text-based responses of the target model are available). Our method is enabled by two sets of locally built reference models, one set involving $\mathcal{D}$ in training and the other not. By measuring which set of reference model $\mathcal{M}$ is closer to, we determine if $\mathcal{M}$ used $\mathcal{D}$ for training. Evaluations of real-world LLMs in the wild show that our method offers high accuracy in all settings and presents robustness against bypassing attempts.

Title: SecureT2I: No More Unauthorized Manipulation on AI Generated Images from Prompts

Authors: Xiaodong Wu, Xiangman Li, Qi Li, Jianbing Ni, Rongxing Lu
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2507.03636
Pdf URL: https://arxiv.org/pdf/2507.03636
Copy Paste: [[2507.03636]] SecureT2I: No More Unauthorized Manipulation on AI Generated Images from Prompts(https://arxiv.org/abs/2507.03636)
Keywords: secure, diffusion, generative
Abstract: Text-guided image manipulation with diffusion models enables flexible and precise editing based on prompts, but raises ethical and copyright concerns due to potential unauthorized modifications. To address this, we propose SecureT2I, a secure framework designed to prevent unauthorized editing in diffusion-based generative models. SecureT2I is compatible with both general-purpose and domain-specific models and can be integrated via lightweight fine-tuning without architectural changes. We categorize images into a permit set and a forbid set based on editing permissions. For the permit set, the model learns to perform high-quality manipulations as usual. For the forbid set, we introduce training objectives that encourage vague or semantically ambiguous outputs (e.g., blurred images), thereby suppressing meaningful edits. The core challenge is to block unauthorized editing while preserving editing quality for permitted inputs. To this end, we design separate loss functions that guide selective editing behavior. Extensive experiments across multiple datasets and models show that SecureT2I effectively degrades manipulation quality on forbidden images while maintaining performance on permitted ones. We also evaluate generalization to unseen inputs and find that SecureT2I consistently outperforms baselines. Additionally, we analyze different vagueness strategies and find that resize-based degradation offers the best trade-off for secure manipulation control.

Title: When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting

Authors: Xiaodong Wu, Tianyi Tang, Xiangman Li, Jianbing Ni, Yong Yu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.03646
Pdf URL: https://arxiv.org/pdf/2507.03646
Copy Paste: [[2507.03646]] When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting(https://arxiv.org/abs/2507.03646)
Keywords: defense, attack, robust, extraction, watermark, diffusion
Abstract: Watermarking has emerged as a promising solution to counter harmful or deceptive AI-generated content by embedding hidden identifiers that trace content origins. However, the robustness of current watermarking techniques is still largely unexplored, raising critical questions about their effectiveness against adversarial attacks. To address this gap, we examine the robustness of model-specific watermarking, where watermark embedding is integrated with text-to-image generation in models like latent diffusion models. We introduce three attack strategies: edge prediction-based, box blurring, and fine-tuning-based attacks in a no-box setting, where an attacker does not require access to the ground-truth watermark decoder. Our findings reveal that while model-specific watermarking is resilient against basic evasion attempts, such as edge prediction, it is notably vulnerable to blurring and fine-tuning-based attacks. Our best-performing attack achieves a reduction in watermark detection accuracy to approximately 47.92\%. Additionally, we perform an ablation study on factors like message length, kernel size and decoder depth, identifying critical parameters influencing the fine-tuning attack's success. Finally, we assess several advanced watermarking defenses, finding that even the most robust methods, such as multi-label smoothing, result in watermark extraction accuracy that falls below an acceptable level when subjected to our no-box attacks.

Title: When Network Architecture Meets Physics: Deep Operator Learning for Coupled Multiphysics

Authors: Kazuma Kobayashi, Jaewan Park, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03660
Pdf URL: https://arxiv.org/pdf/2507.03660
Copy Paste: [[2507.03660]] When Network Architecture Meets Physics: Deep Operator Learning for Coupled Multiphysics(https://arxiv.org/abs/2507.03660)
Keywords: diffusion
Abstract: Scientific applications increasingly demand real-time surrogate models that can capture the behavior of strongly coupled multiphysics systems driven by multiple input functions, such as in thermo-mechanical and electro-thermal processes. While neural operator frameworks, such as Deep Operator Networks (DeepONets), have shown considerable success in single-physics settings, their extension to multiphysics problems remains poorly understood. In particular, the challenge of learning nonlinear interactions between tightly coupled physical fields has received little systematic attention. This study addresses a foundational question: should the architectural design of a neural operator reflect the strength of physical coupling it aims to model? To answer this, we present the first comprehensive, architecture-aware evaluation of DeepONet variants across three regimes: single-physics, weakly coupled, and strongly coupled multiphysics systems. We consider a reaction-diffusion equation with dual spatial inputs, a nonlinear thermo-electrical problem with bidirectional coupling through temperature-dependent conductivity, and a viscoplastic thermo-mechanical model of steel solidification governed by transient phase-driven interactions. Two operator-learning frameworks, the classical DeepONet and its sequential GRU-based extension, S-DeepONet, are benchmarked using both single-branch and multi-branch (MIONet-style) architectures. Our results demonstrate that architectural alignment with physical coupling is crucial: single-branch networks significantly outperform multi-branch counterparts in strongly coupled settings, whereas multi-branch encodings offer advantages for decoupled or single-physics problems. Once trained, these surrogates achieve full-field predictions up to 1.8e4 times faster than high-fidelity finite-element solvers, without compromising solution accuracy.

Title: Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs

Authors: Jeremiah Giordani
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.03662
Pdf URL: https://arxiv.org/pdf/2507.03662
Copy Paste: [[2507.03662]] Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs(https://arxiv.org/abs/2507.03662)
Keywords: secure, security, robust, large language model
Abstract: Recent work has shown that fine-tuning large language models (LLMs) on code with security vulnerabilities can result in misaligned and unsafe behaviors across broad domains. These results prompted concerns about the emergence of harmful behaviors from narrow domain fine-tuning. In this paper, we contextualize these findings by analyzing how such narrow adaptation impacts the internal mechanisms and behavioral manifestations of LLMs. Through a series of experiments covering output probability distributions, loss and gradient vector geometry, layer-wise activation dynamics, and activation space dimensions, we find that behaviors attributed to "emergent misalignment" may be better interpreted as an erosion of prior alignment. We show that fine tuning on insecure code induces internal changes that oppose alignment. Further, we identify a shared latent dimension in the model's activation space that governs alignment behavior. We show that this space is activated by insecure code and by misaligned responses more generally, revealing how narrow fine-tuning can degrade general safety behavior by interfering with shared internal mechanisms. Our findings offer a mechanistic interpretation for previously observed misalignment phenomena, and highlights the fragility of alignment in LLMs. The results underscore the need for more robust fine-tuning strategies that preserve intended behavior across domains.

Title: TRACE: Training and Inference-Time Interpretability Analysis for Language Models

Authors: Nura Aljaafari, Danilo S. Carvalho, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03668
Pdf URL: https://arxiv.org/pdf/2507.03668
Copy Paste: [[2507.03668]] TRACE: Training and Inference-Time Interpretability Analysis for Language Models(https://arxiv.org/abs/2507.03668)
Keywords: interpretability, transformer
Abstract: Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.

Title: Recon, Answer, Verify: Agents in Search of Truth

Authors: Satyam Shukla, Himanshu Dutta, Pushpak Bhattacharyya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03671
Pdf URL: https://arxiv.org/pdf/2507.03671
Copy Paste: [[2507.03671]] Recon, Answer, Verify: Agents in Search of Truth(https://arxiv.org/abs/2507.03671)
Keywords: large language model
Abstract: Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from this http URL, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim's verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO's unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.

Title: TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

Authors: Xixiang He, Hao Yu, Qiyao Sun, Ao Cheng, Tailai Zhang, Cong Liu, Shuxuan Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03673
Pdf URL: https://arxiv.org/pdf/2507.03673
Copy Paste: [[2507.03673]] TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection(https://arxiv.org/abs/2507.03673)
Keywords: large language model
Abstract: Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.

Title: STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Authors: Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03674
Pdf URL: https://arxiv.org/pdf/2507.03674
Copy Paste: [[2507.03674]] STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking(https://arxiv.org/abs/2507.03674)
Keywords: extraction, large language model
Abstract: The ability to extract structured information from unstructured sources-such as free-text documents and scientific literature-is critical for accelerating scientific discovery and knowledge synthesis. Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including structured information extraction. However, their effectiveness often diminishes in specialized, domain-specific contexts that require nuanced understanding and expert-level domain knowledge. In addition, existing LLM-based approaches frequently exhibit poor transferability across tasks and domains, limiting their scalability and adaptability. To address these challenges, we introduce StructSense, a modular, task-agnostic, open-source framework for structured information extraction built on LLMs. StructSense is guided by domain-specific symbolic knowledge encoded in ontologies, enabling it to navigate complex domain content more effectively. It further incorporates agentic capabilities through self-evaluative judges that form a feedback loop for iterative refinement, and includes human-in-the-loop mechanisms to ensure quality and validation. We demonstrate that StructSense can overcome both the limitations of domain sensitivity and the lack of cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks.

Title: Plugging Attention into Power Grids: Towards Transparent Forecasting

Authors: Eloi Campagne, Itai Zehavi, Yvenn Amara-Ouali, Yannig Goude, Argyris Kalogeratos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03690
Pdf URL: https://arxiv.org/pdf/2507.03690
Copy Paste: [[2507.03690]] Plugging Attention into Power Grids: Towards Transparent Forecasting(https://arxiv.org/abs/2507.03690)
Keywords: robust, interpretability, transformer
Abstract: Accurate electricity consumption forecasting is crucial for ensuring grid stability and optimizing power generation, particularly in increasingly decentralized and complex systems. While classical approaches such as Generalized Additive Models (GAMs) remain widely used, they often fail to capture the spatial dependencies inherent in energy networks. Graph Neural Networks (GNNs) offer a principled framework to incorporate this structure by directly leveraging graph topologies. In this work, we evaluate a broad set of GNN architectures -- including GCN, GraphSAGE, ChebConv, TAG, APPNP, TransformerConv, and Graph Attention Networks (GAT and GATv2) -- on two real-world electricity consumption datasets from France and the UK. Our experiments show that while complex architectures like GATv2 and TransformerConv do not consistently outperform their simpler counterparts, models such as GCN and APPNP achieve strong results in low-data or highly disaggregated settings. Nonetheless, the vanilla GAT remains highly competitive across both datasets and offers an additional interpretability layer via attention mechanisms. We perform a temporal analysis of attention weights, revealing evolving patterns of regional interaction linked to seasonal and meteorological variability. These results highlight that, although attention is not universally superior, it provides valuable explanatory power when spatial dependencies are prominent. Finally, we benchmark ensemble-based expert aggregation strategies, showing that uniform or learned combinations can enhance robustness and outperform individual models under data heterogeneity.

Title: Willchain: Decentralized, Privacy-Preserving, Self-Executing, Digital Wills

Authors: Jovonni L. PHarr
Subjects: cs.CR, cs.CE, cs.ET
Abstract URL: https://arxiv.org/abs/2507.03694
Pdf URL: https://arxiv.org/pdf/2507.03694
Copy Paste: [[2507.03694]] Willchain: Decentralized, Privacy-Preserving, Self-Executing, Digital Wills(https://arxiv.org/abs/2507.03694)
Keywords: secure, security, privacy, fair
Abstract: This work presents a novel decentralized protocol for digital estate planning that integrates advances distributed computing, and cryptography. The original proof-of-concept was constructed using purely solidity contracts. Since then, we have enhanced the implementation into a layer-1 protocol that uses modern interchain communication to connect several heterogeneous chain types. A key contribution of this research is the implementation of several modern cryptographic primitives to support various forms of claims for information validation. These primitives introduce an unmatched level of privacy to the process of digital inheritance. We also demonstrate on a set of heterogeneous smart contracts, following the same spec, on each chain to serve as entry points, gateways, or bridge contracts that are invoked via a path from the will module on our protocol, to the contract. This ensures a fair and secure distribution of digital assets in accordance with the wishes of the decedent without the requirement of moving their funds. This research further extends its innovations with a user interaction model, featuring a check-in system and account abstraction process, which enhances flexibility and user-friendliness without compromising on security. By developing a dedicated permissionless blockchain that is secured by a network of validators, and interchain relayers, the proposed protocol signifies a transformation in the digital estate planning industry and illustrates the potential of blockchain technology in revolutionizing traditional legal and personal spheres. Implementing a cryptoeconomic network at the core of inheritance planning allows for unique incentive compatible economic mechanisms to be constructed.

Title: SAMed-2: Selective Memory Enhanced Medical Segment Anything Model

Authors: Zhiling Yan, Sifan Song, Dingjie Song, Yiwei Li, Rong Zhou, Weixiang Sun, Zhennong Chen, Sekeun Kim, Hui Ren, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, Lichao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03698
Pdf URL: https://arxiv.org/pdf/2507.03698
Copy Paste: [[2507.03698]] SAMed-2: Selective Memory Enhanced Medical Segment Anything Model(https://arxiv.org/abs/2507.03698)
Keywords: segmentation
Abstract: Recent "segment anything" efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: this https URL.

Title: Sign Spotting Disambiguation using Large Language Models

Authors: JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03703
Pdf URL: https://arxiv.org/pdf/2507.03703
Copy Paste: [[2507.03703]] Sign Spotting Disambiguation using Large Language Models(https://arxiv.org/abs/2507.03703)
Keywords: large language model
Abstract: Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method's superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

Title: Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

Authors: Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen, Dinh-Duy Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03711
Pdf URL: https://arxiv.org/pdf/2507.03711
Copy Paste: [[2507.03711]] Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making(https://arxiv.org/abs/2507.03711)
Keywords: large language model
Abstract: In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, Ô Ăn Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the Ô Ăn Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.

Title: Predicting Business Angel Early-Stage Decision Making Using AI

Authors: Yan Katcharovski, Andrew L. Maxwell
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03721
Pdf URL: https://arxiv.org/pdf/2507.03721
Copy Paste: [[2507.03721]] Predicting Business Angel Early-Stage Decision Making Using AI(https://arxiv.org/abs/2507.03721)
Keywords: extraction, large language model
Abstract: External funding is crucial for early-stage ventures, particularly technology startups that require significant R&D investment. Business angels offer a critical source of funding, but their decision-making is often subjective and resource-intensive for both investor and entrepreneur. Much research has investigated this investment process to find the critical factors angels consider. One such tool, the Critical Factor Assessment (CFA), deployed more than 20,000 times by the Canadian Innovation Centre, has been evaluated post-decision and found to be significantly more accurate than investors' own decisions. However, a single CFA analysis requires three trained individuals and several days, limiting its adoption. This study builds on previous work validating the CFA to investigate whether the constraints inhibiting its adoption can be overcome using a trained AI model. In this research, we prompted multiple large language models (LLMs) to assign the eight CFA factors to a dataset of 600 transcribed, unstructured startup pitches seeking business angel funding with known investment outcomes. We then trained and evaluated machine learning classification models using the LLM-generated CFA scores as input features. Our best-performing model demonstrated high predictive accuracy (85.0% for predicting BA deal/no-deal outcomes) and exhibited significant correlation (Spearman's r = 0.896, p-value < 0.001) with conventional human-graded evaluations. The integration of AI-based feature extraction with a structured and validated decision-making framework yielded a scalable, reliable, and less-biased model for evaluating startup pitches, removing the constraints that previously limited adoption.

Title: MemOS: A Memory OS for AI System

Authors: Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03724
Pdf URL: https://arxiv.org/pdf/2507.03724
Copy Paste: [[2507.03724]] MemOS: A Memory OS for AI System(https://arxiv.org/abs/2507.03724)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge this http URL models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended this http URL Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent this http URL work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.

Title: FAROS: Fair Graph Generation via Attribute Switching Mechanisms

Authors: Abdennacer Badaoui, Oussama Kharouiche, Hatim Mrabet, Daniele Malitesta, Fragkiskos D. Malliaros
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03728
Pdf URL: https://arxiv.org/pdf/2507.03728
Copy Paste: [[2507.03728]] FAROS: Fair Graph Generation via Attribute Switching Mechanisms(https://arxiv.org/abs/2507.03728)
Keywords: fair, diffusion
Abstract: Recent advancements in graph diffusion models (GDMs) have enabled the synthesis of realistic network structures, yet ensuring fairness in the generated data remains a critical challenge. Existing solutions attempt to mitigate bias by re-training the GDMs with ad-hoc fairness constraints. Conversely, with this work, we propose FAROS, a novel FAir graph geneRatiOn framework leveraging attribute Switching mechanisms and directly running in the generation process of the pre-trained GDM. Technically, our approach works by altering nodes' sensitive attributes during the generation. To this end, FAROS calculates the optimal fraction of switching nodes, and selects the diffusion step to perform the switch by setting tailored multi-criteria constraints to preserve the node-topology profile from the original distribution (a proxy for accuracy) while ensuring the edge independence on the sensitive attributes for the generated graph (a proxy for fairness). Our experiments on benchmark datasets for link prediction demonstrate that the proposed approach effectively reduces fairness discrepancies while maintaining comparable (or even higher) accuracy performance to other similar baselines. Noteworthy, FAROS is also able to strike a better accuracy-fairness trade-off than other competitors in some of the tested settings under the Pareto optimality concept, demonstrating the effectiveness of the imposed multi-criteria constraints.

Title: Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

Authors: Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, Hao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03737
Pdf URL: https://arxiv.org/pdf/2507.03737
Copy Paste: [[2507.03737]] Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps(https://arxiv.org/abs/2507.03737)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, \textbf{lack geometric priors} in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to \textbf{scale drift}. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: this https URL.

Title: Flow-Anchored Consistency Models

Authors: Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xiaoyan Sun, Feng Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03738
Pdf URL: https://arxiv.org/pdf/2507.03738
Copy Paste: [[2507.03738]] Flow-Anchored Consistency Models(https://arxiv.org/abs/2507.03738)
Keywords: generative
Abstract: Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow during training. We introduce the Flow-Anchored Consistency Model (FACM), a simple but effective training strategy that uses a Flow Matching (FM) task as an anchor for the primary CM shortcut objective. This Flow-Anchoring approach requires no architectural modifications and is broadly compatible with standard model architectures. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.76 with just one step (NFE=1) on ImageNet 256x256, significantly outperforming previous methods. This provides a general and effective recipe for building high-performance, few-step generative models. Our code and pretrained models: this https URL.

Title: ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

Authors: Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03739
Pdf URL: https://arxiv.org/pdf/2507.03739
Copy Paste: [[2507.03739]] ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays(https://arxiv.org/abs/2507.03739)
Keywords: explainability, transformer, generative, large language model
Abstract: The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

Title: StreamDiT: Real-Time Streaming Text-to-Video Generation

Authors: Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2507.03745
Pdf URL: https://arxiv.org/pdf/2507.03745
Copy Paste: [[2507.03745]] StreamDiT: Real-Time Streaming Text-to-Video Generation(https://arxiv.org/abs/2507.03745)
Keywords: diffusion, transformer
Abstract: Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://arxiv.org/abs/2507.03765
Pdf URL: https://arxiv.org/pdf/2507.03765
Copy Paste: [[2507.03765]] Efficient Event-Based Semantic Segmentation via Exploiting Frame-Event Fusion: A Hybrid Neural Network Approach(https://arxiv.org/abs/2507.03765)
Keywords: segmentation
Abstract: Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65\% reduction on the DSEC-Semantic dataset.

Title: Skewed Score: A statistical framework to assess autograders

Authors: Magda Dubois, Harry Coppock, Mario Giulianelli, Lennart Luettgau, Cozmin Ududec
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03772
Pdf URL: https://arxiv.org/pdf/2507.03772
Copy Paste: [[2507.03772]] Skewed Score: A statistical framework to assess autograders(https://arxiv.org/abs/2507.03772)
Keywords: robust, large language model
Abstract: The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, and other factors. In this paper we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while also addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional reliability metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying the source of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.

Title: RVISmith: Fuzzing Compilers for RVV Intrinsics

Authors: Yibo He, Cunjian Huang, Xianmiao Qu, Hongdeng Chen, Wei Yang, Tao Xie
Subjects: cs.CR, cs.DC, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2507.03773
Pdf URL: https://arxiv.org/pdf/2507.03773
Copy Paste: [[2507.03773]] RVISmith: Fuzzing Compilers for RVV Intrinsics(https://arxiv.org/abs/2507.03773)
Keywords: security
Abstract: Modern processors are equipped with single instruction multiple data (SIMD) instructions for fine-grained data parallelism. Compiler auto-vectorization techniques that target SIMD instructions face performance limitations due to insufficient information available at compile time, requiring programmers to manually manipulate SIMD instructions. SIMD intrinsics, a type of built-in function provided by modern compilers, enable programmers to manipulate SIMD instructions within high-level programming languages. Bugs in compilers for SIMD intrinsics can introduce potential threats to software security, producing unintended calculation results, data loss, program crashes, etc. To detect bugs in compilers for SIMD intrinsics, we propose RVISmith, a randomized fuzzer that generates well-defined C programs that include various invocation sequences of RVV (RISC-V Vector Extension) intrinsics. We design RVISmith to achieve the following objectives: (i) achieving high intrinsic coverage, (ii) improving sequence variety, and (iii) without known undefined behaviors. We implement RVISmith based on the ratified RVV intrinsic specification and evaluate our approach with three modern compilers: GCC, LLVM, and XuanTie. Experimental results show that RVISmith achieves 11.5 times higher intrinsic coverage than the state-of-the-art fuzzer for RVV intrinsics. By differential testing that compares results across different compilers, optimizations, and equivalent programs, we detect and report 13 previously unknown bugs of the three compilers under test to date. Of these bugs, 10 are confirmed and another 3 are fixed by the compiler developers.

Title: Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings

Authors: Bugra Kilictas, Faruk Alpay
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03774
Pdf URL: https://arxiv.org/pdf/2507.03774
Copy Paste: [[2507.03774]] Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings(https://arxiv.org/abs/2507.03774)
Keywords: security
Abstract: We present a theoretical framework in which a document and an AI model engage in a transfinite fixed-point interaction that leads to stable semantic alignment. Building on the foundations of Alpay Algebra, we introduce a functorial system wherein an observer (the AI) and a textual environment (this paper) co-evolve through iterative transformations guided by the phi-infinity operator. This process guarantees the existence of a unique fixed point in the AI's embedding space -- a state where the AI's internal representation of the content becomes stable, self-consistent, and semantically faithful. We prove that such convergence is mathematically sound, semantically invariant, and permanent, even under perturbation or further context expansion. This fixed point acts as an "empathetic embedding," wherein the AI internalizes not only the meaning of the content but also the author's intent. We interpret this as a rigorous, category-theoretic route to alignment at the embedding level, with implications for semantic security, symbolic memory, and the construction of AI systems with persistent self-referential understanding. All references in this paper function as nodes in the Alpay Algebra universe, and this work embeds itself as a new fixed-point node within that transfinite semantic graph.

Title: FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Authors: Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03779
Pdf URL: https://arxiv.org/pdf/2507.03779
Copy Paste: [[2507.03779]] FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed(https://arxiv.org/abs/2507.03779)
Keywords: robust
Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at this https URL

Title: Zero Memory Overhead Approach for Protecting Vision Transformer Parameters

Authors: Fereshteh Baradaran, Mohsen Raji, Azadeh Baradaran, Arezoo Baradaran, Reihaneh Akbarifard
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03816
Pdf URL: https://arxiv.org/pdf/2507.03816
Copy Paste: [[2507.03816]] Zero Memory Overhead Approach for Protecting Vision Transformer Parameters(https://arxiv.org/abs/2507.03816)
Keywords: protect, robust, transformer, segmentation
Abstract: Vision Transformers (ViTs) have demonstrated superior performance over Convolutional Neural Networks (CNNs) in various vision-related tasks such as classification, object detection, and segmentation due to their use of self-attention mechanisms. As ViTs become more popular in safety-critical applications like autonomous driving, ensuring their correct functionality becomes essential, especially in the presence of bit-flip faults in their parameters stored in memory. In this paper, a fault tolerance technique is introduced to protect ViT parameters against bit-flip faults with zero memory overhead. Since the least significant bits of parameters are not critical for model accuracy, replacing the LSB with a parity bit provides an error detection mechanism without imposing any overhead on the model. When faults are detected, affected parameters are masked by zeroing out, as most parameters in ViT models are near zero, effectively preventing accuracy degradation. This approach enhances reliability across ViT models, improving the robustness of parameters to bit-flips by up to three orders of magnitude, making it an effective zero-overhead solution for fault tolerance in critical applications.

Title: IMPACT: Importance-Aware Activation Space Reconstruction

Authors: Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03828
Pdf URL: https://arxiv.org/pdf/2507.03828
Copy Paste: [[2507.03828]] IMPACT: Importance-Aware Activation Space Reconstruction(https://arxiv.org/abs/2507.03828)
Keywords: large language model
Abstract: Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines.

Title: Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

Authors: Jiuhong Xiao, Yang Zhou, Giuseppe Loianno
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.03831
Pdf URL: https://arxiv.org/pdf/2507.03831
Copy Paste: [[2507.03831]] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition(https://arxiv.org/abs/2507.03831)
Keywords: robust
Abstract: Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Code will be publicly released.

Title: Regularizing Log-Linear Cost Models for Inpatient Stays by Merging ICD-10 Codes

Authors: Chi-Ken Lu, David Alonge, Nicole Richardson, Bruno Richard
Subjects: cs.LG, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03843
Pdf URL: https://arxiv.org/pdf/2507.03843
Copy Paste: [[2507.03843]] Regularizing Log-Linear Cost Models for Inpatient Stays by Merging ICD-10 Codes(https://arxiv.org/abs/2507.03843)
Keywords: interpretability
Abstract: Cost models in healthcare research must balance interpretability, accuracy, and parameter consistency. However, interpretable models often struggle to achieve both accuracy and consistency. Ordinary least squares (OLS) models for high-dimensional regression can be accurate but fail to produce stable regression coefficients over time when using highly granular ICD-10 diagnostic codes as predictors. This instability arises because many ICD-10 codes are infrequent in healthcare datasets. While regularization methods such as Ridge can address this issue, they risk discarding important predictors. Here, we demonstrate that reducing the granularity of ICD-10 codes is an effective regularization strategy within OLS while preserving the representation of all diagnostic code categories. By truncating ICD-10 codes from seven characters (e.g., T67.0XXA, T67.0XXD) to six (e.g., T67.0XX) or fewer, we reduce the dimensionality of the regression problem while maintaining model interpretability and consistency. Mathematically, the merging of predictors in OLS leads to increased trace of the Hessian matrix, which reduces the variance of coefficient estimation. Our findings explain why broader diagnostic groupings like DRGs and HCC codes are favored over highly granular ICD-10 codes in real-world risk adjustment and cost models.

Title: Interpretable Diffusion Models with B-cos Networks

Authors: Nicola Bernold, Moritz Vandenhirtz, Alice Bizeul, Julia E. Vogt
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03846
Pdf URL: https://arxiv.org/pdf/2507.03846
Copy Paste: [[2507.03846]] Interpretable Diffusion Models with B-cos Networks(https://arxiv.org/abs/2507.03846)
Keywords: interpretability, diffusion
Abstract: Text-to-image diffusion models generate images by iteratively denoising random noise, conditioned on a prompt. While these models have enabled impressive progress in image generation, they often fail to accurately reflect all semantic information described in the prompt -- failures that are difficult to detect automatically. In this work, we introduce a diffusion model architecture built with B-cos modules that offers inherent interpretability. Our approach provides insight into how individual prompt tokens affect the generated image by producing explanations that highlight the pixel regions influenced by each token. We demonstrate that B-cos diffusion models can produce high-quality images while providing meaningful insights into prompt-image alignment.

Title: KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis

Authors: Reilly Haskins, Ben Adams
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03847
Pdf URL: https://arxiv.org/pdf/2507.03847
Copy Paste: [[2507.03847]] KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis(https://arxiv.org/abs/2507.03847)
Keywords: robust, interpretability, large language model
Abstract: Large Language Models (LLMs) frequently generate hallucinations: statements that are syntactically plausible but lack factual grounding. This research presents KEA (Kernel-Enriched AI) Explain: a neurosymbolic framework that detects and explains such hallucinations by comparing knowledge graphs constructed from LLM outputs with ground truth data from Wikidata or contextual documents. Using graph kernels and semantic clustering, the method provides explanations for detected hallucinations, ensuring both robustness and interpretability. Our framework achieves competitive accuracy in detecting hallucinations across both open- and closed-domain tasks, and is able to generate contrastive explanations, enhancing transparency. This research advances the reliability of LLMs in high-stakes domains and provides a foundation for future work on precision improvements and multi-source knowledge integration.

Title: OrbitAll: A Unified Quantum Mechanical Representation Deep Learning Framework for All Molecular Systems

Authors: Beom Seok Kang, Vignesh C. Bhethanabotla, Amin Tavakoli, Maurice D. Hanisch, William A. Goddard III, Anima Anandkumar
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2507.03853
Pdf URL: https://arxiv.org/pdf/2507.03853
Copy Paste: [[2507.03853]] OrbitAll: A Unified Quantum Mechanical Representation Deep Learning Framework for All Molecular Systems(https://arxiv.org/abs/2507.03853)
Keywords: robust
Abstract: Despite the success of deep learning methods in quantum chemistry, their representational capacity is most often confined to neutral, closed-shell molecules. However, real-world chemical systems often exhibit complex characteristics, including varying charges, spins, and environments. We introduce OrbitAll, a geometry- and physics-informed deep learning framework that can represent all molecular systems with electronic structure information. OrbitAll utilizes spin-polarized orbital features from the underlying quantum mechanical method, and combines it with graph neural networks satisfying SE(3)-equivariance. The resulting framework can represent and process any molecular system with arbitrary charges, spins, and environmental effects. OrbitAll demonstrates superior performance and generalization on predicting charged, open-shell, and solvated molecules, while also robustly extrapolating to molecules significantly larger than the training data by leveraging a physics-informed architecture. OrbitAll achieves chemical accuracy using 10 times fewer training data than competing AI models, with a speedup of approximately $10^3$ - $10^4$ compared to density functional theory.

Title: Transformer with Koopman-Enhanced Graph Convolutional Network for Spatiotemporal Dynamics Forecasting

Authors: Zekai Wang, Bing Yao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03855
Pdf URL: https://arxiv.org/pdf/2507.03855
Copy Paste: [[2507.03855]] Transformer with Koopman-Enhanced Graph Convolutional Network for Spatiotemporal Dynamics Forecasting(https://arxiv.org/abs/2507.03855)
Keywords: transformer
Abstract: Spatiotemporal dynamics forecasting is inherently challenging, particularly in systems defined over irregular geometric domains, due to the need to jointly capture complex spatial correlations and nonlinear temporal dynamics. To tackle these challenges, we propose TK-GCN, a two-stage framework that integrates geometry-aware spatial encoding with long-range temporal modeling. In the first stage, a Koopman-enhanced Graph Convolutional Network (K-GCN) is developed to embed the high-dimensional dynamics distributed on spatially irregular domains into a latent space where the evolution of system states is approximately linear. By leveraging Koopman operator theory, this stage enhances the temporal consistency during the latent learning. In the second stage, a Transformer module is employed to model the temporal progression within the Koopman-encoded latent space. Through the self-attention mechanism, the Transformer captures long-range temporal dependencies, enabling accurate forecasting over extended horizons. We evaluate TK-GCN in spatiotemporal cardiac dynamics forecasting and benchmark its performance against several state-of-the-art baselines. Experimental results and ablation studies show that TK-GCN consistently delivers superior predictive accuracy across a range of forecast horizons, demonstrating its capability to effectively model complex spatial structures and nonlinear temporal dynamics.

Title: Enhanced accuracy through ensembling of randomly initialized auto-regressive models for time-dependent PDEs

Authors: Ishan Khurjekar, Indrashish Saha, Lori Graham-Brady, Somdatta Goswami
Subjects: cs.LG, cs.AI, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2507.03863
Pdf URL: https://arxiv.org/pdf/2507.03863
Copy Paste: [[2507.03863]] Enhanced accuracy through ensembling of randomly initialized auto-regressive models for time-dependent PDEs(https://arxiv.org/abs/2507.03863)
Keywords: robust, diffusion
Abstract: Systems governed by partial differential equations (PDEs) require computationally intensive numerical solvers to predict spatiotemporal field evolution. While machine learning (ML) surrogates offer faster solutions, autoregressive inference with ML models suffer from error accumulation over successive predictions, limiting their long-term accuracy. We propose a deep ensemble framework to address this challenge, where multiple ML surrogate models with random weight initializations are trained in parallel and aggregated during inference. This approach leverages the diversity of model predictions to mitigate error propagation while retaining the autoregressive strategies ability to capture the system's time dependent relations. We validate the framework on three PDE-driven dynamical systems - stress evolution in heterogeneous microstructures, Gray-Scott reaction-diffusion, and planetary-scale shallow water system - demonstrating consistent reduction in error accumulation over time compared to individual models. Critically, the method requires only a few time steps as input, enabling full trajectory predictions with inference times significantly faster than numerical solvers. Our results highlight the robustness of ensemble methods in diverse physical systems and their potential as efficient and accurate alternatives to traditional solvers. The codes for this work are available on GitHub (this https URL).

Title: OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference

Authors: Seungjun Shin, Jaehoon Oh, Dokwan Oh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03865
Pdf URL: https://arxiv.org/pdf/2507.03865
Copy Paste: [[2507.03865]] OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference(https://arxiv.org/abs/2507.03865)
Keywords: large language model
Abstract: Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.

Title: Enhancing Adaptive Behavioral Interventions with LLM Inference from Participant-Described States

Authors: Karine Karine, Benjamin M. Marlin
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.03871
Pdf URL: https://arxiv.org/pdf/2507.03871
Copy Paste: [[2507.03871]] Enhancing Adaptive Behavioral Interventions with LLM Inference from Participant-Described States(https://arxiv.org/abs/2507.03871)
Keywords: large language model
Abstract: The use of reinforcement learning (RL) methods to support health behavior change via personalized and just-in-time adaptive interventions is of significant interest to health and behavioral science researchers focused on problems such as smoking cessation support and physical activity promotion. However, RL methods are often applied to these domains using a small collection of context variables to mitigate the significant data scarcity issues that arise from practical limitations on the design of adaptive intervention trials. In this paper, we explore an approach to significantly expanding the state space of an adaptive intervention without impacting data efficiency. The proposed approach enables intervention participants to provide natural language descriptions of aspects of their current state. It then leverages inference with pre-trained large language models (LLMs) to better align the policy of a base RL method with these state descriptions. To evaluate our method, we develop a novel physical activity intervention simulation environment that generates text-based state descriptions conditioned on latent state variables using an auxiliary LLM. We show that this approach has the potential to significantly improve the performance of online policy learning methods.

Title: Demystifying ChatGPT: How It Masters Genre Recognition

Authors: Subham Raj, Sriparna Saha, Brijraj Singh, Niranjan Pedanekar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03875
Pdf URL: https://arxiv.org/pdf/2507.03875
Copy Paste: [[2507.03875]] Demystifying ChatGPT: How It Masters Genre Recognition(https://arxiv.org/abs/2507.03875)
Keywords: large language model
Abstract: The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT's substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT's remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT's potential for content-related applications by incorporating visual information from movie posters.

Title: Hierarchical Semantic-Visual Fusion of Visible and Near-infrared Images for Long-range Haze Removal

Authors: Yi Li, Xiaoxiong Wang, Jiawei Wang, Yi Chang, Kai Cao, Luxin Yan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03893
Pdf URL: https://arxiv.org/pdf/2507.03893
Copy Paste: [[2507.03893]] Hierarchical Semantic-Visual Fusion of Visible and Near-infrared Images for Long-range Haze Removal(https://arxiv.org/abs/2507.03893)
Keywords: robust
Abstract: While image dehazing has advanced substantially in the past decade, most efforts have focused on short-range scenarios, leaving long-range haze removal under-explored. As distance increases, intensified scattering leads to severe haze and signal loss, making it impractical to recover distant details solely from visible images. Near-infrared, with superior fog penetration, offers critical complementary cues through multimodal fusion. However, existing methods focus on content integration while often neglecting haze embedded in visible images, leading to results with residual haze. In this work, we argue that the infrared and visible modalities not only provide complementary low-level visual features, but also share high-level semantic consistency. Motivated by this, we propose a Hierarchical Semantic-Visual Fusion (HSVF) framework, comprising a semantic stream to reconstruct haze-free scenes and a visual stream to incorporate structural details from the near-infrared modality. The semantic stream first acquires haze-robust semantic prediction by aligning modality-invariant intrinsic representations. Then the shared semantics act as strong priors to restore clear and high-contrast distant scenes under severe haze degradation. In parallel, the visual stream focuses on recovering lost structural details from near-infrared by fusing complementary cues from both visible and near-infrared images. Through the cooperation of dual streams, HSVF produces results that exhibit both high-contrast scenes and rich texture details. Moreover, we introduce a novel pixel-aligned visible-infrared haze dataset with semantic labels to facilitate benchmarking. Extensive experiments demonstrate the superiority of our method over state-of-the-art approaches in real-world long-range haze removal.

Title: GenAI-Powered Inference

Authors: Kosuke Imai, Kentaro Nakamura
Subjects: cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03897
Pdf URL: https://arxiv.org/pdf/2507.03897
Copy Paste: [[2507.03897]] GenAI-Powered Inference(https://arxiv.org/abs/2507.03897)
Keywords: diffusion, generative, large language model
Abstract: We introduce GenAI-Powered Inference (GPI), a statistical framework for both causal and predictive inference using unstructured data, including text and images. GPI leverages open-source Generative Artificial Intelligence (GenAI) models - such as large language models and diffusion models - not only to generate unstructured data at scale but also to extract low-dimensional representations that capture their underlying structure. Applying machine learning to these representations, GPI enables estimation of causal and predictive effects while quantifying associated estimation uncertainty. Unlike existing approaches to representation learning, GPI does not require fine-tuning of generative models, making it computationally efficient and broadly accessible. We illustrate the versatility of the GPI framework through three applications: (1) analyzing Chinese social media censorship, (2) estimating predictive effects of candidates' facial appearance on electoral outcomes, and (3) assessing the persuasiveness of political rhetoric. An open-source software package is available for implementing GPI.

Title: Transformer Model for Alzheimer's Disease Progression Prediction Using Longitudinal Visit Sequences

Authors: Mahdi Moghaddami, Clayton Schubring, Mohammad-Reza Siadat
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.03899
Pdf URL: https://arxiv.org/pdf/2507.03899
Copy Paste: [[2507.03899]] Transformer Model for Alzheimer's Disease Progression Prediction Using Longitudinal Visit Sequences(https://arxiv.org/abs/2507.03899)
Keywords: transformer, generative
Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder with no known cure that affects tens of millions of people worldwide. Early detection of AD is critical for timely intervention to halt or slow the progression of the disease. In this study, we propose a Transformer model for predicting the stage of AD progression at a subject's next clinical visit using features from a sequence of visits extracted from the subject's visit history. We also rigorously compare our model to recurrent neural networks (RNNs) such as long short-term memory (LSTM), gated recurrent unit (GRU), and minimalRNN and assess their performances based on factors such as the length of prior visits and data imbalance. We test the importance of different feature categories and visit history, as well as compare the model to a newer Transformer-based model optimized for time series. Our model demonstrates strong predictive performance despite missing visits and missing features in available visits, particularly in identifying converter subjects -- individuals transitioning to more severe disease stages -- an area that has posed significant challenges in longitudinal prediction. The results highlight the model's potential in enhancing early diagnosis and patient outcomes.

Title: Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

Authors: Haifeng Zhao, Yufei Zhang, Leilei Ma, Shuo Xu, Dengdi Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03908
Pdf URL: https://arxiv.org/pdf/2507.03908
Copy Paste: [[2507.03908]] Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs(https://arxiv.org/abs/2507.03908)
Keywords: large language model
Abstract: Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment \& Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.

Title: Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces

Authors: Henry B. Moss, Sebastian W. Ober, Tom Diethe
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03910
Pdf URL: https://arxiv.org/pdf/2507.03910
Copy Paste: [[2507.03910]] Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces(https://arxiv.org/abs/2507.03910)
Keywords: generative
Abstract: Bayesian optimisation in the latent space of a Variational AutoEncoder (VAE) is a powerful framework for optimisation tasks over complex structured domains, such as the space of scientifically interesting molecules. However, existing approaches tightly couple the surrogate and generative models, which can lead to suboptimal performance when the latent space is not tailored to specific tasks, which in turn has led to the proposal of increasingly sophisticated algorithms. In this work, we explore a new direction, instead proposing a decoupled approach that trains a generative model and a Gaussian Process (GP) surrogate separately, then combines them via a simple yet principled Bayesian update rule. This separation allows each component to focus on its strengths -- structure generation from the VAE and predictive modelling by the GP. We show that our decoupled approach improves our ability to identify high-potential candidates in molecular optimisation problems under constrained evaluation budgets.

Title: Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation

Authors: Ha-Hieu Pham, Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Ulas Bagci, Min Xu, Trung-Nghia Le, Huy-Hieu Pham
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03923
Pdf URL: https://arxiv.org/pdf/2507.03923
Copy Paste: [[2507.03923]] Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation(https://arxiv.org/abs/2507.03923)
Keywords: segmentation
Abstract: Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (H&E) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at this https URL.

Title: DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

Authors: Rongjia Zheng, Qing Zhang, Chengjiang Long, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03924
Pdf URL: https://arxiv.org/pdf/2507.03924
Copy Paste: [[2507.03924]] DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering(https://arxiv.org/abs/2507.03924)
Keywords: robust, diffusion, generative
Abstract: Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods.

Title: Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models

Authors: Eva Vanmassenhove
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03933
Pdf URL: https://arxiv.org/pdf/2507.03933
Copy Paste: [[2507.03933]] Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models(https://arxiv.org/abs/2507.03933)
Keywords: protect, large language model
Abstract: Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to \textit{offload} the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.

Title: VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision

Authors: Kezhong Liu, Yiwen Zhou, Mozi Chen, Jianhua He, Jingao Xu, Zheng Yang, Chris Xiaoxuan Lu, Shengkai Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.03938
Pdf URL: https://arxiv.org/pdf/2507.03938
Copy Paste: [[2507.03938]] VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision(https://arxiv.org/abs/2507.03938)
Keywords: extraction
Abstract: This work proposes a mmWave radar's scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are more accessible, visual images alone cannot capture the 3D motions of moving objects, making it difficult to supervise their scene flow. Moreover, the temporal drift of VI rigid transformation also degenerates the scene flow estimation of static points. To address these challenges, we propose a drift-free rigid transformation estimator that fuses kinematic model-based ego-motions with neural network-learned results. It provides strong supervision signals to radar-based rigid transformation and infers the scene flow of static points. Then, we develop an optical-mmWave supervision extraction module that extracts the supervision signals of radar rigid transformation and scene flow. It strengthens the supervision by learning the scene flow of dynamic points with the joint constraints of optical and mmWave radar measurements. Extensive experiments demonstrate that, in smoke-filled environments, our method even outperforms state-of-the-art (SOTA) approaches using costly LiDARs.

Title: A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text

Authors: KMA Solaiman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03949
Pdf URL: https://arxiv.org/pdf/2507.03949
Copy Paste: [[2507.03949]] A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text(https://arxiv.org/abs/2507.03949)
Keywords: extraction
Abstract: We propose POSID, a modular, lightweight and on-demand framework for extracting structured attribute-based properties from unstructured text without task-specific fine-tuning. While the method is designed to be adaptable across domains, in this work, we evaluate it on human attribute recognition in incident reports. POSID combines lexical and semantic similarity techniques to identify relevant sentences and extract attributes. We demonstrate its effectiveness on a missing person use case using the InciText dataset, achieving effective attribute extraction without supervised training.

Title: Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study

Authors: Kai Ye, Tianyi Chen, Zhen Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03953
Pdf URL: https://arxiv.org/pdf/2507.03953
Copy Paste: [[2507.03953]] Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study(https://arxiv.org/abs/2507.03953)
Keywords: privacy, protect, diffusion
Abstract: With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC--across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: this https URL.

Title: Robust Low-light Scene Restoration via Illumination Transition

Authors: Ze Li, Feng Zhang, Xiatian Zhu, Meng Zhang, Yanghong Zhou, P. Y. Mok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03976
Pdf URL: https://arxiv.org/pdf/2507.03976
Copy Paste: [[2507.03976]] Robust Low-light Scene Restoration via Illumination Transition(https://arxiv.org/abs/2507.03976)
Keywords: robust
Abstract: Synthesizing normal-light novel views from low-light multiview images is an important yet challenging task, given the low visibility and high ISO noise present in the input images. Existing low-light enhancement methods often struggle to effectively preprocess such low-light inputs, as they fail to consider correlations among multiple views. Although other state-of-the-art methods have introduced illumination-related components offering alternative solutions to the problem, they often result in drawbacks such as color distortions and artifacts, and they provide limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework (RoSe), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. This multiview-consistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To implement RoSe, we design a concise dual-branch architecture and introduce a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard benchmarks. The codes and data are available at this https URL.

Title: CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning

Authors: Jeonghyo Song, Kimin Yun, DaeUng Jo, Jinyoung Kim, Youngjoon Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03984
Pdf URL: https://arxiv.org/pdf/2507.03984
Copy Paste: [[2507.03984]] CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning(https://arxiv.org/abs/2507.03984)
Keywords: robust, large language model, segmentation
Abstract: Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.

Title: MalVol-25: A Diverse, Labelled and Detailed Volatile Memory Dataset for Malware Detection and Response Testing and Validation

Authors: Dipo Dunsin, Mohamed Chahine Ghanem, Eduardo Almeida Palmieri
Subjects: cs.CR, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03993
Pdf URL: https://arxiv.org/pdf/2507.03993
Copy Paste: [[2507.03993]] MalVol-25: A Diverse, Labelled and Detailed Volatile Memory Dataset for Malware Detection and Response Testing and Validation(https://arxiv.org/abs/2507.03993)
Keywords: security
Abstract: This paper addresses the critical need for high-quality malware datasets that support advanced analysis techniques, particularly machine learning and agentic AI frameworks. Existing datasets often lack diversity, comprehensive labelling, and the complexity necessary for effective machine learning and agent-based AI training. To fill this gap, we developed a systematic approach for generating a dataset that combines automated malware execution in controlled virtual environments with dynamic monitoring tools. The resulting dataset comprises clean and infected memory snapshots across multiple malware families and operating systems, capturing detailed behavioural and environmental features. Key design decisions include applying ethical and legal compliance, thorough validation using both automated and manual methods, and comprehensive documentation to ensure replicability and integrity. The dataset's distinctive features enable modelling system states and transitions, facilitating RL-based malware detection and response strategies. This resource is significant for advancing adaptive cybersecurity defences and digital forensic research. Its scope supports diverse malware scenarios and offers potential for broader applications in incident response and automated threat mitigation.

Title: NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models

Authors: Siyu Li, Fei Teng, Yihong Cao, Kailun Yang, Zhiyong Li, Yaonan Wang
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2507.04002
Pdf URL: https://arxiv.org/pdf/2507.04002
Copy Paste: [[2507.04002]] NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models(https://arxiv.org/abs/2507.04002)
Keywords: robust, segmentation
Abstract: Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at this https URL.

Title: Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation

Authors: Patryk Bartkowiak, Filip Graliński
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04003
Pdf URL: https://arxiv.org/pdf/2507.04003
Copy Paste: [[2507.04003]] Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation(https://arxiv.org/abs/2507.04003)
Keywords: transformer
Abstract: Transformer-based models have demonstrated significant success in various source code representation tasks. Nonetheless, traditional positional embeddings employed by these models inadequately capture the hierarchical structure intrinsic to source code, typically represented as Abstract Syntax Trees (ASTs). To address this, we propose a novel tree-based positional embedding approach that explicitly encodes hierarchical relationships derived from ASTs, including node depth and sibling indices. These hierarchical embeddings are integrated into the transformer architecture, specifically enhancing the CodeBERTa model. We thoroughly evaluate our proposed model through masked language modeling (MLM) pretraining and clone detection fine-tuning tasks. Experimental results indicate that our Tree-Enhanced CodeBERTa consistently surpasses the baseline model in terms of loss, accuracy, F1 score, precision, and recall, emphasizing the importance of incorporating explicit structural information into transformer-based representations of source code.

Title: Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing

Authors: Seungjin Jung, Kanghee Lee, Yonghyun Jeong, Haeun Noh, Jungmin Lee, Jongwon Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04006
Pdf URL: https://arxiv.org/pdf/2507.04006
Copy Paste: [[2507.04006]] Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing(https://arxiv.org/abs/2507.04006)
Keywords: extraction
Abstract: Domain Generalizable Face Anti-Spoofing (DGFAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM). Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.

Title: Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Authors: Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04009
Pdf URL: https://arxiv.org/pdf/2507.04009
Copy Paste: [[2507.04009]] Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents(https://arxiv.org/abs/2507.04009)
Keywords: extraction, large language model
Abstract: Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at this https URL and have garnered over 9,000 GitHub stars.

Title: Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Authors: Kyuhee Kim, Sangah Lee
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.04014
Pdf URL: https://arxiv.org/pdf/2507.04014
Copy Paste: [[2507.04014]] Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition(https://arxiv.org/abs/2507.04014)
Keywords: large language model
Abstract: As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs' cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs' cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.

Title: Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

Authors: Hongrui Shi, Lisa Norton, Lucy Ridding, Simon Rolph, Tom August, Claire M Wood, Lan Qie, Petra Bosilj, James M Brown
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04017
Pdf URL: https://arxiv.org/pdf/2507.04017
Copy Paste: [[2507.04017]] Habitat Classification from Ground-Level Imagery Using Deep Neural Networks(https://arxiv.org/abs/2507.04017)
Keywords: robust, transformer
Abstract: Habitat assessment at local scales -- critical for enhancing biodiversity and guiding conservation priorities -- often relies on expert field survey that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models -- convolutional neural networks (CNNs) and vision transformers (ViTs) -- under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91\%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at the national scale.

Title: Exploring Kolmogorov-Arnold Network Expansions in Vision Transformers for Mitigating Catastrophic Forgetting in Continual Learning

Authors: Zahid Ullah, Jihie Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04020
Pdf URL: https://arxiv.org/pdf/2507.04020
Copy Paste: [[2507.04020]] Exploring Kolmogorov-Arnold Network Expansions in Vision Transformers for Mitigating Catastrophic Forgetting in Continual Learning(https://arxiv.org/abs/2507.04020)
Keywords: robust, transformer
Abstract: Continual learning (CL), the ability of a model to learn new tasks without forgetting previously acquired knowledge, remains a critical challenge in artificial intelligence, particularly for vision transformers (ViTs) utilizing Multilayer Perceptrons (MLPs) for global representation learning. Catastrophic forgetting, where new information overwrites prior knowledge, is especially problematic in these models. This research proposes replacing MLPs in ViTs with Kolmogorov-Arnold Network (KANs) to address this issue. KANs leverage local plasticity through spline-based activations, ensuring that only a subset of parameters is updated per sample, thereby preserving previously learned knowledge. The study investigates the efficacy of KAN-based ViTs in CL scenarios across benchmark datasets (MNIST, CIFAR100), focusing on their ability to retain accuracy on earlier tasks while adapting to new ones. Experimental results demonstrate that KAN-based ViTs significantly mitigate catastrophic forgetting, outperforming traditional MLP-based ViTs in knowledge retention and task adaptation. This novel integration of KANs into ViTs represents a promising step toward more robust and adaptable models for dynamic environments.

Title: LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models

Authors: Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04023
Pdf URL: https://arxiv.org/pdf/2507.04023
Copy Paste: [[2507.04023]] LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models(https://arxiv.org/abs/2507.04023)
Keywords: robust, transformer, large language model
Abstract: Large Language Models (LLMs) have achieved remarkable performance on complex mathematical benchmarks, yet often struggle with simple arithmetic tasks and exhibit a tendency toward over-explaining or "overthinking" answers. To systematically assess this phenomenon, we introduce LLMThinkBench, a modular benchmarking framework that enables researchers to evaluate basic math reasoning and overthinking in LLMs. The framework provides 14 configurable math tasks with randomized test data generation and robust parsing strategies. Researchers can quantify overthinking using our Overthinking Score metric, which captures accuracy-verbosity tradeoffs through harmonic mean formulation. The tool offers flexible evaluation with a scalable vLLM/Transformers backend, multi-GPU support, and full configurability. Users can extend the tool with custom tasks, reproduce experiments with seeding, and generate detailed efficiency reports. Distributed as a pip-installable package with CLI and API access, LLMThinkBench provides researchers and practitioners an accessible, cost-effective alternative to expensive LLM-as-a-judge methods for diagnosing basic reasoning capabilities and efficiency analysis. Package can be installed as: pip install llmthinkbench

Title: Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks

Authors: Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček
Subjects: cs.LG, cs.CY, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2507.04033
Pdf URL: https://arxiv.org/pdf/2507.04033
Copy Paste: [[2507.04033]] Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks(https://arxiv.org/abs/2507.04033)
Keywords: fair
Abstract: The ability to train Deep Neural Networks (DNNs) with constraints is instrumental in improving the fairness of modern machine-learning models. Many algorithms have been analysed in recent years, and yet there is no standard, widely accepted method for the constrained training of DNNs. In this paper, we provide a challenging benchmark of real-world large-scale fairness-constrained learning tasks, built on top of the US Census (Folktables). We point out the theoretical challenges of such tasks and review the main approaches in stochastic approximation algorithms. Finally, we demonstrate the use of the benchmark by implementing and comparing three recently proposed, but as-of-yet unimplemented, algorithms both in terms of optimization performance, and fairness improvement. We release the code of the benchmark as a Python package at this https URL.

Title: PresentAgent: Multimodal Agent for Presentation Video Generation

Authors: Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04036
Pdf URL: https://arxiv.org/pdf/2507.04036
Copy Paste: [[2507.04036]] PresentAgent: Multimodal Agent for Presentation Video Generation(https://arxiv.org/abs/2507.04036)
Keywords: large language model
Abstract: We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at this https URL.

Title: T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images

Authors: Christopher Wiedeman, Anastasiia Sarmakeeva, Elena Sizikova, Daniil Filienko, Miguel Lago, Jana G. Delfino, Aldo Badano
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04038
Pdf URL: https://arxiv.org/pdf/2507.04038
Copy Paste: [[2507.04038]] T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images(https://arxiv.org/abs/2507.04038)
Keywords: robust, segmentation
Abstract: One of the key impediments for developing and assessing robust medical imaging algorithms is limited access to large-scale datasets with suitable annotations. Synthetic data generated with plausible physical and biological constraints may address some of these data limitations. We propose the use of physics simulations to generate synthetic images with pixel-level segmentation annotations, which are notoriously difficult to obtain. Specifically, we apply this approach to breast imaging analysis and release T-SYNTH, a large-scale open-source dataset of paired 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images. Our initial experimental results indicate that T-SYNTH images show promise for augmenting limited real patient datasets for detection tasks in DM and DBT. Our data and code are publicly available at this https URL.

Title: Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation

Authors: Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04049
Pdf URL: https://arxiv.org/pdf/2507.04049
Copy Paste: [[2507.04049]] Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation(https://arxiv.org/abs/2507.04049)
Keywords: diffusion
Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode this http URL experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

Title: Predictive Modeling of Effluent Temperature in SAT Systems Using Ambient Meteorological Data: Implications for Infiltration Management

Authors: Roy Elkayam
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04050
Pdf URL: https://arxiv.org/pdf/2507.04050
Copy Paste: [[2507.04050]] Predictive Modeling of Effluent Temperature in SAT Systems Using Ambient Meteorological Data: Implications for Infiltration Management(https://arxiv.org/abs/2507.04050)
Keywords: robust, interpretability
Abstract: Accurate prediction of effluent temperature in recharge basins is essential for optimizing the Soil Aquifer Treatment (SAT) process, as temperature directly influences water viscosity and infiltration rates. This study develops and evaluates predictive models for effluent temperature in the upper recharge layer of a Shafdan SAT system recharge basin using ambient meteorological data. Multiple linear regression (MLR), neural networks (NN), and random forests (RF) were tested for their predictive accuracy and interpretability. The MLR model, preferred for its operational simplicity and robust performance, achieved high predictive accuracy (R2 = 0.86-0.87) and was used to estimate effluent temperatures over a 10-year period. Results highlight pronounced seasonal temperature cycles and the importance of topsoil temperature in governing the thermal profile of the infiltrating effluent. The study provides practical equations for real-time monitoring and long-term planning of SAT operations.

Title: Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

Authors: Xiao Liu, Nan Pu, Haiyang Zheng, Wenjing Li, Nicu Sebe, Zhun Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04051
Pdf URL: https://arxiv.org/pdf/2507.04051
Copy Paste: [[2507.04051]] Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery(https://arxiv.org/abs/2507.04051)
Keywords: diffusion
Abstract: In this paper, we investigate a practical yet challenging task: On-the-fly Category Discovery (OCD). This task focuses on the online identification of newly arriving stream data that may belong to both known and unknown categories, utilizing the category knowledge from only labeled data. Existing OCD methods are devoted to fully mining transferable knowledge from only labeled data. However, the transferability learned by these methods is limited because the knowledge contained in known categories is often insufficient, especially when few annotated data/categories are available in fine-grained recognition. To mitigate this limitation, we propose a diffusion-based OCD framework, dubbed DiffGRE, which integrates Generation, Refinement, and Encoding in a multi-stage fashion. Specifically, we first design an attribute-composition generation method based on cross-image interpolation in the diffusion latent space to synthesize novel samples. Then, we propose a diversity-driven refinement approach to select the synthesized images that differ from known categories for subsequent OCD model training. Finally, we leverage a semi-supervised leader encoding to inject additional category knowledge contained in synthesized data into the OCD models, which can benefit the discovery of both known and unknown categories during the on-the-fly inference process. Extensive experiments demonstrate the superiority of our DiffGRE over previous methods on six fine-grained datasets.

Title: Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG

Authors: Yufan Chen, Daoyuan Wu, Juantao Zhong, Zicheng Zhang, Debin Gao, Shuai Wang, Yingjiu Li, Ning Liu
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2507.04055
Pdf URL: https://arxiv.org/pdf/2507.04055
Copy Paste: [[2507.04055]] Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG(https://arxiv.org/abs/2507.04055)
Keywords: large language model
Abstract: Malware Family Classification (MFC) aims to identify the fine-grained family (e.g., GuLoader or BitRAT) to which a potential malware sample belongs, in contrast to malware detection or sample classification that predicts only an Yes/No. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for MFC in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate how Family-Specific String (FSS) features could be utilized in a manner similar to RAG to facilitate MFC. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.

Title: Attributing Data for Sharpness-Aware Minimization

Authors: Chenyang Ren, Yifan Jia, Huanyi Xie, Zhaobin Xu, Tianxing Wei, Liangyu Wang, Lijie Hu, Di Wang
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2507.04059
Pdf URL: https://arxiv.org/pdf/2507.04059
Copy Paste: [[2507.04059]] Attributing Data for Sharpness-Aware Minimization(https://arxiv.org/abs/2507.04059)
Keywords: privacy, interpretability
Abstract: Sharpness-aware Minimization (SAM) improves generalization in large-scale model training by linking loss landscape geometry to generalization. However, challenges such as mislabeled noisy data and privacy concerns have emerged as significant issues. Data attribution, which identifies the contributions of specific training samples, offers a promising solution. However, directly rendering existing data influence evaluation tools such as influence functions (IF) to SAM will be inapplicable or inaccurate as SAM utilizes an inner loop to find model perturbations that maximize loss, which the outer loop then minimizes, resulting in a doubled computational structure. Additionally, this bilevel structure complicates the modeling of data influence on the parameters. In this paper, based on the IF, we develop two innovative data valuation methods for SAM, each offering unique benefits in different scenarios: the Hessian-based IF and the Gradient Trajectory-based IF. The first one provides a comprehensive estimation of data influence using a closed-form measure that relies only on the trained model weights. In contrast, the other IF for SAM utilizes gradient trajectory information during training for more accurate and efficient data assessment. Extensive experiments demonstrate their effectiveness in data evaluation and parameter tuning, with applications in identifying mislabeled data, model editing, and enhancing interpretability.

Title: Consistent and Invariant Generalization Learning for Short-video Misinformation Detection

Authors: Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2507.04061
Pdf URL: https://arxiv.org/pdf/2507.04061
Copy Paste: [[2507.04061]] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection(https://arxiv.org/abs/2507.04061)
Keywords: diffusion
Abstract: Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at this https URL.

Title: Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering

Authors: Ting-Wen Ko, Jyun-Yu Jiang, Pu-Jen Cheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04069
Pdf URL: https://arxiv.org/pdf/2507.04069
Copy Paste: [[2507.04069]] Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering(https://arxiv.org/abs/2507.04069)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.

Title: Accurate and Efficient World Modeling with Masked Latent Transformers

Authors: Maxime Burchi, Radu Timofte
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04075
Pdf URL: https://arxiv.org/pdf/2507.04075
Copy Paste: [[2507.04075]] Accurate and Efficient World Modeling with Masked Latent Transformers(https://arxiv.org/abs/2507.04075)
Keywords: transformer
Abstract: The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such as $\Delta$-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.

Title: S-Leak: Leakage-Abuse Attack Against Efficient Conjunctive SSE via s-term Leakage

Authors: Yue Su, Meng Shen, Cong Zuo, Yuzhi Liu, Liehuang Zhu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04077
Pdf URL: https://arxiv.org/pdf/2507.04077
Copy Paste: [[2507.04077]] S-Leak: Leakage-Abuse Attack Against Efficient Conjunctive SSE via s-term Leakage(https://arxiv.org/abs/2507.04077)
Keywords: secure, defense, attack
Abstract: Conjunctive Searchable Symmetric Encryption (CSSE) enables secure conjunctive searches over encrypted data. While leakage-abuse attacks (LAAs) against single-keyword SSE have been extensively studied, their extension to conjunctive queries faces a critical challenge: the combinatorial explosion of candidate keyword combinations, leading to enormous time and space overhead for attacks. In this paper, we reveal a fundamental vulnerability in state-of-the-art CSSE schemes: s-term leakage, where the keyword with the minimal document frequency in a query leaks distinct patterns. We propose S-Leak, the first passive attack framework that progressively recovers conjunctive queries by exploiting s-term leakage and global leakage. Our key innovation lies in a three-stage approach: identifying the s-term of queries, pruning low-probability keyword conjunctions, and reconstructing full queries. We propose novel metrics to better assess attacks in conjunctive query scenarios. Empirical evaluations on real-world datasets demonstrate that our attack is effective in diverse CSSE configurations. When considering 161,700 conjunctive keyword queries, our attack achieves a 95.15% accuracy in recovering at least one keyword, 82.57% for at least two, 58% for all three keywords, and maintains efficacy against defenses such as SEAL padding and CLRZ obfuscation. Our work exposes the underestimated risks of s-term leakage in practical SSE deployments and calls for a redesign of leakage models for multi-keyword search scenarios.

Title: Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

Authors: Thomas Savage
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04099
Pdf URL: https://arxiv.org/pdf/2507.04099
Copy Paste: [[2507.04099]] Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching(https://arxiv.org/abs/2507.04099)
Keywords: large language model
Abstract: Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF's improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.

Title: Hierarchical Testing with Rabbit Optimization for Industrial Cyber-Physical Systems

Authors: Jinwei Hu, Zezhi Tang, Xin Jin, Benyuan Zhang, Yi Dong, Xiaowei Huang
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2507.04100
Pdf URL: https://arxiv.org/pdf/2507.04100
Copy Paste: [[2507.04100]] Hierarchical Testing with Rabbit Optimization for Industrial Cyber-Physical Systems(https://arxiv.org/abs/2507.04100)
Keywords: robust
Abstract: This paper presents HERO (Hierarchical Testing with Rabbit Optimization), a novel black-box adversarial testing framework for evaluating the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems. Leveraging Artificial Rabbit Optimization, HERO generates physically constrained adversarial examples that align with real-world data distributions via global and local perspective. Its generalizability ensures applicability across diverse ICPS scenarios. This study specifically focuses on the Proton Exchange Membrane Fuel Cell system, chosen for its highly dynamic operational conditions, complex degradation mechanisms, and increasing integration into ICPS as a sustainable and efficient energy solution. Experimental results highlight HERO's ability to uncover vulnerabilities in even state-of-the-art PHM models, underscoring the critical need for enhanced robustness in real-world applications. By addressing these challenges, HERO demonstrates its potential to advance more resilient PHM systems across a wide range of ICPS domains.

Title: Human-Centered Interactive Anonymization for Privacy-Preserving Machine Learning: A Case for Human-Guided k-Anonymity

Authors: Sri Harsha Gajavalli
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04104
Pdf URL: https://arxiv.org/pdf/2507.04104
Copy Paste: [[2507.04104]] Human-Centered Interactive Anonymization for Privacy-Preserving Machine Learning: A Case for Human-Guided k-Anonymity(https://arxiv.org/abs/2507.04104)
Keywords: privacy
Abstract: Privacy-preserving machine learning (ML) seeks to balance data utility and privacy, especially as regulations like the GDPR mandate the anonymization of personal data for ML applications. Conventional anonymization approaches often reduce data utility due to indiscriminate generalization or suppression of data attributes. In this study, we propose an interactive approach that incorporates human input into the k-anonymization process, enabling domain experts to guide attribute preservation based on contextual importance. Using the UCI Adult dataset, we compare classification outcomes of interactive human-influenced anonymization with traditional, fully automated methods. Our results show that human input can enhance data utility in some cases, although results vary across tasks and settings. We discuss limitations of our approach and suggest potential areas for improved interactive frameworks in privacy-aware ML.

Title: Addressing The Devastating Effects Of Single-Task Data Poisoning In Exemplar-Free Continual Learning

Authors: Stanisław Pawlak (1), Bartłomiej Twardowski (2 and 3), Tomasz Trzciński (1 and 2), Joost van de Weijer (3) ((1) Warsaw University of Technology, Poland, (2) IDEAS Research Institute, Poland, (3) Computer Vision Center, Universitat Autonoma de Barcelona, Spain)
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04106
Pdf URL: https://arxiv.org/pdf/2507.04106
Copy Paste: [[2507.04106]] Addressing The Devastating Effects Of Single-Task Data Poisoning In Exemplar-Free Continual Learning(https://arxiv.org/abs/2507.04106)
Keywords: security, defense, attack
Abstract: Our research addresses the overlooked security concerns related to data poisoning in continual learning (CL). Data poisoning - the intentional manipulation of training data to affect the predictions of machine learning models - was recently shown to be a threat to CL training stability. While existing literature predominantly addresses scenario-dependent attacks, we propose to focus on a more simple and realistic single-task poison (STP) threats. In contrast to previously proposed poisoning settings, in STP adversaries lack knowledge and access to the model, as well as to both previous and future tasks. During an attack, they only have access to the current task within the data stream. Our study demonstrates that even within these stringent conditions, adversaries can compromise model performance using standard image corruptions. We show that STP attacks are able to strongly disrupt the whole continual training process: decreasing both the stability (its performance on past tasks) and plasticity (capacity to adapt to new tasks) of the algorithm. Finally, we propose a high-level defense framework for CL along with a poison task detection method based on task vectors. The code is available at this https URL .

Title: Integrated Gaussian Processes for Robust and Adaptive Multi-Object Tracking

Authors: Fred Lydeard, Bashar I. Ahmad, Simon Godsill
Subjects: cs.CV, stat.AP, stat.ME
Abstract URL: https://arxiv.org/abs/2507.04116
Pdf URL: https://arxiv.org/pdf/2507.04116
Copy Paste: [[2507.04116]] Integrated Gaussian Processes for Robust and Adaptive Multi-Object Tracking(https://arxiv.org/abs/2507.04116)
Keywords: robust
Abstract: This paper presents a computationally efficient multi-object tracking approach that can minimise track breaks (e.g., in challenging environments and against agile targets), learn the measurement model parameters on-line (e.g., in dynamically changing scenes) and infer the class of the tracked objects, if joint tracking and kinematic behaviour classification is sought. It capitalises on the flexibilities offered by the integrated Gaussian process as a motion model and the convenient statistical properties of non-homogeneous Poisson processes as a suitable observation model. This can be combined with the proposed effective track revival / stitching mechanism. We accordingly introduce the two robust and adaptive trackers, Gaussian and Poisson Process with Classification (GaPP-Class) and GaPP with Revival and Classification (GaPP-ReaCtion). They employ an appropriate particle filtering inference scheme that efficiently integrates track management and hyperparameter learning (including the object class, if relevant). GaPP-ReaCtion extends GaPP-Class with the addition of a Markov Chain Monte Carlo kernel applied to each particle permitting track revival and stitching (e.g., within a few time steps after deleting a trajectory). Performance evaluation and benchmarking using synthetic and real data show that GaPP-Class and GaPP-ReaCtion outperform other state-of-the-art tracking algorithms. For example, GaPP-ReaCtion significantly reduces track breaks (e.g., by around 30% from real radar data and markedly more from simulated data).

Title: PromptSR: Cascade Prompting for Lightweight Image Super-Resolution

Authors: Wenyang Liu, Chen Cai, Jianjun Gao, Kejun Wu, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04118
Pdf URL: https://arxiv.org/pdf/2507.04118
Copy Paste: [[2507.04118]] PromptSR: Cascade Prompting for Lightweight Image Super-Resolution(https://arxiv.org/abs/2507.04118)
Keywords: transformer
Abstract: Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at this https URL.

Title: When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need

Authors: Ziming Hong, Runnan Chen, Zengmao Wang, Bo Han, Bo Du, Tongliang Liu
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04119
Pdf URL: https://arxiv.org/pdf/2507.04119
Copy Paste: [[2507.04119]] When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need(https://arxiv.org/abs/2507.04119)
Keywords: security, robust, data-free
Abstract: Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator's attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers. Code is released at this https URL.

Title: Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge

Authors: Linshen Liu, Boyan Su, Junyue Jiang, Guanlin Wu, Cong Guo, Ceyu Xu, Hao Frank Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04123
Pdf URL: https://arxiv.org/pdf/2507.04123
Copy Paste: [[2507.04123]] Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge(https://arxiv.org/abs/2507.04123)
Keywords: robust
Abstract: This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as a end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs.

Title: Graph Neural Networks as a Substitute for Transformers in Single-Cell Transcriptomics

Authors: Jiaxin Qi, Yan Cui, Jinli Ou, Jianqiang Huang, Gaogang Xie
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2507.04125
Pdf URL: https://arxiv.org/pdf/2507.04125
Copy Paste: [[2507.04125]] Graph Neural Networks as a Substitute for Transformers in Single-Cell Transcriptomics(https://arxiv.org/abs/2507.04125)
Keywords: transformer
Abstract: Graph Neural Networks (GNNs) and Transformers share significant similarities in their encoding strategies for interacting with features from nodes of interest, where Transformers use query-key scores and GNNs use edges. Compared to GNNs, which are unable to encode relative positions, Transformers leverage dynamic attention capabilities to better represent relative relationships, thereby becoming the standard backbones in large-scale sequential pre-training. However, the subtle difference prompts us to consider: if positions are no longer crucial, could we substitute Transformers with Graph Neural Networks in some fields such as Single-Cell Transcriptomics? In this paper, we first explore the similarities and differences between GNNs and Transformers, specifically in terms of relative positions. Additionally, we design a synthetic example to illustrate their equivalence where there are no relative positions between tokens in the sample. Finally, we conduct extensive experiments on a large-scale position-agnostic dataset-single-cell transcriptomics-finding that GNNs achieve competitive performance compared to Transformers while consuming fewer computation resources. These findings provide novel insights for researchers in the field of single-cell transcriptomics, challenging the prevailing notion that the Transformer is always the optimum choice.

Title: BlowPrint: Blow-Based Multi-Factor Biometrics for Smartphone User Authentication

Authors: Howard Halim, Eyasu Getahun Chekole, Daniël Reijsbergen, Jianying Zhou
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04126
Pdf URL: https://arxiv.org/pdf/2507.04126
Copy Paste: [[2507.04126]] BlowPrint: Blow-Based Multi-Factor Biometrics for Smartphone User Authentication(https://arxiv.org/abs/2507.04126)
Keywords: security, attack, robust, biometric
Abstract: Biometric authentication is a widely used security mechanism that leverages unique physiological or behavioral characteristics to authenticate users. In multi-factor biometrics (MFB), multiple biometric modalities, e.g., physiological and behavioral, are integrated to mitigate the limitations inherent in single-factor biometrics. The main challenge in MFB lies in identifying novel behavioral techniques capable of meeting critical criteria, including high accuracy, high usability, non-invasiveness, resilience against spoofing attacks, and low use of computational resources. Despite ongoing advancements, current behavioral biometric techniques often fall short of fulfilling one or more of these requirements. In this work, we propose BlowPrint, a novel behavioral biometric technique that allows us to authenticate users based on their phone blowing behaviors. In brief, we assume that the way users blow on a phone screen can produce distinctive acoustic patterns, which can serve as a unique biometric identifier for effective user authentication. It can also be seamlessly integrated with physiological techniques, such as facial recognition, to enhance its robustness and security. To assess BlowPrint's effectiveness, we conduct an empirical study involving 50 participants from whom we collect blow-acoustic and facial feature data. Subsequently, we compute the similarity scores of the two modalities using various similarity algorithms and combine them through score-level fusion. Finally, we compute the accuracy using a machine learning-based classifier. As a result, the proposed method demonstrates an accuracy of 99.35% for blow acoustics, 99.96% for facial recognition, and 99.82% for the combined approach. The experimental results demonstrate BlowPrint's high effectiveness in terms of authentication accuracy, spoofing attack resilience, usability, non-invasiveness, and other aspects.

Title: BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Authors: Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu, Ian Robinson, Bryan Thompson, Huzefa Rangwala, George Karypis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04127
Pdf URL: https://arxiv.org/pdf/2507.04127
Copy Paste: [[2507.04127]] BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering(https://arxiv.org/abs/2507.04127)
Keywords: robust, large language model
Abstract: Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom ("bring-your-own") KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at this https URL.

Title: Token Level Hallucination Detection via Variance in Language Models

Authors: Keshav Kumar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04137
Pdf URL: https://arxiv.org/pdf/2507.04137
Copy Paste: [[2507.04137]] Token Level Hallucination Detection via Variance in Language Models(https://arxiv.org/abs/2507.04137)
Keywords: generative, large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.

Title: Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies

Authors: Mael Jullien, Marco Valentino, Leonardo Ranaldi, Andre Freitas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04142
Pdf URL: https://arxiv.org/pdf/2507.04142
Copy Paste: [[2507.04142]] Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies(https://arxiv.org/abs/2507.04142)
Keywords: large language model
Abstract: Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine-tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine-tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high-quality demonstrations using a frontier model to distil multi-step reasoning capabilities into smaller models (4B parameters) via Low-Rank Adaptation (LoRA). Across different language models fine-tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro-F1. Moreover, LoRA fine-tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT-4o-mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier-scale systems, and (iii) reasoning-type-aware evaluation is essential to uncover prompt-induced trade-offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter-efficient techniques in highly specialised domains.

Title: Large Language Models for Zero-Shot Multicultural Name Recognition

Authors: Thanakorn Phonchai, Surasakdi Siripong, Nicholas Patterson, Owen Campbell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04149
Pdf URL: https://arxiv.org/pdf/2507.04149
Copy Paste: [[2507.04149]] Large Language Models for Zero-Shot Multicultural Name Recognition(https://arxiv.org/abs/2507.04149)
Keywords: robust, large language model
Abstract: The robust and accurate recognition of multicultural names, particularly those not previously encountered, is a critical challenge in an increasingly globalized digital landscape. Traditional methods often falter when confronted with the vast diversity and novel permutations of names across different linguistic and cultural backgrounds. This paper introduces a novel framework, Prompt-Engineered Fine-Tuning (PEFT) for Large Language Models (LLMs) with Adversarial Data Augmentation and Cultural Knowledge Graph Integration, designed to significantly enhance zero-shot multicultural name recognition. Our approach leverages the powerful linguistic understanding of pre-trained LLMs, transforming the recognition task into a guided generation problem. Through meticulous prompt engineering, dynamic integration of explicit cultural knowledge derived from knowledge graphs, and the strategic application of adversarial data augmentation, we equip the LLM with an unprecedented ability to infer the cultural origin of unseen names. Extensive experiments demonstrate that our PEFT method consistently outperforms established deep learning baselines, including advanced Bi-LSTM models with cultural tags, achieving an impressive 93.1\% overall accuracy and a remarkable 89.5\% accuracy on challenging zero-shot name identification. An in-depth ablation study confirms the synergistic contribution of each component, while a human evaluation highlights our method's performance approaching human expert judgment. This work signifies a substantial leap in multicultural name recognition, offering a highly effective and scalable solution for real-world applications.

Title: Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Authors: Fernando Gabriela Garcia, Spencer Burns, Ryan Shaw, Hunter Young
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04151
Pdf URL: https://arxiv.org/pdf/2507.04151
Copy Paste: [[2507.04151]] Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation(https://arxiv.org/abs/2507.04151)
Keywords: diffusion, generative
Abstract: This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM's enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.

Title: LVLM-Composer's Explicit Planning for Image Generation

Authors: Spencer Ramsey, Jeffrey Lee, Amina Grant
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04152
Pdf URL: https://arxiv.org/pdf/2507.04152
Copy Paste: [[2507.04152]] LVLM-Composer's Explicit Planning for Image Generation(https://arxiv.org/abs/2507.04152)
Keywords: robust, generative
Abstract: The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation. We propose a multi-stage training paradigm, featuring Hierarchical Semantic-Visual Grounding Pre-training and Compositional Planning Reinforcement Learning with Self-Correction, to instill robust compositional reasoning. Extensive experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions including object accuracy, composition fidelity, and pose accuracy, significantly outperforming state-of-the-art baselines. An in-depth ablation study further validates the indispensable contribution of our proposed modules, while human evaluations confirm the perceptual superiority of our generated images. LVLM-Composer represents a significant step towards truly controllable and compositionally accurate open-ended text-to-image generation.

Title: Uncertainty Quantification in the Tsetlin Machine

Authors: Runar Helin, Ole-Christoffer Granmo, Mayur Kishor Shende, Lei Jiao, Vladimir I. Zadorozhny, Kunal Ganesh Dumbre, Rishad Shafik, Alex Yakovlev
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04175
Pdf URL: https://arxiv.org/pdf/2507.04175
Copy Paste: [[2507.04175]] Uncertainty Quantification in the Tsetlin Machine(https://arxiv.org/abs/2507.04175)
Keywords: explainability
Abstract: Data modeling using Tsetlin machines (TMs) is all about building logical rules from the data features. The decisions of the model are based on a combination of these logical rules. Hence, the model is fully transparent and it is possible to get explanations of its predictions. In this paper, we present a probability score for TM predictions and develop new techniques for uncertainty quantification to increase the explainability further. The probability score is an inherent property of any TM variant and is derived through an analysis of the TM learning dynamics. Simulated data is used to show a clear connection between the learned TM probability scores and the underlying probabilities of the data. A visualization of the probability scores also reveals that the TM is less confident in its predictions outside the training data domain, which contrasts the typical extrapolation phenomenon found in Artificial Neural Networks. The paper concludes with an application of the uncertainty quantification techniques on an image classification task using the CIFAR-10 dataset, where they provide new insights and suggest possible improvements to current TM image classification models.

Title: skfolio: Portfolio Optimization in Python

Authors: Carlo Nicolini, Matteo Manzi, Hugo Delatte
Subjects: cs.LG, q-fin.PM
Abstract URL: https://arxiv.org/abs/2507.04176
Pdf URL: https://arxiv.org/pdf/2507.04176
Copy Paste: [[2507.04176]] skfolio: Portfolio Optimization in Python(https://arxiv.org/abs/2507.04176)
Keywords: robust
Abstract: Portfolio optimization is a fundamental challenge in quantitative finance, requiring robust computational tools that integrate statistical rigor with practical implementation. We present skfolio, an open-source Python library for portfolio construction and risk management that seamlessly integrates with the scikit-learn ecosystem. skfolio provides a unified framework for diverse allocation strategies, from classical mean-variance optimization to modern clustering-based methods, state-of-the-art financial estimators with native interfaces, and advanced cross-validation techniques tailored for financial time series. By adhering to scikit-learn's fit-predict-transform paradigm, the library enables researchers and practitioners to leverage machine learning workflows for portfolio optimization, promoting reproducibility and transparency in quantitative finance.

Title: SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding

Authors: Runcong Zhao, Qinglin Zhu, Hainiu Xu, Bin Liang, Yulan He, Lin Gui
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.04189
Pdf URL: https://arxiv.org/pdf/2507.04189
Copy Paste: [[2507.04189]] SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding(https://arxiv.org/abs/2507.04189)
Keywords: extraction, large language model
Abstract: Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.

Title: ML-Enhanced AES Anomaly Detection for Real-Time Embedded Security

Authors: Nishant Chinnasami, Rye Stahle-Smith, Rasha Karakchi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04197
Pdf URL: https://arxiv.org/pdf/2507.04197
Copy Paste: [[2507.04197]] ML-Enhanced AES Anomaly Detection for Real-Time Embedded Security(https://arxiv.org/abs/2507.04197)
Keywords: security, attack
Abstract: Advanced Encryption Standard (AES) is a widely adopted cryptographic algorithm, yet its practical implementations remain susceptible to side-channel and fault injection attacks. In this work, we propose a comprehensive framework that enhances AES-128 encryption security through controlled anomaly injection and real-time anomaly detection using both statistical and machine learning (ML) methods. We simulate timing and fault-based anomalies by injecting execution delays and ciphertext perturbations during encryption, generating labeled datasets for detection model training. Two complementary detection mechanisms are developed: a threshold-based timing anomaly detector and a supervised Random Forest classifier trained on combined timing and ciphertext features. We implement and evaluate the framework on both CPU and FPGA-based SoC hardware (PYNQ-Z1), measuring performance across varying block sizes, injection rates, and core counts. Our results show that ML-based detection significantly outperforms threshold-based methods in precision and recall while maintaining real-time performance on embedded hardware. Compared to existing AES anomaly detection methods, our solution offers a low-cost, real-time, and accurate detection approach deployable on lightweight FPGA platforms.

Title: An explicit formulation of the learned noise predictor $ε_θ({\bf x}_t, t)$ via the forward-process noise $ε_{t}$ in denoising diffusion probabilistic models (DDPMs)

Authors: KiHyun Yun
Subjects: cs.LG, math.AP
Abstract URL: https://arxiv.org/abs/2507.04203
Pdf URL: https://arxiv.org/pdf/2507.04203
Copy Paste: [[2507.04203]] An explicit formulation of the learned noise predictor $ε_θ({\bf x}_t, t)$ via the forward-process noise $ε_{t}$ in denoising diffusion probabilistic models (DDPMs)(https://arxiv.org/abs/2507.04203)
Keywords: diffusion, generative
Abstract: In denoising diffusion probabilistic models (DDPMs), the learned noise predictor $ \epsilon_{\theta} ( {\bf x}_t , t)$ is trained to approximate the forward-process noise $\epsilon_t$. The equality $\nabla_{{\bf x}_t} \log q({\bf x}_t) = -\frac 1 {\sqrt {1- {\bar \alpha}_t} } \epsilon_{\theta} ( {\bf x}_t , t)$ plays a fundamental role in both theoretical analyses and algorithmic design, and thus is frequently employed across diffusion-based generative models. In this paper, an explicit formulation of $ \epsilon_{\theta} ( {\bf x}_t , t)$ in terms of the forward-process noise $\epsilon_t$ is derived. This result show how the forward-process noise $\epsilon_t$ contributes to the learned predictor $ \epsilon_{\theta} ( {\bf x}_t , t)$. Furthermore, based on this formulation, we present a novel and mathematically rigorous proof of the fundamental equality above, clarifying its origin and providing new theoretical insight into the structure of diffusion models.

Title: Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration

Authors: Yu-Shan Tai, An-Yeu (Andy)Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04207
Pdf URL: https://arxiv.org/pdf/2507.04207
Copy Paste: [[2507.04207]] Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration(https://arxiv.org/abs/2507.04207)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have demonstrated remarkable success in various image generation tasks. Building upon these achievements, diffusion models have also been effectively adapted to image restoration tasks, e.g., super-resolution and deblurring, aiming to recover high-quality images from degraded inputs. Although existing zero-shot approaches enable pretrained diffusion models to perform restoration tasks without additional fine-tuning, these methods often suffer from prolonged iteration times in the denoising process. To address this limitation, we propose a Quick Bypass Mechanism (QBM), a strategy that significantly accelerates the denoising process by initializing from an intermediate approximation, effectively bypassing early denoising steps. Furthermore, recognizing that approximation may introduce inconsistencies, we introduce a Revised Reverse Process (RRP), which adjusts the weighting of random noise to enhance the stochasticity and mitigate potential disharmony. We validate proposed methods on ImageNet-1K and CelebA-HQ across multiple image restoration tasks, e.g., super-resolution, deblurring, and compressed sensing. Our experimental results show that the proposed methods can effectively accelerate existing methods while maintaining original performance.

Title: Can Large Language Models Automate the Refinement of Cellular Network Specifications?

Authors: Jianshuo Dong, Tianyi Zhang, Feng Yan, Yuanjie Li, Hewu Li, Han Qiu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04214
Pdf URL: https://arxiv.org/pdf/2507.04214
Copy Paste: [[2507.04214]] Can Large Language Models Automate the Refinement of Cellular Network Specifications?(https://arxiv.org/abs/2507.04214)
Keywords: security, attack, large language model
Abstract: Cellular networks serve billions of users globally, yet concerns about reliability and security persist due to weaknesses in 3GPP standards. However, traditional analysis methods, including manual inspection and automated tools, struggle with increasingly expanding cellular network specifications. This paper investigates the feasibility of Large Language Models (LLMs) for automated cellular network specification refinement. To advance it, we leverage 200,000+ approved 3GPP Change Requests (CRs) that document specification revisions, constructing a valuable dataset for domain tasks. We introduce CR-eval, a principled evaluation framework, and benchmark 16 state-of-the-art LLMs, demonstrating that top models can discover security-related weaknesses in over 127 out of 200 test cases within five trials. To bridge potential gaps, we explore LLM specialization techniques, including fine-tuning an 8B model to match or surpass advanced LLMs like GPT-4o and DeepSeek-R1. Evaluations on 30 cellular attacks identify open challenges for achieving full automation. These findings confirm that LLMs can automate the refinement of cellular network specifications and provide valuable insights to guide future research in this direction.

Title: DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design

Authors: Xiwei Hu, Haokun Chen, Zhongqi Qi, Hui Zhang, Dexiang Hong, Jie Shao, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04218
Pdf URL: https://arxiv.org/pdf/2507.04218
Copy Paste: [[2507.04218]] DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design(https://arxiv.org/abs/2507.04218)
Keywords: generative
Abstract: We present DreamPoster, a Text-to-Image generation framework that intelligently synthesizes high-quality posters from user-provided images and text prompts while maintaining content fidelity and supporting flexible resolution and layout outputs. Specifically, DreamPoster is built upon our T2I model, Seedream3.0 to uniformly process different poster generating types. For dataset construction, we propose a systematic data annotation pipeline that precisely annotates textual content and typographic hierarchy information within poster images, while employing comprehensive methodologies to construct paired datasets comprising source materials (e.g., raw graphics/text) and their corresponding final poster outputs. Additionally, we implement a progressive training strategy that enables the model to hierarchically acquire multi-task generation capabilities while maintaining high-quality generation. Evaluations on our testing benchmarks demonstrate DreamPoster's superiority over existing methods, achieving a high usability rate of 88.55\%, compared to GPT-4o (47.56\%) and SeedEdit3.0 (25.96\%). DreamPoster will be online in Jimeng and other Bytedance Apps.

Title: Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Authors: Yan Scholten, Sophie Xhonneux, Stephan Günnemann, Leo Schwinn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04219
Pdf URL: https://arxiv.org/pdf/2507.04219
Copy Paste: [[2507.04219]] Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs(https://arxiv.org/abs/2507.04219)
Keywords: privacy, generative
Abstract: Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their training objectives. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method - Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from the model. Our core idea is to leverage this collapse for unlearning by triggering collapse partially on the sensitive data. We theoretically analyze that our approach converges to the desired outcome, i.e. the LLM unlearns the information in the forget set. We empirically demonstrate that PMC overcomes two key limitations of existing unlearning approaches that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints. Code available at this https URL.

Title: Context Tuning for In-Context Optimization

Authors: Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04221
Pdf URL: https://arxiv.org/pdf/2507.04221
Copy Paste: [[2507.04221]] Context Tuning for In-Context Optimization(https://arxiv.org/abs/2507.04221)
Keywords: large language model
Abstract: We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

Title: Fairness Evaluation of Large Language Models in Academic Library Reference Services

Authors: Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2507.04224
Pdf URL: https://arxiv.org/pdf/2507.04224
Copy Paste: [[2507.04224]] Fairness Evaluation of Large Language Models in Academic Library Reference Services(https://arxiv.org/abs/2507.04224)
Keywords: fair, large language model
Abstract: As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries' commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

Title: Zero-Shot Cyclic Peptide Design with Composable Geometric Conditions

Authors: Dapeng Jiang, Xiangzhe Kong, Jiaqi Han, Mingyu Li, Rui Jiao, Wenbing Huang, Stefano Ermon, Jianzhu Ma, Yang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04225
Pdf URL: https://arxiv.org/pdf/2507.04225
Copy Paste: [[2507.04225]] Zero-Shot Cyclic Peptide Design with Composable Geometric Conditions(https://arxiv.org/abs/2507.04225)
Keywords: diffusion, generative
Abstract: Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing target-specific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38% to 84% on different cyclization strategies.

Title: Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties

Authors: Guohong Liu, Jialei Ye, Jiacheng Liu, Yuanchun Li, Wei Liu, Pengzhi Gao, Jian Luan, Yunxin Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04227
Pdf URL: https://arxiv.org/pdf/2507.04227
Copy Paste: [[2507.04227]] Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties(https://arxiv.org/abs/2507.04227)
Keywords: attack, robust
Abstract: Mobile GUI agents are designed to autonomously execute diverse device-control tasks by interpreting and interacting with mobile screens. Despite notable advancements, their resilience in real-world scenarios where screen content may be partially manipulated by untrustworthy third parties remains largely unexplored. Owing to their black-box and autonomous nature, these agents are vulnerable to manipulations that could compromise user devices. In this work, we present the first systematic investigation into the vulnerabilities of mobile GUI agents. We introduce a scalable attack simulation framework AgentHazard, which enables flexible and targeted modifications of screen content within existing applications. Leveraging this framework, we develop a comprehensive benchmark suite comprising both a dynamic task execution environment and a static dataset of vision-language-action tuples, totaling over 3,000 attack scenarios. The dynamic environment encompasses 58 reproducible tasks in an emulator with various types of hazardous UI content, while the static dataset is constructed from 210 screenshots collected from 14 popular commercial apps. Importantly, our content modifications are designed to be feasible for unprivileged third parties. We evaluate 7 widely-used mobile GUI agents and 5 common backbone models using our benchmark. Our findings reveal that all examined agents are significantly influenced by misleading third-party content (with an average misleading rate of 28.8% in human-crafted attack scenarios) and that their vulnerabilities are closely linked to the employed perception modalities and backbone LLMs. Furthermore, we assess training-based mitigation strategies, highlighting both the challenges and opportunities for enhancing the robustness of mobile GUI agents. Our code and data will be released at this https URL.

Title: Scaling Context Requires Rethinking Attention

Authors: Carles Gelada, Jacob Buckman, Sean Zhang, Txus Bach
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04239
Pdf URL: https://arxiv.org/pdf/2507.04239
Copy Paste: [[2507.04239]] Scaling Context Requires Rethinking Attention(https://arxiv.org/abs/2507.04239)
Keywords: transformer
Abstract: We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.

Title: Domain Generalizable Portrait Style Transfer

Authors: Xinbo Wang, Wenju Xu, Qing Zhang, Wei-Shi Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04243
Pdf URL: https://arxiv.org/pdf/2507.04243
Copy Paste: [[2507.04243]] Domain Generalizable Portrait Style Transfer(https://arxiv.org/abs/2507.04243)
Keywords: diffusion
Abstract: This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transform to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transform, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model are available at this https URL.

Title: Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Authors: Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, Ruoxi Jia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04250
Pdf URL: https://arxiv.org/pdf/2507.04250
Copy Paste: [[2507.04250]] Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning(https://arxiv.org/abs/2507.04250)
Keywords: robust, large language model
Abstract: Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. We introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and data-efficient training framework that minimizes over-refusals by leveraging internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model's ability to handle harmful queries and preserve overall utility.

Title: MoReMouse: Monocular Reconstruction of Laboratory Mouse

Authors: Yuan Zhong, Jingxiang Sun, Liang An, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04258
Pdf URL: https://arxiv.org/pdf/2507.04258
Copy Paste: [[2507.04258]] MoReMouse: Monocular Reconstruction of Laboratory Mouse(https://arxiv.org/abs/2507.04258)
Keywords: robust, transformer
Abstract: Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D reconstruction network tailored for laboratory mice. To achieve this goal, we highlight three key designs. First, we construct the first high-fidelity dense-view synthetic dataset for mice, by rendering our self-designed realistic Gaussian mouse avatar. Second, MoReMouse adopts a transformer-based feedforward architecture with triplane representation, achieving high-quality 3D surface generation from a single image. Third, we create geodesic-based continuous correspondence embeddings on mouse surface, which serve as strong semantic priors to improve reconstruction stability and surface consistency. Extensive quantitative and qualitative experiments demonstrate that MoReMouse significantly outperforms existing open-source methods in accuracy and robustness. Video results are available at this https URL.

Title: An Explainable Transformer Model for Alzheimer's Disease Detection Using Retinal Imaging

Authors: Saeed Jamshidiha, Alireza Rezaee, Farshid Hajati, Mojtaba Golzan, Raymond Chiong
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04259
Pdf URL: https://arxiv.org/pdf/2507.04259
Copy Paste: [[2507.04259]] An Explainable Transformer Model for Alzheimer's Disease Detection Using Retinal Imaging(https://arxiv.org/abs/2507.04259)
Keywords: transformer, generative
Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder that affects millions worldwide. In the absence of effective treatment options, early diagnosis is crucial for initiating management strategies to delay disease onset and slow down its progression. In this study, we propose Retformer, a novel transformer-based architecture for detecting AD using retinal imaging modalities, leveraging the power of transformers and explainable artificial intelligence. The Retformer model is trained on datasets of different modalities of retinal images from patients with AD and age-matched healthy controls, enabling it to learn complex patterns and relationships between image features and disease diagnosis. To provide insights into the decision-making process of our model, we employ the Gradient-weighted Class Activation Mapping algorithm to visualize the feature importance maps, highlighting the regions of the retinal images that contribute most significantly to the classification outcome. These findings are compared to existing clinical studies on detecting AD using retinal biomarkers, allowing us to identify the most important features for AD detection in each imaging modality. The Retformer model outperforms a variety of benchmark algorithms across different performance metrics by margins of up to 11\.

Title: ZERO: Multi-modal Prompt-based Visual Grounding

Authors: Sangbum Choi, Kyeongryeol Go
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04270
Pdf URL: https://arxiv.org/pdf/2507.04270
Copy Paste: [[2507.04270]] ZERO: Multi-modal Prompt-based Visual Grounding(https://arxiv.org/abs/2507.04270)
Keywords: robust
Abstract: Recent advances in artificial intelligence have led to the emergence of foundation models, large-scale pre-trained neural networks that serve as versatile starting points for a wide range of downstream tasks. In this work, we present ZERO, a zero-shot multi-prompt object detection model specifically designed for robust, production-ready deployment across diverse industrial domains. ZERO integrates direct image input with multiple user-defined prompts, which can include both textual and visual cues, and processes them through dedicated encoders to generate accurate detection outputs. The model architecture is optimized for scalability, with a total of 1.033 TFLOPS and 622.346 million parameters, and is trained using a domain-specific image database exceeding one billion images. For the CVPR 2025 Foundational Few-Shot Object Detection (FSOD) Challenge, we introduce a domain-specific fine-tuning strategy that emphasizes prompt diversity and conservative pseudo-labeling, enabling effective adaptation to new domains with minimal supervision. Our approach demonstrates practical advantages in flexibility, efficiency, and real-world applicability, achieving strong performance on the RF20VL-fsod benchmark despite limited annotation budgets. The results highlight the potential of prompt-driven, data-centric AI for scalable and adaptive object detection in dynamic industrial environments.

Title: VOLTRON: Detecting Unknown Malware Using Graph-Based Zero-Shot Learning

Authors: M. Tahir Akdeniz, Zeynep Yeşilkaya, İ. Enes Köse, İ. Ulaş Ünal, Sevil Şen
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04275
Pdf URL: https://arxiv.org/pdf/2507.04275
Copy Paste: [[2507.04275]] VOLTRON: Detecting Unknown Malware Using Graph-Based Zero-Shot Learning(https://arxiv.org/abs/2507.04275)
Keywords: security, robust
Abstract: The persistent threat of Android malware presents a serious challenge to the security of millions of users globally. While many machine learning-based methods have been developed to detect these threats, their reliance on large labeled datasets limits their effectiveness against emerging, previously unseen malware families, for which labeled data is scarce or nonexistent. To address this challenge, we introduce a novel zero-shot learning framework that combines Variational Graph Auto-Encoders (VGAE) with Siamese Neural Networks (SNN) to identify malware without needing prior examples of specific malware families. Our approach leverages graph-based representations of Android applications, enabling the model to detect subtle structural differences between benign and malicious software, even in the absence of labeled data for new threats. Experimental results show that our method outperforms the state-of-the-art MaMaDroid, especially in zero-day malware detection. Our model achieves 96.24% accuracy and 95.20% recall for unknown malware families, highlighting its robustness against evolving Android threats.

Title: SeqTex: Generate Mesh Textures in Video Sequence

Authors: Ze Yuan (1), Xin Yu (1), Yangtian Sun (1), Yuan-Chen Guo (2), Yan-Pei Cao (2), Ding Liang (2), Xiaojuan Qi (1) ((1) HKU, (2) VAST)
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2507.04285
Pdf URL: https://arxiv.org/pdf/2507.04285
Copy Paste: [[2507.04285]] SeqTex: Generate Mesh Textures in Video Sequence(https://arxiv.org/abs/2507.04285)
Keywords: generative
Abstract: Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

Title: M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding

Authors: Shenxi Liu, Kan Li, Mingyang Zhao, Yuhang Tian, Bin Li, Shoujun Zhou, Hongliang Li, Fuxia Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04289
Pdf URL: https://arxiv.org/pdf/2507.04289
Copy Paste: [[2507.04289]] M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding(https://arxiv.org/abs/2507.04289)
Keywords: large language model
Abstract: With the rapid progress of artificial intelligence (AI) in multi-modal understanding, there is increasing potential for video comprehension technologies to support professional domains such as medical education. However, existing benchmarks suffer from two primary limitations: (1) Linguistic Singularity: they are largely confined to English, neglecting the need for multilingual resources; and (2) Shallow Reasoning: their questions are often designed for surface-level information retrieval, failing to properly assess deep multi-modal integration. To address these limitations, we present M3-Med, the first benchmark for Multi-lingual, Multi-modal, and Multi-hop reasoning in Medical instructional video understanding. M3-Med consists of medical questions paired with corresponding video segments, annotated by a team of medical experts. A key innovation of M3-Med is its multi-hop reasoning task, which requires a model to first locate a key entity in the text, then find corresponding visual evidence in the video, and finally synthesize information across both modalities to derive the answer. This design moves beyond simple text matching and poses a substantial challenge to a model's deep cross-modal understanding capabilities. We define two tasks: Temporal Answer Grounding in Single Video (TAGSV) and Temporal Answer Grounding in Video Corpus (TAGVC). We evaluated several state-of-the-art models and Large Language Models (LLMs) on M3-Med. The results reveal a significant performance gap between all models and human experts, especially on the complex multi-hop questions where model performance drops sharply. M3-Med effectively highlights the current limitations of AI models in deep cross-modal reasoning within specialized domains and provides a new direction for future research.

Title: MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Boyu Diao, Fuzhen Zhuang, Michele Magno, Yongjun Xu, Yingli Tian, Tingwen Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04290
Pdf URL: https://arxiv.org/pdf/2507.04290
Copy Paste: [[2507.04290]] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation(https://arxiv.org/abs/2507.04290)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.

Title: QF: Quick Feedforward AI Model Training without Gradient Back Propagation

Authors: Feng Qi
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2507.04300
Pdf URL: https://arxiv.org/pdf/2507.04300
Copy Paste: [[2507.04300]] QF: Quick Feedforward AI Model Training without Gradient Back Propagation(https://arxiv.org/abs/2507.04300)
Keywords: transformer
Abstract: We propose Quick Feedforward (QF) Learning, a novel knowledge consolidation framework for transformer-based models that enables efficient transfer of instruction derived knowledge into model weights through feedforward activations without any gradient back propagation. Unlike traditional finetuning, QF updates are computed in closed form, require minimal parameter modification, and preserve prior knowledge. Importantly, QF allows models to train and infer within the same runtime environment, making the process more resource efficient and closely aligned with how the human brain operates. Code and models are open sourced on GitHub. I hope QF Learning inspires a more efficient and brain-like paradigm for AI systems.

Title: Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis

Authors: Zhipeng Li, Kegang Wang, Hanguang Xiao, Xingyue Liu, Feizhong Zhou, Jiaxin Jiang, Tianqi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04306
Pdf URL: https://arxiv.org/pdf/2507.04306
Copy Paste: [[2507.04306]] Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis(https://arxiv.org/abs/2507.04306)
Keywords: robust
Abstract: Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at this https URL.

Title: Heterogeneous Federated Learning with Prototype Alignment and Upscaling

Authors: Gyuejeong Lee, Jihwan Shin, Daeyoung Choi
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2507.04310
Pdf URL: https://arxiv.org/pdf/2507.04310
Copy Paste: [[2507.04310]] Heterogeneous Federated Learning with Prototype Alignment and Upscaling(https://arxiv.org/abs/2507.04310)
Keywords: federate
Abstract: Heterogeneity in data distributions and model architectures remains a significant challenge in federated learning (FL). Various heterogeneous FL (HtFL) approaches have recently been proposed to address this challenge. Among them, prototype-based FL (PBFL) has emerged as a practical framework that only shares per-class mean activations from the penultimate layer. However, PBFL approaches often suffer from suboptimal prototype separation, limiting their discriminative power. We propose Prototype Normalization (ProtoNorm), a novel PBFL framework that addresses this limitation through two key components: Prototype Alignment (PA) and Prototype Upscaling (PU). The PA method draws inspiration from the Thomson problem in classical physics, optimizing global prototype configurations on a unit sphere to maximize angular separation; subsequently, the PU method increases prototype magnitudes to enhance separation in Euclidean space. Extensive evaluations on benchmark datasets show that our approach better separates prototypes and thus consistently outperforms existing HtFL approaches. Notably, since ProtoNorm inherits the communication efficiency of PBFL and the PA is performed server-side, it is particularly suitable for resource-constrained environments.

Title: TinyProto: Communication-Efficient Federated Learning with Sparse Prototypes in Resource-Constrained Environments

Authors: Gyuejeong Lee, Daeyoung Choi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04327
Pdf URL: https://arxiv.org/pdf/2507.04327
Copy Paste: [[2507.04327]] TinyProto: Communication-Efficient Federated Learning with Sparse Prototypes in Resource-Constrained Environments(https://arxiv.org/abs/2507.04327)
Keywords: federate
Abstract: Communication efficiency in federated learning (FL) remains a critical challenge for resource-constrained environments. While prototype-based FL reduces communication overhead by sharing class prototypes-mean activations in the penultimate layer-instead of model parameters, its efficiency decreases with larger feature dimensions and class counts. We propose TinyProto, which addresses these limitations through Class-wise Prototype Sparsification (CPS) and adaptive prototype scaling. CPS enables structured sparsity by allocating specific dimensions to class prototypes and transmitting only non-zero elements, while adaptive scaling adjusts prototypes based on class distributions. Our experiments show TinyProto reduces communication costs by up to 4x compared to existing methods while maintaining performance. Beyond its communication efficiency, TinyProto offers crucial advantages: achieving compression without client-side computational overhead and supporting heterogeneous architectures, making it ideal for resource-constrained heterogeneous FL.

Title: No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Authors: Dasol Choi, Woomyoung Park, Youngsook Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04329
Pdf URL: https://arxiv.org/pdf/2507.04329
Copy Paste: [[2507.04329]] No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem(https://arxiv.org/abs/2507.04329)
Keywords: large language model
Abstract: Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

Title: Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Authors: Yuanhe Tian, Chen Su, Junwen Duan, Yan Song
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04333
Pdf URL: https://arxiv.org/pdf/2507.04333
Copy Paste: [[2507.04333]] Computed Tomography Visual Question Answering with Cross-modal Feature Graphing(https://arxiv.org/abs/2507.04333)
Keywords: robust, large language model
Abstract: Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.

Title: Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products

Authors: Soojong Kim, Kwanho Kim, Hye Min Kim
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2507.04364
Pdf URL: https://arxiv.org/pdf/2507.04364
Copy Paste: [[2507.04364]] Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products(https://arxiv.org/abs/2507.04364)
Keywords: large language model
Abstract: Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.

Title: Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Authors: Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04365
Pdf URL: https://arxiv.org/pdf/2507.04365
Copy Paste: [[2507.04365]] Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs(https://arxiv.org/abs/2507.04365)
Keywords: defense, attack, large language model
Abstract: As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.

Title: Adaptive Malware Detection using Sequential Feature Selection: A Dueling Double Deep Q-Network (D3QN) Framework for Intelligent Classification

Authors: Naseem Khan, Aref Y. Al-Tamimi, Amine Bermak, Issa M. Khalil
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2507.04372
Pdf URL: https://arxiv.org/pdf/2507.04372
Copy Paste: [[2507.04372]] Adaptive Malware Detection using Sequential Feature Selection: A Dueling Double Deep Q-Network (D3QN) Framework for Intelligent Classification(https://arxiv.org/abs/2507.04372)
Keywords: extraction
Abstract: Traditional malware detection methods exhibit computational inefficiency due to exhaustive feature extraction requirements, creating accuracy-efficiency trade-offs that limit real-time deployment. We formulate malware classification as a Markov Decision Process with episodic feature acquisition and propose a Dueling Double Deep Q-Network (D3QN) framework for adaptive sequential feature selection. The agent learns to dynamically select informative features per sample before terminating with classification decisions, optimizing both detection accuracy and computational cost through reinforcement learning. We evaluate our approach on Microsoft Big2015 (9-class, 1,795 features) and BODMAS (binary, 2,381 features) datasets. D3QN achieves 99.22% and 98.83% accuracy while utilizing only 61 and 56 features on average, representing 96.6% and 97.6% dimensionality reduction. This yields computational efficiency improvements of 30.1x and 42.5x over traditional ensemble methods. Comprehensive ablation studies demonstrate consistent superiority over Random Forest, XGBoost, and static feature selection approaches. Quantitative analysis demonstrates that D3QN learns non-random feature selection policies with 62.5% deviation from uniform baseline distributions. The learned policies exhibit structured hierarchical preferences, utilizing high-level metadata features for initial assessment while selectively incorporating detailed behavioral features based on classification uncertainty. Feature specialization analysis reveals 57.7% of examined features demonstrate significant class-specific discrimination patterns. Our results validate reinforcement learning-based sequential feature selection for malware classification, achieving superior accuracy with substantial computational reduction through learned adaptive policies.

Title: Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

Authors: Xiao Zhang, Johan Bos
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2507.04377
Pdf URL: https://arxiv.org/pdf/2507.04377
Copy Paste: [[2507.04377]] Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions(https://arxiv.org/abs/2507.04377)
Keywords: robust
Abstract: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.

Title: Transferring Visual Explainability of Self-Explaining Models through Task Arithmetic

Authors: Yuya Yoshikawa, Ryotaro Shimizu, Takahiro Kawashima, Yuki Saito
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04380
Pdf URL: https://arxiv.org/pdf/2507.04380
Copy Paste: [[2507.04380]] Transferring Visual Explainability of Self-Explaining Models through Task Arithmetic(https://arxiv.org/abs/2507.04380)
Keywords: robust, explainability
Abstract: In scenarios requiring both prediction and explanation efficiency for image classification, self-explaining models that perform both tasks in a single inference are effective. However, their training incurs substantial labeling and computational costs. This study aims to tackle the issue by proposing a method to transfer the visual explainability of self-explaining models, learned in a source domain, to a target domain based on a task arithmetic framework. Specifically, we construct a self-explaining model by extending image classifiers based on a vision-language pretrained model. We then define an \emph{explainability vector} as the difference between model parameters trained on the source domain with and without explanation supervision. Based on the task arithmetic framework, we impart explainability to a model trained only on the prediction task in the target domain by applying the explainability vector. Experimental results on various image classification datasets demonstrate that, except for transfers between some less-related domains, visual explainability can be successfully transferred from source to target domains, improving explanation quality in the target domain without sacrificing classification accuracy. Furthermore, we show that the explainability vector learned on a large and diverse dataset like ImageNet, extended with explanation supervision, exhibits universality and robustness, improving explanation quality on nine out of ten different target datasets. We also find that the explanation quality achieved with a single model inference is comparable to that of Kernel SHAP, which requires 150 model inferences.

Title: Tractable Representation Learning with Probabilistic Circuits

Authors: Steven Braun, Sahil Sidheekh, Antonio Vergari, Martin Mundt, Sriraam Natarajan, Kristian Kersting
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04385
Pdf URL: https://arxiv.org/pdf/2507.04385
Copy Paste: [[2507.04385]] Tractable Representation Learning with Probabilistic Circuits(https://arxiv.org/abs/2507.04385)
Keywords: robust
Abstract: Probabilistic circuits (PCs) are powerful probabilistic models that enable exact and tractable inference, making them highly suitable for probabilistic reasoning and inference tasks. While dominant in neural networks, representation learning with PCs remains underexplored, with prior approaches relying on external neural embeddings or activation-based encodings. To address this gap, we introduce autoencoding probabilistic circuits (APCs), a novel framework leveraging the tractability of PCs to model probabilistic embeddings explicitly. APCs extend PCs by jointly modeling data and embeddings, obtaining embedding representations through tractable probabilistic inference. The PC encoder allows the framework to natively handle arbitrary missing data and is seamlessly integrated with a neural decoder in a hybrid, end-to-end trainable architecture enabled by differentiable sampling. Our empirical evaluation demonstrates that APCs outperform existing PC-based autoencoding methods in reconstruction quality, generate embeddings competitive with, and exhibit superior robustness in handling missing data compared to neural autoencoders. These results highlight APCs as a powerful and flexible representation learning method that exploits the probabilistic inference capabilities of PCs, showing promising directions for robust inference, out-of-distribution detection, and knowledge distillation.

Title: Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers

Authors: Jung-Ho Hong, Ho-Joong Kim, Kyu-Sung Jeon, Seong-Whan Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04388
Pdf URL: https://arxiv.org/pdf/2507.04388
Copy Paste: [[2507.04388]] Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers(https://arxiv.org/abs/2507.04388)
Keywords: fair, transformer
Abstract: The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. In this paper, we introduce a comprehensive information bottleneck (CoIBA), which discovers the relevant information in each targeted layer to explain the decision-making process. Our core idea is applying information bottleneck in multiple targeted layers to estimate the comprehensive information by sharing a parametric damping ratio across the layers. Leveraging this shared ratio complements the over-compressed information to discover the omitted clues of the decision by sharing the relevant information across the targeted layers. We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Therefore, CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA.

Title: Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?

Authors: Ruochen Zhou, Minrui Xu, Shiqi Chen, Junteng Liu, Yunqi Li, Xinxin Lin, Zhengyu Chen, Junxian He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04391
Pdf URL: https://arxiv.org/pdf/2507.04391
Copy Paste: [[2507.04391]] Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?(https://arxiv.org/abs/2507.04391)
Keywords: robust, large language model
Abstract: There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model's reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.

Title: RegistrationMamba: A Mamba-based Registration Framework Integrating Multi-Expert Feature Learning for Cross-Modal Remote Sensing Images

Authors: Wei Wang, Dou Quan, Chonghua Lv, Shuang Wang, Ning Huyan, Yunan Li, Licheng Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04397
Pdf URL: https://arxiv.org/pdf/2507.04397
Copy Paste: [[2507.04397]] RegistrationMamba: A Mamba-based Registration Framework Integrating Multi-Expert Feature Learning for Cross-Modal Remote Sensing Images(https://arxiv.org/abs/2507.04397)
Keywords: robust, extraction, transformer
Abstract: Cross-modal remote sensing image (CRSI) registration is critical for multi-modal image applications. However, CRSI mainly faces two challenges: significant nonlinear radiometric variations between cross-modal images and limited textures hindering the discriminative information extraction. Existing methods mainly adopt convolutional neural networks (CNNs) or Transformer architectures to extract discriminative features for registration. However, CNNs with the local receptive field fail to capture global contextual features, and Transformers have high computational complexity and restrict their application to high-resolution CRSI. To solve these issues, this paper proposes RegistrationMamba, a novel Mamba architecture based on state space models (SSMs) integrating multi-expert feature learning for improving the accuracy of CRSI registration. Specifically, RegistrationMamba employs a multi-directional cross-scanning strategy to capture global contextual relationships with linear complexity. To enhance the performance of RegistrationMamba under texture-limited scenarios, we propose a multi-expert feature learning (MEFL) strategy to capture features from various augmented image variants through multiple feature experts. MEFL leverages a learnable soft router to dynamically fuse the features from multiple experts, thereby enriching feature representations and improving registration performance. Notably, MEFL can be seamlessly integrated into various frameworks, substantially boosting registration performance. Additionally, RegistrationMamba integrates a multi-level feature aggregation (MFA) module to extract fine-grained local information and enable effective interaction between global and local features. Extensive experiments on CRSI with varying image resolutions have demonstrated that RegistrationMamba has superior performance and robustness compared to state-of-the-art methods.

Title: Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

Authors: Tongyan Hua, Lutao Jiang, Ying-Cong Chen, Wufan Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04403
Pdf URL: https://arxiv.org/pdf/2507.04403
Copy Paste: [[2507.04403]] Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion(https://arxiv.org/abs/2507.04403)
Keywords: diffusion, generative
Abstract: Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance this http URL overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.

Title: MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture

Authors: Guandong Li, Mengxia Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04409
Pdf URL: https://arxiv.org/pdf/2507.04409
Copy Paste: [[2507.04409]] MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture(https://arxiv.org/abs/2507.04409)
Keywords: robust, extraction, transformer
Abstract: Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear complexity sequence modeling capabilities, achieving efficient spatial-spectral feature extraction and fusion. MVNet features a redesigned dual-branch Mamba module, including a State Space Model (SSM) branch and a non-SSM branch employing 1D convolution with SiLU activation, enhancing modeling of both short-range and long-range dependencies while reducing computational latency in traditional Mamba. The optimized HSI-MambaVision Mixer module overcomes the unidirectional limitation of causal convolution, capturing bidirectional spatial-spectral dependencies in a single forward pass through decoupled attention that focuses on high-value features, alleviating parameter redundancy and the curse of dimensionality. On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data.

Title: Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models

Authors: Huy Hoan Le, Van Sy Thinh Nguyen, Thi Le Chi Dang, Vo Thanh Khang Nguyen, Truong Thanh Hung Nguyen, Hung Cao
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.04410
Pdf URL: https://arxiv.org/pdf/2507.04410
Copy Paste: [[2507.04410]] Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models(https://arxiv.org/abs/2507.04410)
Keywords: extraction, large language model
Abstract: This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.

Title: THM@SimpleText 2025 -- Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

Authors: Nico Hofmann, Julian Dauenhauer, Nils Ole Dietzler, Idehen Daniel Idahor, Christin Katharina Kreutz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04414
Pdf URL: https://arxiv.org/pdf/2507.04414
Copy Paste: [[2507.04414]] THM@SimpleText 2025 -- Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts(https://arxiv.org/abs/2507.04414)
Keywords: large language model
Abstract: Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.

Title: MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

Authors: Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04415
Pdf URL: https://arxiv.org/pdf/2507.04415
Copy Paste: [[2507.04415]] MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind(https://arxiv.org/abs/2507.04415)
Keywords: large language model
Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI's multimodal understanding of human behavior.

Title: RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

Authors: Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04416
Pdf URL: https://arxiv.org/pdf/2507.04416
Copy Paste: [[2507.04416]] RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling(https://arxiv.org/abs/2507.04416)
Keywords: transformer
Abstract: Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a $7\times$ improvement in training speed with 100K token sequences and $9\times$ in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at this https URL

Title: Enhancing Phishing Detection in Financial Systems through NLP

Authors: Novruz Amirov, Leminur Celik, Egemen Ali Caner, Emre Yurdakul, Fahri Anil Yerlikaya, Serif Bahtiyar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04426
Pdf URL: https://arxiv.org/pdf/2507.04426
Copy Paste: [[2507.04426]] Enhancing Phishing Detection in Financial Systems through NLP(https://arxiv.org/abs/2507.04426)
Keywords: security, protect, attack, robust
Abstract: The threat of phishing attacks in financial systems is continuously growing. Therefore, protecting sensitive information from unauthorized access is paramount. This paper discusses the critical need for robust email phishing detection. Several existing methods, including blacklists and whitelists, play a crucial role in detecting phishing attempts. Nevertheless, these methods possess inherent limitations, emphasizing the need for the development of a more advanced solution. Our proposed solution presents a pioneering Natural Language Processing (NLP) approach for phishing email detection. Leveraging semantic similarity and TFIDF (Term Frequency-Inverse Document Frequency) analysis, our solution identifies keywords in phishing emails, subsequently evaluating the semantic similarities with a dedicated phishing dataset, ultimately contributing to the enhancement of cybersecurity and NLP domains through a robust solution for detecting phishing threats in financial systems. Experimental results show the accuracy of our phishing detection method can reach 79.8 percent according to TF-IDF analysis, while it can reach 67.2 percent according to semantic analysis.

Title: Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking

Authors: Tim Beyer, Yan Scholten, Stephan Günnemann, Leo Schwinn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04446
Pdf URL: https://arxiv.org/pdf/2507.04446
Copy Paste: [[2507.04446]] Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking(https://arxiv.org/abs/2507.04446)
Keywords: attack, robust, data-free, large language model
Abstract: To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point, greedy generations, overlooking the inherently stochastic nature of LLMs. In this paper, we propose a novel framework for adversarial robustness evaluation that explicitly models the entire output distribution, including tail-risks, providing better estimates for model robustness at scale. By casting the attack process as a resource allocation problem between optimization and sampling, we determine compute-optimal tradeoffs and show that integrating sampling into existing attacks boosts ASR by up to 48% and improves efficiency by up to two orders of magnitude. Our framework also enables us to analyze how different attack algorithms affect output harm distributions. Surprisingly, we find that most optimization strategies have little effect on output harmfulness. Finally, we introduce a data-free proof-of-concept objective based on entropy-maximization to demonstrate how our tail-aware perspective enables new optimization targets. Overall, our findings highlight the importance of tail-aware attacks and evaluation protocols to accurately assess and strengthen LLM safety.

Title: DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04447
Pdf URL: https://arxiv.org/pdf/2507.04447
Copy Paste: [[2507.04447]] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge(https://arxiv.org/abs/2507.04447)
Keywords: diffusion, transformer
Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Title: CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

Authors: Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04451
Pdf URL: https://arxiv.org/pdf/2507.04451
Copy Paste: [[2507.04451]] CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step(https://arxiv.org/abs/2507.04451)
Keywords: diffusion, large language model
Abstract: Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.

Title: ESSA: Evolutionary Strategies for Scalable Alignment

Authors: Daria Korotyshova, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, George Bredis, Alexey Gorbatovski, Viacheslav Sinii, Daniil Gavrilov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04453
Pdf URL: https://arxiv.org/pdf/2507.04453
Copy Paste: [[2507.04453]] ESSA: Evolutionary Strategies for Scalable Alignment(https://arxiv.org/abs/2507.04453)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) are increasingly relying on alignment techniques to ensure that their outputs match human preferences. Although reinforcement learning from human feedback (RLHF) is the dominant approach, it has high computational costs, memory requirements, and training instability, particularly when scaling to larger models. This paper introduces ESSA (Evolutionary Strategies for Scalable Alignment), a new framework that uses Evolutionary Strategies (ES) to efficiently align LLMs without the need for gradient computation. ES is well-suited for LLM alignment due to its favorable properties, such as high parallelizability, memory efficiency, robustness to sparse rewards, and fewer data samples required for convergence, especially when starting from a strong pre-trained policy. Moreover, ES eliminates the need for extensive hyperparameter tuning, making the alignment process simpler and more stable. Although ES excels in low-dimensional optimization, it poses a challenge when applied to high-dimensional LLMs. To address this challenge, we propose a parameter-efficient architectural modification that reduces the dimensionality of optimization through low-rank adaptation. We evaluated our approach on mathematical reasoning tasks with verifiable accuracy-based metrics, demonstrating that ESSA converges faster and is more data efficient than gradient-based methods like Group Relative Policy Optimization (GRPO). Our findings establish ES as a promising and scalable alternative to gradient-based alignment, paving the way for efficient post-training of large language models.

Title: GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models

Authors: Kai Yao, Zhaorui Tan, Penglei Gao, Lichun Li, Kaixin Wu, Yinggui Wang, Yuan Zhao, Yixin Ji, Wei Wang, Jianke Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04455
Pdf URL: https://arxiv.org/pdf/2507.04455
Copy Paste: [[2507.04455]] GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models(https://arxiv.org/abs/2507.04455)
Keywords: privacy, protect, large language model
Abstract: The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression, named GradOT. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.

Title: UniAud: A Unified Auditing Framework for High Auditing Power and Utility with One Training Run

Authors: Ruixuan Liu, Li Xiong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04457
Pdf URL: https://arxiv.org/pdf/2507.04457
Copy Paste: [[2507.04457]] UniAud: A Unified Auditing Framework for High Auditing Power and Utility with One Training Run(https://arxiv.org/abs/2507.04457)
Keywords: privacy
Abstract: Differentially private (DP) optimization has been widely adopted as a standard approach to provide rigorous privacy guarantees for training datasets. DP auditing verifies whether a model trained with DP optimization satisfies its claimed privacy level by estimating empirical privacy lower bounds through hypothesis testing. Recent O(1) frameworks improve auditing efficiency by checking the membership status of multiple audit samples in a single run, rather than checking individual samples across multiple runs. However, we reveal that there is no free lunch for this improved efficiency: data dependency and an implicit conflict between auditing and utility impair the tightness of the auditing results. Addressing these challenges, our key insights include reducing data dependency through uncorrelated data and resolving the auditing-utility conflict by decoupling the criteria for effective auditing and separating objectives for utility and auditing. We first propose a unified framework, UniAud, for data-independent auditing that maximizes auditing power through a novel uncorrelated canary construction and a self-comparison framework. We then extend this framework as UniAud++ for data-dependent auditing, optimizing the auditing and utility trade-off through multi-task learning with separate objectives for auditing and training. Experimental results validate that our black-box O(1) framework matches the state-of-the-art auditing results of O(T) auditing with thousands of runs, demonstrating the best efficiency-auditing trade-off across vision and language tasks. Additionally, our framework provides meaningful auditing with only slight utility degradation compared to standard DP training, showing the optimal utility-auditing trade-off and the benefit of requiring no extra training for auditing.

Title: Arbiter PUF: Uniqueness and Reliability Analysis Using Hybrid CMOS-Stanford Memristor Model

Authors: Tanvir Rahman, A.B.M. Harun-ur Rashid
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04461
Pdf URL: https://arxiv.org/pdf/2507.04461
Copy Paste: [[2507.04461]] Arbiter PUF: Uniqueness and Reliability Analysis Using Hybrid CMOS-Stanford Memristor Model(https://arxiv.org/abs/2507.04461)
Keywords: secure, security, protect, attack, extraction
Abstract: In an increasingly interconnected world, protecting electronic devices has grown more crucial because of the dangers of data extraction, reverse engineering, and hardware tampering. Producing chips in a third-party manufacturing company can let hackers change the design. As the Internet of Things (IoT) proliferates, physical attacks happen more, and conventional cryptography techniques do not function well. In this paper, we investigate the design and assessment of PUFs using the Stanford Memristor Model, utilizing its random filament evolution to improve security. The system was built using 45nm CMOS technology. A comparison is made between CMOS-based and memristor-based Arbiter PUFs, evaluating their performance under temperature, voltage, and process variations. Intra- and inter-hamming distances are employed by Monte Carlo simulations to estimate uniqueness and reliability. The results show that memristor-based PUFs offer better reliability than CMOS-based designs, though uniqueness needs further improvement. Furthermore, this study sheds light on the reasonableness of memristor-based PUFs for secure applications in hardware security.

Title: Model Inversion Attacks on Llama 3: Extracting PII from Large Language Models

Authors: Sathesh P.Sivashanmugam
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2507.04478
Pdf URL: https://arxiv.org/pdf/2507.04478
Copy Paste: [[2507.04478]] Model Inversion Attacks on Llama 3: Extracting PII from Large Language Models(https://arxiv.org/abs/2507.04478)
Keywords: privacy, defense, attack, robust, extraction, large language model
Abstract: Large language models (LLMs) have transformed natural language processing, but their ability to memorize training data poses significant privacy risks. This paper investigates model inversion attacks on the Llama 3.2 model, a multilingual LLM developed by Meta. By querying the model with carefully crafted prompts, we demonstrate the extraction of personally identifiable information (PII) such as passwords, email addresses, and account numbers. Our findings highlight the vulnerability of even smaller LLMs to privacy attacks and underscore the need for robust defenses. We discuss potential mitigation strategies, including differential privacy and data sanitization, and call for further research into privacy-preserving machine learning techniques.

Title: Source Attribution in Retrieval-Augmented Generation

Authors: Ikhtiyor Nematov, Tarik Kalai, Elizaveta Kuzmenko, Gabriele Fugagnoli, Dimitris Sacharidis, Katja Hose, Tomer Sagi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04480
Pdf URL: https://arxiv.org/pdf/2507.04480
Copy Paste: [[2507.04480]] Source Attribution in Retrieval-Augmented Generation(https://arxiv.org/abs/2507.04480)
Keywords: explainability, large language model
Abstract: While attribution methods, such as Shapley values, are widely used to explain the importance of features or training data in traditional machine learning, their application to Large Language Models (LLMs), particularly within Retrieval-Augmented Generation (RAG) systems, is nascent and challenging. The primary obstacle is the substantial computational cost, where each utility function evaluation involves an expensive LLM call, resulting in direct monetary and time expenses. This paper investigates the feasibility and effectiveness of adapting Shapley-based attribution to identify influential retrieved documents in RAG. We compare Shapley with more computationally tractable approximations and some existing attribution methods for LLM. Our work aims to: (1) systematically apply established attribution principles to the RAG document-level setting; (2) quantify how well SHAP approximations can mirror exact attributions while minimizing costly LLM interactions; and (3) evaluate their practical explainability in identifying critical documents, especially under complex inter-document relationships such as redundancy, complementarity, and synergy. This study seeks to bridge the gap between powerful attribution techniques and the practical constraints of LLM-based RAG systems, offering insights into achieving reliable and affordable RAG explainability.

Title: Dealing with Uncertainty in Contextual Anomaly Detection

Authors: Luca Bindini, Lorenzo Perini, Stefano Nistri, Jesse Davis, Paolo Frasconi
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2507.04490
Pdf URL: https://arxiv.org/pdf/2507.04490
Copy Paste: [[2507.04490]] Dealing with Uncertainty in Contextual Anomaly Detection(https://arxiv.org/abs/2507.04490)
Keywords: interpretability
Abstract: Contextual anomaly detection (CAD) aims to identify anomalies in a target (behavioral) variable conditioned on a set of contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In many anomaly detection tasks, there exist contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In this work, we propose a novel framework for CAD, normalcy score (NS), that explicitly models both the aleatoric and epistemic uncertainties. Built on heteroscedastic Gaussian process regression, our method regards the Z-score as a random variable, providing confidence intervals that reflect the reliability of the anomaly assessment. Through experiments on benchmark datasets and a real-world application in cardiology, we demonstrate that NS outperforms state-of-the-art CAD methods in both detection accuracy and interpretability. Moreover, confidence intervals enable an adaptive, uncertainty-driven decision-making process, which may be very important in domains such as healthcare.

Title: Machine Learning-Based Prediction of Metal-Organic Framework Materials: A Comparative Analysis of Multiple Models

Authors: Zhuo Zheng, Keyan Liu, Xiyuan Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04493
Pdf URL: https://arxiv.org/pdf/2507.04493
Copy Paste: [[2507.04493]] Machine Learning-Based Prediction of Metal-Organic Framework Materials: A Comparative Analysis of Multiple Models(https://arxiv.org/abs/2507.04493)
Keywords: robust
Abstract: Metal-organic frameworks (MOFs) have emerged as promising materials for various applications due to their unique structural properties and versatile functionalities. This study presents a comprehensive investigation of machine learning approaches for predicting MOF material properties. We employed five different machine learning models: Random Forest, XGBoost, LightGBM, Support Vector Machine, and Neural Network, to analyze and predict MOF characteristics using a dataset from the Kaggle platform. The models were evaluated using multiple performance metrics, including RMSE, R^2, MAE, and cross-validation scores. Results demonstrated that the Random Forest model achieved superior performance with an R^2 value of 0.891 and RMSE of 0.152, significantly outperforming other models. LightGBM showed remarkable computational efficiency, completing training in 25.7 seconds while maintaining high accuracy. Our comparative analysis revealed that ensemble learning methods generally exhibited better performance than traditional single models in MOF property prediction. This research provides valuable insights into the application of machine learning in materials science and establishes a robust framework for future MOF material design and property prediction.

Title: README: Robust Error-Aware Digital Signature Framework via Deep Watermarking Model

Authors: Hyunwook Choi, Sangyun Won, Daeyeon Hwang, Junhyeok Choi
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04495
Pdf URL: https://arxiv.org/pdf/2507.04495
Copy Paste: [[2507.04495]] README: Robust Error-Aware Digital Signature Framework via Deep Watermarking Model(https://arxiv.org/abs/2507.04495)
Keywords: security, protect, robust, watermark
Abstract: Deep learning-based watermarking has emerged as a promising solution for robust image authentication and protection. However, existing models are limited by low embedding capacity and vulnerability to bit-level errors, making them unsuitable for cryptographic applications such as digital signatures, which require over 2048 bits of error-free data. In this paper, we propose README (Robust Error-Aware Digital Signature via Deep WaterMarking ModEl), a novel framework that enables robust, verifiable, and error-tolerant digital signatures within images. Our method combines a simple yet effective cropping-based capacity scaling mechanism with ERPA (ERror PAinting Module), a lightweight error correction module designed to localize and correct bit errors using Distinct Circular Subsum Sequences (DCSS). Without requiring any fine-tuning of existing pretrained watermarking models, README significantly boosts the zero-bit-error image rate (Z.B.I.R) from 1.2% to 86.3% when embedding 2048-bit digital signatures into a single image, even under real-world distortions. Moreover, our use of perceptual hash-based signature verification ensures public verifiability and robustness against tampering. The proposed framework unlocks a new class of high-assurance applications for deep watermarking, bridging the gap between signal-level watermarking and cryptographic security.

Title: LINE: Public-key encryption

Authors: Gennady Khalimov, Yevgen Kotukh
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2507.04501
Pdf URL: https://arxiv.org/pdf/2507.04501
Copy Paste: [[2507.04501]] LINE: Public-key encryption(https://arxiv.org/abs/2507.04501)
Keywords: security
Abstract: We propose a public key encryption cryptosystem based on solutions of linear equation systems with predefinition of input parameters through shared secret computation for factorizable substitutions. The existence of multiple equivalent solutions for an underdetermined system of linear equations determines the impossibility of its resolution by a cryptanalyst in polynomial time. The completion of input parameters of the equation system is implemented through secret homomorphic matrix transformation for substitutions factorized over the basis of a vector space of dimension m over the field F2. Encryption is implemented through computation of substitutions that are one-way functions on an elementary abelian 2-group of order 2"m. Decryption is implemented through completion of input parameters of the equation system. Homomorphic transformations are constructed based on matrix computations. Matrix computations enable the implementation of high security and low computational overhead for homomorphic transformations.

Title: U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration

Authors: Xiaofan Li, Zhihao Xu, Chenming Wu, Zhao Yang, Yumeng Zhang, Jiang-Jiang Liu, Haibao Yu, Fan Duan, Xiaoqing Ye, Yuan Wang, Shirui Li, Xun Sun, Ji Wan, Jun Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04503
Pdf URL: https://arxiv.org/pdf/2507.04503
Copy Paste: [[2507.04503]] U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration(https://arxiv.org/abs/2507.04503)
Keywords: robust
Abstract: Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird's-Eye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios.

Title: Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

Authors: Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04504
Pdf URL: https://arxiv.org/pdf/2507.04504
Copy Paste: [[2507.04504]] Unveiling the Potential of Diffusion Large Language Model in Controllable Generation(https://arxiv.org/abs/2507.04504)
Keywords: diffusion, large language model
Abstract: Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.

Title: MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

Authors: Zhendong Xiao, Wu Wei, Shujie Ji, Shan Yang, Changhao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04509
Pdf URL: https://arxiv.org/pdf/2507.04509
Copy Paste: [[2507.04509]] MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization(https://arxiv.org/abs/2507.04509)
Keywords: robust
Abstract: Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.

Title: DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging

Authors: Neha Verma, Kenton Murray, Kevin Duh
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04517
Pdf URL: https://arxiv.org/pdf/2507.04517
Copy Paste: [[2507.04517]] DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging(https://arxiv.org/abs/2507.04517)
Keywords: transformer, large language model
Abstract: Model compression offers a promising path to reducing the cost and inaccessibility of large pre-trained models, without significantly compromising their impressive performance. Large Transformer models, including large language models (LLMs), often contain computational redundancy, which can serve as a target for new model compression methods. In this work, we specifically target neuron-level redundancies in model layers by combining groups of similar neurons into fewer neurons. We frame this width reduction as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model weights. To ensure applicability within the Transformer architecture, we motivate and incorporate entropic regularization and matrix factorization into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize can outperform these methods across multiple LLM families and sizes, while achieving measurable reductions in real-world computational cost.

Title: A Data-Driven Novelty Score for Diverse In-Vehicle Data Recording

Authors: Philipp Reis, Joshua Ransiek, David Petri, Jacob Langner, Eric Sax
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04529
Pdf URL: https://arxiv.org/pdf/2507.04529
Copy Paste: [[2507.04529]] A Data-Driven Novelty Score for Diverse In-Vehicle Data Recording(https://arxiv.org/abs/2507.04529)
Keywords: robust
Abstract: High-quality datasets are essential for training robust perception systems in autonomous driving. However, real-world data collection is often biased toward common scenes and objects, leaving novel cases underrepresented. This imbalance hinders model generalization and compromises safety. The core issue is the curse of rarity. Over time, novel events occur infrequently, and standard logging methods fail to capture them effectively. As a result, large volumes of redundant data are stored, while critical novel cases are diluted, leading to biased datasets. This work presents a real-time data selection method focused on object-level novelty detection to build more balanced and diverse datasets. The method assigns a data-driven novelty score to image frames using a novel dynamic Mean Shift algorithm. It models normal content based on mean and covariance statistics to identify frames with novel objects, discarding those with redundant elements. The main findings show that reducing the training dataset size with this method can improve model performance, whereas higher redundancy tends to degrade it. Moreover, as data redundancy increases, more aggressive filtering becomes both possible and beneficial. While random sampling can offer some gains, it often leads to overfitting and unpredictability in outcomes. The proposed method supports real-time deployment with 32 frames per second and is constant over time. By continuously updating the definition of normal content, it enables efficient detection of novelties in a continuous data stream.

Title: DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

Authors: Rushil Thareja, Preslav Nakov, Praneeth Vepakomma, Nils Lukas
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04531
Pdf URL: https://arxiv.org/pdf/2507.04531
Copy Paste: [[2507.04531]] DP-Fusion: Token-Level Differentially Private Inference for Large Language Models(https://arxiv.org/abs/2507.04531)
Keywords: privacy, defense, large language model
Abstract: Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially. Existing defenses that aim to preserve context privacy during inference either lack formal guarantees or suffer from a poor utility/privacy trade-off. We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM's outputs reveal about sensitive tokens in its context. We demonstrate DPI through the task of document privatization, where the goal is to paraphrase documents so that sensitive content (e.g., Personally Identifiable Information, PII) cannot be reliably inferred, while still preserving the overall utility of the text. This is controlled by a parameter $\epsilon$: $\epsilon=0$ hides PII entirely, while higher values trade off privacy for improved paraphrase quality. DP-Fusion works as follows: (i) partition sensitive tokens into disjoint privacy groups, (ii) run the LLM once per group, and (iii) blend the output distributions so that the final output remains within a fixed statistical distance of the baseline distribution produced when no privacy group is revealed. This approach allows fine-grained control over the privacy/utility trade-off but requires multiple LLM forward passes.

Title: MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

Authors: Dawit Mureja Argaw, Xian Liu, Joon Son Chung, Ming-Yu Liu, Fitsum Reda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04559
Pdf URL: https://arxiv.org/pdf/2507.04559
Copy Paste: [[2507.04559]] MambaVideo for Discrete Video Tokenization with Channel-Split Quantization(https://arxiv.org/abs/2507.04559)
Keywords: robust, transformer, generative
Abstract: Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

Title: Evaluating LLMs on Real-World Forecasting Against Human Superforecasters

Authors: Janna Lu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04562
Pdf URL: https://arxiv.org/pdf/2507.04562
Copy Paste: [[2507.04562]] Evaluating LLMs on Real-World Forecasting Against Human Superforecasters(https://arxiv.org/abs/2507.04562)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against human superforecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of superforecasters.

Title: Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Authors: Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04569
Pdf URL: https://arxiv.org/pdf/2507.04569
Copy Paste: [[2507.04569]] Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts(https://arxiv.org/abs/2507.04569)
Keywords: generative
Abstract: We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

Title: S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

Authors: Xudong Liu, Zikun Chen, Ruowei Jiang, Ziyi Wu, Kejia Yin, Han Zhao, Parham Aarabi, Igor Gilitschenski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04584
Pdf URL: https://arxiv.org/pdf/2507.04584
Copy Paste: [[2507.04584]] S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control(https://arxiv.org/abs/2507.04584)
Keywords: diffusion
Abstract: Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S$^2$Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S$^2$Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S$^2$Edit such as makeup transfer.

Title: CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection

Authors: Hanzhi Zhong, Zhiyu Xiang, Ruoyu Xu, Jingyun Fu, Peng Xu, Shaohong Wang, Zhihao Yang, Tianyu Pu, Eryun Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04587
Pdf URL: https://arxiv.org/pdf/2507.04587
Copy Paste: [[2507.04587]] CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection(https://arxiv.org/abs/2507.04587)
Keywords: robust
Abstract: 4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research finish the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplored, hindering the performance improvement. In this study, we propose a cross-view two-stage fusion network called CVFusion. In the first stage, we design a radar guided iterative (RGIter) BEV fusion module to generate high-recall 3D proposal boxes. In the second stage, we aggregate features from multiple heterogeneous views including points, image, and BEV for each proposal. These comprehensive instance level features greatly help refine the proposals and generate high-quality predictions. Extensive experiments on public datasets show that our method outperforms the previous state-of-the-art methods by a large margin, with 9.10% and 3.68% mAP improvements on View-of-Delft (VoD) and TJ4DRadSet, respectively. Our code will be made publicly available.

Title: Photon Splatting: A Physics-Guided Neural Surrogate for Real-Time Wireless Channel Prediction

Authors: Ge Cao, Gabriele Gradoni, Zhen Peng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04595
Pdf URL: https://arxiv.org/pdf/2507.04595
Copy Paste: [[2507.04595]] Photon Splatting: A Physics-Guided Neural Surrogate for Real-Time Wireless Channel Prediction(https://arxiv.org/abs/2507.04595)
Keywords: interpretability
Abstract: We present Photon Splatting, a physics-guided neural surrogate model for real-time wireless channel prediction in complex environments. The proposed framework introduces surface-attached virtual sources, referred to as photons, which carry directional wave signatures informed by the scene geometry and transmitter configuration. At runtime, channel impulse responses (CIRs) are predicted by splatting these photons onto the angular domain of the receiver using a geodesic rasterizer. The model is trained to learn a physically grounded representation that maps transmitter-receiver configurations to full channel responses. Once trained, it generalizes to new transmitter positions, antenna beam patterns, and mobile receivers without requiring model retraining. We demonstrate the effectiveness of the framework through a series of experiments, from canonical 3D scenes to a complex indoor cafe with 1,000 receivers. Results show 30 millisecond-level inference latency and accurate CIR predictions across a wide range of configurations. The approach supports real-time adaptability and interpretability, making it a promising candidate for wireless digital twin platforms and future 6G network planning.

Title: QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

Authors: Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen, Yan Xie, Jianxun Cui, Xun Yang, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04599
Pdf URL: https://arxiv.org/pdf/2507.04599
Copy Paste: [[2507.04599]] QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation(https://arxiv.org/abs/2507.04599)
Keywords: generative
Abstract: Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.

Title: PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

Authors: Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04607
Pdf URL: https://arxiv.org/pdf/2507.04607
Copy Paste: [[2507.04607]] PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes(https://arxiv.org/abs/2507.04607)
Keywords: large language model
Abstract: Large language model (LLM) personalization aims to align model outputs with individuals' unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME's effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.

Title: any4: Learned 4-bit Numeric Representation for LLMs

Authors: Mostafa Elhoushi, Jeff Johnson
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04610
Pdf URL: https://arxiv.org/pdf/2507.04610
Copy Paste: [[2507.04610]] any4: Learned 4-bit Numeric Representation for LLMs(https://arxiv.org/abs/2507.04610)
Keywords: large language model
Abstract: We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at this https URL .

Title: Information-Guided Diffusion Sampling for Dataset Distillation

Authors: Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis
Subjects: cs.LG, cs.AI, cs.CV, cs.IT
Abstract URL: https://arxiv.org/abs/2507.04619
Pdf URL: https://arxiv.org/pdf/2507.04619
Copy Paste: [[2507.04619]] Information-Guided Diffusion Sampling for Dataset Distillation(https://arxiv.org/abs/2507.04619)
Keywords: diffusion
Abstract: Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) prototype information $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) contextual information $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + \beta \mathrm{H}(X | Y)$ during the DM sampling process, where $\beta$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.

Title: Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences

Authors: Yusong Zhang, Yuxuan Sun, Lei Guo, Wei Chen, Bo Ai, Deniz Gunduz
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2507.04621
Pdf URL: https://arxiv.org/pdf/2507.04621
Copy Paste: [[2507.04621]] Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences(https://arxiv.org/abs/2507.04621)
Keywords: diffusion, generative, large language model
Abstract: 6G networks promise revolutionary immersive communication experiences including augmented reality (AR), virtual reality (VR), and holographic communications. These applications demand high-dimensional multimodal data transmission and intelligent data processing in real-time, which is extremely challenging over resource-limited wireless communication systems. Moreover, a joint understanding of the environment, context, and user intent is essential to deliver task-relevant content effectively. This article presents a novel multimodal large language model (MLLM) integrated semantic communications framework, termed MLLM-SC, which fully leverages reasoning and generative capabilities of pre-trained foundation models for context-aware and task-oriented wireless communication. The MLLM-SC framework adopts a device-edge collaborative architecture. At the edge, MLLM-empowered semantic guidance module analyzes multimodal inputs, user intents, and channel conditions to generate importance-aware attention maps prioritizing semantically critical information. An importance-aware semantic encoder and a resource-adaptive semantic decoder are jointly designed and optimized, which can utilize the semantic guidance for adaptive bandwidth allocation and high-quality content reconstruction or generation. Extensive case studies on visual question answering for AR/VR applications and diffusion-driven image generation validate the effectiveness of MLLM-SC.

Title: Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs

Authors: Swayamjit Saha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04625
Pdf URL: https://arxiv.org/pdf/2507.04625
Copy Paste: [[2507.04625]] Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs(https://arxiv.org/abs/2507.04625)
Keywords: large language model
Abstract: Large Language Models (LLMs) are powerful yet prone to generating factual errors, commonly referred to as hallucinations. We present a lightweight, interpretable framework for knowledge-aware self-correction of LLM outputs using structured memory graphs based on RDF triples. Without retraining or fine-tuning, our method post-processes model outputs and corrects factual inconsistencies via external semantic memory. We demonstrate the approach using DistilGPT-2 and show promising results on simple factual prompts.

Title: Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Authors: Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04631
Pdf URL: https://arxiv.org/pdf/2507.04631
Copy Paste: [[2507.04631]] Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts(https://arxiv.org/abs/2507.04631)
Keywords: robust, extraction
Abstract: Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{this https URL}.

Title: LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction

Authors: Yixin Yan, Yang Li, Yuanfan Wang, Xiaozhou Zhou, Beihao Xia, Manjiang Hu, Hongmao Qin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04634
Pdf URL: https://arxiv.org/pdf/2507.04634
Copy Paste: [[2507.04634]] LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction(https://arxiv.org/abs/2507.04634)
Keywords: transformer
Abstract: It has been challenging to model the complex temporal-spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high-order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal-spatial interaction features for multi-modal trajectory prediction. Specifically, we introduce a Local Trend-Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high-order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi-Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT-64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT-128 with a 68% reduction in model size.

Title: MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Authors: Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04635
Pdf URL: https://arxiv.org/pdf/2507.04635
Copy Paste: [[2507.04635]] MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding(https://arxiv.org/abs/2507.04635)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in this https URL.

Title: Put Teacher in Student's Shoes: Cross-Distillation for Ultra-compact Model Compression Framework

Authors: Maolin Wang, Jun Chu, Sicong Xie, Xiaoling Zang, Yao Zhao, Wenliang Zhong, Xiangyu Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04636
Pdf URL: https://arxiv.org/pdf/2507.04636
Copy Paste: [[2507.04636]] Put Teacher in Student's Shoes: Cross-Distillation for Ultra-compact Model Compression Framework(https://arxiv.org/abs/2507.04636)
Keywords: privacy
Abstract: In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model's perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay's live Edge Recommendation system since January 2024, currently serving the app's recommendation traffic across \textbf{8.4 million daily active devices}.

Title: UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification

Authors: Xixi Wan, Aihua Zheng, Bo Jiang, Beibei Wang, Chenglong Li, Jin Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04638
Pdf URL: https://arxiv.org/pdf/2507.04638
Copy Paste: [[2507.04638]] UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification(https://arxiv.org/abs/2507.04638)
Keywords: robust
Abstract: Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. Existing methods primarily aim to improve identification performance, but often overlook the uncertainty arising from inherent defects, such as intra-modal noise and inter-modal conflicts. This uncertainty is particularly significant in the case of fine-grained local occlusion and frame loss, which becomes a challenge in multi-modal learning. To address the above challenge, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code will be made public upon acceptance.

Title: R1-RE: Cross-Domain Relationship Extraction with RLVR

Authors: Runpeng Dai, Tong Zheng, Run Yang, Hongtu Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04642
Pdf URL: https://arxiv.org/pdf/2507.04642
Copy Paste: [[2507.04642]] R1-RE: Cross-Domain Relationship Extraction with RLVR(https://arxiv.org/abs/2507.04642)
Keywords: robust, extraction
Abstract: Relationship extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.

Title: VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs

Authors: Tao Zhang, Shiqing Wei, Shihao Chen, Wenling Yu, Muying Luo, Shunping Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04664
Pdf URL: https://arxiv.org/pdf/2507.04664
Copy Paste: [[2507.04664]] VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs(https://arxiv.org/abs/2507.04664)
Keywords: extraction, large language model, segmentation
Abstract: Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators' labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to improve spatial understanding capability. Through comprehensive exploration of training strategies including pretraining, supervised fine-tuning, and preference optimization across WHU, WHU-Mix, and CrowdAI datasets, VectorLLM significantly outperformed the previous SOTA methods by 5.6 AP, 7.1 AP, 13.6 AP, respectively in the three datasets. Remarkably, VectorLLM exhibits strong zero-shot performance on unseen objects including aircraft, water bodies, and oil tanks, highlighting its potential for unified modeling of diverse remote sensing object contour extraction tasks. Overall, this work establishes a new paradigm for vector extraction in remote sensing, leveraging the topological reasoning capabilities of LLMs to achieve both high accuracy and exceptional generalization. All the codes and weights will be published for promoting community development.

Title: Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction

Authors: Suiyan Shang, Chi Fai Cheung, Pai Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04665
Pdf URL: https://arxiv.org/pdf/2507.04665
Copy Paste: [[2507.04665]] Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction(https://arxiv.org/abs/2507.04665)
Keywords: transformer, generative
Abstract: Accurate surface roughness prediction in ultra-precision machining (UPM) is critical for real-time quality control, but small datasets hinder model performance. We propose HAS-CGAN, a Hybrid Adversarial Spectral Loss CGAN, for effective UPM data augmentation. Among five CGAN variants tested, HAS-CGAN excels in 1D force signal generation, particularly for high-frequency signals, achieving >0.85 wavelet coherence through Fourier-domain optimization. By combining generated signals with machining parameters, prediction accuracy significantly improves. Experiments with traditional ML (SVR, RF, LSTM) and deep learning models (BPNN, 1DCNN, CNN-Transformer) demonstrate that augmenting training data with 520+ synthetic samples reduces prediction error from 31.4% (original 52 samples) to ~9%, effectively addressing data scarcity in UPM roughness prediction."

Title: What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Authors: Hahyeon Choi, Junhoo Lee, Nojun Kwak
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04667
Pdf URL: https://arxiv.org/pdf/2507.04667
Copy Paste: [[2507.04667]] What's Making That Sound Right Now? Video-centric Audio-Visual Localization(https://arxiv.org/abs/2507.04667)
Keywords: robust
Abstract: Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

Title: DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation

Authors: Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04671
Pdf URL: https://arxiv.org/pdf/2507.04671
Copy Paste: [[2507.04671]] DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation(https://arxiv.org/abs/2507.04671)
Keywords: robust
Abstract: Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE's effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at this https URL.

Title: ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing

Authors: Zhenghui Zhao, Chen Wu, Di Wang, Hongruixuan Chen, Zhuo Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04678
Pdf URL: https://arxiv.org/pdf/2507.04678
Copy Paste: [[2507.04678]] ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing(https://arxiv.org/abs/2507.04678)
Keywords: diffusion, generative
Abstract: Recent advancements in generative methods, especially diffusion models, have made great progress in remote sensing image synthesis. Despite these advancements, existing methods have not explored the simulation of future scenarios based on given scenario images. This simulation capability has wide applications for urban planning, land managementChangeBridge: Spatiotemporal Image Generation with Multimodal Controls, and beyond. In this work, we propose ChangeBridge, a conditional spatiotemporal diffusion model. Given pre-event images and conditioned on multimodal spatial controls (e.g., text prompts, instance layouts, and semantic maps), ChangeBridge can synthesize post-event images. The core idea behind ChangeBridge is to modeling the noise-to-image diffusion model, as a pre-to-post diffusion bridge. Conditioned on multimodal controls, ChangeBridge leverages a stochastic Brownian-bridge diffusion, directly modeling the spatiotemporal evolution between pre-event and post-event states. To the best of our knowledge, ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing. Experimental results demonstrate that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Code will be available.

Title: Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini Challenge

Authors: Alper Bahcekapili, Duygu Arslan, Umut Ozdemir, Berkay Ozkirli, Emre Akbas, Ahmet Acar, Gozde B. Akar, Bingdou He, Shuoyu Xu, Umit Mert Caglar, Alptekin Temizel, Guillaume Picaud, Marc Chaumont, Gérard Subsol, Luc Téot, Fahad Alsharekh, Shahad Alghannam, Hexiang Mao, Wenhua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04681
Pdf URL: https://arxiv.org/pdf/2507.04681
Copy Paste: [[2507.04681]] Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini Challenge(https://arxiv.org/abs/2507.04681)
Keywords: transformer, segmentation
Abstract: Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using the publicly available METU CCTGS dataset. The dataset comprises 103 whole-slide images with expert pixel-level annotations for five tissue classes. Participants submitted segmentation masks via Codalab, evaluated using metrics such as macro F-score and mIoU. Among 39 participating teams, six outperformed the Swin Transformer baseline (62.92 F-score). This paper presents an overview of the challenge, dataset, and the top-performing methods

Title: TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation

Authors: Changsong Lei, Yaqian Liang, Shaofeng Wang, Jiajia Dai, Yong-Jin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04685
Pdf URL: https://arxiv.org/pdf/2507.04685
Copy Paste: [[2507.04685]] TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation(https://arxiv.org/abs/2507.04685)
Keywords: diffusion
Abstract: Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at this https URL.

Title: Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

Authors: Wanchang Yu, Qing Zhang, Rongjia Zheng, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04692
Pdf URL: https://arxiv.org/pdf/2507.04692
Copy Paste: [[2507.04692]] Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal(https://arxiv.org/abs/2507.04692)
Keywords: robust, extraction, diffusion, generative
Abstract: We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code is available at this https URL.

Title: Interpretable Reward Modeling with Active Concept Bottlenecks

Authors: Sonia Laguna, Katarzyna Kobalczyk, Julia E. Vogt, Mihaela Van der Schaar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04695
Pdf URL: https://arxiv.org/pdf/2507.04695
Copy Paste: [[2507.04695]] Interpretable Reward Modeling with Active Concept Bottlenecks(https://arxiv.org/abs/2507.04695)
Keywords: interpretability
Abstract: We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.

Title: Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation

Authors: Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri
Subjects: cs.LG, cs.DC, cs.MS
Abstract URL: https://arxiv.org/abs/2507.04697
Pdf URL: https://arxiv.org/pdf/2507.04697
Copy Paste: [[2507.04697]] Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation(https://arxiv.org/abs/2507.04697)
Keywords: transformer, generative, large language model
Abstract: Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.

Title: A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Authors: Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04699
Pdf URL: https://arxiv.org/pdf/2507.04699
Copy Paste: [[2507.04699]] A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets(https://arxiv.org/abs/2507.04699)
Keywords: diffusion, large language model
Abstract: Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as "puzzle pieces" coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.

Title: XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

Authors: Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04701
Pdf URL: https://arxiv.org/pdf/2507.04701
Copy Paste: [[2507.04701]] XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL(https://arxiv.org/abs/2507.04701)
Keywords: robust
Abstract: To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.

Title: Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning

Authors: Feng Yue, Zhaoxing Zhang, Junming Jiao, Zhengyu Liang, Shiwen Cao, Feifei Zhang, Rong Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04702
Pdf URL: https://arxiv.org/pdf/2507.04702
Copy Paste: [[2507.04702]] Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning(https://arxiv.org/abs/2507.04702)
Keywords: large language model
Abstract: Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM's limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model's capability to perceive the boundaries of events in the video. In the fine-tuning part of our pipeline, we creatively apply Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) in TVG area to foster model's temporal reasoning from not only accepting relevant video-query pairs but also refusing irrelevant ones. Experiments demonstrate that our method accomplishes a notable advantage over SOTA solutions by around 3.5% on both the original QVHighlights testbench and its corrected version with more reasonable ground truth annotations.

Title: Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

Authors: Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04705
Pdf URL: https://arxiv.org/pdf/2507.04705
Copy Paste: [[2507.04705]] Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations(https://arxiv.org/abs/2507.04705)
Keywords: secure, robust
Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge.

Title: Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce

Authors: Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04708
Pdf URL: https://arxiv.org/pdf/2507.04708
Copy Paste: [[2507.04708]] Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce(https://arxiv.org/abs/2507.04708)
Keywords: extraction, large language model
Abstract: Customer reviews on e-commerce platforms capture critical affective signals that drive purchasing decisions. However, no existing research has explored the joint task of emotion detection and explanatory span identification in e-commerce reviews - a crucial gap in understanding what triggers customer emotional responses. To bridge this gap, we propose a novel joint task unifying Emotion detection and Opinion Trigger extraction (EOT), which explicitly models the relationship between causal text spans (opinion triggers) and affective dimensions (emotion categories) grounded in Plutchik's theory of 8 primary emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated collection of 2,400 reviews with fine-grained emotions and opinion triggers. We evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured prompting framework with systematic reasoning and self-reflection. Our framework surpasses zero-shot and chain-of-thought techniques, across e-commerce domains.

Title: Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication

Authors: Samuel Pfrommer, George Ma, Yixiao Huang, Somayeh Sojoudi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04709
Pdf URL: https://arxiv.org/pdf/2507.04709
Copy Paste: [[2507.04709]] Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication(https://arxiv.org/abs/2507.04709)
Keywords: diffusion
Abstract: This work shows that normalization layers can facilitate a surprising degree of communication across the spatial dimensions of an input tensor. We study a toy localization task with a convolutional architecture and show that normalization layers enable an iterative message passing procedure, allowing information aggregation from well outside the local receptive field. Our results suggest that normalization layers should be employed with caution in applications such as diffusion-based trajectory generation, where maintaining a spatially limited receptive field is crucial.

Title: Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model

Authors: Anbang Wang, Marawan Elbatel, Keyuan Liu, Lizhuo Lin, Meng Lan, Yanqi Yang, Xiaomeng Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04710
Pdf URL: https://arxiv.org/pdf/2507.04710
Copy Paste: [[2507.04710]] Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model(https://arxiv.org/abs/2507.04710)
Keywords: robust
Abstract: Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model's capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: this https URL.

Title: LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

Authors: Zecheng Tang, Haitian Wang, Quantong Qiu, Baibei Ji, Ruoxi Sun, Keyan Zhou, Juntao Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04723
Pdf URL: https://arxiv.org/pdf/2507.04723
Copy Paste: [[2507.04723]] LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework(https://arxiv.org/abs/2507.04723)
Keywords: large language model
Abstract: Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model's long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: this https URL

Title: Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet

Authors: Raz Lapid, Almog Dubin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04726
Pdf URL: https://arxiv.org/pdf/2507.04726
Copy Paste: [[2507.04726]] Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet(https://arxiv.org/abs/2507.04726)
Keywords: defense, attack, robust, steal, diffusion
Abstract: Text-to-image diffusion models have achieved remarkable success in translating textual prompts into high-fidelity images. ControlNets further extend these models by allowing precise, image-based conditioning (e.g., edge maps, depth, pose), enabling fine-grained control over structure and style. However, their dependence on large, publicly scraped datasets -- and the increasing use of community-shared data for fine-tuning -- exposes them to stealthy data poisoning attacks. In this work, we introduce a novel data poisoning method that manipulates ControlNets to generate images containing specific content without any text triggers. By injecting poisoned samples -- each pairing a subtly triggered input with an NSFW target -- the model retains clean-prompt fidelity yet reliably produces NSFW outputs when the trigger is present. On large-scale, high-quality datasets, our backdoor achieves high attack success rate while remaining imperceptible in raw inputs. These results reveal a critical vulnerability in open-source ControlNets pipelines and underscore the need for robust data sanitization and defense mechanisms.

Title: "This Suits You the Best": Query Focused Comparative Explainable Summarization

Authors: Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.04733
Pdf URL: https://arxiv.org/pdf/2507.04733
Copy Paste: [[2507.04733]] "This Suits You the Best": Query Focused Comparative Explainable Summarization(https://arxiv.org/abs/2507.04733)
Keywords: privacy, large language model
Abstract: Product recommendations inherently involve comparisons, yet traditional opinion summarization often fails to provide holistic comparative insights. We propose the novel task of generating Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address the lack of query-focused recommendation datasets, we introduce MS-Q2P, comprising 7,500 queries mapped to 22,500 recommended products with metadata. We leverage Large Language Models (LLMs) to generate tabular comparative summaries with query-specific explanations. Our approach is personalized, privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS as an intermediate step reduces inference latency approximately by 40% compared to the direct input approach (DIA), which processes raw data directly. We evaluate open-source and proprietary LLMs for generating and assessing QF-CES. Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity, faithfulness, informativeness, format adherence, and query relevance) showed an average Spearman correlation of 0.74 with human judgments, indicating its potential for QF-CES evaluation.

Title: An analysis of vision-language models for fabric retrieval

Authors: Francesco Giuliari, Asif Khan Pattan, Mohamed Lamine Mekhalfi, Fabio Poiesi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04735
Pdf URL: https://arxiv.org/pdf/2507.04735
Copy Paste: [[2507.04735]] An analysis of vision-language models for fabric retrieval(https://arxiv.org/abs/2507.04735)
Keywords: robust, large language model
Abstract: Effective cross-modal retrieval is essential for applications like information retrieval and recommendation systems, particularly in specialized domains such as manufacturing, where product information often consists of visual samples paired with a textual description. This paper investigates the use of Vision Language Models(VLMs) for zero-shot text-to-image retrieval on fabric samples. We address the lack of publicly available datasets by introducing an automated annotation pipeline that uses Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. We produce these descriptions to evaluate retrieval performance across three Vision-Language Models: CLIP, LAION-CLIP, and Meta's Perception Encoder. Our experiments demonstrate that structured, attribute-rich descriptions significantly enhance retrieval accuracy, particularly for visually complex fabric classes, with the Perception Encoder outperforming other models due to its robust feature alignment capabilities. However, zero-shot retrieval remains challenging in this fine-grained domain, underscoring the need for domain-adapted approaches. Our findings highlight the importance of combining technical textual descriptions with advanced VLMs to optimize cross-modal retrieval in industrial applications.

Title: MatDecompSDF: High-Fidelity 3D Shape and PBR Material Decomposition from Multi-View Images

Authors: Chengyu Wang, Isabella Bennett, Henry Scott, Liang Zhang, Mei Chen, Hao Li, Rui Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04749
Pdf URL: https://arxiv.org/pdf/2507.04749
Copy Paste: [[2507.04749]] MatDecompSDF: High-Fidelity 3D Shape and PBR Material Decomposition from Multi-View Images(https://arxiv.org/abs/2507.04749)
Keywords: robust
Abstract: We present MatDecompSDF, a novel framework for recovering high-fidelity 3D shapes and decomposing their physically-based material properties from multi-view images. The core challenge of inverse rendering lies in the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Our method addresses this by jointly optimizing three neural components: a neural Signed Distance Function (SDF) to represent complex geometry, a spatially-varying neural field for predicting PBR material parameters (albedo, roughness, metallic), and an MLP-based model for capturing unknown environmental lighting. The key to our approach is a physically-based differentiable rendering layer that connects these 3D properties to the input images, allowing for end-to-end optimization. We introduce a set of carefully designed physical priors and geometric regularizations, including a material smoothness loss and an Eikonal loss, to effectively constrain the problem and achieve robust decomposition. Extensive experiments on both synthetic and real-world datasets (e.g., DTU) demonstrate that MatDecompSDF surpasses state-of-the-art methods in geometric accuracy, material fidelity, and novel view synthesis. Crucially, our method produces editable and relightable assets that can be seamlessly integrated into standard graphics pipelines, validating its practical utility for digital content creation.

Title: MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry

Authors: Zicheng Lin (International School, Beijing University of Posts and Telecommunications), Xiaoqiang Li (College of Engineering, Peking University), Yichao Wang (College of Physics and Optoelectronic Engineering, Harbin Engineering University), Chuan Zhu (School of Artificial Intelligence, Beijing University of Posts and Telecommunications)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04750
Pdf URL: https://arxiv.org/pdf/2507.04750
Copy Paste: [[2507.04750]] MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry(https://arxiv.org/abs/2507.04750)
Keywords: fair
Abstract: Particle Image Velocimetry (PIV) is fundamental to fluid dynamics, yet deep learning applications face significant hurdles. A critical gap exists: the lack of comprehensive evaluation of how diverse optical flow models perform specifically on PIV data, largely due to limitations in available datasets and the absence of a standardized benchmark. This prevents fair comparison and hinders progress. To address this, our primary contribution is a novel, large-scale synthetic PIV benchmark dataset generated from diverse CFD simulations (JHTDB and Blasius). It features unprecedented variety in particle densities, flow velocities, and continuous motion, enabling, for the first time, a standardized and rigorous evaluation of various optical flow and PIV algorithms. Complementing this, we propose Multi Cost Volume PIV (MCFormer), a new deep network architecture leveraging multi-frame temporal information and multiple cost volumes, specifically designed for PIV's sparse nature. Our comprehensive benchmark evaluation, the first of its kind, reveals significant performance variations among adapted optical flow models and demonstrates that MCFormer significantly outperforms existing methods, achieving the lowest overall normalized endpoint error (NEPE). This work provides both a foundational benchmark resource essential for future PIV research and a state-of-the-art method tailored for PIV challenges. We make our benchmark dataset and code publicly available to foster future research in this area.

Title: LLMs as Architects and Critics for Multi-Source Opinion Summarization

Authors: Anuj Attri, Arnav Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04751
Pdf URL: https://arxiv.org/pdf/2507.04751
Copy Paste: [[2507.04751]] LLMs as Architects and Critics for Multi-Source Opinion Summarization(https://arxiv.org/abs/2507.04751)
Keywords: large language model
Abstract: Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of \r{ho} = 0.74, which surpasses the performance of previous methodologies.

Title: Large Language Models for Network Intrusion Detection Systems: Foundations, Implementations, and Future Directions

Authors: Shuo Yang, Xinran Zheng, Xinchen Zhang, Jinfeng Xu, Jinze Li, Donglin Xie, Weicai Long, Edith C.H. Ngai
Subjects: cs.CR, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2507.04752
Pdf URL: https://arxiv.org/pdf/2507.04752
Copy Paste: [[2507.04752]] Large Language Models for Network Intrusion Detection Systems: Foundations, Implementations, and Future Directions(https://arxiv.org/abs/2507.04752)
Keywords: security, explainability, large language model
Abstract: Large Language Models (LLMs) have revolutionized various fields with their exceptional capabilities in understanding, processing, and generating human-like text. This paper investigates the potential of LLMs in advancing Network Intrusion Detection Systems (NIDS), analyzing current challenges, methodologies, and future opportunities. It begins by establishing a foundational understanding of NIDS and LLMs, exploring the enabling technologies that bridge the gap between intelligent and cognitive systems in AI-driven NIDS. While Intelligent NIDS leverage machine learning and deep learning to detect threats based on learned patterns, they often lack contextual awareness and explainability. In contrast, Cognitive NIDS integrate LLMs to process both structured and unstructured security data, enabling deeper contextual reasoning, explainable decision-making, and automated response for intrusion behaviors. Practical implementations are then detailed, highlighting LLMs as processors, detectors, and explainers within a comprehensive AI-driven NIDS pipeline. Furthermore, the concept of an LLM-centered Controller is proposed, emphasizing its potential to coordinate intrusion detection workflows, optimizing tool collaboration and system performance. Finally, this paper identifies critical challenges and opportunities, aiming to foster innovation in developing reliable, adaptive, and explainable NIDS. By presenting the transformative potential of LLMs, this paper seeks to inspire advancement in next-generation network security systems.

Title: CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

Authors: Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04756
Pdf URL: https://arxiv.org/pdf/2507.04756
Copy Paste: [[2507.04756]] CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering(https://arxiv.org/abs/2507.04756)
Keywords: privacy
Abstract: Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs' general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.

Title: Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking

Authors: Maria Damanaki, Ioulia Kapsali, Nikos Piperigkos, Alexandros Gkillas, Aris S. Lalos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04762
Pdf URL: https://arxiv.org/pdf/2507.04762
Copy Paste: [[2507.04762]] Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking(https://arxiv.org/abs/2507.04762)
Keywords: defense, attack, robust
Abstract: The critical perception capabilities of EdgeAI systems, such as autonomous vehicles, are required to be resilient against adversarial threats, by enabling accurate identification and localization of multiple objects in the scene over time, mitigating their impact. Single-agent tracking offers resilience to adversarial attacks but lacks situational awareness, underscoring the need for multi-agent cooperation to enhance context understanding and robustness. This paper proposes a novel mitigation framework on 3D LiDAR scene against adversarial noise by tracking objects based on least-squares graph on multi-agent adversarial bounding boxes. Specifically, we employ the least-squares graph tool to reduce the induced positional error of each detection's centroid utilizing overlapped bounding boxes on a fully connected graph via differential coordinates and anchor points. Hence, the multi-vehicle detections are fused and refined mitigating the adversarial impact, and associated with existing tracks in two stages performing tracking to further suppress the adversarial threat. An extensive evaluation study on the real-world V2V4Real dataset demonstrates that the proposed method significantly outperforms both state-of-the-art single and multi-agent tracking frameworks by up to 23.3% under challenging adversarial conditions, operating as a resilient approach without relying on additional defense mechanisms.

Title: GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation

Authors: Weilin Lai, Tie Xu, Hu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04765
Pdf URL: https://arxiv.org/pdf/2507.04765
Copy Paste: [[2507.04765]] GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation(https://arxiv.org/abs/2507.04765)
Keywords: diffusion
Abstract: Direct B-Rep generation is increasingly important in CAD workflows, eliminating costly modeling sequence data and supporting complex features. A key challenge is modeling joint distribution of the misaligned geometry and topology. Existing methods tend to implicitly embed topology into the geometric features of edges. Although this integration ensures feature alignment, it also causes edge geometry to carry more redundant structural information compared to the original B-Rep, leading to significantly higher computational cost. To reduce redundancy, we propose GraphBrep, a B-Rep generation model that explicitly represents and learns compact topology. Following the original structure of B-Rep, we construct an undirected weighted graph to represent surface topology. A graph diffusion model is employed to learn topology conditioned on surface features, serving as the basis for determining connectivity between primitive surfaces. The explicit representation ensures a compact data structure, effectively reducing computational cost during both training and inference. Experiments on two large-scale unconditional datasets and one category-conditional dataset demonstrate the proposed method significantly reduces training and inference times (up to 31.3% and 56.3% for given datasets, respectively) while maintaining high-quality CAD generation compared with SOTA.

Title: ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems

Authors: Yiming Zhang, Yingfan Ma, Yanmei Gu, Zhengkai Yang, Yihong Zhuang, Feng Wang, Zenan Huang, Yuanyuan Wang, Chao Huang, Bowen Song, Cheng Lin, Junbo Zhao
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04766
Pdf URL: https://arxiv.org/pdf/2507.04766
Copy Paste: [[2507.04766]] ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems(https://arxiv.org/abs/2507.04766)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings that fail to capture physical modeling ability. In this paper, we introduce ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs' physical reasoning and generalization capabilities. ABench-Physics consists of two components: Phy_A, a static set of 400 graduate- or Olympiad-level problems; and Phy_B, a dynamic subset of 100 problems equipped with an automatic variation engine to test model robustness across changing conditions. All questions require precise numerical answers, with strict formatting and tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals substantial performance gaps, highlighting persistent limitations in physical reasoning, especially in generalization to dynamic variants. ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs.

Title: From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

Authors: Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Ying Deng, Zhiqiang Yuan, Jiapei Zhang, Jinchao Zhang, Jie Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04769
Pdf URL: https://arxiv.org/pdf/2507.04769
Copy Paste: [[2507.04769]] From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection(https://arxiv.org/abs/2507.04769)
Keywords: protect, large language model
Abstract: Current legal frameworks consider AI-generated works eligible for copyright protection when they meet originality requirements and involve substantial human intellectual input. However, systematic legal standards and reliable evaluation methods for AI art copyrights are lacking. Through comprehensive analysis of legal precedents, we establish three essential criteria for determining distinctive artistic style: stylistic consistency, creative uniqueness, and expressive accuracy. To address these challenges, we introduce ArtBulb, an interpretable and quantifiable framework for AI art copyright judgment that combines a novel style description-based multimodal clustering method with multimodal large language models (MLLMs). We also present AICD, the first benchmark dataset for AI art copyright annotated by artists and legal experts. Experimental results demonstrate that ArtBulb outperforms existing models in both quantitative and qualitative evaluations. Our work aims to bridge the gap between the legal and technological communities and bring greater attention to the societal issue of AI art copyrights.

Title: Efficient Unlearning with Privacy Guarantees

Authors: Josep Domingo-Ferrer, Najeeb Jebreel, David Sánchez
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04771
Pdf URL: https://arxiv.org/pdf/2507.04771
Copy Paste: [[2507.04771]] Efficient Unlearning with Privacy Guarantees(https://arxiv.org/abs/2507.04771)
Keywords: privacy, protect
Abstract: Privacy protection laws, such as the GDPR, grant individuals the right to request the forgetting of their personal data not only from databases but also from machine learning (ML) models trained on them. Machine unlearning has emerged as a practical means to facilitate model forgetting of data instances seen during training. Although some existing machine unlearning methods guarantee exact forgetting, they are typically costly in computational terms. On the other hand, more affordable methods do not offer forgetting guarantees and are applicable only to specific ML models. In this paper, we present \emph{efficient unlearning with privacy guarantees} (EUPG), a novel machine unlearning framework that offers formal privacy guarantees to individuals whose data are being unlearned. EUPG involves pre-training ML models on data protected using privacy models, and it enables {\em efficient unlearning with the privacy guarantees offered by the privacy models in use}. Through empirical evaluation on four heterogeneous data sets protected with $k$-anonymity and $\epsilon$-differential privacy as privacy models, our approach demonstrates utility and forgetting effectiveness comparable to those of exact unlearning methods, while significantly reducing computational and storage costs. Our code is available at this https URL.

Title: FIDESlib: A Fully-Fledged Open-Source FHE Library for Efficient CKKS on GPUs

Authors: Carlos Agulló-Domingo (1), Óscar Vera-López (1), Seyda Guzelhan (2), Lohit Daksha (2), Aymane El Jerari (3), Kaustubh Shivdikar (4), Rashmi Agrawal (2), David Kaeli (3), Ajay Joshi (2), José L. Abellán (1) ((1) Universidad de Murcia, (2) Boston University, (3) Northeastern University, (4) Advanced Micro Devices)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04775
Pdf URL: https://arxiv.org/pdf/2507.04775
Copy Paste: [[2507.04775]] FIDESlib: A Fully-Fledged Open-Source FHE Library for Efficient CKKS on GPUs(https://arxiv.org/abs/2507.04775)
Keywords: privacy, robust
Abstract: Word-wise Fully Homomorphic Encryption (FHE) schemes, such as CKKS, are gaining significant traction due to their ability to provide post-quantum-resistant, privacy-preserving approximate computing; an especially desirable feature in Machine-Learning-as-a-Service (MLaaS) cloud-computing paradigms. OpenFHE is a leading CPU-based FHE library with robust CKKS operations, but its server-side performance is not yet sufficient for practical cloud deployment. As GPU computing becomes more common in data centers, many FHE libraries are adding GPU support. However, integrating an efficient GPU backend into OpenFHE is challenging. While OpenFHE uses a Hardware Abstraction Layer (HAL), its flexible architecture sacrifices performance due to the abstraction layers required for multi-scheme and multi-backend compatibility. In this work, we introduce FIDESlib, the first open-source server-side CKKS GPU library that is fully interoperable with well-established client-side OpenFHE operations. Unlike other existing open-source GPU libraries, FIDESlib provides the first implementation featuring heavily optimized GPU kernels for all CKKS primitives, including bootstrapping. Our library also integrates robust benchmarking and testing, ensuring it remains adaptable to further optimization. Furthermore, its software architecture is designed to support extensions to a multi-GPU backend for enhanced acceleration. Our experiments across various GPU systems and the leading open-source CKKS library to date, Phantom, show that FIDESlib offers superior performance and scalability. For bootstrapping, FIDESlib achieves no less than 70x speedup over the AVX-optimized OpenFHE implementation.

Title: FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift

Authors: Yong Zhang, Feng Liang, Guanghu Yuan, Min Yang, Chengming Li, Xiping Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04781
Pdf URL: https://arxiv.org/pdf/2507.04781
Copy Paste: [[2507.04781]] FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift(https://arxiv.org/abs/2507.04781)
Keywords: privacy, extraction, federate
Abstract: Federated learning (FL) enables collaborative training of a global model in the centralized server with data from multiple parties while preserving privacy. However, data heterogeneity can significantly degrade the performance of the global model when each party uses datasets from different sources to train a local model, thereby affecting personalized local models. Among various cases of data heterogeneity, feature drift, feature space difference among parties, is prevalent in real-life data but remains largely unexplored. Feature drift can distract feature extraction learning in clients and thus lead to poor feature extraction and classification performance. To tackle the problem of feature drift in FL, we propose FedPall, an FL framework that utilizes prototype-based adversarial learning to unify feature spaces and collaborative learning to reinforce class information within the features. Moreover, FedPall leverages mixed features generated from global prototypes and local features to enhance the global classifier with classification-relevant information from a global perspective. Evaluation results on three representative feature-drifted datasets demonstrate FedPall's consistently superior performance in classification with feature-drifted data in the FL scenario.

Title: Reason to Rote: Rethinking Memorization in Reasoning

Authors: Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko, Barbara Plank
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04782
Pdf URL: https://arxiv.org/pdf/2507.04782
Copy Paste: [[2507.04782]] Reason to Rote: Rethinking Memorization in Reasoning(https://arxiv.org/abs/2507.04782)
Keywords: large language model
Abstract: Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

Title: SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

Authors: Mengwei Xie, Shuang Zeng, Xinyuan Chang, Xinran Liu, Zheng Pan, Mu Xu, Xing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04822
Pdf URL: https://arxiv.org/pdf/2507.04822
Copy Paste: [[2507.04822]] SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions(https://arxiv.org/abs/2507.04822)
Keywords: transformer
Abstract: Accurate lane topology is essential for autonomous driving, yet traditional methods struggle to model the complex, non-linear structures-such as loops and bidirectional lanes-prevalent in real-world road structure. We present SeqGrowGraph, a novel framework that learns lane topology as a chain of graph expansions, inspired by human map-drawing processes. Representing the lane graph as a directed graph $G=(V,E)$, with intersections ($V$) and centerlines ($E$), SeqGrowGraph incrementally constructs this graph by introducing one vertex at a time. At each step, an adjacency matrix ($A$) expands from $n \times n$ to $(n+1) \times (n+1)$ to encode connectivity, while a geometric matrix ($M$) captures centerline shapes as quadratic Bézier curves. The graph is serialized into sequences, enabling a transformer model to autoregressively predict the chain of expansions, guided by a depth-first search ordering. Evaluated on nuScenes and Argoverse 2 datasets, SeqGrowGraph achieves state-of-the-art performance.

Title: Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

Authors: Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Yisong Yue, Stefano Ermon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04832
Pdf URL: https://arxiv.org/pdf/2507.04832
Copy Paste: [[2507.04832]] Discrete Diffusion Trajectory Alignment via Stepwise Decomposition(https://arxiv.org/abs/2507.04832)
Keywords: diffusion
Abstract: Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose a novel preference optimization method for masked discrete diffusion models through a principled diffusion trajectory alignment. Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, guarantees an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 80.7 on LLaDA-8B-Instruct for language modeling.

Title: RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction

Authors: Johannes Künzel, Anna Hilsmann, Peter Eisert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04839
Pdf URL: https://arxiv.org/pdf/2507.04839
Copy Paste: [[2507.04839]] RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction(https://arxiv.org/abs/2507.04839)
Keywords: robust, extraction
Abstract: We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at this https URL.

Title: Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems

Authors: Quang-Vinh Nguyen, Quang-Chieu Nguyen, Hoang Pham, Khac-Hoai Nam Bui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04841
Pdf URL: https://arxiv.org/pdf/2507.04841
Copy Paste: [[2507.04841]] Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems(https://arxiv.org/abs/2507.04841)
Keywords: large language model
Abstract: Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.

Title: Efficient SAR Vessel Detection for FPGA-Based On-Satellite Sensing

Authors: Colin Laganier, Liam Fletcher, Elim Kwan, Richard Walters, Victoria Nockles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04842
Pdf URL: https://arxiv.org/pdf/2507.04842
Copy Paste: [[2507.04842]] Efficient SAR Vessel Detection for FPGA-Based On-Satellite Sensing(https://arxiv.org/abs/2507.04842)
Keywords: security
Abstract: Rapid analysis of satellite data is vital for many remote sensing applications, from disaster response to environmental monitoring, but is becoming harder to achieve with the increasing volumes of data generated by modern satellites. On-satellite machine learning (ML) offers a potential solution, by reducing latency associated with transmission of these large data volumes to ground stations, but state-of-the-art models are often too large or power-hungry for satellite deployment. Vessel detection using Synthetic Aperture Radar (SAR) is a critical time-sensitive task for maritime security that exemplifies this challenge. SAR vessel detection has previously been demonstrated only by ML models that either are too large for satellite deployment, have not been developed for sufficiently low-power hardware, or have only been developed and tested on small SAR datasets that do not sufficiently represent the real-world task. Here we address this issue by developing and deploying a new efficient and highly performant SAR vessel detection model, using a customised YOLOv8 architecture specifically optimized for FPGA-based processing within common satellite power constraints (<10W). We train and evaluate our model on the largest and most diverse open SAR vessel dataset, xView3-SAR, and deploy it on a Kria KV260 MPSoC. We show that our FPGA-based model has detection and classification performance only ~2% and 3% lower than values from state-of-the-art GPU-based models, despite being two to three orders of magnitude smaller in size. This work demonstrates small yet highly performant ML models for time-critical SAR analysis, paving the way for more autonomous, responsive, and scalable Earth observation systems.

Title: Dialogue-Based Multi-Dimensional Relationship Extraction from Novels

Authors: Yuchen Yan, Hanjie Zhao, Senbin Zhu, Hongde Liu, Zhihong Zhang, Yuxiang Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04852
Pdf URL: https://arxiv.org/pdf/2507.04852
Copy Paste: [[2507.04852]] Dialogue-Based Multi-Dimensional Relationship Extraction from Novels(https://arxiv.org/abs/2507.04852)
Keywords: extraction, large language model
Abstract: Relation extraction is a crucial task in natural language processing, with broad applications in knowledge graph construction and literary analysis. However, the complex context and implicit expressions in novel texts pose significant challenges for automatic character relationship extraction. This study focuses on relation extraction in the novel domain and proposes a method based on Large Language Models (LLMs). By incorporating relationship dimension separation, dialogue data construction, and contextual learning strategies, the proposed method enhances extraction performance. Leveraging dialogue structure information, it improves the model's ability to understand implicit relationships and demonstrates strong adaptability in complex contexts. Additionally, we construct a high-quality Chinese novel relation extraction dataset to address the lack of labeled resources and support future research. Experimental results show that our method outperforms traditional baselines across multiple evaluation metrics and successfully facilitates the automated construction of character relationship networks in novels.

Title: $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models

Authors: Shrey Ganatra, Swapnil Bhattacharyya, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04854
Pdf URL: https://arxiv.org/pdf/2507.04854
Copy Paste: [[2507.04854]] $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models(https://arxiv.org/abs/2507.04854)
Keywords: large language model
Abstract: Access to consumer grievance redressal in India is often hindered by procedural complexity, legal jargon, and jurisdictional challenges. To address this, we present $\textbf{Grahak-Nyay}$ (Justice-to-Consumers), a chatbot that streamlines the process using open-source Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities through a concise and up-to-date knowledge base. We introduce three novel datasets: $\textit{GeneralQA}$ (general consumer law), $\textit{SectoralQA}$ (sector-specific knowledge) and $\textit{SyntheticQA}$ (for RAG evaluation), along with $\textit{NyayChat}$, a dataset of 300 annotated chatbot conversations. We also introduce $\textit{Judgments}$ data sourced from Indian Consumer Courts to aid the chatbot in decision making and to enhance user trust. We also propose $\textbf{HAB}$ metrics ($\textbf{Helpfulness, Accuracy, Brevity}$) to evaluate chatbot performance. Legal domain experts validated Grahak-Nyay's effectiveness. Code and datasets will be released.

Title: Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling

Authors: Chinmay Prabhakar, Suprosanna Shit, Tamaz Amiranashvili, Hongwei Bran Li, Bjoern Menze
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04856
Pdf URL: https://arxiv.org/pdf/2507.04856
Copy Paste: [[2507.04856]] Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling(https://arxiv.org/abs/2507.04856)
Keywords: diffusion, generative
Abstract: 3D spatial graphs play a crucial role in biological and clinical research by modeling anatomical networks such as blood vessels,neurons, and airways. However, generating 3D biological graphs while maintaining anatomical validity remains challenging, a key limitation of existing diffusion-based methods. In this work, we propose a novel 3D biological graph generation method that adheres to structural and semantic plausibility conditions. We achieve this by using a novel projection operator during sampling that stochastically fixes inconsistencies. Further, we adopt a superior edge-deletion-based noising procedure suitable for sparse biological graphs. Our method demonstrates superior performance on two real-world datasets, human circle of Willis and lung airways, compared to previous approaches. Importantly, we demonstrate that the generated samples significantly enhance downstream graph labeling performance. Furthermore, we show that our generative model is a reasonable out-of-the-box link predictior.

Title: NTSFormer: A Self-Teaching Graph Transformer for Multimodal Cold-Start Node Classification

Authors: Jun Hu, Yufei He, Yuan Li, Bryan Hooi, Bingsheng He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04870
Pdf URL: https://arxiv.org/pdf/2507.04870
Copy Paste: [[2507.04870]] NTSFormer: A Self-Teaching Graph Transformer for Multimodal Cold-Start Node Classification(https://arxiv.org/abs/2507.04870)
Keywords: transformer
Abstract: Cold-start node classification on multimodal graphs is challenging because cold-start nodes are isolated (i.e., no edges) and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to MLPs for cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a "student" prediction based only on self-information (i.e., the node's own features), and a "teacher" prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer's capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by a Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experimental results on public datasets show that NTSFormer achieves superior performance on multimodal cold-start node classification tasks.

Title: HGNet: High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention Network for Colorectal Polyp Detection

Authors: Xiaofang Liu, Lingling Sun, Xuqing Zhang, Yuannong Ye, Bin zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04880
Pdf URL: https://arxiv.org/pdf/2507.04880
Copy Paste: [[2507.04880]] HGNet: High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention Network for Colorectal Polyp Detection(https://arxiv.org/abs/2507.04880)
Keywords: interpretability
Abstract: Colorectal cancer (CRC) is closely linked to the malignant transformation of colorectal polyps, making early detection essential. However, current models struggle with detecting small lesions, accurately localizing boundaries, and providing interpretable decisions. To address these issues, we propose HGNet, which integrates High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention. Key innovations include: (1) an Efficient Multi-Scale Context Attention (EMCA) module to enhance lesion feature representation and boundary modeling; (2) the deployment of a spatial hypergraph convolution module before the detection head to capture higher-order spatial relationships between nodes; (3) the application of transfer learning to address the scarcity of medical image data; and (4) Eigen Class Activation Map (Eigen-CAM) for decision visualization. Experimental results show that HGNet achieves 94% accuracy, 90.6% recall, and 90% mAP@0.5, significantly improving small lesion differentiation and clinical interpretability. The source code will be made publicly available upon publication of this paper.

Title: Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning

Authors: Sanyam Vyas, Alberto Caron, Chris Hicks, Pete Burnap, Vasilios Mavroudis
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2507.04883
Pdf URL: https://arxiv.org/pdf/2507.04883
Copy Paste: [[2507.04883]] Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning(https://arxiv.org/abs/2507.04883)
Keywords: security, defense, attack, robust
Abstract: Deep Reinforcement Learning (DRL) systems are increasingly used in safety-critical applications, yet their security remains severely underexplored. This work investigates backdoor attacks, which implant hidden triggers that cause malicious actions only when specific inputs appear in the observation space. Existing DRL backdoor research focuses solely on training-time attacks requiring unrealistic access to the training pipeline. In contrast, we reveal critical vulnerabilities across the DRL supply chain where backdoors can be embedded with significantly reduced adversarial privileges. We introduce two novel attacks: (1) TrojanentRL, which exploits component-level flaws to implant a persistent backdoor that survives full model retraining; and (2) InfrectroRL, a post-training backdoor attack which requires no access to training, validation, nor test data. Empirical and analytical evaluations across six Atari environments show our attacks rival state-of-the-art training-time backdoor attacks while operating under much stricter adversarial constraints. We also demonstrate that InfrectroRL further evades two leading DRL backdoor defenses. These findings challenge the current research focus and highlight the urgent need for robust defenses.

Title: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Authors: A. Bochkov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04886
Pdf URL: https://arxiv.org/pdf/2507.04886
Copy Paste: [[2507.04886]] Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations(https://arxiv.org/abs/2507.04886)
Keywords: interpretability, transformer, large language model
Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

Title: O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d'apprentissage pour l'adaptation de modèles de langue à un domaine

Authors: Ismaël Rousseau, Claire Perroux, Pierre Adam, Thomas Girault, Lionel Delphin-Poulat, Morgan Veyret, Gwénolé Lecorvé, Géraldine Damnati
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04895
Pdf URL: https://arxiv.org/pdf/2507.04895
Copy Paste: [[2507.04895]] O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d'apprentissage pour l'adaptation de modèles de langue à un domaine(https://arxiv.org/abs/2507.04895)
Keywords: defense
Abstract: This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. -- Ce document présente les travaux réalisés par l'équipe O_FT conjointe à Orange et Ouest-France sur l'adaptation de modèles de langue au domaine de la défense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentrés sur l'adaptation du modèle \texttt{Mistral-7B-Instruct-v0.3} avec des techniques classiques de poursuite du pré-entraînement et d'affinage sur instructions. L'essentiel de nos travaux a porté sur la constitution, génération et sélection de données pour ces deux étapes ainsi que pour l'évaluation des modèles. Les expériences montrent que nos modèles adaptés ont de meilleures de connaissances de fond et une meilleure capacité de traitement de tâches sur le domaine de la défense, ainsi que des performances comparables (voire supérieures) sur des connaissances ou capacités généralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux démontrent ainsi la viabilité de l'adaptation à un domaine de modèles relativement petits.

Title: BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning

Authors: Thinh Dao, Dung Thuy Nguyen, Khoa D Doan, Kok-Seng Wong
Subjects: cs.CR, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2507.04903
Pdf URL: https://arxiv.org/pdf/2507.04903
Copy Paste: [[2507.04903]] BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning(https://arxiv.org/abs/2507.04903)
Keywords: security, defense, attack, federate, fair
Abstract: Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in real-world scenarios. To address this, we introduce BackFed - a comprehensive benchmark suite designed to standardize, streamline, and reliably evaluate backdoor attacks and defenses in FL, with a focus on practical constraints. Our benchmark offers key advantages through its multi-processing implementation that significantly accelerates experimentation and the modular design that enables seamless integration of new methods via well-defined APIs. With a standardized evaluation pipeline, we envision BackFed as a plug-and-play environment for researchers to comprehensively and reliably evaluate new attacks and defenses. Using BackFed, we conduct large-scale studies of representative backdoor attacks and defenses across both Computer Vision and Natural Language Processing tasks with diverse model architectures and experimental settings. Our experiments critically assess the performance of proposed attacks and defenses, revealing unknown limitations and modes of failures under practical conditions. These empirical insights provide valuable guidance for the development of new methods and for enhancing the security of FL systems. Our framework is openly available at this https URL.

Title: HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding

Authors: Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, Xiang Bai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04909
Pdf URL: https://arxiv.org/pdf/2507.04909
Copy Paste: [[2507.04909]] HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding(https://arxiv.org/abs/2507.04909)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.

Title: Leveraging Self-Supervised Features for Efficient Flooded Region Identification in UAV Aerial Images

Authors: Dibyabha Deb, Ujjwal Verma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04915
Pdf URL: https://arxiv.org/pdf/2507.04915
Copy Paste: [[2507.04915]] Leveraging Self-Supervised Features for Efficient Flooded Region Identification in UAV Aerial Images(https://arxiv.org/abs/2507.04915)
Keywords: segmentation
Abstract: Identifying regions affected by disasters is a vital step in effectively managing and planning relief and rescue efforts. Unlike the traditional approaches of manually assessing post-disaster damage, analyzing images of Unmanned Aerial Vehicles (UAVs) offers an objective and reliable way to assess the damage. In the past, segmentation techniques have been adopted to identify post-flood damage in UAV aerial images. However, most of these supervised learning approaches rely on manually annotated datasets. Indeed, annotating images is a time-consuming and error-prone task that requires domain expertise. This work focuses on leveraging self-supervised features to accurately identify flooded regions in UAV aerial images. This work proposes two encoder-decoder-based segmentation approaches, which integrate the visual features learned from DINOv2 with the traditional encoder backbone. This study investigates the generalization of self-supervised features for UAV aerial images. Specifically, we evaluate the effectiveness of features from the DINOv2 model, trained on non-aerial images, for segmenting aerial images, noting the distinct perspectives between the two image types. Our results demonstrate that DINOv2's self-supervised pretraining on natural images generates transferable, general-purpose visual features that streamline the development of aerial segmentation workflows. By leveraging these features as a foundation, we significantly reduce reliance on labor-intensive manual annotation processes, enabling high-accuracy segmentation with limited labeled aerial data.

Title: Object-centric Denoising Diffusion Models for Physical Reasoning

Authors: Moritz Lange, Raphael C. Engelhardt, Wolfgang Konen, Andrew Melnik, Laurenz Wiskott
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04920
Pdf URL: https://arxiv.org/pdf/2507.04920
Copy Paste: [[2507.04920]] Object-centric Denoising Diffusion Models for Physical Reasoning(https://arxiv.org/abs/2507.04920)
Keywords: diffusion
Abstract: Reasoning about the trajectories of multiple, interacting objects is integral to physical reasoning tasks in machine learning. This involves conditions imposed on the objects at different time steps, for instance initial states or desired goal states. Existing approaches in physical reasoning generally rely on autoregressive modeling, which can only be conditioned on initial states, but not on later states. In fields such as planning for reinforcement learning, similar challenges are being addressed with denoising diffusion models. In this work, we propose an object-centric denoising diffusion model architecture for physical reasoning that is translation equivariant over time, permutation equivariant over objects, and can be conditioned on arbitrary time steps for arbitrary objects. We demonstrate how this model can solve tasks with multiple conditions and examine its performance when changing object numbers and trajectory lengths during inference.

Title: RainShift: A Benchmark for Precipitation Downscaling Across Geographies

Authors: Paula Harder, Luca Schmidt, Francis Pelletier, Nicole Ludwig, Matthew Chantry, Christian Lessig, Alex Hernandez-Garcia, David Rolnick
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04930
Pdf URL: https://arxiv.org/pdf/2507.04930
Copy Paste: [[2507.04930]] RainShift: A Benchmark for Precipitation Downscaling Across Geographies(https://arxiv.org/abs/2507.04930)
Keywords: diffusion
Abstract: Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolution for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.

Title: LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks

Authors: Ruoxi Wang, Kun Li, Minghui Xu, Yue Zhang, Kaidi Xu, Chunchi Liu, Yinhao Xiao, Xiuzhen Cheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04931
Pdf URL: https://arxiv.org/pdf/2507.04931
Copy Paste: [[2507.04931]] LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks(https://arxiv.org/abs/2507.04931)
Keywords: large language model
Abstract: Dynamic Symbolic Execution (DSE) is a key technique in program analysis, widely used in software testing, vulnerability discovery, and formal verification. In distributed AI systems, DSE plays a crucial role in identifying hard-to-detect bugs, especially those arising from complex network communication patterns. However, traditional approaches to symbolic execution are often hindered by scalability issues and inefficiencies, particularly in large-scale systems. This paper introduces LIFT (Large-language-model Integrated Functional-equivalent-IR Transformation), a novel framework that leverages Large Language Models (LLMs) to automate the optimization of Intermediate Representations (IRs) in symbolic execution. LIFT addresses the challenges of symbolic execution by providing a scalable, context-sensitive solution for IR transformation. The framework consists of two phases: IR Analysis and Optimization, where LLMs optimize time-intensive IR blocks, and Symbolic Execution and Validation, which includes benchmarking and semantic verification to ensure correctness and generalizability. Experiments on real-world binaries demonstrated significant performance improvements, including a 53.5\% reduction in execution time for bigtest and a 10.24\% reduction for random, along with reductions in IR statements, PUT instructions, and temporary variables. These results demonstrate that LLMs simplify IRs while maintaining functional correctness, enhancing symbolic execution in distributed AI systems.

Title: ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding

Authors: Jianjiang Yang, Ziyan Huang, Yanshu Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04943
Pdf URL: https://arxiv.org/pdf/2507.04943
Copy Paste: [[2507.04943]] ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding(https://arxiv.org/abs/2507.04943)
Keywords: robust, large language model
Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in open-ended visual question answering, they remain vulnerable to hallucinations. These are outputs that contradict or misrepresent input semantics, posing a critical challenge to the reliability and factual consistency. Existing methods often rely on external verification or post-hoc correction, lacking an internal mechanism to validate outputs directly during training. To bridge this gap, we propose ReLoop, a unified closed-loop training framework that encourages multimodal consistency for cross-modal understanding in MLLMs. ReLoop adopts a ring-shaped structure that integrates three complementary consistency feedback mechanisms, obliging MLLMs to "seeing twice and thinking backwards". Specifically, ReLoop employs the frozen Consistency Feedback Plugin (CFP), comprising semantic reconstruction, visual description, and an attention supervision module for attention alignment. These components collectively enforce semantic reversibility, visual consistency, and interpretable attention, enabling the model to correct its outputs during training. Extensive evaluations and analyses demonstrate the effectiveness of ReLoop in reducing hallucination rates across multiple benchmarks, establishing a robust method for hallucination mitigation in MLLMs. We will release our source code and data in the camera-ready version.

Title: Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Authors: Jianjiang Yang, Ziyan Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04946
Pdf URL: https://arxiv.org/pdf/2507.04946
Copy Paste: [[2507.04946]] Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation(https://arxiv.org/abs/2507.04946)
Keywords: diffusion, generative
Abstract: Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbf{Hallucination Tri-Space} and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

Title: DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04947
Pdf URL: https://arxiv.org/pdf/2507.04947
Copy Paste: [[2507.04947]] DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer(https://arxiv.org/abs/2507.04947)
Keywords: diffusion
Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.

Title: ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Authors: Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2507.04952
Pdf URL: https://arxiv.org/pdf/2507.04952
Copy Paste: [[2507.04952]] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation(https://arxiv.org/abs/2507.04952)
Keywords: generative, large language model
Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at this https URL, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

Title: Boosting Temporal Sentence Grounding via Causal Inference

Authors: Kefan Tang, Lihuo He, Jisheng Dang, Xinbo Gao
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2507.04958
Pdf URL: https://arxiv.org/pdf/2507.04958
Copy Paste: [[2507.04958]] Boosting Temporal Sentence Grounding via Causal Inference(https://arxiv.org/abs/2507.04958)
Keywords: robust
Abstract: Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at this https URL.

Title: InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior

Authors: Minghao Wen, Shengjie Wu, Kangkan Wang, Dong Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04961
Pdf URL: https://arxiv.org/pdf/2507.04961
Copy Paste: [[2507.04961]] InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior(https://arxiv.org/abs/2507.04961)
Keywords: diffusion
Abstract: 3D Gaussian Splatting based 3D editing has demonstrated impressive performance in recent years. However, the multi-view editing often exhibits significant local inconsistency, especially in areas of non-rigid deformation, which lead to local artifacts, texture blurring, or semantic variations in edited 3D scenes. We also found that the existing editing methods, which rely entirely on text prompts make the editing process a "one-shot deal", making it difficult for users to control the editing degree flexibly. In response to these challenges, we present InterGSEdit, a novel framework for high-quality 3DGS editing via interactively selecting key views with users' preferences. We propose a CLIP-based Semantic Consistency Selection (CSCS) strategy to adaptively screen a group of semantically consistent reference views for each user-selected key view. Then, the cross-attention maps derived from the reference views are used in a weighted Gaussian Splatting unprojection to construct the 3D Geometry-Consistent Attention Prior ($GAP^{3D}$). We project $GAP^{3D}$ to obtain 3D-constrained attention, which are fused with 2D cross-attention via Attention Fusion Network (AFN). AFN employs an adaptive attention strategy that prioritizes 3D-constrained attention for geometric consistency during early inference, and gradually prioritizes 2D cross-attention maps in diffusion for fine-grained features during the later inference. Extensive experiments demonstrate that InterGSEdit achieves state-of-the-art performance, delivering consistent, high-fidelity 3DGS editing with improved user experience.

Title: Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Authors: Eunseop Yoon, Hee Suk Yoon, Mark A. Hasegawa-Johnson, Chang D. Yoo
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04976
Pdf URL: https://arxiv.org/pdf/2507.04976
Copy Paste: [[2507.04976]] Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models(https://arxiv.org/abs/2507.04976)
Keywords: large language model
Abstract: In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.

Title: Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading

Authors: Qinkai Yu, Wei Zhou, Hantao Liu, Yanyu Xu, Meng Wang, Yitian Zhao, Huazhu Fu, Xujiong Ye, Yalin Zheng, Yanda Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04978
Pdf URL: https://arxiv.org/pdf/2507.04978
Copy Paste: [[2507.04978]] Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading(https://arxiv.org/abs/2507.04978)
Keywords: robust, diffusion
Abstract: As a long-term complication of diabetes, diabetic retinopathy (DR) progresses slowly, potentially taking years to threaten vision. An accurate and robust evaluation of its severity is vital to ensure prompt management and care. Ordinal regression leverages the underlying inherent order between categories to achieve superior performance beyond traditional classification. However, there exist challenges leading to lower DR classification performance: 1) The uneven distribution of DR severity levels, characterized by a long-tailed pattern, adds complexity to the grading process. 2)The ambiguity in defining category boundaries introduces additional challenges, making the classification process more complex and prone to inconsistencies. This work proposes a novel autoregressive ordinal regression method called AOR-DR to address the above challenges by leveraging the clinical knowledge of inherent ordinal information in DR grading dataset settings. Specifically, we decompose the DR grading task into a series of ordered steps by fusing the prediction of the previous steps with extracted image features as conditions for the current prediction step. Additionally, we exploit the diffusion process to facilitate conditional probability modeling, enabling the direct use of continuous global image features for autoregression without relearning contextual information from patch-level features. This ensures the effectiveness of the autoregressive process and leverages the capabilities of pre-trained large-scale foundation models. Extensive experiments were conducted on four large-scale publicly available color fundus datasets, demonstrating our model's effectiveness and superior performance over six recent state-of-the-art ordinal regression methods. The implementation code is available at this https URL.

Title: Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

Authors: Ruihao Zhang, Fei Ye, Dandan Meng, Yixuan Huang, Maochen, Xiao Liu
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2507.04981
Pdf URL: https://arxiv.org/pdf/2507.04981
Copy Paste: [[2507.04981]] Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning(https://arxiv.org/abs/2507.04981)
Keywords: robust, extraction
Abstract: T cell receptor (TCR) repertoires encode critical immunological signatures for autoimmune diseases, yet their clinical application remains limited by sequence sparsity and low witness rates. We developed EAMil, a multi-instance deep learning framework that leverages TCR sequencing data to diagnose systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) with exceptional accuracy. By integrating PrimeSeq feature extraction with ESMonehot encoding and enhanced gate attention mechanisms, our model achieved state-of-the-art performance with AUCs of 98.95% for SLE and 97.76% for RA. EAMil successfully identified disease-associated genes with over 90% concordance with established differential analyses and effectively distinguished disease-specific TCR genes. The model demonstrated robustness in classifying multiple disease categories, utilizing the SLEDAI score to stratify SLE patients by disease severity as well as to diagnose the site of damage in SLE patients, and effectively controlling for confounding factors such as age and gender. This interpretable framework for immune receptor analysis provides new insights for autoimmune disease detection and classification with broad potential clinical applications across immune-mediated conditions.

Title: TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Authors: Zonglin Lyu, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04984
Pdf URL: https://arxiv.org/pdf/2507.04984
Copy Paste: [[2507.04984]] TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation(https://arxiv.org/abs/2507.04984)
Keywords: diffusion
Abstract: Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: this https URL.

Title: Robust Incomplete-Modality Alignment for Ophthalmic Disease Grading and Diagnosis via Labeled Optimal Transport

Authors: Qinkai Yu, Jianyang Xie, Yitian Zhao, Cheng Chen, Lijun Zhang, Liming Chen, Jun Cheng, Lu Liu, Yalin Zheng, Yanda Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04999
Pdf URL: https://arxiv.org/pdf/2507.04999
Copy Paste: [[2507.04999]] Robust Incomplete-Modality Alignment for Ophthalmic Disease Grading and Diagnosis via Labeled Optimal Transport(https://arxiv.org/abs/2507.04999)
Keywords: robust
Abstract: Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large ophthalmic multimodal datasets demonstrate our model's superior performance under various modality-incomplete scenarios, achieving Sota performance in both complete modality and inter-modality incompleteness conditions. Code is available at this https URL

Title: Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

Authors: Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Lorena Calvo-Bartolomé, Alexander Hoyle, Zhijing Jin, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Mennatallah El-Assady, Elliott Ash
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05010
Pdf URL: https://arxiv.org/pdf/2507.05010
Copy Paste: [[2507.05010]] Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification(https://arxiv.org/abs/2507.05010)
Keywords: large language model
Abstract: We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

Title: Verified Language Processing with Hybrid Explainability: A Technical Report

Authors: Oliver Robert Fox, Giacomo Bergami, Graham Morgan
Subjects: cs.CL, cs.SC
Abstract URL: https://arxiv.org/abs/2507.05017
Pdf URL: https://arxiv.org/pdf/2507.05017
Copy Paste: [[2507.05017]] Verified Language Processing with Hybrid Explainability: A Technical Report(https://arxiv.org/abs/2507.05017)
Keywords: explainability, generative
Abstract: The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.

Title: Meta-Learning Transformers to Improve In-Context Generalization

Authors: Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05019
Pdf URL: https://arxiv.org/pdf/2507.05019
Copy Paste: [[2507.05019]] Meta-Learning Transformers to Improve In-Context Generalization(https://arxiv.org/abs/2507.05019)
Keywords: privacy, robust, transformer
Abstract: In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

Title: Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens

Authors: Konstantin Nikolaou, Sven Krippendorf, Samuel Tovey, Christian Holm
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.05035
Pdf URL: https://arxiv.org/pdf/2507.05035
Copy Paste: [[2507.05035]] Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens(https://arxiv.org/abs/2507.05035)
Keywords: large language model
Abstract: Scaling laws offer valuable insights into the relationship between neural network performance and computational cost, yet their underlying mechanisms remain poorly understood. In this work, we empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK). This analysis establishes a link between performance scaling and the internal dynamics of neural networks. Our findings of standard vision tasks show that similar performance scaling exponents can occur even though the internal model dynamics show opposite behavior. This demonstrates that performance scaling alone is insufficient for understanding the underlying mechanisms of neural networks. We also address a previously unresolved issue in neural scaling: how convergence to the infinite-width limit affects scaling behavior in finite-width models. To this end, we investigate how feature learning is lost as the model width increases and quantify the transition between kernel-driven and feature-driven scaling regimes. We identify the maximum model width that supports feature learning, which, in our setups, we find to be more than ten times smaller than typical large language model widths.

Title: AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

Authors: Jan Carreras Boada, Rao Muhammad Umer, Carsten Marr
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05063
Pdf URL: https://arxiv.org/pdf/2507.05063
Copy Paste: [[2507.05063]] AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics(https://arxiv.org/abs/2507.05063)
Keywords: privacy, robust, diffusion
Abstract: Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3\% to 78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.

Title: Replacing thinking with tool usage enables reasoning in small language models

Authors: Corrado Rainone, Tim Bakker, Roland Memisevic
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05065
Pdf URL: https://arxiv.org/pdf/2507.05065
Copy Paste: [[2507.05065]] Replacing thinking with tool usage enables reasoning in small language models(https://arxiv.org/abs/2507.05065)
Keywords: large language model
Abstract: Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task.

Title: ICAS: Detecting Training Data from Autoregressive Image Generative Models

Authors: Hongyao Yu, Yixiang Qiu, Yiheng Yang, Hao Fang, Tianqu Zhuang, Jiaxin Hong, Bin Chen, Hao Wu, Shu-Tao Xia
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2507.05068
Pdf URL: https://arxiv.org/pdf/2507.05068
Copy Paste: [[2507.05068]] ICAS: Detecting Training Data from Autoregressive Image Generative Models(https://arxiv.org/abs/2507.05068)
Keywords: privacy, robust, membership infer, generative
Abstract: Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive this http URL code is available at this https URL.

Title: Exploring Semantic Clustering and Similarity Search for Heterogeneous Traffic Scenario Graph

Authors: Ferdinand Mütsch, Maximilian Zipfl, Nikolai Polley, J. Marius Zöllner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.05086
Pdf URL: https://arxiv.org/pdf/2507.05086
Copy Paste: [[2507.05086]] Exploring Semantic Clustering and Similarity Search for Heterogeneous Traffic Scenario Graph(https://arxiv.org/abs/2507.05086)
Keywords: robust
Abstract: Scenario-based testing is an indispensable instrument for the comprehensive validation and verification of automated vehicles (AVs). However, finding a manageable and finite, yet representative subset of scenarios in a scalable, possibly unsupervised manner is notoriously challenging. Our work is meant to constitute a cornerstone to facilitate sample-efficient testing, while still capturing the diversity of relevant operational design domains (ODDs) and accounting for the "long tail" phenomenon in particular. To this end, we first propose an expressive and flexible heterogeneous, spatio-temporal graph model for representing traffic scenarios. Leveraging recent advances of graph neural networks (GNNs), we then propose a self-supervised method to learn a universal embedding space for scenario graphs that enables clustering and similarity search. In particular, we implement contrastive learning alongside a bootstrapping-based approach and evaluate their suitability for partitioning the scenario space. Experiments on the nuPlan dataset confirm the model's ability to capture semantics and thus group related scenarios in a meaningful way despite the absence of discrete class labels. Different scenario types materialize as distinct clusters. Our results demonstrate how variable-length traffic scenarios can be condensed into single vector representations that enable nearest-neighbor retrieval of representative candidates for distinct scenario categories. Notably, this is achieved without manual labeling or bias towards an explicit objective such as criticality. Ultimately, our approach can serve as a basis for scalable selection of scenarios to further enhance the efficiency and robustness of testing AVs in simulation.

Title: MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation

Authors: Yucheng Wang, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05092
Pdf URL: https://arxiv.org/pdf/2507.05092
Copy Paste: [[2507.05092]] MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation(https://arxiv.org/abs/2507.05092)
Keywords: extraction, diffusion, transformer
Abstract: Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering. (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency. (iii) A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.

Title: The Hidden Threat in Plain Text: Attacking RAG Data Loaders

Authors: Alberto Castagnaro, Umberto Salviati, Mauro Conti, Luca Pajola, Simeone Pizzi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05093
Pdf URL: https://arxiv.org/pdf/2507.05093
Copy Paste: [[2507.05093]] The Hidden Threat in Plain Text: Attacking RAG Data Loaders(https://arxiv.org/abs/2507.05093)
Keywords: secure, security, attack, steal, large language model
Abstract: Large Language Models (LLMs) have transformed human-machine interaction since ChatGPT's 2022 debut, with Retrieval-Augmented Generation (RAG) emerging as a key framework that enhances LLM outputs by integrating external knowledge. However, RAG's reliance on ingesting external documents introduces new vulnerabilities. This paper exposes a critical security gap at the data loading stage, where malicious actors can stealthily corrupt RAG pipelines by exploiting document ingestion. We propose a taxonomy of 9 knowledge-based poisoning attacks and introduce two novel threat vectors -- Content Obfuscation and Content Injection -- targeting common formats (DOCX, HTML, PDF). Using an automated toolkit implementing 19 stealthy injection techniques, we test five popular data loaders, finding a 74.4% attack success rate across 357 scenarios. We further validate these threats on six end-to-end RAG systems -- including white-box pipelines and black-box services like NotebookLM and OpenAI Assistants -- demonstrating high success rates and critical vulnerabilities that bypass filters and silently compromise output integrity. Our results emphasize the urgent need to secure the document ingestion process in RAG systems against covert content manipulations.

Title: DICE: Discrete inverse continuity equation for learning population dynamics

Authors: Tobias Blickhan, Jules Berman, Andrew Stuart, Benjamin Peherstorfer
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.05107
Pdf URL: https://arxiv.org/pdf/2507.05107
Copy Paste: [[2507.05107]] DICE: Discrete inverse continuity equation for learning population dynamics(https://arxiv.org/abs/2507.05107)
Keywords: robust, generative
Abstract: We introduce the Discrete Inverse Continuity Equation (DICE) method, a generative modeling approach that learns the evolution of a stochastic process from given sample populations at a finite number of time points. Models learned with DICE capture the typically smooth and well-behaved population dynamics, rather than the dynamics of individual sample trajectories that can exhibit complex or even chaotic behavior. The DICE loss function is developed specifically to be invariant, even in discrete time, to spatially constant but time-varying spurious constants that can emerge during training; this invariance increases training stability and robustness. Generating a trajectory of sample populations with DICE is fast because samples evolve directly in the time interval over which the stochastic process is formulated, in contrast to approaches that condition on time and then require multiple sampling steps per time step. DICE is stable to train, in situations where other methods for learning population dynamics fail, and DICE generates representative samples with orders of magnitude lower costs than methods that have to condition on time. Numerical experiments on a wide range of problems from random waves, Vlasov-Poisson instabilities and high-dimensional chaos are included to justify these assertions.

Title: VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Authors: Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05116
Pdf URL: https://arxiv.org/pdf/2507.05116
Copy Paste: [[2507.05116]] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting(https://arxiv.org/abs/2507.05116)
Keywords: diffusion, segmentation
Abstract: Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35$\times$ faster inference and 145 Hz throughput. All the details and codes will be open-sourced.

Title: An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Authors: Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05123
Pdf URL: https://arxiv.org/pdf/2507.05123
Copy Paste: [[2507.05123]] An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques(https://arxiv.org/abs/2507.05123)
Keywords: large language model
Abstract: Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.

Title: Extreme Learning Machine Based System for DDoS Attacks Detections on IoMT Devices

Authors: Nelly Elsayed, Lily Dzamesi, Zag ElSayed, Murat Ozer
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.05132
Pdf URL: https://arxiv.org/pdf/2507.05132
Copy Paste: [[2507.05132]] Extreme Learning Machine Based System for DDoS Attacks Detections on IoMT Devices(https://arxiv.org/abs/2507.05132)
Keywords: attack
Abstract: The Internet of Medical Things (IoMT) represents a paradigm shift in the healthcare sector, enabling the interconnection of medical devices, sensors, and systems to enhance patient monitoring, diagnosis, and management. The rapid evolution of IoMT presents significant benefits to the healthcare domains. However, there is a rapid increase in distributed denial of service (DDoS) attacks on the IoMT networks due to several vulnerabilities in the IoMT-connected devices, which negatively impact patients' health and can even lead to deaths. Thus, in this paper, we aim to save lives via investigating an extreme learning machine for detecting DDoS attacks on IoMT devices. The proposed approach achieves a high accuracy at a low implementation budget. Thus, it can reduce the implementation cost of the DDoS detection system, making the model capable of executing on the fog level.

Title: Deep Learning to Automate Parameter Extraction and Model Fitting of Two-Dimensional Transistors

Authors: Robert K. A. Bennett, Jan-Lucas Uslu, Harmon F. Gault, Asir Intisar Khan, Lauren Hoang, Tara Peña, Kathryn Neilson, Young Suh Song, Zhepeng Zhang, Andrew J. Mannix, Eric Pop
Subjects: cs.LG, cond-mat.mtrl-sci, physics.app-ph
Abstract URL: https://arxiv.org/abs/2507.05134
Pdf URL: https://arxiv.org/pdf/2507.05134
Copy Paste: [[2507.05134]] Deep Learning to Automate Parameter Extraction and Model Fitting of Two-Dimensional Transistors(https://arxiv.org/abs/2507.05134)
Keywords: extraction
Abstract: We present a deep learning approach to extract physical parameters (e.g., mobility, Schottky contact barrier height, defect profiles) of two-dimensional (2D) transistors from electrical measurements, enabling automated parameter extraction and technology computer-aided design (TCAD) fitting. To facilitate this task, we implement a simple data augmentation and pre-training approach by training a secondary neural network to approximate a physics-based device simulator. This method enables high-quality fits after training the neural network on electrical data generated from physics-based simulations of ~500 devices, a factor >40$\times$ fewer than other recent efforts. Consequently, fitting can be achieved by training on physically rigorous TCAD models, including complex geometry, self-consistent transport, and electrostatic effects, and is not limited to computationally inexpensive compact models. We apply our approach to reverse-engineer key parameters from experimental monolayer WS$_2$ transistors, achieving a median coefficient of determination ($R^2$) = 0.99 when fitting measured electrical data. We also demonstrate that this approach generalizes and scales well by reverse-engineering electrical data on high-electron-mobility transistors while fitting 35 parameters simultaneously. To facilitate future research on deep learning approaches for inverse transistor design, we have published our code and sample data sets online.

Title: Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Authors: Jaewook Lee, Alexander Scarlatos, Andrew Lan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05137
Pdf URL: https://arxiv.org/pdf/2507.05137
Copy Paste: [[2507.05137]] Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization(https://arxiv.org/abs/2507.05137)
Keywords: interpretability, generative, large language model
Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

Title: VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems

Authors: Aadi Srivastava, Vignesh Natarajkumar, Utkarsh Bheemanaboyna, Devisree Akashapu, Nagraj Gaonkar, Archit Joshi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05146
Pdf URL: https://arxiv.org/pdf/2507.05146
Copy Paste: [[2507.05146]] VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems(https://arxiv.org/abs/2507.05146)
Keywords: diffusion, generative
Abstract: The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at this https URL .

Title: AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

Authors: Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, Edward Bowen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05157
Pdf URL: https://arxiv.org/pdf/2507.05157
Copy Paste: [[2507.05157]] AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models(https://arxiv.org/abs/2507.05157)
Keywords: transformer, generative, large language model
Abstract: Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.

Title: InfoSteer: Steering Information Utility in Language Model Post-Training

Authors: Chunyuan Deng, Ruidi Chang, Hanjie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05158
Pdf URL: https://arxiv.org/pdf/2507.05158
Copy Paste: [[2507.05158]] InfoSteer: Steering Information Utility in Language Model Post-Training(https://arxiv.org/abs/2507.05158)
Keywords: interpretability
Abstract: Recent advancements in language models (LMs) gradually ushered in an era where post-training is crucial. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. We therefore present \ours, a lightweight method that encourages parametric information utilization in LMs during post-training. This is achieved via treating FFN layer as associate key-value memory, and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. We find this simple guidance during post-training phase delivers consistent performance improvements across diverse model families--including Qwen, Gemma and Llama-spanning over 15 downstream tasks in both ID and OOD evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information-placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., `,' or `and'). Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability.

Title: 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

Authors: Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05163
Pdf URL: https://arxiv.org/pdf/2507.05163
Copy Paste: [[2507.05163]] 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture(https://arxiv.org/abs/2507.05163)
Keywords: diffusion, generative
Abstract: Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

Title: Critiques of World Models

Authors: Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05169
Pdf URL: https://arxiv.org/pdf/2507.05169
Copy Paste: [[2507.05169]] Critiques of World Models(https://arxiv.org/abs/2507.05169)
Keywords: generative
Abstract: World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

Title: OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model

Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.05177
Pdf URL: https://arxiv.org/pdf/2507.05177
Copy Paste: [[2507.05177]] OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model(https://arxiv.org/abs/2507.05177)
Keywords: large language model
Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at this https URL

Title: From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

Authors: Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05179
Pdf URL: https://arxiv.org/pdf/2507.05179
Copy Paste: [[2507.05179]] From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations(https://arxiv.org/abs/2507.05179)
Keywords: robust
Abstract: In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters -- Actuality and Finesse -- into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework's effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.

Title: $φ$-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery

Authors: Hoang-Quan Nguyen, Xuan Bac Nguyen, Sankalp Pandey, Tim Faltermeier, Nicholas Borys, Hugh Churchill, Khoa Luu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05184
Pdf URL: https://arxiv.org/pdf/2507.05184
Copy Paste: [[2507.05184]] $φ$-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery(https://arxiv.org/abs/2507.05184)
Keywords: interpretability
Abstract: Characterizing quantum flakes is a critical step in quantum hardware engineering because the quality of these flakes directly influences qubit performance. Although computer vision methods for identifying two-dimensional quantum flakes have emerged, they still face significant challenges in estimating flake thickness. These challenges include limited data, poor generalization, sensitivity to domain shifts, and a lack of physical interpretability. In this paper, we introduce one of the first Physics-informed Adaptation Learning approaches to overcome these obstacles. We focus on two main issues, i.e., data scarcity and generalization. First, we propose a new synthetic data generation framework that produces diverse quantum flake samples across various materials and configurations, reducing the need for time-consuming manual collection. Second, we present $\varphi$-Adapt, a physics-informed adaptation method that bridges the performance gap between models trained on synthetic data and those deployed in real-world settings. Experimental results show that our approach achieves state-of-the-art performance on multiple benchmarks, outperforming existing methods. Our proposed approach advances the integration of physics-based modeling and domain adaptation. It also addresses a critical gap in leveraging synthesized data for real-world 2D material analysis, offering impactful tools for deep learning and materials science communities.

Title: Satellite-based Rabi rice paddy field mapping in India: a case study on Telangana state

Authors: Prashanth Reddy Putta, Fabio Dell'Acqua (University of Pavia)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05189
Pdf URL: https://arxiv.org/pdf/2507.05189
Copy Paste: [[2507.05189]] Satellite-based Rabi rice paddy field mapping in India: a case study on Telangana state(https://arxiv.org/abs/2507.05189)
Keywords: security
Abstract: Accurate rice area monitoring is critical for food security and agricultural policy in smallholder farming regions, yet conventional remote sensing approaches struggle with the spatiotemporal heterogeneity characteristic of fragmented agricultural landscapes. This study developed a phenology-driven classification framework that systematically adapts to local agro-ecological variations across 32 districts in Telangana, India during the 2018-19 Rabi rice season. The research reveals significant spatiotemporal diversity, with phenological timing varying by up to 50 days between districts and field sizes ranging from 0.01 to 2.94 hectares. Our district-specific calibration approach achieved 93.3% overall accuracy, an 8.0 percentage point improvement over conventional regional clustering methods, with strong validation against official government statistics (R^2 = 0.981) demonstrating excellent agreement between remotely sensed and ground truth data. The framework successfully mapped 732,345 hectares by adapting to agro-climatic variations, with Northern districts requiring extended land preparation phases (up to 55 days) while Southern districts showed compressed cultivation cycles. Field size analysis revealed accuracy declining 6.8 percentage points from medium to tiny fields, providing insights for operational monitoring in fragmented landscapes. These findings demonstrate that remote sensing frameworks must embrace rather than simplify landscape complexity, advancing region-specific agricultural monitoring approaches that maintain scientific rigor while serving practical policy and food security applications.

Title: Pre-Trained Policy Discriminators are General Reward Models

Authors: Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05197
Pdf URL: https://arxiv.org/pdf/2507.05197
Copy Paste: [[2507.05197]] Pre-Trained Policy Discriminators are General Reward Models(https://arxiv.org/abs/2507.05197)
Keywords: robust
Abstract: We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

Title: All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Authors: Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05211
Pdf URL: https://arxiv.org/pdf/2507.05211
Copy Paste: [[2507.05211]] All in One: Visual-Description-Guided Unified Point Cloud Segmentation(https://arxiv.org/abs/2507.05211)
Keywords: large language model, segmentation
Abstract: Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at this https URL.

Title: Hunting in the Dark: Metrics for Early Stage Traffic Discovery

Authors: Max Gao, Michael Collins, Ricky Mok, kc Claffy
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.05213
Pdf URL: https://arxiv.org/pdf/2507.05213
Copy Paste: [[2507.05213]] Hunting in the Dark: Metrics for Early Stage Traffic Discovery(https://arxiv.org/abs/2507.05213)
Keywords: security, attack
Abstract: Threat hunting is an operational security process where an expert analyzes traffic, applying knowledge and lightweight tools on unlabeled data in order to identify and classify previously unknown phenomena. In this paper, we examine threat hunting metrics and practice by studying the detection of Crackonosh, a cryptojacking malware package, has on various metrics for identifying its behavior. Using a metric for discoverability, we model the ability of defenders to measure Crackonosh traffic as the malware population decreases, evaluate the strength of various detection methods, and demonstrate how different darkspace sizes affect both the ability to track the malware, but enable emergent behaviors by exploiting attacker mistakes.

Title: CTA: Cross-Task Alignment for Better Test Time Training

Authors: Samuel Barbeau, Pedram Fekri, David Osowiechi, Ali Bahri, Moslem YazdanpanahMasih Aminbeidokhti, Christian Desrosiers
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05221
Pdf URL: https://arxiv.org/pdf/2507.05221
Copy Paste: [[2507.05221]] CTA: Cross-Task Alignment for Better Test Time Training(https://arxiv.org/abs/2507.05221)
Keywords: robust
Abstract: Deep learning models have demonstrated exceptional performance across a wide range of computer vision tasks. However, their performance often degrades significantly when faced with distribution shifts, such as domain or dataset changes. Test-Time Training (TTT) has emerged as an effective method to enhance model robustness by incorporating an auxiliary unsupervised task during training and leveraging it for model updates at test time. In this work, we introduce CTA (Cross-Task Alignment), a novel approach for improving TTT. Unlike existing TTT methods, CTA does not require a specialized model architecture and instead takes inspiration from the success of multi-modal contrastive learning to align a supervised encoder with a self-supervised one. This process enforces alignment between the learned representations of both models, thereby mitigating the risk of gradient interference, preserving the intrinsic robustness of self-supervised learning and enabling more semantically meaningful updates at test-time. Experimental results demonstrate substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.

Title: Cascade: Token-Sharded Private LLM Inference

Authors: Rahul Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2507.05228
Pdf URL: https://arxiv.org/pdf/2507.05228
Copy Paste: [[2507.05228]] Cascade: Token-Sharded Private LLM Inference(https://arxiv.org/abs/2507.05228)
Keywords: secure, privacy, attack
Abstract: As LLMs continue to increase in parameter size, the computational resources required to run them are available to fewer parties. Therefore, third-party inference services -- where LLMs are hosted by third parties with significant computational resources -- are becoming increasingly popular. However, third party inference raises critical concerns about user data privacy. To mitigate these risks, privacy researchers have developed provably secure schemes for third-party inference, such as Secure Multi-Party Computation (SMPC). However, SMPC protocols have significant computational and communication overhead, and do not scale to large models. In this work, we propose a new multi-party inference protocol, Cascade, that avoids these punitive costs by leveraging sharding in the sequence dimension to maintain privacy, trading off cryptographic privacy guarantees for increased performance and scalability. We demonstrate that Cascade is resistant to a generalization of a recent attack that is highly effective against other statistical privacy schemes, and that it is further resistant to learning-based attacks. As Cascade is orders of magnitude faster than existing schemes, our findings offer practical solutions for secure deployment of modern state-of-the-art LLMs.

Title: Self-Supervised Real-Time Tracking of Military Vehicles in Low-FPS UAV Footage

Authors: Markiyan Kostiv, Anatolii Adamovskyi, Yevhen Cherniavskyi, Mykyta Varenyk, Ostap Viniavskyi, Igor Krashenyi, Oles Dobosevych
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05229
Pdf URL: https://arxiv.org/pdf/2507.05229
Copy Paste: [[2507.05229]] Self-Supervised Real-Time Tracking of Military Vehicles in Low-FPS UAV Footage(https://arxiv.org/abs/2507.05229)
Keywords: robust
Abstract: Multi-object tracking (MOT) aims to maintain consistent identities of objects across video frames. Associating objects in low-frame-rate videos captured by moving unmanned aerial vehicles (UAVs) in actual combat scenarios is complex due to rapid changes in object appearance and position within the frame. The task becomes even more challenging due to image degradation caused by cloud video streaming and compression algorithms. We present how instance association learning from single-frame annotations can overcome these challenges. We show that global features of the scene provide crucial context for low-FPS instance association, allowing our solution to be robust to distractors and gaps in detections. We also demonstrate that such a tracking approach maintains high association quality even when reducing the input image resolution and latent representation size for faster inference. Finally, we present a benchmark dataset of annotated military vehicles collected from publicly available data sources. This paper was initially presented at the NATO Science and Technology Organization Symposium (ICMCIS) organized by the Information Systems Technology (IST)Scientific and Technical Committee, IST-209-RSY - the ICMCIS, held in Oeiras, Portugal, 13-14 May 2025.

Title: Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Authors: Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05248
Pdf URL: https://arxiv.org/pdf/2507.05248
Copy Paste: [[2507.05248]] Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models(https://arxiv.org/abs/2507.05248)
Keywords: attack, large language model
Abstract: Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at this https URL.

Title: From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving

Authors: Fabian Konstantinidis, Ariel Dallari Guerreiro, Raphael Trumpp, Moritz Sackmann, Ulrich Hofmann, Marco Caccamo, Christoph Stiller
Subjects: cs.CV, cs.AI, cs.LG, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05254
Pdf URL: https://arxiv.org/pdf/2507.05254
Copy Paste: [[2507.05254]] From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving(https://arxiv.org/abs/2507.05254)
Keywords: generative
Abstract: Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at this https URL.

Title: Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Authors: Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.05255
Pdf URL: https://arxiv.org/pdf/2507.05255
Copy Paste: [[2507.05255]] Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning(https://arxiv.org/abs/2507.05255)
Keywords: large language model
Abstract: The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

Title: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Authors: Yuanzhe Hu, Yu Wang, Julian McAuley
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05257
Pdf URL: https://arxiv.org/pdf/2507.05257
Copy Paste: [[2507.05257]] Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions(https://arxiv.org/abs/2507.05257)
Keywords: large language model
Abstract: Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

Title: Spatio-Temporal LLM: Reasoning about Environments and Actions

Authors: Haozhen Zheng, Beitong Tian, Mingyuan Wu, Zhenggang Tang, Klara Nahrstedt, Alex Schwing
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05258
Pdf URL: https://arxiv.org/pdf/2507.05258
Copy Paste: [[2507.05258]] Spatio-Temporal LLM: Reasoning about Environments and Actions(https://arxiv.org/abs/2507.05258)
Keywords: large language model
Abstract: Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at this https URL.

Title: Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Authors: Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05259
Pdf URL: https://arxiv.org/pdf/2507.05259
Copy Paste: [[2507.05259]] Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing(https://arxiv.org/abs/2507.05259)
Keywords: diffusion, large language model, segmentation
Abstract: Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

Title: Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations

Authors: Xiang Xu, Lingdong Kong, Song Wang, Chuanwei Zhou, Qingshan Liu
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05260
Pdf URL: https://arxiv.org/pdf/2507.05260
Copy Paste: [[2507.05260]] Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations(https://arxiv.org/abs/2507.05260)
Keywords: segmentation
Abstract: LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.