2025-07-18

Title: MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Authors: Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.12508
Pdf URL: https://arxiv.org/pdf/2507.12508
Copy Paste: [[2507.12508]] MindJourney: Test-Time Scaling with World Models for Spatial Reasoning(https://arxiv.org/abs/2507.12508)
Keywords: robust, diffusion
Abstract: Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

Title: Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility

Authors: Michael A. Lepori, Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12553
Pdf URL: https://arxiv.org/pdf/2507.12553
Copy Paste: [[2507.12553]] Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility(https://arxiv.org/abs/2507.12553)
Keywords: interpretability
Abstract: Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.

Title: Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Authors: Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.12566
Pdf URL: https://arxiv.org/pdf/2507.12566
Copy Paste: [[2507.12566]] Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models(https://arxiv.org/abs/2507.12566)
Keywords: large language model
Abstract: This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at this https URL.

Title: Safeguarding Federated Learning-based Road Condition Classification

Authors: Sheng Liu, Panos Papadimitratos
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12568
Pdf URL: https://arxiv.org/pdf/2507.12568
Copy Paste: [[2507.12568]] Safeguarding Federated Learning-based Road Condition Classification(https://arxiv.org/abs/2507.12568)
Keywords: privacy, attack, federate
Abstract: Federated Learning (FL) has emerged as a promising solution for privacy-preserving autonomous driving, specifically camera-based Road Condition Classification (RCC) systems, harnessing distributed sensing, computing, and communication resources on board vehicles without sharing sensitive image data. However, the collaborative nature of FL-RCC frameworks introduces new vulnerabilities: Targeted Label Flipping Attacks (TLFAs), in which malicious clients (vehicles) deliberately alter their training data labels to compromise the learned model inference performance. Such attacks can, e.g., cause a vehicle to mis-classify slippery, dangerous road conditions as pristine and exceed recommended speed. However, TLFAs for FL-based RCC systems are largely missing. We address this challenge with a threefold contribution: 1) we disclose the vulnerability of existing FL-RCC systems to TLFAs; 2) we introduce a novel label-distance-based metric to precisely quantify the safety risks posed by TLFAs; and 3) we propose FLARE, a defensive mechanism leveraging neuron-wise analysis of the output layer to mitigate TLFA effects. Extensive experiments across three RCC tasks, four evaluation metrics, six baselines, and three deep learning models demonstrate both the severity of TLFAs on FL-RCC systems and the effectiveness of FLARE in mitigating the attack impact.

Title: Assay2Mol: large language model-based drug design using BioAssay context

Authors: Yifan Deng, Spencer S. Ericksen, Anthony Gitter
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.12574
Pdf URL: https://arxiv.org/pdf/2507.12574
Copy Paste: [[2507.12574]] Assay2Mol: large language model-based drug design using BioAssay context(https://arxiv.org/abs/2507.12574)
Keywords: large language model
Abstract: Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, molecule screening assays evaluate the functional responses of candidate molecules against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate molecules using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand molecules for target protein structures, while also promoting more synthesizable molecule generation.

Title: Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows

Authors: Judy Long, Tao Liu, Sean Alexander Woznicki, Miljana Marković, Oskar Marko, Molly Sears
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12590
Pdf URL: https://arxiv.org/pdf/2507.12590
Copy Paste: [[2507.12590]] Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows(https://arxiv.org/abs/2507.12590)
Keywords: robust, transformer
Abstract: Crop mapping involves identifying and classifying crop types using spatial data, primarily derived from remote sensing imagery. This study presents the first comprehensive review of large-scale, pixel-wise crop mapping workflows, encompassing both conventional supervised methods and emerging transfer learning approaches. To identify the optimal supervised crop mapping workflows, we conducted systematic experiments, comparing six widely adopted satellite image-based preprocessing methods, alongside eleven supervised pixel-wise classification models. Additionally, we assessed the synergistic impact of varied training sample sizes and variable combinations. Moreover, we identified optimal transfer learning techniques for different magnitudes of domain shift. The evaluation of best methods was conducted across five diverse agricultural sites. Landsat 8 served as the primary satellite data source. Labels come from CDL trusted pixels and field surveys. Our findings reveal three key insights. First, fine-scale interval preprocessing paired with Transformer models consistently delivered optimal performance for both supervised and transferable workflows. RF offered rapid training and competitive performance in conventional supervised learning and direct transfer to similar domains. Second, transfer learning techniques enhanced workflow adaptability, with UDA being effective for homogeneous crop classes while fine-tuning remains robust across diverse scenarios. Finally, workflow choice depends heavily on the availability of labeled samples. With a sufficient sample size, supervised training typically delivers more accurate and generalizable results. Below a certain threshold, transfer learning that matches the level of domain shift is a viable alternative to achieve crop mapping. Repository: Best-Practices-for-Large-Scale-Pixel-Wise-Crop-Mapping-and-Transfer-Learning-Workflows

Title: MS-DGCNN++: A Multi-Scale Fusion Dynamic Graph Neural Network with Biological Knowledge Integration for LiDAR Tree Species Classification

Authors: Said Ohamouddou, Abdellatif El Afia, Hanaa El Afia, Raddouane Chiheb
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12602
Pdf URL: https://arxiv.org/pdf/2507.12602
Copy Paste: [[2507.12602]] MS-DGCNN++: A Multi-Scale Fusion Dynamic Graph Neural Network with Biological Knowledge Integration for LiDAR Tree Species Classification(https://arxiv.org/abs/2507.12602)
Keywords: extraction, transformer
Abstract: Tree species classification from terrestrial LiDAR point clouds is challenging because of the complex multi-scale geometric structures in forest environments. Existing approaches using multi-scale dynamic graph convolutional neural networks (MS-DGCNN) employ parallel multi-scale processing, which fails to capture the semantic relationships between the hierarchical levels of the tree architecture. We present MS-DGCNN++, a hierarchical multiscale fusion dynamic graph convolutional network that uses semantically meaningful feature extraction at local, branch, and canopy scales with cross-scale information propagation. Our method employs scale-specific feature engineering, including standard geometric features for the local scale, normalized relative vectors for the branch scale, and distance information for the canopy scale. This hierarchical approach replaces uniform parallel processing with semantically differentiated representations that are aligned with the natural tree structure. Under the same proposed tree species data augmentation strategy for all experiments, MS-DGCNN++ achieved an accuracy of 94.96 \% on STPCTLS, outperforming DGCNN, MS-DGCNN, and the state-of-the-art model PPT. On FOR-species20K, it achieves 67.25\% accuracy (6.1\% improvement compared to MS-DGCNN). For standard 3D object recognition, our method outperformed DGCNN and MS-DGCNN with overall accuracies of 93.15\% on ModelNet40 and 94.05\% on ModelNet10. With lower parameters and reduced complexity compared to state-of-the-art transformer approaches, our method is suitable for resource-constrained applications while maintaining a competitive accuracy. Beyond tree classification, the method generalizes to standard 3D object recognition, establishing it as a versatile solution for diverse point cloud processing applications. The implementation code is publicly available at this https URL.

Title: Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

Authors: Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12612
Pdf URL: https://arxiv.org/pdf/2507.12612
Copy Paste: [[2507.12612]] Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning(https://arxiv.org/abs/2507.12612)
Keywords: robust, large language model
Abstract: The performance of finetuned large language models (LLMs) hinges critically on the composition of the training mixture. However, selecting an optimal blend of task datasets remains a largely manual, heuristic driven process, with practitioners often relying on uniform or size based sampling strategies. We introduce TASKPGM, a principled and scalable framework for mixture optimization that selects continuous task proportions by minimizing an energy function over a Markov Random Field (MRF). Task relationships are modeled using behavioral divergences such as Jensen Shannon Divergence and Pointwise Mutual Information computed from the predictive distributions of single task finetuned models. Our method yields a closed form solution under simplex constraints and provably balances representativeness and diversity among tasks. We provide theoretical guarantees, including weak submodularity for budgeted variants, and demonstrate consistent empirical improvements on Llama 2 and Mistral across evaluation suites such as MMLU and BIGBench. Beyond performance, TASKPGM offers interpretable insights into task influence and mixture composition, making it a powerful tool for efficient and robust LLM finetuning.

Title: BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training

Authors: Rui Li, Xiaoyun Zhi, Jinxin Chi, Menghan Yu, Lixin Huang, Jia Zhu, Weilun Zhang, Xing Ma, Wenjia Liu, Zhicheng Zhu, Daowen Luo, Zuquan Song, Xin Yin, Chao Xiang, Shuguang Wang, Wencong Xiao, Gene Cooperman
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2507.12619
Pdf URL: https://arxiv.org/pdf/2507.12619
Copy Paste: [[2507.12619]] BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training(https://arxiv.org/abs/2507.12619)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.

Title: Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection

Authors: Sandipan Sarma, Agney Talwarr, Arijit Sur
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12628
Pdf URL: https://arxiv.org/pdf/2507.12628
Copy Paste: [[2507.12628]] Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection(https://arxiv.org/abs/2507.12628)
Keywords: transformer
Abstract: Human-object interaction detection (HOID) refers to localizing interactive human-object pairs in images and identifying the interactions. Since there could be an exponential number of object-action combinations, labeled data is limited - leading to a long-tail distribution problem. Recently, zero-shot learning emerged as a solution, with end-to-end transformer-based object detectors adapted for HOID becoming successful frameworks. However, their primary focus is designing improved decoders for learning entangled or disentangled interpretations of interactions. We advocate that HOI-specific cues must be anticipated at the encoder stage itself to obtain a stronger scene interpretation. Consequently, we build a top-down framework named Funnel-HOI inspired by the human tendency to grasp well-defined concepts first and then associate them with abstract concepts during scene understanding. We first probe an image for the presence of objects (well-defined concepts) and then probe for actions (abstract concepts) associated with them. A novel asymmetric co-attention mechanism mines these cues utilizing multimodal information (incorporating zero-shot capabilities) and yields stronger interaction representations at the encoder level. Furthermore, a novel loss is devised that considers objectaction relatedness and regulates misclassification penalty better than existing loss functions for guiding the interaction classifier. Extensive experiments on the HICO-DET and V-COCO datasets across fully-supervised and six zero-shot settings reveal our state-of-the-art performance, with up to 12.4% and 8.4% gains for unseen and rare HOI categories, respectively.

Title: Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos

Authors: Kaihua Chen, Tarasha Khurana, Deva Ramanan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12646
Pdf URL: https://arxiv.org/pdf/2507.12646
Copy Paste: [[2507.12646]] Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos(https://arxiv.org/abs/2507.12646)
Keywords: diffusion
Abstract: We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.

Title: Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective

Authors: Kai Malcolm, César Uribe, Momona Yamagami
Subjects: cs.LG, cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2507.12652
Pdf URL: https://arxiv.org/pdf/2507.12652
Copy Paste: [[2507.12652]] Federated Learning in Open- and Closed-Loop EMG Decoding: A Privacy and Performance Perspective(https://arxiv.org/abs/2507.12652)
Keywords: privacy, federate
Abstract: Invasive and non-invasive neural interfaces hold promise as high-bandwidth input devices for next-generation technologies. However, neural signals inherently encode sensitive information about an individual's identity and health, making data sharing for decoder training a critical privacy challenge. Federated learning (FL), a distributed, privacy-preserving learning framework, presents a promising solution, but it remains unexplored in closed-loop adaptive neural interfaces. Here, we introduce FL-based neural decoding and systematically evaluate its performance and privacy using high-dimensional electromyography signals in both open- and closed-loop scenarios. In open-loop simulations, FL significantly outperformed local learning baselines, demonstrating its potential for high-performance, privacy-conscious neural decoding. In contrast, closed-loop user studies required adapting FL methods to accommodate single-user, real-time interactions, a scenario not supported by standard FL. This modification resulted in local learning decoders surpassing the adapted FL approach in closed-loop performance, yet local learning still carried higher privacy risks. Our findings highlight a critical performance-privacy tradeoff in real-time adaptive applications and indicate the need for FL methods specifically designed for co-adaptive, single-user applications.

Title: Improving physics-informed neural network extrapolation via transfer learning and adaptive activation functions

Authors: Athanasios Papastathopoulos-Katsaros, Alexandra Stavrianidi, Zhandong Liu
Subjects: cs.LG, cs.AI, math.DS, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2507.12659
Pdf URL: https://arxiv.org/pdf/2507.12659
Copy Paste: [[2507.12659]] Improving physics-informed neural network extrapolation via transfer learning and adaptive activation functions(https://arxiv.org/abs/2507.12659)
Keywords: robust
Abstract: Physics-Informed Neural Networks (PINNs) are deep learning models that incorporate the governing physical laws of a system into the learning process, making them well-suited for solving complex scientific and engineering problems. Recently, PINNs have gained widespread attention as a powerful framework for combining physical principles with data-driven modeling to improve prediction accuracy. Despite their successes, however, PINNs often exhibit poor extrapolation performance outside the training domain and are highly sensitive to the choice of activation functions (AFs). In this paper, we introduce a transfer learning (TL) method to improve the extrapolation capability of PINNs. Our approach applies transfer learning (TL) within an extended training domain, using only a small number of carefully selected collocation points. Additionally, we propose an adaptive AF that takes the form of a linear combination of standard AFs, which improves both the robustness and accuracy of the model. Through a series of experiments, we demonstrate that our method achieves an average of 40% reduction in relative L2 error and an average of 50% reduction in mean absolute error in the extrapolation domain, all without a significant increase in computational cost. The code is available at this https URL .

Title: Integrated Oculomics and Lipidomics Reveal Microvascular Metabolic Signatures Associated with Cardiovascular Health in a Healthy Cohort

Authors: Inamullah, Ernesto Elias Vidal Rosas, Imran Razzak, Shoaib Jameel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12663
Pdf URL: https://arxiv.org/pdf/2507.12663
Copy Paste: [[2507.12663]] Integrated Oculomics and Lipidomics Reveal Microvascular Metabolic Signatures Associated with Cardiovascular Health in a Healthy Cohort(https://arxiv.org/abs/2507.12663)
Keywords: robust
Abstract: Cardiovascular disease (CVD) remains the leading global cause of mortality, yet current risk stratification methods often fail to detect early, subclinical changes. Previous studies have generally not integrated retinal microvasculature characteristics with comprehensive serum lipidomic profiles as potential indicators of CVD risk. In this study, an innovative imaging omics framework was introduced, combining retinal microvascular traits derived through deep learning based image processing with serum lipidomic data to highlight asymptomatic biomarkers of cardiovascular risk beyond the conventional lipid panel. This represents the first large scale, covariate adjusted and stratified correlation analysis conducted in a healthy population, which is essential for identifying early indicators of disease. Retinal phenotypes were quantified using automated image analysis tools, while serum lipid profiling was performed by Ultra High Performance Liquid Chromatography Electrospray ionization High resolution mass spectrometry (UHPLC ESI HRMS). Strong, age- and sex-independent correlations were established, particularly between average artery width, vessel density, and lipid subclasses such as triacylglycerols (TAGs), diacylglycerols (DAGs), and ceramides (Cers). These associations suggest a converging mechanism of microvascular remodeling under metabolic stress. By linking detailed vascular structural phenotypes to specific lipid species, this study fills a critical gap in the understanding of early CVD pathogenesis. This integration not only offers a novel perspective on microvascular metabolic associations but also presents a significant opportunity for the identification of robust, non-invasive biomarkers. Ultimately, these findings may support improved early detection, targeted prevention, and personalized approaches in cardiovascular healthcare.

Title: The first open machine translation system for the Chechen language

Authors: Abu-Viskhan A. Umishov, Vladislav A. Grigorian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.12672
Pdf URL: https://arxiv.org/pdf/2507.12672
Copy Paste: [[2507.12672]] The first open machine translation system for the Chechen language(https://arxiv.org/abs/2507.12672)
Keywords: large language model
Abstract: We introduce the first open-source model for translation between the vulnerable Chechen language and Russian, and the dataset collected to train and evaluate it. We explore fine-tuning capabilities for including a new language into a large language model system for multilingual translation NLLB-200. The BLEU / ChrF++ scores for our model are 8.34 / 34.69 and 20.89 / 44.55 for translation from Russian to Chechen and reverse direction, respectively. The release of the translation models is accompanied by the distribution of parallel words, phrases and sentences corpora and multilingual sentence encoder adapted to the Chechen language.

Title: FORTRESS: Function-composition Optimized Real-Time Resilient Structural Segmentation via Kolmogorov-Arnold Enhanced Spatial Attention Networks

Authors: Christina Thrainer, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Christian Guetl, Steven Sloan, Kendall N. Niles, Ken Pathak
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2507.12675
Pdf URL: https://arxiv.org/pdf/2507.12675
Copy Paste: [[2507.12675]] FORTRESS: Function-composition Optimized Real-Time Resilient Structural Segmentation via Kolmogorov-Arnold Enhanced Spatial Attention Networks(https://arxiv.org/abs/2507.12675)
Keywords: robust, segmentation
Abstract: Automated structural defect segmentation in civil infrastructure faces a critical challenge: achieving high accuracy while maintaining computational efficiency for real-time deployment. This paper presents FORTRESS (Function-composition Optimized Real-Time Resilient Structural Segmentation), a new architecture that balances accuracy and speed by using a special method that combines depthwise separable convolutions with adaptive Kolmogorov-Arnold Network integration. FORTRESS incorporates three key innovations: a systematic depthwise separable convolution framework achieving a 3.6x parameter reduction per layer, adaptive TiKAN integration that selectively applies function composition transformations only when computationally beneficial, and multi-scale attention fusion combining spatial, channel, and KAN-enhanced features across decoder levels. The architecture achieves remarkable efficiency gains with 91% parameter reduction (31M to 2.9M), 91% computational complexity reduction (13.7 to 1.17 GFLOPs), and 3x inference speed improvement while delivering superior segmentation performance. Evaluation on benchmark infrastructure datasets demonstrates state-of-the-art results with an F1- score of 0.771 and a mean IoU of 0.677, significantly outperforming existing methods including U-Net, SA-UNet, and U- KAN. The dual optimization strategy proves essential for optimal performance, establishing FORTRESS as a robust solution for practical structural defect segmentation in resource-constrained environments where both accuracy and computational efficiency are paramount. Comprehensive architectural specifications are provided in the Supplemental Material. Source code is available at URL: this https URL.

Title: Improving Drug Identification in Overdose Death Surveillance using Large Language Models

Authors: Arthur J. Funnell, Panayiotis Petousis, Fabrice Harel-Canada, Ruby Romero, Alex A. T. Bui, Adam Koncsol, Hritika Chaturvedi, Chelsea Shover, David Goodman-Meza
Subjects: cs.CL, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.12679
Pdf URL: https://arxiv.org/pdf/2507.12679
Copy Paste: [[2507.12679]] Improving Drug Identification in Overdose Death Surveillance using Large Language Models(https://arxiv.org/abs/2507.12679)
Keywords: robust, transformer, large language model
Abstract: The rising rate of drug-related deaths in the United States, largely driven by fentanyl, requires timely and accurate surveillance. However, critical overdose data are often buried in free-text coroner reports, leading to delays and information loss when coded into ICD (International Classification of Disease)-10 classifications. Natural language processing (NLP) models may automate and enhance overdose surveillance, but prior applications have been limited. A dataset of 35,433 death records from multiple U.S. jurisdictions in 2020 was used for model training and internal testing. External validation was conducted using a novel separate dataset of 3,335 records from 2023-2024. Multiple NLP approaches were evaluated for classifying specific drug involvement from unstructured death certificate text. These included traditional single- and multi-label classifiers, as well as fine-tuned encoder-only language models such as Bidirectional Encoder Representations from Transformers (BERT) and BioClinicalBERT, and contemporary decoder-only large language models such as Qwen 3 and Llama 3. Model performance was assessed using macro-averaged F1 scores, and 95% confidence intervals were calculated to quantify uncertainty. Fine-tuned BioClinicalBERT models achieved near-perfect performance, with macro F1 scores >=0.998 on the internal test set. External validation confirmed robustness (macro F1=0.966), outperforming conventional machine learning, general-domain BERT models, and various decoder-only large language models. NLP models, particularly fine-tuned clinical variants like BioClinicalBERT, offer a highly accurate and scalable solution for overdose death classification from free-text reports. These methods can significantly accelerate surveillance workflows, overcoming the limitations of manual ICD-10 coding and supporting near real-time detection of emerging substance use trends.

Title: AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis

Authors: S M Rafiuddin, Sadia Kamal, Mohammed Rakib, Arunkumar Bagavathi, Atriya Sen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.12695
Pdf URL: https://arxiv.org/pdf/2507.12695
Copy Paste: [[2507.12695]] AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2507.12695)
Keywords: extraction
Abstract: We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model's ability to adjust its focus dynamically based on the context's relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.

Title: PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform

Authors: Xiangyi Chen, Kousik Rajesh, Matthew Lawhon, Zelun Wang, Hanyu Li, Haomiao Li, Saurabh Vishwas Joshi, Pong Eksombatchai, Jaewon Yang, Yi-Ping Hsu, Jiajing Xu, Charles Rosenberg
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2507.12704
Pdf URL: https://arxiv.org/pdf/2507.12704
Copy Paste: [[2507.12704]] PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform(https://arxiv.org/abs/2507.12704)
Keywords: transformer
Abstract: User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than a half billion users across various applications.

Title: AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation

Authors: Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.12705
Pdf URL: https://arxiv.org/pdf/2507.12705
Copy Paste: [[2507.12705]] AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation(https://arxiv.org/abs/2507.12705)
Keywords: robust
Abstract: Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.

Title: From SGD to Spectra: A Theory of Neural Network Weight Dynamics

Authors: Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12709
Pdf URL: https://arxiv.org/pdf/2507.12709
Copy Paste: [[2507.12709]] From SGD to Spectra: A Theory of Neural Network Weight Dynamics(https://arxiv.org/abs/2507.12709)
Keywords: transformer
Abstract: Deep neural networks have revolutionized machine learning, yet their training dynamics remain theoretically unclear-we develop a continuous-time, matrix-valued stochastic differential equation (SDE) framework that rigorously connects the microscopic dynamics of SGD to the macroscopic evolution of singular-value spectra in weight matrices. We derive exact SDEs showing that squared singular values follow Dyson Brownian motion with eigenvalue repulsion, and characterize stationary distributions as gamma-type densities with power-law tails, providing the first theoretical explanation for the empirically observed 'bulk+tail' spectral structure in trained networks. Through controlled experiments on transformer and MLP architectures, we validate our theoretical predictions and demonstrate quantitative agreement between SDE-based forecasts and observed spectral evolution, providing a rigorous foundation for understanding why deep learning works.

Title: A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique

Authors: Homare Sueyoshi, Kiyoshi Nishikawa, Hitoshi Kiya
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2507.12730
Pdf URL: https://arxiv.org/pdf/2507.12730
Copy Paste: [[2507.12730]] A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique(https://arxiv.org/abs/2507.12730)
Keywords: privacy, transformer, segmentation
Abstract: We propose a privacy-preserving semantic-segmentation method for applying perceptual encryption to images used for model training in addition to test images. This method also provides almost the same accuracy as models without any encryption. The above performance is achieved using a domain-adaptation technique on the embedding structure of the Vision Transformer (ViT). The effectiveness of the proposed method was experimentally confirmed in terms of the accuracy of semantic segmentation when using a powerful semantic-segmentation model with ViT called Segmentation Transformer.

Title: Strategy Adaptation in Large Language Model Werewolf Agents

Authors: Fuya Nakamori, Yin Jou Huang, Fei Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.12732
Pdf URL: https://arxiv.org/pdf/2507.12732
Copy Paste: [[2507.12732]] Strategy Adaptation in Large Language Model Werewolf Agents(https://arxiv.org/abs/2507.12732)
Keywords: large language model
Abstract: This study proposes a method to improve the performance of Werewolf agents by switching between predefined strategies based on the attitudes of other players and the context of conversations. While prior works of Werewolf agents using prompt engineering have employed methods where effective strategies are implicitly defined, they cannot adapt to changing situations. In this research, we propose a method that explicitly selects an appropriate strategy based on the game context and the estimated roles of other players. We compare the strategy adaptation Werewolf agents with baseline agents using implicit or fixed strategies and verify the effectiveness of our proposed method.

Title: Transformer-based Spatial Grounding: A Comprehensive Survey

Authors: Ijazul Haq, Muhammad Saqib, Yingjie Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12739
Pdf URL: https://arxiv.org/pdf/2507.12739
Copy Paste: [[2507.12739]] Transformer-based Spatial Grounding: A Comprehensive Survey(https://arxiv.org/abs/2507.12739)
Keywords: robust, transformer
Abstract: Spatial grounding, the process of associating natural language expressions with corresponding image regions, has rapidly advanced due to the introduction of transformer-based models, significantly enhancing multimodal representation and cross-modal alignment. Despite this progress, the field lacks a comprehensive synthesis of current methodologies, dataset usage, evaluation metrics, and industrial applicability. This paper presents a systematic literature review of transformer-based spatial grounding approaches from 2018 to 2025. Our analysis identifies dominant model architectures, prevalent datasets, and widely adopted evaluation metrics, alongside highlighting key methodological trends and best practices. This study provides essential insights and structured guidance for researchers and practitioners, facilitating the development of robust, reliable, and industry-ready transformer-based spatial grounding models.

Title: Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning

Authors: Suorong Yang, Peijia Li, Yujie Liu, Zhiming Xu, Peng Ye, Wanli Ouyang, Furao Shen, Dongzhan Zhou
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.12750
Pdf URL: https://arxiv.org/pdf/2507.12750
Copy Paste: [[2507.12750]] Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning(https://arxiv.org/abs/2507.12750)
Keywords: robust
Abstract: Modern deep models are trained on large real-world datasets, where data quality varies and redundancy is common. Data-centric approaches such as dataset pruning have shown promise in improving training efficiency and model performance. However, most existing methods rely on static heuristics or task-specific metrics, limiting their robustness and generalizability across domains. In this work, we introduce a dynamic dataset pruning framework that adaptively selects training samples based on both task-driven difficulty and cross-modality semantic consistency. By incorporating supervision from pretrained multimodal foundation models, our approach captures training dynamics while effectively filtering out uninformative samples. Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.

Title: Domain-Enhanced Dual-Branch Model for Efficient and Interpretable Accident Anticipation

Authors: Yanchen Guan, Haicheng Liao, Chengyue Wang, Bonan Wang, Jiaxun Zhang, Jia Hu, Zhenning Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12755
Pdf URL: https://arxiv.org/pdf/2507.12755
Copy Paste: [[2507.12755]] Domain-Enhanced Dual-Branch Model for Efficient and Interpretable Accident Anticipation(https://arxiv.org/abs/2507.12755)
Keywords: interpretability
Abstract: Developing precise and computationally efficient traffic accident anticipation system is crucial for contemporary autonomous driving technologies, enabling timely intervention and loss prevention. In this paper, we propose an accident anticipation framework employing a dual-branch architecture that effectively integrates visual information from dashcam videos with structured textual data derived from accident reports. Furthermore, we introduce a feature aggregation method that facilitates seamless integration of multimodal inputs through large models (GPT-4o, Long-CLIP), complemented by targeted prompt engineering strategies to produce actionable feedback and standardized accident archives. Comprehensive evaluations conducted on benchmark datasets (DAD, CCD, and A3D) validate the superior predictive accuracy, enhanced responsiveness, reduced computational overhead, and improved interpretability of our approach, thus establishing a new benchmark for state-of-the-art performance in traffic accident anticipation.

Title: HairShifter: Consistent and High-Fidelity Video Hair Transfer via Anchor-Guided Animation

Authors: Wangzheng Shi, Yinglin Zheng, Yuxin Lin, Jianmin Bao, Ming Zeng, Dong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12758
Pdf URL: https://arxiv.org/pdf/2507.12758
Copy Paste: [[2507.12758]] HairShifter: Consistent and High-Fidelity Video Hair Transfer via Anchor-Guided Animation(https://arxiv.org/abs/2507.12758)
Keywords: robust
Abstract: Hair transfer is increasingly valuable across domains such as social media, gaming, advertising, and entertainment. While significant progress has been made in single-image hair transfer, video-based hair transfer remains challenging due to the need for temporal consistency, spatial fidelity, and dynamic adaptability. In this work, we propose HairShifter, a novel "Anchor Frame + Animation" framework that unifies high-quality image hair transfer with smooth and coherent video animation. At its core, HairShifter integrates a Image Hair Transfer (IHT) module for precise per-frame transformation and a Multi-Scale Gated SPADE Decoder to ensure seamless spatial blending and temporal coherence. Our method maintains hairstyle fidelity across frames while preserving non-hair regions. Extensive experiments demonstrate that HairShifter achieves state-of-the-art performance in video hairstyle transfer, combining superior visual quality, temporal consistency, and scalability. The code will be publicly available. We believe this work will open new avenues for video-based hairstyle transfer and establish a robust baseline in this field.

Title: Unified Medical Image Segmentation with State Space Modeling Snake

Authors: Ruicheng Zhang, Haowei Guo, Kanghui Tian, Jun Zhou, Mingliang Yan, Zeyu Zhang, Shen Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12760
Pdf URL: https://arxiv.org/pdf/2507.12760
Copy Paste: [[2507.12760]] Unified Medical Image Segmentation with State Space Modeling Snake(https://arxiv.org/abs/2507.12760)
Keywords: robust, segmentation
Abstract: Unified Medical Image Segmentation (UMIS) is critical for comprehensive anatomical assessment but faces challenges due to multi-scale structural heterogeneity. Conventional pixel-based approaches, lacking object-level anatomical insight and inter-organ relational modeling, struggle with morphological complexity and feature conflicts, limiting their efficacy in UMIS. We propose Mamba Snake, a novel deep snake framework enhanced by state space modeling for UMIS. Mamba Snake frames multi-contour evolution as a hierarchical state space atlas, effectively modeling macroscopic inter-organ topological relationships and microscopic contour refinements. We introduce a snake-specific vision state space module, the Mamba Evolution Block (MEB), which leverages effective spatiotemporal information aggregation for adaptive refinement of complex morphologies. Energy map shape priors further ensure robust long-range contour evolution in heterogeneous data. Additionally, a dual-classification synergy mechanism is incorporated to concurrently optimize detection and segmentation, mitigating under-segmentation of microstructures in UMIS. Extensive evaluations across five clinical datasets reveal Mamba Snake's superior performance, with an average Dice improvement of 3\% over state-of-the-art methods.

Title: Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation

Authors: Hanlei Shi, Leyuan Qu, Yu Liu, Di Gao, Yuhua Zheng, Taihao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12761
Pdf URL: https://arxiv.org/pdf/2507.12761
Copy Paste: [[2507.12761]] Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation(https://arxiv.org/abs/2507.12761)
Keywords: large language model
Abstract: Emotional talking-head generation has emerged as a pivotal research area at the intersection of computer vision and multimodal artificial intelligence, with its core value lying in enhancing human-computer interaction through immersive and empathetic this http URL the advancement of multimodal large language models, the driving signals for emotional talking-head generation has shifted from audio and video to more flexible text. However, current text-driven methods rely on predefined discrete emotion label texts, oversimplifying the dynamic complexity of real facial muscle movements and thus failing to achieve natural emotional this http URL study proposes the Think-Before-Draw framework to address two key challenges: (1) In-depth semantic parsing of emotions--by innovatively introducing Chain-of-Thought (CoT), abstract emotion labels are transformed into physiologically grounded facial muscle movement descriptions, enabling the mapping from high-level semantics to actionable motion features; and (2) Fine-grained expressiveness optimization--inspired by artists' portrait painting process, a progressive guidance denoising strategy is proposed, employing a "global emotion localization--local muscle control" mechanism to refine micro-expression dynamics in generated this http URL experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including MEAD and HDTF. Additionally, we collected a set of portrait images to evaluate our model's zero-shot generation capability.

Title: World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving

Authors: Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang, Zhenning Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12762
Pdf URL: https://arxiv.org/pdf/2507.12762
Copy Paste: [[2507.12762]] World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving(https://arxiv.org/abs/2507.12762)
Keywords: robust, generative
Abstract: Reliable anticipation of traffic accidents is essential for advancing autonomous driving systems. However, this objective is limited by two fundamental challenges: the scarcity of diverse, high-quality training data and the frequent absence of crucial object-level cues due to environmental disruptions or sensor deficiencies. To tackle these issues, we propose a comprehensive framework combining generative scene augmentation with adaptive temporal reasoning. Specifically, we develop a video generation pipeline that utilizes a world model guided by domain-informed prompts to create high-resolution, statistically consistent driving scenarios, particularly enriching the coverage of edge cases and complex interactions. In parallel, we construct a dynamic prediction model that encodes spatio-temporal relationships through strengthened graph convolutions and dilated temporal operators, effectively addressing data incompleteness and transient visual noise. Furthermore, we release a new benchmark dataset designed to better capture diverse real-world driving risks. Extensive experiments on public and newly released datasets confirm that our framework enhances both the accuracy and lead time of accident anticipation, offering a robust solution to current data and modeling limitations in safety-critical autonomous driving applications.

Title: Continuous Marine Tracking via Autonomous UAV Handoff

Authors: Heegyeong Kim (1), Alice James (1), Avishkar Seth (1), Endrowednes Kuantama (1), Jane Williamson (2), Yimeng Feng (1), Richard Han (1) ((1) School of Computing, Macquarie University, (2) School of Natural Sciences, Macquarie University)
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.12763
Pdf URL: https://arxiv.org/pdf/2507.12763
Copy Paste: [[2507.12763]] Continuous Marine Tracking via Autonomous UAV Handoff(https://arxiv.org/abs/2507.12763)
Keywords: robust
Abstract: This paper introduces an autonomous UAV vision system for continuous, real-time tracking of marine animals, specifically sharks, in dynamic marine environments. The system integrates an onboard computer with a stabilised RGB-D camera and a custom-trained OSTrack pipeline, enabling visual identification under challenging lighting, occlusion, and sea-state conditions. A key innovation is the inter-UAV handoff protocol, which enables seamless transfer of tracking responsibilities between drones, extending operational coverage beyond single-drone battery limitations. Performance is evaluated on a curated shark dataset of 5,200 frames, achieving a tracking success rate of 81.9\% during real-time flight control at 100 Hz, and robustness to occlusion, illumination variation, and background clutter. We present a seamless UAV handoff framework, where target transfer is attempted via high-confidence feature matching, achieving 82.9\% target coverage. These results confirm the viability of coordinated UAV operations for extended marine tracking and lay the groundwork for scalable, autonomous monitoring.

Title: Synergy: End-to-end Concept Model

Authors: Keli Zheng, Zerong Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12769
Pdf URL: https://arxiv.org/pdf/2507.12769
Copy Paste: [[2507.12769]] Synergy: End-to-end Concept Model(https://arxiv.org/abs/2507.12769)
Keywords: robust
Abstract: In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.

Title: Local Representative Token Guided Merging for Text-to-Image Generation

Authors: Min-Jeong Lee, Hee-Dong Kim, Seong-Whan Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12771
Pdf URL: https://arxiv.org/pdf/2507.12771
Copy Paste: [[2507.12771]] Local Representative Token Guided Merging for Text-to-Image Generation(https://arxiv.org/abs/2507.12771)
Keywords: diffusion
Abstract: Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.

Title: A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Authors: Weijieying Ren, Jingxi Zhu, Zehao Liu, Tianxiang Zhao, Vasant Honavar
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.12774
Pdf URL: https://arxiv.org/pdf/2507.12774
Copy Paste: [[2507.12774]] A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models(https://arxiv.org/abs/2507.12774)
Keywords: explainability, large language model
Abstract: Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to this https URL.

Title: Compact Vision Transformer by Reduction of Kernel Complexity

Authors: Yancheng Wang, Yingzhen Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12780
Pdf URL: https://arxiv.org/pdf/2507.12780
Copy Paste: [[2507.12780]] Compact Vision Transformer by Reduction of Kernel Complexity(https://arxiv.org/abs/2507.12780)
Keywords: transformer
Abstract: Self-attention and transformer architectures have become foundational components in modern deep learning. Recent efforts have integrated transformer blocks into compact neural architectures for computer vision, giving rise to various efficient vision transformers. In this work, we introduce Transformer with Kernel Complexity Reduction, or KCR-Transformer, a compact transformer block equipped with differentiable channel selection, guided by a novel and sharp theoretical generalization bound. KCR-Transformer performs input/output channel selection in the MLP layers of transformer blocks to reduce the computational cost. Furthermore, we provide a rigorous theoretical analysis establishing a tight generalization bound for networks equipped with KCR-Transformer blocks. Leveraging such strong theoretical results, the channel pruning by KCR-Transformer is conducted in a generalization-aware manner, ensuring that the resulting network retains a provably small generalization error. Our KCR-Transformer is compatible with many popular and compact transformer networks, such as ViT and Swin, and it reduces the FLOPs of the vision transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in the vision transformers with KCR-Transformer blocks, leading to KCR-Transformer networks with different backbones. The resulting TCR-Transformers achieve superior performance on various computer vision tasks, achieving even better performance than the original models with even less FLOPs and parameters.

Title: Learning Robust Negation Text Representations

Authors: Thinh Hung Truong, Karin Verspoor, Trevor Cohn, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.12782
Pdf URL: https://arxiv.org/pdf/2507.12782
Copy Paste: [[2507.12782]] Learning Robust Negation Text Representations(https://arxiv.org/abs/2507.12782)
Keywords: robust, large language model
Abstract: Despite rapid adoption of autoregressive large language models, smaller text encoders still play an important role in text understanding tasks that require rich contextualized representations. Negation is an important semantic function that is still not properly captured by such methods, affecting many downstream applications relying on text embeddings. We propose a strategy to improve negation robustness of text encoders, by distilling data from large language models using diverse patterns of negation and hedging. We adopt a standard contrastive learning strategy to finetune a strong BERT-based model, and observe large improvement in negation understanding capabilities while maintaining competitive performance on general benchmarks. In addition, we also show that our method can be adapted to LLMs, leading to improved performance on negation benchmarks.

Title: Multi-Channel Graph Neural Network for Financial Risk Prediction of NEEQ Enterprises

Authors: Jianyu Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12787
Pdf URL: https://arxiv.org/pdf/2507.12787
Copy Paste: [[2507.12787]] Multi-Channel Graph Neural Network for Financial Risk Prediction of NEEQ Enterprises(https://arxiv.org/abs/2507.12787)
Keywords: robust
Abstract: With the continuous evolution of China's multi-level capital market, the National Equities Exchange and Quotations (NEEQ), also known as the "New Third Board," has become a critical financing platform for small and medium-sized enterprises (SMEs). However, due to their limited scale and financial resilience, many NEEQ-listed companies face elevated risks of financial distress. To address this issue, we propose a multi-channel deep learning framework that integrates structured financial indicators, textual disclosures, and enterprise relationship data for comprehensive financial risk prediction. Specifically, we design a Triple-Channel Graph Isomorphism Network (GIN) that processes numeric, textual, and graph-based inputs separately. These modality-specific representations are fused using an attention-based mechanism followed by a gating unit to enhance robustness and prediction accuracy. Experimental results on data from 7,731 real-world NEEQ companies demonstrate that our model significantly outperforms traditional machine learning methods and single-modality baselines in terms of AUC, Precision, Recall, and F1 Score. This work provides theoretical and practical insights into risk modeling for SMEs and offers a data-driven tool to support financial regulators and investors.

Title: DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment

Authors: Junjie Gao, Runze Liu, Yingzhe Peng, Shujian Yang, Jin Zhang, Kai Yang, Zhiyuan You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12796
Pdf URL: https://arxiv.org/pdf/2507.12796
Copy Paste: [[2507.12796]] DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment(https://arxiv.org/abs/2507.12796)
Keywords: robust, large language model
Abstract: Document quality assessment is critical for a wide range of applications including document digitization, OCR, and archival. However, existing approaches often struggle to provide accurate and robust quality scores, limiting their applicability in practical scenarios. With the rapid progress in Multi-modal Large Language Models (MLLMs), recent MLLM-based methods have achieved remarkable performance in image quality assessment. In this work, we extend this success to the document domain by adapting DeQA-Score, a state-of-the-art MLLM-based image quality scorer, for document quality assessment. We propose DeQA-Doc, a framework that leverages the visual language capabilities of MLLMs and a soft label strategy to regress continuous document quality scores. To adapt DeQA-Score to DeQA-Doc, we adopt two complementary solutions to construct soft labels without the variance information. Also, we relax the resolution constrains to support the large resolution of document images. Finally, we introduce ensemble methods to further enhance the performance. Extensive experiments demonstrate that DeQA-Doc significantly outperforms existing baselines, offering accurate and generalizable document quality assessment across diverse degradation types. Codes and model weights are available in this https URL.

Title: FLDmamba: Integrating Fourier and Laplace Transform Decomposition with Mamba for Enhanced Time Series Prediction

Authors: Qianru Zhang, Chenglei Yu, Haixin Wang, Yudong Yan, Yuansheng Cao, Siu-Ming Yiu, Tailin Wu, Hongzhi Yin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12803
Pdf URL: https://arxiv.org/pdf/2507.12803
Copy Paste: [[2507.12803]] FLDmamba: Integrating Fourier and Laplace Transform Decomposition with Mamba for Enhanced Time Series Prediction(https://arxiv.org/abs/2507.12803)
Keywords: robust, transformer
Abstract: Time series prediction, a crucial task across various domains, faces significant challenges due to the inherent complexities of time series data, including non-stationarity, multi-scale periodicity, and transient dynamics, particularly when tackling long-term predictions. While Transformer-based architectures have shown promise, their quadratic complexity with sequence length hinders their efficiency for long-term predictions. Recent advancements in State-Space Models, such as Mamba, offer a more efficient alternative for long-term modeling, but they cannot capture multi-scale periodicity and transient dynamics effectively. Meanwhile, they are susceptible to data noise issues in time series. This paper proposes a novel framework, FLDmamba (Fourier and Laplace Transform Decomposition Mamba), addressing these limitations. FLDmamba leverages the strengths of both Fourier and Laplace transforms to effectively capture both multi-scale periodicity, transient dynamics within time series data, and improve the robustness of the model to the data noise issue. Our extensive experiments demonstrate that FLDmamba achieves superior performance on time series prediction benchmarks, outperforming both Transformer-based and other Mamba-based architectures. To promote the reproducibility of our method, we have made both the code and data accessible via the following URL:{\href{this https URL}{this https URL\model}.

Title: ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion

Authors: Hoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim, Hyung-Jeong Yang, Soonja Yeom, Soo-Hyung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12804
Pdf URL: https://arxiv.org/pdf/2507.12804
Copy Paste: [[2507.12804]] ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion(https://arxiv.org/abs/2507.12804)
Keywords: diffusion
Abstract: Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \href{this https URL}{this https URL}

Title: PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Authors: Hui Sun, Yanfeng Ding, Liping Yi, Huidong Ma, Gang Wang, Xiaoguang Liu, Cheng Zhong, Wentong Cai
Subjects: cs.LG, cs.AI, cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2507.12805
Pdf URL: https://arxiv.org/pdf/2507.12805
Copy Paste: [[2507.12805]] PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database(https://arxiv.org/abs/2507.12805)
Keywords: robust
Abstract: Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors' backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036$\times$ and 10.710$\times$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.

Title: Large Language Models' Internal Perception of Symbolic Music

Authors: Andrew Shin, Kunitake Kaneko
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.12808
Pdf URL: https://arxiv.org/pdf/2507.12808
Copy Paste: [[2507.12808]] Large Language Models' Internal Perception of Symbolic Music(https://arxiv.org/abs/2507.12808)
Keywords: generative, large language model
Abstract: Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.

Title: RONOM: Reduced-Order Neural Operator Modeling

Authors: Sven Dummer, Dongwei Ye, Christoph Brune
Subjects: cs.LG, cs.CE, math.NA
Abstract URL: https://arxiv.org/abs/2507.12814
Pdf URL: https://arxiv.org/pdf/2507.12814
Copy Paste: [[2507.12814]] RONOM: Reduced-Order Neural Operator Modeling(https://arxiv.org/abs/2507.12814)
Keywords: robust
Abstract: Time-dependent partial differential equations are ubiquitous in physics-based modeling, but they remain computationally intensive in many-query scenarios, such as real-time forecasting, optimal control, and uncertainty quantification. Reduced-order modeling (ROM) addresses these challenges by constructing a low-dimensional surrogate model but relies on a fixed discretization, which limits flexibility across varying meshes during evaluation. Operator learning approaches, such as neural operators, offer an alternative by parameterizing mappings between infinite-dimensional function spaces, enabling adaptation to data across different resolutions. Whereas ROM provides rigorous numerical error estimates, neural operator learning largely focuses on discretization convergence and invariance without quantifying the error between the infinite-dimensional and the discretized operators. This work introduces the reduced-order neural operator modeling (RONOM) framework, which bridges concepts from ROM and operator learning. We establish a discretization error bound analogous to those in ROM, and get insights into RONOM's discretization convergence and discretization robustness. Moreover, two numerical examples are presented that compare RONOM to existing neural operators for solving partial differential equations. The results demonstrate that RONOM using standard vector-to-vector neural networks achieves comparable performance in input generalization and superior performance in both spatial super-resolution and discretization robustness, while also offering novel insights into temporal super-resolution scenarios.

Title: From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning

Authors: Gaurav Chaudhary, Laxmidhar Behera
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12815
Pdf URL: https://arxiv.org/pdf/2507.12815
Copy Paste: [[2507.12815]] From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning(https://arxiv.org/abs/2507.12815)
Keywords: robust
Abstract: Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset without requiring further agent-environment interactions. However, its practical adoption is often hindered by the need for explicit reward annotations, which can be costly to engineer or difficult to obtain retrospectively. To address this, we propose ReLOAD (Reinforcement Learning with Offline Reward Annotation via Distillation), a novel reward annotation framework for offline RL. Unlike existing methods that depend on complex alignment procedures, our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations using a simple yet effective embedding discrepancy measure. First, we train a predictor network to mimic a fixed target network's embeddings based on expert state transitions. Later, the prediction error between these networks serves as a reward signal for each transition in the static dataset. This mechanism provides a structured reward signal without requiring handcrafted reward annotations. We provide a formal theoretical construct that offers insights into how RND prediction errors effectively serve as intrinsic rewards by distinguishing expert-like transitions. Experiments on the D4RL benchmark demonstrate that ReLOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods.

Title: MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval

Authors: Jeong-Woo Park, Seong-Whan Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12819
Pdf URL: https://arxiv.org/pdf/2507.12819
Copy Paste: [[2507.12819]] MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval(https://arxiv.org/abs/2507.12819)
Keywords: large language model
Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a gallery using a composed query consisting of a reference image and a modification text. Among various CIR approaches, training-free zero-shot methods based on pre-trained models are cost-effective but still face notable limitations. For example, sequential VLM-LLM pipelines process each modality independently, which often results in information loss and limits cross-modal interaction. In contrast, methods based on multimodal large language models (MLLMs) often focus exclusively on applying changes indicated by the text, without fully utilizing the contextual visual information from the reference image. To address these issues, we propose multi-faceted Chain-of-Thought with re-ranking (MCoT-RE), a training-free zero-shot CIR framework. MCoT-RE utilizes multi-faceted Chain-of-Thought to guide the MLLM to balance explicit modifications and contextual visual cues, generating two distinct captions: one focused on modification and the other integrating comprehensive visual-textual context. The first caption is used to filter candidate images. Subsequently, we combine these two captions and the reference image to perform multi-grained re-ranking. This two-stage approach facilitates precise retrieval by aligning with the textual modification instructions while preserving the visual context of the reference image. Through extensive experiments, MCoT-RE achieves state-of-the-art results among training-free methods, yielding improvements of up to 6.24% in Recall@10 on FashionIQ and 8.58% in Recall@1 on CIRR.

Title: FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval

Authors: Jeong-Woo Park, Young-Eun Kim, Seong-Whan Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12823
Pdf URL: https://arxiv.org/pdf/2507.12823
Copy Paste: [[2507.12823]] FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval(https://arxiv.org/abs/2507.12823)
Keywords: robust
Abstract: Composed image retrieval (CIR) is a vision language task that retrieves a target image using a reference image and modification text, enabling intuitive specification of desired changes. While effectively fusing visual and textual modalities is crucial, existing methods typically adopt either early or late fusion. Early fusion tends to excessively focus on explicitly mentioned textual details and neglect visual context, whereas late fusion struggles to capture fine-grained semantic alignments between image regions and textual tokens. To address these issues, we propose FAR-Net, a multi-stage fusion framework designed with enhanced semantic alignment and adaptive reconciliation, integrating two complementary modules. The enhanced semantic alignment module (ESAM) employs late fusion with cross-attention to capture fine-grained semantic relationships, while the adaptive reconciliation module (ARM) applies early fusion with uncertainty embeddings to enhance robustness and adaptability. Experiments on CIRR and FashionIQ show consistent performance gains, improving Recall@1 by up to 2.4% and Recall@50 by 1.04% over existing state-of-the-art methods, empirically demonstrating that FAR Net provides a robust and scalable solution to CIR tasks.

Title: Feature-Enhanced TResNet for Fine-Grained Food Image Classification

Authors: Lulu Liu, Zhiyong Xiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12828
Pdf URL: https://arxiv.org/pdf/2507.12828
Copy Paste: [[2507.12828]] Feature-Enhanced TResNet for Fine-Grained Food Image Classification(https://arxiv.org/abs/2507.12828)
Keywords: extraction
Abstract: Food is not only a core component of humans' daily diets, but also an important carrier of cultural heritage and emotional bonds. With the development of technology, the need for accurate classification of food images has grown, which is crucial for a variety of application scenarios. However, existing Convolutional Neural Networks (CNNs) face significant challenges when dealing with fine-grained food images that are similar in shape but subtle in detail. To address this challenge, this study presents an innovative method for classifying food images, named Feature-Enhanced TResNet (FE-TResNet), specifically designed to address fine-grained food images and accurately capture subtle features within them. The FE-TResNet method is based on the TResNet model and integrates Style-based Recalibration Module (StyleRM) and Deep Channel-wise Attention (DCA) technologies to enhance feature extraction capabilities. In experimental validation on Chinese food image datasets ChineseFoodNet and CNFOOD-241, the FE-TResNet method significantly improved classification accuracy, achieving rates of 81.37% and 80.29%, respectively, demonstrating its effectiveness and superiority in fine-grained food image classification.

Title: Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?

Authors: Xi Ai, Mahardika Krisna Ihsani, Min-Yen Kan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.12838
Pdf URL: https://arxiv.org/pdf/2507.12838
Copy Paste: [[2507.12838]] Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?(https://arxiv.org/abs/2507.12838)
Keywords: interpretability
Abstract: Cross-lingual consistency should be considered to assess cross-lingual transferability, maintain the factuality of the model knowledge across languages, and preserve the parity of language model performance. We are thus interested in analyzing, evaluating, and interpreting cross-lingual consistency for factual knowledge. We examine code-mixed coreferential statements conveyed identical knowledge across languages to study cross-lingual knowledge consistency. We use some interpretability approaches to analyze the behavior of a model in cross-lingual contexts, discovering that multilingual models show different levels of consistency, subject to language families, linguistic factors, and a bottleneck in cross-lingual consistency on a particular layer. In addition, we evaluate common strategies aimed at improving multilingual performance to observe whether these strategies can improve knowledge consistency at the same time. While knowledge is not cross-lingual consistency in many cases, code-switching training and cross-lingual word alignment objectives show the most promising results, emphasizing the noteworthiness of cross-lingual alignment supervision and code-switching training for both multilingual performance and cross-lingual consistency enhancement.

Title: SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

Authors: Khang Truong, Lam Pham, Hieu Tang, Jasmin Lampert, Martin Boyer, Son Phan, Truong Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12845
Pdf URL: https://arxiv.org/pdf/2507.12845
Copy Paste: [[2507.12845]] SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning(https://arxiv.org/abs/2507.12845)
Keywords: transformer
Abstract: Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.

Title: Transformer-Based Person Identification via Wi-Fi CSI Amplitude and Phase Perturbations

Authors: Danilo Avola, Andrea Bernardini, Francesco Danese, Mario Lezoche, Maurizio Mancini, Daniele Pannone, Amedeo Ranaldi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12854
Pdf URL: https://arxiv.org/pdf/2507.12854
Copy Paste: [[2507.12854]] Transformer-Based Person Identification via Wi-Fi CSI Amplitude and Phase Perturbations(https://arxiv.org/abs/2507.12854)
Keywords: privacy, biometric, transformer
Abstract: Wi-Fi sensing is gaining momentum as a non-intrusive and privacy-preserving alternative to vision-based systems for human identification. However, person identification through wireless signals, particularly without user motion, remains largely unexplored. Most prior wireless-based approaches rely on movement patterns, such as walking gait, to extract biometric cues. In contrast, we propose a transformer-based method that identifies individuals from Channel State Information (CSI) recorded while the subject remains stationary. CSI captures fine-grained amplitude and phase distortions induced by the unique interaction between the human body and the radio signal. To support evaluation, we introduce a dataset acquired with ESP32 devices in a controlled indoor environment, featuring six participants observed across multiple orientations. A tailored preprocessing pipeline, including outlier removal, smoothing, and phase calibration, enhances signal quality. Our dual-branch transformer architecture processes amplitude and phase modalities separately and achieves 99.82\% classification accuracy, outperforming convolutional and multilayer perceptron baselines. These results demonstrate the discriminative potential of CSI perturbations, highlighting their capacity to encode biometric traits in a consistent manner. They further confirm the viability of passive, device-free person identification using low-cost commodity Wi-Fi hardware in real-world settings.

Title: Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

Authors: Chongli Qin, Jost Tobias Springenberg
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12856
Pdf URL: https://arxiv.org/pdf/2507.12856
Copy Paste: [[2507.12856]] Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)(https://arxiv.org/abs/2507.12856)
Keywords: large language model
Abstract: Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of large language models; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for large language models and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.

Title: SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation

Authors: Shiqi Huang, Shuting He, Huaiyuan Qin, Bihan Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12857
Pdf URL: https://arxiv.org/pdf/2507.12857
Copy Paste: [[2507.12857]] SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation(https://arxiv.org/abs/2507.12857)
Keywords: robust, segmentation
Abstract: Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose $\textbf{SCORE}$ ($\textbf{S}$cene $\textbf{C}$ontext matters in $\textbf{O}$pen-vocabulary $\textbf{RE}$mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at this https URL.

Title: WhoFi: Deep Person Re-Identification via Wi-Fi Channel Signal Encoding

Authors: Danilo Avola, Daniele Pannone, Dario Montagnini, Emad Emam
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12869
Pdf URL: https://arxiv.org/pdf/2507.12869
Copy Paste: [[2507.12869]] WhoFi: Deep Person Re-Identification via Wi-Fi Channel Signal Encoding(https://arxiv.org/abs/2507.12869)
Keywords: robust, biometric, transformer
Abstract: Person Re-Identification is a key and challenging task in video surveillance. While traditional methods rely on visual data, issues like poor lighting, occlusion, and suboptimal angles often hinder performance. To address these challenges, we introduce WhoFi, a novel pipeline that utilizes Wi-Fi signals for person re-identification. Biometric features are extracted from Channel State Information (CSI) and processed through a modular Deep Neural Network (DNN) featuring a Transformer-based encoder. The network is trained using an in-batch negative loss function to learn robust and generalizable biometric signatures. Experiments on the NTU-Fi dataset show that our approach achieves competitive results compared to state-of-the-art methods, confirming its effectiveness in identifying individuals via Wi-Fi signals.

Title: An Investigation of Ear-EEG Signals for a Novel Biometric Authentication System

Authors: Danilo Avola, Giancarlo Crocetti, Gian Luca Foresti, Daniele Pannone, Claudio Piciarelli, Amedeo Ranaldi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12873
Pdf URL: https://arxiv.org/pdf/2507.12873
Copy Paste: [[2507.12873]] An Investigation of Ear-EEG Signals for a Novel Biometric Authentication System(https://arxiv.org/abs/2507.12873)
Keywords: secure, biometric
Abstract: This work explores the feasibility of biometric authentication using EEG signals acquired through in-ear devices, commonly referred to as ear-EEG. Traditional EEG-based biometric systems, while secure, often suffer from low usability due to cumbersome scalp-based electrode setups. In this study, we propose a novel and practical framework leveraging ear-EEG signals as a user-friendly alternative for everyday biometric authentication. The system extracts an original combination of temporal and spectral features from ear-EEG signals and feeds them into a fully connected deep neural network for subject identification. Experimental results on the only currently available ear-EEG dataset suitable for different purposes, including biometric authentication, demonstrate promising performance, with an average accuracy of 82\% in a subject identification scenario. These findings confirm the potential of ear-EEG as a viable and deployable direction for next-generation real-world biometric systems.

Title: HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation

Authors: Weihuang Lin, Yiwei Ma, Xiaoshuai Sun, Shuting He, Jiayi Ji, Liujuan Cao, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12883
Pdf URL: https://arxiv.org/pdf/2507.12883
Copy Paste: [[2507.12883]] HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation(https://arxiv.org/abs/2507.12883)
Keywords: segmentation
Abstract: The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions, which may encompass subtleties such as contextual cues and open-world knowledge. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE). The HRP module processes high-resolution images through cropping, integrating local and global features for multi-granularity quality. The HRE module enhances mask features by integrating fine-grained information from high-resolution images, refining their alignment with text features for precise segmentation. Extensive ablation studies validate the effectiveness of our modules, while comprehensive experiments on multiple benchmark datasets demonstrate HRSeg's superior performance.

Title: From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation

Authors: Mengxi Liu, Lala Shakti Swarup Ray, Sizhen Bian, Ko Watanabe, Ankur Bhatt, Joanna Sorysz, Russel Torah, Bo Zhou, Paul Lukowicz
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2507.12884
Pdf URL: https://arxiv.org/pdf/2507.12884
Copy Paste: [[2507.12884]] From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation(https://arxiv.org/abs/2507.12884)
Keywords: robust
Abstract: We present NeckSense, a novel wearable system for head pose tracking that leverages multi-channel bio-impedance sensing with soft, dry electrodes embedded in a lightweight, necklace-style form factor. NeckSense captures dynamic changes in tissue impedance around the neck, which are modulated by head rotations and subtle muscle activations. To robustly estimate head pose, we propose a deep learning framework that integrates anatomical priors, including joint constraints and natural head rotation ranges, into the loss function design. We validate NeckSense on 7 participants using the current SOTA pose estimation model as ground truth. Our system achieves a mean per-vertex error of 25.9 mm across various head movements with a leave-one-person-out cross-validation method, demonstrating that a compact, line-of-sight-free bio-impedance wearable can deliver head-tracking performance comparable to SOTA vision-based methods.

Title: LanePerf: a Performance Estimation Framework for Lane Detection

Authors: Yin Wu, Daniel Slieter, Ahmed Abouelazm, Christian Hubschneider, J. Marius Zöllner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12894
Pdf URL: https://arxiv.org/pdf/2507.12894
Copy Paste: [[2507.12894]] LanePerf: a Performance Estimation Framework for Lane Detection(https://arxiv.org/abs/2507.12894)
Keywords: robust
Abstract: Lane detection is a critical component of Advanced Driver-Assistance Systems (ADAS) and Automated Driving System (ADS), providing essential spatial information for lateral control. However, domain shifts often undermine model reliability when deployed in new environments. Ensuring the robustness and safety of lane detection models typically requires collecting and annotating target domain data, which is resource-intensive. Estimating model performance without ground-truth labels offers a promising alternative for efficient robustness assessment, yet remains underexplored in lane detection. While previous work has addressed performance estimation in image classification, these methods are not directly applicable to lane detection tasks. This paper first adapts five well-performing performance estimation methods from image classification to lane detection, building a baseline. Addressing the limitations of prior approaches that solely rely on softmax scores or lane features, we further propose a new Lane Performance Estimation Framework (LanePerf), which integrates image and lane features using a pretrained image encoder and a DeepSets-based architecture, effectively handling zero-lane detection scenarios and large domain-shift cases. Extensive experiments on the OpenLane dataset, covering diverse domain shifts (scenes, weather, hours), demonstrate that our LanePerf outperforms all baselines, achieving a lower MAE of 0.117 and a higher Spearman's rank correlation coefficient of 0.727. These findings pave the way for robust, label-free performance estimation in ADAS, supporting more efficient testing and improved safety in challenging driving scenarios.

Title: Generalist Bimanual Manipulation via Foundation Video Diffusion Models

Authors: Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.12898
Pdf URL: https://arxiv.org/pdf/2507.12898
Copy Paste: [[2507.12898]] Generalist Bimanual Manipulation via Foundation Video Diffusion Models(https://arxiv.org/abs/2507.12898)
Keywords: diffusion
Abstract: Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.

Title: Federated Learning for Commercial Image Sources

Authors: Shreyansh Jain, Koteswar Rao Jerripothula
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.12903
Pdf URL: https://arxiv.org/pdf/2507.12903
Copy Paste: [[2507.12903]] Federated Learning for Commercial Image Sources(https://arxiv.org/abs/2507.12903)
Keywords: secure, privacy, federate
Abstract: Federated Learning is a collaborative machine learning paradigm that enables multiple clients to learn a global model without exposing their data to each other. Consequently, it provides a secure learning platform with privacy-preserving capabilities. This paper introduces a new dataset containing 23,326 images collected from eight different commercial sources and classified into 31 categories, similar to the Office-31 dataset. To the best of our knowledge, this is the first image classification dataset specifically designed for Federated Learning. We also propose two new Federated Learning algorithms, namely Fed-Cyclic and Fed-Star. In Fed-Cyclic, a client receives weights from its previous client, updates them through local training, and passes them to the next client, thus forming a cyclic topology. In Fed-Star, a client receives weights from all other clients, updates its local weights through pre-aggregation (to address statistical heterogeneity) and local training, and sends its updated local weights to all other clients, thus forming a star-like topology. Our experiments reveal that both algorithms perform better than existing baselines on our newly introduced dataset.

Title: Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services

Authors: Jiadong Chen, Hengyu Ye, Fuxin Jiang, Xiao He, Tieying Zhang, Jianjun Chen, Xiaofeng Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12908
Pdf URL: https://arxiv.org/pdf/2507.12908
Copy Paste: [[2507.12908]] Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services(https://arxiv.org/abs/2507.12908)
Keywords: robust, transformer
Abstract: Workload forecasting is pivotal in cloud service applications, such as auto-scaling and scheduling, with profound implications for operational efficiency. Although Transformer-based forecasting models have demonstrated remarkable success in general tasks, their computational efficiency often falls short of the stringent requirements in large-scale cloud environments. Given that most workload series exhibit complicated periodic patterns, addressing these challenges in the frequency domain offers substantial advantages. To this end, we propose Fremer, an efficient and effective deep forecasting model. Fremer fulfills three critical requirements: it demonstrates superior efficiency, outperforming most Transformer-based forecasting models; it achieves exceptional accuracy, surpassing all state-of-the-art (SOTA) models in workload forecasting; and it exhibits robust performance for multi-period series. Furthermore, we collect and open-source four high-quality, open-source workload datasets derived from ByteDance's cloud services, encompassing workload data from thousands of computing instances. Extensive experiments on both our proprietary datasets and public benchmarks demonstrate that Fremer consistently outperforms baseline models, achieving average improvements of 5.5% in MSE, 4.7% in MAE, and 8.6% in SMAPE over SOTA models, while simultaneously reducing parameter scale and computational costs. Additionally, in a proactive auto-scaling test based on Kubernetes, Fremer improves average latency by 18.78% and reduces resource consumption by 2.35%, underscoring its practical efficacy in real-world applications.

Title: Robust Explanations Through Uncertainty Decomposition: A Path to Trustworthier AI

Authors: Chenrui Zhu, Louenas Bounia, Vu Linh Nguyen, Sébastien Destercke, Arthur Hoarau
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12913
Pdf URL: https://arxiv.org/pdf/2507.12913
Copy Paste: [[2507.12913]] Robust Explanations Through Uncertainty Decomposition: A Path to Trustworthier AI(https://arxiv.org/abs/2507.12913)
Keywords: robust, interpretability, explainability
Abstract: Recent advancements in machine learning have emphasized the need for transparency in model predictions, particularly as interpretability diminishes when using increasingly complex architectures. In this paper, we propose leveraging prediction uncertainty as a complementary approach to classical explainability methods. Specifically, we distinguish between aleatoric (data-related) and epistemic (model-related) uncertainty to guide the selection of appropriate explanations. Epistemic uncertainty serves as a rejection criterion for unreliable explanations and, in itself, provides insight into insufficient training (a new form of explanation). Aleatoric uncertainty informs the choice between feature-importance explanations and counterfactual explanations. This leverages a framework of explainability methods driven by uncertainty quantification and disentanglement. Our experiments demonstrate the impact of this uncertainty-aware approach on the robustness and attainability of explanations in both traditional machine learning and deep learning scenarios.

Title: Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models

Authors: Yifan Xu, Chao Zhang, Hanqi Jiang, Xiaoyan Wang, Ruifei Ma, Yiwei Li, Zihao Wu, Zeju Li, Xiangde Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12916
Pdf URL: https://arxiv.org/pdf/2507.12916
Copy Paste: [[2507.12916]] Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models(https://arxiv.org/abs/2507.12916)
Keywords: large language model
Abstract: Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based on these insights, we propose Argus, a novel 3D multimodal framework that leverages multi-view images for enhanced 3D scene understanding with LLMs. In general, Argus can be treated as a 3D Large Multimodal Foundation Model (3D-LMM) since it takes various modalities as input(text instructions, 2D multi-view images, and 3D point clouds) and expands the capability of LLMs to tackle 3D tasks. Argus involves fusing and integrating multi-view images and camera poses into view-as-scene features, which interact with the 3D features to create comprehensive and detailed 3D-aware scene embeddings. Our approach compensates for the information loss while reconstructing 3D point clouds and helps LLMs better understand the 3D world. Extensive experiments demonstrate that our method outperforms existing 3D-LMMs in various downstream tasks.

Title: Architectural Backdoors in Deep Learning: A Survey of Vulnerabilities, Detection, and Defense

Authors: Victoria Childress, Josh Collyer, Jodie Knapp
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.12919
Pdf URL: https://arxiv.org/pdf/2507.12919
Copy Paste: [[2507.12919]] Architectural Backdoors in Deep Learning: A Survey of Vulnerabilities, Detection, and Defense(https://arxiv.org/abs/2507.12919)
Keywords: security, defense, steal
Abstract: Architectural backdoors pose an under-examined but critical threat to deep neural networks, embedding malicious logic directly into a model's computational graph. Unlike traditional data poisoning or parameter manipulation, architectural backdoors evade standard mitigation techniques and persist even after clean retraining. This survey systematically consolidates research on architectural backdoors, spanning compiler-level manipulations, tainted AutoML pipelines, and supply-chain vulnerabilities. We assess emerging detection and defense strategies, including static graph inspection, dynamic fuzzing, and partial formal verification, and highlight their limitations against distributed or stealth triggers. Despite recent progress, scalable and practical defenses remain elusive. We conclude by outlining open challenges and proposing directions for strengthening supply-chain security, cryptographic model attestations, and next-generation benchmarks. This survey aims to guide future research toward comprehensive defenses against structural backdoor threats in deep learning systems.

Title: DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization

Authors: Dongyeun Lee, Jiwan Hur, Hyounguk Shon, Jae Young Lee, Junmo Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12933
Pdf URL: https://arxiv.org/pdf/2507.12933
Copy Paste: [[2507.12933]] DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization(https://arxiv.org/abs/2507.12933)
Keywords: robust, diffusion
Abstract: Diffusion models have achieved remarkable success in image generation but come with significant computational costs, posing challenges for deployment in resource-constrained environments. Recent post-training quantization (PTQ) methods have attempted to mitigate this issue by focusing on the iterative nature of diffusion models. However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. Learned Equivalent Scaling optimizes channel-wise scaling factors to redistribute quantization difficulty between weights and activations, reducing overall quantization error. Recognizing that early denoising steps, despite having small quantization errors, crucially impact the final output due to error accumulation, we incorporate an adaptive timestep weighting scheme to prioritize these critical steps during learning. Furthermore, identifying that layers such as skip connections exhibit high inter-channel variance, we introduce channel-wise Power-of-Two Scaling for activations. To ensure robust selection of PTS factors even with small calibration set, we introduce a voting algorithm that enhances reliability. Extensive experiments demonstrate that our method significantly outperforms existing works, especially at low bit-widths such as W4A6 (4-bit weight, 6-bit activation) and W4A8, maintaining high image generation quality and model stability. The code is available at this https URL.

Title: Enterprise Security Incident Analysis and Countermeasures Based on the T-Mobile Data Breach

Authors: Zhuohan Cui, Zikun Song
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.12937
Pdf URL: https://arxiv.org/pdf/2507.12937
Copy Paste: [[2507.12937]] Enterprise Security Incident Analysis and Countermeasures Based on the T-Mobile Data Breach(https://arxiv.org/abs/2507.12937)
Keywords: security, segmentation
Abstract: This paper presents a comprehensive analysis of T-Mobile's critical data breaches in 2021 and 2023, alongside a full-spectrum security audit targeting its systems, infrastructure, and publicly exposed endpoints. By combining case-based vulnerability assessments with active ethical hacking techniques--including Shodan reconnaissance, API misuse simulations, VNC brute-forcing, firmware reverse engineering, and web application scans--we uncover structural weaknesses persisting beyond the initial breach events. Building on these findings, we propose a multi-layered defensive strategy encompassing Zero Trust Architecture, granular role-based access control, network segmentation, firmware encryption using AES with integrity checks, and API rate limiting and token lifecycle control. Financial modelling demonstrates that a five-year investment yields less than 1.1% of expected breach losses, validating the cost-effectiveness of proactive security measures. Our work bridges post-incident forensic analysis with hands-on security evaluation, providing an actionable blueprint for large-scale telecoms seeking operational resilience, regulatory compliance, and cross-domain threat readiness.

Title: A Deep-Learning Framework for Land-Sliding Classification from Remote Sensing Image

Authors: Hieu Tang, Truong Vo, Dong Pham, Toan Nguyen, Lam Pham, Truong Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12939
Pdf URL: https://arxiv.org/pdf/2507.12939
Copy Paste: [[2507.12939]] A Deep-Learning Framework for Land-Sliding Classification from Remote Sensing Image(https://arxiv.org/abs/2507.12939)
Keywords: robust
Abstract: The use of satellite imagery combined with deep learning to support automatic landslide detection is becoming increasingly widespread. However, selecting an appropriate deep learning architecture to optimize performance while avoiding overfitting remains a critical challenge. To address these issues, we propose a deep-learning based framework for landslide detection from remote sensing image in this paper. The proposed framework presents an effective combination of the online an offline data augmentation to tackle the imbalanced data, a backbone EfficientNet\_Large deep learning model for extracting robust embedding features, and a post-processing SVM classifier to balance and enhance the classification performance. The proposed model achieved an F1-score of 0.8938 on the public test set of the Zindi challenge.

Title: Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning

Authors: Yafei Zhang, Lingqi Kong, Huafeng Li, Jie Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12942
Pdf URL: https://arxiv.org/pdf/2507.12942
Copy Paste: [[2507.12942]] Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning(https://arxiv.org/abs/2507.12942)
Keywords: robust
Abstract: To reduce the reliance of visible-infrared person re-identification (ReID) models on labeled cross-modal samples, this paper explores a weakly supervised cross-modal person ReID method that uses only single-modal sample identity labels, addressing scenarios where cross-modal identity labels are unavailable. To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences in a weakly supervised manner. This framework leverages labeled data from each modality to independently train dedicated classification experts. To associate cross-modal samples, these classification experts act as heterogeneous predictors, predicting the identities of samples from the other modality. To improve prediction accuracy, we design a cross-modal relationship fusion mechanism that effectively integrates predictions from different experts. Under the implicit supervision provided by cross-modal identity correspondences, collaborative and consistent learning among the experts is encouraged, significantly enhancing the model's ability to extract modality-invariant features and improve cross-modal identity recognition. Experimental results on two challenging datasets validate the effectiveness of the proposed method.

Title: Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications

Authors: Yucheng Tang, Yunguan Fu, Weixi Yi, Yipei Wang, Daniel C. Alexander, Rhodri Davies, Yipeng Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12945
Pdf URL: https://arxiv.org/pdf/2507.12945
Copy Paste: [[2507.12945]] Analysis of Image-and-Text Uncertainty Propagation in Multimodal Large Language Models with Cardiac MR-Based Applications(https://arxiv.org/abs/2507.12945)
Keywords: robust, large language model
Abstract: Multimodal large language models (MLLMs) can process and integrate information from multimodality sources, such as text and images. However, interrelationship among input modalities, uncertainties due to individual uni-modal data and potential clinical applications following such an uncertainty decomposition are yet fully understood in the context of large-scale MLLMs. In this work, we propose a multimodal uncertainty propagation model (MUPM) based on uncertainty propagation, to characterise the relationship among the uncertainties arising from image-only, text-only, and joint image-text variations in MLLM inputs. Using real clinical data consisting of cardiac MR scans and digital health records, we describe that MUPMs can be optimised robustly with a few samples. We then show that the fitted MUPMs are generalisable across different input data distributions and, perhaps surprisingly, across different downstream tasks. Such a transferability may be explained by the shared pretraining, comparatively light MLLM fine-tuning, along with the low-dimensional nature of the MUPMs. More importantly, this learned transferability, quantifying the relationship between these uncertainties, led to direct clinical applications in which uncertainties may be estimated and thus analysed robustly for varying data or even a novel set of cardiac disease prediction tasks. In addition, we show experimentally the efficiency in multimodal data required for estimating the overall uncertainty and its ability to identify redundant factors, both of which are considered practical yet clinically useful applications with the proposed MUPMs. Codes are available at this https URL.

Title: Probabilistic Soundness Guarantees in LLM Reasoning Chains

Authors: Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.12948
Pdf URL: https://arxiv.org/pdf/2507.12948
Copy Paste: [[2507.12948]] Probabilistic Soundness Guarantees in LLM Reasoning Chains(https://arxiv.org/abs/2507.12948)
Keywords: robust, large language model
Abstract: In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).

Title: Insights into a radiology-specialised multimodal large language model with sparse autoencoders

Authors: Kenza Bouzid, Shruthi Bannur, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12950
Pdf URL: https://arxiv.org/pdf/2507.12950
Copy Paste: [[2507.12950]] Insights into a radiology-specialised multimodal large language model with sparse autoencoders(https://arxiv.org/abs/2507.12950)
Keywords: interpretability, transformer, large language model
Abstract: Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic interpretability, particularly through the use of sparse autoencoders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clinically relevant concepts - including medical devices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and textual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results reveal practical and methodological challenges, yet they offer initial insights into the internal concepts learned by MAIRA-2 - marking a step toward deeper mechanistic understanding and interpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trained SAEs and interpretations: this https URL.

Title: LoViC: Efficient Long Video Generation with Context Compression

Authors: Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12952
Pdf URL: https://arxiv.org/pdf/2507.12952
Copy Paste: [[2507.12952]] LoViC: Efficient Long Video Generation with Context Compression(https://arxiv.org/abs/2507.12952)
Keywords: diffusion, transformer
Abstract: Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.

Title: cIDIR: Conditioned Implicit Neural Representation for Regularized Deformable Image Registration

Authors: Sidaty El Hadramy, Oumeymah Cherkaoui, Philippe C. Cattin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12953
Pdf URL: https://arxiv.org/pdf/2507.12953
Copy Paste: [[2507.12953]] cIDIR: Conditioned Implicit Neural Representation for Regularized Deformable Image Registration(https://arxiv.org/abs/2507.12953)
Keywords: robust, segmentation
Abstract: Regularization is essential in deformable image registration (DIR) to ensure that the estimated Deformation Vector Field (DVF) remains smooth, physically plausible, and anatomically consistent. However, fine-tuning regularization parameters in learning-based DIR frameworks is computationally expensive, often requiring multiple training iterations. To address this, we propose cIDI, a novel DIR framework based on Implicit Neural Representations (INRs) that conditions the registration process on regularization hyperparameters. Unlike conventional methods that require retraining for each regularization hyperparameter setting, cIDIR is trained over a prior distribution of these hyperparameters, then optimized over the regularization hyperparameters by using the segmentations masks as an observation. Additionally, cIDIR models a continuous and differentiable DVF, enabling seamless integration of advanced regularization techniques via automatic differentiation. Evaluated on the DIR-LAB dataset, $\operatorname{cIDIR}$ achieves high accuracy and robustness across the dataset.

Title: FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers

Authors: Qiang Wang, Mengchao Wang, Fan Jiang, Yaqi Fan, Yonggang Qi, Mu Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12956
Pdf URL: https://arxiv.org/pdf/2507.12956
Copy Paste: [[2507.12956]] FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers(https://arxiv.org/abs/2507.12956)
Keywords: diffusion, transformer
Abstract: Producing expressive facial animations from static images is a challenging task. Prior methods relying on explicit geometric priors (e.g., facial landmarks or 3DMM) often suffer from artifacts in cross reenactment and struggle to capture subtle emotions. Furthermore, existing approaches lack support for multi-character animation, as driving features from different individuals frequently interfere with one another, complicating the task. To address these challenges, we propose FantasyPortrait, a diffusion transformer based framework capable of generating high-fidelity and emotion-rich animations for both single- and multi-character scenarios. Our method introduces an expression-augmented learning strategy that utilizes implicit representations to capture identity-agnostic facial dynamics, enhancing the model's ability to render fine-grained emotions. For multi-character control, we design a masked cross-attention mechanism that ensures independent yet coordinated expression generation, effectively preventing feature interference. To advance research in this area, we propose the Multi-Expr dataset and ExprBench, which are specifically designed datasets and benchmarks for training and evaluating multi-character portrait animations. Extensive experiments demonstrate that FantasyPortrait significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluations, excelling particularly in challenging cross reenactment and multi-character contexts. Our project page is this https URL.

Title: A Spectral Interpretation of Redundancy in a Graph Reservoir

Authors: Anna Bison, Alessandro Sperduti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.12963
Pdf URL: https://arxiv.org/pdf/2507.12963
Copy Paste: [[2507.12963]] A Spectral Interpretation of Redundancy in a Graph Reservoir(https://arxiv.org/abs/2507.12963)
Keywords: fair
Abstract: Reservoir computing has been successfully applied to graphs as a preprocessing method to improve the training efficiency of Graph Neural Networks (GNNs). However, a common issue that arises when repeatedly applying layer operators on graphs is over-smoothing, which consists in the convergence of graph signals toward low-frequency components of the graph Laplacian. This work revisits the definition of the reservoir in the Multiresolution Reservoir Graph Neural Network (MRGNN), a spectral reservoir model, and proposes a variant based on a Fairing algorithm originally introduced in the field of surface design in computer graphics. This algorithm provides a pass-band spectral filter that allows smoothing without shrinkage, and it can be adapted to the graph setting through the Laplacian operator. Given its spectral formulation, this method naturally connects to GNN architectures for tasks where smoothing, when properly controlled, can be beneficial,such as graph classification. The core contribution of the paper lies in the theoretical analysis of the algorithm from a random walks perspective. In particular, it shows how tuning the spectral coefficients can be interpreted as modulating the contribution of redundant random walks. Exploratory experiments based on the MRGNN architecture illustrate the potential of this approach and suggest promising directions for future research.

Title: RGB Pre-Training Enhanced Unobservable Feature Latent Diffusion Model for Spectral Reconstruction

Authors: Keli Deng, Jie Nie, Yuntao Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.12967
Pdf URL: https://arxiv.org/pdf/2507.12967
Copy Paste: [[2507.12967]] RGB Pre-Training Enhanced Unobservable Feature Latent Diffusion Model for Spectral Reconstruction(https://arxiv.org/abs/2507.12967)
Keywords: diffusion
Abstract: Spectral reconstruction (SR) is a crucial problem in image processing that requires reconstructing hyperspectral images (HSIs) from the corresponding RGB images. A key difficulty in SR is estimating the unobservable feature, which encapsulates significant spectral information not captured by RGB imaging sensors. The solution lies in effectively constructing the spectral-spatial joint distribution conditioned on the RGB image to complement the unobservable feature. Since HSIs share a similar spatial structure with the corresponding RGB images, it is rational to capitalize on the rich spatial knowledge in RGB pre-trained models for spectral-spatial joint distribution learning. To this end, we extend the RGB pre-trained latent diffusion model (RGB-LDM) to an unobservable feature LDM (ULDM) for SR. As the RGB-LDM and its corresponding spatial autoencoder (SpaAE) already excel in spatial knowledge, the ULDM can focus on modeling spectral structure. Moreover, separating the unobservable feature from the HSI reduces the redundant spectral information and empowers the ULDM to learn the joint distribution in a compact latent space. Specifically, we propose a two-stage pipeline consisting of spectral structure representation learning and spectral-spatial joint distribution learning to transform the RGB-LDM into the ULDM. In the first stage, a spectral unobservable feature autoencoder (SpeUAE) is trained to extract and compress the unobservable feature into a 3D manifold aligned with RGB space. In the second stage, the spectral and spatial structures are sequentially encoded by the SpeUAE and the SpaAE, respectively. The ULDM is then acquired to model the distribution of the coded unobservable feature with guidance from the corresponding RGB images. Experimental results on SR and downstream relighting tasks demonstrate that our proposed method achieves state-of-the-art performance.

Title: WaveletInception Networks for Drive-by Vibration-Based Infrastructure Health Monitoring

Authors: Reza Riahi Samani, Alfredo Nunez, Bart De Schutter
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.12969
Pdf URL: https://arxiv.org/pdf/2507.12969
Copy Paste: [[2507.12969]] WaveletInception Networks for Drive-by Vibration-Based Infrastructure Health Monitoring(https://arxiv.org/abs/2507.12969)
Keywords: extraction
Abstract: This paper presents a novel deep learning-based framework for infrastructure health monitoring using drive-by vibration response signals. Recognizing the importance of spectral and temporal information, we introduce the WaveletInception-BiLSTM network. The WaveletInception feature extractor utilizes a Learnable Wavelet Packet Transform (LWPT) as the stem for extracting vibration signal features, incorporating spectral information in the early network layers. This is followed by 1D Inception networks that extract multi-scale, high-level features at deeper layers. The extracted vibration signal features are then integrated with operational conditions via a Long Short-term Memory (LSTM) layer. The resulting feature extraction network effectively analyzes drive-by vibration signals across various measurement speeds without preprocessing and uses LSTM to capture interrelated temporal dependencies among different modes of information and to create feature vectors for health condition estimation. The estimator head is designed with a sequential modeling architecture using bidirectional LSTM (BiLSTM) networks, capturing bi-directional temporal relationships from drive-by measurements. This architecture allows for a high-resolution, beam-level assessment of infrastructure health conditions. A case study focusing on railway track stiffness estimation with simulated drive-by vibration signals shows that the model significantly outperforms state-of-the-art methods in estimating railway ballast and railpad stiffness parameters. Results underscore the potential of this approach for accurate, localized, and fully automated drive-by infrastructure health monitoring.

Title: A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints

Authors: Youssef Tawfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.12979
Pdf URL: https://arxiv.org/pdf/2507.12979
Copy Paste: [[2507.12979]] A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints(https://arxiv.org/abs/2507.12979)
Keywords: security, privacy, federate, generative
Abstract: Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it achieves 1.1x -- 2.2x higher image generation scores, an average 10% boost in classification metrics (up to 50% in multi-domain non-IID settings), in much lower latency compared to several benchmarks. Find our code at this https URL.

Title: FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient

Authors: ShanBin Liu
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2507.12983
Pdf URL: https://arxiv.org/pdf/2507.12983
Copy Paste: [[2507.12983]] FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient(https://arxiv.org/abs/2507.12983)
Keywords: federate, fair
Abstract: Fairness has emerged as one of the key challenges in federated learning. In horizontal federated settings, data heterogeneity often leads to substantial performance disparities across clients, raising concerns about equitable model behavior. To address this issue, we propose FedGA, a fairness-aware federated learning algorithm. We first employ the Gini coefficient to measure the performance disparity among clients. Based on this, we establish a relationship between the Gini coefficient $G$ and the update scale of the global model ${U_s}$, and use this relationship to adaptively determine the timing of fairness intervention. Subsequently, we dynamically adjust the aggregation weights according to the system's real-time fairness status, enabling the global model to better incorporate information from clients with relatively poor this http URL conduct extensive experiments on the Office-Caltech-10, CIFAR-10, and Synthetic datasets. The results show that FedGA effectively improves fairness metrics such as variance and the Gini coefficient, while maintaining strong overall performance, demonstrating the effectiveness of our approach.

Title: Teach Old SAEs New Domain Tricks with Boosting

Authors: Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.12990
Pdf URL: https://arxiv.org/pdf/2507.12990
Copy Paste: [[2507.12990]] Teach Old SAEs New Domain Tricks with Boosting(https://arxiv.org/abs/2507.12990)
Keywords: interpretability, large language model
Abstract: Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

Title: Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Authors: Zihua Zhao, Feng Hong, Mengxi Chen, Pengyi Chen, Benyuan Liu, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.12998
Pdf URL: https://arxiv.org/pdf/2507.12998
Copy Paste: [[2507.12998]] Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning(https://arxiv.org/abs/2507.12998)
Keywords: robust
Abstract: The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-larger datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods. Source code is available at: this https URL.

Title: Measuring CEX-DEX Extracted Value and Searcher Profitability: The Darkest of the MEV Dark Forest

Authors: Fei Wu, Danning Sui, Thomas Thiery, Mallesh Pai
Subjects: cs.CR, q-fin.TR
Abstract URL: https://arxiv.org/abs/2507.13023
Pdf URL: https://arxiv.org/pdf/2507.13023
Copy Paste: [[2507.13023]] Measuring CEX-DEX Extracted Value and Searcher Profitability: The Darkest of the MEV Dark Forest(https://arxiv.org/abs/2507.13023)
Keywords: robust
Abstract: This paper provides a comprehensive empirical analysis of the economics and dynamics behind arbitrages between centralized and decentralized exchanges (CEX-DEX) on Ethereum. We refine heuristics to identify arbitrage transactions from on-chain data and introduce a robust empirical framework to estimate arbitrage revenue without knowing traders' actual behaviors on CEX. Leveraging an extensive dataset spanning 19 months from August 2023 to March 2025, we estimate a total of 233.8M USD extracted by 19 major CEX-DEX searchers from 7,203,560 identified CEX-DEX arbitrages. Our analysis reveals increasing centralization trends as three searchers captured three-quarters of both volume and extracted value. We also demonstrate that searchers' profitability is tied to their integration level with block builders and uncover exclusive searcher-builder relationships and their market impact. Finally, we correct the previously underestimated profitability of block builders who vertically integrate with a searcher. These insights illuminate the darkest corner of the MEV landscape and highlight the critical implications of CEX-DEX arbitrages for Ethereum's decentralization.

Title: From Paranoia to Compliance: The Bumpy Road of System Hardening Practices on Stack Exchange

Authors: Niklas Busch (1), Philip Klostermeyer (1), Jan H. Klemmer (1), Yasemin Acar (2), Sascha Fahl (1) ((1) CISPA Helmholtz Center for Information Security, (2) Paderborn University)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.13028
Pdf URL: https://arxiv.org/pdf/2507.13028
Copy Paste: [[2507.13028]] From Paranoia to Compliance: The Bumpy Road of System Hardening Practices on Stack Exchange(https://arxiv.org/abs/2507.13028)
Keywords: secure, security, attack
Abstract: Hardening computer systems against cyberattacks is crucial for security. However, past incidents illustrated, that many system operators struggle with effective system hardening. Hence, many computer systems and applications remain insecure. So far, the research community lacks an in-depth understanding of system operators motivation, practices, and challenges around system hardening. With a focus on practices and challenges, we qualitatively analyzed 316 Stack Exchange (SE) posts related to system hardening. We find that access control and deployment-related issues are the most challenging, and system operators suffer from misconceptions and unrealistic expectations. Most frequently, posts focused on operating systems and server applications. System operators were driven by the fear of their systems getting attacked or by compliance reasons. Finally, we discuss our research questions, make recommendations for future system hardening, and illustrate the implications of our work.

Title: Confidence-Filtered Relevance (CFR): An Interpretable and Uncertainty-Aware Machine Learning Framework for Naturalness Assessment in Satellite Imagery

Authors: Ahmed Emam, Ribana Roscher
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13034
Pdf URL: https://arxiv.org/pdf/2507.13034
Copy Paste: [[2507.13034]] Confidence-Filtered Relevance (CFR): An Interpretable and Uncertainty-Aware Machine Learning Framework for Naturalness Assessment in Satellite Imagery(https://arxiv.org/abs/2507.13034)
Keywords: protect, interpretability
Abstract: Protected natural areas play a vital role in ecological balance and ecosystem services. Monitoring these regions at scale using satellite imagery and machine learning is promising, but current methods often lack interpretability and uncertainty-awareness, and do not address how uncertainty affects naturalness assessment. In contrast, we propose Confidence-Filtered Relevance (CFR), a data-centric framework that combines LRP Attention Rollout with Deep Deterministic Uncertainty (DDU) estimation to analyze how model uncertainty influences the interpretability of relevance heatmaps. CFR partitions the dataset into subsets based on uncertainty thresholds, enabling systematic analysis of how uncertainty shapes the explanations of naturalness in satellite imagery. Applied to the AnthroProtect dataset, CFR assigned higher relevance to shrublands, forests, and wetlands, aligning with other research on naturalness assessment. Moreover, our analysis shows that as uncertainty increases, the interpretability of these relevance heatmaps declines and their entropy grows, indicating less selective and more ambiguous attributions. CFR provides a data-centric approach to assess the relevance of patterns to naturalness in satellite imagery based on their associated certainty.

Title: MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems

Authors: Yu Cui, Hongyang Du
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.13038
Pdf URL: https://arxiv.org/pdf/2507.13038
Copy Paste: [[2507.13038]] MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems(https://arxiv.org/abs/2507.13038)
Keywords: security, attack, large language model
Abstract: Multi-agent debate (MAD) systems leverage collaborative interactions among large language models (LLMs) agents to improve reasoning capabilities. While recent studies have focused on increasing the accuracy and scalability of MAD systems, their security vulnerabilities have received limited attention. In this work, we introduce MAD-Spear, a targeted prompt injection attack that compromises a small subset of agents but significantly disrupts the overall MAD process. Manipulated agents produce multiple plausible yet incorrect responses, exploiting LLMs' conformity tendencies to propagate misinformation and degrade consensus quality. Furthermore, the attack can be composed with other strategies, such as communication attacks, to further amplify its impact by increasing the exposure of agents to incorrect responses. To assess MAD's resilience under attack, we propose a formal definition of MAD fault-tolerance and develop a comprehensive evaluation framework that jointly considers accuracy, consensus efficiency, and scalability. Extensive experiments on five benchmark datasets with varying difficulty levels demonstrate that MAD-Spear consistently outperforms the baseline attack in degrading system performance. Additionally, we observe that agent diversity substantially improves MAD performance in mathematical reasoning tasks, which challenges prior work suggesting that agent diversity has minimal impact on performance. These findings highlight the urgent need to improve the security in MAD design.

Title: Backscattering-Based Security in Wireless Power Transfer Applied to Battery-Free BLE Sensors

Authors: Taki Eddine Djidjekh (INSA Toulouse, LAAS-MINC), Gaël Loubet (LAAS-MINC, INSA Toulouse), Alexandru Takacs (LAAS-MINC, UT)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.13042
Pdf URL: https://arxiv.org/pdf/2507.13042
Copy Paste: [[2507.13042]] Backscattering-Based Security in Wireless Power Transfer Applied to Battery-Free BLE Sensors(https://arxiv.org/abs/2507.13042)
Keywords: secure, security
Abstract: The integration of security and energy efficiency in Internet of Things systems remains a critical challenge, particularly for battery-free and resource-constrained devices. This paper explores the scalability and protocol-agnostic nature of a backscattering-based security mechanism by integrating it into Bluetooth Low Energy battery-free Wireless Sensor Network. The proposed approach leverages the Wireless Power Transfer link, traditionally used for energy harvesting, to generate additional identification signals without increasing energy consumption or computational demands. Experimental validation demonstrates the solution's functionality using compact, low-gain antenna, ensuring compatibility with size-constrained applications such as Structural Health Monitoring and smart transport. Furthermore, this work addresses the challenges associated with backscattering dynamic range and multi-node Wireless Sensor Network scenarios, discussing potential collisions between identification signals and proposing future improvements to enhance generalizability and scalability. The findings underscore the potential of the backscattering-based security mechanism for creating secure, sustainable, and scalable IoT deployments across diverse protocols and applications.

Title: The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

Authors: Lefei Shen, Mouxiang Chen, Han Fu, Xiaoxue Ren, Xiaoyun Joy Wang, Jianling Sun, Zhuo Li, Chenghao Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13043
Pdf URL: https://arxiv.org/pdf/2507.13043
Copy Paste: [[2507.13043]] The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting(https://arxiv.org/abs/2507.13043)
Keywords: transformer
Abstract: Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF), yet the variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks? However, existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself. To address this, we propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures. Our taxonomy considers key aspects such as attention mechanisms, forecasting aggregations, forecasting paradigms, and normalization layers. Through extensive experiments, we uncover several key insights: bi-directional attention with joint-attention is most effective; more complete forecasting aggregation improves performance; and the direct-mapping paradigm outperforms autoregressive approaches. Furthermore, our combined model, utilizing optimal architectural choices, consistently outperforms several existing models, reinforcing the validity of our conclusions. We hope these findings offer valuable guidance for future research on Transformer architectural designs in LTSF. Our code is available at this https URL.

Title: Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection

Authors: Jingyao Wang, Yiming Chen, Lingyu Si, Changwen Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13061
Pdf URL: https://arxiv.org/pdf/2507.13061
Copy Paste: [[2507.13061]] Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection(https://arxiv.org/abs/2507.13061)
Keywords: robust
Abstract: Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.

Title: Label-Consistent Dataset Distillation with Detector-Guided Refinement

Authors: Yawen Zou, Guang Li, Zi Wang, Chunzhi Gu, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13074
Pdf URL: https://arxiv.org/pdf/2507.13074
Copy Paste: [[2507.13074]] Label-Consistent Dataset Distillation with Detector-Guided Refinement(https://arxiv.org/abs/2507.13074)
Keywords: diffusion
Abstract: Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector's confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.

Title: Formalizing Attack Scenario Description: A Proposed Model

Authors: Quentin Goux (CEDRIC - ISID), Nadira Lammari (CEDRIC - ISID)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13076
Pdf URL: https://arxiv.org/pdf/2507.13076
Copy Paste: [[2507.13076]] Formalizing Attack Scenario Description: A Proposed Model(https://arxiv.org/abs/2507.13076)
Keywords: security, protect, attack
Abstract: Organizations face an ever-changing threat landscape. They must continuously dedicate significant efforts to protect their assets, making their adoption of increased cybersecurity automation inevitable. However, process automation requires formalization of input data. Through this paper, we address this need for processes that use attack scenarios as input. Among these processes, one can mention both the generation of scripts for attack simulation and training purposes, as well as the analysis of attacks. Therefore, the paper's main research contribution is a novel formal model that encompasses the attack's context description and its scenario. It is abstracted using UML class model. Once the description of our model done, we will show how it could serve an upstream attack analysis process. We will show also its use for an automatic generation of attack scripts in the context of cybersecurity training. These two uses cases constitute the second contribution of this present research work.

Title: DASViT: Differentiable Architecture Search for Vision Transformer

Authors: Pengjin Wu, Ferrante Neri, Zhenhua Feng
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.13079
Pdf URL: https://arxiv.org/pdf/2507.13079
Copy Paste: [[2507.13079]] DASViT: Differentiable Architecture Search for Vision Transformer(https://arxiv.org/abs/2507.13079)
Keywords: transformer
Abstract: Designing effective neural networks is a cornerstone of deep learning, and Neural Architecture Search (NAS) has emerged as a powerful tool for automating this process. Among the existing NAS approaches, Differentiable Architecture Search (DARTS) has gained prominence for its efficiency and ease of use, inspiring numerous advancements. Since the rise of Vision Transformers (ViT), researchers have applied NAS to explore ViT architectures, often focusing on macro-level search spaces and relying on discrete methods like evolutionary algorithms. While these methods ensure reliability, they face challenges in discovering innovative architectural designs, demand extensive computational resources, and are time-intensive. To address these limitations, we introduce Differentiable Architecture Search for Vision Transformer (DASViT), which bridges the gap in differentiable search for ViTs and uncovers novel designs. Experiments show that DASViT delivers architectures that break traditional Transformer encoder designs, outperform ViT-B/16 on multiple datasets, and achieve superior efficiency with fewer parameters and FLOPs.

Title: Channel-wise Motion Features for Efficient Motion Segmentation

Authors: Riku Inoue, Masamitsu Tsuchiya, Yuji Yasui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13082
Pdf URL: https://arxiv.org/pdf/2507.13082
Copy Paste: [[2507.13082]] Channel-wise Motion Features for Efficient Motion Segmentation(https://arxiv.org/abs/2507.13082)
Keywords: segmentation
Abstract: For safety-critical robotics applications such as autonomous driving, it is important to detect all required objects accurately in real-time. Motion segmentation offers a solution by identifying dynamic objects from the scene in a class-agnostic manner. Recently, various motion segmentation models have been proposed, most of which jointly use subnetworks to estimate Depth, Pose, Optical Flow, and Scene Flow. As a result, the overall computational cost of the model increases, hindering real-time performance. In this paper, we propose a novel cost-volume-based motion feature representation, Channel-wise Motion Features. By extracting depth features of each instance in the feature map and capturing the scene's 3D motion information, it offers enhanced efficiency. The only subnetwork used to build Channel-wise Motion Features is the Pose Network, and no others are required. Our method not only achieves about 4 times the FPS of state-of-the-art models in the KITTI Dataset and Cityscapes of the VCAS-Motion Dataset, but also demonstrates equivalent accuracy while reducing the parameters to about 25$\%$.

Title: Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection

Authors: Riku Inoue, Masamitsu Tsuchiya, Yuji Yasui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13085
Pdf URL: https://arxiv.org/pdf/2507.13085
Copy Paste: [[2507.13085]] Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection(https://arxiv.org/abs/2507.13085)
Keywords: transformer
Abstract: Open World Object Detection (OWOD) is a challenging computer vision task that extends standard object detection by (1) detecting and classifying unknown objects without supervision, and (2) incrementally learning new object classes without forgetting previously learned ones. The absence of ground truths for unknown objects makes OWOD tasks particularly challenging. Many methods have addressed this by using pseudo-labels for unknown objects. The recently proposed Probabilistic Objectness transformer-based open-world detector (PROB) is a state-of-the-art model that does not require pseudo-labels for unknown objects, as it predicts probabilistic objectness. However, this method faces issues with learning conflicts between objectness and class predictions. To address this issue and further enhance performance, we propose a novel model, Decoupled PROB. Decoupled PROB introduces Early Termination of Objectness Prediction (ETOP) to stop objectness predictions at appropriate layers in the decoder, resolving the learning conflicts between class and objectness predictions in PROB. Additionally, we introduce Task-Decoupled Query Initialization (TDQI), which efficiently extracts features of known and unknown objects, thereby improving performance. TDQI is a query initialization method that combines query selection and learnable queries, and it is a module that can be easily integrated into existing DETR-based OWOD models. Extensive experiments on OWOD benchmarks demonstrate that Decoupled PROB surpasses all existing methods across several metrics, significantly improving performance.

Title: DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model

Authors: Han Zhang, Xiangde Luo, Yong Chen, Kang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13087
Pdf URL: https://arxiv.org/pdf/2507.13087
Copy Paste: [[2507.13087]] DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model(https://arxiv.org/abs/2507.13087)
Keywords: diffusion, segmentation
Abstract: Annotation variability remains a substantial challenge in medical image segmentation, stemming from ambiguous imaging boundaries and diverse clinical expertise. Traditional deep learning methods producing single deterministic segmentation predictions often fail to capture these annotator biases. Although recent studies have explored multi-rater segmentation, existing methods typically focus on a single perspective -- either generating a probabilistic ``gold standard'' consensus or preserving expert-specific preferences -- thus struggling to provide a more omni view. In this study, we propose DiffOSeg, a two-stage diffusion-based framework, which aims to simultaneously achieve both consensus-driven (combining all experts' opinions) and preference-driven (reflecting experts' individual assessments) segmentation. Stage I establishes population consensus through a probabilistic consensus strategy, while Stage II captures expert-specific preference via adaptive prompts. Demonstrated on two public datasets (LIDC-IDRI and NPC-170), our model outperforms existing state-of-the-art methods across all evaluated metrics. Source code is available at this https URL .

Title: GLAD: Generalizable Tuning for Vision-Language Models

Authors: Yuqi Peng, Pengfei Wang, Jianzhuang Liu, Shifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13089
Pdf URL: https://arxiv.org/pdf/2507.13089
Copy Paste: [[2507.13089]] GLAD: Generalizable Tuning for Vision-Language Models(https://arxiv.org/abs/2507.13089)
Keywords: robust
Abstract: Pre-trained vision-language models, such as CLIP, show impressive zero-shot recognition ability and can be easily transferred to specific downstream tasks via prompt tuning, even with limited training data. However, existing prompt tuning methods face two main challenges: (1) In few-shot scenarios, data scarcity often leads to overfitting, making the model sensitive to changes in the input domain. (2) To mitigate overfitting, these methods typically rely on complex task-specific model architectures and sensitive hyperparameter tuning, severely restricting their general applicability. To address these issues, we propose a simpler and more general framework called GLAD (Generalizable LoRA tuning with RegulArized GraDient). We show that merely applying LoRA achieves performance in downstream tasks comparable to current state-of-the-art prompt-based methods. While LoRA is effective and easy to use, it remains susceptible to overfitting in few-shot learning scenarios. To mitigate this risk, we introduce a gradient-based regularization technique. This technique effectively steers the optimization trajectory, encouraging the model to find a more stable parameter region that is robust to variations in data distribution. Through extensive experiments conducted on 15 benchmark datasets, we demonstrate that GLAD outperforms previous tuning approaches in terms of base-to-novel class generalization, image domain generalization, and cross-dataset generalization. The code will be publicly available.

Title: MUPAX: Multidimensional Problem Agnostic eXplainable AI

Authors: Vincenzo Dentamaro, Felice Franchini, Giuseppe Pirlo, Irina Voiculescu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.13090
Pdf URL: https://arxiv.org/pdf/2507.13090
Copy Paste: [[2507.13090]] MUPAX: Multidimensional Problem Agnostic eXplainable AI(https://arxiv.org/abs/2507.13090)
Keywords: robust, explainability
Abstract: Robust XAI techniques should ideally be simultaneously deterministic, model agnostic, and guaranteed to converge. We propose MULTIDIMENSIONAL PROBLEM AGNOSTIC EXPLAINABLE AI (MUPAX), a deterministic, model agnostic explainability technique, with guaranteed convergency. MUPAX measure theoretic formulation gives principled feature importance attribution through structured perturbation analysis that discovers inherent input patterns and eliminates spurious relationships. We evaluate MUPAX on an extensive range of data modalities and tasks: audio classification (1D), image classification (2D), volumetric medical image analysis (3D), and anatomical landmark detection, demonstrating dimension agnostic effectiveness. The rigorous convergence guarantees extend to any loss function and arbitrary dimensions, making MUPAX applicable to virtually any problem context for AI. By contrast with other XAI methods that typically decrease performance when masking, MUPAX not only preserves but actually enhances model accuracy by capturing only the most important patterns of the original data. Extensive benchmarking against the state of the XAI art demonstrates MUPAX ability to generate precise, consistent and understandable explanations, a crucial step towards explainable and trustworthy AI systems. The source code will be released upon publication.

Title: Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction

Authors: Zhennan Xiao, Katharine Brudkiewicz, Zhen Yuan, Rosalind Aughwane, Magdalena Sokolska, Joanna Chappell, Trevor Gaunt, Anna L. David, Andrew P. King, Andrew Melbourne
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.13106
Pdf URL: https://arxiv.org/pdf/2507.13106
Copy Paste: [[2507.13106]] Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction(https://arxiv.org/abs/2507.13106)
Keywords: robust, diffusion, segmentation
Abstract: Fetal lung maturity is a critical indicator for predicting neonatal outcomes and the need for post-natal intervention, especially for pregnancies affected by fetal growth restriction. Intra-voxel incoherent motion analysis has shown promising results for non-invasive assessment of fetal lung development, but its reliance on manual segmentation is time-consuming, thus limiting its clinical applicability. In this work, we present an automated lung maturity evaluation pipeline for diffusion-weighted magnetic resonance images that consists of a deep learning-based fetal lung segmentation model and a model-fitting lung maturity assessment. A 3D nnU-Net model was trained on manually segmented images selected from the baseline frames of 4D diffusion-weighted MRI scans. The segmentation model demonstrated robust performance, yielding a mean Dice coefficient of 82.14%. Next, voxel-wise model fitting was performed based on both the nnU-Net-predicted and manual lung segmentations to quantify IVIM parameters reflecting tissue microstructure and perfusion. The results suggested no differences between the two. Our work shows that a fully automated pipeline is possible for supporting fetal lung maturity assessment and clinical decision-making.

Title: R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning

Authors: Xiaohan Guo, Yusong Cai, Zejia Liu, Zhengning Wang, Lili Pan, Hongliang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13107
Pdf URL: https://arxiv.org/pdf/2507.13107
Copy Paste: [[2507.13107]] R^2MoE: Redundancy-Removal Mixture of Experts for Lifelong Concept Learning(https://arxiv.org/abs/2507.13107)
Keywords: generative
Abstract: Enabling large-scale generative models to continuously learn new visual concepts is essential for personalizing pre-trained models to meet individual user preferences. Existing approaches for continual visual concept learning are constrained by two fundamental challenges: catastrophic forgetting and parameter expansion. In this paper, we propose Redundancy-Removal Mixture of Experts (R^2MoE), a parameter-efficient framework for lifelong visual concept learning that effectively learns new concepts while incurring minimal parameter overhead. Our framework includes three key innovative contributions: First, we propose a mixture-of-experts framework with a routing distillation mechanism that enables experts to acquire concept-specific knowledge while preserving the gating network's routing capability, thereby effectively mitigating catastrophic forgetting. Second, we propose a strategy for eliminating redundant layer-wise experts that reduces the number of expert parameters by fully utilizing previously learned experts. Third, we employ a hierarchical local attention-guided inference approach to mitigate interference between generated visual concepts. Extensive experiments have demonstrated that our method generates images with superior conceptual fidelity compared to the state-of-the-art (SOTA) method, achieving an impressive 87.8\% reduction in forgetting rates and 63.3\% fewer parameters on the CustomConcept 101 dataset. Our code is available at {this https URL}

Title: A Computational Framework to Identify Self-Aspects in Text

Authors: Jaya Caporusso, Matthew Purver, Senja Pollak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13115
Pdf URL: https://arxiv.org/pdf/2507.13115
Copy Paste: [[2507.13115]] A Computational Framework to Identify Self-Aspects in Text(https://arxiv.org/abs/2507.13115)
Keywords: interpretability, generative, large language model
Abstract: This Ph.D. proposal introduces a plan to develop a computational framework to identify Self-aspects in text. The Self is a multifaceted construct and it is reflected in language. While it is described across disciplines like cognitive science and phenomenology, it remains underexplored in natural language processing (NLP). Many of the aspects of the Self align with psychological and other well-researched phenomena (e.g., those related to mental health), highlighting the need for systematic NLP-based analysis. In line with this, we plan to introduce an ontology of Self-aspects and a gold-standard annotated dataset. Using this foundation, we will develop and evaluate conventional discriminative models, generative large language models, and embedding-based retrieval approaches against four main criteria: interpretability, ground-truth adherence, accuracy, and computational efficiency. Top-performing models will be applied in case studies in mental health and empirical phenomenology.

Title: NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation

Authors: Yuanxin Zhuang, Dazhong Shen, Ying Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13133
Pdf URL: https://arxiv.org/pdf/2507.13133
Copy Paste: [[2507.13133]] NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation(https://arxiv.org/abs/2507.13133)
Keywords: interpretability, generative
Abstract: Graph generation plays a pivotal role across numerous domains, including molecular design and knowledge graph construction. Although existing methods achieve considerable success in generating realistic graphs, their interpretability remains limited, often obscuring the rationale behind structural decisions. To address this challenge, we propose the Neural Graph Topic Model (NGTM), a novel generative framework inspired by topic modeling in natural language processing. NGTM represents graphs as mixtures of latent topics, each defining a distribution over semantically meaningful substructures, which facilitates explicit interpretability at both local and global scales. The generation process transparently integrates these topic distributions with a global structural variable, enabling clear semantic tracing of each generated graph. Experiments demonstrate that NGTM achieves competitive generation quality while uniquely enabling fine-grained control and interpretability, allowing users to tune structural features or induce biological properties through topic-level adjustments.

Title: Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Authors: Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13138
Pdf URL: https://arxiv.org/pdf/2507.13138
Copy Paste: [[2507.13138]] Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation(https://arxiv.org/abs/2507.13138)
Keywords: robust, fair, generative
Abstract: Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

Title: DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

Authors: Maulana Bisyir Azhari, David Hyunchul Shim
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.13145
Pdf URL: https://arxiv.org/pdf/2507.13145
Copy Paste: [[2507.13145]] DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model(https://arxiv.org/abs/2507.13145)
Keywords: robust, transformer
Abstract: Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2's coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.

Title: SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Authors: Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.13152
Pdf URL: https://arxiv.org/pdf/2507.13152
Copy Paste: [[2507.13152]] SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models(https://arxiv.org/abs/2507.13152)
Keywords: large language model
Abstract: Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

Title: Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities

Authors: Hao Sun, Mihaela van der Schaar
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.13158
Pdf URL: https://arxiv.org/pdf/2507.13158
Copy Paste: [[2507.13158]] Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities(https://arxiv.org/abs/2507.13158)
Keywords: large language model
Abstract: In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.

Title: Prompt Injection 2.0: Hybrid AI Threats

Authors: Jeremy McHugh, Kristina Šekrst, Jon Cefalu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13169
Pdf URL: https://arxiv.org/pdf/2507.13169
Copy Paste: [[2507.13169]] Prompt Injection 2.0: Hybrid AI Threats(https://arxiv.org/abs/2507.13169)
Keywords: security, attack
Abstract: Prompt injection attacks, where malicious input is designed to manipulate AI systems into ignoring their original instructions and following unauthorized commands instead, were first discovered by Preamble, Inc. in May 2022 and responsibly disclosed to OpenAI. Over the last three years, these attacks have continued to pose a critical security threat to LLM-integrated systems. The emergence of agentic AI systems, where LLMs autonomously perform multistep tasks through tools and coordination with other agents, has fundamentally transformed the threat landscape. Modern prompt injection attacks can now combine with traditional cybersecurity exploits to create hybrid threats that systematically evade traditional security controls. This paper presents a comprehensive analysis of Prompt Injection 2.0, examining how prompt injections integrate with Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), and other web security vulnerabilities to bypass traditional security measures. We build upon Preamble's foundational research and mitigation technologies, evaluating them against contemporary threats, including AI worms, multi-agent infections, and hybrid cyber-AI attacks. Our analysis incorporates recent benchmarks that demonstrate how traditional web application firewalls, XSS filters, and CSRF tokens fail against AI-enhanced attacks. We also present architectural solutions that combine prompt isolation, runtime security, and privilege separation with novel threat detection capabilities.

Title: Automatically assessing oral narratives of Afrikaans and isiXhosa children

Authors: R. Louw (1), E. Sharratt (1), F. de Wet (1), C. Jacobs (1), A. Smith (1), H. Kamper (1) ((1) Stellenbosch University)
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2507.13205
Pdf URL: https://arxiv.org/pdf/2507.13205
Copy Paste: [[2507.13205]] Automatically assessing oral narratives of Afrikaans and isiXhosa children(https://arxiv.org/abs/2507.13205)
Keywords: large language model
Abstract: Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children's learning.

Title: MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling

Authors: Etienne Le Naour, Tahar Nabil, Ghislain Agoua
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13207
Pdf URL: https://arxiv.org/pdf/2507.13207
Copy Paste: [[2507.13207]] MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling(https://arxiv.org/abs/2507.13207)
Keywords: robust
Abstract: Recent years have witnessed a growing interest for time series foundation models, with a strong emphasis on the forecasting task. Yet, the crucial task of out-of-domain imputation of missing values remains largely underexplored. We propose a first step to fill this gap by leveraging implicit neural representations (INRs). INRs model time series as continuous functions and naturally handle various missing data scenarios and sampling rates. While they have shown strong performance within specific distributions, they struggle under distribution shifts. To address this, we introduce MoTM (Mixture of Timeflow Models), a step toward a foundation model for time series imputation. Building on the idea that a new time series is a mixture of previously seen patterns, MoTM combines a basis of INRs, each trained independently on a distinct family of time series, with a ridge regressor that adapts to the observed context at inference. We demonstrate robust in-domain and out-of-domain generalization across diverse imputation scenarios (e.g., block and pointwise missingness, variable sampling rates), paving the way for adaptable foundation imputation models.

Title: Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection

Authors: Hongyang Zhao, Tianyu Liang, Sina Davari, Daeho Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13221
Pdf URL: https://arxiv.org/pdf/2507.13221
Copy Paste: [[2507.13221]] Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection(https://arxiv.org/abs/2507.13221)
Keywords: generative
Abstract: While recent advancements in deep neural networks (DNNs) have substantially enhanced visual AI's capabilities, the challenge of inadequate data diversity and volume remains, particularly in construction domain. This study presents a novel image synthesis methodology tailored for construction worker detection, leveraging the generative-AI platform Midjourney. The approach entails generating a collection of 12,000 synthetic images by formulating 3000 different prompts, with an emphasis on image realism and diversity. These images, after manual labeling, serve as a dataset for DNN training. Evaluation on a real construction image dataset yielded promising results, with the model attaining average precisions (APs) of 0.937 and 0.642 at intersection-over-union (IoU) thresholds of 0.5 and 0.5 to 0.95, respectively. Notably, the model demonstrated near-perfect performance on the synthetic dataset, achieving APs of 0.994 and 0.919 at the two mentioned thresholds. These findings reveal both the potential and weakness of generative AI in addressing DNN training data scarcity.

Title: Leveraging Pre-Trained Visual Models for AI-Generated Video Detection

Authors: Keerthi Veeramachaneni, Praveen Tirupattur, Amrit Singh Bedi, Mubarak Shah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13224
Pdf URL: https://arxiv.org/pdf/2507.13224
Copy Paste: [[2507.13224]] Leveraging Pre-Trained Visual Models for AI-Generated Video Detection(https://arxiv.org/abs/2507.13224)
Keywords: security, privacy, generative
Abstract: Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.

Title: $S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation

Authors: Junhong Min, Youngpil Jeon, Jimin Kim, Minyong Choi
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.13229
Pdf URL: https://arxiv.org/pdf/2507.13229
Copy Paste: [[2507.13229]] $S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation(https://arxiv.org/abs/2507.13229)
Keywords: robust, transformer
Abstract: The pursuit of a generalizable stereo matching model, capable of performing across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. On the other hand, global matching architectures, while theoretically more robust, have been historically rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with $S^2M^2$: a global matching architecture that achieves both state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. $S^2M^2$ establishes a new state of the art on the Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods across most metrics while reconstructing high-quality details with competitive efficiency.

Title: VITA: Vision-to-Action Flow Matching Policy

Authors: Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.13231
Pdf URL: https://arxiv.org/pdf/2507.13231
Copy Paste: [[2507.13231]] VITA: Vision-to-Action Flow Matching Policy(https://arxiv.org/abs/2507.13231)
Keywords: diffusion, generative
Abstract: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

Title: Enhancing Cross-task Transfer of Large Language Models via Activation Steering

Authors: Xinyu Tang, Zhihao Lv, Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13236
Pdf URL: https://arxiv.org/pdf/2507.13236
Copy Paste: [[2507.13236]] Enhancing Cross-task Transfer of Large Language Models via Activation Steering(https://arxiv.org/abs/2507.13236)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model's internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.

Title: HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

Authors: Ashray Gupta, Rohan Joseph, Sunny Rai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13238
Pdf URL: https://arxiv.org/pdf/2507.13238
Copy Paste: [[2507.13238]] HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models(https://arxiv.org/abs/2507.13238)
Keywords: large language model
Abstract: Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

Title: Leveraging Asynchronous Cross-border Market Data for Improved Day-Ahead Electricity Price Forecasting in European Markets

Authors: Maria Margarida Mascarenhas, Jilles De Blauwe, Mikael Amelin, Hussain Kazmi
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2507.13250
Pdf URL: https://arxiv.org/pdf/2507.13250
Copy Paste: [[2507.13250]] Leveraging Asynchronous Cross-border Market Data for Improved Day-Ahead Electricity Price Forecasting in European Markets(https://arxiv.org/abs/2507.13250)
Keywords: interpretability
Abstract: Accurate short-term electricity price forecasting is crucial for strategically scheduling demand and generation bids in day-ahead markets. While data-driven techniques have shown considerable prowess in achieving high forecast accuracy in recent years, they rely heavily on the quality of input covariates. In this paper, we investigate whether asynchronously published prices as a result of differing gate closure times (GCTs) in some bidding zones can improve forecasting accuracy in other markets with later GCTs. Using a state-of-the-art ensemble of models, we show significant improvements of 22% and 9% in forecast accuracy in the Belgian (BE) and Swedish bidding zones (SE3) respectively, when including price data from interconnected markets with earlier GCT (Germany-Luxembourg, Austria, and Switzerland). This improvement holds for both general as well as extreme market conditions. Our analysis also yields further important insights: frequent model recalibration is necessary for maximum accuracy but comes at substantial additional computational costs, and using data from more markets does not always lead to better performance - a fact we delve deeper into with interpretability analysis of the forecast models. Overall, these findings provide valuable guidance for market participants and decision-makers aiming to optimize bidding strategies within increasingly interconnected and volatile European energy markets.

Title: Automating Steering for Safe Multimodal Large Language Models

Authors: Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2507.13255
Pdf URL: https://arxiv.org/pdf/2507.13255
Copy Paste: [[2507.13255]] Automating Steering for Safe Multimodal Large Language Models(https://arxiv.org/abs/2507.13255)
Keywords: attack, large language model
Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.

Title: Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

Authors: Yiting Yang, Hao Luo, Yuan Sun, Qingsen Yan, Haokui Zhang, Wei Dong, Guoqing Wang, Peng Wang, Yang Yang, Hengtao Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13260
Pdf URL: https://arxiv.org/pdf/2507.13260
Copy Paste: [[2507.13260]] Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy(https://arxiv.org/abs/2507.13260)
Keywords: transformer
Abstract: A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this work, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices.

Title: Merge Kernel for Bayesian Optimization on Permutation Space

Authors: Zikai Xie, Linjiang Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13263
Pdf URL: https://arxiv.org/pdf/2507.13263
Copy Paste: [[2507.13263]] Merge Kernel for Bayesian Optimization on Permutation Space(https://arxiv.org/abs/2507.13263)
Keywords: robust
Abstract: Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an $\Omega(n^2)$ representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbf{Merge Kernel} constructed from merge sort, which replaces the quadratic complexity with $\Theta(n\log n)$ to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.

Title: Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management

Authors: Luis Gasco, Hermenegildo Fabregat, Laura García-Sardiña, Paula Estrella, Daniel Deniz, Alvaro Rodrigo, Rabih Zbib
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.13275
Pdf URL: https://arxiv.org/pdf/2507.13275
Copy Paste: [[2507.13275]] Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management(https://arxiv.org/abs/2507.13275)
Keywords: robust, fair, large language model
Abstract: Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.

Title: Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis

Authors: Wang Xi, Quan Shi, Tian Yu, Yujie Peng, Jiayi Sun, Mengxing Ren, Zenghui Ding, Ningguang Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13285
Pdf URL: https://arxiv.org/pdf/2507.13285
Copy Paste: [[2507.13285]] Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis(https://arxiv.org/abs/2507.13285)
Keywords: robust, extraction
Abstract: Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference-based evaluation framework employing rationale-enhanced multi-dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.

Title: DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation

Authors: Ekta Balkrishna Gavas, Chinmay Hegde, Nasir Memon, Sudipta Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13292
Pdf URL: https://arxiv.org/pdf/2507.13292
Copy Paste: [[2507.13292]] DiffClean: Diffusion-based Makeup Removal for Accurate Age Estimation(https://arxiv.org/abs/2507.13292)
Keywords: protect, attack, diffusion
Abstract: Accurate age verification can protect underage users from unauthorized access to online platforms and e-commerce sites that provide age-restricted services. However, accurate age estimation can be confounded by several factors, including facial makeup that can induce changes to alter perceived identity and age to fool both humans and machines. In this work, we propose DiffClean which erases makeup traces using a text-guided diffusion model to defend against makeup attacks. DiffClean improves age estimation (minor vs. adult accuracy by 4.8%) and face verification (TMR by 8.9% at FMR=0.01%) over competing baselines on digitally simulated and real makeup images.

Title: AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

Authors: Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13300
Pdf URL: https://arxiv.org/pdf/2507.13300
Copy Paste: [[2507.13300]] AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research(https://arxiv.org/abs/2507.13300)
Keywords: large language model
Abstract: We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

Title: Boosting Team Modeling through Tempo-Relational Representation Learning

Authors: Vincenzo Marco De Luca, Giovanna Varni, Andrea Passerini
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13305
Pdf URL: https://arxiv.org/pdf/2507.13305
Copy Paste: [[2507.13305]] Boosting Team Modeling through Tempo-Relational Representation Learning(https://arxiv.org/abs/2507.13305)
Keywords: explainability
Abstract: Team modeling remains a fundamental challenge at the intersection of Artificial Intelligence and the Social Sciences. Social Science research emphasizes the need to jointly model dynamics and relations, while practical applications demand unified models capable of inferring multiple team constructs simultaneously, providing interpretable insights and actionable recommendations to enhance team performance. However, existing works do not meet these practical demands. To bridge this gap, we present TRENN, a novel tempo-relational architecture that integrates: (i) an automatic temporal graph extractor, (ii) a tempo-relational encoder, (iii) a decoder for team construct prediction, and (iv) two complementary explainability modules. TRENN jointly captures relational and temporal team dynamics, providing a solid foundation for MT-TRENN, which extends TReNN by replacing the decoder with a multi-task head, enabling the model to learn shared Social Embeddings and simultaneously predict multiple team constructs, including Emergent Leadership, Leadership Style, and Teamwork components. Experimental results demonstrate that our approach significantly outperforms approaches that rely exclusively on temporal or relational information. Additionally, experimental evaluation has shown that the explainability modules integrated in MT-TRENN yield interpretable insights and actionable suggestions to support team improvement. These capabilities make our approach particularly well-suited for Human-Centered AI applications, such as intelligent decision-support systems in high-stakes collaborative environments.

Title: FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization

Authors: Chuancheng Shi, Yixiang Chen, Burong Lei, Jichao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13311
Pdf URL: https://arxiv.org/pdf/2507.13311
Copy Paste: [[2507.13311]] FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization(https://arxiv.org/abs/2507.13311)
Keywords: diffusion
Abstract: Realistic and controllable garment visualization is critical for fashion e-commerce, where users expect personalized previews under diverse poses and lighting conditions. Existing methods often rely on predefined poses, limiting semantic flexibility and illumination adaptability. To address this, we introduce FashionPose, the first unified text-to-pose-to-relighting generation framework. Given a natural language description, our method first predicts a 2D human pose, then employs a diffusion model to generate high-fidelity person images, and finally applies a lightweight relighting module, all guided by the same textual input. By replacing explicit pose annotations with text-driven conditioning, FashionPose enables accurate pose alignment, faithful garment rendering, and flexible lighting control. Experiments demonstrate fine-grained pose synthesis and efficient, consistent relighting, providing a practical solution for personalized virtual fashion display.

Title: A Crowdsensing Intrusion Detection Dataset For Decentralized Federated Learning Models

Authors: Chao Feng, Alberto Huertas Celdran, Jing Han, Heqing Ren, Xi Cheng, Zien Zeng, Lucas Krauter, Gerome Bovet, Burkhard Stiller
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.13313
Pdf URL: https://arxiv.org/pdf/2507.13313
Copy Paste: [[2507.13313]] A Crowdsensing Intrusion Detection Dataset For Decentralized Federated Learning Models(https://arxiv.org/abs/2507.13313)
Keywords: security, federate
Abstract: This paper introduces a dataset and experimental study for decentralized federated learning (DFL) applied to IoT crowdsensing malware detection. The dataset comprises behavioral records from benign and eight malware families. A total of 21,582,484 original records were collected from system calls, file system activities, resource usage, kernel events, input/output events, and network records. These records were aggregated into 30-second windows, resulting in 342,106 features used for model training and evaluation. Experiments on the DFL platform compare traditional machine learning (ML), centralized federated learning (CFL), and DFL across different node counts, topologies, and data distributions. Results show that DFL maintains competitive performance while preserving data locality, outperforming CFL in most settings. This dataset provides a solid foundation for studying the security of IoT crowdsensing environments.

Title: Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark

Authors: Junsu Kim, Naeun Kim, Jaeho Lee, Incheol Park, Dongyoon Han, Seungryul Baek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13314
Pdf URL: https://arxiv.org/pdf/2507.13314
Copy Paste: [[2507.13314]] Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark(https://arxiv.org/abs/2507.13314)
Keywords: fair, large language model
Abstract: The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics (\eg, MPJPE, PA-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. To alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware multimodal reasoning.

Title: GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM

Authors: Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, Meeyoung Cha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13323
Pdf URL: https://arxiv.org/pdf/2507.13323
Copy Paste: [[2507.13323]] GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM(https://arxiv.org/abs/2507.13323)
Keywords: large language model
Abstract: Socio-economic indicators like regional GDP, population, and education levels, are crucial to shaping policy decisions and fostering sustainable development. This research introduces GeoReg a regression model that integrates diverse data sources, including satellite imagery and web-based geospatial information, to estimate these indicators even for data-scarce regions such as developing countries. Our approach leverages the prior knowledge of large language model (LLM) to address the scarcity of labeled data, with the LLM functioning as a data engineer by extracting informative features to enable effective estimation in few-shot settings. Specifically, our model obtains contextual relationships between data features and the target indicator, categorizing their correlations as positive, negative, mixed, or irrelevant. These features are then fed into the linear estimator with tailored weight constraints for each category. To capture nonlinear patterns, the model also identifies meaningful feature interactions and integrates them, along with nonlinear transformations. Experiments across three countries at different stages of development demonstrate that our model outperforms baselines in estimating socio-economic indicators, even for low-income countries with limited data availability.

Title: The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Authors: Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13332
Pdf URL: https://arxiv.org/pdf/2507.13332
Copy Paste: [[2507.13332]] The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner(https://arxiv.org/abs/2507.13332)
Keywords: transformer, large language model
Abstract: Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

Title: A Survey of Context Engineering for Large Language Models

Authors: Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13334
Pdf URL: https://arxiv.org/pdf/2507.13334
Copy Paste: [[2507.13334]] A Survey of Context Engineering for Large Language Models(https://arxiv.org/abs/2507.13334)
Keywords: large language model
Abstract: The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

Title: Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

Authors: Tyler Loakman, William Thorne, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.13335
Pdf URL: https://arxiv.org/pdf/2507.13335
Copy Paste: [[2507.13335]] Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes(https://arxiv.org/abs/2507.13335)
Keywords: large language model
Abstract: Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond "common sense", rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.

Title: Training Transformers with Enforced Lipschitz Constants

Authors: Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.13338
Pdf URL: https://arxiv.org/pdf/2507.13338
Copy Paste: [[2507.13338]] Training Transformers with Enforced Lipschitz Constants(https://arxiv.org/abs/2507.13338)
Keywords: transformer
Abstract: Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods -- weight decay and spectral normalization -- allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon's update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.

Title: Taming Diffusion Transformer for Real-Time Mobile Video Generation

Authors: Yushu Wu, Yanyu Li, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ke Ma, Arpit Sahni, Ju Hu, Aliaksandr Siarohin, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.13343
Pdf URL: https://arxiv.org/pdf/2507.13343
Copy Paste: [[2507.13343]] Taming Diffusion Transformer for Real-Time Mobile Video Generation(https://arxiv.org/abs/2507.13343)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.

Title: Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Authors: Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13344
Pdf URL: https://arxiv.org/pdf/2507.13344
Copy Paste: [[2507.13344]] Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models(https://arxiv.org/abs/2507.13344)
Keywords: diffusion
Abstract: This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: this https URL .

Title: $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning

Authors: Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.13347
Pdf URL: https://arxiv.org/pdf/2507.13347
Copy Paste: [[2507.13347]] $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning(https://arxiv.org/abs/2507.13347)
Keywords: robust
Abstract: We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.

Title: Hierarchical Rectified Flow Matching with Mini-Batch Couplings

Authors: Yichi Zhang, Yici Yan, Alex Schwing, Zhizhen Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.13350
Pdf URL: https://arxiv.org/pdf/2507.13350
Copy Paste: [[2507.13350]] Hierarchical Rectified Flow Matching with Mini-Batch Couplings(https://arxiv.org/abs/2507.13350)
Keywords: generative
Abstract: Flow matching has emerged as a compelling generative modeling approach that is widely used across domains. To generate data via a flow matching model, an ordinary differential equation (ODE) is numerically solved via forward integration of the modeled velocity field. To better capture the multi-modality that is inherent in typical velocity fields, hierarchical flow matching was recently introduced. It uses a hierarchy of ODEs that are numerically integrated when generating data. This hierarchy of ODEs captures the multi-modal velocity distribution just like vanilla flow matching is capable of modeling a multi-modal data distribution. While this hierarchy enables to model multi-modal velocity distributions, the complexity of the modeled distribution remains identical across levels of the hierarchy. In this paper, we study how to gradually adjust the complexity of the distributions across different levels of the hierarchy via mini-batch couplings. We show the benefits of mini-batch couplings in hierarchical rectified flow matching via compelling results on synthetic and imaging data. Code is available at this https URL.

Title: VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Authors: Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Li, Jose M. Alvarez, Lei Zhang, Zhiding Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.13353
Pdf URL: https://arxiv.org/pdf/2507.13353
Copy Paste: [[2507.13353]] VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding(https://arxiv.org/abs/2507.13353)
Keywords: large language model
Abstract: Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.