2025-07-03

Title: A Systematic Review of Security Vulnerabilities in Smart Home Devices and Mitigation Techniques

Authors: Mohammed K. Alzaylaee
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01018
Pdf URL: https://arxiv.org/pdf/2507.01018
Copy Paste: [[2507.01018]] A Systematic Review of Security Vulnerabilities in Smart Home Devices and Mitigation Techniques(https://arxiv.org/abs/2507.01018)
Keywords: anomaly
Abstract: Smart homes that integrate Internet of Things (IoT) devices face increasing cybersecurity risks, posing significant challenges to these environments. The study explores security threats in smart homes ecosystems, categorizing them into vulnerabilities at the network layer, device level, and those from cloud-based and AI-driven systems. Research findings indicate that post-quantum encryption, coupled with AI-driven anomaly detection, is highly effective in enhancing security; however, computational resource demands present significant challenges. Blockchain authentication together with zero-trust structures builds security resilience, although they need changes to existing infrastructure. The specific security strategies show their effectiveness through ANOVA, Chi-square tests, and Monte Carlo simulations yet lack sufficient scalability according to the results. The research demonstrates the requirement for improvement in cryptographic techniques, alongside AI-enhanced threat detection and adaptive security models which must achieve a balance between performance and efficiency and real-time applicability within smart home ecosystems.

Title: Few-Shot Inspired Generative Zero-Shot Learning

Authors: Md Shakil Ahamed Shohag, Q. M. Jonathan Wu, Farhad Pourpanah
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01026
Pdf URL: https://arxiv.org/pdf/2507.01026
Copy Paste: [[2507.01026]] Few-Shot Inspired Generative Zero-Shot Learning(https://arxiv.org/abs/2507.01026)
Keywords: generative
Abstract: Generative zero-shot learning (ZSL) methods typically synthesize visual features for unseen classes using predefined semantic attributes, followed by training a fully supervised classification model. While effective, these methods require substantial computational resources and extensive synthetic data, thereby relaxing the original ZSL assumptions. In this paper, we propose FSIGenZ, a few-shot-inspired generative ZSL framework that reduces reliance on large-scale feature synthesis. Our key insight is that class-level attributes exhibit instance-level variability, i.e., some attributes may be absent or partially visible, yet conventional ZSL methods treat them as uniformly present. To address this, we introduce Model-Specific Attribute Scoring (MSAS), which dynamically re-scores class attributes based on model-specific optimization to approximate instance-level variability without access to unseen data. We further estimate group-level prototypes as clusters of instances based on MSAS-adjusted attribute scores, which serve as representative synthetic features for each unseen class. To mitigate the resulting data imbalance, we introduce a Dual-Purpose Semantic Regularization (DPSR) strategy while training a semantic-aware contrastive classifier (SCC) using these prototypes. Experiments on SUN, AwA2, and CUB benchmarks demonstrate that FSIGenZ achieves competitive performance using far fewer synthetic features.

Title: Dual Perspectives on Non-Contrastive Self-Supervised Learning

Authors: Jean Ponce (WILLOW), Martial Hebert (CMU), Basile Terver (FAIR, WILLOW)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01028
Pdf URL: https://arxiv.org/pdf/2507.01028
Copy Paste: [[2507.01028]] Dual Perspectives on Non-Contrastive Self-Supervised Learning(https://arxiv.org/abs/2507.01028)
Keywords: self-supervised
Abstract: The objective of non-contrastive approaches to self-supervised learning is to train on pairs of different views of the data an encoder and a predictor that minimize the mean discrepancy between the code predicted from the embedding of the first view and the embedding of the second one. In this setting, the stop gradient and exponential moving average iterative procedures are commonly used to avoid representation collapse, with excellent performance in downstream supervised applications. This presentation investigates these procedures from the dual theoretical viewpoints of optimization and dynamical systems. We first show that, in general, although they do not optimize the original objective, or for that matter, any other smooth function, they do avoid collapse. Following Tian et al. [2021], but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average always leads to collapse. Conversely, we finally show that the limit points of the dynamical systems associated with these two procedures are, in general, asymptotically stable equilibria, with no risk of degenerating to trivial solutions.

Title: PathCoT: Chain-of-Thought Prompting for Zero-shot Pathology Visual Reasoning

Authors: Junjie Zhou, Yingli Zuo, Shichang Feng, Peng Wan, Qi Zhu, Daoqiang Zhang, Wei Shao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.01029
Pdf URL: https://arxiv.org/pdf/2507.01029
Copy Paste: [[2507.01029]] PathCoT: Chain-of-Thought Prompting for Zero-shot Pathology Visual Reasoning(https://arxiv.org/abs/2507.01029)
Keywords: generative
Abstract: With the development of generative artificial intelligence and instruction tuning techniques, multimodal large language models (MLLMs) have made impressive progress on general reasoning tasks. Benefiting from the chain-of-thought (CoT) methodology, MLLMs can solve the visual reasoning problem step-by-step. However, existing MLLMs still face significant challenges when applied to pathology visual reasoning tasks: (1) LLMs often underperforms because they lack domain-specific information, which can lead to model hallucinations. (2) The additional reasoning steps in CoT may introduce errors, leading to the divergence of answers. To address these limitations, we propose PathCoT, a novel zero-shot CoT prompting method which integrates the pathology expert-knowledge into the reasoning process of MLLMs and incorporates self-evaluation to mitigate divergence of answers. Specifically, PathCoT guides the MLLM with prior knowledge to perform as pathology experts, and provides comprehensive analysis of the image with their domain-specific knowledge. By incorporating the experts' knowledge, PathCoT can obtain the answers with CoT reasoning. Furthermore, PathCoT incorporates a self-evaluation step that assesses both the results generated directly by MLLMs and those derived through CoT, finally determining the reliable answer. The experimental results on the PathMMU dataset demonstrate the effectiveness of our method on pathology visual understanding and reasoning.

Title: Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals

Authors: Xiao Gu, Wei Tang, Jinpei Han, Veer Sangha, Fenglin Liu, Shreyank N Gowda, Antonio H. Ribeiro, Patrick Schwab, Kim Branson, Lei Clifton, Antonio Luiz P. Ribeiro, Zhangdaihong Liu, David A. Clifton
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2507.01045
Pdf URL: https://arxiv.org/pdf/2507.01045
Copy Paste: [[2507.01045]] Sensing Cardiac Health Across Scenarios and Devices: A Multi-Modal Foundation Model Pretrained on Heterogeneous Data from 1.7 Million Individuals(https://arxiv.org/abs/2507.01045)
Keywords: foundation model, generative
Abstract: Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robustness and generalizability across diverse clinical settings and acquisition protocols. In this study, we present a cardiac sensing foundation model (CSFM) that leverages advanced transformer architectures and a generative, masked pretraining strategy to learn unified representations from vast, heterogeneous health records. Our model is pretrained on an innovative multi-modal integration of data from multiple large-scale datasets (including MIMIC-III-WDB, MIMIC-IV-ECG, and CODE), comprising cardiac signals and the corresponding clinical or machine-generated text reports from approximately 1.7 million individuals. We demonstrate that the embeddings derived from our CSFM not only serve as effective feature extractors across diverse cardiac sensing scenarios, but also enable seamless transfer learning across varying input configurations and sensor modalities. Extensive evaluations across diagnostic tasks, demographic information recognition, vital sign measurement, clinical outcome prediction, and ECG question answering reveal that CSFM consistently outperforms traditional one-modal-one-task approaches. Notably, CSFM exhibits robust performance across multiple ECG lead configurations from standard 12-lead systems to single-lead setups, and in scenarios where only ECG, only PPG, or a combination thereof is available. These findings highlight the potential of CSFM as a versatile and scalable solution, for comprehensive cardiac monitoring.

Title: XxaCT-NN: Structure Agnostic Multimodal Learning for Materials Science

Authors: Jithendaraa Subramanian, Linda Hung, Daniel Schweigert, Santosh Suram, Weike Ye
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01054
Pdf URL: https://arxiv.org/pdf/2507.01054
Copy Paste: [[2507.01054]] XxaCT-NN: Structure Agnostic Multimodal Learning for Materials Science(https://arxiv.org/abs/2507.01054)
Keywords: self-supervised, foundation model
Abstract: Recent advances in materials discovery have been driven by structure-based models, particularly those using crystal graphs. While effective for computational datasets, these models are impractical for real-world applications where atomic structures are often unknown or difficult to obtain. We propose a scalable multimodal framework that learns directly from elemental composition and X-ray diffraction (XRD) -- two of the more available modalities in experimental workflows without requiring crystal structure input. Our architecture integrates modality-specific encoders with a cross-attention fusion module and is trained on the 5-million-sample Alexandria dataset. We present masked XRD modeling (MXM), and apply MXM and contrastive alignment as self-supervised pretraining strategies. Pretraining yields faster convergence (up to 4.2x speedup) and improves both accuracy and representation quality. We further demonstrate that multimodal performance scales more favorably with dataset size than unimodal baselines, with gains compounding at larger data regimes. Our results establish a path toward structure-free, experimentally grounded foundation models for materials science.

Title: Good Enough to Learn: LLM-based Anomaly Detection in ECU Logs without Reliable Labels

Authors: Bogdan Bogdan, Arina Cazacu, Laura Vasilie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01077
Pdf URL: https://arxiv.org/pdf/2507.01077
Copy Paste: [[2507.01077]] Good Enough to Learn: LLM-based Anomaly Detection in ECU Logs without Reliable Labels(https://arxiv.org/abs/2507.01077)
Keywords: generative, anomaly
Abstract: Anomaly detection often relies on supervised or clustering approaches, with limited success in specialized domains like automotive communication systems where scalable solutions are essential. We propose a novel decoder-only Large Language Model (LLM) to detect anomalies in Electronic Control Unit (ECU) communication logs. Our approach addresses two key challenges: the lack of LLMs tailored for ECU communication and the complexity of inconsistent ground truth data. By learning from UDP communication logs, we formulate anomaly detection simply as identifying deviations in time from normal behavior. We introduce an entropy regularization technique that increases model's uncertainty in known anomalies while maintaining consistency in similar scenarios. Our solution offers three novelties: a decoder-only anomaly detection architecture, a way to handle inconsistent labeling, and an adaptable LLM for different ECU communication use cases. By leveraging the generative capabilities of decoder-only models, we present a new technique that addresses the high cost and error-prone nature of manual labeling through a more scalable system that is able to learn from a minimal set of examples, while improving detection accuracy in complex communication environments.

Title: Event-based evaluation of abstractive news summarization

Authors: Huiling You, Samia Touileb, Erik Velldal, Lilja Øvrelid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01160
Pdf URL: https://arxiv.org/pdf/2507.01160
Copy Paste: [[2507.01160]] Event-based evaluation of abstractive news summarization(https://arxiv.org/abs/2507.01160)
Keywords: generative
Abstract: An abstractive summary of a news article contains its most important information in a condensed version. The evaluation of automatically generated summaries by generative language models relies heavily on human-authored summaries as gold references, by calculating overlapping units or similarity scores. News articles report events, and ideally so should the summaries. In this work, we propose to evaluate the quality of abstractive summaries by calculating overlapping events between generated summaries, reference summaries, and the original news articles. We experiment on a richly annotated Norwegian dataset comprising both events annotations and summaries authored by expert human annotators. Our approach provides more insight into the event information contained in the summaries.

Title: Diffusion Explorer: Interactive Exploration of Diffusion Models

Authors: Alec Helbling, Duen Horng Chau
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01178
Pdf URL: https://arxiv.org/pdf/2507.01178
Copy Paste: [[2507.01178]] Diffusion Explorer: Interactive Exploration of Diffusion Models(https://arxiv.org/abs/2507.01178)
Keywords: diffusion
Abstract: Diffusion models have been central to the development of recent image, video, and even text generation systems. They posses striking geometric properties that can be faithfully portrayed in low-dimensional settings. However, existing resources for explaining diffusion either require an advanced theoretical foundation or focus on their neural network architectures rather than their rich geometric properties. We introduce Diffusion Explorer, an interactive tool to explain the geometric properties of diffusion models. Users can train 2D diffusion models in the browser and observe the temporal dynamics of their sampling process. Diffusion Explorer leverages interactive animation, which has been shown to be a powerful tool for making engaging visualizations of dynamic systems, making it well suited to explaining diffusion models which represent stochastic processes that evolve over time. Diffusion Explorer is open source and a live demo is available at this http URL.

Title: Are Large Brainwave Foundation Models Capable Yet? Insights from Fine-tuning

Authors: Na Lee, Konstantinos Barmpas, Yannis Panagakis, Dimitrios Adamos, Nikolaos Laskaris, Stefanos Zafeiriou
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.01196
Pdf URL: https://arxiv.org/pdf/2507.01196
Copy Paste: [[2507.01196]] Are Large Brainwave Foundation Models Capable Yet? Insights from Fine-tuning(https://arxiv.org/abs/2507.01196)
Keywords: foundation model
Abstract: Foundation Models have demonstrated significant success across various domains in Artificial Intelligence (AI), yet their capabilities for brainwave modeling remain unclear. In this paper, we comprehensively evaluate current Large Brainwave Foundation Models (LBMs) through systematic fine-tuning experiments across multiple Brain-Computer Interface (BCI) benchmark tasks, including memory tasks and sleep stage classification. Our extensive analysis shows that state-of-the-art LBMs achieve only marginal improvements (0.9%-1.2%) over traditional deep architectures while requiring significantly more parameters (millions vs thousands), raising important questions about their efficiency and applicability in BCI contexts. Moreover, through detailed ablation studies and Low-Rank Adaptation (LoRA), we significantly reduce trainable parameters without performance degradation, while demonstrating that architectural and training inefficiencies limit LBMs' current capabilities. Our experiments span both full model fine-tuning and parameter-efficient adaptation techniques, providing insights into optimal training strategies for BCI applications. We pioneer the application of LoRA to LBMs, revealing that performance benefits generally emerge when adapting multiple neural network components simultaneously. These findings highlight the critical need for domain-specific development strategies to advance LBMs, suggesting that current architectures may require redesign to fully leverage the potential of foundation models in brainwave analysis.

Title: Escaping Platos Cave: JAM for Aligning Independently Trained Vision and Language Models

Authors: Hyoseo (Lauren)Yoon, Yisong Yue, Been Kim
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.01201
Pdf URL: https://arxiv.org/pdf/2507.01201
Copy Paste: [[2507.01201]] Escaping Platos Cave: JAM for Aligning Independently Trained Vision and Language Models(https://arxiv.org/abs/2507.01201)
Keywords: foundation model
Abstract: Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. Yet an emerging hypothesis - the Platonic Representation Hypothesis - suggests that such models may nonetheless converge toward a shared statistical model of reality. This compatibility, if it exists, raises a fundamental question: can we move beyond post-hoc statistical detection of alignment and explicitly optimize for it between such disjoint representations? We cast this Platonic alignment problem as a multi-objective optimization task - preserve each modality's native structure while aligning for mutual coherence. We introduce the Joint Autoencoder Modulator (JAM) framework that jointly trains modality-specific autoencoders on the latent representations of pre-trained single modality models, encouraging alignment through both reconstruction and cross-modal objectives. By analogy, this framework serves as a method to escape Plato's Cave, enabling the emergence of shared structure from disjoint inputs. We evaluate this framework across three critical design axes: (i) the alignment objective - comparing contrastive loss (Con), its hard-negative variant (NegCon), and our Spread loss, (ii) the layer depth at which alignment is most effective, and (iii) the impact of foundation model scale on representational convergence. Our findings show that our lightweight Pareto-efficient framework reliably induces alignment, even across frozen, independently trained representations, offering both theoretical insight and practical pathways for transforming generalist unimodal foundations into specialist multimodal models.

Title: Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing

Authors: Chengxu Liu, Lu Qi, Jinshan Pan, Xueming Qian, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01275
Pdf URL: https://arxiv.org/pdf/2507.01275
Copy Paste: [[2507.01275]] Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing(https://arxiv.org/abs/2507.01275)
Keywords: diffusion, generative
Abstract: Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propose a novel frequency domain-based diffusion model, named \ours, for fully exploiting the beneficial knowledge in unpaired clear data. In particular, inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction and perform the DMs to yield the amplitude spectrum consistent with the distribution of clear images. To implement it, we propose an Amplitude Residual Encoder (ARE) to extract the amplitude residuals, which effectively compensates for the amplitude gap from the hazy to clear domains, as well as provide supervision for the DMs training. In addition, we propose a Phase Correction Module (PCM) to eliminate artifacts by further refining the phase spectrum during dehazing with a simple attention mechanism. Experimental results demonstrate that our \ours outperforms other state-of-the-art methods on both synthetic and real-world datasets.

Title: DiffusionLight-Turbo: Accelerated Light Probes for Free via Single-Pass Chrome Ball Inpainting

Authors: Worameth Chinchuthakun, Pakkapon Phongthawee, Amit Raj, Varun Jampani, Pramook Khungurn, Supasorn Suwajanakorn
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01305
Pdf URL: https://arxiv.org/pdf/2507.01305
Copy Paste: [[2507.01305]] DiffusionLight-Turbo: Accelerated Light Probes for Free via Single-Pass Chrome Ball Inpainting(https://arxiv.org/abs/2507.01305)
Keywords: diffusion
Abstract: We introduce a simple yet effective technique for estimating lighting from a single low-dynamic-range (LDR) image by reframing the task as a chrome ball inpainting problem. This approach leverages a pre-trained diffusion model, Stable Diffusion XL, to overcome the generalization failures of existing methods that rely on limited HDR panorama datasets. While conceptually simple, the task remains challenging because diffusion models often insert incorrect or inconsistent content and cannot readily generate chrome balls in HDR format. Our analysis reveals that the inpainting process is highly sensitive to the initial noise in the diffusion process, occasionally resulting in unrealistic outputs. To address this, we first introduce DiffusionLight, which uses iterative inpainting to compute a median chrome ball from multiple outputs to serve as a stable, low-frequency lighting prior that guides the generation of a high-quality final result. To generate high-dynamic-range (HDR) light probes, an Exposure LoRA is fine-tuned to create LDR images at multiple exposure values, which are then merged. While effective, DiffusionLight is time-intensive, requiring approximately 30 minutes per estimation. To reduce this overhead, we introduce DiffusionLight-Turbo, which reduces the runtime to about 30 seconds with minimal quality loss. This 60x speedup is achieved by training a Turbo LoRA to directly predict the averaged chrome balls from the iterative process. Inference is further streamlined into a single denoising pass using a LoRA swapping technique. Experimental results that show our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios. Our code is available at this https URL

Title: ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Authors: Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2507.01321
Pdf URL: https://arxiv.org/pdf/2507.01321
Copy Paste: [[2507.01321]] ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks(https://arxiv.org/abs/2507.01321)
Keywords: in-context
Abstract: In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).

Title: Reasoner for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy

Authors: Xiaoyun Zhang, Jingqing Ruan, Xing Ma, Yawen Zhu, Jiansong Chen, Ke Zeng, Xunliang Cai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01327
Pdf URL: https://arxiv.org/pdf/2507.01327
Copy Paste: [[2507.01327]] Reasoner for Real-World Event Detection: Scaling Reinforcement Learning via Adaptive Perplexity-Aware Sampling Strategy(https://arxiv.org/abs/2507.01327)
Keywords: anomaly
Abstract: Detecting abnormal events in real-world customer service dialogues is highly challenging due to the complexity of business data and the dynamic nature of customer interactions. Moreover, models must demonstrate strong out-of-domain (OOD) generalization to enable rapid adaptation across different business scenarios and maximize commercial value. In this work, we propose a novel Adaptive Perplexity-Aware Reinforcement Learning (APARL) framework that leverages the advanced reasoning capabilities of large language models for abnormal event detection. APARL introduces a dual-loop dynamic curriculum learning architecture, enabling the model to progressively focus on more challenging samples as its proficiency increases. This design effectively addresses performance bottlenecks and significantly enhances OOD transferability. Extensive evaluations on food delivery dialogue tasks show that our model achieves significantly enhanced adaptability and robustness, attaining the highest F1 score with an average improvement of 17.19\%, and an average improvement of 9.59\% in OOD transfer tests. This method provides a superior solution for industrial deployment of anomaly detection models, contributing to improved operational efficiency and commercial benefits.

Title: Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

Authors: Andrei Jelea, Ahmed Nabil Belbachir, Marius Leordeanu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01347
Pdf URL: https://arxiv.org/pdf/2507.01347
Copy Paste: [[2507.01347]] Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation(https://arxiv.org/abs/2507.01347)
Keywords: self-supervised
Abstract: We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA forms robust ensembles at test time in which, due to sound statistical properties, the structural and systematic noises in the initial input data is filtered out and final estimator errors are reduced. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost, at no loss in accuracy. Our tests and comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.

Title: Efficient Kilometer-Scale Precipitation Downscaling with Conditional Wavelet Diffusion

Authors: Chugang Yi, Minghan Yu, Weikang Qian, Yixin Wen, Haizhao Yang
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2507.01354
Pdf URL: https://arxiv.org/pdf/2507.01354
Copy Paste: [[2507.01354]] Efficient Kilometer-Scale Precipitation Downscaling with Conditional Wavelet Diffusion(https://arxiv.org/abs/2507.01354)
Keywords: diffusion, generative
Abstract: Effective hydrological modeling and extreme weather analysis demand precipitation data at a kilometer-scale resolution, which is significantly finer than the 10 km scale offered by standard global products like IMERG. To address this, we propose the Wavelet Diffusion Model (WDM), a generative framework that achieves 10x spatial super-resolution (downscaling to 1 km) and delivers a 9x inference speedup over pixel-based diffusion models. WDM is a conditional diffusion model that learns the learns the complex structure of precipitation from MRMS radar data directly in the wavelet domain. By focusing on high-frequency wavelet coefficients, it generates exceptionally realistic and detailed 1-km precipitation fields. This wavelet-based approach produces visually superior results with fewer artifacts than pixel-space models, and delivers a significant gains in sampling efficiency. Our results demonstrate that WDM provides a robust solution to the dual challenges of accuracy and speed in geoscience super-resolution, paving the way for more reliable hydrological forecasts.

Title: Activation Reward Models for Few-Shot Model Alignment

Authors: Tianning Chai, Chancharik Mitra, Brandon Huang, Gautam Rajendrakumar Gare, Zhiqiu Lin, Assaf Arbelle, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Deva Ramanan, Roei Herzig
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01368
Pdf URL: https://arxiv.org/pdf/2507.01368
Copy Paste: [[2507.01368]] Activation Reward Models for Few-Shot Model Alignment(https://arxiv.org/abs/2507.01368)
Keywords: generative, in-context
Abstract: Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) -- a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.

Title: Distributional Soft Actor-Critic with Diffusion Policy

Authors: Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, Shengbo Eben Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01381
Pdf URL: https://arxiv.org/pdf/2507.01381
Copy Paste: [[2507.01381]] Distributional Soft Actor-Critic with Diffusion Policy(https://arxiv.org/abs/2507.01381)
Keywords: diffusion
Abstract: Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional reinforcement learning algorithm called DSAC-D (Distributed Soft Actor Critic with Diffusion Policy) to address the challenges of estimating bias in value functions and obtaining multimodal policy representations. A multimodal distributional policy iteration framework that can converge to the optimal policy was established by introducing policy entropy and value distribution function. A diffusion value network that can accurately characterize the distribution of multi peaks was constructed by generating a set of reward samples through reverse sampling using a diffusion model. Based on this, a distributional reinforcement learning algorithm with dual diffusion of the value network and the policy network was derived. MuJoCo testing tasks demonstrate that the proposed algorithm not only learns multimodal policy, but also achieves state-of-the-art (SOTA) performance in all 9 control tasks, with significant suppression of estimation bias and total average return improvement of over 10\% compared to existing mainstream algorithms. The results of real vehicle testing show that DSAC-D can accurately characterize the multimodal distribution of different driving styles, and the diffusion policy network can characterize multimodal trajectories.

Title: Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound

Authors: Huanwen Liang, Jingxian Xu, Yuanji Zhang, Yuhao Huang, Yuhan Zhang, Xin Yang, Ran Li, Xuedong Deng, Yanjun Liu, Guowei Tao, Yun Wu, Sheng Zhao, Xinru Gao, Dong Ni
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01401
Pdf URL: https://arxiv.org/pdf/2507.01401
Copy Paste: [[2507.01401]] Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound(https://arxiv.org/abs/2507.01401)
Keywords: self-supervised
Abstract: Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emphasis on case-level diagnosis. In this paper, we develop a case-level multiple instance learning (MIL)-based method, free of standard plane localization, for classifying fetal abdominal anomalies in prenatal ultrasound. Our contribution is three-fold. First, we adopt a mixture-of-attention-experts module (MoAE) to weight different attention heads for various planes. Secondly, we propose a medical-knowledge-driven feature selection module (MFS) to align image features with medical knowledge, performing self-supervised image token selection at the case-level. Finally, we propose a prompt-based prototype learning (PPL) to enhance the MFS. Extensively validated on a large prenatal abdominal ultrasound dataset containing 2,419 cases, with a total of 24,748 images and 6 categories, our proposed method outperforms the state-of-the-art competitors. Codes are available at:this https URL.

Title: CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

Authors: Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, Yoshitaka Ushiku
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01409
Pdf URL: https://arxiv.org/pdf/2507.01409
Copy Paste: [[2507.01409]] CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning(https://arxiv.org/abs/2507.01409)
Keywords: generative
Abstract: An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506\% despite better lexical alignment. Code will be available on this https URL.

Title: Decomposing Prediction Mechanisms for In-Context Recall

Authors: Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01414
Pdf URL: https://arxiv.org/pdf/2507.01414
Copy Paste: [[2507.01414]] Decomposing Prediction Mechanisms for In-Context Recall(https://arxiv.org/abs/2507.01414)
Keywords: in-context
Abstract: We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

Title: DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal

Authors: Wenjie Liu, Bingshu Wang, Ze Wang, C.L. Philip Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01422
Pdf URL: https://arxiv.org/pdf/2507.01422
Copy Paste: [[2507.01422]] DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal(https://arxiv.org/abs/2507.01422)
Keywords: diffusion
Abstract: Document shadow removal is a crucial task in the field of document image enhancement. However, existing methods tend to remove shadows with constant color background and ignore color shadows. In this paper, we first design a diffusion model in latent space for document image shadow removal, called DocShaDiffusion. It translates shadow images from pixel space to latent space, enabling the model to more easily capture essential features. To address the issue of color shadows, we design a shadow soft-mask generation module (SSGM). It is able to produce accurate shadow mask and add noise into shadow regions specially. Guided by the shadow mask, a shadow mask-aware guided diffusion module (SMGDM) is proposed to remove shadows from document images by supervising the diffusion and denoising process. We also propose a shadow-robust perceptual feature loss to preserve details and structures in document images. Moreover, we develop a large-scale synthetic document color shadow removal dataset (SDCSRD). It simulates the distribution of realistic color shadows and provides powerful supports for the training of models. Experiments on three public datasets validate the proposed method's superiority over state-of-the-art. Our code and dataset will be publicly available.

Title: DiffMark: Diffusion-based Robust Watermark Against Deepfakes

Authors: Chen Sun, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Liejun Wang, Dan Ma, Gaobo Yang, Keqin Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.01428
Pdf URL: https://arxiv.org/pdf/2507.01428
Copy Paste: [[2507.01428]] DiffMark: Diffusion-based Robust Watermark Against Deepfakes(https://arxiv.org/abs/2507.01428)
Keywords: diffusion
Abstract: Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at this https URL.

Title: OoDDINO:A Multi-level Framework for Anomaly Segmentation on Complex Road Scenes

Authors: Yuxing Liu, Ji Zhang, Zhou Xuchuan, Jingzhong Xiao, Huimin Yang, Jiaxin Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01455
Pdf URL: https://arxiv.org/pdf/2507.01455
Copy Paste: [[2507.01455]] OoDDINO:A Multi-level Framework for Anomaly Segmentation on Complex Road Scenes(https://arxiv.org/abs/2507.01455)
Keywords: anomaly
Abstract: Anomaly segmentation aims to identify Out-of-Distribution (OoD) anomalous objects within images. Existing pixel-wise methods typically assign anomaly scores individually and employ a global thresholding strategy to segment anomalies. Despite their effectiveness, these approaches encounter significant challenges in real-world applications: (1) neglecting spatial correlations among pixels within the same object, resulting in fragmented segmentation; (2) variabil ity in anomaly score distributions across image regions, causing global thresholds to either generate false positives in background areas or miss segments of anomalous objects. In this work, we introduce OoDDINO, a novel multi-level anomaly segmentation framework designed to address these limitations through a coarse-to-fine anomaly detection strategy. OoDDINO combines an uncertainty-guided anomaly detection model with a pixel-level segmentation model within a two-stage cascade architecture. Initially, we propose an Orthogonal Uncertainty-Aware Fusion Strategy (OUAFS) that sequentially integrates multiple uncertainty metrics with visual representations, employing orthogonal constraints to strengthen the detection model's capacity for localizing anomalous regions accurately. Subsequently, we develop an Adaptive Dual-Threshold Network (ADT-Net), which dynamically generates region-specific thresholds based on object-level detection outputs and pixel-wise anomaly scores. This approach allows for distinct thresholding strategies within foreground and background areas, achieving fine-grained anomaly segmentation. The proposed framework is compatible with other pixel-wise anomaly detection models, which acts as a plug-in to boost the performance. Extensive experiments on two benchmark datasets validate our framework's superiority and compatibility over state-of-the-art methods.

Title: NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation

Authors: Max Gandyra, Alessandro Santonicola, Michael Beetz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01463
Pdf URL: https://arxiv.org/pdf/2507.01463
Copy Paste: [[2507.01463]] NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation(https://arxiv.org/abs/2507.01463)
Keywords: foundation model
Abstract: Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed, for all kinds of novel objects, without (re-) training, has proven to be a difficult task. To handle this, we propose a simple, yet powerful, framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). This work stems from and improves upon previous ones like CNOS, SAM-6D and NIDS-Net; thus, it also leverages on recent vision foundation models, namely: Grounded-SAM 2 and DINOv2. It utilises Grounded-SAM 2 to obtain object proposals with precise bounding boxes and their corresponding segmentation masks; while DINOv2's zero-shot capabilities are employed to generate the image embeddings. The quality of those masks, together with their embeddings, is of vital importance to our approach; as the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings. Differently to SAM-6D, calculating the latter involves a prior patch filtering based on the distance between each patch and its corresponding cyclic/roundtrip patch in the image grid. Furthermore, the average confidence of the proposals' bounding box and mask is used as an additional weighting factor for the object matching score. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.

Title: Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

Authors: Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01467
Pdf URL: https://arxiv.org/pdf/2507.01467
Copy Paste: [[2507.01467]] Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think(https://arxiv.org/abs/2507.01467)
Keywords: diffusion, foundation model
Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: this https URL.

Title: Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities

Authors: Yingqiang Gao, Kaede Johnson, David Froehlich, Luisa Carrer, Sarah Ebling
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01479
Pdf URL: https://arxiv.org/pdf/2507.01479
Copy Paste: [[2507.01479]] Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities(https://arxiv.org/abs/2507.01479)
Keywords: generative
Abstract: Automatic text simplification (ATS) aims to enhance language accessibility for various target groups, particularly persons with intellectual disabilities. Recent advancements in generative AI, especially large language models (LLMs), have substantially improved the quality of machine-generated text simplifications, thereby mitigating information barriers for the target group. However, existing LLM-based ATS systems do not incorporate preference feedback on text simplifications during training, resulting in a lack of personalization tailored to the specific needs of target group representatives. In this work, we extend the standard supervised fine-tuning (SFT) approach for adapting LLM-based ATS models by leveraging a computationally efficient LLM alignment technique -- direct preference optimization (DPO). Specifically, we post-train LLM-based ATS models using human feedback collected from persons with intellectual disabilities, reflecting their preferences on paired text simplifications generated by mainstream LLMs. Furthermore, we propose a pipeline for developing personalized LLM-based ATS systems, encompassing data collection, model selection, SFT and DPO post-training, and evaluation. Our findings underscore the necessity of active participation of target group persons in designing personalized AI accessibility solutions aligned with human expectations. This work represents a step towards personalizing inclusive AI systems at the target-group level, incorporating insights not only from text simplification experts but also from target group persons themselves.

Title: AVC-DPO: Aligned Video Captioning via Direct Preference Optimization

Authors: Jiyang Tang, Hengyi Li, Yifan Du, Wayne Xin Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01492
Pdf URL: https://arxiv.org/pdf/2507.01492
Copy Paste: [[2507.01492]] AVC-DPO: Aligned Video Captioning via Direct Preference Optimization(https://arxiv.org/abs/2507.01492)
Keywords: foundation model
Abstract: Although video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks, it remains challenging to adjust the focal emphasis of video captions according to human preferences. To address this limitation, we propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment. Our approach designs enhanced prompts that specifically target temporal dynamics and spatial information-two key factors that humans care about when watching a video-thereby incorporating human-centric preferences. AVC-DPO leverages the same foundation model's caption generation responses under varied prompt conditions to conduct preference-aware training and caption alignment. Using this framework, we have achieved exceptional performance in the LOVE@CVPR'25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning (VDC) benchmark according to the VDCSCORE evaluation metric.

Title: ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation

Authors: Jimyeong Kim, Jungwon Park, Yeji Song, Nojun Kwak, Wonjong Rhee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01496
Pdf URL: https://arxiv.org/pdf/2507.01496
Copy Paste: [[2507.01496]] ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation(https://arxiv.org/abs/2507.01496)
Keywords: diffusion
Abstract: Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.

Title: Loss Functions in Diffusion Models: A Comparative Study

Authors: Dibyanshu Kumar, Philipp Vaeth, Magda Gregorová
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01516
Pdf URL: https://arxiv.org/pdf/2507.01516
Copy Paste: [[2507.01516]] Loss Functions in Diffusion Models: A Comparative Study(https://arxiv.org/abs/2507.01516)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as powerful generative models, inspiring extensive research into their underlying mechanisms. One of the key questions in this area is the loss functions these models shall train with. Multiple formulations have been introduced in the literature over the past several years with some links and some critical differences stemming from various initial considerations. In this paper, we explore the different target objectives and corresponding loss functions in detail. We present a systematic overview of their relationships, unifying them under the framework of the variational lower bound objective. We complement this theoretical analysis with an empirical study providing insights into the conditions under which these objectives diverge in performance and the underlying factors contributing to such deviations. Additionally, we evaluate how the choice of objective impacts the model ability to achieve specific goals, such as generating high-quality samples or accurately estimating likelihoods. This study offers a unified understanding of loss functions in diffusion models, contributing to more efficient and goal-oriented model designs in future research.

Title: MARVIS: Modality Adaptive Reasoning over VISualizations

Authors: Benjamin Feuer, Lennart Purucker, Oussama Elachqar, Chinmay Hegde
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01544
Pdf URL: https://arxiv.org/pdf/2507.01544
Copy Paste: [[2507.01544]] MARVIS: Modality Adaptive Reasoning over VISualizations(https://arxiv.org/abs/2507.01544)
Keywords: foundation model
Abstract: Scientific applications of machine learning often rely on small, specialized models tuned to particular domains. Such models often achieve excellent performance, but lack flexibility. Foundation models offer versatility, but typically underperform specialized approaches, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to successfully interpret and utilize them. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini by 16\% on average and approach specialized methods, without exposing personally identifiable information (P.I.I.) or requiring any domain-specific training. We open source our code and datasets at this https URL

Title: A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation

Authors: Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01573
Pdf URL: https://arxiv.org/pdf/2507.01573
Copy Paste: [[2507.01573]] A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation(https://arxiv.org/abs/2507.01573)
Keywords: diffusion, generative
Abstract: Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at this https URL.

Title: SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

Authors: Bryan Constantine Sadihin, Michael Hua Wang, Shei Pern Chua, Hang Su
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01586
Pdf URL: https://arxiv.org/pdf/2507.01586
Copy Paste: [[2507.01586]] SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation(https://arxiv.org/abs/2507.01586)
Keywords: diffusion
Abstract: The production of high-quality 2D animation is highly labor-intensive process, as animators are currently required to draw and color a large number of frames by hand. We present SketchColour, the first sketch-to-colour pipeline for 2D animation built on a diffusion transformer (DiT) backbone. By replacing the conventional U-Net denoiser with a DiT-style architecture and injecting sketch information via lightweight channel-concatenation adapters accompanied with LoRA finetuning, our method natively integrates conditioning without the parameter and memory bloat of a duplicated ControlNet, greatly reducing parameter count and GPU memory usage. Evaluated on the SAKUGA dataset, SketchColour outperforms previous state-of-the-art video colourization methods across all metrics, despite using only half the training data of competing models. Our approach produces temporally coherent animations with minimal artifacts such as colour bleeding or object deformation. Our code is available at: this https URL .

Title: DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

Authors: Yue-Jiang Dong, Wang Zhao, Jiale Xu, Ying Shan, Song-Hai Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01603
Pdf URL: https://arxiv.org/pdf/2507.01603
Copy Paste: [[2507.01603]] DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation(https://arxiv.org/abs/2507.01603)
Keywords: diffusion
Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.

Title: Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference

Authors: Xu Zhang, Ming Lu, Yan Chen, Zhan Ma
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.01608
Pdf URL: https://arxiv.org/pdf/2507.01608
Copy Paste: [[2507.01608]] Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference(https://arxiv.org/abs/2507.01608)
Keywords: generative
Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at this https URL.

Title: SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement

Authors: Weijie Yin, Dingkang Yang, Hongyuan Dong, Zijian Kang, Jiacong Wang, Xiao Liang, Chao Feng, Jiao Ran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01643
Pdf URL: https://arxiv.org/pdf/2507.01643
Copy Paste: [[2507.01643]] SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement(https://arxiv.org/abs/2507.01643)
Keywords: self-supervised
Abstract: Vision Transformers (ViTs) are essential as foundation backbones in establishing the visual comprehension capabilities of Multimodal Large Language Models (MLLMs). Although most ViTs achieve impressive performance through image-text pair-based contrastive learning or self-supervised mechanisms, they struggle to engage in connector-based co-training directly with LLMs due to potential parameter initialization conflicts and modality semantic gaps. To address the above challenges, this paper proposes SAILViT, a gradual feature learning-enhanced ViT for facilitating MLLMs to break through performance bottlenecks in complex multimodal interactions. SAILViT achieves coarse-to-fine-grained feature alignment and world knowledge infusion with gradual feature refinement, which better serves target training demands. We perform thorough empirical analyses to confirm the powerful robustness and generalizability of SAILViT across different dimensions, including parameter sizes, model architectures, training strategies, and data scales. Equipped with SAILViT, existing MLLMs show significant and consistent performance improvements on the OpenCompass benchmark across extensive downstream tasks. SAILViT series models are released at this https URL.

Title: RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather

Authors: Yuran Wang, Yingping Liang, Yutao Hu, Ying Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01653
Pdf URL: https://arxiv.org/pdf/2507.01653
Copy Paste: [[2507.01653]] RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather(https://arxiv.org/abs/2507.01653)
Keywords: diffusion
Abstract: Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose \textbf{RobuSTereo}, a novel framework that enhances the zero-shot generalization of stereo matching models under adverse weather by addressing both data scarcity and feature extraction challenges. First, we introduce a diffusion-based simulation pipeline with a stereo consistency module, which generates high-quality stereo data tailored for adverse conditions. By training stereo matching models on our synthetic datasets, we reduce the domain gap between clean and degraded images, significantly improving the models' robustness to unseen weather conditions. The stereo consistency module ensures structural alignment across synthesized image pairs, preserving geometric integrity and enhancing depth estimation accuracy. Second, we design a robust feature encoder that combines a specialized ConvNet with a denoising transformer to extract stable and reliable features from degraded images. The ConvNet captures fine-grained local structures, while the denoising transformer refines global representations, effectively mitigating the impact of noise, low visibility, and weather-induced distortions. This enables more accurate disparity estimation even under challenging visual conditions. Extensive experiments demonstrate that \textbf{RobuSTereo} significantly improves the robustness and generalization of stereo matching models across diverse adverse weather scenarios.

Title: Graph Representation-based Model Poisoning on Federated LLMs in CyberEdge Networks

Authors: Hanlin Cai, Haofan Dong, Houtianfu Wang, Kai Li, Ozgur B. Akan
Subjects: cs.CR, eess.SY
Abstract URL: https://arxiv.org/abs/2507.01694
Pdf URL: https://arxiv.org/pdf/2507.01694
Copy Paste: [[2507.01694]] Graph Representation-based Model Poisoning on Federated LLMs in CyberEdge Networks(https://arxiv.org/abs/2507.01694)
Keywords: generative
Abstract: Federated large language models (FedLLMs) provide powerful generative capabilities in CyberEdge networks while protecting data privacy. However, FedLLMs remains highly vulnerable to model poisoning attacks. This article first reviews recent model poisoning techniques and existing defense mechanisms for FedLLMs, highlighting critical limitations, particularly under non-IID text distributions. In particular, current defenses primarily utilize distance-based outlier detection or norm constraints, operating under the assumption that adversarial updates significantly diverge from benign statistics. This assumption can fail when facing adaptive attackers targeting billionparameter LLMs. Next, this article investigates emerging Graph Representation-Based Model Poisoning (GRMP), a novel attack paradigm that leverages higher-order correlations among honest client gradients to synthesize malicious updates indistinguishable from legitimate model updates. GRMP can effectively evade advanced defenses, resulting in substantial accuracy loss and performance degradation. Moreover, this article outlines a research roadmap emphasizing the importance of graph-aware secure aggregation methods, FedLLMs-specific vulnerability metrics, and evaluation frameworks to strengthen the robustness of future federated language model deployments.

Title: LLMs for Legal Subsumption in German Employment Contracts

Authors: Oliver Wardas, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01734
Pdf URL: https://arxiv.org/pdf/2507.01734
Copy Paste: [[2507.01734]] LLMs for Legal Subsumption in German Employment Contracts(https://arxiv.org/abs/2507.01734)
Keywords: in-context
Abstract: Legal work, characterized by its text-heavy and resource-intensive nature, presents unique challenges and opportunities for NLP research. While data-driven approaches have advanced the field, their lack of interpretability and trustworthiness limits their applicability in dynamic legal environments. To address these issues, we collaborated with legal experts to extend an existing dataset and explored the use of Large Language Models (LLMs) and in-context learning to evaluate the legality of clauses in German employment contracts. Our work evaluates the ability of different LLMs to classify clauses as "valid," "unfair," or "void" under three legal context variants: no legal context, full-text sources of laws and court rulings, and distilled versions of these (referred to as examination guidelines). Results show that full-text sources moderately improve performance, while examination guidelines significantly enhance recall for void clauses and weighted F1-Score, reaching 80\%. Despite these advancements, LLMs' performance when using full-text sources remains substantially below that of human lawyers. We contribute an extended dataset, including examination guidelines, referenced legal sources, and corresponding annotations, alongside our code and all log files. Our findings highlight the potential of LLMs to assist lawyers in contract legality review while also underscoring the limitations of the methods presented.

Title: HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion

Authors: Lin Wu, Zhixiang Chen, Jianglin Lan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01737
Pdf URL: https://arxiv.org/pdf/2507.01737
Copy Paste: [[2507.01737]] HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion(https://arxiv.org/abs/2507.01737)
Keywords: diffusion
Abstract: Generating realistic 3D human-object interactions (HOIs) remains a challenging task due to the difficulty of modeling detailed interaction dynamics. Existing methods treat human and object motions independently, resulting in physically implausible and causally inconsistent behaviors. In this work, we present HOI-Dyn, a novel framework that formulates HOI generation as a driver-responder system, where human actions drive object responses. At the core of our method is a lightweight transformer-based interaction dynamics model that explicitly predicts how objects should react to human motion. To further enforce consistency, we introduce a residual-based dynamics loss that mitigates the impact of dynamics prediction errors and prevents misleading optimization signals. The dynamics model is used only during training, preserving inference efficiency. Through extensive qualitative and quantitative experiments, we demonstrate that our approach not only enhances the quality of HOI generation but also establishes a feasible metric for evaluating the quality of generated interactions.

Title: Calibrated Self-supervised Vision Transformers Improve Intracranial Arterial Calcification Segmentation from Clinical CT Head Scans

Authors: Benjamin Jin, Grant Mair, Joanna M. Wardlaw, Maria del C. Valdés Hernández
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01744
Pdf URL: https://arxiv.org/pdf/2507.01744
Copy Paste: [[2507.01744]] Calibrated Self-supervised Vision Transformers Improve Intracranial Arterial Calcification Segmentation from Clinical CT Head Scans(https://arxiv.org/abs/2507.01744)
Keywords: self-supervised
Abstract: Vision Transformers (ViTs) have gained significant popularity in the natural image domain but have been less successful in 3D medical image segmentation. Nevertheless, 3D ViTs are particularly interesting for large medical imaging volumes due to their efficient self-supervised training within the masked autoencoder (MAE) framework, which enables the use of imaging data without the need for expensive manual annotations. intracranial arterial calcification (IAC) is an imaging biomarker visible on routinely acquired CT scans linked to neurovascular diseases such as stroke and dementia, and automated IAC quantification could enable their large-scale risk assessment. We pre-train ViTs with MAE and fine-tune them for IAC segmentation for the first time. To develop our models, we use highly heterogeneous data from a large clinical trial, the third International Stroke Trial (IST-3). We evaluate key aspects of MAE pre-trained ViTs in IAC segmentation, and analyse the clinical implications. We show: 1) our calibrated self-supervised ViT beats a strong supervised nnU-Net baseline by 3.2 Dice points, 2) low patch sizes are crucial for ViTs for IAC segmentation and interpolation upsampling with regular convolutions is preferable to transposed convolutions for ViT-based models, and 3) our ViTs increase robustness to higher slice thicknesses and improve risk group classification in a clinical scenario by 46%. Our code is available online.

Title: SSL4SAR: Self-Supervised Learning for Glacier Calving Front Extraction from SAR Imagery

Authors: Nora Gourmelon, Marcel Dreier, Martin Mayr, Thorsten Seehaus, Dakota Pyles, Matthias Braun, Andreas Maier, Vincent Christlein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01747
Pdf URL: https://arxiv.org/pdf/2507.01747
Copy Paste: [[2507.01747]] SSL4SAR: Self-Supervised Learning for Glacier Calving Front Extraction from SAR Imagery(https://arxiv.org/abs/2507.01747)
Keywords: self-supervised
Abstract: Glaciers are losing ice mass at unprecedented rates, increasing the need for accurate, year-round monitoring to understand frontal ablation, particularly the factors driving the calving process. Deep learning models can extract calving front positions from Synthetic Aperture Radar imagery to track seasonal ice losses at the calving fronts of marine- and lake-terminating glaciers. The current state-of-the-art model relies on ImageNet-pretrained weights. However, they are suboptimal due to the domain shift between the natural images in ImageNet and the specialized characteristics of remote sensing imagery, in particular for Synthetic Aperture Radar imagery. To address this challenge, we propose two novel self-supervised multimodal pretraining techniques that leverage SSL4SAR, a new unlabeled dataset comprising 9,563 Sentinel-1 and 14 Sentinel-2 images of Arctic glaciers, with one optical image per glacier in the dataset. Additionally, we introduce a novel hybrid model architecture that combines a Swin Transformer encoder with a residual Convolutional Neural Network (CNN) decoder. When pretrained on SSL4SAR, this model achieves a mean distance error of 293 m on the "CAlving Fronts and where to Find thEm" (CaFFe) benchmark dataset, outperforming the prior best model by 67 m. Evaluating an ensemble of the proposed model on a multi-annotator study of the benchmark dataset reveals a mean distance error of 75 m, approaching the human performance of 38 m. This advancement enables precise monitoring of seasonal changes in glacier calving fronts.

Title: Enhanced Generative Model Evaluation with Clipped Density and Coverage

Authors: Nicolas Salvy, Hugues Talbot, Bertrand Thirion
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2507.01761
Pdf URL: https://arxiv.org/pdf/2507.01761
Copy Paste: [[2507.01761]] Enhanced Generative Model Evaluation with Clipped Density and Coverage(https://arxiv.org/abs/2507.01761)
Keywords: generative
Abstract: Although generative models have made remarkable progress in recent years, their use in critical applications has been hindered by their incapacity to reliably evaluate sample quality. Quality refers to at least two complementary concepts: fidelity and coverage. Current quality metrics often lack reliable, interpretable values due to an absence of calibration or insufficient robustness to outliers. To address these shortcomings, we introduce two novel metrics, Clipped Density and Clipped Coverage. By clipping individual sample contributions and, for fidelity, the radii of nearest neighbor balls, our metrics prevent out-of-distribution samples from biasing the aggregated values. Through analytical and empirical calibration, these metrics exhibit linear score degradation as the proportion of poor samples increases. Thus, they can be straightforwardly interpreted as equivalent proportions of good samples. Extensive experiments on synthetic and real-world datasets demonstrate that Clipped Density and Clipped Coverage outperform existing methods in terms of robustness, sensitivity, and interpretability for evaluating generative models.

Title: FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization

Authors: Peng Zheng, Ye Wang, Rui Ma, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01792
Pdf URL: https://arxiv.org/pdf/2507.01792
Copy Paste: [[2507.01792]] FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization(https://arxiv.org/abs/2507.01792)
Keywords: generative
Abstract: Subject-driven image generation plays a crucial role in applications such as virtual try-on and poster design. Existing approaches typically fine-tune pretrained generative models or apply LoRA-based adaptations for individual subjects. However, these methods struggle with multi-subject personalization, as combining independently adapted modules often requires complex re-tuning or joint optimization. We present FreeLoRA, a simple and generalizable framework that enables training-free fusion of subject-specific LoRA modules for multi-subject personalization. Each LoRA module is adapted on a few images of a specific subject using a Full Token Tuning strategy, where it is applied across all tokens in the prompt to encourage weakly supervised token-content alignment. At inference, we adopt Subject-Aware Inference, activating each module only on its corresponding subject tokens. This enables training-free fusion of multiple personalized subjects within a single image, while mitigating overfitting and mutual interference between subjects. Extensive experiments show that FreeLoRA achieves strong performance in both subject fidelity and prompt consistency.

Title: Towards Decentralized and Sustainable Foundation Model Training with the Edge

Authors: Leyang Xue, Meghana Madhyastha, Randal Burns, Myungjin Lee, Mahesh K. Marina
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.01803
Pdf URL: https://arxiv.org/pdf/2507.01803
Copy Paste: [[2507.01803]] Towards Decentralized and Sustainable Foundation Model Training with the Edge(https://arxiv.org/abs/2507.01803)
Keywords: foundation model
Abstract: Foundation models are at the forefront of AI research, appealing for their ability to learn from vast datasets and cater to diverse tasks. Yet, their significant computational demands raise issues of environmental impact and the risk of centralized control in their development. We put forward a vision towards decentralized and sustainable foundation model training that leverages the collective compute of sparingly used connected edge AI devices. We present the rationale behind our vision, particularly in support of its sustainability benefit. We further outline a set of challenges that need to be addressed to turn this vision into reality.

Title: Out-of-Distribution Detection Methods Answer the Wrong Questions

Authors: Yucen Lily Li, Daohan Lu, Polina Kirichenko, Shikai Qiu, Tim G. J. Rudner, C. Bayan Bruss, Andrew Gordon Wilson
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.01831
Pdf URL: https://arxiv.org/pdf/2507.01831
Copy Paste: [[2507.01831]] Out-of-Distribution Detection Methods Answer the Wrong Questions(https://arxiv.org/abs/2507.01831)
Keywords: generative
Abstract: To detect distribution shifts and improve model safety, many out-of-distribution (OOD) detection methods rely on the predictive uncertainty or features of supervised models trained on in-distribution data. In this paper, we critically re-examine this popular family of OOD detection procedures, and we argue that these methods are fundamentally answering the wrong questions for OOD detection. There is no simple fix to this misalignment, since a classifier trained only on in-distribution classes cannot be expected to identify OOD points; for instance, a cat-dog classifier may confidently misclassify an airplane if it contains features that distinguish cats from dogs, despite generally appearing nothing alike. We find that uncertainty-based methods incorrectly conflate high uncertainty with being OOD, while feature-based methods incorrectly conflate far feature-space distance with being OOD. We show how these pathologies manifest as irreducible errors in OOD detection and identify common settings where these methods are ineffective. Additionally, interventions to improve OOD detection such as feature-logit hybrid methods, scaling of model and data size, epistemic uncertainty representation, and outlier exposure also fail to address this fundamental misalignment in objectives. We additionally consider unsupervised density estimation and generative models for OOD detection, which we show have their own fundamental limitations.

Title: Towards Foundation Auto-Encoders for Time-Series Anomaly Detection

Authors: Gastón García González, Pedro Casas, Emilio Martínez, Alicia Fernández
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01875
Pdf URL: https://arxiv.org/pdf/2507.01875
Copy Paste: [[2507.01875]] Towards Foundation Auto-Encoders for Time-Series Anomaly Detection(https://arxiv.org/abs/2507.01875)
Keywords: foundation model, generative, anomaly
Abstract: We investigate a novel approach to time-series modeling, inspired by the successes of large pretrained foundation models. We introduce FAE (Foundation Auto-Encoders), a foundation generative-AI model for anomaly detection in time-series data, based on Variational Auto-Encoders (VAEs). By foundation, we mean a model pretrained on massive amounts of time-series data which can learn complex temporal patterns useful for accurate modeling, forecasting, and detection of anomalies on previously unseen datasets. FAE leverages VAEs and Dilated Convolutional Neural Networks (DCNNs) to build a generic model for univariate time-series modeling, which could eventually perform properly in out-of-the-box, zero-shot anomaly detection applications. We introduce the main concepts of FAE, and present preliminary results in different multi-dimensional time-series datasets from various domains, including a real dataset from an operational mobile ISP, and the well known KDD 2021 Anomaly Detection dataset.

Title: Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

Authors: Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01908
Pdf URL: https://arxiv.org/pdf/2507.01908
Copy Paste: [[2507.01908]] Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning(https://arxiv.org/abs/2507.01908)
Keywords: diffusion
Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.

Title: Exploring a Hybrid Deep Learning Approach for Anomaly Detection in Mental Healthcare Provider Billing: Addressing Label Scarcity through Semi-Supervised Anomaly Detection

Authors: Samirah Bakker, Yao Ma, Seyed Sahand Mohammadi Ziabari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01924
Pdf URL: https://arxiv.org/pdf/2507.01924
Copy Paste: [[2507.01924]] Exploring a Hybrid Deep Learning Approach for Anomaly Detection in Mental Healthcare Provider Billing: Addressing Label Scarcity through Semi-Supervised Anomaly Detection(https://arxiv.org/abs/2507.01924)
Keywords: anomaly
Abstract: The complexity of mental healthcare billing enables anomalies, including fraud. While machine learning methods have been applied to anomaly detection, they often struggle with class imbalance, label scarcity, and complex sequential patterns. This study explores a hybrid deep learning approach combining Long Short-Term Memory (LSTM) networks and Transformers, with pseudo-labeling via Isolation Forests (iForest) and Autoencoders (AE). Prior work has not evaluated such hybrid models trained on pseudo-labeled data in the context of healthcare billing. The approach is evaluated on two real-world billing datasets related to mental healthcare. The iForest LSTM baseline achieves the highest recall (0.963) on declaration-level data. On the operation-level data, the hybrid iForest-based model achieves the highest recall (0.744), though at the cost of lower precision. These findings highlight the potential of combining pseudo-labeling with hybrid deep learning in complex, imbalanced anomaly detection settings.

Title: IC-Custom: Diverse Image Customization via In-Context Learning

Authors: Yaowei Li, Xiaoyu Li, Zhaoyang Zhang, Yuxuan Bian, Gan Liu, Xinyuan Li, Jiale Xu, Wenbo Hu, Yating Liu, Lingen Li, Jing Cai, Yuexian Zou, Yancheng He, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01926
Pdf URL: https://arxiv.org/pdf/2507.01926
Copy Paste: [[2507.01926]] IC-Custom: Diverse Image Customization via In-Context Learning(https://arxiv.org/abs/2507.01926)
Keywords: in-context
Abstract: Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We introduce the In-context Multi-Modal Attention (ICMA) mechanism with learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to correctly handle different task types and distinguish various inputs in polyptych configurations. To bridge the data gap, we carefully curated a high-quality dataset of 12k identity-consistent samples with 8k from real-world sources and 4k from high-quality synthetic data, avoiding the overly glossy and over-saturated synthetic appearance. IC-Custom supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves approximately 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters. Project page: this https URL

Title: Kwai Keye-VL Technical Report

Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01949
Pdf URL: https://arxiv.org/pdf/2507.01949
Copy Paste: [[2507.01949]] Kwai Keye-VL Technical Report(https://arxiv.org/abs/2507.01949)
Keywords: foundation model
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.

Title: Test-Time Scaling with Reflective Generative Model

Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.01951
Pdf URL: https://arxiv.org/pdf/2507.01951
Copy Paste: [[2507.01951]] Test-Time Scaling with Reflective Generative Model(https://arxiv.org/abs/2507.01951)
Keywords: self-supervised, generative
Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at this https URL.

Title: FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model

Authors: Yukang Cao, Chenyang Si, Jinghao Wang, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.01953
Pdf URL: https://arxiv.org/pdf/2507.01953
Copy Paste: [[2507.01953]] FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model(https://arxiv.org/abs/2507.01953)
Keywords: diffusion
Abstract: We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with different semantics or layouts. Unlike existing methods that rely on finetuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without requiring per-instance training. Despite their efficiency and potential, tuning-free methods face challenges in maintaining high-quality results due to the non-linear nature of the multi-step denoising process and biases inherited from the pre-trained diffusion model. In this paper, we introduce FreeMorph to address these challenges by integrating two key innovations. 1) We first propose a guidance-aware spherical interpolation design that incorporates explicit guidance from the input images by modifying the self-attention modules, thereby addressing identity loss and ensuring directional transitions throughout the generated sequence. 2) We further introduce a step-oriented variation trend that blends self-attention modules derived from each input image to achieve controlled and consistent transitions that respect both inputs. Our extensive evaluations demonstrate that FreeMorph outperforms existing methods, being 10x ~ 50x faster and establishing a new state-of-the-art for image morphing.

Title: How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Authors: Rahul Ramachandran, Ali Garjani, Roman Bachmann, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01955
Pdf URL: https://arxiv.org/pdf/2507.01955
Copy Paste: [[2507.01955]] How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks(https://arxiv.org/abs/2507.01955)
Keywords: foundation model
Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.