2025-08-14

Title: To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA

Authors: Shugang Hao, Hongbo Li, Lingjie Duan
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2508.09146
Pdf URL: https://arxiv.org/pdf/2508.09146
Copy Paste: [[2508.09146]] To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA(https://arxiv.org/abs/2508.09146)
Keywords: in-context
Abstract: The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments. Recent model-based approaches (e.g., non-persistent and $p$-persistent CSMA) simply optimize backoff strategies under a known and fixed node density, still leading to a large throughput loss due to inaccurate node density estimation. This paper is the first to propose LLM transformer-based in-context learning (ICL) theory for optimizing channel access. We design a transformer-based ICL optimizer to pre-collect collision-threshold data examples and a query collision case. They are constructed as a prompt as the input for the transformer to learn the pattern, which then generates a predicted contention window threshold (CWT). To train the transformer for effective ICL, we develop an efficient algorithm and guarantee a near-optimal CWT prediction within limited training steps. As it may be hard to gather perfect data examples for ICL in practice, we further extend to allow erroneous data input in the prompt. We prove that our optimizer maintains minimal prediction and throughput deviations from the optimal values. Experimental results on NS-3 further demonstrate our approach's fast convergence and near-optimal throughput over existing model-based and DRL-based approaches under unknown node densities.

Title: Motif 2.6B Technical Report

Authors: Junghwan Lim, Sungmin Lee, Dongseok Kim, Eunhwan Park, Hyunbyung Park, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Jihwan Kim, Minjae Kim, Taehwan Kim, Youngrok Kim, Haesol Lee, Jeesoo Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Daewon Suh, Dongjoo Weon
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09148
Pdf URL: https://arxiv.org/pdf/2508.09148
Copy Paste: [[2508.09148]] Motif 2.6B Technical Report(https://arxiv.org/abs/2508.09148)
Keywords: foundation model, in-context
Abstract: Recent advancements in Large Language Models (LLMs) have revolutionized artificial intelligence, yet developing an effective foundational LLM that balances high performance with computational efficiency remains challenging, especially for emerging research groups. To address this gap, we introduce Motif-2.6B, a 2.6-billion-parameter foundation model designed to democratize advanced LLM capabilities. Motif-2.6B incorporates several innovative architectural enhancements, including Differential Attention and PolyNorm activation functions, which improve long-context comprehension, reduce hallucination, and enhance in-context learning capabilities. We rigorously tested multiple novel architectural components through extensive experimentation to determine the optimal architecture for Motif-2.6B. Comprehensive evaluations demonstrate that Motif-2.6B consistently meets or exceeds the performance of similarly sized state-of-the-art models across diverse benchmarks, showcasing its effectiveness, scalability, and real-world applicability. Through detailed experiments and tailored techniques, Motif-2.6B significantly advances the landscape of efficient, scalable, and powerful foundational LLMs, offering valuable insights and a robust foundation for future research and deployment.

Title: A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models

Authors: Wenkai Wang, Hongcan Guo, Zheqi Lv, Shengyu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09155
Pdf URL: https://arxiv.org/pdf/2508.09155
Copy Paste: [[2508.09155]] A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models(https://arxiv.org/abs/2508.09155)
Keywords: foundation model
Abstract: Self-evaluation, a model's ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task's training state from the distribution of model generated multi-turn trajectories' performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks' training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.

Title: Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems

Authors: Jan Tauberschmidt, Sophie Fellenz, Sebastian J. Vollmer, Andrew B. Duncan
Subjects: cs.LG, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2508.09156
Pdf URL: https://arxiv.org/pdf/2508.09156
Copy Paste: [[2508.09156]] Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems(https://arxiv.org/abs/2508.09156)
Keywords: generative
Abstract: We present a framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems. Starting from a model trained on low-fidelity or observational data, we apply a differentiable post-training procedure that minimizes weak-form residuals of governing partial differential equations (PDEs), promoting physical consistency and adherence to boundary conditions without distorting the underlying learned distribution. To infer unknown physical inputs, such as source terms, material parameters, or boundary data, we augment the generative process with a learnable latent parameter predictor and propose a joint optimization strategy. The resulting model produces physically valid field solutions alongside plausible estimates of hidden parameters, effectively addressing ill-posed inverse problems in a data-driven yet physicsaware manner. We validate our method on canonical PDE benchmarks, demonstrating improved satisfaction of PDE constraints and accurate recovery of latent coefficients. Our approach bridges generative modelling and scientific inference, opening new avenues for simulation-augmented discovery and data-efficient modelling of physical systems.

Title: EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving

Authors: Siwen Jiao, Kangan Qian, Hao Ye, Yang Zhong, Ziang Luo, Sicong Jiang, Zilin Huang, Yangyi Fang, Jinyu Miao, Zheng Fu, Yunlong Wang, Kun Jiang, Diange Yang, Rui Fan, Baoyun Peng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09158
Pdf URL: https://arxiv.org/pdf/2508.09158
Copy Paste: [[2508.09158]] EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving(https://arxiv.org/abs/2508.09158)
Keywords: diffusion
Abstract: Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preferences into scalar rewards, obscuring critical trade-offs and yielding scalarization this http URL overcome these issues, we present EvaDrive, a novel multi-objective reinforcement learning framework that establishes genuine closed-loop co-evolution between trajectory generation and evaluation via adversarial optimization. EvaDrive frames trajectory planning as a multi-round adversarial game. In this game, a hierarchical generator continuously proposes candidate paths by combining autoregressive intent modeling for temporal causality with diffusion-based refinement for spatial flexibility. These proposals are then rigorously assessed by a trainable multi-objective critic that explicitly preserves diverse preference structures without collapsing them into a single scalarization this http URL adversarial interplay, guided by a Pareto frontier selection mechanism, enables iterative multi-round refinement, effectively escaping local optima while preserving trajectory this http URL experiments on NAVSIM and Bench2Drive benchmarks demonstrate SOTA performance, achieving 94.9 PDMS on NAVSIM v1 (surpassing DiffusionDrive by 6.8, DriveSuprim by 5.0, and TrajHF by 0.9) and 64.96 Driving Score on Bench2Drive. EvaDrive generates diverse driving styles via dynamic weighting without external preference data, introducing a closed-loop adversarial framework for human-like iterative decision-making, offering a novel scalarization-free trajectory optimization approach.

Title: An Unsupervised Deep XAI Framework for Localization of Concurrent Replay Attacks in Nuclear Reactor Signals

Authors: Konstantinos Vasili, Zachery T. Dahm, William Richards, Stylianos Chatzidakis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09162
Pdf URL: https://arxiv.org/pdf/2508.09162
Copy Paste: [[2508.09162]] An Unsupervised Deep XAI Framework for Localization of Concurrent Replay Attacks in Nuclear Reactor Signals(https://arxiv.org/abs/2508.09162)
Keywords: anomaly
Abstract: Next generation advanced nuclear reactors are expected to be smaller both in size and power output, relying extensively on fully digital instrumentation and control systems. These reactors will generate a large flow of information in the form of multivariate time series data, conveying simultaneously various non linear cyber physical, process, control, sensor, and operational states. Ensuring data integrity against deception attacks is becoming increasingly important for networked communication and a requirement for safe and reliable operation. Current efforts to address replay attacks, almost universally focus on watermarking or supervised anomaly detection approaches without further identifying and characterizing the root cause of the anomaly. In addition, these approaches rely mostly on synthetic data with uncorrelated Gaussian process and measurement noise and full state feedback or are limited to univariate signals, signal stationarity, linear quadratic regulators, or other linear-time invariant state-space which may fail to capture any unmodeled system dynamics. In the realm of regulated nuclear cyber-physical systems, additional work is needed on characterization of replay attacks and explainability of predictions using real data. Here, we propose an unsupervised explainable AI framework based on a combination of autoencoder and customized windowSHAP algorithm to fully characterize real-time replay attacks, i.e., detection, source identification, timing and type, of increasing complexity during a dynamic time evolving reactor process. The proposed XAI framework was benchmarked on several real world datasets from Purdue's nuclear reactor PUR-1 with up to six signals concurrently being replayed. In all cases, the XAI framework was able to detect and identify the source and number of signals being replayed and the duration of the falsification with 95 percent or better accuracy.

Title: Generating Feasible and Diverse Synthetic Populations Using Diffusion Models

Authors: Min Tang, Peng Lu, Qing Feng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09164
Pdf URL: https://arxiv.org/pdf/2508.09164
Copy Paste: [[2508.09164]] Generating Feasible and Diverse Synthetic Populations Using Diffusion Models(https://arxiv.org/abs/2508.09164)
Keywords: diffusion, generative
Abstract: Population synthesis is a critical task that involves generating synthetic yet realistic representations of populations. It is a fundamental problem in agent-based modeling (ABM), which has become the standard to analyze intelligent transportation systems. The synthetic population serves as the primary input for ABM transportation simulation, with traveling agents represented by population members. However, when the number of attributes describing agents becomes large, survey data often cannot densely support the joint distribution of the attributes in the population due to the curse of dimensionality. This sparsity makes it difficult to accurately model and produce the population. Interestingly, deep generative models trained from available sample data can potentially synthesize possible attribute combinations that present in the actual population but do not exist in the sample data(called sampling zeros). Nevertheless, this comes at the cost of falsely generating the infeasible attribute combinations that do not exist in the population (called structural zeros). In this study, a novel diffusion model-based population synthesis method is proposed to estimate the underlying joint distribution of a population. This approach enables the recovery of numerous missing sampling zeros while keeping the generated structural zeros minimal. Our method is compared with other recently proposed approaches such as Variational Autoencoders (VAE) and Generative Adversarial Network (GAN) approaches, which have shown success in high dimensional tabular population synthesis. We assess the performance of the synthesized outputs using a range of metrics, including marginal distribution similarity, feasibility, and diversity. The results demonstrate that our proposed method outperforms previous approaches in achieving a better balance between the feasibility and diversity of the synthesized population.

Title: IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

Authors: Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, Chao Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09178
Pdf URL: https://arxiv.org/pdf/2508.09178
Copy Paste: [[2508.09178]] IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection(https://arxiv.org/abs/2508.09178)
Keywords: anomaly
Abstract: Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from "Anomaly Perception" to "Anomaly Interpretation". Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, attaining up to 43.3% enhancement in average accuracy on 6 industrial anomaly detection benchmark datasets. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at this https URL.

Title: From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

Authors: Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, Shijin Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09191
Pdf URL: https://arxiv.org/pdf/2508.09191
Copy Paste: [[2508.09191]] From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization(https://arxiv.org/abs/2508.09191)
Keywords: generative
Abstract: Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, an LLM-driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained large language model (LLM), further optimized with autoregressive generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on diverse real-world datasets enriched with contextual features demonstrate the effectiveness and generalizability of TokenCast.

Title: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Authors: Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09192
Pdf URL: https://arxiv.org/pdf/2508.09192
Copy Paste: [[2508.09192]] Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing(https://arxiv.org/abs/2508.09192)
Keywords: diffusion
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at this https URL.

Title: Multi-Objective Instruction-Aware Representation Learning in Procedural Content Generation RL

Authors: Sung-Hyun Kim, In-Chang Baek, Seo-Young Lee, Geum-Hwan Hwang, Kyung-Joong Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09193
Pdf URL: https://arxiv.org/pdf/2508.09193
Copy Paste: [[2508.09193]] Multi-Objective Instruction-Aware Representation Learning in Procedural Content Generation RL(https://arxiv.org/abs/2508.09193)
Keywords: generative
Abstract: Recent advancements in generative modeling emphasize the importance of natural language as a highly expressive and accessible modality for controlling content generation. However, existing instructed reinforcement learning for procedural content generation (IPCGRL) method often struggle to leverage the expressive richness of textual input, especially under complex, multi-objective instructions, leading to limited controllability. To address this problem, we propose \textit{MIPCGRL}, a multi-objective representation learning method for instructed content generators, which incorporates sentence embeddings as conditions. MIPCGRL effectively trains a multi-objective embedding space by incorporating multi-label classification and multi-head regression networks. Experimental results show that the proposed method achieves up to a 13.8\% improvement in controllability with multi-objective instructions. The ability to process complex instructions enables more expressive and flexible content generation.

Title: Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach

Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.09201
Pdf URL: https://arxiv.org/pdf/2508.09201
Copy Paste: [[2508.09201]] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach(https://arxiv.org/abs/2508.09201)
Keywords: anomaly
Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. Although recent detection works have shifted to internal representations due to their rich cross-modal information, most methods rely on heuristic rules rather than principled objectives, resulting in suboptimal performance. To address these limitations, we propose Learning to Detect (LoD), a novel unsupervised framework that formulates jailbreak detection as anomaly detection. LoD introduces two key components: Multi-modal Safety Concept Activation Vectors (MSCAV), which capture layer-wise safety-related representations across modalities, and the Safety Pattern Auto-Encoder, which models the distribution of MSCAV derived from safe inputs and detects anomalies via reconstruction errors. By training the auto-encoder (AE) solely on safe samples without attack labels, LoD naturally identifies jailbreak inputs as distributional anomalies, enabling accurate and unified detection of jailbreak attacks. Comprehensive experiments on three different LVLMs and five benchmarks demonstrate that LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.

Title: Towards Scalable Training for Handwritten Mathematical Expression Recognition

Authors: Haoyang Li, Jiaqing Li, Jialun Cao, Zongyuan Yang, Yongping Xiong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09220
Pdf URL: https://arxiv.org/pdf/2508.09220
Copy Paste: [[2508.09220]] Towards Scalable Training for Handwritten Mathematical Expression Recognition(https://arxiv.org/abs/2508.09220)
Keywords: foundation model
Abstract: Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.

Title: FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

Authors: Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09241
Pdf URL: https://arxiv.org/pdf/2508.09241
Copy Paste: [[2508.09241]] FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents(https://arxiv.org/abs/2508.09241)
Keywords: generative
Abstract: With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash's success rate by 14.9\%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning this http URL resources are fully open-source. github: this https URL huggingface: this https URL

Title: Leveraging Large Language Models for Rare Disease Named Entity Recognition

Authors: Nan Miles Xi, Yu Deng, Lin Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09323
Pdf URL: https://arxiv.org/pdf/2508.09323
Copy Paste: [[2508.09323]] Leveraging Large Language Models for Rare Disease Named Entity Recognition(https://arxiv.org/abs/2508.09323)
Keywords: in-context
Abstract: Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding new state-of-the-art (SOTA) results. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets, while RAG offers marginal additional benefit. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce.

Title: Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

Authors: Yifan Jiang, Ahmad Shariftabrizi, Venkata SK. Manem
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09327
Pdf URL: https://arxiv.org/pdf/2508.09327
Copy Paste: [[2508.09327]] Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model(https://arxiv.org/abs/2508.09327)
Keywords: diffusion, generative
Abstract: Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at this https URL.

Title: The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains

Authors: Cathy Speed, Ahmed A. Metwally
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09349
Pdf URL: https://arxiv.org/pdf/2508.09349
Copy Paste: [[2508.09349]] The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains(https://arxiv.org/abs/2508.09349)
Keywords: generative
Abstract: Expert consensus plays a critical role in domains where evidence is complex, conflicting, or insufficient for direct prescription. Traditional methods, such as Delphi studies, consensus conferences, and systematic guideline synthesis, offer structure but face limitations including high panel burden, interpretive oversimplification, and suppression of conditional nuance. These challenges are now exacerbated by information overload, fragmentation of the evidence base, and increasing reliance on publicly available sources that lack expert filtering. This study introduces and evaluates a Human-AI Hybrid Delphi (HAH-Delphi) framework designed to augment expert consensus development by integrating a generative AI model (Gemini 2.5 Pro), small panels of senior human experts, and structured facilitation. The HAH-Delphi was tested in three phases: retrospective replication, prospective comparison, and applied deployment in two applied domains (endurance training and resistance and mixed cardio/strength training). The AI replicated 95% of published expert consensus conclusions in Phase I and showed 95% directional agreement with senior human experts in Phase II, though it lacked experiential and pragmatic nuance. In Phase III, compact panels of six senior experts achieved >90% consensus coverage and reached thematic saturation before the final participant. The AI provided consistent, literature-grounded scaffolding that supported divergence resolution and accelerated saturation. The HAH-Delphi framework offers a flexible, scalable approach for generating high-quality, context-sensitive consensus. Its successful application across health, coaching, and performance science confirms its methodological robustness and supports its use as a foundation for generating conditional, personalised guidance and published consensus frameworks at scale.

Title: Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Authors: Ju-Chieh Chou, Jiawei Zhou, Karen Livescu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09350
Pdf URL: https://arxiv.org/pdf/2508.09350
Copy Paste: [[2508.09350]] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling(https://arxiv.org/abs/2508.09350)
Keywords: generative
Abstract: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

Title: X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

Authors: Guoxian Song, Hongyi Xu, Xiaochen Zhao, You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Linjie Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09383
Pdf URL: https://arxiv.org/pdf/2508.09383
Copy Paste: [[2508.09383]] X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents(https://arxiv.org/abs/2508.09383)
Keywords: self-supervised, generative
Abstract: We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens -- one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion--identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.

Title: Understanding Dementia Speech Alignment with Diffusion-Based Image Generation

Authors: Mansi, Anastasios Lepipas, Dominika Woszczyk, Yiying Guan, Soteris Demetriou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09385
Pdf URL: https://arxiv.org/pdf/2508.09385
Copy Paste: [[2508.09385]] Understanding Dementia Speech Alignment with Diffusion-Based Image Generation(https://arxiv.org/abs/2508.09385)
Keywords: diffusion
Abstract: Text-to-image models generate highly realistic images based on natural language descriptions and millions of users use them to create and share images online. While it is expected that such models can align input text and generated image in the same latent space little has been done to understand whether this alignment is possible between pathological speech and generated images. In this work, we examine the ability of such models to align dementia-related speech information with the generated images and develop methods to explain this alignment. Surprisingly, we found that dementia detection is possible from generated images alone achieving 75% accuracy on the ADReSS dataset. We then leverage explainability methods to show which parts of the language contribute to the detection.

Title: Graph Neural Network and Transformer Integration for Unsupervised System Anomaly Discovery

Authors: Yun Zi, Ming Gong, Zhihao Xue, Yujun Zou, Nia Qi, Yingnan Deng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09401
Pdf URL: https://arxiv.org/pdf/2508.09401
Copy Paste: [[2508.09401]] Graph Neural Network and Transformer Integration for Unsupervised System Anomaly Discovery(https://arxiv.org/abs/2508.09401)
Keywords: anomaly
Abstract: This study proposes an unsupervised anomaly detection method for distributed backend service systems, addressing practical challenges such as complex structural dependencies, diverse behavioral evolution, and the absence of labeled data. The method constructs a dynamic graph based on service invocation relationships and applies graph convolution to extract high-order structural representations from multi-hop topologies. A Transformer is used to model the temporal behavior of each node, capturing long-term dependencies and local fluctuations. During the feature fusion stage, a learnable joint embedding mechanism integrates structural and behavioral representations into a unified anomaly vector. A nonlinear mapping is then applied to compute anomaly scores, enabling an end-to-end detection process without supervision. Experiments on real-world cloud monitoring data include sensitivity analyses across different graph depths, sequence lengths, and data perturbations. Results show that the proposed method outperforms existing models on several key metrics, demonstrating stronger expressiveness and stability in capturing anomaly propagation paths and modeling dynamic behavior sequences, with high potential for practical deployment.

Title: Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation

Authors: Badi Li, Ren-jie Lu, Yu Zhou, Jingke Meng, Wei-shi Zheng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.09423
Pdf URL: https://arxiv.org/pdf/2508.09423
Copy Paste: [[2508.09423]] Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation(https://arxiv.org/abs/2508.09423)
Keywords: generative
Abstract: The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at this https URL.

Title: RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration

Authors: Jiaqi Yan, Shuning Xu, Xiangyu Chen, Dell Zhang, Jie Tang, Gangshan Wu, Jie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09449
Pdf URL: https://arxiv.org/pdf/2508.09449
Copy Paste: [[2508.09449]] RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration(https://arxiv.org/abs/2508.09449)
Keywords: diffusion
Abstract: Reference-based Super Resolution (RefSR) improves upon Single Image Super Resolution (SISR) by leveraging high-quality reference images to enhance texture fidelity and visual realism. However, a critical limitation of existing RefSR approaches is their reliance on manually curated target-reference image pairs, which severely constrains their practicality in real-world scenarios. To overcome this, we introduce Retrieval-Augmented Super Resolution (RASR), a new and practical RefSR paradigm that automatically retrieves semantically relevant high-resolution images from a reference database given only a low-quality input. This enables scalable and flexible RefSR in realistic use cases, such as enhancing mobile photos taken in environments like zoos or museums, where category-specific reference data (e.g., animals, artworks) can be readily collected or pre-curated. To facilitate research in this direction, we construct RASR-Flickr30, the first benchmark dataset designed for RASR. Unlike prior datasets with fixed target-reference pairs, RASR-Flickr30 provides per-category reference databases to support open-world retrieval. We further propose RASRNet, a strong baseline that combines a semantic reference retriever with a diffusion-based RefSR generator. It retrieves relevant references based on semantic similarity and employs a diffusion-based generator enhanced with semantic conditioning. Experiments on RASR-Flickr30 demonstrate that RASRNet consistently improves over SISR baselines, achieving +0.38 dB PSNR and -0.0131 LPIPS, while generating more realistic textures. These findings highlight retrieval augmentation as a promising direction to bridge the gap between academic RefSR research and real-world applicability.

Title: A Unified Contrastive-Generative Framework for Time Series Classification

Authors: Ziyu Liu, Azadeh Alavi, Minyi Li, Xiang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09451
Pdf URL: https://arxiv.org/pdf/2508.09451
Copy Paste: [[2508.09451]] A Unified Contrastive-Generative Framework for Time Series Classification(https://arxiv.org/abs/2508.09451)
Keywords: self-supervised, generative
Abstract: Self-supervised learning (SSL) for multivariate time series mainly includes two paradigms: contrastive methods that excel at instance discrimination and generative approaches that model data distributions. While effective individually, their complementary potential remains unexplored. We propose a Contrastive Generative Time series framework (CoGenT), the first framework to unify these paradigms through joint contrastive-generative optimization. CoGenT addresses fundamental limitations of both approaches: it overcomes contrastive learning's sensitivity to high intra-class similarity in temporal data while reducing generative methods' dependence on large datasets. We evaluate CoGenT on six diverse time series datasets. The results show consistent improvements, with up to 59.2% and 14.27% F1 gains over standalone SimCLR and MAE, respectively. Our analysis reveals that the hybrid objective preserves discriminative power while acquiring generative robustness. These findings establish a foundation for hybrid SSL in temporal domains. We will release the code shortly.

Title: HyperKD: Distilling Cross-Spectral Knowledge in Masked Autoencoders via Inverse Domain Shift with Spatial-Aware Masking and Specialized Loss

Authors: Abdul Matin, Tanjim Bin Faruk, Shrideep Pallickara, Sangmi Lee Pallickara
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09453
Pdf URL: https://arxiv.org/pdf/2508.09453
Copy Paste: [[2508.09453]] HyperKD: Distilling Cross-Spectral Knowledge in Masked Autoencoders via Inverse Domain Shift with Spatial-Aware Masking and Specialized Loss(https://arxiv.org/abs/2508.09453)
Keywords: foundation model
Abstract: The proliferation of foundation models, pretrained on large-scale unlabeled datasets, has emerged as an effective approach in creating adaptable and reusable architectures that can be leveraged for various downstream tasks using satellite observations. However, their direct application to hyperspectral remote sensing remains challenging due to inherent spectral disparities and the scarcity of available observations. In this work, we present HyperKD, a novel knowledge distillation framework that enables transferring learned representations from a teacher model into a student model for effective development of a foundation model on hyperspectral images. Unlike typical knowledge distillation frameworks, which use a complex teacher to guide a simpler student, HyperKD enables an inverse form of knowledge transfer across different types of spectral data, guided by a simpler teacher model. Building upon a Masked Autoencoder, HyperKD distills knowledge from the Prithvi foundational model into a student tailored for EnMAP hyperspectral imagery. HyperKD addresses the inverse domain adaptation problem with spectral gaps by introducing a feature-based strategy that includes spectral range-based channel alignment, spatial feature-guided masking, and an enhanced loss function tailored for hyperspectral images. HyperKD bridges the substantial spectral domain gap, enabling the effective use of pretrained foundation models for geospatial applications. Extensive experiments show that HyperKD significantly improves representation learning in MAEs, leading to enhanced reconstruction fidelity and more robust performance on downstream tasks such as land cover classification, crop type identification, and soil organic carbon prediction, underpinning the potential of knowledge distillation frameworks in remote sensing analytics with hyperspectral imagery.

Title: Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

Authors: Hao Yu, Rupayan Mallick, Margrit Betke, Sarah Adel Bargal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09461
Pdf URL: https://arxiv.org/pdf/2508.09461
Copy Paste: [[2508.09461]] Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy(https://arxiv.org/abs/2508.09461)
Keywords: diffusion
Abstract: Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.

Title: CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

Authors: Jialei Xu, Zizhuang Wei, Weikang You, Linyun Li, Weijian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09470
Pdf URL: https://arxiv.org/pdf/2508.09470
Copy Paste: [[2508.09470]] CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios(https://arxiv.org/abs/2508.09470)
Keywords: foundation model
Abstract: Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.

Title: EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models

Authors: Omar Bazarbachi, Zijun Sun, Yanning Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09471
Pdf URL: https://arxiv.org/pdf/2508.09471
Copy Paste: [[2508.09471]] EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models(https://arxiv.org/abs/2508.09471)
Keywords: foundation model
Abstract: As Large Language Models (LLMs) become more widely adopted and scale up in size, the computational and memory challenges involved in deploying these massive foundation models have grown increasingly severe. This underscores the urgent need to develop more efficient model variants. Faced with this challenge, the present work introduces EGGS-PTP: an Expander-Graph Guided Structured Post-training Pruning method. The proposed approach leverages graph theory to guide the design of N:M structured pruning, effectively reducing model size and computational demands. By incorporating concepts from expander graphs, EGGS-PTP ensures information flow within the pruned network, preserving essential model functionality. Extensive numerical experiments demonstrate that EGGS-PTP not only achieves significant acceleration and memory savings due to structured sparsity but also outperforms existing structured pruning techniques in terms of accuracy across various LLMs.

Title: Leveraging Failed Samples: A Few-Shot and Training-Free Framework for Generalized Deepfake Detection

Authors: Shibo Yao, Renshuai Tao, Xiaolong Zheng, Chao Liang, Chunjie Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09475
Pdf URL: https://arxiv.org/pdf/2508.09475
Copy Paste: [[2508.09475]] Leveraging Failed Samples: A Few-Shot and Training-Free Framework for Generalized Deepfake Detection(https://arxiv.org/abs/2508.09475)
Keywords: generative
Abstract: Recent deepfake detection studies often treat unseen sample detection as a ``zero-shot" task, training on images generated by known models but generalizing to unknown ones. A key real-world challenge arises when a model performs poorly on unknown samples, yet these samples remain available for analysis. This highlights that it should be approached as a ``few-shot" task, where effectively utilizing a small number of samples can lead to significant improvement. Unlike typical few-shot tasks focused on semantic understanding, deepfake detection prioritizes image realism, which closely mirrors real-world distributions. In this work, we propose the Few-shot Training-free Network (FTNet) for real-world few-shot deepfake detection. Simple yet effective, FTNet differs from traditional methods that rely on large-scale known data for training. Instead, FTNet uses only one fake samplefrom an evaluation set, mimicking the scenario where new samples emerge in the real world and can be gathered for use, without any training or parameter updates. During evaluation, each test sample is compared to the known fake and real samples, and it is classified based on the category of the nearest sample. We conduct a comprehensive analysis of AI-generated images from 29 different generative models and achieve a new SoTA performance, with an average improvement of 8.7\% compared to existing methods. This work introduces a fresh perspective on real-world deepfake detection: when the model struggles to generalize on a few-shot sample, leveraging the failed samples leads to better performance.

Title: CLIP-Flow: A Universal Discriminator for AI-Generated Images Inspired by Anomaly Detection

Authors: Zhipeng Yuan, Kai Wang, Weize Quan, Dong-Ming Yan, Tieru Wu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2508.09477
Pdf URL: https://arxiv.org/pdf/2508.09477
Copy Paste: [[2508.09477]] CLIP-Flow: A Universal Discriminator for AI-Generated Images Inspired by Anomaly Detection(https://arxiv.org/abs/2508.09477)
Keywords: generative, anomaly
Abstract: With the rapid advancement of AI generative models, the visual quality of AI-generated images (AIIs) has become increasingly close to natural images, which inevitably raises security concerns. Most AII detectors often employ the conventional image classification pipeline with natural images and AIIs (generated by a generative model), which can result in limited detection performance for AIIs from unseen generative models. To solve this, we proposed a universal AI-generated image detector from the perspective of anomaly detection. Our discriminator does not need to access any AIIs and learn a generalizable representation with unsupervised learning. Specifically, we use the pre-trained CLIP encoder as the feature extractor and design a normalizing flow-like unsupervised model. Instead of AIIs, proxy images, e.g., obtained by applying a spectral modification operation on natural images, are used for training. Our models are trained by minimizing the likelihood of proxy images, optionally combined with maximizing the likelihood of natural images. Extensive experiments demonstrate the effectiveness of our method on AIIs produced by various image generators.

Title: SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Authors: Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, Yongjun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09479
Pdf URL: https://arxiv.org/pdf/2508.09479
Copy Paste: [[2508.09479]] SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images(https://arxiv.org/abs/2508.09479)
Keywords: self-supervised
Abstract: Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark.

Title: SARE: Semantic-Aware Reconstruction Error for Generalizable Diffusion-Generated Image Detection

Authors: Ju Yeon Kang, Jaehong Park, Semin Kim, Ji Won Yoon, Nam Soo Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09487
Pdf URL: https://arxiv.org/pdf/2508.09487
Copy Paste: [[2508.09487]] SARE: Semantic-Aware Reconstruction Error for Generalizable Diffusion-Generated Image Detection(https://arxiv.org/abs/2508.09487)
Keywords: diffusion, generative
Abstract: Recently, diffusion-generated image detection has gained increasing attention, as the rapid advancement of diffusion models has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts. To address this limitation, we explore a fundamental property commonly observed in fake images. Motivated by the observation that fake images tend to exhibit higher similarity to their captions than real images, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE can be utilized as a discriminative feature for robust detection across diverse generative models. We empirically demonstrate that the proposed method exhibits strong generalization, outperforming existing baselines on benchmarks including GenImage and CommunityForensics.

Title: Large-Small Model Collaborative Framework for Federated Continual Learning

Authors: Hao Yu, Xin Yang, Boyang Fan, Xuemei Cao, Hanlin Gu, Lixin Fan, Qiang Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09489
Pdf URL: https://arxiv.org/pdf/2508.09489
Copy Paste: [[2508.09489]] Large-Small Model Collaborative Framework for Federated Continual Learning(https://arxiv.org/abs/2508.09489)
Keywords: foundation model
Abstract: Continual learning (CL) for Foundation Models (FMs) is an essential yet underexplored challenge, especially in Federated Continual Learning (FCL), where each client learns from a private, evolving task stream under strict data and communication constraints. Despite their powerful generalization abilities, FMs often exhibit suboptimal performance on local downstream tasks, as they are unable to utilize private local data. Furthermore, enabling FMs to learn new tasks without forgetting prior knowledge is inherently a challenging problem, primarily due to their immense parameter count and high model complexity. In contrast, small models can be trained locally under resource-constrained conditions and benefit from more mature CL techniques. To bridge the gap between small models and FMs, we propose the first collaborative framework in FCL, where lightweight local models act as a dynamic bridge, continually adapting to new tasks while enhancing the utility of the large model. Two novel components are also included: Small Model Continual Fine-tuning is for preventing small models from temporal forgetting; One-by-One Distillation performs personalized fusion of heterogeneous local knowledge on the server. Experimental results demonstrate its superior performance, even when clients utilize heterogeneous small models.

Title: Causal Graph Profiling via Structural Divergence for Robust Anomaly Detection in Cyber-Physical Systems

Authors: Arun Vignesh Malarkkan, Haoyue Bai, Dongjie Wang, Yanjie Fu
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2508.09504
Pdf URL: https://arxiv.org/pdf/2508.09504
Copy Paste: [[2508.09504]] Causal Graph Profiling via Structural Divergence for Robust Anomaly Detection in Cyber-Physical Systems(https://arxiv.org/abs/2508.09504)
Keywords: anomaly
Abstract: With the growing complexity of cyberattacks targeting critical infrastructures such as water treatment networks, there is a pressing need for robust anomaly detection strategies that account for both system vulnerabilities and evolving attack patterns. Traditional methods -- statistical, density-based, and graph-based models struggle with distribution shifts and class imbalance in multivariate time series, often leading to high false positive rates. To address these challenges, we propose CGAD, a Causal Graph-based Anomaly Detection framework designed for reliable cyberattack detection in public infrastructure systems. CGAD follows a two-phase supervised framework -- causal profiling and anomaly scoring. First, it learns causal invariant graph structures representing the system's behavior under "Normal" and "Attack" states using Dynamic Bayesian Networks. Second, it employs structural divergence to detect anomalies via causal graph comparison by evaluating topological deviations in causal graphs over time. By leveraging causal structures, CGAD achieves superior adaptability and accuracy in non-stationary and imbalanced time series environments compared to conventional machine learning approaches. By uncovering causal structures beneath volatile sensor data, our framework not only detects cyberattacks with markedly higher precision but also redefines robustness in anomaly detection, proving resilience where traditional models falter under imbalance and drift. Our framework achieves substantial gains in F1 and ROC-AUC scores over best-performing baselines across four industrial datasets, demonstrating robust detection of delayed and structurally complex anomalies.

Title: LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09515
Pdf URL: https://arxiv.org/pdf/2508.09515
Copy Paste: [[2508.09515]] LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation(https://arxiv.org/abs/2508.09515)
Keywords: generative
Abstract: Cross-lingual aspect-based sentiment analysis (ABSA) involves detailed sentiment analysis in a target language by transferring knowledge from a source language with available annotated data. Most existing methods depend heavily on often unreliable translation tools to bridge the language gap. In this paper, we propose a new approach that leverages a large language model (LLM) to generate high-quality pseudo-labelled data in the target language without the need for translation tools. First, the framework trains an ABSA model to obtain predictions for unlabelled target language data. Next, LLM is prompted to generate natural sentences that better represent these noisy predictions than the original text. The ABSA model is then further fine-tuned on the resulting pseudo-labelled dataset. We demonstrate the effectiveness of this method across six languages and five backbone models, surpassing previous state-of-the-art translation-based approaches. The proposed framework also supports generative models, and we show that fine-tuned LLMs outperform smaller multilingual models.

Title: Generation of Indian Sign Language Letters, Numbers, and Words

Authors: Ajeet Kumar Yadav, Nishant Kumar, Rathna G N
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09522
Pdf URL: https://arxiv.org/pdf/2508.09522
Copy Paste: [[2508.09522]] Generation of Indian Sign Language Letters, Numbers, and Words(https://arxiv.org/abs/2508.09522)
Keywords: generative
Abstract: Sign language, which contains hand movements, facial expressions and bodily gestures, is a significant medium for communicating with hard-of-hearing people. A well-trained sign language community communicates easily, but those who don't know sign language face significant challenges. Recognition and generation are basic communication methods between hearing and hard-of-hearing individuals. Despite progress in recognition, sign language generation still needs to be explored. The Progressive Growing of Generative Adversarial Network (ProGAN) excels at producing high-quality images, while the Self-Attention Generative Adversarial Network (SAGAN) generates feature-rich images at medium resolutions. Balancing resolution and detail is crucial for sign language image generation. We are developing a Generative Adversarial Network (GAN) variant that combines both models to generate feature-rich, high-resolution, and class-conditional sign language images. Our modified Attention-based model generates high-quality images of Indian Sign Language letters, numbers, and words, outperforming the traditional ProGAN in Inception Score (IS) and Fréchet Inception Distance (FID), with improvements of 3.2 and 30.12, respectively. Additionally, we are publishing a large dataset incorporating high-quality images of Indian Sign Language alphabets, numbers, and 129 words.

Title: Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks

Authors: Bokeng Zheng, Jianqiang Zhong, Jiayi Liu, Xiaoxi Zhang
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2508.09532
Pdf URL: https://arxiv.org/pdf/2508.09532
Copy Paste: [[2508.09532]] Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks(https://arxiv.org/abs/2508.09532)
Keywords: foundation model
Abstract: Federated fine-tuning has emerged as a promising approach for adapting foundation models (FMs) to diverse downstream tasks in edge environments. In Internet of Vehicles (IoV) systems, enabling efficient and low-latency multi-task adaptation is particularly challenging due to client mobility, heterogeneous resources, and intermittent connectivity. This paper proposes a hierarchical federated fine-tuning framework that coordinates roadside units (RSUs) and vehicles to support resource-aware and mobility-resilient learning across dynamic IoV scenarios. Leveraging Low-Rank Adaptation (LoRA), we introduce a decentralized, energy-aware rank adaptation mechanism formulated as a constrained multi-armed bandit problem. A novel UCB-DUAL algorithm is developed to enable adaptive exploration under per-task energy budgets, achieving provable sublinear regret. To evaluate our method, we construct a large-scale IoV simulator based on real-world trajectories, capturing dynamic participation, RSU handoffs, and communication variability. Extensive experiments show that our approach achieves the best accuracy-efficiency trade-off among all baselines, reducing latency by over 24\% and improving average accuracy by more than 2.5\%.

Title: Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification

Authors: Haowen Wang, Guowei Zhang, Xiang Zhang, Zeyuan Chen, Haiyang Xu, Dou Hoon Kwark, Zhuowen Tu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09550
Pdf URL: https://arxiv.org/pdf/2508.09550
Copy Paste: [[2508.09550]] Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification(https://arxiv.org/abs/2508.09550)
Keywords: generative
Abstract: In this paper, we address a key scientific problem in machine learning: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance? (i.e., closed-set generative data augmentation). We start by exploring the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models. Through extensive experiments, we offer systematic insights into the effective use of closed-set synthetic data for augmentation. Notably, we empirically determine the equivalent scale of synthetic images needed for augmentation. In addition, we also show quantitative equivalence between the real data augmentation and open-set generative augmentation (generative models trained using data beyond the given training set). While it aligns with the common intuition that real images are generally preferred, our empirical formulation also offers a guideline to quantify the increased scale of synthetic data augmentation required to achieve comparable image classification performance. Our results on natural and medical image datasets further illustrate how this effect varies with the baseline training set size and the amount of synthetic data incorporated.

Title: Edge General Intelligence Through World Models and Agentic AI: Fundamentals, Solutions, and Challenges

Authors: Changyuan Zhao, Guangyuan Liu, Ruichen Zhang, Yinqiu Liu, Jiacheng Wang, Jiawen Kang, Dusit Niyato, Zan Li, Xuemin (Sherman)Shen, Zhu Han, Sumei Sun, Chau Yuen, Dong In Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09561
Pdf URL: https://arxiv.org/pdf/2508.09561
Copy Paste: [[2508.09561]] Edge General Intelligence Through World Models and Agentic AI: Fundamentals, Solutions, and Challenges(https://arxiv.org/abs/2508.09561)
Keywords: foundation model
Abstract: Edge General Intelligence (EGI) represents a transformative evolution of edge computing, where distributed agents possess the capability to perceive, reason, and act autonomously across diverse, dynamic environments. Central to this vision are world models, which act as proactive internal simulators that not only predict but also actively imagine future trajectories, reason under uncertainty, and plan multi-step actions with foresight. This proactive nature allows agents to anticipate potential outcomes and optimize decisions ahead of real-world interactions. While prior works in robotics and gaming have showcased the potential of world models, their integration into the wireless edge for EGI remains underexplored. This survey bridges this gap by offering a comprehensive analysis of how world models can empower agentic artificial intelligence (AI) systems at the edge. We first examine the architectural foundations of world models, including latent representation learning, dynamics modeling, and imagination-based planning. Building on these core capabilities, we illustrate their proactive applications across EGI scenarios such as vehicular networks, unmanned aerial vehicle (UAV) networks, the Internet of Things (IoT) systems, and network functions virtualization, thereby highlighting how they can enhance optimization under latency, energy, and privacy constraints. We then explore their synergy with foundation models and digital twins, positioning world models as the cognitive backbone of EGI. Finally, we highlight open challenges, such as safety guarantees, efficient training, and constrained deployment, and outline future research directions. This survey provides both a conceptual foundation and a practical roadmap for realizing the next generation of intelligent, autonomous edge systems.

Title: Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion

Authors: Jiwon Kim, Pureum Kim, SeonHwa Kim, Soobin Park, Eunju Cha, Kyong Hwan Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09575
Pdf URL: https://arxiv.org/pdf/2508.09575
Copy Paste: [[2508.09575]] Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion(https://arxiv.org/abs/2508.09575)
Keywords: diffusion
Abstract: Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at this https URL.

Title: Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality

Authors: Jie Shao, Ke Zhu, Minghao Fu, Guo-hua Wang, Jianxin Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09598
Pdf URL: https://arxiv.org/pdf/2508.09598
Copy Paste: [[2508.09598]] Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality(https://arxiv.org/abs/2508.09598)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.

Title: MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

Authors: Daniel Barco (1), Marc Stadelmann (1), Martin Oswald (1), Ivo Herzig (2), Lukas Lichtensteiger (2), Pascal Paysan (3), Igor Peterlik (3), Michal Walczak (3), Bjoern Menze (4), Frank-Peter Schilling (1) ((1) Centre for Artificial Intelligence (CAI), Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland, (2) Institute of Applied Mathematics and Physics (IAMP), Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland, (3) Varian Medical Systems Imaging Lab, Baden, Switzerland, (4) Biomedical Image Analysis and Machine Learning, University of Zurich, Zurich, Switzerland)
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09616
Pdf URL: https://arxiv.org/pdf/2508.09616
Copy Paste: [[2508.09616]] MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography(https://arxiv.org/abs/2508.09616)
Keywords: diffusion
Abstract: We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the "InDI" concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D's effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well.

Title: Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation

Authors: Xu Tang, Junan Jia, Yijing Wang, Jingjing Ma, Xiangrong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09626
Pdf URL: https://arxiv.org/pdf/2508.09626
Copy Paste: [[2508.09626]] Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation(https://arxiv.org/abs/2508.09626)
Keywords: foundation model
Abstract: In the task of 3D Aerial-view Scene Semantic Segmentation (3D-AVS-SS), traditional methods struggle to address semantic ambiguity caused by scale variations and structural occlusions in aerial images. This limits their segmentation accuracy and consistency. To tackle these challenges, we propose a novel 3D-AVS-SS approach named SAD-Splat. Our method introduces a Gaussian point drop module, which integrates semantic confidence estimation with a learnable sparsity mechanism based on the Hard Concrete distribution. This module effectively eliminates redundant and semantically ambiguous Gaussian points, enhancing both segmentation performance and representation compactness. Furthermore, SAD-Splat incorporates a high-confidence pseudo-label generation pipeline. It leverages 2D foundation models to enhance supervision when ground-truth labels are limited, thereby further improving segmentation accuracy. To advance research in this domain, we introduce a challenging benchmark dataset: 3D Aerial Semantic (3D-AS), which encompasses diverse real-world aerial scenes with sparse annotations. Experimental results demonstrate that SAD-Splat achieves an excellent balance between segmentation accuracy and representation compactness. It offers an efficient and scalable solution for 3D aerial scene understanding.

Title: NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation

Authors: Eduarda Caldeira, Naser Damer, Fadi Boutros
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09661
Pdf URL: https://arxiv.org/pdf/2508.09661
Copy Paste: [[2508.09661]] NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation(https://arxiv.org/abs/2508.09661)
Keywords: diffusion
Abstract: The use of synthetic data as an alternative to authentic datasets in face recognition (FR) development has gained significant attention, addressing privacy, ethical, and practical concerns associated with collecting and using authentic data. Recent state-of-the-art approaches have proposed identity-conditioned diffusion models to generate identity-consistent face images, facilitating their use in training FR models. However, these methods often lack explicit sampling mechanisms to enforce inter-class separability, leading to identity overlap in the generated data and, consequently, suboptimal FR performance. In this work, we introduce NegFaceDiff, a novel sampling method that incorporates negative conditions into the identity-conditioned diffusion process. NegFaceDiff enhances identity separation by leveraging negative conditions that explicitly guide the model away from unwanted features while preserving intra-class consistency. Extensive experiments demonstrate that NegFaceDiff significantly improves the identity consistency and separability of data generated by identity-conditioned diffusion models. Specifically, identity separability, measured by the Fisher Discriminant Ratio (FDR), increases from 2.427 to 5.687. These improvements are reflected in FR systems trained on the NegFaceDiff dataset, which outperform models trained on data generated without negative conditions across multiple benchmarks.

Title: GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors

Authors: Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qingnan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, Xiaodong Cun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09667
Pdf URL: https://arxiv.org/pdf/2508.09667
Copy Paste: [[2508.09667]] GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors(https://arxiv.org/abs/2508.09667)
Keywords: diffusion, foundation model, generative
Abstract: Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Project page: this https URL.

Title: MangaDiT: Reference-Guided Line Art Colorization with Hierarchical Attention in Diffusion Transformers

Authors: Qianru Qiu, Jiafeng Mao, Kento Masui, Xueting Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09709
Pdf URL: https://arxiv.org/pdf/2508.09709
Copy Paste: [[2508.09709]] MangaDiT: Reference-Guided Line Art Colorization with Hierarchical Attention in Diffusion Transformers(https://arxiv.org/abs/2508.09709)
Keywords: diffusion
Abstract: Recent advances in diffusion models have significantly improved the performance of reference-guided line art colorization. However, existing methods still struggle with region-level color consistency, especially when the reference and target images differ in character pose or motion. Instead of relying on external matching annotations between the reference and target, we propose to discover semantic correspondences implicitly through internal attention mechanisms. In this paper, we present MangaDiT, a powerful model for reference-guided line art colorization based on Diffusion Transformers (DiT). Our model takes both line art and reference images as conditional inputs and introduces a hierarchical attention mechanism with a dynamic attention weighting strategy. This mechanism augments the vanilla attention with an additional context-aware path that leverages pooled spatial features, effectively expanding the model's receptive field and enhancing region-level color alignment. Experiments on two benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches, achieving superior performance in both qualitative and quantitative evaluations.

Title: GraphTreeGen: Subtree-Centric Approach to Efficient and Supervised Graph Generation

Authors: Yitong Luo, Islem Rekik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09710
Pdf URL: https://arxiv.org/pdf/2508.09710
Copy Paste: [[2508.09710]] GraphTreeGen: Subtree-Centric Approach to Efficient and Supervised Graph Generation(https://arxiv.org/abs/2508.09710)
Keywords: self-supervised, generative
Abstract: Brain connectomes, representing neural connectivity as graphs, are crucial for understanding brain organization but costly and time-consuming to acquire, motivating generative approaches. Recent advances in graph generative modeling offer a data-driven alternative, enabling synthetic connectome generation and reducing dependence on large neuroimaging datasets. However, current models face key limitations: (i) compressing the whole graph into a single latent code (e.g., VGAEs) blurs fine-grained local motifs; (ii) relying on rich node attributes rarely available in connectomes reduces reconstruction quality; (iii) edge-centric models emphasize topology but overlook accurate edge-weight prediction, harming quantitative fidelity; and (iv) computationally expensive designs (e.g., edge-conditioned convolutions) impose high memory demands, limiting scalability. We propose GraphTreeGen (GTG), a subtree-centric generative framework for efficient, accurate connectome synthesis. GTG decomposes each connectome into entropy-guided k-hop trees capturing informative local structure, encoded by a shared GCN. A bipartite message-passing layer fuses subtree embeddings with global node features, while a dual-branch decoder jointly predicts edge existence and weights to reconstruct the adjacency matrix. GTG outperforms state-of-the-art baselines in self-supervised tasks and remains competitive in supervised settings, delivering higher structural fidelity and more precise weights with far less memory. Its modular design enables extensions to connectome super-resolution and cross-modality synthesis. Code: this https URL

Title: NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation

Authors: Devvrat Joshi, Islem Rekik
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09715
Pdf URL: https://arxiv.org/pdf/2508.09715
Copy Paste: [[2508.09715]] NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation(https://arxiv.org/abs/2508.09715)
Keywords: generative
Abstract: The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7\% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at this https URL.

Title: Generative Modeling with Multi-Instance Reward Learning for E-commerce Creative Optimization

Authors: Qiaolei Gu, Yu Li, DingYi Zeng, Lu Wang, Ming Pang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09730
Pdf URL: https://arxiv.org/pdf/2508.09730
Copy Paste: [[2508.09730]] Generative Modeling with Multi-Instance Reward Learning for E-commerce Creative Optimization(https://arxiv.org/abs/2508.09730)
Keywords: generative
Abstract: In e-commerce advertising, selecting the most compelling combination of creative elements -- such as titles, images, and highlights -- is critical for capturing user attention and driving conversions. However, existing methods often evaluate creative components individually, failing to navigate the exponentially large search space of possible combinations. To address this challenge, we propose a novel framework named GenCO that integrates generative modeling with multi-instance reward learning. Our unified two-stage architecture first employs a generative model to efficiently produce a diverse set of creative combinations. This generative process is optimized with reinforcement learning, enabling the model to effectively explore and refine its selections. Next, to overcome the challenge of sparse user feedback, a multi-instance learning model attributes combination-level rewards, such as clicks, to the individual creative elements. This allows the reward model to provide a more accurate feedback signal, which in turn guides the generative model toward creating more effective combinations. Deployed on a leading e-commerce platform, our approach has significantly increased advertising revenue, demonstrating its practical value. Additionally, we are releasing a large-scale industrial dataset to facilitate further research in this important domain.

Title: Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection

Authors: Zhiqiu Zhang, Dongqi Fan, Mingjie Wang, Qiang Tang, Jian Yang, Zili Yi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09746
Pdf URL: https://arxiv.org/pdf/2508.09746
Copy Paste: [[2508.09746]] Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection(https://arxiv.org/abs/2508.09746)
Keywords: diffusion, generative
Abstract: The goal of image harmonization is to adjust the foreground in a composite image to achieve visual consistency with the background. Recently, latent diffusion model (LDM) are applied for harmonization, achieving remarkable results. However, LDM-based harmonization faces challenges in detail preservation and limited harmonization ability. Additionally, current synthetic datasets rely on color transfer, which lacks local variations and fails to capture complex real-world lighting conditions. To enhance harmonization capabilities, we propose the Region-to-Region transformation. By injecting information from appropriate regions into the foreground, this approach preserves original details while achieving image harmonization or, conversely, generating new composite data. From this perspective, We propose a novel model R2R. Specifically, we design Clear-VAE to preserve high-frequency details in the foreground using Adaptive Filter while eliminating disharmonious elements. To further enhance harmonization, we introduce the Harmony Controller with Mask-aware Adaptive Channel Attention (MACA), which dynamically adjusts the foreground based on the channel importance of both foreground and background regions. To address the limitation of existing datasets, we propose Random Poisson Blending, which transfers color and lighting information from a suitable region to the foreground, thereby generating more diverse and challenging synthetic images. Using this method, we construct a new synthetic dataset, RPHarmony. Experiments demonstrate the superiority of our method over other methods in both quantitative metrics and visual harmony. Moreover, our dataset helps the model generate more realistic images in real examples. Our code, dataset, and model weights have all been released for open access.

Title: Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09820
Pdf URL: https://arxiv.org/pdf/2508.09820
Copy Paste: [[2508.09820]] Provable In-Context Vector Arithmetic via Retrieving Task Concepts(https://arxiv.org/abs/2508.09820)
Keywords: in-context
Abstract: In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.

Title: KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging

Authors: Valentin Boussot, Jean-Louis Dillenseger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09823
Pdf URL: https://arxiv.org/pdf/2508.09823
Copy Paste: [[2508.09823]] KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging(https://arxiv.org/abs/2508.09823)
Keywords: generative
Abstract: KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at \href{this https URL}{this https URL}.

Title: Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Authors: Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.09834
Pdf URL: https://arxiv.org/pdf/2508.09834
Copy Paste: [[2508.09834]] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models(https://arxiv.org/abs/2508.09834)
Keywords: diffusion, foundation model
Abstract: Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

Title: Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance

Authors: Dhruvraj Singh Rawat, Enggen Sherpa, Rishikesan Kirupanantha, Tin Hoang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09847
Pdf URL: https://arxiv.org/pdf/2508.09847
Copy Paste: [[2508.09847]] Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance(https://arxiv.org/abs/2508.09847)
Keywords: diffusion
Abstract: We present a benchmark of diffusion models for human face generation on a small-scale CelebAMask-HQ dataset, evaluating both unconditional and conditional pipelines. Our study compares UNet and DiT architectures for unconditional generation and explores LoRA-based fine-tuning of pretrained Stable Diffusion models as a separate experiment. Building on the multi-conditioning approach of Giambi and Lisanti, which uses both attribute vectors and segmentation masks, our main contribution is the integration of an InfoNCE loss for attribute embedding and the adoption of a SegFormer-based segmentation encoder. These enhancements improve the semantic alignment and controllability of attribute-guided synthesis. Our results highlight the effectiveness of contrastive embedding learning and advanced segmentation encoding for controlled face generation in limited data settings.

Title: PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Authors: Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09848
Pdf URL: https://arxiv.org/pdf/2508.09848
Copy Paste: [[2508.09848]] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts(https://arxiv.org/abs/2508.09848)
Keywords: in-context
Abstract: We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

Title: HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics

Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09858
Pdf URL: https://arxiv.org/pdf/2508.09858
Copy Paste: [[2508.09858]] HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics(https://arxiv.org/abs/2508.09858)
Keywords: diffusion, generative
Abstract: \textbf{Synthetic human dynamics} aims to generate photorealistic videos of human subjects performing expressive, intention-driven motions. However, current approaches face two core challenges: (1) \emph{geometric inconsistency} and \emph{coarse reconstruction}, due to limited 3D modeling and detail preservation; and (2) \emph{motion generalization limitations} and \emph{scene inharmonization}, stemming from weak generative capabilities. To address these, we present \textbf{HumanGenesis}, a framework that integrates geometric and generative modeling through four collaborative agents: (1) \textbf{Reconstructor} builds 3D-consistent human-scene representations from monocular video using 3D Gaussian Splatting and deformation decomposition. (2) \textbf{Critique Agent} enhances reconstruction fidelity by identifying and refining poor regions via multi-round MLLM-based reflection. (3) \textbf{Pose Guider} enables motion generalization by generating expressive pose sequences using time-aware parametric encoders. (4) \textbf{Video Harmonizer} synthesizes photorealistic, coherent video via a hybrid rendering pipeline with diffusion, refining the Reconstructor through a Back-to-4D feedback loop. HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization, significantly improving expressiveness, geometric fidelity, and scene integration.

Title: Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?

Authors: Viacheslav Barkov, Jonas Schmidinger, Robin Gebbers, Martin Atzmueller
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09888
Pdf URL: https://arxiv.org/pdf/2508.09888
Copy Paste: [[2508.09888]] Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?(https://arxiv.org/abs/2508.09888)
Keywords: foundation model, in-context
Abstract: In the field of pedometrics, tabular machine learning is the predominant method for predicting soil properties from remote and proximal soil sensing data, forming a central component of digital soil mapping. At the field-scale, this predictive soil modeling (PSM) task is typically constrained by small training sample sizes and high feature-to-sample ratios in soil spectroscopy. Traditionally, these conditions have proven challenging for conventional deep learning methods. Classical machine learning algorithms, particularly tree-based models like Random Forest and linear models such as Partial Least Squares Regression, have long been the default choice for field-scale PSM. Recent advances in artificial neural networks (ANN) for tabular data challenge this view, yet their suitability for field-scale PSM has not been proven. We introduce a comprehensive benchmark that evaluates state-of-the-art ANN architectures, including the latest multilayer perceptron (MLP)-based models (TabM, RealMLP), attention-based transformer variants (FT-Transformer, ExcelFormer, T2G-Former, AMFormer), retrieval-augmented approaches (TabR, ModernNCA), and an in-context learning foundation model (TabPFN). Our evaluation encompasses 31 field- and farm-scale datasets containing 30 to 460 samples and three critical soil properties: soil organic matter or soil organic carbon, pH, and clay content. Our results reveal that modern ANNs consistently outperform classical methods on the majority of tasks, demonstrating that deep learning has matured sufficiently to overcome the long-standing dominance of classical machine learning for PSM. Notably, TabPFN delivers the strongest overall performance, showing robustness across varying conditions. We therefore recommend the adoption of modern ANNs for field-scale PSM and propose TabPFN as the new default choice in the toolkit of every pedometrician.

Title: Rare anomalies require large datasets: About proving the existence of anomalies

Authors: Simon Klüttermann, Emmanuel Müller
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09894
Pdf URL: https://arxiv.org/pdf/2508.09894
Copy Paste: [[2508.09894]] Rare anomalies require large datasets: About proving the existence of anomalies(https://arxiv.org/abs/2508.09894)
Keywords: anomaly
Abstract: Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $ \alpha_{\text{algo}} $. Our results demonstrate that, for an unlabeled dataset of size $ N $ and contamination rate $ \nu $, the condition $ N \ge \frac{\alpha_{\text{algo}}}{\nu^2} $ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.

Title: SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Authors: Yachao Liang, Min Yu, Gang Li, Jianguo Jiang, Boquan Li, Feng Yu, Ning Zhang, Xiang Meng, Weiqing Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09913
Pdf URL: https://arxiv.org/pdf/2508.09913
Copy Paste: [[2508.09913]] SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection(https://arxiv.org/abs/2508.09913)
Keywords: self-supervised
Abstract: Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at this https URL.

Title: Prototype-Guided Diffusion: Visual Conditioning without External Memory

Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.09922
Pdf URL: https://arxiv.org/pdf/2508.09922
Copy Paste: [[2508.09922]] Prototype-Guided Diffusion: Visual Conditioning without External Memory(https://arxiv.org/abs/2508.09922)
Keywords: diffusion
Abstract: Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.

Title: Quo Vadis Handwritten Text Generation for Handwritten Text Recognition?

Authors: Vittorio Pippi, Konstantina Nikolaidou, Silvia Cascianelli, George Retsinas, Giorgos Sfikas, Rita Cucchiara, Marcus Liwicki
Subjects: cs.CV, cs.DL
Abstract URL: https://arxiv.org/abs/2508.09936
Pdf URL: https://arxiv.org/pdf/2508.09936
Copy Paste: [[2508.09936]] Quo Vadis Handwritten Text Generation for Handwritten Text Recognition?(https://arxiv.org/abs/2508.09936)
Keywords: diffusion, generative
Abstract: The digitization of historical manuscripts presents significant challenges for Handwritten Text Recognition (HTR) systems, particularly when dealing with small, author-specific collections that diverge from the training data distributions. Handwritten Text Generation (HTG) techniques, which generate synthetic data tailored to specific handwriting styles, offer a promising solution to address these challenges. However, the effectiveness of various HTG models in enhancing HTR performance, especially in low-resource transcription settings, has not been thoroughly evaluated. In this work, we systematically compare three state-of-the-art styled HTG models (representing the generative adversarial, diffusion, and autoregressive paradigms for HTG) to assess their impact on HTR fine-tuning. We analyze how visual and linguistic characteristics of synthetic data influence fine-tuning outcomes and provide quantitative guidelines for selecting the most effective HTG model. The results of our analysis provide insights into the current capabilities of HTG methods and highlight key areas for further improvement in their application to low-resource HTR.

Title: AST-n: A Fast Sampling Approach for Low-Dose CT Reconstruction using Diffusion Models

Authors: Tomás de la Sotta, José M. Saavedra, Héctor Henríquez, Violeta Chang, Aline Xavier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09943
Pdf URL: https://arxiv.org/pdf/2508.09943
Copy Paste: [[2508.09943]] AST-n: A Fast Sampling Approach for Low-Dose CT Reconstruction using Diffusion Models(https://arxiv.org/abs/2508.09943)
Keywords: diffusion, generative
Abstract: Low-dose CT (LDCT) protocols reduce radiation exposure but increase image noise, compromising diagnostic confidence. Diffusion-based generative models have shown promise for LDCT denoising by learning image priors and performing iterative refinement. In this work, we introduce AST-n, an accelerated inference framework that initiates reverse diffusion from intermediate noise levels, and integrate high-order ODE solvers within conditioned models to further reduce sampling steps. We evaluate two acceleration paradigms--AST-n sampling and standard scheduling with high-order solvers -- on the Low Dose CT Grand Challenge dataset, covering head, abdominal, and chest scans at 10-25 % of standard dose. Conditioned models using only 25 steps (AST-25) achieve peak signal-to-noise ratio (PSNR) above 38 dB and structural similarity index (SSIM) above 0.95, closely matching standard baselines while cutting inference time from ~16 seg to under 1 seg per slice. Unconditional sampling suffers substantial quality loss, underscoring the necessity of conditioning. We also assess DDIM inversion, which yields marginal PSNR gains at the cost of doubling inference time, limiting its clinical practicality. Our results demonstrate that AST-n with high-order samplers enables rapid LDCT reconstruction without significant loss of image fidelity, advancing the feasibility of diffusion-based methods in clinical workflows.

Title: Stable Diffusion Models are Secretly Good at Visual In-Context Learning

Authors: Trevine Oorloff, Vishwanath Sindagi, Wele Gedara Chaminda Bandara, Ali Shafahi, Amin Ghiasi, Charan Prakash, Reza Ardekani
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09949
Pdf URL: https://arxiv.org/pdf/2508.09949
Copy Paste: [[2508.09949]] Stable Diffusion Models are Secretly Good at Visual In-Context Learning(https://arxiv.org/abs/2508.09949)
Keywords: diffusion, in-context
Abstract: Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this repurposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance.

Title: MOC: Meta-Optimized Classifier for Few-Shot Whole Slide Image Classification

Authors: Tianqi Xiang, Yi Li, Qixiang Zhang, Xiaomeng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09967
Pdf URL: https://arxiv.org/pdf/2508.09967
Copy Paste: [[2508.09967]] MOC: Meta-Optimized Classifier for Few-Shot Whole Slide Image Classification(https://arxiv.org/abs/2508.09967)
Keywords: foundation model
Abstract: Recent advances in histopathology vision-language foundation models (VLFMs) have shown promise in addressing data scarcity for whole slide image (WSI) classification via zero-shot adaptation. However, these methods remain outperformed by conventional multiple instance learning (MIL) approaches trained on large datasets, motivating recent efforts to enhance VLFM-based WSI classification through fewshot learning paradigms. While existing few-shot methods improve diagnostic accuracy with limited annotations, their reliance on conventional classifier designs introduces critical vulnerabilities to data scarcity. To address this problem, we propose a Meta-Optimized Classifier (MOC) comprising two core components: (1) a meta-learner that automatically optimizes a classifier configuration from a mixture of candidate classifiers and (2) a classifier bank housing diverse candidate classifiers to enable a holistic pathological interpretation. Extensive experiments demonstrate that MOC outperforms prior arts in multiple few-shot benchmarks. Notably, on the TCGA-NSCLC benchmark, MOC improves AUC by 10.4% over the state-of-the-art few-shot VLFM-based methods, with gains up to 26.25% under 1-shot conditions, offering a critical advancement for clinical deployments where diagnostic training data is severely limited. Code is available at this https URL.

Title: Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

Authors: Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, Zeynep Akata
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.09968
Pdf URL: https://arxiv.org/pdf/2508.09968
Copy Paste: [[2508.09968]] Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models(https://arxiv.org/abs/2508.09968)
Keywords: diffusion, generative
Abstract: The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost. Code is available at this https URL

Title: PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image

Authors: Geonhee Sim, Gyeongsik Moon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09973
Pdf URL: https://arxiv.org/pdf/2508.09973
Copy Paste: [[2508.09973]] PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image(https://arxiv.org/abs/2508.09973)
Keywords: diffusion
Abstract: Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses.

Title: A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

Authors: Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.09977
Pdf URL: https://arxiv.org/pdf/2508.09977
Copy Paste: [[2508.09977]] A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation(https://arxiv.org/abs/2508.09977)
Keywords: foundation model
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at this https URL.

Title: Story2Board: A Training-Free Approach for Expressive Storyboard Generation

Authors: David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09983
Pdf URL: https://arxiv.org/pdf/2508.09983
Copy Paste: [[2508.09983]] Story2Board: A Training-Free Approach for Expressive Storyboard Generation(https://arxiv.org/abs/2508.09983)
Keywords: diffusion
Abstract: We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.

Title: Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Authors: Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.09987
Pdf URL: https://arxiv.org/pdf/2508.09987
Copy Paste: [[2508.09987]] Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation(https://arxiv.org/abs/2508.09987)
Keywords: foundation model
Abstract: Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.