2025-08-06

Title: ECGTwin: Personalized ECG Generation Using Controllable Diffusion Model

Authors: Yongfan Lai, Bo Liu, Xinyan Guan, Qinghao Zhao, Hongyan Li, Shenda Hong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02720
Pdf URL: https://arxiv.org/pdf/2508.02720
Copy Paste: [[2508.02720]] ECGTwin: Personalized ECG Generation Using Controllable Diffusion Model(https://arxiv.org/abs/2508.02720)
Keywords: diffusion, generative
Abstract: Personalized electrocardiogram (ECG) generation is to simulate a patient's ECG digital twins tailored to specific conditions. It has the potential to transform traditional healthcare into a more accurate individualized paradigm, while preserving the key benefits of conventional population-level ECG synthesis. However, this promising task presents two fundamental challenges: extracting individual features without ground truth and injecting various types of conditions without confusing generative model. In this paper, we present ECGTwin, a two-stage framework designed to address these challenges. In the first stage, an Individual Base Extractor trained via contrastive learning robustly captures personal features from a reference ECG. In the second stage, the extracted individual features, along with a target cardiac condition, are integrated into the diffusion-based generation process through our novel AdaX Condition Injector, which injects these signals via two dedicated and specialized pathways. Both qualitative and quantitative experiments have demonstrated that our model can not only generate ECG signals of high fidelity and diversity by offering a fine-grained generation controllability, but also preserving individual-specific features. Furthermore, ECGTwin shows the potential to enhance ECG auto-diagnosis in downstream application, confirming the possibility of precise personalized healthcare solutions.

Title: DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

Authors: Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, Xin Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02807
Pdf URL: https://arxiv.org/pdf/2508.02807
Copy Paste: [[2508.02807]] DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework(https://arxiv.org/abs/2508.02807)
Keywords: diffusion
Abstract: Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbf{In the second stage}, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page this https URL

Title: Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation

Authors: Radhika Dua, Young Joon (Fred)Kwon, Siddhant Dogra, Daniel Freedman, Diana Ruan, Motaz Nashawaty, Danielle Rigau, Daniel Alexander Alber, Kang Zhang, Kyunghyun Cho, Eric Karl Oermann
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02808
Pdf URL: https://arxiv.org/pdf/2508.02808
Copy Paste: [[2508.02808]] Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation(https://arxiv.org/abs/2508.02808)
Keywords: foundation model
Abstract: Radiological imaging is central to diagnosis, treatment planning, and clinical decision-making. Vision-language foundation models have spurred interest in automated radiology report generation (RRG), but safe deployment requires reliable clinical evaluation of generated reports. Existing metrics often rely on surface-level similarity or behave as black boxes, lacking interpretability. We introduce ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation), an interpretable evaluation framework leveraging large language model agents and dynamic multiple-choice question answering (MCQA). Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other. Agreement on answers captures preservation and consistency of findings, serving as interpretable proxies for clinical precision and recall. By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment. Clinician studies show ICARE aligns significantly more with expert judgment than prior metrics. Perturbation analyses confirm sensitivity to clinical content and reproducibility, while model comparisons reveal interpretable error patterns.

Title: Elucidating the Role of Feature Normalization in IJEPA

Authors: Adam Colton
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02829
Pdf URL: https://arxiv.org/pdf/2508.02829
Copy Paste: [[2508.02829]] Elucidating the Role of Feature Normalization in IJEPA(https://arxiv.org/abs/2508.02829)
Keywords: self-supervised
Abstract: In the standard image joint embedding predictive architecture (IJEPA), features at the output of the teacher encoder are layer normalized (LN) before serving as a distillation target for the student encoder and predictor. We propose that this feature normalization disrupts the natural energy hierarchy of visual tokens, where high-energy tokens (those with larger L2 norms) encode semantically important image regions. LN forces all features to have identical L2 norms, effectively equalizing their energies and preventing the model from prioritizing semantically rich regions. We find that IJEPA models trained with feature LN exhibit loss maps with significant checkerboard-like artifacts. We propose that feature LN be replaced with a DynTanh activation as the latter better preserves token energies and allows high-energy tokens to greater contribute to the prediction loss. We show that IJEPA trained with feature DynTanh exhibits a longer-tailed loss distribution and fixes the checkerboard artifacts in the loss map. Our empirical results show that our simple modification improves ImageNet linear probe accuracy from 38% to 42.7% for ViT-Small and reduces RMSE by 0.08 on NYU Depth V2 monocular depth estimation. These results suggest that preserving natural token energies is crucial for effective self-supervised visual representation learning.

Title: Learning from B Cell Evolution: Adaptive Multi-Expert Diffusion for Antibody Design via Online Optimization

Authors: Hanqi Feng, Peng Qiu, Mengchun Zhang, Yiran Tao, You Fan, Jingtao Xu, Barnabas Poczos
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02834
Pdf URL: https://arxiv.org/pdf/2508.02834
Copy Paste: [[2508.02834]] Learning from B Cell Evolution: Adaptive Multi-Expert Diffusion for Antibody Design via Online Optimization(https://arxiv.org/abs/2508.02834)
Keywords: diffusion
Abstract: Recent advances in diffusion models have shown remarkable potential for antibody design, yet existing approaches apply uniform generation strategies that cannot adapt to each antigen's unique requirements. Inspired by B cell affinity maturation, where antibodies evolve through multi-objective optimization balancing affinity, stability, and self-avoidance, we propose the first biologically-motivated framework that leverages physics-based domain knowledge within an online meta-learning system. Our method employs multiple specialized experts (van der Waals, molecular recognition, energy balance, and interface geometry) whose parameters evolve during generation based on iterative feedback, mimicking natural antibody refinement cycles. Instead of fixed protocols, this adaptive guidance discovers personalized optimization strategies for each target. Our experiments demonstrate that this approach: (1) discovers optimal SE(3)-equivariant guidance strategies for different antigen classes without pre-training, preserving molecular symmetries throughout optimization; (2) significantly enhances hotspot coverage and interface quality through target-specific adaptation, achieving balanced multi-objective optimization characteristic of therapeutic antibodies; (3) establishes a paradigm for iterative refinement where each antibody-antigen system learns its unique optimization profile through online evaluation; (4) generalizes effectively across diverse design challenges, from small epitopes to large protein interfaces, enabling precision-focused campaigns for individual targets.

Title: Highlight & Summarize: RAG without the jailbreaks

Authors: Giovanni Cherubin, Andrew Paverd
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02872
Pdf URL: https://arxiv.org/pdf/2508.02872
Copy Paste: [[2508.02872]] Highlight & Summarize: RAG without the jailbreaks(https://arxiv.org/abs/2508.02872)
Keywords: generative
Abstract: Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. For example, when interacting with a chatbot, malicious users can input specially crafted prompts to cause the LLM to generate undesirable content or perform a completely different task from its intended purpose. Existing mitigations for such attacks typically rely on hardening the LLM's system prompt or using a content classifier trained to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. In this paper, we present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts relevant passages ("highlights") from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe several possible instantiations of H&S and evaluate their generated responses in terms of correctness, relevance, and response quality. Surprisingly, when using an LLM-based highlighter, the majority of H&S responses are judged to be better than those of a standard RAG pipeline.

Title: CauKer: classification time series foundation models can be pretrained on synthetic data only

Authors: Shifeng Xie, Vasilii Feofanov, Marius Alonso, Ambroise Odonnat, Jianfeng Zhang, Themis Palpanas, Ievgen Redko
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02879
Pdf URL: https://arxiv.org/pdf/2508.02879
Copy Paste: [[2508.02879]] CauKer: classification time series foundation models can be pretrained on synthetic data only(https://arxiv.org/abs/2508.02879)
Keywords: foundation model
Abstract: Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.

Title: RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation

Authors: Mehrdad Moradi, Kamran Paynabar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02903
Pdf URL: https://arxiv.org/pdf/2508.02903
Copy Paste: [[2508.02903]] RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation(https://arxiv.org/abs/2508.02903)
Keywords: diffusion, anomaly
Abstract: Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08\% higher AUROC and 10.37\% higher AUPRC on MVTec datasets. The implementation code is available at: this https URL

Title: How Diffusion Prior Landscapes Shape the Posterior in Blind Deconvolution

Authors: Minh-Hai Nguyen, Edouard Pauwels, Pierre Weiss
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02923
Pdf URL: https://arxiv.org/pdf/2508.02923
Copy Paste: [[2508.02923]] How Diffusion Prior Landscapes Shape the Posterior in Blind Deconvolution(https://arxiv.org/abs/2508.02923)
Keywords: diffusion
Abstract: The Maximum A Posteriori (MAP) estimation is a widely used framework in blind deconvolution to recover sharp images from blurred observations. The estimated image and blur filter are defined as the maximizer of the posterior distribution. However, when paired with sparsity-promoting image priors, MAP estimation has been shown to favors blurry solutions, limiting its effectiveness. In this paper, we revisit this result using diffusion-based priors, a class of models that capture realistic image distributions. Through an empirical examination of the prior's likelihood landscape, we uncover two key properties: first, blurry images tend to have higher likelihoods; second, the landscape contains numerous local minimizers that correspond to natural images. Building on these insights, we provide a theoretical analysis of the blind deblurring posterior. This reveals that the MAP estimator tends to produce sharp filters (close to the Dirac delta function) and blurry solutions. However local minimizers of the posterior, which can be obtained with gradient descent, correspond to realistic, natural images, effectively solving the blind deconvolution problem. Our findings suggest that overcoming MAP's limitations requires good local initialization to local minima in the posterior landscape. We validate our analysis with numerical experiments, demonstrating the practical implications of our insights for designing improved priors and optimization techniques.

Title: GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

Authors: Arthur Cho
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2508.02926
Pdf URL: https://arxiv.org/pdf/2508.02926
Copy Paste: [[2508.02926]] GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics(https://arxiv.org/abs/2508.02926)
Keywords: generative
Abstract: Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.

Title: X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio

Authors: Chenxu Zhang, Zenan Li, Hongyi Xu, You Xie, Xiaochen Zhao, Tianpei Gu, Guoxian Song, Xin Chen, Chao Liang, Jianwen Jiang, Linjie Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02944
Pdf URL: https://arxiv.org/pdf/2508.02944
Copy Paste: [[2508.02944]] X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio(https://arxiv.org/abs/2508.02944)
Keywords: diffusion
Abstract: We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.

Title: Injecting Measurement Information Yields a Fast and Noise-Robust Diffusion-Based Inverse Problem Solver

Authors: Jonathan Patsenker, Henry Li, Myeongseob Ko, Ruoxi Jia, Yuval Kluger
Subjects: cs.LG, stat.CO
Abstract URL: https://arxiv.org/abs/2508.02964
Pdf URL: https://arxiv.org/pdf/2508.02964
Copy Paste: [[2508.02964]] Injecting Measurement Information Yields a Fast and Noise-Robust Diffusion-Based Inverse Problem Solver(https://arxiv.org/abs/2508.02964)
Keywords: diffusion
Abstract: Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and iterative sampling algorithm. These approaches often rely on Tweedie's formula, which relates the diffusion variate $\mathbf{x}_t$ to the posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t]$, in order to guide the diffusion trajectory with an estimate of the final denoised sample $\mathbf{x}_0$. However, this does not consider information from the measurement $\mathbf{y}$, which must then be integrated downstream. In this work, we propose to estimate the conditional posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t, \mathbf{y}]$, which can be formulated as the solution to a lightweight, single-parameter maximum likelihood estimation problem. The resulting prediction can be integrated into any standard sampler, resulting in a fast and memory-efficient inverse solver. Our optimizer is amenable to a noise-aware likelihood-based stopping criteria that is robust to measurement noise in $\mathbf{y}$. We demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets and tasks.

Title: Diffusion Models with Adaptive Negative Sampling Without External Resources

Authors: Alakh Desai, Nuno Vasconcelos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02973
Pdf URL: https://arxiv.org/pdf/2508.02973
Copy Paste: [[2508.02973]] Diffusion Models with Adaptive Negative Sampling Without External Resources(https://arxiv.org/abs/2508.02973)
Keywords: diffusion
Abstract: Diffusion models (DMs) have demonstrated an unparalleled ability to create diverse and high-fidelity images from text prompts. However, they are also well-known to vary substantially regarding both prompt adherence and quality. Negative prompting was introduced to improve prompt compliance by specifying what an image must not contain. Previous works have shown the existence of an ideal negative prompt that can maximize the odds of the positive prompt. In this work, we explore relations between negative prompting and classifier-free guidance (CFG) to develop a sampling procedure, {\it Adaptive Negative Sampling Without External Resources} (ANSWER), that accounts for both positive and negative conditions from a single prompt. This leverages the internal understanding of negation by the diffusion model to increase the odds of generating images faithful to the prompt. ANSWER is a training-free technique, applicable to any model that supports CFG, and allows for negative grounding of image concepts without an explicit negative prompts, which are lossy and incomplete. Experiments show that adding ANSWER to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.

Title: Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models

Authors: Fan Yang, Yihao Huang, Jiayi Zhu, Ling Shi, Geguang Pu, Jin Song Dong, Kailong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03006
Pdf URL: https://arxiv.org/pdf/2508.03006
Copy Paste: [[2508.03006]] Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models(https://arxiv.org/abs/2508.03006)
Keywords: diffusion
Abstract: Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.

Title: Multi-Granularity Feature Calibration via VFM for Domain Generalized Semantic Segmentation

Authors: Xinhui Li, Xiaojie Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03007
Pdf URL: https://arxiv.org/pdf/2508.03007
Copy Paste: [[2508.03007]] Multi-Granularity Feature Calibration via VFM for Domain Generalized Semantic Segmentation(https://arxiv.org/abs/2508.03007)
Keywords: foundation model
Abstract: Domain Generalized Semantic Segmentation (DGSS) aims to improve the generalization ability of models across unseen domains without access to target data during training. Recent advances in DGSS have increasingly exploited vision foundation models (VFMs) via parameter-efficient fine-tuning strategies. However, most existing approaches concentrate on global feature fine-tuning, while overlooking hierarchical adaptation across feature levels, which is crucial for precise dense prediction. In this paper, we propose Multi-Granularity Feature Calibration (MGFC), a novel framework that performs coarse-to-fine alignment of VFM features to enhance robustness under domain shifts. Specifically, MGFC first calibrates coarse-grained features to capture global contextual semantics and scene-level structure. Then, it refines medium-grained features by promoting category-level feature discriminability. Finally, fine-grained features are calibrated through high-frequency spatial detail enhancement. By performing hierarchical and granularity-aware calibration, MGFC effectively transfers the generalization strengths of VFMs to the domain-specific task of DGSS. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art DGSS approaches, highlighting the effectiveness of multi-granularity adaptation for the semantic segmentation task of domain generalization.

Title: MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention

Authors: Qi Xie (1), Yongjia Ma (2), Donglin Di (2), Xuehao Gao (3), Xun Yang (1) ((1) University of Science and Technology of China, (2) Li Auto, (3) Northwestern Polytechnical University)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03034
Pdf URL: https://arxiv.org/pdf/2508.03034
Copy Paste: [[2508.03034]] MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention(https://arxiv.org/abs/2508.03034)
Keywords: diffusion
Abstract: Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we collect CelebIPVid, a dataset of 10,000 high-resolution videos from 1,000 diverse individuals, promoting cross-ethnicity generalization. Extensive experiments on CelebIPVid show that MoCA outperforms existing T2V methods by over 5% across Face similarity.

Title: When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025

Authors: Ariya Mukherjee-Gandhi, Oliver Muellerklein
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2508.03037
Pdf URL: https://arxiv.org/pdf/2508.03037
Copy Paste: [[2508.03037]] When Algorithms Meet Artists: Topic Modeling the AI-Art Debate, 2013-2025(https://arxiv.org/abs/2508.03037)
Keywords: generative
Abstract: As generative AI continues to reshape artistic production and alternate modes of human expression, artists whose livelihoods are most directly affected have raised urgent concerns about consent, transparency, and the future of creative labor. However, the voices of artists are often marginalized in dominant public and scholarly discourse. This study presents a twelve-year analysis, from 2013 to 2025, of English-language discourse surrounding AI-generated art. It draws from 439 curated 500-word excerpts sampled from opinion articles, news reports, blogs, legal filings, and spoken-word transcripts. Through a reproducible methodology, we identify five stable thematic clusters and uncover a misalignment between artists' perceptions and prevailing media narratives. Our findings highlight how the use of technical jargon can function as a subtle form of gatekeeping, often sidelining the very issues artists deem most urgent. Our work provides a BERTopic-based methodology and a multimodal baseline for future research, alongside a clear call for deeper, transparency-driven engagement with artist perspectives in the evolving AI-creative landscape.

Title: Urban In-Context Learning: Bridging Pretraining and Inference through Masked Diffusion for Urban Profiling

Authors: Ruixing Zhang, Bo Wang, Tongyu Zhu, Leilei Sun, Weifeng Lv
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03042
Pdf URL: https://arxiv.org/pdf/2508.03042
Copy Paste: [[2508.03042]] Urban In-Context Learning: Bridging Pretraining and Inference through Masked Diffusion for Urban Profiling(https://arxiv.org/abs/2508.03042)
Keywords: diffusion, self-supervised, in-context
Abstract: Urban profiling aims to predict urban profiles in unknown regions and plays a critical role in economic and social censuses. Existing approaches typically follow a two-stage paradigm: first, learning representations of urban areas; second, performing downstream prediction via linear probing, which originates from the BERT era. Inspired by the development of GPT style models, recent studies have shown that novel self-supervised pretraining schemes can endow models with direct applicability to downstream tasks, thereby eliminating the need for task-specific fine-tuning. This is largely because GPT unifies the form of pretraining and inference through next-token prediction. However, urban data exhibit structural characteristics that differ fundamentally from language, making it challenging to design a one-stage model that unifies both pretraining and inference. In this work, we propose Urban In-Context Learning, a framework that unifies pretraining and inference via a masked autoencoding process over urban regions. To capture the distribution of urban profiles, we introduce the Urban Masked Diffusion Transformer, which enables each region' s prediction to be represented as a distribution rather than a deterministic value. Furthermore, to stabilize diffusion training, we propose the Urban Representation Alignment Mechanism, which regularizes the model's intermediate features by aligning them with those from classical urban profiling methods. Extensive experiments on three indicators across two cities demonstrate that our one-stage method consistently outperforms state-of-the-art two-stage approaches. Ablation studies and case studies further validate the effectiveness of each proposed module, particularly the use of diffusion modeling.

Title: A Novel Multimodal Framework for Early Detection of Alzheimers Disease Using Deep Learning

Authors: Tatwadarshi P Nagarhalli, Sanket Patil, Vishal Pande, Uday Aswalekar, Prafulla Patil
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03046
Pdf URL: https://arxiv.org/pdf/2508.03046
Copy Paste: [[2508.03046]] A Novel Multimodal Framework for Early Detection of Alzheimers Disease Using Deep Learning(https://arxiv.org/abs/2508.03046)
Keywords: generative
Abstract: Alzheimers Disease (AD) is a progressive neurodegenerative disorder that poses significant challenges in its early diagnosis, often leading to delayed treatment and poorer outcomes for patients. Traditional diagnostic methods, typically reliant on single data modalities, fall short of capturing the multifaceted nature of the disease. In this paper, we propose a novel multimodal framework for the early detection of AD that integrates data from three primary sources: MRI imaging, cognitive assessments, and biomarkers. This framework employs Convolutional Neural Networks (CNN) for analyzing MRI images and Long Short-Term Memory (LSTM) networks for processing cognitive and biomarker data. The system enhances diagnostic accuracy and reliability by aggregating results from these distinct modalities using advanced techniques like weighted averaging, even in incomplete data. The multimodal approach not only improves the robustness of the detection process but also enables the identification of AD at its earliest stages, offering a significant advantage over conventional methods. The integration of biomarkers and cognitive tests is particularly crucial, as these can detect Alzheimer's long before the onset of clinical symptoms, thereby facilitating earlier intervention and potentially altering the course of the disease. This research demonstrates that the proposed framework has the potential to revolutionize the early detection of AD, paving the way for more timely and effective treatments

Title: Untraceable DeepFakes via Traceable Fingerprint Elimination

Authors: Jiewei Lai, Lan Zhang, Chen Tang, Pengcheng Sun, Xinming Wang, Yunhao Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03067
Pdf URL: https://arxiv.org/pdf/2508.03067
Copy Paste: [[2508.03067]] Untraceable DeepFakes via Traceable Fingerprint Elimination(https://arxiv.org/abs/2508.03067)
Keywords: generative
Abstract: Recent advancements in DeepFakes attribution technologies have significantly enhanced forensic capabilities, enabling the extraction of traces left by generative models (GMs) in images, making DeepFakes traceable back to their source GMs. Meanwhile, several attacks have attempted to evade attribution models (AMs) for exploring their limitations, calling for more robust AMs. However, existing attacks fail to eliminate GMs' traces, thus can be mitigated by defensive measures. In this paper, we identify that untraceable DeepFakes can be achieved through a multiplicative attack, which can fundamentally eliminate GMs' traces, thereby evading AMs even enhanced with defensive measures. We design a universal and black-box attack method that trains an adversarial model solely using real data, applicable for various GMs and agnostic to AMs. Experimental results demonstrate the outstanding attack capability and universal applicability of our method, achieving an average attack success rate (ASR) of 97.08\% against 6 advanced AMs on DeepFakes generated by 9 GMs. Even in the presence of defensive mechanisms, our method maintains an ASR exceeding 72.39\%. Our work underscores the potential challenges posed by multiplicative attacks and highlights the need for more robust AMs.

Title: HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

Authors: Mengting Pan, Fan Li, Xiaoyang Wang, Wenjie Zhang, Xuemin Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03104
Pdf URL: https://arxiv.org/pdf/2508.03104
Copy Paste: [[2508.03104]] HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation(https://arxiv.org/abs/2508.03104)
Keywords: self-supervised
Abstract: Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which is overlooked in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders overlooks the correlations between textual content and hypergraph topology, resulting in suboptimal representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive objective. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for expressive representation learning. Although HyperBERT pioneers CL on TAHGs, its co-training paradigm suffers from poor scalability. To fill the research gap, we introduce HiTeC, a two-stage hierarchical contrastive learning framework with semantic-aware augmentation for scalable and effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we introduce two semantic-aware augmentation strategies, including prompt-enhanced text augmentation and semantic-aware hyperedge drop, to facilitate informative view generation. Furthermore, we propose a multi-scale contrastive loss that extends existing objectives with an $s$-walk-based subgraph-level contrast to better capture long-range dependencies. By decoupling text encoder pretraining from hypergraph contrastive learning, this two-stage design enhances scalability without compromising representation quality. Extensive experiments confirm the effectiveness of HiTeC.

Title: H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

Authors: Heng Jia, Linchao Zhu, Na Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03118
Pdf URL: https://arxiv.org/pdf/2508.03118
Copy Paste: [[2508.03118]] H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction(https://arxiv.org/abs/2508.03118)
Keywords: foundation model
Abstract: Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Plücker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2$\times$ faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at this https URL.

Title: UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying

Authors: Chengyu Bai, Jintao Chen, Xiang Bai, Yilong Chen, Qi She, Ming Lu, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03142
Pdf URL: https://arxiv.org/pdf/2508.03142
Copy Paste: [[2508.03142]] UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying(https://arxiv.org/abs/2508.03142)
Keywords: diffusion
Abstract: In recent years, unified vision-language models (VLMs) have rapidly advanced, effectively tackling both visual understanding and generation tasks within a single design. While many unified VLMs have explored various design choices, the recent hypothesis from OpenAI's GPT-4o suggests a promising generation pipeline: Understanding VLM->Visual Feature->Projector->Diffusion Model->Image. The understanding VLM is frozen, and only the generation-related modules are trained. This pipeline maintains the strong capability of understanding VLM while enabling the image generation ability of the unified VLM. Although this pipeline has shown very promising potential for the future development of unified VLM, how to easily enable image editing capability is still unexplored. In this paper, we introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability via three iterative steps: understanding, editing, and verifying. 1. The understanding step analyzes the source image to create a source prompt through structured semantic analysis and makes minimal word replacements to form the target prompt based on the editing instruction. 2. The editing step introduces a time-adaptive offset, allowing for coherent editing from coarse to fine throughout the denoising process. 3. The verification step checks the alignment between the target prompt and the intermediate edited image, provides automatic consistency scores and corrective feedback, and determines whether to stop early or continue the editing loop. This understanding, editing, and verifying loop iterates until convergence, delivering high-fidelity editing in a training-free manner. We implemented our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.

Title: SARD: Segmentation-Aware Anomaly Synthesis via Region-Constrained Diffusion with Discriminative Mask Guidance

Authors: Yanshu Wang, Xichen Xu, Xiaoning Lei, Guoyang Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03143
Pdf URL: https://arxiv.org/pdf/2508.03143
Copy Paste: [[2508.03143]] SARD: Segmentation-Aware Anomaly Synthesis via Region-Constrained Diffusion with Discriminative Mask Guidance(https://arxiv.org/abs/2508.03143)
Keywords: diffusion, anomaly
Abstract: Synthesizing realistic and spatially precise anomalies is essential for enhancing the robustness of industrial anomaly detection systems. While recent diffusion-based methods have demonstrated strong capabilities in modeling complex defect patterns, they often struggle with spatial controllability and fail to maintain fine-grained regional fidelity. To overcome these limitations, we propose SARD (Segmentation-Aware anomaly synthesis via Region-constrained Diffusion with discriminative mask Guidance), a novel diffusion-based framework specifically designed for anomaly generation. Our approach introduces a Region-Constrained Diffusion (RCD) process that preserves the background by freezing it and selectively updating only the foreground anomaly regions during the reverse denoising phase, thereby effectively reducing background artifacts. Additionally, we incorporate a Discriminative Mask Guidance (DMG) module into the discriminator, enabling joint evaluation of both global realism and local anomaly fidelity, guided by pixel-level masks. Extensive experiments on the MVTec-AD and BTAD datasets show that SARD surpasses existing methods in segmentation accuracy and visual quality, setting a new state-of-the-art for pixel-level anomaly synthesis.

Title: Quantum Spectral Reasoning: A Non-Neural Architecture for Interpretable Machine Learning

Authors: Andrew Kiruluta
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03170
Pdf URL: https://arxiv.org/pdf/2508.03170
Copy Paste: [[2508.03170]] Quantum Spectral Reasoning: A Non-Neural Architecture for Interpretable Machine Learning(https://arxiv.org/abs/2508.03170)
Keywords: anomaly
Abstract: We propose a novel machine learning architecture that departs from conventional neural network paradigms by leveraging quantum spectral methods, specifically Pade approximants and the Lanczos algorithm, for interpretable signal analysis and symbolic reasoning. The core innovation of our approach lies in its ability to transform raw time-domain signals into sparse, physically meaningful spectral representations without the use of backpropagation, high-dimensional embeddings, or data-intensive black-box models. Through rational spectral approximation, the system extracts resonant structures that are then mapped into symbolic predicates via a kernel projection function, enabling logical inference through a rule-based reasoning engine. This architecture bridges mathematical physics, sparse approximation theory, and symbolic artificial intelligence, offering a transparent and physically grounded alternative to deep learning models. We develop the full mathematical formalism underlying each stage of the pipeline, provide a modular algorithmic implementation, and demonstrate the system's effectiveness through comparative evaluations on time-series anomaly detection, symbolic classification, and hybrid reasoning tasks. Our results show that this spectral-symbolic architecture achieves competitive accuracy while maintaining interpretability and data efficiency, suggesting a promising new direction for physically-informed, reasoning-capable machine learning.

Title: SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision

Authors: Zhaoxu Li, Chenqi Kong, Yi Yu, Qiangqiang Wu, Xinghao Jiang, Ngai-Man Cheung, Bihan Wen, Alex Kot, Xudong Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03177
Pdf URL: https://arxiv.org/pdf/2508.03177
Copy Paste: [[2508.03177]] SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision(https://arxiv.org/abs/2508.03177)
Keywords: generative
Abstract: Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their photographic counterparts. To address this issue, we propose Style-Aware Visual Early Revision SAVER, a novel mechanism that dynamically adjusts LVLMs' final outputs based on the token-level visual attention patterns, leveraging early-layer feedback to mitigate hallucinations caused by stylized images. Extensive experiments demonstrate that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.

Title: Convergence of Deterministic and Stochastic Diffusion-Model Samplers: A Simple Analysis in Wasserstein Distance

Authors: Eliot Beyler (SIERRA), Francis Bach (SIERRA)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.03210
Pdf URL: https://arxiv.org/pdf/2508.03210
Copy Paste: [[2508.03210]] Convergence of Deterministic and Stochastic Diffusion-Model Samplers: A Simple Analysis in Wasserstein Distance(https://arxiv.org/abs/2508.03210)
Keywords: diffusion, generative
Abstract: We provide new convergence guarantees in Wasserstein distance for diffusion-based generative models, covering both stochastic (DDPM-like) and deterministic (DDIM-like) sampling methods. We introduce a simple framework to analyze discretization, initialization, and score estimation errors. Notably, we derive the first Wasserstein convergence bound for the Heun sampler and improve existing results for the Euler sampler of the probability flow ODE. Our analysis emphasizes the importance of spatial regularity of the learned score function and argues for controlling the score error with respect to the true reverse process, in line with denoising score matching. We also incorporate recent results on smoothed Wasserstein distances to sharpen initialization error bounds.

Title: ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow

Authors: Shanshan Guo, Xiwen Liang, Junfan Lin, Yuzheng Zhuang, Liang Lin, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03218
Pdf URL: https://arxiv.org/pdf/2508.03218
Copy Paste: [[2508.03218]] ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow(https://arxiv.org/abs/2508.03218)
Keywords: self-supervised
Abstract: Language-instructed robot manipulation has garnered significant interest due to the potential of learning from collected data. While the challenges in high-level perception and planning are continually addressed along the progress of general large pre-trained models, the low precision of low-level action estimation has emerged as the key limiting factor in manipulation performance. To this end, this paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations in the field of learning-based robot manipulation. As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called "action flow", in a self-supervised manner, which are then used to be retrieved and integrated to enhance the action estimation. Specifically, ActionSink incorporates two primary modules. The first module is a coarse-to-fine action flow matcher, which continuously refines the accuracy of action flow via iterative retrieval and denoising process. The second module is a dynamic action flow integrator, which employs a working memory pool that dynamically and efficiently manages the historical action flows that should be used to integrate to enhance the current action estimation. In this module, a multi-layer fusion module is proposed to integrate direct estimation and action flows from both the current and the working memory, achieving highly accurate action estimation through a series of estimation-integration processes. Our ActionSink framework outperformed prior SOTA on the LIBERO benchmark by a 7.9\% success rate, and obtained nearly an 8\% accuracy gain on the challenging long-horizon visual task LIBERO-Long.

Title: BadBlocks: Low-Cost and Stealthy Backdoor Attacks Tailored for Text-to-Image Diffusion Models

Authors: Yu Pan, Jiahao Chen, Lin Wang, Bingrong Dai, Yi Du
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2508.03221
Pdf URL: https://arxiv.org/pdf/2508.03221
Copy Paste: [[2508.03221]] BadBlocks: Low-Cost and Stealthy Backdoor Attacks Tailored for Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.03221)
Keywords: diffusion
Abstract: In recent years,Diffusion models have achieved remarkable progress in the field of image this http URL,recent studies have shown that diffusion models are susceptible to backdoor attacks,in which attackers can manipulate the output by injecting covert triggers such as specific visual patterns or textual phrases into the training this http URL,with the continuous advancement of defense techniques,defenders have become increasingly capable of identifying and mitigating most backdoor attacks using visual inspection and neural network-based detection this http URL,in this paper,we identify a novel type of backdoor threat that is more lightweight and covert than existing approaches,which we name BadBlocks,requires only about 30\% of the computational resources and 20\% GPU time typically needed by previous backdoor attacks,yet it successfully injects backdoors and evades the most advanced defense this http URL enables attackers to selectively contaminate specific blocks within the UNet architecture of diffusion models while maintaining normal functionality in the remaining this http URL results demonstrate that BadBlocks achieves a high attack success rate (ASR) and low perceptual quality loss (as measured by FID Score),even under extremely constrained computational resources and GPU this http URL,BadBlocks is able to bypass existing defense frameworks,especially the attention-based backdoor detection method, highlighting it as a novel and noteworthy this http URL studies further demonstrate that effective backdoor injection does not require fine-tuning the entire network and highlight the pivotal role of certain neural network layers in backdoor this http URL,BadBlocks significantly reduces the barrier to conducting backdoor attacks in all this http URL enables attackers to inject backdoors into large-scale diffusion models even using consumer-grade GPUs.

Title: Zero-shot Shape Classification of Nanoparticles in SEM Images using Vision Foundation Models

Authors: Freida Barnatan, Emunah Goldstein, Einav Kalimian, Orchen Madar, Avi Huri, David Zitoun, Ya'akov Mandelbaum, Moshe Amitay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03235
Pdf URL: https://arxiv.org/pdf/2508.03235
Copy Paste: [[2508.03235]] Zero-shot Shape Classification of Nanoparticles in SEM Images using Vision Foundation Models(https://arxiv.org/abs/2508.03235)
Keywords: foundation model
Abstract: Accurate and efficient characterization of nanoparticle morphology in Scanning Electron Microscopy (SEM) images is critical for ensuring product quality in nanomaterial synthesis and accelerating development. However, conventional deep learning methods for shape classification require extensive labeled datasets and computationally demanding training, limiting their accessibility to the typical nanoparticle practitioner in research and industrial settings. In this study, we introduce a zero-shot classification pipeline that leverages two vision foundation models: the Segment Anything Model (SAM) for object segmentation and DINOv2 for feature embedding. By combining these models with a lightweight classifier, we achieve high-precision shape classification across three morphologically diverse nanoparticle datasets - without the need for extensive parameter fine-tuning. Our methodology outperforms a fine-tuned YOLOv11 and ChatGPT o4-mini-high baselines, demonstrating robustness to small datasets, subtle morphological variations, and domain shifts from natural to scientific imaging. Quantitative clustering metrics on PCA plots of the DINOv2 features are discussed as a means of assessing the progress of the chemical synthesis. This work highlights the potential of foundation models to advance automated microscopy image analysis, offering an alternative to traditional deep learning pipelines in nanoparticle research which is both more efficient and more accessible to the user.

Title: Robust Single-Stage Fully Sparse 3D Object Detection via Detachable Latent Diffusion

Authors: Wentao Qu, Guofeng Mei, Jing Wang, Yujiao Wu, Xiaoshui Huang, Liang Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03252
Pdf URL: https://arxiv.org/pdf/2508.03252
Copy Paste: [[2508.03252]] Robust Single-Stage Fully Sparse 3D Object Detection via Detachable Latent Diffusion(https://arxiv.org/abs/2508.03252)
Keywords: diffusion
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a \textbf{R}obust single-stage fully \textbf{S}parse 3D object \textbf{D}etection \textbf{Net}work with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.

Title: V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models

Authors: Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, Youngjae Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03254
Pdf URL: https://arxiv.org/pdf/2508.03254
Copy Paste: [[2508.03254]] V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models(https://arxiv.org/abs/2508.03254)
Keywords: diffusion
Abstract: With growing interest in deploying text-to-video (T2V) models in resource-constrained environments, reducing their high computational cost has become crucial, leading to extensive research on pruning and knowledge distillation methods while maintaining performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher's outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, ReDPO, that integrates DPO and SFT. Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher, while also utilizing SFT to enhance overall performance. We additionally propose V.I.P., a novel framework for filtering and curating high-quality pair datasets, along with a step-by-step online approach for calibrated training. We validate our method on two leading T2V models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2% and 67.5% each, while maintaining or even surpassing the performance of full models. Further experiments demonstrate the effectiveness of both ReDPO and V.I.P. framework in enabling efficient and high-quality video generation. Our code and videos are available at this https URL.

Title: Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation

Authors: Gang Dai, Yifan Zhang, Yutao Qin, Qiangya Guo, Shuangping Huang, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03256
Pdf URL: https://arxiv.org/pdf/2508.03256
Copy Paste: [[2508.03256]] Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation(https://arxiv.org/abs/2508.03256)
Keywords: diffusion
Abstract: Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text lines, particularly in style reproduction and content preservation. Code is available at this https URL.

Title: Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes

Authors: Shahed Masoudian, Gustavo Escobedo, Hannah Strauss, Markus Schedl
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03292
Pdf URL: https://arxiv.org/pdf/2508.03292
Copy Paste: [[2508.03292]] Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes(https://arxiv.org/abs/2508.03292)
Keywords: generative
Abstract: As Large Language Models (LLMs) are increasingly used across different applications, concerns about their potential to amplify gender biases in various tasks are rising. Prior research has often probed gender bias using explicit gender cues as counterfactual, or studied them in sentence completion and short question answering tasks. These formats might overlook more implicit forms of bias embedded in generative behavior of longer content. In this work, we investigate gender bias in LLMs using gender stereotypes studied in psychology (e.g., aggressiveness or gossiping) in an open-ended task of narrative generation. We introduce a novel dataset called StereoBias-Stories containing short stories either unconditioned or conditioned on (one, two, or six) random attributes from 25 psychological stereotypes and three task-related story endings. We analyze how the gender contribution in the overall story changes in response to these attributes and present three key findings: (1) While models, on average, are highly biased towards male in unconditioned prompts, conditioning on attributes independent from gender stereotypes mitigates this bias. (2) Combining multiple attributes associated with the same gender stereotype intensifies model behavior, with male ones amplifying bias and female ones alleviating it. (3) Model biases align with psychological ground-truth used for categorization, and alignment strength increases with model size. Together, these insights highlight the importance of psychology-grounded evaluation of LLMs.

Title: Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation

Authors: Jun Luo, Zijing Zhao, Yang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03300
Pdf URL: https://arxiv.org/pdf/2508.03300
Copy Paste: [[2508.03300]] Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation(https://arxiv.org/abs/2508.03300)
Keywords: diffusion
Abstract: Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain's style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to-image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at this https URL

Title: Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Authors: Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03334
Pdf URL: https://arxiv.org/pdf/2508.03334
Copy Paste: [[2508.03334]] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation(https://arxiv.org/abs/2508.03334)
Keywords: diffusion
Abstract: Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

Title: FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models

Authors: Matteo Caligiuri, Francesco Barbato, Donald Shenaj, Umberto Michieli, Pietro Zanuttigh
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03356
Pdf URL: https://arxiv.org/pdf/2508.03356
Copy Paste: [[2508.03356]] FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models(https://arxiv.org/abs/2508.03356)
Keywords: foundation model
Abstract: Federated Learning (FL) is an established paradigm for training deep learning models on decentralized data. However, as the size of the models grows, conventional FL approaches often require significant computational resources on client devices, which may not be feasible. We introduce FedPromo, a novel framework that enables efficient adaptation of large-scale foundation models stored on a central server to new domains encountered only by remote clients. Instead of directly training the large model on client devices, FedPromo optimizes lightweight proxy models via FL, significantly reducing computational overhead while maintaining privacy. Our method follows a two-stage process: first, server-side knowledge distillation aligns the representations of a large-scale foundation model (e.g., a transformer) with those of a compact counterpart (e.g., a CNN). Then, the compact model encoder is deployed to client devices, where trainable classifiers are learned locally. These classifiers are subsequently aggregated and seamlessly transferred back to the foundation model, facilitating personalized adaptation without requiring direct access to user data. Through novel regularization strategies, our framework enables decentralized multi-domain learning, balancing performance, privacy, and resource efficiency. Extensive experiments on five image classification benchmarks demonstrate that FedPromo outperforms existing methods while assuming limited-resource clients.

Title: Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Authors: Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03363
Pdf URL: https://arxiv.org/pdf/2508.03363
Copy Paste: [[2508.03363]] Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models(https://arxiv.org/abs/2508.03363)
Keywords: in-context
Abstract: Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that leverages the structured difference between two reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy. Specifically, our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt that incorporates the original question and both candidate answers. Since such disagreement occurs infrequently (e.g., only 6\% in GSM8K), our method performs just one round of reasoning in most cases, resulting in minimal latency overhead. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT) and majority voting with improved answer robustness. Moreover, It achieves comparable in-distribution performance to training-based SOTA method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing that leveraging different reasoning modes consistently lowers the error rate and highlights the value of structural thinking diversity. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second round of thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

Title: Diffusion Once and Done: Degradation-Aware LoRA for Efficient All-in-One Image Restoration

Authors: Ni Tang, Xiaotong Luo, Zihan Cheng, Liangtai Zhou, Dongxiao Zhang, Yanyun Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03373
Pdf URL: https://arxiv.org/pdf/2508.03373
Copy Paste: [[2508.03373]] Diffusion Once and Done: Degradation-Aware LoRA for Efficient All-in-One Image Restoration(https://arxiv.org/abs/2508.03373)
Keywords: diffusion
Abstract: Diffusion models have revealed powerful potential in all-in-one image restoration (AiOIR), which is talented in generating abundant texture details. The existing AiOIR methods either retrain a diffusion model or fine-tune the pretrained diffusion model with extra conditional guidance. However, they often suffer from high inference costs and limited adaptability to diverse degradation types. In this paper, we propose an efficient AiOIR method, Diffusion Once and Done (DOD), which aims to achieve superior restoration performance with only one-step sampling of Stable Diffusion (SD) models. Specifically, multi-degradation feature modulation is first introduced to capture different degradation prompts with a pretrained diffusion model. Then, parameter-efficient conditional low-rank adaptation integrates the prompts to enable the fine-tuning of the SD model for adapting to different degradation types. Besides, a high-fidelity detail enhancement module is integrated into the decoder of SD to improve structural and textural details. Experiments demonstrate that our method outperforms existing diffusion-based restoration approaches in both visual quality and inference efficiency.

Title: SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models

Authors: Pingchuan Ma, Xiaopei Yang, Yusong Li, Ming Gui, Felix Krause, Johannes Schusterbauer, Björn Ommer
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03402
Pdf URL: https://arxiv.org/pdf/2508.03402
Copy Paste: [[2508.03402]] SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models(https://arxiv.org/abs/2508.03402)
Keywords: diffusion, generative
Abstract: Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and normalizing flows; and 3) a synthetic dataset of 510,000 samples (51 styles $\times$ 10,000 content samples) was curated to simulate disentanglement through systematic style-content pairing. Beyond controllable generation tasks, we demonstrate that SCFlow generalizes to ImageNet-1k and WikiArt in zero-shot settings and achieves competitive performance, highlighting that disentanglement naturally emerges from the invertible merging process.

Title: Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN

Authors: Shivangi Nigam, Adarsh Prasad Behera, Shekhar Verma, P. Nagabhushan
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2508.03415
Pdf URL: https://arxiv.org/pdf/2508.03415
Copy Paste: [[2508.03415]] Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN(https://arxiv.org/abs/2508.03415)
Keywords: diffusion, generative
Abstract: This paper presents Fd-CycleGAN, an image-to-image (I2I) translation framework that enhances latent representation learning to approximate real data distributions. Building upon the foundation of CycleGAN, our approach integrates Local Neighborhood Encoding (LNE) and frequency-aware supervision to capture fine-grained local pixel semantics while preserving structural coherence from the source domain. We employ distribution-based loss metrics, including KL/JS divergence and log-based similarity measures, to explicitly quantify the alignment between real and generated image distributions in both spatial and frequency domains. To validate the efficacy of Fd-CycleGAN, we conduct experiments on diverse datasets -- Horse2Zebra, Monet2Photo, and a synthetically augmented Strike-off dataset. Compared to baseline CycleGAN and other state-of-the-art methods, our approach demonstrates superior perceptual quality, faster convergence, and improved mode diversity, particularly in low-data regimes. By effectively capturing local and global distribution characteristics, Fd-CycleGAN achieves more visually coherent and semantically consistent translations. Our results suggest that frequency-guided latent learning significantly improves generalization in image translation tasks, with promising applications in document restoration, artistic style transfer, and medical image synthesis. We also provide comparative insights with diffusion-based generative models, highlighting the advantages of our lightweight adversarial approach in terms of training efficiency and qualitative output.

Title: R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation

Authors: Futian Wang, Yuhan Qiao, Xiao Wang, Fuling Wang, Yuxiang Zhang, Dengdi Sun
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03426
Pdf URL: https://arxiv.org/pdf/2508.03426
Copy Paste: [[2508.03426]] R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM-based Radiology Report Generation(https://arxiv.org/abs/2508.03426)
Keywords: foundation model
Abstract: X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on this https URL.

Title: AI on the Pulse: Real-Time Health Anomaly Detection with Wearable and Ambient Intelligence

Authors: Davide Gabrielli, Bardh Prenkaj, Paola Velardi, Stefano Faralli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03436
Pdf URL: https://arxiv.org/pdf/2508.03436
Copy Paste: [[2508.03436]] AI on the Pulse: Real-Time Health Anomaly Detection with Wearable and Ambient Intelligence(https://arxiv.org/abs/2508.03436)
Keywords: anomaly
Abstract: We introduce AI on the Pulse, a real-world-ready anomaly detection system that continuously monitors patients using a fusion of wearable sensors, ambient intelligence, and advanced AI models. Powered by UniTS, a state-of-the-art (SoTA) universal time-series model, our framework autonomously learns each patient's unique physiological and behavioral patterns, detecting subtle deviations that signal potential health risks. Unlike classification methods that require impractical, continuous labeling in real-world scenarios, our approach uses anomaly detection to provide real-time, personalized alerts for reactive home-care interventions. Our approach outperforms 12 SoTA anomaly detection methods, demonstrating robustness across both high-fidelity medical devices (ECG) and consumer wearables, with a ~ 22% improvement in F1 score. However, the true impact of AI on the Pulse lies in @HOME, where it has been successfully deployed for continuous, real-world patient monitoring. By operating with non-invasive, lightweight devices like smartwatches, our system proves that high-quality health monitoring is possible without clinical-grade equipment. Beyond detection, we enhance interpretability by integrating LLMs, translating anomaly scores into clinically meaningful insights for healthcare professionals.

Title: Spatial Imputation Drives Cross-Domain Alignment for EEG Classification

Authors: Hongjun Liu, Chao Yao, Yalan Zhang, Xiaokun wang, Xiaojuan Ban
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03437
Pdf URL: https://arxiv.org/pdf/2508.03437
Copy Paste: [[2508.03437]] Spatial Imputation Drives Cross-Domain Alignment for EEG Classification(https://arxiv.org/abs/2508.03437)
Keywords: self-supervised
Abstract: Electroencephalogram (EEG) signal classification faces significant challenges due to data distribution shifts caused by heterogeneous electrode configurations, acquisition protocols, and hardware discrepancies across domains. This paper introduces IMAC, a novel channel-dependent mask and imputation self-supervised framework that formulates the alignment of cross-domain EEG data shifts as a spatial time series imputation task. To address heterogeneous electrode configurations in cross-domain scenarios, IMAC first standardizes different electrode layouts using a 3D-to-2D positional unification mapping strategy, establishing unified spatial representations. Unlike previous mask-based self-supervised representation learning methods, IMAC introduces spatio-temporal signal alignment. This involves constructing a channel-dependent mask and reconstruction task framed as a low-to-high resolution EEG spatial imputation problem. Consequently, this approach simulates cross-domain variations such as channel omissions and temporal instabilities, thus enabling the model to leverage the proposed imputer for robust signal alignment during inference. Furthermore, IMAC incorporates a disentangled structure that separately models the temporal and spatial information of the EEG signals separately, reducing computational complexity while enhancing flexibility and adaptability. Comprehensive evaluations across 10 publicly available EEG datasets demonstrate IMAC's superior performance, achieving state-of-the-art classification accuracy in both cross-subject and cross-center validation scenarios. Notably, IMAC shows strong robustness under both simulated and real-world distribution shifts, surpassing baseline methods by up to $35$\% in integrity scores while maintaining consistent classification accuracy.

Title: MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis

Authors: Ning Zhu, Xiaochuan Ma, Shaoting Zhang, Guotai Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03441
Pdf URL: https://arxiv.org/pdf/2508.03441
Copy Paste: [[2508.03441]] MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis(https://arxiv.org/abs/2508.03441)
Keywords: self-supervised, foundation model
Abstract: Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at this https URL.

Title: RAAG: Ratio Aware Adaptive Guidance

Authors: Shangwen Zhu, Qianyu Peng, Yuting Hu, Zhantao Yang, Han Zhang, Zhao Pu, Ruili Feng, Fan Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03442
Pdf URL: https://arxiv.org/pdf/2508.03442
Copy Paste: [[2508.03442]] RAAG: Ratio Aware Adaptive Guidance(https://arxiv.org/abs/2508.03442)
Keywords: generative
Abstract: Flow-based generative models have recently achieved remarkable progress in image and video synthesis, with classifier-free guidance (CFG) becoming the standard tool for high-fidelity, controllable generation. However, despite their practical success, little is known about how guidance interacts with different stages of the sampling process-especially in the fast, low-step regimes typical of modern flow-based pipelines. In this work, we uncover and analyze a fundamental instability: the earliest reverse steps are acutely sensitive to the guidance scale, owing to a pronounced spike in the relative strength (RATIO) of conditional to unconditional predictions. Through rigorous theoretical analysis and empirical validation, we show that this RATIO spike is intrinsic to the data distribution, independent of the model architecture, and causes exponential error amplification when paired with strong guidance. To address this, we propose a simple, theoretically grounded, RATIO-aware adaptive guidance schedule that automatically dampens the guidance scale at early steps based on the evolving RATIO, using a closed-form exponential decay. Our method is lightweight, requires no additional inference overhead, and is compatible with standard flow frameworks. Experiments across state-of-the-art image (SD3.5, Lumina) and video (WAN2.1) models demonstrate that our approach enables up to 3x faster sampling while maintaining or improving generation quality, robustness, and semantic alignment. Extensive ablation studies further confirm the generality and stability of our schedule across models, datasets, and hyperparameters. Our findings highlight the critical role of stepwise guidance adaptation in unlocking the full potential of fast flow-based generative models.

Title: CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

Authors: Qiyu Chen, Zhen Qu, Wei Luo, Haiming Yao, Yunkang Cao, Yuxin Jiang, Yinan Duan, Huiyuan Luo, Chengkan Lv, Zhengtao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03447
Pdf URL: https://arxiv.org/pdf/2508.03447
Copy Paste: [[2508.03447]] CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection(https://arxiv.org/abs/2508.03447)
Keywords: anomaly
Abstract: Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets. Code will be available at this https URL.

Title: Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings

Authors: Rita González-Márquez, Philipp Berens, Dmitry Kobak
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03453
Pdf URL: https://arxiv.org/pdf/2508.03453
Copy Paste: [[2508.03453]] Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings(https://arxiv.org/abs/2508.03453)
Keywords: self-supervised
Abstract: Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, sentiment analysis, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via extensive supervised fine-tuning using curated text pairs. This contrasts with computer vision, where self-supervised training based on data augmentations has demonstrated remarkable success. Here we systematically compare the two most well-known augmentation strategies for positive pair generation in contrastive learning of text embeddings. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is below the supervised SOTA models, but for in-domain data, self-supervised fine-tuning produces high-quality text embeddings after very short fine-tuning, sometimes only marginally below the supervised SOTA. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

Title: READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation

Authors: Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Jianqing Gao, Qingfeng Liu
Subjects: cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2508.03457
Pdf URL: https://arxiv.org/pdf/2508.03457
Copy Paste: [[2508.03457]] READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation(https://arxiv.org/abs/2508.03457)
Keywords: diffusion
Abstract: The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, the first real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference process of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.

Title: VideoGuard: Protecting Video Content from Unauthorized Editing

Authors: Junjie Cao, Kaizhou Li, Xinchun Yu, Hongxiang Li, Xiaoping Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03480
Pdf URL: https://arxiv.org/pdf/2508.03480
Copy Paste: [[2508.03480]] VideoGuard: Protecting Video Content from Unauthorized Editing(https://arxiv.org/abs/2508.03480)
Keywords: diffusion, generative
Abstract: With the rapid development of generative technology, current generative models can generate high-fidelity digital content and edit it in a controlled manner. However, there is a risk that malicious individuals might misuse these capabilities for misleading activities. Although existing research has attempted to shield photographic images from being manipulated by generative models, there remains a significant disparity in the protection offered to video content editing. To bridge the gap, we propose a protection method named VideoGuard, which can effectively protect videos from unauthorized malicious editing. This protection is achieved through the subtle introduction of nearly unnoticeable perturbations that interfere with the functioning of the intended generative diffusion models. Due to the redundancy between video frames, and inter-frame attention mechanism in video diffusion models, simply applying image-based protection methods separately to every video frame can not shield video from unauthorized editing. To tackle the above challenge, we adopt joint frame optimization, treating all video frames as an optimization entity. Furthermore, we extract video motion information and fuse it into optimization objectives. Thus, these alterations can effectively force the models to produce outputs that are implausible and inconsistent. We provide a pipeline to optimize this perturbation. Finally, we use both objective metrics and subjective metrics to demonstrate the efficacy of our method, and the results show that the protection performance of VideoGuard is superior to all the baseline methods.

Title: Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models

Authors: Hyungjin Kim, Seokho Ahn, Young-Duk Seo
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.03481
Pdf URL: https://arxiv.org/pdf/2508.03481
Copy Paste: [[2508.03481]] Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.03481)
Keywords: diffusion
Abstract: Personalized generation in T2I diffusion models aims to naturally incorporate individual user preferences into the generation process with minimal user intervention. However, existing studies primarily rely on prompt-level modeling with large-scale models, often leading to inaccurate personalization due to the limited input token capacity of T2I diffusion models. To address these limitations, we propose DrUM, a novel method that integrates user profiling with a transformer-based adapter to enable personalized generation through condition-level modeling in the latent space. DrUM demonstrates strong performance on large-scale datasets and seamlessly integrates with open-source text encoders, making it compatible with widely used foundation T2I models without requiring additional fine-tuning.

Title: When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

Authors: Dasol Choi Jihwan Lee, Minjae Lee, Minsuk Kahng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03483
Pdf URL: https://arxiv.org/pdf/2508.03483
Copy Paste: [[2508.03483]] When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models(https://arxiv.org/abs/2508.03483)
Keywords: diffusion, generative
Abstract: While prior research on text-to-image generation has predominantly focused on biases in human depictions, we investigate a more subtle yet pervasive phenomenon: demographic bias in generated objects (e.g., cars). We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring such biases. Our approach compares visual attributes of objects generated with demographic cues (e.g., "for young people'') to those from neutral prompts, across 2,700 images produced by three state-of-the-art models (GPT Image-1, Imagen 4, and Stable Diffusion) in five object categories. Through a comprehensive analysis, we uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues. These patterns reflect and reinforce not only well-known stereotypes but also more subtle and unintuitive biases. We also observe that some models generate less diverse outputs, which in turn amplifies the visual disparities compared to neutral prompts. Our proposed auditing framework offers a practical approach for testing, revealing how stereotypes still remain embedded in today's generative models. We see this as an essential step toward more systematic and responsible AI development.

Title: LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation

Authors: Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, Qingyi Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03485
Pdf URL: https://arxiv.org/pdf/2508.03485
Copy Paste: [[2508.03485]] LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation(https://arxiv.org/abs/2508.03485)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) have achieved impressive performance in text-to-image generation. However, their high computational cost and large parameter sizes pose significant challenges for usage in resource-constrained scenarios. Post-training quantization (PTQ) is a promising solution to reduce memory usage and accelerate inference, but existing PTQ methods suffer from severe performance degradation under extreme low-bit settings. We identify two key obstacles to low-bit post-training quantization for DiT models: (1) model weights follow a Gaussian-like distribution with long tails, causing uniform quantization to poorly allocate intervals and leading to significant errors; (2) two types of activation outliers: (i) Mild Outliers with slightly elevated values, and (ii) Salient Outliers with large magnitudes concentrated in specific channels, which disrupt activation quantization. To address these issues, we propose LRQ-DiT, an efficient and accurate PTQ framework. We introduce Twin-Log Quantization (TLQ), a log-based method that aligns well with the weight distribution and reduces quantization errors. We also propose an Adaptive Rotation Scheme (ARS) that dynamically applies Hadamard or outlier-aware rotations based on activation fluctuation, effectively mitigating the impact of both types of outliers. We evaluate LRQ-DiT on PixArt and FLUX under various bit-width settings, and validate the performance on COCO, MJHQ, and sDCI datasets. LRQ-DiT achieves low-bit quantization of DiT models while preserving image quality, outperforming existing PTQ baselines.

Title: ParticleSAM: Small Particle Segmentation for Material Quality Monitoring in Recycling Processes

Authors: Yu Zhou, Pelle Thielmann, Ayush Chamoli, Bruno Mirbach, Didier Stricker, Jason Rambach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03490
Pdf URL: https://arxiv.org/pdf/2508.03490
Copy Paste: [[2508.03490]] ParticleSAM: Small Particle Segmentation for Material Quality Monitoring in Recycling Processes(https://arxiv.org/abs/2508.03490)
Keywords: foundation model
Abstract: The construction industry represents a major sector in terms of resource consumption. Recycled construction material has high reuse potential, but quality monitoring of the aggregates is typically still performed with manual methods. Vision-based machine learning methods could offer a faster and more efficient solution to this problem, but existing segmentation methods are by design not directly applicable to images with hundreds of small particles. In this paper, we propose ParticleSAM, an adaptation of the segmentation foundation model to images with small and dense objects such as the ones often encountered in construction material particles. Moreover, we create a new dense multi-particle dataset simulated from isolated particle images with the assistance of an automated data generation and labeling pipeline. This dataset serves as a benchmark for visual material quality control automation while our segmentation approach has the potential to be valuable in application areas beyond construction where small-particle segmentation is needed. Our experimental results validate the advantages of our method by comparing to the original SAM method both in quantitative and qualitative experiments.

Title: MAUP: Training-free Multi-center Adaptive Uncertainty-aware Prompting for Cross-domain Few-shot Medical Image Segmentation

Authors: Yazhou Zhu, Haofeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03511
Pdf URL: https://arxiv.org/pdf/2508.03511
Copy Paste: [[2508.03511]] MAUP: Training-free Multi-center Adaptive Uncertainty-aware Prompting for Cross-domain Few-shot Medical Image Segmentation(https://arxiv.org/abs/2508.03511)
Keywords: foundation model
Abstract: Cross-domain Few-shot Medical Image Segmentation (CD-FSMIS) is a potential solution for segmenting medical images with limited annotation using knowledge from other domains. The significant performance of current CD-FSMIS models relies on the heavily training procedure over other source medical domains, which degrades the universality and ease of model deployment. With the development of large visual models of natural images, we propose a training-free CD-FSMIS model that introduces the Multi-center Adaptive Uncertainty-aware Prompting (MAUP) strategy for adapting the foundation model Segment Anything Model (SAM), which is trained with natural images, into the CD-FSMIS task. To be specific, MAUP consists of three key innovations: (1) K-means clustering based multi-center prompts generation for comprehensive spatial coverage, (2) uncertainty-aware prompts selection that focuses on the challenging regions, and (3) adaptive prompt optimization that can dynamically adjust according to the target region complexity. With the pre-trained DINOv2 feature encoder, MAUP achieves precise segmentation results across three medical datasets without any additional training compared with several conventional CD-FSMIS models and training-free FSMIS model. The source code is available at: this https URL.

Title: Semantic Mosaicing of Histo-Pathology Image Fragments using Visual Foundation Models

Authors: Stefan Brandstätter, Maximilian Köller, Philipp Seeböck, Alissa Blessing, Felicitas Oberndorfer, Svitlana Pochepnia, Helmut Prosch, Georg Langs
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03524
Pdf URL: https://arxiv.org/pdf/2508.03524
Copy Paste: [[2508.03524]] Semantic Mosaicing of Histo-Pathology Image Fragments using Visual Foundation Models(https://arxiv.org/abs/2508.03524)
Keywords: foundation model
Abstract: In histopathology, tissue samples are often larger than a standard microscope slide, making stitching of multiple fragments necessary to process entire structures such as tumors. Automated stitching is a prerequisite for scaling analysis, but is challenging due to possible tissue loss during preparation, inhomogeneous morphological distortion, staining inconsistencies, missing regions due to misalignment on the slide, or frayed tissue edges. This limits state-of-the-art stitching methods using boundary shape matching algorithms to reconstruct artificial whole mount slides (WMS). Here, we introduce SemanticStitcher using latent feature representations derived from a visual histopathology foundation model to identify neighboring areas in different fragments. Robust pose estimation based on a large number of semantic matching candidates derives a mosaic of multiple fragments to form the WMS. Experiments on three different histopathology datasets demonstrate that SemanticStitcher yields robust WMS mosaicing and consistently outperforms the state of the art in correct boundary matches.

Title: MoKA: Mixture of Kronecker Adapters

Authors: Mohammadreza Sadeghi, Mahsa Ghazvini Nejad, MirHamed Jafarzadeh Asl, Yu Gu, Yuanhao Yu, Masoud Asgharian, Vahid Partovi Nia
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.03527
Pdf URL: https://arxiv.org/pdf/2508.03527
Copy Paste: [[2508.03527]] MoKA: Mixture of Kronecker Adapters(https://arxiv.org/abs/2508.03527)
Keywords: generative
Abstract: Parameter-efficient fine-tuning (PEFT) is essential for reducing the computational overhead of large language models (LLMs). Low-rank family adapters are commonly used to control the parameter size efficiently while maintaining the generative power of LLMs. However, their limited expressiveness due to the rank constraint often restricts their performance on complex tasks. We propose Mixture of Kronecker Adapters (MoKA), a new generation of Kronecker adapters that addresses this limitation by modeling weight updates as a mixture of Kronecker products. Our proposed adapter leverages a gating mechanism that measures the importance of each Kronecker factor, enabling more expressive adaptation. Moreover, MoKA enables a rank flexibility that provides a better trade-off between parameter efficiency and accuracy. To ensure hardware efficiency, we reformulate Kronecker computations using standard matrix operations, allowing seamless deployment on GPU-optimized hardware. We conduct extensive experiments on instruction-tuning and commonsense reasoning tasks using low-bit quantized versions of LLaMA2-7B and LLaMA3-8B models. MoKA not only outperforms PEFT baselines, but also reduces the number of trainable parameters up to 27x, achieving state-of-the-art trade-offs between performance and parameter efficiency.

Title: EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models

Authors: Xiaoming Hou, Jiquan Zhang, Zibin Lin, DaCheng Tao, Shengli Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03533
Pdf URL: https://arxiv.org/pdf/2508.03533
Copy Paste: [[2508.03533]] EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models(https://arxiv.org/abs/2508.03533)
Keywords: foundation model
Abstract: Effectively adapting powerful pretrained foundation models to diverse tasks remains a key challenge in AI deployment. Current approaches primarily follow two paradigms:discrete optimization of text prompts through prompt engineering, or continuous adaptation via additional trainable parameters. Both exhibit limitations-discrete methods lack refinement precision while parameter-based techniques increase complexity and reduce interpretability. To address these constraints, we propose EmbedGrad, a novel framework that optimizes text prompt embeddings through gradient-based refinement. Our approach uniquely decouples training from deployment:during optimization,labeled examples guide precise embedding adjustments while preserving semantic meaning; during inference, only optimized embeddings integrate with user queries. This enables fine-grained calibration impossible in text space, such as enhancing the reasoning capability of prompts like please reason step by step. Comprehensive evaluations across mathematical reasoning, sentiment analysis, and causal judgment tasks demonstrate EmbedGrad's effectiveness:optimizing this reasoning prompt for Qwen2.5-Math-1.5B increased accuracy from 14.74\% to 58.96\% on mathematical problems. Consistent improvements were observed across model scales (0.5B-14B) and all tasks, with particularly significant gains for smaller models on complex problems like causal judgment. By bridging prompt engineering and parameter efficiency without architectural changes, our work establishes embedding refinement as a powerful new paradigm for task adaptation.

Title: CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation

Authors: Kaishen Yuan, Yuting Zhang, Shang Gao, Yijie Zhu, Wenshuo Chen, Yutao Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03535
Pdf URL: https://arxiv.org/pdf/2508.03535
Copy Paste: [[2508.03535]] CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation(https://arxiv.org/abs/2508.03535)
Keywords: diffusion
Abstract: Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a Hierarchical Low-Rank Adaptation (HiLoRA) module to cohesively model both polarity-shared low-level features and emotion-specific high-level semantics. Extensive experiments demonstrate CoEmoGen's superiority in emotional faithfulness and semantic coherence from quantitative, qualitative, and user study perspectives. To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images, providing endless inspiration for emotion-driven artistic creation. The dataset and code are available at this https URL.

Title: Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection

Authors: Long Qian, Bingke Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03539
Pdf URL: https://arxiv.org/pdf/2508.03539
Copy Paste: [[2508.03539]] Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection(https://arxiv.org/abs/2508.03539)
Keywords: diffusion, anomaly
Abstract: Despite substantial progress in anomaly synthesis methods, existing diffusion-based and coarse inpainting pipelines commonly suffer from structural deficiencies such as micro-structural discontinuities, limited semantic controllability, and inefficient generation. To overcome these limitations, we introduce ARAS, a language-conditioned, auto-regressive anomaly synthesis approach that precisely injects local, text-specified defects into normal images via token-anchored latent editing. Leveraging a hard-gated auto-regressive operator and a training-free, context-preserving masked sampling kernel, ARAS significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies. Integrated within our Quality-Aware Re-weighted Anomaly Detection (QARAD) framework, we further propose a dynamic weighting strategy that emphasizes high-quality synthetic samples by computing an image-text similarity score with a dual-encoder model. Extensive experiments across three benchmark datasets-MVTec AD, VisA, and BTAD, demonstrate that our QARAD outperforms SOTA methods in both image- and pixel-level anomaly detection tasks, achieving improved accuracy, robustness, and a 5 times synthesis speedup compared to diffusion-based alternatives. Our complete code and synthesized dataset will be publicly available.

Title: SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks

Authors: Xinyu Xiong, Zihuang Wu, Lei Zhang, Lei Lu, Ming Li, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03566
Pdf URL: https://arxiv.org/pdf/2508.03566
Copy Paste: [[2508.03566]] SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks(https://arxiv.org/abs/2508.03566)
Keywords: foundation model
Abstract: Recent studies have highlighted the potential of adapting the Segment Anything Model (SAM) for various downstream tasks. However, constructing a more powerful and generalizable encoder to further enhance performance remains an open challenge. In this work, we propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet while extending the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder. By incorporating a dual-resolution strategy and a dense glue layer, our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs. Extensive experiments conducted on four benchmarks, including dichotomous image segmentation, camouflaged object detection, marine animal segmentation, and remote sensing saliency detection, demonstrate the superior performance of our proposed method. The code is available at this https URL.

Title: Zero-Variance Gradients for Variational Autoencoders

Authors: Zilei Shao, Anji Liu, Guy Van den Broeck
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03587
Pdf URL: https://arxiv.org/pdf/2508.03587
Copy Paste: [[2508.03587]] Zero-Variance Gradients for Variational Autoencoders(https://arxiv.org/abs/2508.03587)
Keywords: generative
Abstract: Training deep generative models like Variational Autoencoders (VAEs) is often hindered by the need to backpropagate gradients through the stochastic sampling of their latent variables, a process that inherently introduces estimation variance, which can slow convergence and degrade performance. In this paper, we propose a new perspective that sidesteps this problem, which we call Silent Gradients. Instead of improving stochastic estimators, we leverage specific decoder architectures to analytically compute the expected ELBO, yielding a gradient with zero variance. We first provide a theoretical foundation for this method and demonstrate its superiority over existing estimators in a controlled setting with a linear decoder. To generalize our approach for practical use with complex, expressive decoders, we introduce a novel training dynamic that uses the exact, zero-variance gradient to guide the early stages of encoder training before annealing to a standard stochastic estimator. Our experiments show that this technique consistently improves the performance of established baselines, including reparameterization, Gumbel-Softmax, and REINFORCE, across multiple datasets. This work opens a new direction for training generative models by combining the stability of analytical computation with the expressiveness of deep, nonlinear architecture.

Title: VITA: Variational Pretraining of Transformers for Climate-Robust Crop Yield Forecasting

Authors: Adib Hasan, Mardavij Roozbehani, Munther Dahleh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03589
Pdf URL: https://arxiv.org/pdf/2508.03589
Copy Paste: [[2508.03589]] VITA: Variational Pretraining of Transformers for Climate-Robust Crop Yield Forecasting(https://arxiv.org/abs/2508.03589)
Keywords: self-supervised
Abstract: Accurate crop yield forecasting is essential for global food security. However, current AI models systematically underperform when yields deviate from historical trends. This issue arises from key data challenges, including a major asymmetry between rich pretraining weather datasets and the limited data available for fine-tuning. We introduce VITA (Variational Inference Transformer for Asymmetric data), a variational pretraining framework that addresses this asymmetry. Instead of relying on input reconstruction, VITA uses detailed weather variables as proxy targets during pretraining and learns to predict rich atmospheric states through self-supervised feature masking. This allows the model to be fine-tuned using only basic weather statistics during deployment. Applied to 763 counties in the U.S. Corn Belt, VITA achieves state-of-the-art performance in predicting corn and soybean yields across all evaluation scenarios. While it consistently delivers superior performance under normal conditions, its advantages are particularly pronounced during extreme weather years, with statistically significant improvements (paired t-test, $p \approx 0.01$). Importantly, VITA outperforms prior frameworks like GNN-RNN using less data, making it more practical for real-world use--particularly in data-scarce regions. This work highlights how domain-aware AI design can overcome data limitations and support resilient agricultural forecasting in a changing climate.

Title: evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition

Authors: Rodrigo Verschae, Ignacio Bugueno-Cordova
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03609
Pdf URL: https://arxiv.org/pdf/2508.03609
Copy Paste: [[2508.03609]] evTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition(https://arxiv.org/abs/2508.03609)
Keywords: generative
Abstract: Event-based cameras are bio-inspired vision sensors that asynchronously capture per-pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing valuable information about the spatio-temporal dynamics of the scene. In the present work, we propose evTransFER, a transfer learning-based framework and architecture for face expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode the spatio-temporal dynamics of faces, built by training an adversarial generative method on a different problem (facial reconstruction) and then transferring the trained encoder weights to the face expression recognition system. We show that this proposed transfer learning method greatly improves the ability to recognize facial expressions compared to training a network from scratch. In addition, we propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics, and we introduce a new event-based representation, referred to as TIE, both of which further improve the results. We evaluate the proposed framework on the event-based facial expression database e-CK+ and compare it to state-of-the-art methods. The results show that the proposed framework evTransFER achieves a 93.6\% recognition rate on the e-CK+ database, significantly improving the accuracy (25.9\% points or more) when compared to state-of-the-art performance for similar problems.

Title: A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design

Authors: Claudiu Leoveanu-Condrei
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03665
Pdf URL: https://arxiv.org/pdf/2508.03665
Copy Paste: [[2508.03665]] A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design(https://arxiv.org/abs/2508.03665)
Keywords: generative
Abstract: Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emph{functionally equivalent} with respect to those contracts.

Title: OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World

Authors: Katherine Liu, Sergey Zakharov, Dian Chen, Takuya Ikeda, Greg Shakhnarovich, Adrien Gaidon, Rares Ambrus
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.03669
Pdf URL: https://arxiv.org/pdf/2508.03669
Copy Paste: [[2508.03669]] OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World(https://arxiv.org/abs/2508.03669)
Keywords: diffusion
Abstract: We would like to estimate the pose and full shape of an object from a single observation, without assuming known 3D model or category. In this work, we propose OmniShape, the first method of its kind to enable probabilistic pose and shape estimation. OmniShape is based on the key insight that shape completion can be decoupled into two multi-modal distributions: one capturing how measurements project into a normalized object reference frame defined by the dataset and the other modelling a prior over object geometries represented as triplanar neural fields. By training separate conditional diffusion models for these two distributions, we enable sampling multiple hypotheses from the joint pose and shape distribution. OmniShape demonstrates compelling performance on challenging real world datasets. Project website: this https URL

Title: Veila: Panoramic LiDAR Generation from a Monocular RGB Image

Authors: Youquan Liu, Lingdong Kong, Weidong Yang, Ao Liang, Jianxiong Gao, Yang Wu, Xiang Xu, Xin Li, Linfeng Li, Runnan Chen, Ben Fei
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.03690
Pdf URL: https://arxiv.org/pdf/2508.03690
Copy Paste: [[2508.03690]] Veila: Panoramic LiDAR Generation from a Monocular RGB Image(https://arxiv.org/abs/2508.03690)
Keywords: diffusion, generative
Abstract: Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform unconditional generation with poor controllability or adopt text-guided synthesis, which lacks fine-grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB are vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB appearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in non-overlap regions between images and LiDAR. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: a Confidence-Aware Conditioning Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; a Geometric Cross-Modal Alignment (GCMA) for robust RGB-LiDAR alignment under noisy diffusion; and a Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics, Cross-Modal Semantic Consistency and Cross-Modal Depth Consistency, to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI-Weather benchmark demonstrate that Veila achieves state-of-the-art generation fidelity and cross-modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.

Title: La La LiDAR: Large-Scale Layout Generation from LiDAR Data

Authors: Youquan Liu, Lingdong Kong, Weidong Yang, Xin Li, Ao Liang, Runnan Chen, Ben Fei, Tongliang Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.03691
Pdf URL: https://arxiv.org/pdf/2508.03691
Copy Paste: [[2508.03691]] La La LiDAR: Large-Scale Layout Generation from LiDAR Data(https://arxiv.org/abs/2508.03691)
Keywords: diffusion, generative
Abstract: Controllable generation of realistic LiDAR scenes is crucial for applications such as autonomous driving and robotics. While recent diffusion-based models achieve high-fidelity LiDAR generation, they lack explicit control over foreground objects and spatial relationships, limiting their usefulness for scenario simulation and safety validation. To address these limitations, we propose Large-scale Layout-guided LiDAR generation model ("La La LiDAR"), a novel layout-guided generative framework that introduces semantic-enhanced scene graph diffusion with relation-aware contextual conditioning for structured LiDAR layout generation, followed by foreground-aware control injection for complete scene generation. This enables customizable control over object placement while ensuring spatial and semantic consistency. To support our structured LiDAR generation, we introduce Waymo-SG and nuScenes-SG, two large-scale LiDAR scene graph datasets, along with new evaluation metrics for layout synthesis. Extensive experiments demonstrate that La La LiDAR achieves state-of-the-art performance in both LiDAR generation and downstream perception tasks, establishing a new benchmark for controllable 3D scene generation.

Title: LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Authors: Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.03692
Pdf URL: https://arxiv.org/pdf/2508.03692
Copy Paste: [[2508.03692]] LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences(https://arxiv.org/abs/2508.03692)
Keywords: diffusion, generative
Abstract: Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.