2025-06-02

Title: Meaning Is Not A Metric: Using LLMs to make cultural context legible at scale

Authors: Cody Kommers, Drew Hemment, Maria Antoniak, Joel Z. Leibo, Hoyt Long, Emily Robinson, Adam Sobey
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.23785
Pdf URL: https://arxiv.org/pdf/2505.23785
Copy Paste: [[2505.23785]] Meaning Is Not A Metric: Using LLMs to make cultural context legible at scale(https://arxiv.org/abs/2505.23785)
Keywords: generative
Abstract: This position paper argues that large language models (LLMs) can make cultural context, and therefore human meaning, legible at an unprecedented scale in AI-based sociotechnical systems. We argue that such systems have previously been unable to represent human meaning because they rely on thin descriptions: numerical representations that enforce standardization and therefore strip human activity of the cultural context that gives it meaning. By contrast, scholars in the humanities and qualitative social sciences have developed frameworks for representing meaning through thick description: verbal representations that accommodate heterogeneity and retain contextual information needed to represent human meaning. While these methods can effectively codify meaning, they are difficult to deploy at scale. However, the verbal capabilities of LLMs now provide a means of (at least partially) automating the generation and processing of thick descriptions, potentially overcoming this bottleneck. We argue that the problem of rendering human meaning legible is not just about selecting better metrics, but about developing new representational formats (based on thick description). We frame this as a crucial direction for the application of generative AI and identify five key challenges: preserving context, maintaining interpretive pluralism, integrating perspectives based on lived experience and critical distance, distinguishing qualitative content from quantitative magnitude, and acknowledging meaning as dynamic rather than static. Furthermore, we suggest that thick description has the potential to serve as a unifying framework to address a number of emerging concerns about the difficulties of representing culture in (or using) LLMs.

Title: Zero-Trust Foundation Models: A New Paradigm for Secure and Collaborative Artificial Intelligence for Internet of Things

Authors: Kai Li, Conggai Li, Xin Yuan, Shenghong Li, Sai Zou, Syed Sohail Ahmed, Wei Ni, Dusit Niyato, Abbas Jamalipour, Falko Dressler, Ozgur B. Akan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23792
Pdf URL: https://arxiv.org/pdf/2505.23792
Copy Paste: [[2505.23792]] Zero-Trust Foundation Models: A New Paradigm for Secure and Collaborative Artificial Intelligence for Internet of Things(https://arxiv.org/abs/2505.23792)
Keywords: foundation model, anomaly
Abstract: This paper focuses on Zero-Trust Foundation Models (ZTFMs), a novel paradigm that embeds zero-trust security principles into the lifecycle of foundation models (FMs) for Internet of Things (IoT) systems. By integrating core tenets, such as continuous verification, least privilege access (LPA), data confidentiality, and behavioral analytics into the design, training, and deployment of FMs, ZTFMs can enable secure, privacy-preserving AI across distributed, heterogeneous, and potentially adversarial IoT environments. We present the first structured synthesis of ZTFMs, identifying their potential to transform conventional trust-based IoT architectures into resilient, self-defending ecosystems. Moreover, we propose a comprehensive technical framework, incorporating federated learning (FL), blockchain-based identity management, micro-segmentation, and trusted execution environments (TEEs) to support decentralized, verifiable intelligence at the network edge. In addition, we investigate emerging security threats unique to ZTFM-enabled systems and evaluate countermeasures, such as anomaly detection, adversarial training, and secure aggregation. Through this analysis, we highlight key open research challenges in terms of scalability, secure orchestration, interpretable threat attribution, and dynamic trust calibration. This survey lays a foundational roadmap for secure, intelligent, and trustworthy IoT infrastructures powered by FMs.

Title: My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals

Authors: Jian Lan, Yifei Fu, Udo Schlegel, Gengyuan Zhang, Tanveer Hannan, Haokun Chen, Thomas Seidl
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.23798
Pdf URL: https://arxiv.org/pdf/2505.23798
Copy Paste: [[2505.23798]] My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals(https://arxiv.org/abs/2505.23798)
Keywords: generative
Abstract: Social bias is a critical issue in large vision-language models (VLMs), where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield social bias in generative responses. In this study, we focus on evaluating and mitigating social bias on both the model's response and probability distribution. To do so, we first evaluate four state-of-the-art VLMs on PAIRS and SocialCounterfactuals datasets with the multiple-choice selection task. Surprisingly, we find that models suffer from generating gender-biased or race-biased responses. We also observe that models are prone to stating their responses are fair, but indeed having mis-calibrated confidence levels towards particular social groups. While investigating why VLMs are unfair in this study, we observe that VLMs' hidden layers exhibit substantial fluctuations in fairness levels. Meanwhile, residuals in each layer show mixed effects on fairness, with some contributing positively while some lead to increased bias. Based on these findings, we propose a post-hoc method for the inference stage to mitigate social bias, which is training-free and model-agnostic. We achieve this by ablating bias-associated residuals while amplifying fairness-associated residuals on model hidden layers during inference. We demonstrate that our post-hoc method outperforms the competing training strategies, helping VLMs have fairer responses and more reliable confidence levels.

Title: MultiPhishGuard: An LLM-based Multi-Agent System for Phishing Email Detection

Authors: Yinuo Xue, Eric Spero, Yun Sing Koh, Giovanni Russello
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23803
Pdf URL: https://arxiv.org/pdf/2505.23803
Copy Paste: [[2505.23803]] MultiPhishGuard: An LLM-based Multi-Agent System for Phishing Email Detection(https://arxiv.org/abs/2505.23803)
Keywords: generative
Abstract: Phishing email detection faces critical challenges from evolving adversarial tactics and heterogeneous attack patterns. Traditional detection methods, such as rule-based filters and denylists, often struggle to keep pace with these evolving tactics, leading to false negatives and compromised security. While machine learning approaches have improved detection accuracy, they still face challenges adapting to novel phishing strategies. We present MultiPhishGuard, a dynamic LLM-based multi-agent detection system that synergizes specialized expertise with adversarial-aware reinforcement learning. Our framework employs five cooperative agents (text, URL, metadata, explanation simplifier, and adversarial agents) with automatically adjusted decision weights powered by a Proximal Policy Optimization reinforcement learning algorithm. To address emerging threats, we introduce an adversarial training loop featuring an adversarial agent that generates subtle context-aware email variants, creating a self-improving defense ecosystem and enhancing system robustness. Experimental evaluations on public datasets demonstrate that MultiPhishGuard significantly outperforms Chain-of-Thoughts, single-agent baselines and state-of-the-art detectors, as validated by ablation studies and comparative analyses. Experiments demonstrate that MultiPhishGuard achieves high accuracy (97.89\%) with low false positive (2.73\%) and false negative rates (0.20\%). Additionally, we incorporate an explanation simplifier agent, which provides users with clear and easily understandable explanations for why an email is classified as phishing or legitimate. This work advances phishing defense through dynamic multi-agent collaboration and generative adversarial resilience.

Title: Watermarking Without Standards Is Not AI Governance

Authors: Alexander Nemecek, Yuzhou Jiang, Erman Ayday
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.23814
Pdf URL: https://arxiv.org/pdf/2505.23814
Copy Paste: [[2505.23814]] Watermarking Without Standards Is Not AI Governance(https://arxiv.org/abs/2505.23814)
Keywords: generative
Abstract: Watermarking has emerged as a leading technical proposal for attributing generative AI content and is increasingly cited in global governance frameworks. This paper argues that current implementations risk serving as symbolic compliance rather than delivering effective oversight. We identify a growing gap between regulatory expectations and the technical limitations of existing watermarking schemes. Through analysis of policy proposals and industry practices, we show how incentive structures disincentivize robust, auditable deployments. To realign watermarking with governance goals, we propose a three-layer framework encompassing technical standards, audit infrastructure, and enforcement mechanisms. Without enforceable requirements and independent verification, watermarking will remain inadequate for accountability and ultimately undermine broader efforts in AI safety and regulation.

Title: Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams

Authors: Masoud Safilian, Amin Beheshti, Stephen Elbourn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23818
Pdf URL: https://arxiv.org/pdf/2505.23818
Copy Paste: [[2505.23818]] Ratas framework: A comprehensive genai-based approach to rubric-based marking of real-world textual exams(https://arxiv.org/abs/2505.23818)
Keywords: generative
Abstract: Automated answer grading is a critical challenge in educational technology, with the potential to streamline assessment processes, ensure grading consistency, and provide timely feedback to students. However, existing approaches are often constrained to specific exam formats, lack interpretability in score assignment, and struggle with real-world applicability across diverse subjects and assessment types. To address these limitations, we introduce RATAS (Rubric Automated Tree-based Answer Scoring), a novel framework that leverages state-of-the-art generative AI models for rubric-based grading of textual responses. RATAS is designed to support a wide range of grading rubrics, enable subject-agnostic evaluation, and generate structured, explainable rationales for assigned scores. We formalize the automatic grading task through a mathematical framework tailored to rubric-based assessment and present an architecture capable of handling complex, real-world exam structures. To rigorously evaluate our approach, we construct a unique, contextualized dataset derived from real-world project-based courses, encompassing diverse response formats and varying levels of complexity. Empirical results demonstrate that RATAS achieves high reliability and accuracy in automated grading while providing interpretable feedback that enhances transparency for both students and nstructors.

Title: LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation

Authors: Chaeeun Kim, Jinu Lee, Wonseok Hwang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.23832
Pdf URL: https://arxiv.org/pdf/2505.23832
Copy Paste: [[2505.23832]] LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation(https://arxiv.org/abs/2505.23832)
Keywords: generative
Abstract: Legal Case Retrieval (LCR), which retrieves relevant cases from a query case, is a fundamental task for legal professionals in research and decision-making. However, existing studies on LCR face two major limitations. First, they are evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and use a narrow range of criminal query types, which cannot sufficiently reflect the complexity of real-world legal retrieval scenarios. Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches. To address these issues, we present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering 411 diverse crime types in queries over 1.2M legal cases; and (2) LegalSearchLM, a retrieval model that performs legal element reasoning over the query case and directly generates content grounded in the target cases through constrained decoding. Experimental results show that LegalSearchLM outperforms baselines by 6-20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.

Title: GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance

Authors: Zaixi Zhang, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang
Subjects: cs.CR, q-bio.GN
Abstract URL: https://arxiv.org/abs/2505.23839
Pdf URL: https://arxiv.org/pdf/2505.23839
Copy Paste: [[2505.23839]] GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance(https://arxiv.org/abs/2505.23839)
Keywords: foundation model
Abstract: DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Foundation Models have achieved success in designing synthetic functional DNA sequences, even whole genomes, but their susceptibility to jailbreaking remains underexplored, leading to potential concern of generating harmful sequences such as pathogens or toxin-producing genes. In this paper, we introduce GeneBreaker, the first framework to systematically evaluate jailbreak vulnerabilities of DNA foundation models. GeneBreaker employs (1) an LLM agent with customized bioinformatic tools to design high-homology, non-pathogenic jailbreaking prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer generation toward pathogen-like sequences, and (3) a BLAST-based evaluation pipeline against a curated Human Pathogen Database (JailbreakDNABench) to detect successful jailbreaks. Evaluated on our JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA foundation models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms. Our code is at this https URL.

Title: Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs

Authors: Jakub Podolak, Rajeev Verma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23845
Pdf URL: https://arxiv.org/pdf/2505.23845
Copy Paste: [[2505.23845]] Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs(https://arxiv.org/abs/2505.23845)
Keywords: generative
Abstract: We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks. In the default answer-then-confidence setting, the model is regularly over-confident, whereas semantic entropy - obtained by sampling many responses - remains reliable. We hypothesize that this is because of semantic entropy's larger test-time compute, which lets us explore the model's predictive distribution. We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness, even on simple fact-retrieval questions that normally require no reasoning. Furthermore, a separate reader model that sees only the chain can reconstruct very similar confidences, indicating the verbal score might be merely a statistic of the alternatives surfaced during reasoning. Our analysis concludes that reliable uncertainty estimation requires explicit exploration of the generative space, and self-reported confidence is trustworthy only after such exploration.

Title: Mamba Integrated with Physics Principles Masters Long-term Chaotic System Forecasting

Authors: Chang Liu, Bohao Zhao, Jingtao Ding, Huandong Wang, Yong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23863
Pdf URL: https://arxiv.org/pdf/2505.23863
Copy Paste: [[2505.23863]] Mamba Integrated with Physics Principles Masters Long-term Chaotic System Forecasting(https://arxiv.org/abs/2505.23863)
Keywords: generative
Abstract: Long-term forecasting of chaotic systems from short-term observations remains a fundamental and underexplored challenge due to the intrinsic sensitivity to initial conditions and the complex geometry of strange attractors. Existing approaches often rely on long-term training data or focus on short-term sequence correlations, struggling to maintain predictive stability and dynamical coherence over extended horizons. We propose PhyxMamba, a novel framework that integrates a Mamba-based state-space model with physics-informed principles to capture the underlying dynamics of chaotic systems. By reconstructing the attractor manifold from brief observations using time-delay embeddings, PhyxMamba extracts global dynamical features essential for accurate forecasting. Our generative training scheme enables Mamba to replicate the physical process, augmented by multi-token prediction and attractor geometry regularization for physical constraints, enhancing prediction accuracy and preserving key statistical invariants. Extensive evaluations on diverse simulated and real-world chaotic systems demonstrate that PhyxMamba delivers superior long-term forecasting and faithfully captures essential dynamical invariants from short-term data. This framework opens new avenues for reliably predicting chaotic systems under observation-scarce conditions, with broad implications across climate science, neuroscience, epidemiology, and beyond. Our code is open-source at this https URL.

Title: MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection

Authors: Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D. Pimentel, Anuj Pathania
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23870
Pdf URL: https://arxiv.org/pdf/2505.23870
Copy Paste: [[2505.23870]] MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection(https://arxiv.org/abs/2505.23870)
Keywords: foundation model
Abstract: We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models. Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy. Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space. Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition's most critical frequency components are selected. Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.

Title: ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement Learning

Authors: Zeyuan Liu, Zhihe Yang, Jiawei Xu, Rui Yang, Jiafei Lyu, Baoxiang Wang, Yunjian Xu, Xiu Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23871
Pdf URL: https://arxiv.org/pdf/2505.23871
Copy Paste: [[2505.23871]] ADG: Ambient Diffusion-Guided Dataset Recovery for Corruption-Robust Offline Reinforcement Learning(https://arxiv.org/abs/2505.23871)
Keywords: diffusion
Abstract: Real-world datasets collected from sensors or human inputs are prone to noise and errors, posing significant challenges for applying offline reinforcement learning (RL). While existing methods have made progress in addressing corrupted actions and rewards, they remain insufficient for handling corruption in high-dimensional state spaces and for cases where multiple elements in the dataset are corrupted simultaneously. Diffusion models, known for their strong denoising capabilities, offer a promising direction for this problem-but their tendency to overfit noisy samples limits their direct applicability. To overcome this, we propose Ambient Diffusion-Guided Dataset Recovery (ADG), a novel approach that pioneers the use of diffusion models to tackle data corruption in offline RL. First, we introduce Ambient Denoising Diffusion Probabilistic Models (DDPM) from approximated distributions, which enable learning on partially corrupted datasets with theoretical guarantees. Second, we use the noise-prediction property of Ambient DDPM to distinguish between clean and corrupted data, and then use the clean subset to train a standard DDPM. Third, we employ the trained standard DDPM to refine the previously identified corrupted data, enhancing data quality for subsequent offline RL training. A notable strength of ADG is its versatility-it can be seamlessly integrated with any offline RL algorithm. Experiments on a range of benchmarks, including MuJoCo, Kitchen, and Adroit, demonstrate that ADG effectively mitigates the impact of corrupted data and improves the robustness of offline RL under various noise settings, achieving state-of-the-art results.

Title: KGMark: A Diffusion Watermark for Knowledge Graphs

Authors: Hongrui Peng, Haolang Lu, Yuanlong Yu, Weiye Fu, Kun Wang, Guoshun Nan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23873
Pdf URL: https://arxiv.org/pdf/2505.23873
Copy Paste: [[2505.23873]] KGMark: A Diffusion Watermark for Knowledge Graphs(https://arxiv.org/abs/2505.23873)
Keywords: diffusion
Abstract: Knowledge graphs (KGs) are ubiquitous in numerous real-world applications, and watermarking facilitates protecting intellectual property and preventing potential harm from AI-generated content. Existing watermarking methods mainly focus on static plain text or image data, while they can hardly be applied to dynamic graphs due to spatial and temporal variations of structured data. This motivates us to propose KGMARK, the first graph watermarking framework that aims to generate robust, detectable, and transparent diffusion fingerprints for dynamic KG data. Specifically, we propose a novel clustering-based alignment method to adapt the watermark to spatial variations. Meanwhile, we present a redundant embedding strategy to harden the diffusion watermark against various attacks, facilitating the robustness of the watermark to the temporal variations. Additionally, we introduce a novel learnable mask matrix to improve the transparency of diffusion fingerprints. By doing so, our KGMARK properly tackles the variation challenges of structured data. Experiments on various public benchmarks show the effectiveness of our proposed KGMARK.

Title: BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Authors: Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23883
Pdf URL: https://arxiv.org/pdf/2505.23883
Copy Paste: [[2505.23883]] BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning(https://arxiv.org/abs/2505.23883)
Keywords: foundation model
Abstract: Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

Title: Test-Time Training Done Right

Authors: Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, Hao Tan
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.23884
Pdf URL: https://arxiv.org/pdf/2505.23884
Copy Paste: [[2505.23884]] Test-Time Training Done Right(https://arxiv.org/abs/2505.23884)
Keywords: diffusion
Abstract: Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters), hence substantially improving state capacity, all without requiring cumbersome and error-prone kernel implementations. It also allows easy integration of sophisticated optimizers, e.g. Muon for online updates. We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, language models, and auto-regressive video diffusion. Our approach can scale up to 14B-parameter AR video diffusion model on sequences up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with 1 million context length. We hope this work will inspire and accelerate new research in the field of long-context modeling and test-time training. Website: this https URL

Title: Generating Fit Check Videos with a Handheld Camera

Authors: Bowei Chen, Brian Curless, Ira Kemelmacher-Shlizerman, Steven M. Seitz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23886
Pdf URL: https://arxiv.org/pdf/2505.23886
Copy Paste: [[2505.23886]] Generating Fit Check Videos with a Handheld Camera(https://arxiv.org/abs/2505.23886)
Keywords: diffusion
Abstract: Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy, as well as a multi-reference attention mechanism, that effectively integrate appearance information from both the front and back selfies into the video diffusion model. Additionally, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve the generation of shadows and reflections, achieving a more realistic human-scene composition.

Title: Cora: Correspondence-aware image editing using few step diffusion

Authors: Amirhossein Almohammadi, Aryan Mikaeili, Sauradip Nag, Negar Hassanpour, Andrea Tagliasacchi, Ali Mahdavi-Amiri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.23907
Pdf URL: https://arxiv.org/pdf/2505.23907
Copy Paste: [[2505.23907]] Cora: Correspondence-aware image editing using few step diffusion(https://arxiv.org/abs/2505.23907)
Keywords: diffusion
Abstract: Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle to preserve key attributes of the source image (e.g., pose). We introduce Cora, a novel editing framework that addresses these limitations by introducing correspondence-aware noise correction and interpolated attention maps. Our method aligns textures and structures between the source and target images through semantic correspondence, enabling accurate texture transfer while generating new content when necessary. Cora offers control over the balance between content generation and preservation. Extensive experiments demonstrate that, quantitatively and qualitatively, Cora excels in maintaining structure, textures, and identity across diverse edits, including pose changes, object addition, and texture refinements. User studies confirm that Cora delivers superior results, outperforming alternatives.

Title: One Task Vector is not Enough: A Large-Scale Study for In-Context Learning

Authors: Pavel Tikhonov, Ivan Oseledets, Elena Tutubalina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23911
Pdf URL: https://arxiv.org/pdf/2505.23911
Copy Paste: [[2505.23911]] One Task Vector is not Enough: A Large-Scale Study for In-Context Learning(https://arxiv.org/abs/2505.23911)
Keywords: in-context
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks using few examples, with task vectors - specific hidden state activations - hypothesized to encode task information. Existing studies are limited by small-scale benchmarks, restricting comprehensive analysis. We introduce QuiteAFew, a novel dataset of 3,096 diverse few-shot tasks, each with 30 input-output pairs derived from the Alpaca dataset. Experiments with Llama-3-8B on QuiteAFew reveal: (1) task vector performance peaks at an intermediate layer (e.g., 15th), (2) effectiveness varies significantly by task type, and (3) complex tasks rely on multiple, subtask-specific vectors rather than a single vector, suggesting distributed task knowledge representation.

Title: Simplifying Bayesian Optimization Via In-Context Direct Optimum Sampling

Authors: Gustavo Sutter Pessurno de Carvalho, Mohammed Abdulrahman, Hao Wang, Sriram Ganapathi Subramanian, Marc St-Aubin, Sharon O'Sullivan, Lawrence Wan, Luis Ricardez-Sandoval, Pascal Poupart, Agustinus Kristiadi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.23913
Pdf URL: https://arxiv.org/pdf/2505.23913
Copy Paste: [[2505.23913]] Simplifying Bayesian Optimization Via In-Context Direct Optimum Sampling(https://arxiv.org/abs/2505.23913)
Keywords: foundation model, generative, in-context
Abstract: The optimization of expensive black-box functions is ubiquitous in science and engineering. A common solution to this problem is Bayesian optimization (BO), which is generally comprised of two components: (i) a surrogate model and (ii) an acquisition function, which generally require expensive re-training and optimization steps at each iteration, respectively. Although recent work enabled in-context surrogate models that do not require re-training, virtually all existing BO methods still require acquisition function maximization to select the next observation, which introduces many knobs to tune, such as Monte Carlo samplers and multi-start optimizers. In this work, we propose a completely in-context, zero-shot solution for BO that does not require surrogate fitting or acquisition function optimization. This is done by using a pre-trained deep generative model to directly sample from the posterior over the optimum point. We show that this process is equivalent to Thompson sampling and demonstrate the capabilities and cost-effectiveness of our foundation model on a suite of real-world benchmarks. We achieve an efficiency gain of more than 35x in terms of wall-clock time when compared with Gaussian process-based BO, enabling efficient parallel and distributed BO, e.g., for high-throughput optimization.

Title: Multi-Group Proportional Representation for Text-to-Image Models

Authors: Sangwon Jung, Alex Oesterling, Claudio Mayrink Verdun, Sajani Vithana, Taesup Moon, Flavio P. Calmon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24023
Pdf URL: https://arxiv.org/pdf/2505.24023
Copy Paste: [[2505.24023]] Multi-Group Proportional Representation for Text-to-Image Models(https://arxiv.org/abs/2505.24023)
Keywords: generative
Abstract: Text-to-image (T2I) generative models can create vivid, realistic images from textual descriptions. As these models proliferate, they expose new concerns about their ability to represent diverse demographic groups, propagate stereotypes, and efface minority populations. Despite growing attention to the "safe" and "responsible" design of artificial intelligence (AI), there is no established methodology to systematically measure and control representational harms in image generation. This paper introduces a novel framework to measure the representation of intersectional groups in images generated by T2I models by applying the Multi-Group Proportional Representation (MPR) metric. MPR evaluates the worst-case deviation of representation statistics across given population groups in images produced by a generative model, allowing for flexible and context-specific measurements based on user requirements. We also develop an algorithm to optimize T2I models for this metric. Through experiments, we demonstrate that MPR can effectively measure representation statistics across multiple intersectional groups and, when used as a training objective, can guide models toward a more balanced generation across demographic groups while maintaining generation quality.

Title: DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Authors: Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24025
Pdf URL: https://arxiv.org/pdf/2505.24025
Copy Paste: [[2505.24025]] DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models(https://arxiv.org/abs/2505.24025)
Keywords: foundation model, in-context
Abstract: The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.

Title: From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

Authors: Ziming Zhao, ChengAo Shen, Hanghang Tong, Dongjin Song, Zhigang Deng, Qingsong Wen, Jingchao Ni
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.24030
Pdf URL: https://arxiv.org/pdf/2505.24030
Copy Paste: [[2505.24030]] From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?(https://arxiv.org/abs/2505.24030)
Keywords: foundation model
Abstract: Transformer-based models have gained increasing attention in time series research, driving interest in Large Language Models (LLMs) and foundation models for time series analysis. As the field moves toward multi-modality, Large Vision Models (LVMs) are emerging as a promising direction. In the past, the effectiveness of Transformer and LLMs in time series has been debated. When it comes to LVMs, a similar question arises: are LVMs truely useful for time series analysis? To address it, we design and conduct the first principled study involving 4 LVMs, 8 imaging methods, 18 datasets and 26 baselines across both high-level (classification) and low-level (forecasting) tasks, with extensive ablation analysis. Our findings indicate LVMs are indeed useful for time series classification but face challenges in forecasting. Although effective, the contemporary best LVM forecasters are limited to specific types of LVMs and imaging methods, exhibit a bias toward forecasting periods, and have limited ability to utilize long look-back windows. We hope our findings could serve as a cornerstone for future research on LVM- and multimodal-based solutions to different time series tasks.

Title: Measure gradients, not activations! Enhancing neuronal activity in deep reinforcement learning

Authors: Jiashun Liu, Zihao Wu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Ling Pan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24061
Pdf URL: https://arxiv.org/pdf/2505.24061
Copy Paste: [[2505.24061]] Measure gradients, not activations! Enhancing neuronal activity in deep reinforcement learning(https://arxiv.org/abs/2505.24061)
Keywords: diffusion
Abstract: Deep reinforcement learning (RL) agents frequently suffer from neuronal activity loss, which impairs their ability to adapt to new data and learn continually. A common method to quantify and address this issue is the tau-dormant neuron ratio, which uses activation statistics to measure the expressive ability of neurons. While effective for simple MLP-based agents, this approach loses statistical power in more complex architectures. To address this, we argue that in advanced RL agents, maintaining a neuron's learning capacity, its ability to adapt via gradient updates, is more critical than preserving its expressive ability. Based on this insight, we shift the statistical objective from activations to gradients, and introduce GraMa (Gradient Magnitude Neural Activity Metric), a lightweight, architecture-agnostic metric for quantifying neuron-level learning capacity. We show that GraMa effectively reveals persistent neuron inactivity across diverse architectures, including residual networks, diffusion models, and agents with varied activation functions. Moreover, resetting neurons guided by GraMa (ReGraMa) consistently improves learning performance across multiple deep RL algorithms and benchmarks, such as MuJoCo and the DeepMind Control Suite.

Title: ComposeAnything: Composite Object Priors for Text-to-Image Generation

Authors: Zeeshan Khan, Shizhe Chen, Cordelia Schmid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24086
Pdf URL: https://arxiv.org/pdf/2505.24086
Copy Paste: [[2505.24086]] ComposeAnything: Composite Object Priors for Text-to-Image Generation(https://arxiv.org/abs/2505.24086)
Keywords: diffusion
Abstract: Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.

Title: Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting

Authors: Chen Huang, Skyler Seto, Hadi Pouransari, Mehrdad Farajtabar, Raviteja Vemulapalli, Fartash Faghri, Oncel Tuzel, Barry-John Theobald, Josh Susskind
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.24088
Pdf URL: https://arxiv.org/pdf/2505.24088
Copy Paste: [[2505.24088]] Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting(https://arxiv.org/abs/2505.24088)
Keywords: foundation model
Abstract: Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue of concept forgetting on other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization method Proxy-FDA that explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA's benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.

Title: Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors

Authors: Peiran Xu, Yadong Mu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24103
Pdf URL: https://arxiv.org/pdf/2505.24103
Copy Paste: [[2505.24103]] Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors(https://arxiv.org/abs/2505.24103)
Keywords: foundation model
Abstract: In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off-the-shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at this https URL .

Title: Federated Foundation Model for GI Endoscopy Images

Authors: Alina Devkota, Annahita Amireskandari, Joel Palko, Shyam Thakkar, Donald Adjeroh, Xiajun Jiang, Binod Bhattarai, Prashnna K. Gyawali
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24108
Pdf URL: https://arxiv.org/pdf/2505.24108
Copy Paste: [[2505.24108]] Federated Foundation Model for GI Endoscopy Images(https://arxiv.org/abs/2505.24108)
Keywords: foundation model
Abstract: Gastrointestinal (GI) endoscopy is essential in identifying GI tract abnormalities in order to detect diseases in their early stages and improve patient outcomes. Although deep learning has shown success in supporting GI diagnostics and decision-making, these models require curated datasets with labels that are expensive to acquire. Foundation models offer a promising solution by learning general-purpose representations, which can be finetuned for specific tasks, overcoming data scarcity. Developing foundation models for medical imaging holds significant potential, but the sensitive and protected nature of medical data presents unique challenges. Foundation model training typically requires extensive datasets, and while hospitals generate large volumes of data, privacy restrictions prevent direct data sharing, making foundation model training infeasible in most scenarios. In this work, we propose a FL framework for training foundation models for gastroendoscopy imaging, enabling data to remain within local hospital environments while contributing to a shared model. We explore several established FL algorithms, assessing their suitability for training foundation models without relying on task-specific labels, conducting experiments in both homogeneous and heterogeneous settings. We evaluate the trained foundation model on three critical downstream tasks--classification, detection, and segmentation--and demonstrate that it achieves improved performance across all tasks, highlighting the effectiveness of our approach in a federated, privacy-preserving setting.

Title: S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

Authors: Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, Dragomir Anguelov, Mingxing Tan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24139
Pdf URL: https://arxiv.org/pdf/2505.24139
Copy Paste: [[2505.24139]] S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation(https://arxiv.org/abs/2505.24139)
Keywords: self-supervised
Abstract: The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multi-view and multi-frame visual inputs and enables better prediction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and Waymo Open Motion Dataset (with in-house camera data). Results show that S4-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.

Title: The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models

Authors: Jiashuai Liu, Yingjia Shang, Yingkang Zhan, Di Zhang, Yi Niu, Dong Wei, Xian Wu, Zeyu Gao, Chen Li, Yefeng Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24141
Pdf URL: https://arxiv.org/pdf/2505.24141
Copy Paste: [[2505.24141]] The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation Models(https://arxiv.org/abs/2505.24141)
Keywords: foundation model
Abstract: With the widespread adoption of pathology foundation models in both research and clinical decision support systems, exploring their security has become a critical concern. However, despite their growing impact, the vulnerability of these models to adversarial attacks remains largely unexplored. In this work, we present the first systematic investigation into the security of pathology foundation models for whole slide image~(WSI) analysis against adversarial attacks. Specifically, we introduce the principle of \textit{local perturbation with global impact} and propose a label-free attack framework that operates without requiring access to downstream task labels. Under this attack framework, we revise four classical white-box attack methods and redefine the perturbation budget based on the characteristics of WSI. We conduct comprehensive experiments on three representative pathology foundation models across five datasets and six downstream tasks. Despite modifying only 0.1\% of patches per slide with imperceptible noise, our attack leads to downstream accuracy degradation that can reach up to 20\% in the worst cases. Furthermore, we analyze key factors that influence attack success, explore the relationship between patch-level vulnerability and semantic content, and conduct a preliminary investigation into potential defence strategies. These findings lay the groundwork for future research on the adversarial robustness and reliable deployment of pathology foundation models. Our code is publicly available at: this https URL.

Title: CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer

Authors: Jinglong Gao, Xiao Ding, Lingxiao Zou, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24143
Pdf URL: https://arxiv.org/pdf/2505.24143
Copy Paste: [[2505.24143]] CrossICL: Cross-Task In-Context Learning via Unsupervised Demonstration Transfer(https://arxiv.org/abs/2505.24143)
Keywords: in-context
Abstract: In-Context Learning (ICL) enhances the performance of large language models (LLMs) with demonstrations. However, obtaining these demonstrations primarily relies on manual effort. In most real-world scenarios, users are often unwilling or unable to provide such demonstrations. Inspired by the human analogy, we explore a new ICL paradigm CrossICL to study how to utilize existing source task demonstrations in the ICL for target tasks, thereby obtaining reliable guidance without any additional manual effort. To explore this, we first design a two-stage alignment strategy to mitigate the interference caused by gaps across tasks, as the foundation for our experimental exploration. Based on it, we conduct comprehensive exploration of CrossICL, with 875 NLP tasks from the Super-NI benchmark and six types of LLMs, including GPT-4o. Experimental results demonstrate the effectiveness of CrossICL and provide valuable insights on questions like the criteria for selecting cross-task demonstrations, as well as the types of task-gap-induced interference in CrossICL.

Title: Autoregressive regularized score-based diffusion models for multi-scenarios fluid flow prediction

Authors: Wilfried Genuist, Éric Savin, Filippo Gatti, Didier Clouteau
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2505.24145
Pdf URL: https://arxiv.org/pdf/2505.24145
Copy Paste: [[2505.24145]] Autoregressive regularized score-based diffusion models for multi-scenarios fluid flow prediction(https://arxiv.org/abs/2505.24145)
Keywords: diffusion, generative
Abstract: Building on recent advances in scientific machine learning and generative modeling for computational fluid dynamics, we propose a conditional score-based diffusion model designed for multi-scenarios fluid flow prediction. Our model integrates an energy constraint rooted in the statistical properties of turbulent flows, improving prediction quality with minimal training, while enabling efficient sampling at low cost. The method features a simple and general architecture that requires no problem-specific design, supports plug-and-play enhancements, and enables fast and flexible solution generation. It also demonstrates an efficient conditioning mechanism that simplifies training across different scenarios without demanding a redesign of existing models. We further explore various stochastic differential equation formulations to demonstrate how thoughtful design choices enhance performance. We validate the proposed methodology through extensive experiments on complex fluid dynamics datasets encompassing a variety of flow regimes and configurations. Results demonstrate that our model consistently achieves stable, robust, and physically faithful predictions, even under challenging turbulent conditions. With properly tuned parameters, it achieves accurate results across multiple scenarios while preserving key physical and statistical properties. We present a comprehensive analysis of stochastic differential equation impact and discuss our approach across diverse fluid mechanics tasks.

Title: Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction

Authors: Chenyou Fan, Fangzheng Yan, Chenjia Bai, Jiepeng Wang, Chi Zhang, Zhen Wang, Xuelong Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2505.24156
Pdf URL: https://arxiv.org/pdf/2505.24156
Copy Paste: [[2505.24156]] Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction(https://arxiv.org/abs/2505.24156)
Keywords: diffusion
Abstract: Learning a generalizable bimanual manipulation policy is extremely challenging for embodied agents due to the large action space and the need for coordinated arm movements. Existing approaches rely on Vision-Language-Action (VLA) models to acquire bimanual policies. However, transferring knowledge from single-arm datasets or pre-trained VLA models often fails to generalize effectively, primarily due to the scarcity of bimanual data and the fundamental differences between single-arm and bimanual manipulation. In this paper, we propose a novel bimanual foundation policy by fine-tuning the leading text-to-video models to predict robot trajectories and training a lightweight diffusion policy for action generation. Given the lack of embodied knowledge in text-to-video models, we introduce a two-stage paradigm that fine-tunes independent text-to-flow and flow-to-video models derived from a pre-trained text-to-video model. Specifically, optical flow serves as an intermediate variable, providing a concise representation of subtle movements between images. The text-to-flow model predicts optical flow to concretize the intent of language instructions, and the flow-to-video model leverages this flow for fine-grained video prediction. Our method mitigates the ambiguity of language in single-stage text-to-video prediction and significantly reduces the robot-data requirement by avoiding direct use of low-level actions. In experiments, we collect high-quality manipulation data for real dual-arm robot, and the results of simulation and real-world experiments demonstrate the effectiveness of our method.

Title: Pretraining Deformable Image Registration Networks with Random Images

Authors: Junyu Chen, Shuwen Wei, Yihao Liu, Aaron Carass, Yong Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24167
Pdf URL: https://arxiv.org/pdf/2505.24167
Copy Paste: [[2505.24167]] Pretraining Deformable Image Registration Networks with Random Images(https://arxiv.org/abs/2505.24167)
Keywords: foundation model
Abstract: Recent advances in deep learning-based medical image registration have shown that training deep neural networks~(DNNs) does not necessarily require medical images. Previous work showed that DNNs trained on randomly generated images with carefully designed noise and contrast properties can still generalize well to unseen medical data. Building on this insight, we propose using registration between random images as a proxy task for pretraining a foundation model for image registration. Empirical results show that our pretraining strategy improves registration accuracy, reduces the amount of domain-specific data needed to achieve competitive performance, and accelerates convergence during downstream training, thereby enhancing computational efficiency.

Title: Invariant Link Selector for Spatial-Temporal Out-of-Distribution Problem

Authors: Katherine Tieu, Dongqi Fu, Jun Wu, Jingrui He
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24178
Pdf URL: https://arxiv.org/pdf/2505.24178
Copy Paste: [[2505.24178]] Invariant Link Selector for Spatial-Temporal Out-of-Distribution Problem(https://arxiv.org/abs/2505.24178)
Keywords: foundation model
Abstract: In the era of foundation models, Out-of- Distribution (OOD) problems, i.e., the data discrepancy between the training environments and testing environments, hinder AI generalization. Further, relational data like graphs disobeying the Independent and Identically Distributed (IID) condition makes the problem more challenging, especially much harder when it is associated with time. Motivated by this, to realize the robust invariant learning over temporal graphs, we want to investigate what components in temporal graphs are most invariant and representative with respect to labels. With the Information Bottleneck (IB) method, we propose an error-bounded Invariant Link Selector that can distinguish invariant components and variant components during the training process to make the deep learning model generalizable for different testing scenarios. Besides deriving a series of rigorous generalizable optimization functions, we also equip the training with task-specific loss functions, e.g., temporal link prediction, to make pretrained models solve real-world application tasks like citation recommendation and merchandise recommendation, as demonstrated in our experiments with state-of-the-art (SOTA) methods. Our code is available at this https URL.

Title: CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Authors: Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24196
Pdf URL: https://arxiv.org/pdf/2505.24196
Copy Paste: [[2505.24196]] CLaSp: In-Context Layer Skip for Self-Speculative Decoding(https://arxiv.org/abs/2505.24196)
Keywords: in-context
Abstract: Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.

Title: STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models

Authors: Zheng Tan, Weizhen Wang, Andrea L. Bertozzi, Ernest K. Ryu
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2505.24210
Pdf URL: https://arxiv.org/pdf/2505.24210
Copy Paste: [[2505.24210]] STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models(https://arxiv.org/abs/2505.24210)
Keywords: diffusion
Abstract: Diffusion models (DMs) have demonstrated remarkable performance in high-fidelity image and video generation. Because high-quality generations with DMs typically require a large number of function evaluations (NFEs), resulting in slow sampling, there has been extensive research successfully reducing the NFE to a small range (<10) while maintaining acceptable image quality. However, many practical applications, such as those involving Stable Diffusion 3.5, FLUX, and SANA, commonly operate in the mid-NFE regime (20-50 NFE) to achieve superior results, and, despite the practical relevance, research on the effective sampling within this mid-NFE regime remains underexplored. In this work, we propose a novel, training-free, and structure-independent DM ODE solver called the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, based on a class of stiff ODE solvers with a Taylor expansion adaptation. Unlike prior work such as DPM-Solver, which is dependent on the semi-linear structure of the DM ODE, STORK is applicable to any DM sampling, including noise-based and flow matching-based models. Within the 20-50 NFE range, STORK achieves improved generation quality, as measured by FID scores, across unconditional pixel-level generation and conditional latent-space generation tasks using models like Stable Diffusion 3.5 and SANA. Code is available at this https URL.

Title: Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?

Authors: Jiwan Chung, Janghan Yoon, Junhyeong Park, Sangeyl Lee, Joowon Yang, Sooyeon Park, Youngjae Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24211
Pdf URL: https://arxiv.org/pdf/2505.24211
Copy Paste: [[2505.24211]] Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?(https://arxiv.org/abs/2505.24211)
Keywords: generative
Abstract: Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at this https URL.

Title: Benchmarking Foundation Models for Zero-Shot Biometric Tasks

Authors: Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, Arun Ross
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24214
Pdf URL: https://arxiv.org/pdf/2505.24214
Copy Paste: [[2505.24214]] Benchmarking Foundation Models for Zero-Shot Biometric Tasks(https://arxiv.org/abs/2505.24214)
Keywords: foundation model
Abstract: The advent of foundation models, particularly Vision-Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), has redefined the frontiers of artificial intelligence, enabling remarkable generalization across diverse tasks with minimal or no supervision. Yet, their potential in biometric recognition and analysis remains relatively underexplored. In this work, we introduce a comprehensive benchmark that evaluates the zero-shot and few-shot performance of state-of-the-art publicly available VLMs and MLLMs across six biometric tasks spanning the face and iris modalities: face verification, soft biometric attribute prediction (gender and race), iris recognition, presentation attack detection (PAD), and face manipulation detection (morphs and deepfakes). A total of 41 VLMs were used in this evaluation. Experiments show that embeddings from these foundation models can be used for diverse biometric tasks with varying degrees of success. For example, in the case of face verification, a True Match Rate (TMR) of 96.77 percent was obtained at a False Match Rate (FMR) of 1 percent on the Labeled Face in the Wild (LFW) dataset, without any fine-tuning. In the case of iris recognition, the TMR at 1 percent FMR on the IITD-R-Full dataset was 97.55 percent without any fine-tuning. Further, we show that applying a simple classifier head to these embeddings can help perform DeepFake detection for faces, Presentation Attack Detection (PAD) for irides, and extract soft biometric attributes like gender and ethnicity from faces with reasonably high accuracy. This work reiterates the potential of pretrained models in achieving the long-term vision of Artificial General Intelligence.

Title: Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin

Authors: Fangyikang Wang, Hubery Yin, Lei Qian, Yinan Li, Shaobin Zhuang, Huminhao Zhu, Yilin Zhang, Yanlong Tang, Chao Zhang, Hanbin Zhao, Hui Qian, Chen Li
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24222
Pdf URL: https://arxiv.org/pdf/2505.24222
Copy Paste: [[2505.24222]] Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin(https://arxiv.org/abs/2505.24222)
Keywords: diffusion
Abstract: The diffusion models (DMs) have demonstrated the remarkable capability of generating images via learning the noised score function of data distribution. Current DM sampling techniques typically rely on first-order Langevin dynamics at each noise level, with efforts concentrated on refining inter-level denoising strategies. While leveraging additional second-order Hessian geometry to enhance the sampling quality of Langevin is a common practice in Markov chain Monte Carlo (MCMC), the naive attempts to utilize Hessian geometry in high-dimensional DMs lead to quadratic-complexity computational costs, rendering them non-scalable. In this work, we introduce a novel Levenberg-Marquardt-Langevin (LML) method that approximates the diffusion Hessian geometry in a training-free manner, drawing inspiration from the celebrated Levenberg-Marquardt optimization algorithm. Our approach introduces two key innovations: (1) A low-rank approximation of the diffusion Hessian, leveraging the DMs' inherent structure and circumventing explicit quadratic-complexity computations; (2) A damping mechanism to stabilize the approximated Hessian. This LML approximated Hessian geometry enables the diffusion sampling to execute more accurate steps and improve the image generation quality. We further conduct a theoretical analysis to substantiate the approximation error bound of low-rank approximation and the convergence property of the damping mechanism. Extensive experiments across multiple pretrained DMs validate that the LML method significantly improves image generation quality, with negligible computational overhead.

Title: From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

Authors: Haibo Jin, Peiyan Zhang, Peiran Wang, Man Luo, Haohan Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.24232
Pdf URL: https://arxiv.org/pdf/2505.24232
Copy Paste: [[2505.24232]] From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models(https://arxiv.org/abs/2505.24232)
Keywords: foundation model
Abstract: Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textit{Similar Loss Convergence} - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textit{Gradient Consistency in Attention Redistribution} - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities.

Title: LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework

Authors: Xin Kang, Zihan Zheng, Lei Chu, Yue Gao, Jiahao Li, Hao Pan, Xuejin Chen, Yan Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24245
Pdf URL: https://arxiv.org/pdf/2505.24245
Copy Paste: [[2505.24245]] LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework(https://arxiv.org/abs/2505.24245)
Keywords: diffusion
Abstract: We present LTM3D, a Latent Token space Modeling framework for conditional 3D shape generation that integrates the strengths of diffusion and auto-regressive (AR) models. While diffusion-based methods effectively model continuous latent spaces and AR models excel at capturing inter-token dependencies, combining these paradigms for 3D shape generation remains a challenge. To address this, LTM3D features a Conditional Distribution Modeling backbone, leveraging a masked autoencoder and a diffusion model to enhance token dependency learning. Additionally, we introduce Prefix Learning, which aligns condition tokens with shape latent tokens during generation, improving flexibility across modalities. We further propose a Latent Token Reconstruction module with Reconstruction-Guided Sampling to reduce uncertainty and enhance structural fidelity in generated shapes. Our approach operates in token space, enabling support for multiple 3D representations, including signed distance fields, point clouds, meshes, and 3D Gaussian Splatting. Extensive experiments on image- and text-conditioned shape generation tasks demonstrate that LTM3D outperforms existing methods in prompt fidelity and structural accuracy while offering a generalizable framework for multi-modal, multi-representation 3D generation.

Title: Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization

Authors: Qingyao Tian, Huai Liao, Xinyan Huang, Bingyu Yang, Hongbin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24249
Pdf URL: https://arxiv.org/pdf/2505.24249
Copy Paste: [[2505.24249]] Harnessing Foundation Models for Robust and Generalizable 6-DOF Bronchoscopy Localization(https://arxiv.org/abs/2505.24249)
Keywords: foundation model
Abstract: Vision-based 6-DOF bronchoscopy localization offers a promising solution for accurate and cost-effective interventional guidance. However, existing methods struggle with 1) limited generalization across patient cases due to scarce labeled data, and 2) poor robustness under visual degradation, as bronchoscopy procedures frequently involve artifacts such as occlusions and motion blur that impair visual information. To address these challenges, we propose PANSv2, a generalizable and robust bronchoscopy localization framework. Motivated by PANS that leverages multiple visual cues for pose likelihood measurement, PANSv2 integrates depth estimation, landmark detection, and centerline constraints into a unified pose optimization framework that evaluates pose probability and solves for the optimal bronchoscope pose. To further enhance generalization capabilities, we leverage the endoscopic foundation model EndoOmni for depth estimation and the video foundation model EndoMamba for landmark detection, incorporating both spatial and temporal analyses. Pretrained on diverse endoscopic datasets, these models provide stable and transferable visual representations, enabling reliable performance across varied bronchoscopy scenarios. Additionally, to improve robustness to visual degradation, we introduce an automatic re-initialization module that detects tracking failures and re-establishes pose using landmark detections once clear views are available. Experimental results on bronchoscopy dataset encompassing 10 patient cases show that PANSv2 achieves the highest tracking success rate, with an 18.1% improvement in SR-5 (percentage of absolute trajectory error under 5 mm) compared to existing methods, showing potential towards real clinical usage.

Title: Interactive Video Generation via Domain Adaptation

Authors: Ishaan Rawal, Suryansh Kumar
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2505.24253
Pdf URL: https://arxiv.org/pdf/2505.24253
Copy Paste: [[2505.24253]] Interactive Video Generation via Domain Adaptation(https://arxiv.org/abs/2505.24253)
Keywords: diffusion
Abstract: Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial noise does not align with IVG conditioning, by introducing a temporal intrinsic diffusion prior that enforces spatio-temporal consistency at each denoising step. Extensive qualitative and quantitative evaluations demonstrate that mask normalization and temporal intrinsic denoising improve both perceptual quality and trajectory control over the existing state-of-the-art IVG techniques.

Title: Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

Authors: Sahithya Ravi, Gabriel Sarch, Vibhav Vineet, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24257
Pdf URL: https://arxiv.org/pdf/2505.24257
Copy Paste: [[2505.24257]] Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames(https://arxiv.org/abs/2505.24257)
Keywords: generative
Abstract: An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.

Title: MUSE: Model-Agnostic Tabular Watermarking via Multi-Sample Selection

Authors: Liancheng Fang, Aiwei Liu, Henry Peng Zou, Yankai Chen, Hengrui Zhang, Zhongfen Deng, Philip S. Yu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.24267
Pdf URL: https://arxiv.org/pdf/2505.24267
Copy Paste: [[2505.24267]] MUSE: Model-Agnostic Tabular Watermarking via Multi-Sample Selection(https://arxiv.org/abs/2505.24267)
Keywords: diffusion, generative
Abstract: We introduce MUSE, a watermarking algorithm for tabular generative models. Previous approaches typically leverage DDIM invertibility to watermark tabular diffusion models, but tabular diffusion models exhibit significantly poorer invertibility compared to other modalities, compromising performance. Simultaneously, tabular diffusion models require substantially less computation than other modalities, enabling a multi-sample selection approach to tabular generative model watermarking. MUSE embeds watermarks by generating multiple candidate samples and selecting one based on a specialized scoring function, without relying on model invertibility. Our theoretical analysis establishes the relationship between watermark detectability, candidate count, and dataset size, allowing precise calibration of watermarking strength. Extensive experiments demonstrate that MUSE achieves state-of-the-art watermark detectability and robustness against various attacks while maintaining data quality, and remains compatible with any tabular generative model supporting repeated sampling, effectively addressing key challenges in tabular data watermarking. Specifically, it reduces the distortion rates on fidelity metrics by 81-89%, while achieving a 1.0 TPR@0.1%FPR detection rate. Implementation of MUSE can be found at this https URL.

Title: Large Language Models are Locally Linear Mappings

Authors: James R. Golden
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.24293
Pdf URL: https://arxiv.org/pdf/2505.24293
Copy Paste: [[2505.24293]] Large Language Models are Locally Linear Mappings(https://arxiv.org/abs/2505.24293)
Keywords: diffusion
Abstract: We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.

Title: Category-aware EEG image generation based on wavelet transform and contrast semantic loss

Authors: Enshang Zhang, Zhicheng Zhang, Takashi Hanakawa
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2505.24301
Pdf URL: https://arxiv.org/pdf/2505.24301
Copy Paste: [[2505.24301]] Category-aware EEG image generation based on wavelet transform and contrast semantic loss(https://arxiv.org/abs/2505.24301)
Keywords: diffusion
Abstract: Reconstructing visual stimuli from EEG signals is a crucial step in realizing brain-computer interfaces. In this paper, we propose a transformer-based EEG signal encoder integrating the Discrete Wavelet Transform (DWT) and the gating mechanism. Guided by the feature alignment and category-aware fusion losses, this encoder is used to extract features related to visual stimuli from EEG signals. Subsequently, with the aid of a pre-trained diffusion model, these features are reconstructed into visual stimuli. To verify the effectiveness of the model, we conducted EEG-to-image generation and classification tasks using the THINGS-EEG dataset. To address the limitations of quantitative analysis at the semantic level, we combined WordNet-based classification and semantic similarity metrics to propose a novel semantic-based score, emphasizing the ability of our model to transfer neural activities into visual representations. Experimental results show that our model significantly improves semantic alignment and classification accuracy, which achieves a maximum single-subject accuracy of 43\%, outperforming other state-of-the-art methods. The source code and supplementary material is available at this https URL.

Title: InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing

Authors: Jinlu Zhang, Yixin Chen, Zan Wang, Jie Yang, Yizhou Wang, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24315
Pdf URL: https://arxiv.org/pdf/2505.24315
Copy Paste: [[2505.24315]] InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing(https://arxiv.org/abs/2505.24315)
Keywords: diffusion
Abstract: Recent advances in 3D human-aware generation have made significant progress. However, existing methods still struggle with generating novel Human Object Interaction (HOI) from text, particularly for open-set objects. We identify three main challenges of this task: precise human-object relation reasoning, affordance parsing for any object, and detailed human interaction pose synthesis aligning description and object geometry. In this work, we propose a novel zero-shot 3D HOI generation framework without training on specific datasets, leveraging the knowledge from large-scale pre-trained models. Specifically, the human-object relations are inferred from large language models (LLMs) to initialize object properties and guide the optimization process. Then we utilize a pre-trained 2D image diffusion model to parse unseen objects and extract contact points, avoiding the limitations imposed by existing 3D asset knowledge. The initial human pose is generated by sampling multiple hypotheses through multi-view SDS based on the input text and object geometry. Finally, we introduce a detailed optimization to generate fine-grained, precise, and natural interaction, enforcing realistic 3D contact between the 3D object and the involved body parts, including hands in grasping. This is achieved by distilling human-level feedback from LLMs to capture detailed human-object relations from the text instruction. Extensive experiments validate the effectiveness of our approach compared to prior works, particularly in terms of the fine-grained nature of interactions and the ability to handle open-set 3D objects.

Title: KairosAD: A SAM-Based Model for Industrial Anomaly Detection on Embedded Devices

Authors: Uzair Khan, Franco Fummi, Luigi Capogrosso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24334
Pdf URL: https://arxiv.org/pdf/2505.24334
Copy Paste: [[2505.24334]] KairosAD: A SAM-Based Model for Industrial Anomaly Detection on Embedded Devices(https://arxiv.org/abs/2505.24334)
Keywords: anomaly
Abstract: In the era of intelligent manufacturing, anomaly detection has become essential for maintaining quality control on modern production lines. However, while many existing models show promising performance, they are often too large, computationally demanding, and impractical to deploy on resource-constrained embedded devices that can be easily installed on the production lines of Small and Medium Enterprises (SMEs). To bridge this gap, we present KairosAD, a novel supervised approach that uses the power of the Mobile Segment Anything Model (MobileSAM) for image-based anomaly detection. KairosAD has been evaluated on the two well-known industrial anomaly detection datasets, i.e., MVTec-AD and ViSA. The results show that KairosAD requires 78% fewer parameters and boasts a 4x faster inference time compared to the leading state-of-the-art model, while maintaining comparable AUROC performance. We deployed KairosAD on two embedded devices, the NVIDIA Jetson NX, and the NVIDIA Jetson AGX. Finally, KairosAD was successfully installed and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at this https URL.

Title: Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings

Authors: Shujian Yang, Shiyao Cui, Chuanrui Hu, Haicheng Wang, Tianwei Zhang, Minlie Huang, Jialiang Lu, Han Qiu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.24341
Pdf URL: https://arxiv.org/pdf/2505.24341
Copy Paste: [[2505.24341]] Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings(https://arxiv.org/abs/2505.24341)
Keywords: in-context
Abstract: Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs "overcorrect'': misidentify many normal Chinese contents as toxic.

Title: Multilingual Gloss-free Sign Language Translation: Towards Building a Sign Language Foundation Model

Authors: Sihan Tan, Taro Miyazaki, Kazuhiro Nakadai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.24355
Pdf URL: https://arxiv.org/pdf/2505.24355
Copy Paste: [[2505.24355]] Multilingual Gloss-free Sign Language Translation: Towards Building a Sign Language Foundation Model(https://arxiv.org/abs/2505.24355)
Keywords: foundation model
Abstract: Sign Language Translation (SLT) aims to convert sign language (SL) videos into spoken language text, thereby bridging the communication gap between the sign and the spoken community. While most existing works focus on translating a single sign language into a single spoken language (one-to-one SLT), leveraging multilingual resources could mitigate low-resource issues and enhance accessibility. However, multilingual SLT (MLSLT) remains unexplored due to language conflicts and alignment difficulties across SLs and spoken languages. To address these challenges, we propose a multilingual gloss-free model with dual CTC objectives for token-level SL identification and spoken text generation. Our model supports 10 SLs and handles one-to-one, many-to-one, and many-to-many SLT tasks, achieving competitive performance compared to state-of-the-art methods on three widely adopted benchmarks: multilingual SP-10, PHOENIX14T, and CSL-Daily.

Title: Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning

Authors: Stepan Shabalin, Ayush Panda, Dmitrii Kharlapenko, Abdur Raheem Ali, Yixiong Hao, Arthur Conmy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24360
Pdf URL: https://arxiv.org/pdf/2505.24360
Copy Paste: [[2505.24360]] Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning(https://arxiv.org/abs/2505.24360)
Keywords: diffusion
Abstract: Sparse autoencoders are a promising new approach for decomposing language model activations for interpretation and control. They have been applied successfully to vision transformer image encoders and to small-scale diffusion models. Inference-Time Decomposition of Activations (ITDA) is a recently proposed variant of dictionary learning that takes the dictionary to be a set of data points from the activation distribution and reconstructs them with gradient pursuit. We apply Sparse Autoencoders (SAEs) and ITDA to a large text-to-image diffusion model, Flux 1, and consider the interpretability of embeddings of both by introducing a visual automated interpretation pipeline. We find that SAEs accurately reconstruct residual stream embeddings and beat MLP neurons on interpretability. We are able to use SAE features to steer image generation through activation addition. We find that ITDA has comparable interpretability to SAEs.

Title: Anomaly Detection and Improvement of Clusters using Enhanced K-Means Algorithm

Authors: Vardhan Shorewala, Shivam Shorewala
Subjects: cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2505.24365
Pdf URL: https://arxiv.org/pdf/2505.24365
Copy Paste: [[2505.24365]] Anomaly Detection and Improvement of Clusters using Enhanced K-Means Algorithm(https://arxiv.org/abs/2505.24365)
Keywords: anomaly
Abstract: This paper introduces a unified approach to cluster refinement and anomaly detection in datasets. We propose a novel algorithm that iteratively reduces the intra-cluster variance of N clusters until a global minimum is reached, yielding tighter clusters than the standard k-means algorithm. We evaluate the method using intrinsic measures for unsupervised learning, including the silhouette coefficient, Calinski-Harabasz index, and Davies-Bouldin index, and extend it to anomaly detection by identifying points whose assignment causes a significant variance increase. External validation on synthetic data and the UCI Breast Cancer and UCI Wine Quality datasets employs the Jaccard similarity score, V-measure, and F1 score. Results show variance reductions of 18.7% and 88.1% on the synthetic and Wine Quality datasets, respectively, along with accuracy and F1 score improvements of 22.5% and 20.8% on the Wine Quality dataset.

Title: Adversarial Preference Learning for Robust LLM Alignment

Authors: Yuanfu Wang, Pengyu Wang, Chenyang Xi, Bo Tang, Junyi Zhu, Wenqiang Wei, Chen Chen, Chao Yang, Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao, Zhiyu Li, Feiyu Xiong, Jie Hu, Mingchuan Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24369
Pdf URL: https://arxiv.org/pdf/2505.24369
Copy Paste: [[2505.24369]] Adversarial Preference Learning for Robust LLM Alignment(https://arxiv.org/abs/2505.24369)
Keywords: generative
Abstract: Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.

Title: IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models

Authors: Hanting Wang, Tao Jin, Wang Lin, Shulei Wang, Hai Huang, Shengpeng Ji, Zhou Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24406
Pdf URL: https://arxiv.org/pdf/2505.24406
Copy Paste: [[2505.24406]] IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models(https://arxiv.org/abs/2505.24406)
Keywords: diffusion, generative
Abstract: Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminate this requirement. The main challenge is that standard generative models are typically designed for a diffusion process that starts from pure noise, while restoration tasks begin with a low-quality image, resulting in a mismatch in the state distributions between the two processes. To address this challenge, we propose a transition equation that bridges two diffusion processes with the same endpoint distribution. Based on this, we introduce the IRBridge framework, which enables the direct utilization of generative models within image restoration bridges, offering a more flexible and adaptable approach to image restoration. Extensive experiments on six image restoration tasks demonstrate that IRBridge efficiently integrates generative priors, resulting in improved robustness and generalization performance. Code will be available at GitHub.

Title: EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

Authors: Runnan Lu, Yuxuan Zhang, Jailing Liu, Haifa Wang, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24417
Pdf URL: https://arxiv.org/pdf/2505.24417
Copy Paste: [[2505.24417]] EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering(https://arxiv.org/abs/2505.24417)
Keywords: diffusion
Abstract: Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.

Title: Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation

Authors: Bozhong Zheng, Jinye Gan, Xiaohao Xu, Wenqiao Li, Xiaonan Huang, Na Ni, Yingna Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24431
Pdf URL: https://arxiv.org/pdf/2505.24431
Copy Paste: [[2505.24431]] Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation(https://arxiv.org/abs/2505.24431)
Keywords: anomaly
Abstract: 3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization. We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module. Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. The code is available at this https URL to support further research.

Title: Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields

Authors: Md Shahriar Rahim Siddiqui, Moshe Eliasof, Eldad Haber
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.24434
Pdf URL: https://arxiv.org/pdf/2505.24434
Copy Paste: [[2505.24434]] Graph Flow Matching: Enhancing Image Generation with Neighbor-Aware Flow Fields(https://arxiv.org/abs/2505.24434)
Keywords: diffusion
Abstract: Flow matching casts sample generation as learning a continuous-time velocity field that transports noise to data. Existing flow matching networks typically predict each point's velocity independently, considering only its location and time along its flow trajectory, and ignoring neighboring points. However, this pointwise approach may overlook correlations between points along the generation trajectory that could enhance velocity predictions, thereby improving downstream generation quality. To address this, we propose Graph Flow Matching (GFM), a lightweight enhancement that decomposes the learned velocity into a reaction term -- any standard flow matching network -- and a diffusion term that aggregates neighbor information via a graph neural module. This reaction-diffusion formulation retains the scalability of deep flow models while enriching velocity predictions with local context, all at minimal additional computational cost. Operating in the latent space of a pretrained variational autoencoder, GFM consistently improves Fréchet Inception Distance (FID) and recall across five image generation benchmarks (LSUN Church, LSUN Bedroom, FFHQ, AFHQ-Cat, and CelebA-HQ at $256\times256$), demonstrating its effectiveness as a modular enhancement to existing flow matching architectures.

Title: Object Centric Concept Bottlenecks

Authors: David Steinmann, Wolfgang Stammer, Antonia Wüst, Kristian Kersting
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24492
Pdf URL: https://arxiv.org/pdf/2505.24492
Copy Paste: [[2505.24492]] Object Centric Concept Bottlenecks(https://arxiv.org/abs/2505.24492)
Keywords: foundation model
Abstract: Developing high-performing, yet interpretable models remains a critical challenge in modern AI. Concept-based models (CBMs) attempt to address this by extracting human-understandable concepts from a global encoding (e.g., image encoding) and then applying a linear classifier on the resulting concept activations, enabling transparent decision-making. However, their reliance on holistic image encodings limits their expressiveness in object-centric real-world settings and thus hinders their ability to solve complex vision tasks beyond single-label classification. To tackle these challenges, we introduce Object-Centric Concept Bottlenecks (OCB), a framework that combines the strengths of CBMs and pre-trained object-centric foundation models, boosting performance and interpretability. We evaluate OCB on complex image datasets and conduct a comprehensive ablation study to analyze key components of the framework, such as strategies for aggregating object-concept encodings. The results show that OCB outperforms traditional CBMs and allows one to make interpretable decisions for complex visual tasks.

Title: un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Authors: Yinqi Li, Jiahe Zhao, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24517
Pdf URL: https://arxiv.org/pdf/2505.24517
Copy Paste: [[2505.24517]] un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP(https://arxiv.org/abs/2505.24517)
Keywords: foundation model, generative
Abstract: Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks. However, recent works indicate that CLIP falls short in distinguishing detailed differences in images and shows suboptimal performance on dense-prediction and vision-centric multimodal tasks. Therefore, this work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible. We find that a specific type of generative models, unCLIP, provides a suitable framework for achieving our goal. Specifically, unCLIP trains an image generator conditioned on the CLIP image embedding. In other words, it inverts the CLIP image encoder. Compared to discriminative models like CLIP, generative models are better at capturing image details because they are trained to learn the data distribution of images. Additionally, the conditional input space of unCLIP aligns with CLIP's original image-text embedding space. Therefore, we propose to invert unCLIP (dubbed un$^2$CLIP) to improve the CLIP model. In this way, the improved image encoder can gain unCLIP's visual detail capturing ability while preserving its alignment with the original text encoder simultaneously. We evaluate our improved CLIP across various tasks to which CLIP has been applied, including the challenging MMVP-VLM benchmark, the dense-prediction open-vocabulary segmentation task, and multimodal large language model tasks. Experiments show that un$^2$CLIP significantly improves the original CLIP and previous CLIP improvement methods. Code and models will be available at this https URL.

Title: UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation

Authors: Yang-Tian Sun, Xin Yu, Zehuan Huang, Yi-Hua Huang, Yuan-Chen Guo, Ziyi Yang, Yan-Pei Cao, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24521
Pdf URL: https://arxiv.org/pdf/2505.24521
Copy Paste: [[2505.24521]] UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation(https://arxiv.org/abs/2505.24521)
Keywords: diffusion
Abstract: Recently, methods leveraging diffusion model priors to assist monocular geometric estimation (e.g., depth and normal) have gained significant attention due to their strong generalization ability. However, most existing works focus on estimating geometric properties within the camera coordinate system of individual video frames, neglecting the inherent ability of diffusion models to determine inter-frame correspondence. In this work, we demonstrate that, through appropriate design and fine-tuning, the intrinsic consistency of video generation models can be effectively harnessed for consistent geometric estimation. Specifically, we 1) select geometric attributes in the global coordinate system that share the same correspondence with video frames as the prediction targets, 2) introduce a novel and efficient conditioning method by reusing positional encodings, and 3) enhance performance through joint training on multiple geometric attributes that share the same correspondence. Our results achieve superior performance in predicting global geometric attributes in videos and can be directly applied to reconstruction tasks. Even when trained solely on static video data, our approach exhibits the potential to generalize to dynamic video scenes.

Title: Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

Authors: Andrea Pedrotti, Michele Papucci, Cristiano Ciaccio, Alessio Miaschi, Giovanni Puccetti, Felice Dell'Orletta, Andrea Esuli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24523
Pdf URL: https://arxiv.org/pdf/2505.24523
Copy Paste: [[2505.24523]] Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors(https://arxiv.org/abs/2505.24523)
Keywords: generative
Abstract: Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

Title: Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

Authors: Pedram Ghamisi, Weikang Yu, Xiaokang Zhang, Aldino Rizaldy, Jian Wang, Chufeng Zhou, Richard Gloaguen, Gustau Camps-Valls
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.24528
Pdf URL: https://arxiv.org/pdf/2505.24528
Copy Paste: [[2505.24528]] Geospatial Foundation Models to Enable Progress on Sustainable Development Goals(https://arxiv.org/abs/2505.24528)
Keywords: foundation model
Abstract: Foundation Models (FMs) are large-scale, pre-trained AI systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.

Title: Transformers Are Universally Consistent

Authors: Sagar Ghosh, Kushal Bose, Swagatam Das
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24531
Pdf URL: https://arxiv.org/pdf/2505.24531
Copy Paste: [[2505.24531]] Transformers Are Universally Consistent(https://arxiv.org/abs/2505.24531)
Keywords: in-context
Abstract: Despite their central role in the success of foundational models and large-scale language modeling, the theoretical foundations governing the operation of Transformers remain only partially understood. Contemporary research has largely focused on their representational capacity for language comprehension and their prowess in in-context learning, frequently under idealized assumptions such as linearized attention mechanisms. Initially conceived to model sequence-to-sequence transformations, a fundamental and unresolved question is whether Transformers can robustly perform functional regression over sequences of input tokens. This question assumes heightened importance given the inherently non-Euclidean geometry underlying real-world data distributions. In this work, we establish that Transformers equipped with softmax-based nonlinear attention are uniformly consistent when tasked with executing Ordinary Least Squares (OLS) regression, provided both the inputs and outputs are embedded in hyperbolic space. We derive deterministic upper bounds on the empirical error which, in the asymptotic regime, decay at a provable rate of $\mathcal{O}(t^{-1/2d})$, where $t$ denotes the number of input tokens and $d$ the embedding dimensionality. Notably, our analysis subsumes the Euclidean setting as a special case, recovering analogous convergence guarantees parameterized by the intrinsic dimensionality of the data manifold. These theoretical insights are corroborated through empirical evaluations on real-world datasets involving both continuous and categorical response variables.

Title: HLSAD: Hodge Laplacian-based Simplicial Anomaly Detection

Authors: Florian Frantzen, Michael T. Schaub
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2505.24534
Pdf URL: https://arxiv.org/pdf/2505.24534
Copy Paste: [[2505.24534]] HLSAD: Hodge Laplacian-based Simplicial Anomaly Detection(https://arxiv.org/abs/2505.24534)
Keywords: anomaly
Abstract: In this paper, we propose HLSAD, a novel method for detecting anomalies in time-evolving simplicial complexes. While traditional graph anomaly detection techniques have been extensively studied, they often fail to capture changes in higher-order interactions that are crucial for identifying complex structural anomalies. These higher-order interactions can arise either directly from the underlying data itself or through graph lifting techniques. Our approach leverages the spectral properties of Hodge Laplacians of simplicial complexes to effectively model multi-way interactions among data points. By incorporating higher-dimensional simplicial structures into our method, our method enhances both detection accuracy and computational efficiency. Through comprehensive experiments on both synthetic and real-world datasets, we demonstrate that our approach outperforms existing graph methods in detecting both events and change points.

Title: AutoChemSchematic AI: A Closed-Loop, Physics-Aware Agentic Framework for Auto-Generating Chemical Process and Instrumentation Diagrams

Authors: Sakhinana Sagar Srinivas, Shivam Gupta, Venkataramana Runkana
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.24584
Pdf URL: https://arxiv.org/pdf/2505.24584
Copy Paste: [[2505.24584]] AutoChemSchematic AI: A Closed-Loop, Physics-Aware Agentic Framework for Auto-Generating Chemical Process and Instrumentation Diagrams(https://arxiv.org/abs/2505.24584)
Keywords: generative
Abstract: Recent advancements in generative AI have accelerated the discovery of novel chemicals and materials; however, transitioning these discoveries to industrial-scale production remains a critical bottleneck, as it requires the development of entirely new chemical manufacturing processes. Current AI methods cannot auto-generate PFDs or PIDs, despite their critical role in scaling chemical processes, while adhering to engineering constraints. We present a closed loop, physics aware framework for the automated generation of industrially viable PFDs and PIDs. The framework integrates domain specialized small scale language models (SLMs) (trained for chemical process QA tasks) with first principles simulation, leveraging three key components: (1) a hierarchical knowledge graph of process flow and instrumentation descriptions for 1,020+ chemicals, (2) a multi-stage training pipeline that fine tunes domain specialized SLMs on synthetic datasets via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Retrieval-Augmented Instruction Tuning (RAIT), and (3) DWSIM based simulator in the loop validation to ensure feasibility. To improve both runtime efficiency and model compactness, the framework incorporates advanced inference time optimizations including FlashAttention, Lookahead Decoding, PagedAttention with KV-cache quantization, and Test Time Inference Scaling and independently applies structural pruning techniques (width and depth) guided by importance heuristics to reduce model size with minimal accuracy loss. Experiments demonstrate that the framework generates simulator-validated process descriptions with high fidelity, outperforms baseline methods in correctness, and generalizes to unseen chemicals. By bridging AI-driven design with industrial-scale feasibility, this work significantly reduces R&D timelines from lab discovery to plant deployment.

Title: Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Authors: Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24616
Pdf URL: https://arxiv.org/pdf/2505.24616
Copy Paste: [[2505.24616]] Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX(https://arxiv.org/abs/2505.24616)
Keywords: generative
Abstract: We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.

Title: Category-Level 6D Object Pose Estimation in Agricultural Settings Using a Lattice-Deformation Framework and Diffusion-Augmented Synthetic Data

Authors: Marios Glytsos, Panagiotis P. Filntisis, George Retsinas, Petros Maragos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24636
Pdf URL: https://arxiv.org/pdf/2505.24636
Copy Paste: [[2505.24636]] Category-Level 6D Object Pose Estimation in Agricultural Settings Using a Lattice-Deformation Framework and Diffusion-Augmented Synthetic Data(https://arxiv.org/abs/2505.24636)
Keywords: diffusion
Abstract: Accurate 6D object pose estimation is essential for robotic grasping and manipulation, particularly in agriculture, where fruits and vegetables exhibit high intra-class variability in shape, size, and texture. The vast majority of existing methods rely on instance-specific CAD models or require depth sensors to resolve geometric ambiguities, making them impractical for real-world agricultural applications. In this work, we introduce PLANTPose, a novel framework for category-level 6D pose estimation that operates purely on RGB input. PLANTPose predicts both the 6D pose and deformation parameters relative to a base mesh, allowing a single category-level CAD model to adapt to unseen instances. This enables accurate pose estimation across varying shapes without relying on instance-specific data. To enhance realism and improve generalization, we also leverage Stable Diffusion to refine synthetic training images with realistic texturing, mimicking variations due to ripeness and environmental factors and bridging the domain gap between synthetic data and the real world. Our evaluations on a challenging benchmark that includes bananas of various shapes, sizes, and ripeness status demonstrate the effectiveness of our framework in handling large intraclass variations while maintaining accurate 6D pose predictions, significantly outperforming the state-of-the-art RGB-based approach MegaPose.

Title: A Cross Branch Fusion-Based Contrastive Learning Framework for Point Cloud Self-supervised Learning

Authors: Chengzhi Wu, Qianliang Huang, Kun Jin, Julius Pfrommer, Jürgen Beyerer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24641
Pdf URL: https://arxiv.org/pdf/2505.24641
Copy Paste: [[2505.24641]] A Cross Branch Fusion-Based Contrastive Learning Framework for Point Cloud Self-supervised Learning(https://arxiv.org/abs/2505.24641)
Keywords: self-supervised
Abstract: Contrastive learning is an essential method in self-supervised learning. It primarily employs a multi-branch strategy to compare latent representations obtained from different branches and train the encoder. In the case of multi-modal input, diverse modalities of the same object are fed into distinct branches. When using single-modal data, the same input undergoes various augmentations before being fed into different branches. However, all existing contrastive learning frameworks have so far only performed contrastive operations on the learned features at the final loss end, with no information exchange between different branches prior to this stage. In this paper, for point cloud unsupervised learning without the use of extra training data, we propose a Contrastive Cross-branch Attention-based framework for Point cloud data (termed PoCCA), to learn rich 3D point cloud representations. By introducing sub-branches, PoCCA allows information exchange between different branches before the loss end. Experimental results demonstrate that in the case of using no extra training data, the representations learned with our self-supervised model achieve state-of-the-art performances when used for downstream tasks on point clouds.

Title: MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR

Authors: Dimitrios Damianos, Georgios Paraskevopoulos, Alexandros Potamianos
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.24656
Pdf URL: https://arxiv.org/pdf/2505.24656
Copy Paste: [[2505.24656]] MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR(https://arxiv.org/abs/2505.24656)
Keywords: self-supervised
Abstract: In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). We introduce a Multi-Stage Domain Adaptation pipeline (MSDA), a sample-efficient, two-stage adaptation approach that integrates self-supervised learning with semi-supervised techniques. MSDA is designed to enhance the robustness and generalization of ASR models, making them more adaptable to diverse conditions. It is particularly effective for low-resource languages like Greek and in weakly supervised scenarios where labeled data is scarce or noisy. Through extensive experiments, we demonstrate that Meta PL can be applied effectively to ASR tasks, achieving state-of-the-art results, significantly outperforming state-of-the-art methods, and providing more robust solutions for unsupervised domain adaptation in ASR. Our ablations highlight the necessity of utilizing a cascading approach when combining self-supervision with self-training.

Title: Conformal Prediction for Zero-Shot Models

Authors: Julio Silva-Rodríguez, Ismail Ben Ayed, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24693
Pdf URL: https://arxiv.org/pdf/2505.24693
Copy Paste: [[2505.24693]] Conformal Prediction for Zero-Shot Models(https://arxiv.org/abs/2505.24693)
Keywords: foundation model
Abstract: Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being 15 times faster than popular transductive approaches.

Title: PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations

Authors: Benjamin Holzschuh, Qiang Liu, Georg Kohl, Nils Thuerey
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24717
Pdf URL: https://arxiv.org/pdf/2505.24717
Copy Paste: [[2505.24717]] PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations(https://arxiv.org/abs/2505.24717)
Keywords: diffusion, foundation model
Abstract: We introduce PDE-Transformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. This helps to maintain a consistent information density of tokens when learning multiple types of PDEs simultaneously. We demonstrate that our pre-trained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations.

Title: HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

Authors: Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24722
Pdf URL: https://arxiv.org/pdf/2505.24722
Copy Paste: [[2505.24722]] HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts(https://arxiv.org/abs/2505.24722)
Keywords: generative
Abstract: Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

Title: AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption

Authors: Yajie Zhou, Xiaoyi Pang, Zhibo Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24773
Pdf URL: https://arxiv.org/pdf/2505.24773
Copy Paste: [[2505.24773]] AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption(https://arxiv.org/abs/2505.24773)
Keywords: foundation model
Abstract: Federated fine-tuning has emerged as a promising approach to adapt foundation models to downstream tasks using decentralized data. However, real-world deployment remains challenging due to the high computational and communication demands of fine-tuning Large Language Models (LLMs) on clients with data and system resources that are heterogeneous and constrained. In such settings, the global model's performance is often bottlenecked by the weakest clients and further degraded by the non-IID nature of local data. Although existing methods leverage parameter-efficient techniques such as Low-Rank Adaptation (LoRA) to reduce communication and computation overhead, they often fail to simultaneously ensure accurate aggregation of low-rank updates and maintain low system costs, thereby hindering overall performance. To address these challenges, we propose AFLoRA, an adaptive and lightweight federated fine-tuning framework for LLMs. AFLoRA decouples shared and client-specific updates to reduce overhead and improve aggregation accuracy, incorporates diagonal matrix-based rank pruning to better utilize local resources, and employs rank-aware aggregation with public data refinement to strengthen generalization under data heterogeneity. Extensive experiments demonstrate that AFLoRA outperforms state-of-the-art methods in both accuracy and efficiency, providing a practical solution for efficient LLM adaptation in heterogeneous environments in the real world.

Title: Diffusion-Based Symbolic Regression

Authors: Zachary Bastiani, Robert M. Kirby, Jacob Hochhalter, Shandian Zhe
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24776
Pdf URL: https://arxiv.org/pdf/2505.24776
Copy Paste: [[2505.24776]] Diffusion-Based Symbolic Regression(https://arxiv.org/abs/2505.24776)
Keywords: diffusion, generative
Abstract: Diffusion has emerged as a powerful framework for generative modeling, achieving remarkable success in applications such as image and audio synthesis. Enlightened by this progress, we propose a novel diffusion-based approach for symbolic regression. We construct a random mask-based diffusion and denoising process to generate diverse and high-quality equations. We integrate this generative processes with a token-wise Group Relative Policy Optimization (GRPO) method to conduct efficient reinforcement learning on the given measurement dataset. In addition, we introduce a long short-term risk-seeking policy to expand the pool of top-performing candidates, further enhancing performance. Extensive experiments and ablation studies have demonstrated the effectiveness of our approach.

Title: EVA-MILP: Towards Standardized Evaluation of MILP Instance Generation

Authors: Yidong Luo, Chenguang Wang, Jiahao Yang, Fanzeng Xia, Tianshu Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24779
Pdf URL: https://arxiv.org/pdf/2505.24779
Copy Paste: [[2505.24779]] EVA-MILP: Towards Standardized Evaluation of MILP Instance Generation(https://arxiv.org/abs/2505.24779)
Keywords: generative
Abstract: Mixed-Integer Linear Programming (MILP) is fundamental to solving complex decision-making problems. The proliferation of MILP instance generation methods, driven by machine learning's demand for diverse optimization datasets and the limitations of static benchmarks, has significantly outpaced standardized evaluation techniques. Consequently, assessing the fidelity and utility of synthetic MILP instances remains a critical, multifaceted challenge. This paper introduces a comprehensive benchmark framework designed for the systematic and objective evaluation of MILP instance generation methods. Our framework provides a unified and extensible methodology, assessing instance quality across crucial dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream machine learning tasks. A key innovation is its in-depth analysis of solver-internal features -- particularly by comparing distributions of key solver outputs including root node gap, heuristic success rates, and cut plane usage -- leveraging the solver's dynamic solution behavior as an `expert assessment' to reveal nuanced computational resemblances. By offering a structured approach with clearly defined solver-independent and solver-dependent metrics, our benchmark aims to facilitate robust comparisons among diverse generation techniques, spur the development of higher-quality instance generators, and ultimately enhance the reliability of research reliant on synthetic MILP data. The framework's effectiveness in systematically comparing the fidelity of instance sets is demonstrated using contemporary generative models.

Title: QGAN-based data augmentation for hybrid quantum-classical neural networks

Authors: Run-Ze He, Jun-Jian Su, Su-Juan Qin, Zheng-Ping Jin, Fei Gao
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2505.24780
Pdf URL: https://arxiv.org/pdf/2505.24780
Copy Paste: [[2505.24780]] QGAN-based data augmentation for hybrid quantum-classical neural networks(https://arxiv.org/abs/2505.24780)
Keywords: generative
Abstract: Quantum neural networks converge faster and achieve higher accuracy than classical models. However, data augmentation in quantum machine learning remains underexplored. To tackle data scarcity, we integrate quantum generative adversarial networks (QGANs) with hybrid quantum-classical neural networks (HQCNNs) to develop an augmentation framework. We propose two strategies: a general approach to enhance data processing and classification across HQCNNs, and a customized strategy that dynamically generates samples tailored to the HQCNN's performance on specific data categories, improving its ability to learn from complex datasets. Simulation experiments on the MNIST dataset demonstrate that QGAN outperforms traditional data augmentation methods and classical GANs. Compared to baseline DCGAN, QGAN achieves comparable performance with half the parameters, balancing efficiency and effectiveness. This suggests that QGANs can simplify models and generate high-quality data, enhancing HQCNN accuracy and performance. These findings pave the way for applying quantum data augmentation techniques in machine learning.

Title: Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding

Authors: Jiaru Zhang, Juanwu Lu, Ziran Wang, Ruqi Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.24791
Pdf URL: https://arxiv.org/pdf/2505.24791
Copy Paste: [[2505.24791]] Inference Acceleration of Autoregressive Normalizing Flows by Selective Jacobi Decoding(https://arxiv.org/abs/2505.24791)
Keywords: generative
Abstract: Normalizing flows are promising generative models with advantages such as theoretical rigor, analytical log-likelihood computation, and end-to-end training. However, the architectural constraints to ensure invertibility and tractable Jacobian computation limit their expressive power and practical usability. Recent advancements utilize autoregressive modeling, significantly enhancing expressive power and generation quality. However, such sequential modeling inherently restricts parallel computation during inference, leading to slow generation that impedes practical deployment. In this paper, we first identify that strict sequential dependency in inference is unnecessary to generate high-quality samples. We observe that patches in sequential modeling can also be approximated without strictly conditioning on all preceding patches. Moreover, the models tend to exhibit low dependency redundancy in the initial layer and higher redundancy in subsequent layers. Leveraging these observations, we propose a selective Jacobi decoding (SeJD) strategy that accelerates autoregressive inference through parallel iterative optimization. Theoretical analyses demonstrate the method's superlinear convergence rate and guarantee that the number of iterations required is no greater than the original sequential approach. Empirical evaluations across multiple datasets validate the generality and effectiveness of our acceleration technique. Experiments demonstrate substantial speed improvements up to 4.7 times faster inference while keeping the generation quality and fidelity.

Title: Guiding Generative Storytelling with Knowledge Graphs

Authors: Zhijun Pan, Antonios Andronis, Eva Hayek, Oscar AP Wilkinson, Ilya Lasy, Annette Parry, Guy Gadney, Tim J. Smith, Mick Grierson
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2505.24803
Pdf URL: https://arxiv.org/pdf/2505.24803
Copy Paste: [[2505.24803]] Guiding Generative Storytelling with Knowledge Graphs(https://arxiv.org/abs/2505.24803)
Keywords: generative
Abstract: Large Language Models (LLMs) have shown great potential in automated story generation, but challenges remain in maintaining long-form coherence and providing users with intuitive and effective control. Retrieval-Augmented Generation (RAG) has proven effective in reducing hallucinations in text generation; however, the use of structured data to support generative storytelling remains underexplored. This paper investigates how knowledge graphs (KGs) can enhance LLM-based storytelling by improving narrative quality and enabling user-driven modifications. We propose a KG-assisted storytelling pipeline and evaluate its effectiveness through a user study with 15 participants. Participants created their own story prompts, generated stories, and edited knowledge graphs to shape their narratives. Through quantitative and qualitative analysis, our findings demonstrate that knowledge graphs significantly enhance story quality in action-oriented and structured narratives within our system settings. Additionally, editing the knowledge graph increases users' sense of control, making storytelling more engaging, interactive, and playful.

Title: Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking

Authors: Heli Ben-Hamu, Itai Gat, Daniel Severo, Niklas Nolte, Brian Karrer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24857
Pdf URL: https://arxiv.org/pdf/2505.24857
Copy Paste: [[2505.24857]] Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking(https://arxiv.org/abs/2505.24857)
Keywords: diffusion
Abstract: Recent masked diffusion models (MDMs) have shown competitive performance compared to autoregressive models (ARMs) for language modeling. While most literature has focused on performance enhancing sampling procedures, efficient sampling from MDMs has been scarcely explored. We make the observation that often a given sequence of partially masked tokens determines the values of multiple unknown tokens deterministically, meaning that a single prediction of a masked model holds additional information unused by standard sampling procedures. Based on this observation, we introduce EB-Sampler, a simple drop-in replacement for existing samplers, utilizing an Entropy Bounded unmasking procedure that dynamically unmasks multiple tokens in one function evaluation with predefined approximate error tolerance. We formulate the EB-Sampler as part of a broad family of adaptive samplers for which we provide an error analysis that motivates our algorithmic choices. EB-Sampler accelerates sampling from current state of the art MDMs by roughly 2-3x on standard coding and math reasoning benchmarks without loss in performance. We also validate the same procedure works well on smaller reasoning tasks including maze navigation and Sudoku, tasks ARMs often struggle with.

Title: ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Authors: Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Zhewei Huang, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Gang Yu, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24862
Pdf URL: https://arxiv.org/pdf/2505.24862
Copy Paste: [[2505.24862]] ViStoryBench: Comprehensive Benchmark Suite for Story Visualization(https://arxiv.org/abs/2505.24862)
Keywords: generative
Abstract: Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen significant progress with recent advancements in generative models. To further enhance the performance of story visualization frameworks in real-world scenarios, we introduce a comprehensive evaluation benchmark, ViStoryBench. We collect a diverse dataset encompassing various story types and artistic styles, ensuring models are evaluated across multiple dimensions such as different plots (e.g., comedy, horror) and visual aesthetics (e.g., anime, 3D renderings). ViStoryBench is carefully curated to balance narrative structures and visual elements, featuring stories with single and multiple protagonists to test models' ability to maintain character consistency. Additionally, it includes complex plots and intricate world-building to challenge models in generating accurate visuals. To ensure comprehensive comparisons, our benchmark incorporates a wide range of evaluation metrics assessing critical aspects. This structured and multifaceted framework enables researchers to thoroughly identify both the strengths and weaknesses of different models, fostering targeted improvements.

Title: TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection

Authors: Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, Roni Sengupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24866
Pdf URL: https://arxiv.org/pdf/2505.24866
Copy Paste: [[2505.24866]] TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection(https://arxiv.org/abs/2505.24866)
Keywords: generative
Abstract: The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a comprehensive multi-model multi-generator benchmark and curated dataset designed to evaluate the performance of state-of-the-art detectors on the most advanced generators. Our dataset includes deepfakes synthesized by leading academic and commercial models and features carefully constructed protocols to assess generalization under distribution shifts in identity and generator characteristics. We benchmark a diverse set of existing detection methods, including CNNs, vision transformers, and temporal models, and analyze their robustness and generalization capabilities. In addition, we provide error analysis using Grad-CAM visualizations to expose common failure modes and detector biases. TalkingHeadBench is hosted on this https URL with open access to all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.

Title: GenSpace: Benchmarking Spatially-Aware Image Generation

Authors: Zehan Wang, Jiayang Xu, Ziang Zhang, Tianyu Pan, Chao Du, Hengshuang Zhao, Zhou Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24870
Pdf URL: https://arxiv.org/pdf/2505.24870
Copy Paste: [[2505.24870]] GenSpace: Benchmarking Spatially-Aware Image Generation(https://arxiv.org/abs/2505.24870)
Keywords: foundation model
Abstract: Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.

Title: MiniMax-Remover: Taming Bad Noise Helps Video Object Removal

Authors: Bojia Zi, Weixuan Peng, Xianbiao Qi, Jianan Wang, Shihao Zhao, Rong Xiao, Kam-Fai Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24873
Pdf URL: https://arxiv.org/pdf/2505.24873
Copy Paste: [[2505.24873]] MiniMax-Remover: Taming Bad Noise Helps Video Object Removal(https://arxiv.org/abs/2505.24873)
Keywords: diffusion
Abstract: Recent advances in video diffusion models have driven rapid progress in video editing techniques. However, video object removal, a critical subtask of video editing, remains challenging due to issues such as hallucinated objects and visual artifacts. Furthermore, existing methods often rely on computationally expensive sampling procedures and classifier-free guidance (CFG), resulting in slow inference. To address these limitations, we propose MiniMax-Remover, a novel two-stage video object removal approach. Motivated by the observation that text condition is not best suited for this task, we simplify the pretrained video generation model by removing textual input and cross-attention layers, resulting in a more lightweight and efficient model architecture in the first stage. In the second stage, we distilled our remover on successful videos produced by the stage-1 model and curated by human annotators, using a minimax optimization strategy to further improve editing quality and inference speed. Specifically, the inner maximization identifies adversarial input noise ("bad noise") that makes failure removals, while the outer minimization step trains the model to generate high-quality removal results even under such challenging conditions. As a result, our method achieves a state-of-the-art video object removal results with as few as 6 sampling steps and doesn't rely on CFG, significantly improving inference efficiency. Extensive experiments demonstrate the effectiveness and superiority of MiniMax-Remover compared to existing methods. Codes and Videos are available at: this https URL.

Title: The Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models

Authors: Adam Stein, Aaditya Naik, Neelay Velingker, Mayur Naik, Eric Wong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.24874
Pdf URL: https://arxiv.org/pdf/2505.24874
Copy Paste: [[2505.24874]] The Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models(https://arxiv.org/abs/2505.24874)
Keywords: foundation model
Abstract: Neuro-symbolic learning was proposed to address challenges with training neural networks for complex reasoning tasks with the added benefits of interpretability, reliability, and efficiency. Neuro-symbolic learning methods traditionally train neural models in conjunction with symbolic programs, but they face significant challenges that limit them to simplistic problems. On the other hand, purely-neural foundation models now reach state-of-the-art performance through prompting rather than training, but they are often unreliable and lack interpretability. Supplementing foundation models with symbolic programs, which we call neuro-symbolic prompting, provides a way to use these models for complex reasoning tasks. Doing so raises the question: What role does specialized model training as part of neuro-symbolic learning have in the age of foundation models? To explore this question, we highlight three pitfalls of traditional neuro-symbolic learning with respect to the compute, data, and programs leading to generalization problems. This position paper argues that foundation models enable generalizable neuro-symbolic solutions, offering a path towards achieving the original goals of neuro-symbolic learning without the downsides of training from scratch.

Title: ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Authors: Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.24875
Pdf URL: https://arxiv.org/pdf/2505.24875
Copy Paste: [[2505.24875]] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL(https://arxiv.org/abs/2505.24875)
Keywords: generative
Abstract: Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: this http URL.

Title: AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion

Authors: Yangyi Huang, Ye Yuan, Xueting Li, Jan Kautz, Umar Iqbal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.24877
Pdf URL: https://arxiv.org/pdf/2505.24877
Copy Paste: [[2505.24877]] AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion(https://arxiv.org/abs/2505.24877)
Keywords: diffusion
Abstract: Existing methods for image-to-3D avatar generation struggle to produce highly detailed, animation-ready avatars suitable for real-world applications. We introduce AdaHuman, a novel framework that generates high-fidelity animatable 3D avatars from a single in-the-wild image. AdaHuman incorporates two key innovations: (1) A pose-conditioned 3D joint diffusion model that synthesizes consistent multi-view images in arbitrary poses alongside corresponding 3D Gaussian Splats (3DGS) reconstruction at each diffusion step; (2) A compositional 3DGS refinement module that enhances the details of local body parts through image-to-image refinement and seamlessly integrates them using a novel crop-aware camera ray map, producing a cohesive detailed 3D avatar. These components allow AdaHuman to generate highly realistic standardized A-pose avatars with minimal self-occlusion, enabling rigging and animation with any input motion. Extensive evaluation on public benchmarks and in-the-wild images demonstrates that AdaHuman significantly outperforms state-of-the-art methods in both avatar reconstruction and reposing. Code and models will be publicly available for research purposes.