2025-12-01

Title: Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation

Authors: Fatima Kazi
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21711
Pdf URL: https://arxiv.org/pdf/2511.21711
Copy Paste: [[2511.21711]] Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation(https://arxiv.org/abs/2511.21711)
Keywords: generative
Abstract: Large Language models (LLMs), such as ChatGPT, have gained popularity in recent years with the advancement of Natural Language Processing (NLP), with use cases spanning many disciplines and daily lives as well. LLMs inherit explicit and implicit biases from the datasets they were trained on; these biases can include social, ethical, cultural, religious, and other prejudices and stereotypes. It is important to comprehensively examine such shortcomings by identifying the existence and extent of such biases, recognizing the origin, and attempting to mitigate such biased outputs to ensure fair outputs to reduce harmful stereotypes and misinformation. This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI). We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA. To detect both explicit and implicit biases, we adopt a three-pronged approach for thorough and inclusive analysis. Results indicate fine-tuned models struggle with gender biases but excel at identifying and avoiding racial biases. Our findings also illustrated that despite some cases of success, LLMs often over-rely on keywords in prompts and its outputs. This demonstrates the incapability of LLMs to attempt to truly understand the accuracy and authenticity of its outputs. Finally, in an attempt to bolster model performance, we applied an enhancement learning strategy involving fine-tuning, models using different prompting techniques, and data augmentation of the bias benchmarks. We found fine-tuned models to exhibit promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.

Title: HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation

Authors: Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21732
Pdf URL: https://arxiv.org/pdf/2511.21732
Copy Paste: [[2511.21732]] HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation(https://arxiv.org/abs/2511.21732)
Keywords: generative
Abstract: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.

Title: EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants

Authors: Meenakshi Mittal, Rishi Khare, Mihran Miroyan, Chancharik Mitra, Narges Norouzi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21742
Pdf URL: https://arxiv.org/pdf/2511.21742
Copy Paste: [[2511.21742]] EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants(https://arxiv.org/abs/2511.21742)
Keywords: generative
Abstract: With the growing use of Large Language Model (LLM)-based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce {\model}, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment. Website and Supplementary Material: this https URL

Title: DELTA: Language Diffusion-based EEG-to-Text Architecture

Authors: Mingyu Jeon, Hyobin Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.21746
Pdf URL: https://arxiv.org/pdf/2511.21746
Copy Paste: [[2511.21746]] DELTA: Language Diffusion-based EEG-to-Text Architecture(https://arxiv.org/abs/2511.21746)
Keywords: diffusion
Abstract: Electroencephalogram (EEG)-to-text remains challenging due to high-dimensional noise, subject variability, and error accumulation in autoregressive decoding. We introduce DELTA, which pairs a Residual Vector Quantization (RVQ) EEG tokenizer with a masked language diffusion model (LLaDA). RVQ discretizes continuous EEG into multi-layer tokens to reduce noise and individual differences, while LLaDA reconstructs sentences via non-sequential denoising. On ZuCo, DELTA improves semantic alignment by up to 5.37 points over autoregressive baselines, achieving BLEU-1 21.9 and ROUGE-1 F 17.2 under word-level conditions. These results enable reliable text generation from small EEG-text datasets and point toward scalable multimodal EEG-language models.

Title: Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness

Authors: Svitlana Volkova, Will Dupree, Hsien-Te Kao, Peter Bautista, Gabe Ganberg, Jeff Beaubien, Laura Cassani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21749
Pdf URL: https://arxiv.org/pdf/2511.21749
Copy Paste: [[2511.21749]] Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness(https://arxiv.org/abs/2511.21749)
Keywords: generative
Abstract: This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and Mistral demonstrated notable weaknesses in identifying subtle rhetorical, suggesting that different architectures encode and process persuasive language patterns in fundamentally different ways. We show that prompt engineering dramatically affects detection efficacy, with temperature settings and confidence scoring producing model-specific variations; Gemma and GPT-4 perform optimally at lower temperatures while Llama3 and Mistral show improved capabilities at higher temperatures. Our causal analysis provides novel insights into socio-emotional-cognitive signatures of persuasion attacks, revealing that different attack types target specific cognitive dimensions. This research advances generative AI safety and cognitive security by quantifying LLM-specific vulnerabilities to persuasion attacks and delivers a framework for enhancing human cognitive resilience through structured interventions before exposure to harmful content.

Title: Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

Authors: Linye Wei, Wenjue Chen, Pingzhi Tang, Xiaotian Guo, Le Ye, Runsheng Wang, Meng Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21759
Pdf URL: https://arxiv.org/pdf/2511.21759
Copy Paste: [[2511.21759]] Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models(https://arxiv.org/abs/2511.21759)
Keywords: diffusion
Abstract: Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.

Title: fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21760
Pdf URL: https://arxiv.org/pdf/2511.21760
Copy Paste: [[2511.21760]] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding(https://arxiv.org/abs/2511.21760)
Keywords: foundation model
Abstract: Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

Title: Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach

Authors: Aamiruddin Syed, Mohammed Ilyas Ahmad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.21795
Pdf URL: https://arxiv.org/pdf/2511.21795
Copy Paste: [[2511.21795]] Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach(https://arxiv.org/abs/2511.21795)
Keywords: anomaly
Abstract: Cloud security is an important concern. To identify and stop cyber threats, efficient data collection methods are necessary. This research presents an innovative method to cloud security by integrating numerous data sources and modalities with multi-modal deep learning autoencoders. The Multi-Modal Deep Learning Ensemble Architecture (MMDLEA), a unique approach for anomaly detection and classification in multi-modal data, is proposed in this study. The proposed design integrates the best features of six deep learning models: Multi-Modal Deep Learning Autoencoder (MMDLA), Anomaly Detection using Adaptive Metric Learning (ADAM), ADADELTA, ADAGRAD, RMSPROP, and Stacked Graph Transformer (SGT). A final prediction is produced by combining the outputs of all the models, each of which is trained using a distinct modality of the data. Based on the test dataset, the recommended MMDLA architecture achieves an accuracy of 98.5% and an F1-score of 0.985, demonstrating its superior performance over each individual model. Of the different models, the ADAM model performs the best, with an accuracy of 96.2% and an F1-score of 0.962. With an F1-score of 0.955 and an accuracy of 95.5%, the ADADELTA model trails closely behind. MMDLA obtains an F1-score of 0.948 and an accuracy of 94.8%. Additionally, the suggested MMDLEA design exhibits enhanced resilience to fluctuating modalities and noisy data, proving its usefulness in practical settings. Future study in this area is made possible by the results, which show the potential of the proposed framework for abnormal identification and categorization in multi-modal data.

Title: Unsupervised Anomaly Detection for Smart IoT Devices: Performance and Resource Comparison

Authors: Md. Sad Abdullah Sami, Mushfiquzzaman Abid
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2511.21842
Pdf URL: https://arxiv.org/pdf/2511.21842
Copy Paste: [[2511.21842]] Unsupervised Anomaly Detection for Smart IoT Devices: Performance and Resource Comparison(https://arxiv.org/abs/2511.21842)
Keywords: anomaly
Abstract: The rapid expansion of Internet of Things (IoT) deployments across diverse sectors has significantly enhanced operational efficiency, yet concurrently elevated cybersecurity vulnerabilities due to increased exposure to cyber threats. Given the limitations of traditional signature-based Anomaly Detection Systems (ADS) in identifying emerging and zero-day threats, this study investigates the effectiveness of two unsupervised anomaly detection techniques, Isolation Forest (IF) and One-Class Support Vector Machine (OC-SVM), using the TON_IoT thermostat dataset. A comprehensive evaluation was performed based on standard metrics (accuracy, precision, recall, and F1-score) alongside critical resource utilization metrics such as inference time, model size, and peak RAM usage. Experimental results revealed that IF consistently outperformed OC-SVM, achieving higher detection accuracy, superior precision, and recall, along with a significantly better F1-score. Furthermore, Isolation Forest demonstrated a markedly superior computational footprint, making it more suitable for deployment on resource-constrained IoT edge devices. These findings underscore Isolation Forest's robustness in high-dimensional and imbalanced IoT environments and highlight its practical viability for real-time anomaly detection.

Title: Towards a Foundation Model for Partial Differential Equations Across Physics Domains

Authors: Eduardo Soares, Emilio Vital Brazil, Victor Shirasuna, Breno W. S. R. de Carvalho, Cristiano Malossi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21861
Pdf URL: https://arxiv.org/pdf/2511.21861
Copy Paste: [[2511.21861]] Towards a Foundation Model for Partial Differential Equations Across Physics Domains(https://arxiv.org/abs/2511.21861)
Keywords: foundation model
Abstract: We present PDE-FM, a modular foundation model for physics-informed machine learning that unifies spatial, spectral, and temporal reasoning across heterogeneous partial differential equation (PDE) systems. PDE-FM combines spatial-spectral tokenization, physics-aware conditioning, and a Mamba-based state-space backbone with an operator-theoretic decoder, enabling scalable and data-efficient modeling of complex physical dynamics. In contrast to task-specific neural operators, PDE-FM is pretrained once on diverse PDE datasets and can be transferred to new physical regimes without architectural or data-specific modifications. Evaluated on twelve 2D and 3D datasets from The Well benchmark - spanning hydrodynamic, radiative, elastic, and astrophysical phenomena - PDE-FM achieves state-of-the-art accuracy in six domains, reducing mean VRMSE by 46% relative to prior operator-learning baselines. The model demonstrates robust cross-physics generalization, excelling in turbulent and radiative systems while maintaining strong performance in linear and steady-state regimes. These results suggest that large-scale pretraining across diverse physical processes can yield transferable representations of dynamics, marking a step toward unified, foundation-level surrogates for multi-physics simulation and scientific discovery.

Title: Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training

Authors: Eric Yeats, Darryl Hannan, Wilson Fearn, Timothy Doster, Henry Kvinge, Scott Mahan
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.21863
Pdf URL: https://arxiv.org/pdf/2511.21863
Copy Paste: [[2511.21863]] Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training(https://arxiv.org/abs/2511.21863)
Keywords: diffusion, generative
Abstract: Score-based generative models require guidance in order to generate plausible, on-manifold samples. The most popular guidance method, Classifier-Free Guidance (CFG), is only applicable in settings with labeled data and requires training an additional unconditional score-based model. More recently, Auto-Guidance adopts a smaller, less capable version of the original model to guide generation. While each method effectively promotes the fidelity of generated data, each requires labeled data or the training of additional models, making it challenging to guide score-based models when (labeled) training data are not available or training new models is not feasible. We make the surprising discovery that the positive curvature of log density estimates in saddle regions provides strong guidance for score-based models. Motivated by this, we develop saddle-free guidance (SFG) which maintains estimates of maximal positive curvature of the log density to guide individual score-based models. SFG has the same computational cost of classifier-free guidance, does not require additional training, and works with off-the-shelf diffusion and flow matching models. Our experiments indicate that SFG achieves state-of-the-art FID and FD-DINOv2 metrics in single-model unconditional ImageNet-512 generation. When SFG is combined with Auto-Guidance, its unconditional samples achieve general state-of-the-art in FD-DINOv2 score. Our experiments with FLUX.1-dev and Stable Diffusion v3.5 indicate that SFG boosts the diversity of output images compared to CFG while maintaining excellent prompt adherence and image fidelity.

Title: Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium

Authors: Akbar Anbar Jafari, Gholamreza Anbarjafari
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.21882
Pdf URL: https://arxiv.org/pdf/2511.21882
Copy Paste: [[2511.21882]] Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium(https://arxiv.org/abs/2511.21882)
Keywords: diffusion
Abstract: Contemporary autoregressive transformers operate in open loop: each hidden state is computed in a single forward pass and never revised, causing errors to propagate uncorrected through the sequence. We identify this open-loop bottleneck as a fundamental architectural limitation underlying well-documented failures in long-range reasoning, factual consistency, and multi-step planning. To address this limitation, we introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium before committing to each token. We instantiate this principle as Equilibrium Transformers (EqT), which augment standard transformer layers with an Equilibrium Refinement Module that minimizes a learned energy function via gradient descent in latent space. The energy function enforces bidirectional prediction consistency, episodic memory coherence, and output confidence, all computed without external supervision. Theoretically, we prove that EqT performs approximate MAP inference in a latent energy-based model, establish linear convergence guarantees, and show that refinement improves predictions precisely on hard instances where one-shot inference is suboptimal. The framework unifies deep equilibrium models, diffusion language models, and test-time training as special cases. Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance, validating that the benefit of deliberation scales with task difficulty. Just as attention mechanisms resolved the sequential bottleneck of recurrent networks, we propose that closed-loop equilibrium may resolve the commitment bottleneck of open-loop autoregression, representing a foundational step toward language models.

Title: UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation

Authors: Bu Jin, Weize Li, Songen Gu, Yupeng Zheng, Yuhang Zheng, Zhengyi Zhou, Yao Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21887
Pdf URL: https://arxiv.org/pdf/2511.21887
Copy Paste: [[2511.21887]] UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation(https://arxiv.org/abs/2511.21887)
Keywords: diffusion
Abstract: Articulated 3D objects play a vital role in realistic simulation and embodied robotics, yet manually constructing such assets remains costly and difficult to scale. In this paper, we present UniArt, a diffusion-based framework that directly synthesizes fully articulated 3D objects from a single image in an end-to-end manner. Unlike prior multi-stage techniques, UniArt establishes a unified latent representation that jointly encodes geometry, texture, part segmentation, and kinematic parameters. We introduce a reversible joint-to-voxel embedding, which spatially aligns articulation features with volumetric geometry, enabling the model to learn coherent motion behaviors alongside structural formation. Furthermore, we formulate articulation type prediction as an open-set problem, removing the need for fixed joint semantics and allowing generalization to novel joint categories and unseen object types. Experiments on the PartNet-Mobility benchmark demonstrate that UniArt achieves state-of-the-art mesh quality and articulation accuracy.

Title: Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Authors: Fatemeh Akbarian, Anahita Baninajjar, Yingyi Zhang, Ananth Balashankar, Amir Aminifar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21893
Pdf URL: https://arxiv.org/pdf/2511.21893
Copy Paste: [[2511.21893]] Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings(https://arxiv.org/abs/2511.21893)
Keywords: foundation model, generative
Abstract: Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions (Zhang et al., 2025), where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. To counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker's perturbed input through generative models, e.g., Variational Autoencoders (VAEs), to maintain natural alignment. To further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 to 46) and 11% (32 to 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions.

Title: Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

Authors: Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathnam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, Heni Ben Amor
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21928
Pdf URL: https://arxiv.org/pdf/2511.21928
Copy Paste: [[2511.21928]] Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs(https://arxiv.org/abs/2511.21928)
Keywords: in-context
Abstract: Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.

Title: Modeling Quantum Autoencoder Trainable Kernel for IoT Anomaly Detection

Authors: Swathi Chandrasekhar, Shiva Raj Pokhrel, Swati Kumari, Navneet Singh
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2511.21932
Pdf URL: https://arxiv.org/pdf/2511.21932
Copy Paste: [[2511.21932]] Modeling Quantum Autoencoder Trainable Kernel for IoT Anomaly Detection(https://arxiv.org/abs/2511.21932)
Keywords: anomaly
Abstract: Escalating cyber threats and the high-dimensional complexity of IoT traffic have outpaced classical anomaly detection methods. While deep learning offers improvements, computational bottlenecks limit real-time deployment at scale. We present a quantum autoencoder (QAE) framework that compresses network traffic into discriminative latent representations and employs quantum support vector classification (QSVC) for intrusion detection. Evaluated on three datasets, our approach achieves improved accuracy on ideal simulators and on the IBM Quantum hardware demonstrating practical quantum advantage on current NISQ devices. Crucially, moderate depolarizing noise acts as implicit regularization, stabilizing training and enhancing generalization. This work establishes quantum machine learning as a viable, hardware-ready solution for real-world cybersecurity challenges.

Title: AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views

Authors: Junwei Zhou, Yu-Wing Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21945
Pdf URL: https://arxiv.org/pdf/2511.21945
Copy Paste: [[2511.21945]] AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views(https://arxiv.org/abs/2511.21945)
Keywords: generative
Abstract: Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.

Title: WalkCLIP: Multimodal Learning for Urban Walkability Prediction

Authors: Shilong Xiang, JangHyeon Lee, Min Namgung, Yao-Yi Chiang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21947
Pdf URL: https://arxiv.org/pdf/2511.21947
Copy Paste: [[2511.21947]] WalkCLIP: Multimodal Learning for Urban Walkability Prediction(https://arxiv.org/abs/2511.21947)
Keywords: foundation model
Abstract: Urban walkability is a cornerstone of public health, sustainability, and quality of life. Traditional walkability assessments rely on surveys and field audits, which are costly and difficult to scale. Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability, but these single-source approaches capture only one dimension of the walking environment. Satellite data describe the built environment from above, but overlook the pedestrian perspective. Street view imagery captures conditions at the ground level, but lacks broader spatial context. Population dynamics reveal patterns of human activity but not the visual form of the environment. We introduce WalkCLIP, a multimodal framework that integrates these complementary viewpoints to predict urban walkability. WalkCLIP learns walkability-aware vision-language representations from GPT-4o generated image captions, refines these representations with a spatial aggregation module that incorporates neighborhood context, and fuses the resulting features with representations from a population dynamics foundation model. Evaluated at 4,660 locations throughout Minneapolis-Saint Paul, WalkCLIP outperforms unimodal and multimodal baselines in both predictive accuracy and spatial alignment. These results show that the integration of visual and behavioral signals yields reliable predictions of the walking environment.

Title: DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models

Authors: Futian Wang, Chaoliu Weng, Xiao Wang, Zhen Chen, Zhicheng Zhao, Jin Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21982
Pdf URL: https://arxiv.org/pdf/2511.21982
Copy Paste: [[2511.21982]] DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models(https://arxiv.org/abs/2511.21982)
Keywords: foundation model
Abstract: The precise reading recognition of pointer meters plays a key role in smart power systems, but existing approaches remain fragile due to challenges like reflections, occlusions, dynamic viewing angles, and overly between thin pointers and scale markings. Up to now, this area still lacks large-scale datasets to support the development of robust algorithms. To address these challenges, this paper first presents a new large-scale benchmark dataset for dial reading, termed RPM-10K, which contains 10730 meter images that fully reflect the aforementioned key challenges. Built upon the dataset, we propose a novel vision-language model for pointer meter reading recognition, termed MRLM, based on physical relation injection. Instead of exhaustively learning image-level correlations, MRLM explicitly encodes the geometric and causal relationships between the pointer and the scale, aligning perception with physical reasoning in the spirit of world-model perspectives. Through cross-attentional fusion and adaptive expert selection, the model learns to interpret dial configurations and generate precise numeric readings. Extensive experiments fully validated the effectiveness of our proposed framework on the newly proposed benchmark dataset. Both the dataset and source code will be released on this https URL

Title: PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation

Authors: Xuchen Li, Hengrui Gu, Mohan Zhang, Qin Liu, Zhen Tan, Xinyuan Zhu, Huixue Zhou, Tianlong Chen, Kaixiong Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21984
Pdf URL: https://arxiv.org/pdf/2511.21984
Copy Paste: [[2511.21984]] PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation(https://arxiv.org/abs/2511.21984)
Keywords: foundation model
Abstract: Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.

Title: StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation

Authors: Sen Fang, Hongbin Zhong, Yalin Feng, Dimitris N. Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22009
Pdf URL: https://arxiv.org/pdf/2511.22009
Copy Paste: [[2511.22009]] StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation(https://arxiv.org/abs/2511.22009)
Keywords: diffusion, generative
Abstract: New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.

Title: ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion

Authors: Junoh Kang, Donghun Ryu, Bohyung Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22048
Pdf URL: https://arxiv.org/pdf/2511.22048
Copy Paste: [[2511.22048]] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion(https://arxiv.org/abs/2511.22048)
Keywords: diffusion, generative
Abstract: Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.

Title: Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

Authors: Yiran Zhang, Weihang Xu, Mo Zhou, Maryam Fazel, Simon Shaolei Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22069
Pdf URL: https://arxiv.org/pdf/2511.22069
Copy Paste: [[2511.22069]] Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian(https://arxiv.org/abs/2511.22069)
Keywords: diffusion, generative
Abstract: Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a $1/\tau$ rate, where $\tau$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.

Title: ARES: Anomaly Recognition Model For Edge Streams

Authors: Simone Mungari, Albert Bifet, Giuseppe Manco, Bernhard Pfahringer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22078
Pdf URL: https://arxiv.org/pdf/2511.22078
Copy Paste: [[2511.22078]] ARES: Anomaly Recognition Model For Edge Streams(https://arxiv.org/abs/2511.22078)
Keywords: anomaly
Abstract: Many real-world scenarios involving streaming information can be represented as temporal graphs, where data flows through dynamic changes in edges over time. Anomaly detection in this context has the objective of identifying unusual temporal connections within the graph structure. Detecting edge anomalies in real time is crucial for mitigating potential risks. Unlike traditional anomaly detection, this task is particularly challenging due to concept drifts, large data volumes, and the need for real-time response. To face these challenges, we introduce ARES, an unsupervised anomaly detection framework for edge streams. ARES combines Graph Neural Networks (GNNs) for feature extraction with Half-Space Trees (HST) for anomaly scoring. GNNs capture both spike and burst anomalous behaviors within streams by embedding node and edge properties in a latent space, while HST partitions this space to isolate anomalies efficiently. ARES operates in an unsupervised way without the need for prior data labeling. To further validate its detection capabilities, we additionally incorporate a simple yet effective supervised thresholding mechanism. This approach leverages statistical dispersion among anomaly scores to determine the optimal threshold using a minimal set of labeled data, ensuring adaptability across different domains. We validate ARES through extensive evaluations across several real-world cyber-attack scenarios, comparing its performance against existing methods while analyzing its space and time complexity.

Title: WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

Authors: Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22098
Pdf URL: https://arxiv.org/pdf/2511.22098
Copy Paste: [[2511.22098]] WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation(https://arxiv.org/abs/2511.22098)
Keywords: diffusion, in-context
Abstract: Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.

Title: MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation

Authors: Simon Joseph Clément Crête, Marta Kersten-Oertel, Yiming Xiao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22102
Pdf URL: https://arxiv.org/pdf/2511.22102
Copy Paste: [[2511.22102]] MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation(https://arxiv.org/abs/2511.22102)
Keywords: generative
Abstract: MRI-based brain age estimation models aim to assess a subject's biological brain age based on information, such as neuroanatomical features. Various factors, including neurodegenerative diseases, can accelerate brain aging and measuring this phenomena could serve as a potential biomarker for clinical applications. While deep learning (DL)-based regression has recently attracted major attention, existing approaches often fail to capture the continuous nature of neuromorphological changes, potentially resulting in sub-optimal feature representation and results. To address this, we propose to use supervised contrastive learning with the recent Rank-N-Contrast (RNC) loss to estimate brain age based on widely used T1w structural MRI for the first time and leverage Grad-RAM to visually explain regression results. Experiments show that our proposed method achieves a mean absolute error (MAE) of 4.27 years and an $R^2$ of 0.93 with a limited dataset of training samples, significantly outperforming conventional deep regression with the same ResNet backbone while performing better or comparably with the state-of-the-art methods with significantly larger training data. Furthermore, Grad-RAM revealed more nuanced features related to age regression with the RNC loss than conventional deep regression. As an exploratory study, we employed the proposed method to estimate the gap between the biological and chronological brain ages in Alzheimer's Disease and Parkinson's disease patients, and revealed the correlation between the brain age gap and disease severity, demonstrating its potential as a biomarker in neurodegenerative disorders.

Title: PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization

Authors: Mingzhe Li, Renhao Zhang, Zhiyang Wen, Siqi Pan, Bruno Castro da Silva, Juan Zhai, Shiqing Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22119
Pdf URL: https://arxiv.org/pdf/2511.22119
Copy Paste: [[2511.22119]] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization(https://arxiv.org/abs/2511.22119)
Keywords: diffusion, generative
Abstract: Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: this https URL

Title: Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

Authors: Xiang Li, Zirui Wang, Zixuan Huang, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22121
Pdf URL: https://arxiv.org/pdf/2511.22121
Copy Paste: [[2511.22121]] Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation(https://arxiv.org/abs/2511.22121)
Keywords: generative
Abstract: Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

Title: Benchmarking In-context Experiential Learning Through Repeated Product Recommendations

Authors: Gilbert Yang, Yaqin Chen, Thomson Yen, Hongseok Namkoong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22130
Pdf URL: https://arxiv.org/pdf/2511.22130
Copy Paste: [[2511.22130]] Benchmarking In-context Experiential Learning Through Repeated Product Recommendations(https://arxiv.org/abs/2511.22130)
Keywords: in-context
Abstract: To reliably navigate ever-shifting real-world environments, agents must grapple with incomplete knowledge and adapt their behavior through experience. However, current evaluations largely focus on tasks that leave no ambiguity, and do not measure agents' ability to adaptively learn and reason through the experiences they accrued. We exemplify the need for this in-context experiential learning in a product recommendation context, where agents must navigate shifting customer preferences and product landscapes through natural language dialogue. We curate a benchmark for experiential learning and active exploration (BELA) that combines (1) rich real-world products from Amazon, (2) a diverse collection of user personas to represent heterogeneous yet latent preferences, and (3) a LLM user simulator powered by the persona to create rich interactive trajectories. We observe that current frontier models struggle to meaningfully improve across episodes, underscoring the need for agentic systems with strong in-context learning capabilities.

Title: Autonomous labeling of surgical resection margins using a foundation model

Authors: Xilin Yang, Musa Aydin, Yuhong Lu, Sahan Yoruc Selcuk, Bijie Bai, Yijie Zhang, Andrew Birkeland, Katjana Ehrlich, Julien Bec, Laura Marcu, Nir Pillar, Aydogan Ozcan
Subjects: cs.CV, cs.LG, physics.med-ph
Abstract URL: https://arxiv.org/abs/2511.22131
Pdf URL: https://arxiv.org/pdf/2511.22131
Copy Paste: [[2511.22131]] Autonomous labeling of surgical resection margins using a foundation model(https://arxiv.org/abs/2511.22131)
Keywords: foundation model
Abstract: Assessing resection margins is central to pathological specimen evaluation and has profound implications for patient outcomes. Current practice employs physical inking, which is applied variably, and cautery artifacts can obscure the true margin on histological sections. We present a virtual inking network (VIN) that autonomously localizes the surgical cut surface on whole-slide images, reducing reliance on inks and standardizing margin-focused review. VIN uses a frozen foundation model as the feature extractor and a compact two-layer multilayer perceptron trained for patch-level classification of cautery-consistent features. The dataset comprised 120 hematoxylin and eosin (H&E) stained slides from 12 human tonsil tissue blocks, resulting in ~2 TB of uncompressed raw image data, where a board-certified pathologist provided boundary annotations. In blind testing with 20 slides from previously unseen blocks, VIN produced coherent margin overlays that qualitatively aligned with expert annotations across serial sections. Quantitatively, region-level accuracy was ~73.3% across the test set, with errors largely confined to limited areas that did not disrupt continuity of the whole-slide margin map. These results indicate that VIN captures cautery-related histomorphology and can provide a reproducible, ink-free margin delineation suitable for integration into routine digital pathology workflows and for downstream measurement of margin distances.

Title: EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation

Authors: Yanchao Zhao, Jihao Zhu, Yu Liu, Weizhuo Chen, Yuling Yang, Kun Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22135
Pdf URL: https://arxiv.org/pdf/2511.22135
Copy Paste: [[2511.22135]] EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation(https://arxiv.org/abs/2511.22135)
Keywords: diffusion
Abstract: Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.

Title: C$^2$DLM: Causal Concept-Guided Diffusion Large Language Models

Authors: Kairong Han, Nuanqiao Shan, Ziyu Zhao, Zijing Hu, Xinpeng Dong, Junjian Ye, Lujia Pan, Fei Wu, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22146
Pdf URL: https://arxiv.org/pdf/2511.22146
Copy Paste: [[2511.22146]] C$^2$DLM: Causal Concept-Guided Diffusion Large Language Models(https://arxiv.org/abs/2511.22146)
Keywords: diffusion
Abstract: Autoregressive (AR) language models and Diffusion Language Models (DLMs) constitute the two principal paradigms of large language models. However, both paradigms suffer from insufficient reasoning capabilities. Human reasoning inherently relies on causal knowledge and thought, which are reflected in natural language. But in the AR paradigm, language is modeled as next token prediction (a strictly left-to-right, token-by-token order), whereas natural language itself exhibits more flexible causal structures. In the DLM paradigm, the attention mechanism is fully connected, which entirely disregards causal order. To fill this gap, we propose a \underline{\textbf{C}}ausal \underline{\textbf{C}}oncept-Guided \underline{\textbf{D}}iffusion \underline{\textbf{L}}anguage \underline{\textbf{M}}odel (C$^2$DLM). Starting from DLM's fully connected attention, C$^2$DLM first obtains a concept-level causal graph from the teacher model, and then explicitly guides attention to learn causal relationships between concepts. By focusing on causal relationships and avoiding interference from difficult subgoals involving causal inversion, C$^2$DLM improves 12\% with about 3.2 times training speedup in the COT-OrderPerturb task, and achieves an average gain of 1.31\% across six downstream reasoning tasks. More details in the repository ~\href{this https URL}{here}.

Title: Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization

Authors: Inha Kang, Eunki Kim, Wonjeong Ryu, Jaeyo Shin, Seungjun Yu, Yoon-Hee Kang, Seongeun Jeong, Eunhye Kim, Soontae Kim, Hyunjung Shim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22169
Pdf URL: https://arxiv.org/pdf/2511.22169
Copy Paste: [[2511.22169]] Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization(https://arxiv.org/abs/2511.22169)
Keywords: foundation model
Abstract: Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non-real-time inputs, limiting their practical utility for localized warning systems. To address this gap, we construct and release the real-world observations and high-resolution CMAQ-OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts. However, standard point-wise objectives cannot reflect asymmetric operational costs, where false alarms deteriorate public trust while missed severe events endanger populations. This cost mismatch causes SFT models to over-predict and yield high False Alarm Rates. We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. Experimental results demonstrate that our framework significantly improves the reliability of the forecast. Compared to the SFT-only baseline, our model reduces the False Alarm Rate by 47.3% while achieving a competitive F1-score, proving its effectiveness for practical, real-world air quality forecasting systems on long lead time scenarios.

Title: BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch

Authors: Pu Li, Wenhao Zhang, Weize Quan, Biao Zhang, Peter Wonka, Dong-Ming Yan
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.22171
Pdf URL: https://arxiv.org/pdf/2511.22171
Copy Paste: [[2511.22171]] BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch(https://arxiv.org/abs/2511.22171)
Keywords: generative
Abstract: Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design. The intricate coupling between geometric and topological elements in B-rep structures has forced existing generative methods to rely on cascaded multi-stage networks, resulting in error accumulation and computational inefficiency. We present BrepGPT, a single-stage autoregressive framework for B-rep generation. Our key innovation lies in the Voronoi Half-Patch (VHP) representation, which decomposes B-reps into unified local units by assigning geometry to nearest half-edges and sampling their next pointers. Unlike hierarchical representations that require multiple distinct encodings for different structural levels, our VHP representation facilitates unifying geometric attributes and topological relations in a single, coherent format. We further leverage dual VQ-VAEs to encode both vertex topology and Voronoi Half-Patches into vertex-based tokens, achieving a more compact sequential encoding. A decoder-only Transformer is then trained to autoregressively predict these tokens, which are subsequently mapped to vertex-based features and decoded into complete B-rep models. Experiments demonstrate that BrepGPT achieves state-of-the-art performance in unconditional B-rep generation. The framework also exhibits versatility in various applications, including conditional generation from category labels, point clouds, text descriptions, and images, as well as B-rep autocompletion and interpolation.

Title: Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage

Authors: Peiyu Yu, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.22177
Pdf URL: https://arxiv.org/pdf/2511.22177
Copy Paste: [[2511.22177]] Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage(https://arxiv.org/abs/2511.22177)
Keywords: diffusion, generative
Abstract: Most post-training methods for text-to-image samplers focus on model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency. We take a different route: rescheduling the sampling timeline of a frozen sampler. Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James-Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior performance. Our rescheduled samplers consistently improve text-image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families. Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.

Title: HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving

Authors: Qiang Li, Yingwenqi Jiang, Tuoxi Li, Duyu Chen, Xiang Feng, Yucheng Ao, Shangyue Liu, Xingchen Yu, Youcheng Cai, Yumeng Liu, Yuexin Ma, Xin Hu, Li Liu, Yu Zhang, Linkun Xu, Bingtao Gao, Xueyuan Wang, Shuchang Zhou, Xianming Liu, Ligang Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22187
Pdf URL: https://arxiv.org/pdf/2511.22187
Copy Paste: [[2511.22187]] HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving(https://arxiv.org/abs/2511.22187)
Keywords: generative
Abstract: Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.

Title: Controllable 3D Object Generation with Single Image Prompt

Authors: Jaeseok Lee, Jaekoo Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22194
Pdf URL: https://arxiv.org/pdf/2511.22194
Copy Paste: [[2511.22194]] Controllable 3D Object Generation with Single Image Prompt(https://arxiv.org/abs/2511.22194)
Keywords: diffusion, generative
Abstract: Recently, the impressive generative capabilities of diffusion models have been demonstrated, producing images with remarkable fidelity. Particularly, existing methods for the 3D object generation tasks, which is one of the fastest-growing segments in computer vision, pre-dominantly use text-to-image diffusion models with textual inversion which train a pseudo text prompt to describe the given image. In practice, various text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space, thereby generating sophisticated outputs. However, textual inversion requires additional training time and lacks control ability. To tackle this issues, we propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text. (2) a depth conditioned warmup strategy to enhance 3D consistency. In experimental results, ours show qualitatively and quantitatively comparable performance and improved 3D consistency to the existing text-inversion-based alternatives. Furthermore, we conduct a user study to assess (i) how well results match the input image and (ii) whether 3D consistency is maintained. User study results show that our model outperforms the alternatives, validating the effectiveness of our approaches. Our code is available at GitHub repository:this https URL

Title: PULSE-ICU: A Pretrained Unified Long-Sequence Encoder for Multi-task Prediction in Intensive Care Units

Authors: Sejeong Jang, Joo Heung Yoon, Hyo Kyung Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22199
Pdf URL: https://arxiv.org/pdf/2511.22199
Copy Paste: [[2511.22199]] PULSE-ICU: A Pretrained Unified Long-Sequence Encoder for Multi-task Prediction in Intensive Care Units(https://arxiv.org/abs/2511.22199)
Keywords: self-supervised, foundation model
Abstract: Intensive care unit (ICU) data are highly irregular, heterogeneous, and temporally fragmented, posing challenges for generalizable clinical prediction. We present PULSE-ICU, a self-supervised foundation model that learns event-level ICU representations from large-scale EHR sequences without resampling or manual feature engineering. A unified embedding module encodes event identity, continuous values, units, and temporal attributes, while a Longformer-based encoder enables efficient modeling of long trajectories. PULSE-ICU was fine-tuned across 18 prediction tasks, including mortality, intervention forecasting, and phenotype identification, achieving strong performance across task types. External validation on eICU, HiRID, and P12 showed substantial improvements with minimal fine-tuning, demonstrating robustness to domain shift and variable constraints. These findings suggest that foundation-style modeling can improve data efficiency and adaptability, providing a scalable framework for ICU decision support across diverse clinical environments.

Title: 3D-Consistent Multi-View Editing by Diffusion Guidance

Authors: Josef Bengtson, David Nilsson, Dong In Lee, Fredrik Kahl
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22228
Pdf URL: https://arxiv.org/pdf/2511.22228
Copy Paste: [[2511.22228]] 3D-Consistent Multi-View Editing by Diffusion Guidance(https://arxiv.org/abs/2511.22228)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. We propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. To achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: this https URL

Title: Bridging 3D Deep Learning and Curation for Analysis and High-Quality Segmentation in Practice

Authors: Simon Püttmann, Jonathan Jair Sànchez Contreras, Lennart Kowitz, Peter Lampen, Saumya Gupta, Davide Panzeri, Nina Hagemann, Qiaojie Xiong, Dirk M. Hermann, Cao Chen, Jianxu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22236
Pdf URL: https://arxiv.org/pdf/2511.22236
Copy Paste: [[2511.22236]] Bridging 3D Deep Learning and Curation for Analysis and High-Quality Segmentation in Practice(https://arxiv.org/abs/2511.22236)
Keywords: foundation model
Abstract: Accurate 3D microscopy image segmentation is critical for quantitative bioimage analysis but even state-of-the-art foundation models yield error-prone results. Therefore, manual curation is still widely used for either preparing high-quality training data or fixing errors before analysis. We present VessQC, an open-source tool for uncertainty-guided curation of large 3D microscopy segmentations. By integrating uncertainty maps, VessQC directs user attention to regions most likely containing biologically meaningful errors. In a preliminary user study uncertainty-guided correction significantly improved error detection recall from 67% to 94.0% (p=0.007) without a significant increase in total curation time. VessQC thus enables efficient, human-in-the-loop refinement of volumetric segmentations and bridges a key gap in real-world applications between uncertainty estimation and practical human-computer interaction. The software is freely available at this http URL.

Title: TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning

Authors: Qingtao Yu, Changlin Song, Minghao Sun, Zhengyang Yu, Vinay Kumar Verma, Soumya Roy, Sumit Negi, Hongdong Li, Dylan Campbell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22242
Pdf URL: https://arxiv.org/pdf/2511.22242
Copy Paste: [[2511.22242]] TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning(https://arxiv.org/abs/2511.22242)
Keywords: diffusion
Abstract: A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: this https URL.

Title: Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models

Authors: Seoyun Yang, Gihoon Kim, Taesup Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22245
Pdf URL: https://arxiv.org/pdf/2511.22245
Copy Paste: [[2511.22245]] Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models(https://arxiv.org/abs/2511.22245)
Keywords: diffusion
Abstract: Text-to-image diffusion models have achieved remarkable progress in generating diverse and realistic images from textual descriptions. However, they still struggle with personalization, which requires adapting a pretrained model to depict user-specific subjects from only a few reference images. The key challenge lies in learning a new visual concept from a limited number of reference images while preserving the pretrained semantic prior that maintains text-image alignment. When the model focuses on subject fidelity, it tends to overfit the limited reference images and fails to leverage the pretrained distribution. Conversely, emphasizing prior preservation maintains semantic consistency but prevents the model from learning new personalized attributes. Building on these observations, we propose the personalization process through a semantic anchoring that guides adaptation by grounding new concepts in their corresponding distributions. We therefore reformulate personalization as the process of learning a rare concept guided by its frequent counterpart through semantic anchoring. This anchoring encourages the model to adapt new concepts in a stable and controlled manner, expanding the pretrained distribution toward personalized regions while preserving its semantic structure. As a result, the proposed method achieves stable adaptation and consistent improvements in both subject fidelity and text-image alignment compared to baseline methods. Extensive experiments and ablation studies further demonstrate the robustness and effectiveness of the proposed anchoring strategy.

Title: Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

Authors: Bolin Lai, Xudong Wang, Saketh Rambhatla, James M. Rehg, Zsolt Kira, Rohit Girdhar, Ishan Misra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22249
Pdf URL: https://arxiv.org/pdf/2511.22249
Copy Paste: [[2511.22249]] Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective(https://arxiv.org/abs/2511.22249)
Keywords: diffusion
Abstract: Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction-generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency encoding and decoding. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional autoencoders, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.13 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.

Title: UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

Authors: Dengbo Chen, Ziwei Zhao, Kexin Zhang, Shishuang Zhao, Junjie Hou, Yaqian Wang, Nianxi Liao, Anlan Sun, Fei Gao, Jia Ding, Yuhang Liu, Dong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22256
Pdf URL: https://arxiv.org/pdf/2511.22256
Copy Paste: [[2511.22256]] UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation(https://arxiv.org/abs/2511.22256)
Keywords: foundation model
Abstract: Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.

Title: Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques

Authors: Guifeng Wang, Yuanfeng Song, Meng Yang, Tao Zhu, Xiaoming Yin, Xing Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22258
Pdf URL: https://arxiv.org/pdf/2511.22258
Copy Paste: [[2511.22258]] Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques(https://arxiv.org/abs/2511.22258)
Keywords: generative
Abstract: Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.

Title: Structure is Supervision: Multiview Masked Autoencoders for Radiology

Authors: Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22294
Pdf URL: https://arxiv.org/pdf/2511.22294
Copy Paste: [[2511.22294]] Structure is Supervision: Multiview Masked Autoencoders for Radiology(https://arxiv.org/abs/2511.22294)
Keywords: self-supervised, foundation model
Abstract: Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

Title: Token-Level Marginalization for Multi-Label LLM Classifiers

Authors: Anjaneya Praharaj, Jaykumar Kasundra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22312
Pdf URL: https://arxiv.org/pdf/2511.22312
Copy Paste: [[2511.22312]] Token-Level Marginalization for Multi-Label LLM Classifiers(https://arxiv.org/abs/2511.22312)
Keywords: generative
Abstract: This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.

Title: Prompt-based Consistent Video Colorization

Authors: Silvia Dani, Tiberio Uricchio, Lorenzo Seidenari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22330
Pdf URL: https://arxiv.org/pdf/2511.22330
Copy Paste: [[2511.22330]] Prompt-based Consistent Video Colorization(https://arxiv.org/abs/2511.22330)
Keywords: diffusion
Abstract: Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.

Title: Test Time Training for AC Power Flow Surrogates via Physics and Operational Constraint Refinement

Authors: Panteleimon Dogoulis, Mohammad Iman Alizadeh, Sylvain Kubler, Maxime Cordy
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2511.22343
Pdf URL: https://arxiv.org/pdf/2511.22343
Copy Paste: [[2511.22343]] Test Time Training for AC Power Flow Surrogates via Physics and Operational Constraint Refinement(https://arxiv.org/abs/2511.22343)
Keywords: self-supervised
Abstract: Power Flow (PF) calculation based on machine learning (ML) techniques offer significant computational advantages over traditional numerical methods but often struggle to maintain full physical consistency. This paper introduces a physics-informed test-time training (PI-TTT) framework that enhances the accuracy and feasibility of ML-based PF surrogates by enforcing AC power flow equalities and operational constraints directly at inference time. The proposed method performs a lightweight self-supervised refinement of the surrogate outputs through few gradient-based updates, enabling local adaptation to unseen operating conditions without requiring labeled data. Extensive experiments on the IEEE 14-, 118-, and 300-bus systems and the PEGASE 1354-bus network show that PI-TTT reduces power flow residuals and operational constraint violations by one to two orders of magnitude compared with purely ML-based models, while preserving their computational advantage. The results demonstrate that PI-TTT provides fast, accurate, and physically reliable predictions, representing a promising direction for scalable and physics-consistent learning in power system analysis.

Title: Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning

Authors: Denis Huseljic, Marek Herde, Lukas Rauch, Paul Hahn, Bernhard Sick
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22344
Pdf URL: https://arxiv.org/pdf/2511.22344
Copy Paste: [[2511.22344]] Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning(https://arxiv.org/abs/2511.22344)
Keywords: foundation model
Abstract: Existing active learning (AL) strategies capture fundamentally different notions of data value, e.g., uncertainty or representativeness. Consequently, the effectiveness of strategies can vary substantially across datasets, models, and even AL cycles. Committing to a single strategy risks suboptimal performance, as no single strategy dominates throughout the entire AL process. We introduce REFINE, an ensemble AL method that combines multiple strategies without knowing in advance which will perform best. In each AL cycle, REFINE operates in two stages: (1) Progressive filtering iteratively refines the unlabeled pool by considering an ensemble of AL strategies, retaining promising candidates capturing different notions of value. (2) Coverage-based selection then chooses a final batch from this refined pool, ensuring all previously identified notions of value are accounted for. Extensive experiments across 6 classification datasets and 3 foundation models show that REFINE consistently outperforms individual strategies and existing ensemble methods. Notably, progressive filtering serves as a powerful preprocessing step that improves the performance of any individual AL strategy applied to the refined pool, which we demonstrate on an audio spectrogram classification use case. Finally, the ensemble of REFINE can be easily extended with upcoming state-of-the-art AL strategies.

Title: Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Authors: Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen, Xubin Li, Tiezheng Ge, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22345
Pdf URL: https://arxiv.org/pdf/2511.22345
Copy Paste: [[2511.22345]] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment(https://arxiv.org/abs/2511.22345)
Keywords: foundation model, generative
Abstract: Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at this https URL.

Title: INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts

Authors: Anshul Bagaria
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22351
Pdf URL: https://arxiv.org/pdf/2511.22351
Copy Paste: [[2511.22351]] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts(https://arxiv.org/abs/2511.22351)
Keywords: diffusion, generative
Abstract: The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.

Title: AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows

Authors: Zhenglin Zhou, Fan Ma, Chengzhuo Gui, Xiaobo Xia, Hehe Fan, Yi Yang, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22357
Pdf URL: https://arxiv.org/pdf/2511.22357
Copy Paste: [[2511.22357]] AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows(https://arxiv.org/abs/2511.22357)
Keywords: diffusion
Abstract: Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. Code is at this https URL.

Title: TS2Vec-Ensemble: An Enhanced Self-Supervised Framework for Time Series Forecasting

Authors: Ganeshan Niroshan, Uthayasanker Thayasivam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22395
Pdf URL: https://arxiv.org/pdf/2511.22395
Copy Paste: [[2511.22395]] TS2Vec-Ensemble: An Enhanced Self-Supervised Framework for Time Series Forecasting(https://arxiv.org/abs/2511.22395)
Keywords: self-supervised
Abstract: Self-supervised representation learning, particularly through contrastive methods like TS2Vec, has advanced the analysis of time series data. However, these models often falter in forecasting tasks because their objective functions prioritize instance discrimination over capturing the deterministic patterns, such as seasonality and trend, that are critical for accurate prediction. This paper introduces TS2Vec-Ensemble, a novel hybrid framework designed to bridge this gap. Our approach enhances the powerful, implicitly learned dynamics from a pretrained TS2Vec encoder by fusing them with explicit, engineered time features that encode periodic cycles. This fusion is achieved through a dual-model ensemble architecture, where two distinct regression heads -- one focused on learned dynamics and the other on seasonal patterns -- are combined using an adaptive weighting scheme. The ensemble weights are optimized independently for each forecast horizon, allowing the model to dynamically prioritize short-term dynamics or long-term seasonality as needed. We conduct extensive experiments on the ETT benchmark datasets for both univariate and multivariate forecasting. The results demonstrate that TS2Vec-Ensemble consistently and significantly outperforms the standard TS2Vec baseline and other state-of-the-art models, validating our hypothesis that a hybrid of learned representations and explicit temporal priors is a superior strategy for long-horizon time series forecasting.

Title: DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention

Authors: Furkan Guzelant, Arda Goktogan, Tarık Kaya, Aysegul Dundar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22411
Pdf URL: https://arxiv.org/pdf/2511.22411
Copy Paste: [[2511.22411]] DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention(https://arxiv.org/abs/2511.22411)
Keywords: diffusion
Abstract: 3D head stylization has emerged as a key technique for reimagining realistic human heads in various artistic forms, enabling expressive character design and creative visual experiences in digital media. Despite the progress in 3D-aware generation, existing 3D head stylization methods often rely on computationally expensive optimization or domain-specific fine-tuning to adapt to new styles. To address these limitations, we propose DiffStyle360, a diffusion-based framework capable of producing multi-view consistent, identity-preserving 3D head stylizations across diverse artistic domains given a single style reference image, without requiring per-style training. Building upon the 3D-aware DiffPortrait360 architecture, our approach introduces two key components: the Style Appearance Module, which disentangles style from content, and the Style Fusion Attention mechanism, which adaptively balances structure preservation and stylization fidelity in the latent space. Furthermore, we employ a 3D GAN-generated multi-view dataset for robust fine-tuning and introduce a temperaturebased key scaling strategy to control stylization intensity during inference. Extensive experiments on FFHQ and RenderMe360 demonstrate that DiffStyle360 achieves superior style quality, outperforming state-of-the-art GAN- and diffusion-based stylization methods across challenging style domains.

Title: Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

Authors: Minghao Yin, Yukang Cao, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22425
Pdf URL: https://arxiv.org/pdf/2511.22425
Copy Paste: [[2511.22425]] Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models(https://arxiv.org/abs/2511.22425)
Keywords: generative
Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.

Title: ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection

Authors: Runzhi Deng, Yundi Hu, Xinshuang Zhang, Zhao Wang, Xixi Liu, Wang-Zhou Dai, Caifeng Shan, Fang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22436
Pdf URL: https://arxiv.org/pdf/2511.22436
Copy Paste: [[2511.22436]] ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection(https://arxiv.org/abs/2511.22436)
Keywords: anomaly
Abstract: Few-shot multi-class industrial anomaly detection remains a challenging task. Vision-language models need to be both category-adaptive and sharply discriminative, yet data scarcity often blurs the boundary between normal and abnormal states. This ambiguity leads to missed subtle defects and the rejection of atypical normal samples. We propose ABounD, an Adversarial Boundary-Driven few-shot learning for multi-class anomaly detection, which is a unified learning framework that integrates semantic concept learning with decision boundary shaping. The Dynamic Concept Fusion (DCF) module produces class-adaptive prompts by fusing generalizable priors with class-specific cues, conditioned on image features. Meanwhile, Adversarial Boundary Forging (ABF) sculpts a more precise decision margin by generating boundary-level fence features via PGD-style perturbations. Training is conducted in a single stage under a Concept-Boundary Loss, where ABF provides the main supervisory signal and semantic-spatial regularizers stabilize the optimization. This synergy yields a decision boundary that closely follows normal data while preserving flexibility and robust semantic alignment. Experiments on MVTec-AD and VisA datasets demonstrate state-of-the-art performance in the task of few-shot multi-class anomaly detection.

Title: Beyond Real versus Fake Towards Intent-Aware Video Analysis

Authors: Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22455
Pdf URL: https://arxiv.org/pdf/2511.22455
Copy Paste: [[2511.22455]] Beyond Real versus Fake Towards Intent-Aware Video Analysis(https://arxiv.org/abs/2511.22455)
Keywords: self-supervised, generative
Abstract: The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.

Title: ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models

Authors: Zhenglin Zhou, Fan Ma, Xiaobo Xia, Hehe Fan, Yi Yang, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22456
Pdf URL: https://arxiv.org/pdf/2511.22456
Copy Paste: [[2511.22456]] ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models(https://arxiv.org/abs/2511.22456)
Keywords: diffusion, generative
Abstract: We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at this https URL.

Title: Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval

Authors: Tien-Huy Nguyen, Huu-Loc Tran, Huu-Phong Phan-Nguyen, Quang-Vinh Dinh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22470
Pdf URL: https://arxiv.org/pdf/2511.22470
Copy Paste: [[2511.22470]] Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval(https://arxiv.org/abs/2511.22470)
Keywords: anomaly
Abstract: Text-based person anomaly retrieval has emerged as a challenging task, with most existing approaches relying on complex deep-learning techniques. This raises a research question: How can the model be optimized to achieve greater fine-grained features? To address this, we propose a Local-Global Hybrid Perspective (LHP) module integrated with a Vision-Language Model (VLM), designed to explore the effectiveness of incorporating both fine-grained features alongside coarse-grained features. Additionally, we investigate a Unified Image-Text (UIT) model that combines multiple objective loss functions, including Image-Text Contrastive (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Masked Image Modeling (MIM) loss. Beyond this, we propose a novel iterative ensemble strategy, by combining iteratively instead of using model results simultaneously like other ensemble methods. To take advantage of the superior performance of the LHP model, we introduce a novel feature selection algorithm based on its guidance, which helps improve the model's performance. Extensive experiments demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on PAB dataset, compared with previous work, with a 9.70\% improvement in R@1, 1.77\% improvement in R@5, and 1.01\% improvement in R@10.

Title: Rethinking Cross-Generator Image Forgery Detection through DINOv3

Authors: Zhenglin Huang, Jason Li, Haiquan Wen, Tianxiao Li, Xi Yang, Lu Qi, Bei Peng, Xiaowei Huang, Ming-Hsuan Yang, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22471
Pdf URL: https://arxiv.org/pdf/2511.22471
Copy Paste: [[2511.22471]] Rethinking Cross-Generator Image Forgery Detection through DINOv3(https://arxiv.org/abs/2511.22471)
Keywords: foundation model, generative
Abstract: As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.

Title: Adversarial Flow Models

Authors: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.22475
Pdf URL: https://arxiv.org/pdf/2511.22475
Copy Paste: [[2511.22475]] Adversarial Flow Models(https://arxiv.org/abs/2511.22475)
Keywords: generative
Abstract: We present adversarial flow models, a class of generative models that unifies adversarial models and flow models. Our method supports native one-step or multi-step generation and is trained using the adversarial objective. Unlike traditional GANs, where the generator learns an arbitrary transport plan between the noise and the data distributions, our generator learns a deterministic noise-to-data mapping, which is the same optimal transport as in flow-matching models. This significantly stabilizes adversarial training. Also, unlike consistency-based methods, our model directly learns one-step or few-step generation without needing to learn the intermediate timesteps of the probability flow for propagation. This saves model capacity, reduces training iterations, and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model creates a new best FID of 2.38. We additionally show the possibility of end-to-end training of 56-layer and 112-layer models through depth repetition without any intermediate supervision, and achieve FIDs of 2.08 and 1.94 using a single forward pass, surpassing their 2NFE and 4NFE counterparts.

Title: AI killed the video star. Audio-driven diffusion model for expressive talking head generation

Authors: Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22488
Pdf URL: https://arxiv.org/pdf/2511.22488
Copy Paste: [[2511.22488]] AI killed the video star. Audio-driven diffusion model for expressive talking head generation(https://arxiv.org/abs/2511.22488)
Keywords: diffusion
Abstract: We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

Title: What Shape Is Optimal for Masks in Text Removal?

Authors: Hyakka Nakada, Marika Kubota
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22499
Pdf URL: https://arxiv.org/pdf/2511.22499
Copy Paste: [[2511.22499]] What Shape Is Optimal for Masks in Text Removal?(https://arxiv.org/abs/2511.22499)
Keywords: generative
Abstract: The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.

Title: Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

Authors: Mengyu Yang, Yanming Yang, Chenyi Xu, Chenxi Song, Yufan Zuo, Tong Zhao, Ruibo Li, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22533
Pdf URL: https://arxiv.org/pdf/2511.22533
Copy Paste: [[2511.22533]] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration(https://arxiv.org/abs/2511.22533)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).

Title: Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior

Authors: Ruoyu Feng, Yunpeng Qi, Jinming Liu, Yixin Gao, Xin Li, Xin Jin, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22549
Pdf URL: https://arxiv.org/pdf/2511.22549
Copy Paste: [[2511.22549]] Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior(https://arxiv.org/abs/2511.22549)
Keywords: diffusion, generative
Abstract: Image compression methods are usually optimized isolatedly for human perception or machine analysis tasks. We reveal fundamental commonalities between these objectives: preserving accurate semantic information is paramount, as it directly dictates the integrity of critical information for intelligent tasks and aids human understanding. Concurrently, enhanced perceptual quality not only improves visual appeal but also, by ensuring realistic image distributions, benefits semantic feature extraction for machine tasks. Based on this insight, we propose Diff-ICMH, a generative image compression framework aiming for harmonizing machine and human vision in image compression. It ensures perceptual realism by leveraging generative priors and simultaneously guarantees semantic fidelity through the incorporation of Semantic Consistency loss (SC loss) during training. Additionally, we introduce the Tag Guidance Module (TGM) that leverages highly semantic image-level tags to stimulate the pre-trained diffusion model's generative capabilities, requiring minimal additional bit rates. Consequently, Diff-ICMH supports multiple intelligent tasks through a single codec and bitstream without any task-specific adaptation, while preserving high-quality visual experience for human perception. Extensive experimental results demonstrate Diff-ICMH's superiority and generalizability across diverse tasks, while maintaining visual appeal for human perception. Code is available at: this https URL.

Title: Bringing Your Portrait to 3D Presence

Authors: Jiawei Zhang, Lei Chu, Jiahao Li, Zhenyu Zang, Chong Li, Xiao Li, Xun Cao, Hao Zhu, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22553
Pdf URL: https://arxiv.org/pdf/2511.22553
Copy Paste: [[2511.22553]] Bringing Your Portrait to 3D Presence(https://arxiv.org/abs/2511.22553)
Keywords: generative
Abstract: We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

Title: Text Condition Embedded Regression Network for Automated Dental Abutment Design

Authors: Mianjie Zheng, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuefen Liu, Kun Tang, He Meng, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22578
Pdf URL: https://arxiv.org/pdf/2511.22578
Copy Paste: [[2511.22578]] Text Condition Embedded Regression Network for Automated Dental Abutment Design(https://arxiv.org/abs/2511.22578)
Keywords: self-supervised
Abstract: The abutment is an important part of artificial dental implants, whose design process is time-consuming and labor-intensive. Long-term use of inappropriate dental implant abutments may result in implant complications, including peri-implantitis. Using artificial intelligence to assist dental implant abutment design can quickly improve the efficiency of abutment design and enhance abutment adaptability. In this paper, we propose a text condition embedded abutment design framework (TCEAD), the novel automated abutment design solution available in literature. The proposed study extends the self-supervised learning framework of the mesh mask autoencoder (MeshMAE) by introducing a text-guided localization (TGL) module to facilitate abutment area localization. As the parameter determination of the abutment is heavily dependent on local fine-grained features (the width and height of the implant and the distance to the opposing tooth), we pre-train the encoder using oral scan data to improve the model's feature extraction ability. Moreover, considering that the abutment area is only a small part of the oral scan data, we designed a TGL module, which introduces the description of the abutment area through the text encoder of Contrastive Language-Image Pre-training (CLIP), enabling the network to quickly locate the abutment area. We validated the performance of TCEAD on a large abutment design dataset. Extensive experiments demonstrate that TCEAD achieves an Intersection over Union (IoU) improvement of 0.8%-12.85% over other mainstream methods, underscoring its potential in automated dental abutment design.

Title: AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection

Authors: Dayou Huang, Feng Xue, Xurui Li, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22595
Pdf URL: https://arxiv.org/pdf/2511.22595
Copy Paste: [[2511.22595]] AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection(https://arxiv.org/abs/2511.22595)
Keywords: anomaly
Abstract: Zero-shot industrial anomaly detection (ZSAD) methods typically yield coarse anomaly maps as vision transformers (ViTs) extract patch-level features only. To solve this, recent solutions attempt to predict finer anomalies using features from ZSAD, but they still struggle to recover fine-grained anomalies without missed detections, mainly due to the gap between randomly synthesized training anomalies and real ones. We observe that anomaly score maps exactly provide complementary spatial cues that are largely absent from ZSAD's image features, a fact overlooked before. Inspired by this, we propose an anomaly-aware refiner (AnoRefiner) that can be plugged into most ZSAD models and improve patch-level anomaly maps to the pixel level. First, we design an anomaly refinement decoder (ARD) that progressively enhances image features using anomaly score maps, reducing the reliance on synthetic anomaly data. Second, motivated by the mass production paradigm, we propose a progressive group-wise test-time training (PGT) strategy that trains ARD in each product group for the refinement process in the next group, while staying compatible with any ZSAD method. Experiments on the MVTec AD and VisA datasets show that AnoRefiner boosts various ZSAD models by up to a 5.2\% gain in pixel-AP metrics, which can also be directly observed in many visualizations. The code will be available at this https URL.

Title: REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

Authors: Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22625
Pdf URL: https://arxiv.org/pdf/2511.22625
Copy Paste: [[2511.22625]] REASONEDIT: Towards Reasoning-Enhanced Image Editing Models(https://arxiv.org/abs/2511.22625)
Keywords: diffusion
Abstract: Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).

Title: Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning

Authors: Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, Andreas Krause
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22640
Pdf URL: https://arxiv.org/pdf/2511.22640
Copy Paste: [[2511.22640]] Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning(https://arxiv.org/abs/2511.22640)
Keywords: diffusion, generative
Abstract: Adapting large-scale foundation flow and diffusion generative models to optimize task-specific objectives while preserving prior information is crucial for real-world applications such as molecular design, protein docking, and creative image generation. Existing principled fine-tuning methods aim to maximize the expected reward of generated samples, while retaining knowledge from the pre-trained model via KL-divergence regularization. In this work, we tackle the significantly more general problem of optimizing general utilities beyond average rewards, including risk-averse and novelty-seeking reward maximization, diversity measures for exploration, and experiment design objectives among others. Likewise, we consider more general ways to preserve prior information beyond KL-divergence, such as optimal transport distances and Renyi divergences. To this end, we introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine-tuning tasks, each solvable via scalable established methods. We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we validate our method on illustrative settings, text-to-image, and molecular design tasks, showing that it can steer pre-trained generative models to optimize objectives and solve practically relevant tasks beyond the reach of current fine-tuning schemes.

Title: Modèles de Fondation et Ajustement : Vers une Nouvelle Génération de Modèles pour la Prévision des Séries Temporelles

Authors: Morad Laglil, Emilie Devijver, Eric Gaussier, Bertrand Pracca
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22674
Pdf URL: https://arxiv.org/pdf/2511.22674
Copy Paste: [[2511.22674]] Modèles de Fondation et Ajustement : Vers une Nouvelle Génération de Modèles pour la Prévision des Séries Temporelles(https://arxiv.org/abs/2511.22674)
Keywords: foundation model
Abstract: Inspired by recent advances in large language models, foundation models have been developed for zero-shot time series forecasting, enabling prediction on datasets unseen during pretraining. These large-scale models, trained on vast collections of time series, learn generalizable representations for both point and probabilistic forecasting, reducing the need for task-specific architectures and manual tuning. In this work, we review the main architectures, pretraining strategies, and optimization methods used in such models, and study the effect of fine-tuning after pretraining to enhance their performance on specific datasets. Our empirical results show that fine-tuning generally improves zero-shot forecasting capabilities, especially for long-term horizons.

Title: Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

Authors: Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, Steven Hoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22677
Pdf URL: https://arxiv.org/pdf/2511.22677
Copy Paste: [[2511.22677]] Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield(https://arxiv.org/abs/2511.22677)
Keywords: diffusion
Abstract: Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core ``engine'' of distillation, while the Distribution Matching (DM) term functions as a ``regularizer'' that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( this https URL ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.

Title: Emergent Extreme-View Geometry in 3D Foundation Models

Authors: Yiwen Zhang, Joseph Tung, Ruojin Cai, David Fouhey, Hadar Averbuch-Elor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22686
Pdf URL: https://arxiv.org/pdf/2511.22686
Copy Paste: [[2511.22686]] Emergent Extreme-View Geometry in 3D Foundation Models(https://arxiv.org/abs/2511.22686)
Keywords: foundation model
Abstract: 3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.

Title: Test-time scaling of diffusions with flow maps

Authors: Amirmojtaba Sabour, Michael S. Albergo, Carles Domingo-Enrich, Nicholas M. Boffi, Sanja Fidler, Karsten Kreis, Eric Vanden-Eijnden
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22688
Pdf URL: https://arxiv.org/pdf/2511.22688
Copy Paste: [[2511.22688]] Test-time scaling of diffusions with flow maps(https://arxiv.org/abs/2511.22688)
Keywords: diffusion
Abstract: A common recipe to improve diffusion models at test-time so that samples score highly against a user-specified reward is to introduce the gradient of the reward into the dynamics of the diffusion itself. This procedure is often ill posed, as user-specified rewards are usually only well defined on the data distribution at the end of generation. While common workarounds to this problem are to use a denoiser to estimate what a sample would have been at the end of generation, we propose a simple solution to this problem by working directly with a flow map. By exploiting a relationship between the flow map and velocity field governing the instantaneous transport, we construct an algorithm, Flow Map Trajectory Tilting (FMTT), which provably performs better ascent on the reward than standard test-time methods involving the gradient of the reward. The approach can be used to either perform exact sampling via importance weighting or principled search that identifies local maximizers of the reward-tilted distribution. We demonstrate the efficacy of our approach against other look-ahead techniques, and show how the flow map enables engagement with complicated reward functions that make possible new forms of image editing, e.g. by interfacing with vision language models.

Title: Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

Authors: Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Ha Nguyen, Anh Tuan Tran, Sungrack Yun, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22690
Pdf URL: https://arxiv.org/pdf/2511.22690
Copy Paste: [[2511.22690]] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation(https://arxiv.org/abs/2511.22690)
Keywords: diffusion
Abstract: Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.

Title: Generative Anchored Fields: Controlled Data Generation via Emergent Velocity Fields and Transport Algebra

Authors: Deressa Wodajo Deressa, Hannes Mareen, Peter Lambert, Glenn Van Wallendael
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22693
Pdf URL: https://arxiv.org/pdf/2511.22693
Copy Paste: [[2511.22693]] Generative Anchored Fields: Controlled Data Generation via Emergent Velocity Fields and Transport Algebra(https://arxiv.org/abs/2511.22693)
Keywords: generative
Abstract: We present Generative Anchored Fields (GAF), a generative model that learns independent endpoint predictors $J$ (noise) and $K$ (data) rather than a trajectory predictor. The velocity field $v=K-J$ emerges from their time-conditioned disagreement. This factorization enables \textit{Transport Algebra}: algebraic operation on learned $\{(J_n,K_n)\}_{n=1}^N$ heads for compositional control. With class-specific $K_n$ heads, GAF supports a rich family of directed transport maps between a shared base distribution and multiple modalities, enabling controllable interpolation, hybrid generation, and semantic morphing through vector arithmetic. We achieve strong sample quality (FID 7.5 on CelebA-HQ $64\times 64$) while uniquely providing compositional generation as an architectural primitive. We further demonstrate, GAF has lossless cyclic transport between its initial and final state with LPIPS=$0.0$. Code available at this https URL

Title: Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Authors: Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22699
Pdf URL: https://arxiv.org/pdf/2511.22699
Copy Paste: [[2511.22699]] Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(https://arxiv.org/abs/2511.22699)
Keywords: diffusion, foundation model, generative
Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

Title: Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction

Authors: Boyao Zhou, Shunyuan Zheng, Zhanfeng Liao, Zihan Ma, Hanzhang Tu, Boning Liu, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22704
Pdf URL: https://arxiv.org/pdf/2511.22704
Copy Paste: [[2511.22704]] Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction(https://arxiv.org/abs/2511.22704)
Keywords: self-supervised
Abstract: We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. We collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.

Title: MammoRGB: Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models

Authors: Jorge Alberto Garza-Abdala, Gerardo A. Fumagal-González, Daly Avendano, Servando Cardona, Sadam Hussain, Eduardo de Avila-Armenta, Jasiel H. Toscano-Martínez, Diana S. M. Rosales Gurmendi, Alma A. Pedro-Pérez, Jose Gerardo Tamez-Pena
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22759
Pdf URL: https://arxiv.org/pdf/2511.22759
Copy Paste: [[2511.22759]] MammoRGB: Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models(https://arxiv.org/abs/2511.22759)
Keywords: diffusion
Abstract: Purpose: This study aims to develop and evaluate a three channel denoising diffusion probabilistic model (DDPM) for synthesizing single breast dual view mammograms and to assess the impact of channel representations on image fidelity and cross view consistency. Materials and Methods: A pretrained three channel DDPM, sourced from Hugging Face, was fine tuned on a private dataset of 11020 screening mammograms to generate paired craniocaudal (CC) and mediolateral oblique (MLO) views. Three third channel encodings of the CC and MLO views were evaluated: sum, absolute difference, and zero channel. Each model produced 500 synthetic image pairs. Quantitative assessment involved breast mask segmentation using Intersection over Union (IoU) and Dice Similarity Coefficient (DSC), with distributional comparisons against 2500 real pairs using Earth Movers Distance (EMD) and Kolmogorov Smirnov (KS) tests. Qualitative evaluation included a visual Turing test by a non expert radiologist to assess cross view consistency and artifacts. Results: Synthetic mammograms showed IoU and DSC distributions comparable to real images, with EMD and KS values (0.020 and 0.077 respectively). Models using sum or absolute difference encodings outperformed others in IoU and DSC (p < 0.001), though distributions remained broadly similar. Generated CC and MLO views maintained cross view consistency, with 6 to 8 percent of synthetic images exhibiting artifacts consistent with those in the training data. Conclusion: Three channel DDPMs can generate realistic and anatomically consistent dual view mammograms with promising applications in dataset augmentation.

Title: Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data

Authors: Mahdieh Behjat Khatooni, Mohsen Soryani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22774
Pdf URL: https://arxiv.org/pdf/2511.22774
Copy Paste: [[2511.22774]] Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data(https://arxiv.org/abs/2511.22774)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a prevalent neurodegenerative disorder that progressively impairs memory, decision-making, and overall cognitive function. As AD is irreversible, early prediction is critical for timely intervention and management. Mild Cognitive Impairment (MCI), a transitional stage between cognitively normal (CN) aging and AD, plays a significant role in early AD diagnosis. However, predicting MCI progression remains a significant challenge, as not all individuals with MCI convert to AD. MCI subjects are categorized into stable MCI (sMCI) and progressive MCI (pMCI) based on conversion status. In this study, we propose a generalized, end-to-end deep learning model for AD prediction using MCI cases from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our hybrid architecture integrates Convolutional Neural Networks and Vision Transformers to capture both local spatial features and global contextual dependencies from Magnetic Resonance Imaging (MRI) scans. To incorporate temporal progression, we further employ Bidirectional Long Short-Term Memory (BiLSTM) networks to process features extracted from four consecutive MRI timepoints along with some other non-image biomarkers, predicting each subject's cognitive status at month 48. Our multimodal model achieved an average progression prediction accuracy of 95.05\% between sMCI and pMCI, outperforming existing studies in AD prediction. This work demonstrates state-of-the-art performance in longitudinal AD prediction and highlights the effectiveness of combining spatial and temporal modeling for the early detection of Alzheimer's disease.

Title: World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Authors: Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22787
Pdf URL: https://arxiv.org/pdf/2511.22787
Copy Paste: [[2511.22787]] World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models(https://arxiv.org/abs/2511.22787)
Keywords: diffusion
Abstract: In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.

Title: LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

Authors: Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22812
Pdf URL: https://arxiv.org/pdf/2511.22812
Copy Paste: [[2511.22812]] LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer(https://arxiv.org/abs/2511.22812)
Keywords: diffusion, generative
Abstract: Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

Title: TARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE

Authors: Jiawen Wei, Lan Jiang, Pengbo Wei, Ziwen Ye, Teng Song, Chen Chen, Guangrui Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22853
Pdf URL: https://arxiv.org/pdf/2511.22853
Copy Paste: [[2511.22853]] TARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE(https://arxiv.org/abs/2511.22853)
Keywords: generative
Abstract: Time series data is ubiquitous, with forecasting applications spanning from finance to healthcare. Beyond popular deterministic methods, generative models are gaining attention due to advancements in areas like image synthesis and video generation, as well as their inherent ability to provide probabilistic predictions. However, existing generative approaches mostly involve recurrent generative operations or repeated denoising steps, making the prediction laborious, particularly for long-term forecasting. Most of them only conduct experiments for relatively short-term forecasting, with limited comparison to deterministic methods in long-term forecasting, leaving their practical advantages unclear. This paper presents TARFVAE, a novel generative framework that combines the Transformer-based autoregressive flow (TARFLOW) and variational autoencoder (VAE) for efficient one-step generative time series forecasting. Inspired by the rethinking that complex architectures for extracting time series representations might not be necessary, we add a flow module, TARFLOW, to VAE to promote spontaneous learning of latent variables that benefit predictions. TARFLOW enhances VAE's posterior estimation by breaking the Gaussian assumption, thereby enabling a more informative latent space. TARFVAE uses only the forward process of TARFLOW, avoiding autoregressive inverse operations and thus ensuring fast generation. During generation, it samples from the prior latent space and directly generates full-horizon forecasts via the VAE decoder. With simple MLP modules, TARFVAE achieves superior performance over state-of-the-art deterministic and generative models across different forecast horizons on benchmark datasets while maintaining efficient prediction speed, demonstrating its effectiveness as an efficient and powerful solution for generative time series forecasting.

Title: CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

Authors: Fengyi Fang, Sicheng Yang, Wenming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22863
Pdf URL: https://arxiv.org/pdf/2511.22863
Copy Paste: [[2511.22863]] CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation(https://arxiv.org/abs/2511.22863)
Keywords: diffusion
Abstract: Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.

Title: Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis

Authors: Jungwoo Seo, David Keetae Park, Shinjae Yoo, Jiook Cha
Subjects: cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2511.22870
Pdf URL: https://arxiv.org/pdf/2511.22870
Copy Paste: [[2511.22870]] Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis(https://arxiv.org/abs/2511.22870)
Keywords: diffusion
Abstract: Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.

Title: DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking

Authors: Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Qiannan Guo, Zhenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22896
Pdf URL: https://arxiv.org/pdf/2511.22896
Copy Paste: [[2511.22896]] DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking(https://arxiv.org/abs/2511.22896)
Keywords: diffusion
Abstract: Multi-object tracking (MOT) is a fundamental task in computer vision with critical applications in autonomous driving and robotics. Multimodal MOT that integrates visible light and thermal infrared information is particularly essential for robust autonomous driving systems. However, effectively fusing these heterogeneous modalities is challenging. Simple strategies like concatenation or addition often fail to bridge the significant non-linear distribution gap between their feature representations, which can lead to modality conflicts and degrade tracking accuracy. Drawing inspiration from the connection between multimodal MOT and the iterative refinement in diffusion models, this paper proposes DM$^3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process to generate accurate and temporally coherent object trajectories. Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module. In this process, features from both modalities provide mutual guidance, iteratively projecting them onto a shared, consistent feature manifold. This enables the learning of complementary information and achieves deeper fusion compared to conventional methods. Additionally, we introduce a plug-and-play Diffusion Refiner (DR) to enhance and refine the unified feature representation. To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation. DM$^3$T unifies object detection, state estimation, and data association into a comprehensive online tracking framework without complex post-processing. Extensive experiments on the VT-MOT benchmark demonstrate that our method achieves 41.7 HOTA, representing a 1.54% relative improvement over existing state-of-the-art methods. The code and models are available at this https URL.

Title: From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts

Authors: Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Xin Liu, Zhenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22897
Pdf URL: https://arxiv.org/pdf/2511.22897
Copy Paste: [[2511.22897]] From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts(https://arxiv.org/abs/2511.22897)
Keywords: diffusion
Abstract: Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at this https URL.

Title: EnECG: Efficient Ensemble Learning for Electrocardiogram Multi-task Foundation Model

Authors: Yuhao Xu, Xiaoda Wang, Jiaying Lu, Sirui Ding, Defu Cao, Huaxiu Yao, Yan Liu, Xiao Hu, Carl Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22935
Pdf URL: https://arxiv.org/pdf/2511.22935
Copy Paste: [[2511.22935]] EnECG: Efficient Ensemble Learning for Electrocardiogram Multi-task Foundation Model(https://arxiv.org/abs/2511.22935)
Keywords: foundation model
Abstract: Electrocardiogram (ECG) analysis plays a vital role in the early detection, monitoring, and management of various cardiovascular conditions. While existing models have achieved notable success in ECG interpretation, they fail to leverage the interrelated nature of various cardiac abnormalities. Conversely, developing a specific model capable of extracting all relevant features for multiple ECG tasks remains a significant challenge. Large-scale foundation models, though powerful, are not typically pretrained on ECG data, making full re-training or fine-tuning computationally expensive. To address these challenges, we propose EnECG(Mixture of Experts-based Ensemble Learning for ECG Multi-tasks), an ensemble-based framework that integrates multiple specialized foundation models, each excelling in different aspects of ECG interpretation. Instead of relying on a single model or single task, EnECG leverages the strengths of multiple specialized models to tackle a variety of ECG-based tasks. To mitigate the high computational cost of full re-training or fine-tuning, we introduce a lightweight adaptation strategy: attaching dedicated output layers to each foundation model and applying Low-Rank Adaptation (LoRA) only to these newly added parameters. We then adopt a Mixture of Experts (MoE) mechanism to learn ensemble weights, effectively combining the complementary expertise of individual models. Our experimental results demonstrate that by minimizing the scope of fine-tuning, EnECG can help reduce computational and memory costs while maintaining the strong representational power of foundation models. This framework not only enhances feature extraction and predictive performance but also ensures practical efficiency for real-world clinical applications. The code is available at this https URL.

Title: One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe

Authors: Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, Jiangning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22940
Pdf URL: https://arxiv.org/pdf/2511.22940
Copy Paste: [[2511.22940]] One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe(https://arxiv.org/abs/2511.22940)
Keywords: diffusion, self-supervised
Abstract: Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available at this https URL.

Title: Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation

Authors: Taeyeong Kim, SeungJoon Lee, Jung Uk Kim, MyeongAh Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22948
Pdf URL: https://arxiv.org/pdf/2511.22948
Copy Paste: [[2511.22948]] Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation(https://arxiv.org/abs/2511.22948)
Keywords: diffusion
Abstract: Domain generalization in semantic segmentation faces challenges from domain shifts, particularly under adverse conditions. While diffusion-based data generation methods show promise, they introduce inherent misalignment between generated images and semantic masks. This paper presents FLEX-Seg (FLexible Edge eXploitation for Segmentation), a framework that transforms this limitation into an opportunity for robust learning. FLEX-Seg comprises three key components: (1) Granular Adaptive Prototypes that captures boundary characteristics across multiple scales, (2) Uncertainty Boundary Emphasis that dynamically adjusts learning emphasis based on prediction entropy, and (3) Hardness-Aware Sampling that progressively focuses on challenging examples. By leveraging inherent misalignment rather than enforcing strict alignment, FLEX-Seg learns robust representations while capturing rich stylistic variations. Experiments across five real-world datasets demonstrate consistent improvements over state-of-the-art methods, achieving 2.44% and 2.63% mIoU gains on ACDC and Dark Zurich. Our findings validate that adaptive strategies for handling imperfect synthetic data lead to superior domain generalization. Code is available at this https URL.

Title: RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

Authors: Haiyang Mei, Qiming Huang, Hai Ci, Mike Zheng Shou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22950
Pdf URL: https://arxiv.org/pdf/2511.22950
Copy Paste: [[2511.22950]] RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video(https://arxiv.org/abs/2511.22950)
Keywords: foundation model
Abstract: Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.

Title: BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Authors: Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22973
Pdf URL: https://arxiv.org/pdf/2511.22973
Copy Paste: [[2511.22973]] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation(https://arxiv.org/abs/2511.22973)
Keywords: diffusion
Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: this https URL. Inferix (Code): this https URL.

Title: McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

Authors: Qiushi Yang, Yingjie Chen, Yuan Yao, Yifang Men, Huaizhuo Liu, Miaomiao Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22974
Pdf URL: https://arxiv.org/pdf/2511.22974
Copy Paste: [[2511.22974]] McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning(https://arxiv.org/abs/2511.22974)
Keywords: generative
Abstract: Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.

Title: Ovis-Image Technical Report

Authors: Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22982
Pdf URL: https://arxiv.org/pdf/2511.22982
Copy Paste: [[2511.22982]] Ovis-Image Technical Report(https://arxiv.org/abs/2511.22982)
Keywords: diffusion
Abstract: We introduce $\textbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.

Title: Guiding Visual Autoregressive Models through Spectrum Weakening

Authors: Chaoyang Wang, Tianmeng Yang, Jingdong Wang, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22991
Pdf URL: https://arxiv.org/pdf/2511.22991
Copy Paste: [[2511.22991]] Guiding Visual Autoregressive Models through Spectrum Weakening(https://arxiv.org/abs/2511.22991)
Keywords: diffusion
Abstract: Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.

Title: Masked Diffusion for Generative Recommendation

Authors: Kulin Shah, Bhuvesh Kumar, Neil Shah, Liam Collins
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2511.23021
Pdf URL: https://arxiv.org/pdf/2511.23021
Copy Paste: [[2511.23021]] Masked Diffusion for Generative Recommendation(https://arxiv.org/abs/2511.23021)
Keywords: diffusion, generative
Abstract: Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user's interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user's sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

Title: GOATex: Geometry & Occlusion-Aware Texturing

Authors: Hyunjin Kim, Kunho Kim, Adam Lee, Wonkwang Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23051
Pdf URL: https://arxiv.org/pdf/2511.23051
Copy Paste: [[2511.23051]] GOATex: Geometry & Occlusion-Aware Texturing(https://arxiv.org/abs/2511.23051)
Keywords: diffusion
Abstract: We present GOATex, a diffusion-based method for 3D mesh texturing that generates high-quality textures for both exterior and interior surfaces. While existing methods perform well on visible regions, they inherently lack mechanisms to handle occluded interiors, resulting in incomplete textures and visible seams. To address this, we introduce an occlusion-aware texturing framework based on the concept of hit levels, which quantify the relative depth of mesh faces via multi-view ray casting. This allows us to partition mesh faces into ordered visibility layers, from outermost to innermost. We then apply a two-stage visibility control strategy that progressively reveals interior regions with structural coherence, followed by texturing each layer using a pretrained diffusion model. To seamlessly merge textures obtained across layers, we propose a soft UV-space blending technique that weighs each texture's contribution based on view-dependent visibility confidence. Empirical results demonstrate that GOATex consistently outperforms existing methods, producing seamless, high-fidelity textures across both visible and occluded surfaces. Unlike prior works, GOATex operates entirely without costly fine-tuning of a pretrained diffusion model and allows separate prompting for exterior and interior mesh regions, enabling fine-grained control over layered appearances. For more qualitative results, please visit our project page: this https URL.

Title: Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation

Authors: Felipe Akio Matsuoka, Eduardo Moreno J. M. Farina, Augusto Sarquis Serpa, Soraya Monteiro, Rodrigo Ragazzini, Nitamar Abdala, Marcelo Straus Takahashi, Felipe Campos Kitamura
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23066
Pdf URL: https://arxiv.org/pdf/2511.23066
Copy Paste: [[2511.23066]] Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation(https://arxiv.org/abs/2511.23066)
Keywords: foundation model, generative
Abstract: Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.

Title: NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing

Authors: Zhenyu Xu, Xiaoqi Shen, Haotian Nan, Xinyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23105
Pdf URL: https://arxiv.org/pdf/2511.23105
Copy Paste: [[2511.23105]] NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing(https://arxiv.org/abs/2511.23105)
Keywords: diffusion
Abstract: Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.

Title: db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

Authors: Siqi Chen, Ke Hong, Tianchen Zhao, Ruiqi Xie, Zhenhua Zhu, Xudong Zhang, Yu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.23113
Pdf URL: https://arxiv.org/pdf/2511.23113
Copy Paste: [[2511.23113]] db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism(https://arxiv.org/abs/2511.23113)
Keywords: diffusion, generative
Abstract: Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at this https URL.

Title: Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM

Authors: Mengjie Liu, Jiahui Peng, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Wenchang Ning, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23119
Pdf URL: https://arxiv.org/pdf/2511.23119
Copy Paste: [[2511.23119]] Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM(https://arxiv.org/abs/2511.23119)
Keywords: generative
Abstract: Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.

Title: Freeze, Diffuse, Decode: Geometry-Aware Adaptation of Pretrained Transformer Embeddings for Antimicrobial Peptide Design

Authors: Pankhil Gawade, Adam Izdebski, Myriam Lizotte, Kevin R. Moon, Jake S. Rhodes, Guy Wolf, Ewa Szczurek
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23120
Pdf URL: https://arxiv.org/pdf/2511.23120
Copy Paste: [[2511.23120]] Freeze, Diffuse, Decode: Geometry-Aware Adaptation of Pretrained Transformer Embeddings for Antimicrobial Peptide Design(https://arxiv.org/abs/2511.23120)
Keywords: diffusion
Abstract: Pretrained transformers provide rich, general-purpose embeddings, which are transferred to downstream tasks. However, current transfer strategies: fine-tuning and probing, either distort the pretrained geometric structure of the embeddings or lack sufficient expressivity to capture task-relevant signals. These issues become even more pronounced when supervised data are scarce. Here, we introduce Freeze, Diffuse, Decode (FDD), a novel diffusion-based framework that adapts pre-trained embeddings to downstream tasks while preserving their underlying geometric structure. FDD propagates supervised signal along the intrinsic manifold of frozen embeddings, enabling a geometry-aware adaptation of the embedding space. Applied to antimicrobial peptide design, FDD yields low-dimensional, predictive, and interpretable representations that support property prediction, retrieval, and latent-space interpolation.

Title: DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

Authors: Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23127
Pdf URL: https://arxiv.org/pdf/2511.23127
Copy Paste: [[2511.23127]] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation(https://arxiv.org/abs/2511.23127)
Keywords: diffusion
Abstract: This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: this https URL\-page/

Title: InstanceV: Instance-Level Video Generation

Authors: Yuheng Chen, Teng Hu, Jiangning Zhang, Zhucun Xue, Ran Yi, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23146
Pdf URL: https://arxiv.org/pdf/2511.23146
Copy Paste: [[2511.23146]] InstanceV: Instance-Level Video Generation(https://arxiv.org/abs/2511.23146)
Keywords: diffusion
Abstract: Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.

Title: REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection

Authors: Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Ying Zhang, Chen Li, Zhimeng Zhang, Xin Ding, Yongwei Wang, Jing Lyu, Fei Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23158
Pdf URL: https://arxiv.org/pdf/2511.23158
Copy Paste: [[2511.23158]] REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection(https://arxiv.org/abs/2511.23158)
Keywords: generative
Abstract: With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.

Title: Fast Multi-view Consistent 3D Editing with Video Priors

Authors: Liyi Chen, Ruihuang Li, Guowen Zhang, Pengfei Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23172
Pdf URL: https://arxiv.org/pdf/2511.23172
Copy Paste: [[2511.23172]] Fast Multi-view Consistent 3D Editing with Video Priors(https://arxiv.org/abs/2511.23172)
Keywords: generative
Abstract: Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.

Title: Vision Bridge Transformer at Scale

Authors: Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23199
Pdf URL: https://arxiv.org/pdf/2511.23199
Copy Paste: [[2511.23199]] Vision Bridge Transformer at Scale(https://arxiv.org/abs/2511.23199)
Keywords: diffusion
Abstract: We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

Title: Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

Authors: Christian Grashei, Christian Brechenmacher, Rao Muhammad Umer, Jingsong Liu, Carsten Marr, Ewa Szczurek, Peter J. Schüffler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23204
Pdf URL: https://arxiv.org/pdf/2511.23204
Copy Paste: [[2511.23204]] Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings(https://arxiv.org/abs/2511.23204)
Keywords: foundation model
Abstract: Pathology foundation models (FMs) have driven significant progress in computational pathology. However, these high-performing models can easily exceed a billion parameters and produce high-dimensional embeddings, thus limiting their applicability for research or clinical use when computing resources are tight. Here, we introduce Pathryoshka, a multi-teacher distillation framework inspired by RADIO distillation and Matryoshka Representation Learning to reduce pathology FM sizes while allowing for adaptable embedding dimensions. We evaluate our framework with a distilled model on ten public pathology benchmarks with varying downstream tasks. Compared to its much larger teachers, Pathryoshka reduces the model size by 86-92% at on-par performance. It outperforms state-of-the-art single-teacher distillation models of comparable size by a median margin of 7.0 in accuracy. By enabling efficient local deployment without sacrificing accuracy or representational richness, Pathryoshka democratizes access to state-of-the-art pathology FMs for the broader research and clinical community.

Title: Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models

Authors: Praveen Gatla, Anushka, Nikita Kanwar, Gouri Sahoo, Rajesh Kumar Mundotiya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23235
Pdf URL: https://arxiv.org/pdf/2511.23235
Copy Paste: [[2511.23235]] Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models(https://arxiv.org/abs/2511.23235)
Keywords: foundation model
Abstract: This article presents the first comprehensive study on designing a baseline extractive question-answering (QA) system for the Hindi tourism domain, with a specialized focus on the Varanasi-a cultural and spiritual hub renowned for its Bhakti-Bhaav (devotional ethos). Targeting ten tourism-centric subdomains-Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, General, Ashram, Temple and Travel, the work addresses the absence of language-specific QA resources in Hindi for culturally nuanced applications. In this paper, a dataset comprising 7,715 Hindi QA pairs pertaining to Varanasi tourism was constructed and subsequently augmented with 27,455 pairs generated via Llama zero-shot prompting. We propose a framework leveraging foundation models-BERT and RoBERTa, fine-tuned using Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA), to optimize parameter efficiency and task performance. Multiple variants of BERT, including pre-trained languages (e.g., Hindi-BERT), are evaluated to assess their suitability for low-resource domain-specific QA. Evaluation metrics - F1, BLEU, and ROUGE-L - highlight trade-offs between answer precision and linguistic fluency. Experiments demonstrate that LoRA-based fine-tuning achieves competitive performance (85.3\% F1) while reducing trainable parameters by 98\% compared to SFT, striking a balance between efficiency and accuracy. Comparative analysis across models reveals that RoBERTa with SFT outperforms BERT variants in capturing contextual nuances, particularly for culturally embedded terms (e.g., Aarti, Kund). This work establishes a foundational baseline for Hindi tourism QA systems, emphasizing the role of LORA in low-resource settings and underscoring the need for culturally contextualized NLP frameworks in the tourism domain.

Title: Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods

Authors: Jose Moises Araya-Martinez, Adrián Sanchis Reig, Gautham Mohan, Sarvenaz Sardari, Jens Lambrecht, Jörg Krüger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23241
Pdf URL: https://arxiv.org/pdf/2511.23241
Copy Paste: [[2511.23241]] Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods(https://arxiv.org/abs/2511.23241)
Keywords: diffusion, generative
Abstract: Reducing the burden of data generation and annotation remains a major challenge for the cost-effective deployment of machine learning in industrial and robotics settings. While synthetic rendering is a promising solution, bridging the sim-to-real gap often requires expert intervention. In this work, we benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI), and classical rendering approaches, for creating contextualized synthetic data without manual annotation. Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts. We validate our methods on two datasets: a proprietary industrial dataset (automotive and logistics) and a public robotics dataset. Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency. Perceptual hashing consistently achieves the highest performance, with mAP50 scores of 98% and 67% on the industrial and robotics datasets, respectively. Additionally, GenAI methods present significant time overhead for data generation at no apparent improvement of sim-to-real mAP values compared to simpler methods. Our findings offer actionable insights for efficiently bridging the sim-to-real gap, enabling high real-world performance from models trained exclusively on synthetic data.

Title: Time Series Forecasting via Direct Per-Step Probability Distribution Modeling

Authors: Linghao Kong, Xiaopeng Hong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23260
Pdf URL: https://arxiv.org/pdf/2511.23260
Copy Paste: [[2511.23260]] Time Series Forecasting via Direct Per-Step Probability Distribution Modeling(https://arxiv.org/abs/2511.23260)
Keywords: self-supervised
Abstract: Deep neural network-based time series prediction models have recently demonstrated superior capabilities in capturing complex temporal dependencies. However, it is challenging for these models to account for uncertainty associated with their predictions, because they directly output scalar values at each time step. To address such a challenge, we propose a novel model named interleaved dual-branch Probability Distribution Network (interPDN), which directly constructs discrete probability distributions per step instead of a scalar. The regression output at each time step is derived by computing the expectation of the predictive distribution on a predefined support set. To mitigate prediction anomalies, a dual-branch architecture is introduced with interleaved support sets, augmented by coarse temporal-scale branches for long-term trend forecasting. Outputs from another branch are treated as auxiliary signals to impose self-supervised consistency constraints on the current branch's prediction. Extensive experiments on multiple real-world datasets demonstrate the superior performance of interPDN.

Title: Beyond Curve Fitting: Neuro-Symbolic Agents for Context-Aware Epidemic Forecasting

Authors: Joongwon Chae, Runming Wang, Chen Xiong, Gong Yunhan, Lian Zhang, Ji Jiansong, Dongmei Yu, Peiwu Qin
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2511.23276
Pdf URL: https://arxiv.org/pdf/2511.23276
Copy Paste: [[2511.23276]] Beyond Curve Fitting: Neuro-Symbolic Agents for Context-Aware Epidemic Forecasting(https://arxiv.org/abs/2511.23276)
Keywords: foundation model
Abstract: Effective surveillance of hand, foot and mouth disease (HFMD) requires forecasts accounting for epidemiological patterns and contextual drivers like school calendars and weather. While classical models and recent foundation models (e.g., Chronos, TimesFM) incorporate covariates, they often lack the semantic reasoning to interpret the causal interplay between conflicting drivers. In this work, we propose a two-agent framework decoupling contextual interpretation from probabilistic forecasting. An LLM "event interpreter" processes heterogeneous signals-including school schedules, meteorological summaries, and reports-into a scalar transmission-impact signal. A neuro-symbolic core then combines this with historical case counts to produce calibrated probabilistic forecasts. We evaluate the framework on real-world HFMD datasets from Hong Kong (2023-2024) and Lishui, China (2024). Compared to traditional and foundation-model baselines, our approach achieves competitive point forecasting accuracy while providing robust 90% prediction intervals (coverage 0.85-1.00) and human-interpretable rationales. Our results suggest that structurally integrating domain knowledge through LLMs can match state-of-the-art performance while yielding context-aware forecasts that align with public health workflows. Code is available at this https URL .

Title: Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Authors: Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, Jianguo Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23319
Pdf URL: https://arxiv.org/pdf/2511.23319
Copy Paste: [[2511.23319]] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models(https://arxiv.org/abs/2511.23319)
Keywords: in-context
Abstract: This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: \textbf{sparsity}, \textbf{random-access flexibility}, and \textbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.

Title: Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

Authors: Shuqi Liu, Han Wu, Guanzhi Deng, Jianshu Chen, Xiaoyang Wang, Linqi Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23335
Pdf URL: https://arxiv.org/pdf/2511.23335
Copy Paste: [[2511.23335]] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach(https://arxiv.org/abs/2511.23335)
Keywords: generative
Abstract: Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.

Title: Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories

Authors: Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Alen Mrdovic, Dimitris Metaxas
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23342
Pdf URL: https://arxiv.org/pdf/2511.23342
Copy Paste: [[2511.23342]] Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories(https://arxiv.org/abs/2511.23342)
Keywords: generative
Abstract: Flow-based generative models have recently demonstrated strong performance, yet sampling typically relies on expensive numerical integration of ordinary differential equations (ODEs). Rectified Flow enables one-step sampling by learning nearly straight probability paths, but achieving such straightness requires multiple computationally intensive reflow iterations. MeanFlow achieves one-step generation by directly modeling the average velocity over time; however, when trained on highly curved flows, it suffers from slow convergence and noisy supervision. To address these limitations, we propose Rectified MeanFlow, a framework that models the mean velocity field along the rectified trajectory using only a single reflow step. This eliminates the need for perfectly straightened trajectories while enabling efficient training. Furthermore, we introduce a simple yet effective truncation heuristic that aims to reduce residual curvature and further improve performance. Extensive experiments on ImageNet at 64, 256, and 512 resolutions show that Re-MeanFlow consistently outperforms prior one-step flow distillation and Rectified Flow methods in both sample quality and training efficiency. Code is available at this https URL.

Title: Scaling HuBERT for African Languages: From Base to Large and XL

Authors: Antoine Caubrière, Elodie Gauthier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23370
Pdf URL: https://arxiv.org/pdf/2511.23370
Copy Paste: [[2511.23370]] Scaling HuBERT for African Languages: From Base to Large and XL(https://arxiv.org/abs/2511.23370)
Keywords: self-supervised
Abstract: Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see this https URL. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.

Title: DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline

Authors: Rui Zhang, Hongxia Wang, Hangqing Liu, Yang Zhou, Qiang Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23377
Pdf URL: https://arxiv.org/pdf/2511.23377
Copy Paste: [[2511.23377]] DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline(https://arxiv.org/abs/2511.23377)
Keywords: diffusion, foundation model
Abstract: Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML this http URL dataset can be accessed via this https URL.

Title: VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Authors: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23386
Pdf URL: https://arxiv.org/pdf/2511.23386
Copy Paste: [[2511.23386]] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction(https://arxiv.org/abs/2511.23386)
Keywords: foundation model
Abstract: Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

Title: Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

Authors: Jiajun Guo, Xin Luo, Jie Liu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.23402
Pdf URL: https://arxiv.org/pdf/2511.23402
Copy Paste: [[2511.23402]] Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning(https://arxiv.org/abs/2511.23402)
Keywords: foundation model
Abstract: Split learning is well known as a method for resolving data privacy concerns by training a model on distributed devices, thereby avoiding data sharing that raises privacy issues. However, high network communication costs are always an impediment to split learning, especially for large foundation models that require transmitting large amounts of high-dimensional data. To resolve this issue, we present a new multimodal model structure that incorporates a learning-based data compression method, which compresses model embeddings into low-bit integers while preserving the model's performance, greatly reducing the transmission costs between partitions. We then determine the optimal number of discrete representation levels based on a solid theoretical foundation from entropy coding.

Title: LFM2 Technical Report

Authors: Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Mathias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichkun, Alex Quach, Ryan Rogers, Daniela Rus, Nayan Saxena, Bettina Schlager, Tim Seyde, Jimmy T.H. Smith, Aditya Tadimeti, Neehal Tumma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23404
Pdf URL: https://arxiv.org/pdf/2511.23404
Copy Paste: [[2511.23404]] LFM2 Technical Report(https://arxiv.org/abs/2511.23404)
Keywords: foundation model
Abstract: We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, this http URL, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.

Title: Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

Authors: Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23429
Pdf URL: https://arxiv.org/pdf/2511.23429
Copy Paste: [[2511.23429]] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model(https://arxiv.org/abs/2511.23429)
Keywords: foundation model, generative
Abstract: Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as "open the door", "draw a torch", or "trigger an explosion".

Title: ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts

Authors: Hang Yu, Di Zhang, Qiwei Du, Yanping Zhao, Hai Zhang, Guang Chen, Eduardo E. Veas, Junqiao Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23442
Pdf URL: https://arxiv.org/pdf/2511.23442
Copy Paste: [[2511.23442]] ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts(https://arxiv.org/abs/2511.23442)
Keywords: generative
Abstract: Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching's feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.

Title: Physics-Informed Neural Networks for Thermophysical Property Retrieval

Authors: Ali Waseem, Malcolm Mielle
Subjects: cs.LG, cs.AI, cs.CE, cs.CV
Abstract URL: https://arxiv.org/abs/2511.23449
Pdf URL: https://arxiv.org/pdf/2511.23449
Copy Paste: [[2511.23449]] Physics-Informed Neural Networks for Thermophysical Property Retrieval(https://arxiv.org/abs/2511.23449)
Keywords: diffusion
Abstract: Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have wide-ranging uses, but a critical application lies in quantifying how building facade renovation reduces thermal transmittance, a key determinant of building energy efficiency. However, solving inverse heat problems with non-invasive data collected in situ is error-prone due to environmental variability or deviations from theoretically assumed conditions. Hence, current methods for measuring thermal conductivity are either invasive, require lengthy observation periods, or are sensitive to environmental and experimental conditions. Here, we present a PINN-based iterative framework to estimate the thermal conductivity k of a wall from a set of thermographs; our framework alternates between estimating the forward heat problem with a PINN for a fixed k, and optimizing k by comparing the thermographs and surface temperatures predicted by the PINN, repeating until the estimated k's convergence. Using both environmental data captured by a weather station and data generated from Finite-Volume-Method software simulations, we accurately predict k across different environmental conditions and data collection sampling times, given the temperature profile of the wall at dawn is close to steady state. Although violating the steady-state assumption impacts the accuracy of k's estimation, we show that our proposed framework still only exhibits a maximum MAE of 4.0851. Our work demonstrates the potential of PINN-based methods for reliable estimation of material properties in situ and under realistic conditions, without lengthy measurement campaigns. Given the lack of research on using machine learning, and more specifically on PINNs, for solving in-situ inverse problems, we expect our work to be a starting point for more research on the topic.

Title: Object-Centric Data Synthesis for Category-level Object Detection

Authors: Vikhyat Agarwal, Jiayi Cora Guo, Declan Hoban, Sissi Zhang, Nicholas Moran, Peter Cho, Srilakshmi Pattabiraman, Shantanu Joshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23450
Pdf URL: https://arxiv.org/pdf/2511.23450
Copy Paste: [[2511.23450]] Object-Centric Data Synthesis for Category-level Object Detection(https://arxiv.org/abs/2511.23450)
Keywords: diffusion
Abstract: Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capability to new object classes requires large amounts of annotated training data, which is costly and time-consuming to acquire, especially for long-tailed classes with insufficient representation in existing datasets. Here, we introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models), and systematically evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting. The approaches are based on simple image processing techniques, 3D rendering, and image diffusion models, and use object-centric data to synthesize realistic, cluttered images with varying contextual coherence and complexity. We assess how these methods enable models to achieve category-level generalization in real-world data, and demonstrate significant performance boosts within this data-constrained experimental setting.

Title: SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments

Authors: Xinyi Li, Zaishuo Xia, Weyl Lu, Chenjie Hao, Yubei Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23465
Pdf URL: https://arxiv.org/pdf/2511.23465
Copy Paste: [[2511.23465]] SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments(https://arxiv.org/abs/2511.23465)
Keywords: diffusion
Abstract: Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.

Title: Visual Generation Tuning

Authors: Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23469
Pdf URL: https://arxiv.org/pdf/2511.23469
Copy Paste: [[2511.23469]] Visual Generation Tuning(https://arxiv.org/abs/2511.23469)
Keywords: diffusion, foundation model
Abstract: Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at this https URL.

Title: ThetaEvolve: Test-time Learning on Open Problems

Authors: Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2511.23473
Pdf URL: https://arxiv.org/pdf/2511.23473
Copy Paste: [[2511.23473]] ThetaEvolve: Test-time Learning on Open Problems(https://arxiv.org/abs/2511.23473)
Keywords: in-context
Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: this https URL

Title: AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Authors: Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23475
Pdf URL: https://arxiv.org/pdf/2511.23475
Copy Paste: [[2511.23475]] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement(https://arxiv.org/abs/2511.23475)
Keywords: diffusion, generative
Abstract: Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.