2025-06-09

Title: Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching

Authors: Tinglin Huang, Tianyu Liu, Mehrtash Babadi, Wengong Jin, Rex Ying
Subjects: cs.CV, q-bio.GN
Abstract URL: https://arxiv.org/abs/2506.05361
Pdf URL: https://arxiv.org/pdf/2506.05361
Copy Paste: [[2506.05361]] Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching(https://arxiv.org/abs/2506.05361)
Keywords: generation, generative
Abstract: Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.
摘要：空间转录组学（ST）已成为一种强大的技术，用于与基因表达分析桥接组织学成像。但是，它的应用受到低通量的限制，并且需要专门的实验设施。先前的工作试图从整体扫描的组织学图像中预测ST以加速这一过程，但它们遭受了两个主要局限性。首先，它们不能明确地模拟细胞 - 细胞相互作用，因为它们将整个ST数据的联合分布分布并独立预测每个点的基因表达。其次，由于典型的ST数据集中大量的斑点（通常超过10,000），他们的编码器与内存约束斗争。在此，我们提出了STFLOW，这是一种流动匹配的生成模型，该模型通过对整个载玻片的基因表达的关节分布进行建模来考虑细胞细胞的相互作用。它还采用了有效的幻灯片级编码器，并具有局部空间注意力，从而实现了全扫描的处理，而无需过多的内存开销。在最近策划的HEST-1K和刺激1K4M基准上，STFlow基本上优于最先进的基准，并且在病理基础模型中取得了超过18％的相对改进。

Title: Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards

Authors: Aakash Garg, Libing Zeng, Andrii Tsarov, Nima Khademi Kalantari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05367
Pdf URL: https://arxiv.org/pdf/2506.05367
Copy Paste: [[2506.05367]] Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards(https://arxiv.org/abs/2506.05367)
Keywords: generation
Abstract: In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.
摘要：在本文中，我们提出了一种基于扩散的新方法来产生立体声图像，给定文本提示。由于具有较大基线的立体声图像数据集稀缺，因此训练从头开始的扩散模型是不可行的。因此，我们提出了通过稳定扩散和在立体声图像数据集上进行微调来利用强大的先验，以使其适应立体声生成的任务。为了提高立体声的一致性和文本对象的对准，我们使用及时对齐方式和拟议的立体声一致性奖励功能进一步调整模型。全面的实验证明了我们在跨不同场景中产生高质量立体声图像的方法的优越性，表现优于现有方法。

Title: Speaking images. A novel framework for the automated self-description of artworks

Authors: Valentine Bernasconi, Gustavo Marfia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05368
Pdf URL: https://arxiv.org/pdf/2506.05368
Copy Paste: [[2506.05368]] Speaking images. A novel framework for the automated self-description of artworks(https://arxiv.org/abs/2506.05368)
Keywords: generative
Abstract: Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.
摘要：最近在生成AI中的突破为艺术和文化遗产领域的新研究观点打开了大门，那里已经数字化了大量文物。有必要创新来减轻访问权限并突出数字收藏的内容。与原始历史对象相对，这种创新与数字形象有关其延展性和当代解释的创造性探索。基于自主图像的概念，我们使用开源大语，面部检测，文本到语音和音频对动画模型为生产自我解释的文化伪像来生产新框架。目的是从数字化的艺术品开始，并自动组装一个简短的视频，其中主角为解释其内容而动画。整个过程都质疑封装在大型语言模型中的文化偏见，数字图像的潜力和用于教育目的的艺术品的潜力，以及对这种创造性转移的艺术历史领域的担忧。

Title: An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos

Authors: Shayantani Kar, B. Shresth Bhimrajka, Aditya Kumar, Sahil Gupta, Sourav Ghosh, Subhamita Mukherjee, Shauvik Paul
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05377
Pdf URL: https://arxiv.org/pdf/2506.05377
Copy Paste: [[2506.05377]] An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos(https://arxiv.org/abs/2506.05377)
Keywords: generative
Abstract: Rapid spread of false images and videos on online platforms is an emerging problem. Anyone may add, delete, clone or modify people and entities from an image using various editing software which are readily available. This generates false and misleading proof to hide the crime. Now-a-days, these false and counterfeit images and videos are flooding on the internet. These spread false information. Many methods are available in literature for detecting those counterfeit contents but new methods of counterfeiting are also evolving. Generative Adversarial Networks (GAN) are observed to be one effective method as it modifies the context and definition of images producing plausible results via image-to-image translation. This work uses an independent discriminant network that can identify GAN generated image or video. A discriminant network has been created using a convolutional neural network based on InceptionResNetV2. The article also proposes a platform where users can detect forged images and videos. This proposed work has the potential to help the forensics domain to detect counterfeit videos and hidden criminal evidence towards the identification of criminal activities.
摘要：虚假图像和视频在在线平台上的快速传播是一个新的问题。任何人都可以使用易于获得的各种编辑软件从图像中添加，删除，克隆或修改人和实体。这产生了虚假和误导性的证据来掩盖犯罪。如今，这些虚假和伪造的图像和视频正在互联网上泛滥。这些传播虚假信息。文献中有许多方法可用于检测那些假冒内容，但伪造的新方法也在不断发展。观察到生成对抗网络（GAN）是一种有效的方法，因为它通过图像到图像翻译修改了产生合理结果的图像的上下文和定义。这项工作使用一个独立的判别网络，可以识别GAN生成的图像或视频。已经使用基于InceptionResnetv2的卷积神经网络创建了一个判别网络。本文还提出了一个平台，用户可以在其中检测锻造的图像和视频。这项拟议的工作有可能帮助法医领域检测伪造视频和隐藏的犯罪证据，以识别犯罪活动。

Title: Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Authors: Zhuoxuan Cai, Jian Zhang, Xinbin Yuan, Pengtao Jiang, Wenxiang Chen, Bowen Tang, Lujian Yao, Qiyuan Wang, Jinwen Chen, Bo Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05384
Pdf URL: https://arxiv.org/pdf/2506.05384
Copy Paste: [[2506.05384]] Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment(https://arxiv.org/abs/2506.05384)
Keywords: quality assessment
Abstract: Recent studies demonstrate that multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments. However, existing approaches typically treat quality scoring and reasoning descriptions as separate tasks with disjoint optimization objectives, leading to a trade-off: models adept at quality reasoning descriptions struggle with precise score regression, while score-focused models lack interpretability. This limitation hinders the full potential of MLLMs in visual quality assessment, where accuracy and interpretability should be mutually reinforcing. To address this, we propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage. Specifically, in the first stage, we distill high-quality data from a teacher model through expert-designed prompts, initializing reasoning capabilities via cross-entropy loss supervision. In the second stage, we introduce a novel reward with Group Relative Policy Optimization (GRPO) to jointly optimize scoring accuracy and reasoning consistency. We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder. Extensive experiments show that Q-Ponder achieves state-of-the-art (SOTA) performance on quality score regression benchmarks, delivering up to 6.5% higher SRCC on cross-domain datasets. Furthermore, Q-Ponder significantly outperforms description-based SOTA models, including its teacher model Qwen-2.5-VL-72B, particularly in description accuracy and reasonableness, demonstrating the generalization potential over diverse tasks.
摘要：最近的研究表明，多模式大语言模型（MLLM）可以通过可解释的评估来熟练地评估视觉质量。但是，现有方法通常将质量评分和推理描述视为具有脱节优化目标的单独任务，从而实现了权衡：熟练的质量推理描述的模型在精确的得分回归方面遇到了困难，而以分数为中心的模型缺乏可解释性。这种限制阻碍了MLLM在视觉质量评估中的全部潜力，在这种质量评估中，准确性和解释性应相互加强。为了解决这个问题，我们提出了一个统一的两阶段训练框架，其中包括一个冷阶段和基于增强学习的微调阶段。具体而言，在第一阶段，我们通过专家设计的提示将高质量的数据从教师模型中提炼出来，从而通过交叉渗透损失监督初始化推理能力。在第二阶段，我们通过小组相对政策优化（GRPO）引入了一种新颖的奖励，以共同优化评分准确性和推理一致性。我们将从这两个阶段得出的模型指定为Q-Ponder-CI和Q-Ponder。广泛的实验表明，Q-ponder在质量得分回归基准测试基准上实现了最先进的（SOTA）性能，在跨域数据集上的SRCC高达6.5％。此外，Q-ponder显着胜过基于描述的SOTA模型，包括其教师模型QWEN-2.5-VL-72B，尤其是在描述准确性和合理性中，证明了对各种任务的概括潜力。

Title: TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Authors: Mert Can Cakmak, Nitin Agarwal, Diwash Poudel
Subjects: cs.CV, cs.IR, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2506.05395
Pdf URL: https://arxiv.org/pdf/2506.05395
Copy Paste: [[2506.05395]] TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations(https://arxiv.org/abs/2506.05395)
Keywords: quality assessment
Abstract: Efficient keyframe extraction is critical for effective video summarization and retrieval, yet capturing the complete richness of video content remains challenging. In this work, we present TriPSS, a novel tri-modal framework that effectively integrates perceptual cues from color features in the CIELAB space, deep structural embeddings derived from ResNet-50, and semantic context from frame-level captions generated by Llama-3.2-11B-Vision-Instruct. By fusing these diverse modalities using principal component analysis, TriPSS constructs robust multi-modal embeddings that enable adaptive segmentation of video content via HDBSCAN clustering. A subsequent refinement stage incorporating quality assessment and duplicate filtering ensures that the final keyframe set is both concise and semantically rich. Comprehensive evaluations on benchmark datasets TVSum20 and SumMe demonstrate that TriPSS achieves state-of-the-art performance, substantially outperforming traditional unimodal and previous multi-modal methods. These results underscore TriPSS's ability to capture nuanced visual and semantic information, thereby setting a new benchmark for video content understanding in large-scale retrieval scenarios.
摘要：有效的钥匙扣提取对于有效的视频摘要和检索至关重要，但是捕获视频内容的完整性仍然具有挑战性。在这项工作中，我们提出了旅行，这是一种新型的三模式框架，可有效整合来自Cielab空间中颜色特征的感知线索，源自Resnet-50的深层结构嵌入，以及来自llama-3.2-11b-vision-vision-Instruct产生的框架级字幕的语义上下文。通过使用主成分分析融合这些不同的模式，TRIPS构建了稳健的多模式嵌入，可以通过HDBSCAN聚类对视频内容进行自适应分割。随后的完善阶段结合了质量评估和重复的过滤，可确保最终的密钥帧集既简洁又具有语义上的富含。对基准数据集TVSUM20和Summe进行的全面评估表明，旅行可以实现最先进的性能，从而大大优于传统的单峰和以前的多模式方法。这些结果强调了旅行捕获细微的视觉和语义信息的能力，从而为大规模检索方案设定了新的基准，以了解视频内容的理解。

Title: Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation

Authors: Israa A. Albadarneh, Bassam H. Hammo, Omar S. Al-Kadi
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05399
Pdf URL: https://arxiv.org/pdf/2506.05399
Copy Paste: [[2506.05399]] Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation(https://arxiv.org/abs/2506.05399)
Keywords: generation
Abstract: Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic inconsistencies, data scarcity in non-English languages, and limitations in reasoning ability. Finally, we outline future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis. This survey serves as a comprehensive reference for researchers aiming to advance the field of attention-based image captioning.
摘要：图像字幕涉及从输入图像生成文本描述，弥合计算机视觉和自然语言处理之间的差距。基于变压器的模型的最新进展通过利用注意机制来更好地理解场景，从而显着改善了标题的产生。尽管各种调查探讨了基于深度学习的图像字幕方法，但很少有人对跨多种语言进行了全面分析基于注意力的变压器模型。这项调查回顾了基于注意力的图像字幕模型，将其分类为基于变压器，基于深度学习和混合方法的方法。它探讨了基准数据集，讨论了Bleu，Meteor，Cider和Rouge等评估指标，并突出了多语言字幕的挑战。此外，本文确定了当前模型的关键局限性，包括语义不一致，非英语语言的数据稀缺以及推理能力的局限性。最后，我们概述了未来的研究方向，例如多模式学习，AI驱动助手的实时应用，医疗保健和法医分析。该调查是研究人员的全面参考，旨在推进基于注意力的图像字幕的领域。

Title: Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction

Authors: Zhihao Tang, Chaozhuo Li, Litian Zhang, Xi Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05428
Pdf URL: https://arxiv.org/pdf/2506.05428
Copy Paste: [[2506.05428]] Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction(https://arxiv.org/abs/2506.05428)
Keywords: generation
Abstract: Early prediction of Mild Cognitive Impairment (MCI) conversion is hampered by a trade-off between immediacy--making fast predictions from a single baseline sMRI--and accuracy--leveraging longitudinal scans to capture disease progression. We propose MCI-Diff, a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data, achieving both real-time risk assessment and high predictive performance. First, a multi-task sequence reconstruction strategy trains a shared denoising network on interpolation and extrapolation tasks to handle irregular follow-up sampling and learn robust latent trajectories. Second, an LLM-driven "linguistic compass" is introduced for clinical plausibility sampling: generated feature candidates are quantized, tokenized, and scored by a fine-tuned language model conditioned on expected structural biomarkers, guiding autoregressive generation toward realistic disease patterns. Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines, improving early conversion accuracy by 5-12%.
摘要：在即时性之间的权衡 - 从单个基线SMRI和准确性的快速预测之间进行了权衡，可以使轻度认知障碍（MCI）转化的早期预测受到阻碍。我们提出了MCI-DIFF，这是一个基于扩散的框架，它直接从基线数据中综合了临床上合理的未来SMRI表示，从而实现了实时风险评估和高预测性能。首先，多任务序列重建策略在插值和外推任务上训练共享的denoising网络，以处理不规则的后续采样并学习强大的潜在轨迹。其次，引入了以LLM驱动的“语言指南针”，以用于临床合理性抽样：生成的特征候选者被定量，令牌化和由以预期结构生物标志物为条件的微调语言模型进行了定量，并将其评分为预期的结构生物标志物，从而指导自动回归产生的疾病模式。关于ADNI和AIBL队列的实验表明，MCI-DIFF的表现要优于最先进的基线，从而提高了早期转化率的准确性5-12％。

Title: BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models

Authors: Ludovic Arnould, Salim Khazem, Hugues Ali Mehenni
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05440
Pdf URL: https://arxiv.org/pdf/2506.05440
Copy Paste: [[2506.05440]] BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models(https://arxiv.org/abs/2506.05440)
Keywords: generation
Abstract: Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at this https URL.
摘要：视觉语言模型（VLM）现在已经足够先进，可以支持广泛的应用程序，包括回答复杂的视觉问题，并且越来越多地希望以各种方式与图像进行交互。为了评估它们，当前的基准通常专注于特定域（例如阅读图表），构建带注释的真实图像的数据集与预定的多项选择问题（MCQ）配对以报告汇总精度得分。但是，这种基准需要高注释成本，风险信息泄漏，并且不阐明失败是否源于视觉感知，推理或常识的局限性。我们提出了一种新的评估方法，受眼科诊断的启发，利用合成图像的程序生成以获得对视觉属性的控制，并精确地揭示了VLMS中的感知失败。具体来说，我们在关注内容（例如，计数任务中的对象数量）中构建图像的集合，逐渐具有更具挑战性的变化（例如，对象数量），而将其他视觉参数保持恒定。该诊断允许系统的应力测试和细粒的失效分析，将重点从粗略的基准测试转移到了对VLM功能的靶向和可解释评估。我们的代码可在此HTTPS URL上找到。

Title: Degradation-Aware Image Enhancement via Vision-Language Classification

Authors: Jie Cai, Kangning Yang, Jiaming Ding, Lan Fu, Ling Ouyang, Jiang Li, Jinglin Shen, Zibo Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05450
Pdf URL: https://arxiv.org/pdf/2506.05450
Copy Paste: [[2506.05450]] Degradation-Aware Image Enhancement via Vision-Language Classification(https://arxiv.org/abs/2506.05450)
Keywords: restoration, super-resolution
Abstract: Image degradation is a prevalent issue in various real-world applications, affecting visual quality and downstream processing tasks. In this study, we propose a novel framework that employs a Vision-Language Model (VLM) to automatically classify degraded images into predefined categories. The VLM categorizes an input image into one of four degradation types: (A) super-resolution degradation (including noise, blur, and JPEG compression), (B) reflection artifacts, (C) motion blur, or (D) no visible degradation (high-quality image). Once classified, images assigned to categories A, B, or C undergo targeted restoration using dedicated models tailored for each specific degradation type. The final output is a restored image with improved visual quality. Experimental results demonstrate the effectiveness of our approach in accurately classifying image degradations and enhancing image quality through specialized restoration models. Our method presents a scalable and automated solution for real-world image enhancement tasks, leveraging the capabilities of VLMs in conjunction with state-of-the-art restoration techniques.
摘要：在各种现实世界应用程序中，图像降解是一个普遍的问题，影响了视觉质量和下游处理任务。在这项研究中，我们提出了一个新型框架，该框架采用视觉模型（VLM）自动将降级图像分类为预定义的类别。 VLM将输入图像分为四种降解类型之一：（a）超分辨率降解（包括噪声，模糊和JPEG压缩），（b）反射伪像，（c）运动Blur，或（d）无可见的降解（高质量图像）。一旦分类，使用针对每种特定降解类型量身定制的专用模型，分配给类别A，B或C的图像进行了目标修复。最终输出是具有改进视觉质量的恢复图像。实验结果证明了我们方法在通过专门的恢复模型通过精确分类图像降低和增强图像质量的有效性。我们的方法为真实世界图像增强任务提供了可扩展且自动化的解决方案，利用VLMS与最新的恢复技术结合使用。

Title: Implicit Neural Representation for Video Restoration

Authors: Mary Aiyetigbo, Wanqi Yuan, Feng Luo, Nianyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05488
Pdf URL: https://arxiv.org/pdf/2506.05488
Copy Paste: [[2506.05488]] Implicit Neural Representation for Video Restoration(https://arxiv.org/abs/2506.05488)
Keywords: restoration, super-resolution
Abstract: High-resolution (HR) videos play a crucial role in many computer vision applications. Although existing video restoration (VR) methods can significantly enhance video quality by exploiting temporal information across video frames, they are typically trained for fixed upscaling factors and lack the flexibility to handle scales or degradations beyond their training distribution. In this paper, we introduce VR-INR, a novel video restoration approach based on Implicit Neural Representations (INRs) that is trained only on a single upscaling factor ($\times 4$) but generalizes effectively to arbitrary, unseen super-resolution scales at test time. Notably, VR-INR also performs zero-shot denoising on noisy input, despite never having seen noisy data during training. Our method employs a hierarchical spatial-temporal-texture encoding framework coupled with multi-resolution implicit hash encoding, enabling adaptive decoding of high-resolution and noise-suppressed frames from low-resolution inputs at any desired magnification. Experimental results show that VR-INR consistently maintains high-quality reconstructions at unseen scales and noise during training, significantly outperforming state-of-the-art approaches in sharpness, detail preservation, and denoising efficacy.
摘要：高分辨率（HR）视频在许多计算机视觉应用中起着至关重要的作用。尽管现有的视频恢复（VR）方法可以通过跨视频框架利用时间信息来显着提高视频质量，但它们通常接受固定的升级因素的培训，并且缺乏在训练分布之外处理量表或退化的灵活性。在本文中，我们介绍了VR-Inr，这是一种基于隐式神经表示（INR）的新型视频恢复方法，该方法仅以单个升级因素（$ \ times 4 $）进行训练，但在测试时间有效地概括了任意，未见的超分辨率量表。值得注意的是，尽管在训练过程中从未见过嘈杂的数据，但VR-INR在嘈杂的输入方面还执行零射击。我们的方法采用了层次的空间 - 周期性文本编码框架以及多分辨率的隐式哈希编码，从而使高分辨率和噪声支持的框架从低分辨率输入中对任何所需放大放大倍率的低分辨率输入进行了自适应解码。实验结果表明，VR-INR在训练过程中始终保持在看不见的尺度和噪声下保持高质量的重建，在清晰度，细节保存和脱氧功效方面显着优于最先进的方法。

Title: Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models

Authors: Sima Noorani, Shayan Kiyani, George Pappas, Hamed Hassani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05497
Pdf URL: https://arxiv.org/pdf/2506.05497
Copy Paste: [[2506.05497]] Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models(https://arxiv.org/abs/2506.05497)
Keywords: generative
Abstract: Uncertainty quantification (UQ) is essential for safe deployment of generative AI models such as large language models (LLMs), especially in high stakes applications. Conformal prediction (CP) offers a principled uncertainty quantification framework, but classical methods focus on regression and classification, relying on geometric distances or softmax scores: tools that presuppose structured outputs. We depart from this paradigm by studying CP in a query only setting, where prediction sets must be constructed solely from finite queries to a black box generative model, introducing a new trade off between coverage, test time query budget, and informativeness. We introduce Conformal Prediction with Query Oracle (CPQ), a framework characterizing the optimal interplay between these objectives. Our finite sample algorithm is built on two core principles: one governs the optimal query policy, and the other defines the optimal mapping from queried samples to prediction sets. Remarkably, both are rooted in the classical missing mass problem in statistics. Specifically, the optimal query policy depends on the rate of decay, or the derivative, of the missing mass, for which we develop a novel estimator. Meanwhile, the optimal mapping hinges on the missing mass itself, which we estimate using Good Turing estimators. We then turn our focus to implementing our method for language models, where outputs are vast, variable, and often under specified. Fine grained experiments on three real world open ended tasks and two LLMs, show CPQ applicability to any black box LLM and highlight: (1) individual contribution of each principle to CPQ performance, and (2) CPQ ability to yield significantly more informative prediction sets than existing conformal methods for language uncertainty quantification.
摘要：不确定性量化（UQ）对于安全部署生成AI模型，例如大语言模型（LLMS）至关重要，尤其是在高风险应用中。共形预测（CP）提供了一个原则上的不确定性量化框架，但是经典方法依赖于几何距离或软磁得分，重点放在回归和分类上：以结构化输出为前提的工具。我们通过在仅查询的环境中研究CP来偏离此范式，在该环境中，必须仅从有限查询到黑匣子生成模型，在覆盖范围，测试时间查询预算和信息性之间引入新的交易。我们与查询Oracle（CPQ）一起介绍了共形预测，该框架表征了这些目标之间的最佳相互作用。我们的有限样本算法建立在两个核心原则上：一个控制最佳查询策略，另一个则定义了从查询样本到预测集的最佳映射。值得注意的是，两者都植根于统计中经典缺失的质量问题。具体而言，最佳查询策略取决于我们为其开发新型估计量的缺失质量的衰减或衍生物的速率。同时，最佳映射取决于缺失的质量本身，我们使用良好的图灵估计器对其进行估算。然后，我们将重点转移到为语言模型中实施我们的方法，在该方法中，输出量很大，可变且经常指定。对三个现实世界开放式任务和两个LLM的细粒度实验，显示对任何黑匣子LLM的CPQ适用性，并突出显示：（1）每种原理对CPQ绩效的个人贡献，以及（2）CPQ能够比现有的共同方法对语言不确定性量化的现有共同方法产生更大的信息预测集。

Title: The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

Authors: Alex Damian, Jason D. Lee, Joan Bruna
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05500
Pdf URL: https://arxiv.org/pdf/2506.05500
Copy Paste: [[2506.05500]] The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models(https://arxiv.org/abs/2506.05500)
Keywords: generative
Abstract: In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emph{generative leap} exponent $k^\star$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of $n=\Theta(d^{1 \vee \k/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with $r$-dimensional first hidden layer).
摘要：在这项工作中，我们考虑通用高斯多指数模型，其中标签仅取决于（高斯）$ d $二维输入，通过其对低维$ r = o_d（1）$ subspace的投影，我们研究了此隐藏的supperpace的有效的不合理估计程序。我们介绍\ emph {generative leap}指数$ k^\ star $，这是从[Damian等人24]的生成指数的自然扩展到多索引设置。我们首先表明，在低度聚体框架捕获的算法类别中，必须使用$ n = \ theta（d^{1 \ vee \ k/2}）$的样本复杂性。然后，我们通过给出基于对适当的Hermite张量的光谱U统计量的不可知的顺序估计程序（即，不需要先验知识）来确定该样本复杂性也足够了。我们进一步计算了几个示例的生成LEAP指数，包括分段线性函数（具有偏见的深度relu网络）和一般的深神经网络（具有$ r $ r $维的第一层隐藏层）。

Title: FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Authors: Kaihang Pan, Wendong Bu, Yuruo Wu, Yang Wu, Kai Shen, Yunfei Li, Hang Zhao, Juncheng Li, Siliang Tang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05501
Pdf URL: https://arxiv.org/pdf/2506.05501
Copy Paste: [[2506.05501]] FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL(https://arxiv.org/abs/2506.05501)
Keywords: generation
Abstract: Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
摘要：最近的研究将自动进度范式扩展到文本对图像生成，实现与扩散模型相当的性能。但是，我们的新配对基准测试（具有具有相似语法但不同粒度语义的配对提示的测试用例）揭示了现有模型与细颗粒的文本图像对齐方式挣扎，因此未能实现对视觉令牌的精确控制。为了解决这个问题，我们提出了Focusdiff，该焦点通过着重于相似的文本图像对之间的细微差异来增强细粒的文本图像的语义对齐。我们构建了一个新的数据集，其中包括具有相似总体表达方式但局部语义相似的成对文本和图像，进一步引入了一种新颖的增强学习算法，以强调所需图像生成的这种细粒语义差异。我们的方法在现有的文本到图像基准上实现了最先进的性能，并且在配对上明显优于先前方法。

Title: MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Authors: Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05523
Pdf URL: https://arxiv.org/pdf/2506.05523
Copy Paste: [[2506.05523]] MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning(https://arxiv.org/abs/2506.05523)
Keywords: generation, generative
Abstract: Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.
摘要：尽管视觉模型（VLMS）迅速发展，但在三个关键维度上，多模式推理的当前基准均缺乏。首先，他们压倒性地依靠静态图像，无法捕获现实环境的时间复杂性。其次，他们狭do以数学问题解决，忽略了强大的多模式智能所需的更广泛的推理技能（包括抽象，物理，计划，空间和时间功能）。第三，许多基准迅速饱和，为诊断故障模式或测量持续进展提供有限的净空。我们介绍了Morse-500（多模式推理应力测试环境），这是一个视频基准，由500个完全脚本的剪辑组成，其中包含嵌入式问题，涵盖了六个互补的推理类别。每个实例都是使用确定性Python脚本（通过Manim，Matplotlib，Monypy），生成视频模型和精心策划的真实镜头来编程生成的。这种脚本驱动的设计允许对视觉复杂性，干扰物密度和时间动态的细粒度控制 - 使难以随着模型的改善而系统地缩放。与一旦饱和的静态基准测试，Morse-500是为了发展而来的：它可控的生成管道支持创建任意挑战的新实例，使其非常适合于压力测试的下一代模型。最初对最先进系统的实验 - 包括各种Gemini 2.5 Pro和OpenAI O3，代表当时最强的可用型号，以及强大的开源模型 - 揭示了所有类别的大量性能差距，在抽象和规划任务中尤其较大。我们发布完整的数据集，发电脚本和评估线束，以支持透明，可重现和前瞻性的多模式推理研究。

Title: On Fitting Flow Models with Large Sinkhorn Couplings

Authors: Michal Klein, Alireza Mousavi-Hosseini, Stephen Zhang, Marco Cuturi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05526
Pdf URL: https://arxiv.org/pdf/2506.05526
Copy Paste: [[2506.05526]] On Fitting Flow Models with Large Sinkhorn Couplings(https://arxiv.org/abs/2506.05526)
Keywords: generation
Abstract: Flow models transform data gradually from one modality (e.g. noise) onto another (e.g. images). Such models are parameterized by a time-dependent velocity field, trained to fit segments connecting pairs of source and target points. When the pairing between source and target points is given, training flow models boils down to a supervised regression problem. When no such pairing exists, as is the case when generating data from noise, training flows is much harder. A popular approach lies in picking source and target points independently. This can, however, lead to velocity fields that are slow to train, but also costly to integrate at inference time. In theory, one would greatly benefit from training flow models by sampling pairs from an optimal transport (OT) measure coupling source and target, since this would lead to a highly efficient flow solving the Benamou and Brenier dynamical OT problem. In practice, recent works have proposed to sample mini-batches of $n$ source and $n$ target points and reorder them using an OT solver to form better pairs. These works have advocated using batches of size $n\approx 256$, and considered OT solvers that return couplings that are either sharp (using e.g. the Hungarian algorithm) or blurred (using e.g. entropic regularization, a.k.a. Sinkhorn). We follow in the footsteps of these works by exploring the benefits of increasing $n$ by three to four orders of magnitude, and look more carefully on the effect of the entropic regularization $\varepsilon$ used in the Sinkhorn algorithm. Our analysis is facilitated by new scale invariant quantities to report the sharpness of a coupling, while our sharded computations across multiple GPU or GPU nodes allow scaling up $n$. We show that in both synthetic and image generation tasks, flow models greatly benefit when fitted with large Sinkhorn couplings, with a low entropic regularization $\varepsilon$.
摘要：流模型将数据逐渐从一种模态（例如噪声）转化为另一种模式（例如图像）。这样的模型通过时间相关的速度场进行参数化，该速度场训练以适合连接源和目标点对的段。当给出源和目标点之间的配对时，训练流模型归结为监督回归问题。当没有这种配对的情况下，就像从噪声中生成数据一样，训练流就更难了。一种流行的方法在于独立选择源和目标点。但是，这可能会导致速度场训练缓慢，但在推理时间集成也很昂贵。从理论上讲，通过从最佳传输（OT）测量耦合源和目标的对训练流模型中，人们将大大受益，因为这将导致高效的流量解决Benamou和Brenier动力学问题。实际上，最近的作品提议采样$ n $源和$ n $目标点的迷你批次，并使用OT求解器重新排序它们以形成更好的对。这些作品使用了$ n \约256 $的批次提倡，并考虑到返回的求解器，这些求解器返回敏锐的耦合（例如使用匈牙利算法）或模糊（使用例如熵正则化，A.K.A。Sindhorn）。我们遵循这些作品的脚步，探索将$ n $提高三到四个数量级的好处，并更仔细地仔细研究熵正则化$ \ varepsilon $的效果。我们的分析是通过新的量表不变量来促进的，以报告耦合的清晰度，而我们在多个GPU或GPU节点上进行的碎片计算允许扩大$ n $。我们表明，在综合和图像生成任务中，流量模型在装有大型凹槽耦合时会大大受益，并且具有低熵正则化$ \ varepsilon $。

Title: SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms

Authors: Arnesh Batra, Anushk Kumar, Jashn Khemani, Arush Gumber, Arhan Jain, Somil Gupta
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05538
Pdf URL: https://arxiv.org/pdf/2506.05538
Copy Paste: [[2506.05538]] SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms(https://arxiv.org/abs/2506.05538)
Keywords: generative
Abstract: The rapid advancement of deep generative models has significantly improved the realism of synthetic media, presenting both opportunities and security challenges. While deepfake technology has valuable applications in entertainment and accessibility, it has emerged as a potent vector for misinformation campaigns, particularly on social media. Existing detection frameworks struggle to distinguish between benign and adversarially generated deepfakes engineered to manipulate public perception. To address this challenge, we introduce SocialDF, a curated dataset reflecting real-world deepfake challenges on social media platforms. This dataset encompasses high-fidelity deepfakes sourced from various online ecosystems, ensuring broad coverage of manipulative techniques. We propose a novel LLM-based multi-factor detection approach that combines facial recognition, automated speech transcription, and a multi-agent LLM pipeline to cross-verify audio-visual cues. Our methodology emphasizes robust, multi-modal verification techniques that incorporate linguistic, behavioral, and contextual analysis to effectively discern synthetic media from authentic content.
摘要：深层生成模型的快速发展显着改善了合成媒体的现实主义，既提出了机遇和安全挑战。尽管DeepFake技术在娱乐和可及性方面具有宝贵的应用，但它已成为错误信息广告系列的有效向量，尤其是在社交媒体上。现有的检测框架努力区分良性和对抗性产生的深层，设计用于操纵公众的看法。为了应对这一挑战，我们介绍了SociaLDF，这是一个策划的数据集，反映了社交媒体平台上现实世界中的深层挑战。该数据集涵盖了来自各种在线生态系统的高保真性深击，可确保对操纵技术的广泛覆盖。我们提出了一种新型的基于LLM的多因素检测方法，该方法结合了面部识别，自动语音转录和一个多代理LLM管道，以交叉验证视听提示。我们的方法强调了强大的多模式验证技术，这些技术结合了语言，行为和上下文分析，以有效地辨别合成媒体与真实内容相识别。

Title: EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

Authors: Tao Hu, Haoyang Peng, Xiao Liu, Yuewen Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05554
Pdf URL: https://arxiv.org/pdf/2506.05554
Copy Paste: [[2506.05554]] EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh(https://arxiv.org/abs/2506.05554)
Keywords: generation
Abstract: Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.
摘要：从单眼输入中生成高质量的相机可控制视频是一项艰巨的任务，尤其是在极端的观点下。现有的方法通常会在边界上的几何不一致和遮挡伪像，从而导致视觉质量下降。在本文中，我们介绍了Ex-4D，这是一个新颖的框架，该框架通过深度水密网状表示解决了这些挑战。该表示形式通过明确对可见和遮挡区域进行显式建模，以确保极端摄像头姿势的几何一致性。为了克服缺乏配对的多视图数据集，我们提出了一种模拟掩蔽策略，该策略仅从单眼视频中生成有效的培训数据。此外，采用了基于Lora的轻质视频扩散适配器来综合高质量，身体一致和时间连贯的视频。广泛的实验表明，在物理一致性和极端视图质量方面，EX-4D的表现优于最先进的方法，从而实现了实用的4D视频。

Title: PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Authors: Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, Katerina Fragkiadaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05573
Pdf URL: https://arxiv.org/pdf/2506.05573
Copy Paste: [[2506.05573]] PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers(https://arxiv.org/abs/2506.05573)
Keywords: generation, generative
Abstract: We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.
摘要：我们介绍了Partcrafter，这是第一个结构化的3D生成模型，该模型共同综合了与单个RGB图像的多个语义上有意义和几何不同的3D网格。与现有的产生整体3D形状或遵循两个阶段管道的方法不同，即首先分割图像然后重建每个段，PartCrafter采用了不依赖于预分段输入的统一的，组成的生成体系结构。它以单个图像为条件，同时将多个3D零件降低，从而使单个对象和复杂的多对象场景的端到端零件感知生成。 Partcrafter建立在经过训练的整个物体训练的经过验证的3D网格扩散变压器（DIT）上，从而继承了预审计的重量，编码器和解码器，并引入了两个关键的创新：（1）组成的潜在空间，每个3D部分都由一组固定的潜在潜伏的潜伏的潜在的潜在的潜在潜在的潜伏品代表；（2）一种层次注意机制，可以使各个部分和各个部分之间的结构信息流动，从而确保全球连贯性，同时在生成过程中保留零件级别的细节。为了支持零件级的监督，我们通过从大规模3D对象数据集挖掘零件级注释来策划新数据集。实验表明，Partcrafter在生成可分解的3D网格中的现有方法，包括在输入图像中不直接可见的部分，这表明了零件感知的生成先验的强度，以了解3D理解和合成。代码和培训数据将发布。

Title: UniRes: Universal Image Restoration for Complex Degradations

Authors: Mo Zhou, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Vishal M. Patel, Hossein Talebi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05599
Pdf URL: https://arxiv.org/pdf/2506.05599
Copy Paste: [[2506.05599]] UniRes: Universal Image Restoration for Complex Degradations(https://arxiv.org/abs/2506.05599)
Keywords: restoration, generative
Abstract: Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusionbased framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.
摘要：实际的图像恢复受到不同捕获条件，捕获设备和后处理管道的不同降解的阻碍。现有作品通过模拟这些降解并利用图像生成先验来改进，但是对野外数据的概括仍然是一个尚未解决的问题。在本文中，我们专注于复杂的降解，即多种已知降解的任意混合物，这些混合物在野外经常看到。提出了一个简单而灵活的基于扩散的框架，称为Unires，以端到端的方式解决此类降解。在扩散采样步骤中，它结合了几个专用模型，因此将知识从几个良好的恢复任务转移到复杂的野外降解恢复。这仅需要几种降解类型的良好分离培训数据。该框架是灵活的，因为可以通过统一的配方添加扩展，并且可以通过新的范式调整保真度质量的权衡。我们提出的方法对复杂降解和单一降解图像恢复数据集进行了评估。广泛的定性和定量实验结果显示出一致的性能增长，尤其是对于复杂降解的图像。

Title: Controlled Data Rebalancing in Multi-Task Learning for Real-World Image Super-Resolution

Authors: Shuchen Lin, Mingtao Feng, Weisheng Dong, Fangfang Wu, Jianqiao Luo, Yaonan Wang, Guangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05607
Pdf URL: https://arxiv.org/pdf/2506.05607
Copy Paste: [[2506.05607]] Controlled Data Rebalancing in Multi-Task Learning for Real-World Image Super-Resolution(https://arxiv.org/abs/2506.05607)
Keywords: super-resolution
Abstract: Real-world image super-resolution (Real-SR) is a challenging problem due to the complex degradation patterns in low-resolution images. Unlike approaches that assume a broadly encompassing degradation space, we focus specifically on achieving an optimal balance in how SR networks handle different degradation patterns within a fixed degradation space. We propose an improved paradigm that frames Real-SR as a data-heterogeneous multi-task learning problem, our work addresses task imbalance in the paradigm through coordinated advancements in task definition, imbalance quantification, and adaptive data rebalancing. Specifically, we introduce a novel task definition framework that segments the degradation space by setting parameter-specific boundaries for degradation operators, effectively reducing the task quantity while maintaining task discrimination. We then develop a focal loss based multi-task weighting mechanism that precisely quantifies task imbalance dynamics during model training. Furthermore, to prevent sporadic outlier samples from dominating the gradient optimization of the shared multi-task SR model, we strategically convert the quantified task imbalance into controlled data rebalancing through deliberate regulation of task-specific training volumes. Extensive quantitative and qualitative experiments demonstrate that our method achieves consistent superiority across all degradation tasks.
摘要：由于低分辨率图像中的复杂降解模式，真实世界图像超分辨率（实际SR）是一个具有挑战性的问题。与假设涵盖较广泛的退化空间的方法不同，我们专门针对在SR网络如何处理固定降解空间内不同降解模式的最佳平衡上。我们提出了一个改进的范式，将真实SR作为数据异质多任务学习问题将其构架，我们的工作通过在任务定义，不平衡量化和自适应数据重新平衡的协调进步中解决了范式中的任务失衡。具体而言，我们引入了一个新颖的任务定义框架，该框架通过为降级运算符设置特定于参数的边界来分离降级空间，从而有效地减少了任务数量的同时维持任务歧视。然后，我们开发基于焦点损失的多任务加权机制，该机制精确地量化了模型训练期间的任务不平衡动态。此外，为了防止零星的离群样本主导共享多任务SR模型的梯度优化，我们通过故意对特定于任务的培训量进行故意调节，从策略性地将量化的任务不平衡转换为受控的数据重新平衡。广泛的定量和定性实验表明，我们的方法在所有退化任务中都具有一致的优势。

Title: GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance

Authors: Jiri Navratil, Jarret Ross, Payel Das, Youssef Mroueh, Samuel C Hoffman, Vijil Chenthamarakshan, Brian Belgodere
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05628
Pdf URL: https://arxiv.org/pdf/2506.05628
Copy Paste: [[2506.05628]] GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance(https://arxiv.org/abs/2506.05628)
Keywords: generation, generative
Abstract: The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology. We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM), while using the molecular similarity to the target as a guide. Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM. At each step of the decoding process, the method tracks the distance of the current generations from the target and updates the logits to encourage the preservation of similarity in generations. We implement the method using a recently proposed $\sim$47M parameter SMILES-based CLM, GP-MoLFormer, and therefore refer to the method as GP-MoLFormer-Sim, which enables a test-time update of the deep generative policy to reflect the contextual similarity to a set of guide molecules. The method is further integrated into a genetic algorithm (GA) and tested on a set of standard molecular optimization benchmarks involving property optimization, molecular rediscovery, and structure-based drug design. Results show that, GP-MoLFormer-Sim, combined with GA (GP-MoLFormer-Sim+GA) outperforms existing training-free baseline methods, when the oracle remains black-box. The findings in this work are a step forward in understanding and guiding the generative mechanisms of CLMs.
摘要：设计分子的能力在保留与目标分子和/或特性相似性的同时对于药物发现，化学设计和生物学中的各种应用至关重要。我们在本文中介绍了一种使用生成化学语言模型（CLM）从分子空间导航和采样的有效的无训练方法，同时使用与目标的分子相似性作为指导。我们的方法利用了从CLM本身学到的上下文表示来估计分子相似性，然后将其用于调整CLM的自回归抽样策略。在解码过程的每个步骤中，该方法跟踪当前世代与目标的距离，并更新逻辑以鼓励世代相传的相似性。我们使用最近提出的$ \ sim $ 4700万$ smiles的CLM，GP-molformer实现了该方法，因此将该方法称为GP-Molformer-SIM，该方法可以对深层生成策略进行测试时间更新，以反映与一组指南分子的上下文相似性。该方法进一步集成到遗传算法（GA）中，并根据一组涉及性质优化，分子重分辨率和基于结构的药物设计的标准分子优化基准进行了测试。结果表明，GP-MolFormer-SIM与GA（GP-Molformer-SIM+GA）结合使用时，当Oracle保持黑框时，超过现有的无训练基线方法。这项工作的发现是理解和指导CLM的生成机制的一步。

Title: Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

Authors: Andrey Zhmoginov, Jihwan Lee, Mark Sandler
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05641
Pdf URL: https://arxiv.org/pdf/2506.05641
Copy Paste: [[2506.05641]] Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones(https://arxiv.org/abs/2506.05641)
Keywords: generation
Abstract: Modern Foundation Models (FMs) are typically trained on corpora spanning a wide range of different data modalities, topics and downstream tasks. Utilizing these models can be very computationally expensive and is out of reach for most consumer devices. Furthermore, most of the broad FM knowledge may actually be irrelevant for a specific task at hand. Here we explore a technique for mapping parameters of a large Transformer to parameters of a smaller specialized model. By making this transformation task-specific, we aim to capture a narrower scope of the knowledge needed for performing a specific task by a smaller model. We study our method on image modeling tasks, showing that performance of generated models exceeds that of universal conditional models.
摘要：现代基础模型（FMS）通常在跨越各种不同数据模式，主题和下游任务的Corpora上进行培训。利用这些模型在计算上可能非常昂贵，并且对于大多数消费设备而言是无法触及的。此外，大多数广泛的FM知识实际上可能与手头的特定任务无关。在这里，我们探索了一种将大型变压器的参数映射到较小专业模型的参数的技术。通过使此转换特定于任务，我们旨在捕获较小的范围较小的知识范围，以通过较小的模型执行特定任务。我们研究了图像建模任务的方法，表明生成的模型的性能超过了通用条件模型的模型。

Title: Learning to Weight Parameters for Data Attribution

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05647
Pdf URL: https://arxiv.org/pdf/2506.05647
Copy Paste: [[2506.05647]] Learning to Weight Parameters for Data Attribution(https://arxiv.org/abs/2506.05647)
Keywords: generative
Abstract: We study data attribution in generative models, aiming to identify which training examples most influence a given output. Existing methods achieve this by tracing gradients back to training data. However, they typically treat all network parameters uniformly, ignoring the fact that different layers encode different types of information and may thus draw information differently from the training set. We propose a method that models this by learning parameter importance weights tailored for attribution, without requiring labeled data. This allows the attribution process to adapt to the structure of the model, capturing which training examples contribute to specific semantic aspects of an output, such as subject, style, or background. Our method improves attribution accuracy across diffusion models and enables fine-grained insights into how outputs borrow from training data.
摘要：我们研究生成模型中的数据归因，旨在确定哪些训练示例最大，影响给定的产出。现有方法通过将梯度追溯到训练数据来实现这一目标。但是，他们通常会统一地对待所有网络参数，而忽略了以下事实：不同的层编码不同类型的信息，因此可能与训练集不同地绘制信息。我们提出了一种通过学习用于归因的参数重要性权重来对此进行建模的方法，而无需标记数据。这允许归因过程适应模型的结构，捕获哪些培训示例有助于输出的特定语义方面，例如主题，样式或背景。我们的方法提高了跨扩散模型的归因精度，并可以对输出如何从训练数据中借入借款进行细粒度的见解。

Title: Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery

Authors: Sajjad Abdoli, Freeman Lewin, Gediminas Vasiliauskas, Fabian Schonholz
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05673
Pdf URL: https://arxiv.org/pdf/2506.05673
Copy Paste: [[2506.05673]] Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery(https://arxiv.org/abs/2506.05673)
Keywords: generation
Abstract: The development of modern Artificial Intelligence (AI) models, particularly diffusion-based models employed in computer vision and image generation tasks, is undergoing a paradigmatic shift in development methodologies. Traditionally dominated by a "Model Centric" approach, in which performance gains were primarily pursued through increasingly complex model architectures and hyperparameter optimization, the field is now recognizing a more nuanced "Data-Centric" approach. This emergent framework foregrounds the quality, structure, and relevance of training data as the principal driver of model performance. To operationalize this paradigm shift, we introduce the this http URL sample dataset (the "DSD"), initially comprised of approximately 10,610 high-quality human peer-ranked photography images accompanied by extensive multi-tier annotations. The DSD is a foundational computer vision dataset designed to usher in a new standard for commercial image datasets. Representing a small fraction of this http URL's 100 million-plus image catalog, the DSD provides a scalable foundation necessary for robust commercial and multimodal AI development. Through this in-depth exploratory analysis, we document the quantitative improvements generated by the DSD on specific models against known benchmarks and make the code and the trained models used in our evaluation publicly available.
摘要：现代人工智能（AI）模型的开发，尤其是计算机视觉和图像生成任务中采用的基于扩散的模型，正在经历开发方法的范式转变。传统上，以“以模型为中心”的方法主导，在这种方法中，该领域主要通过越来越复杂的模型架构和超参数优化来追求性能，该领域现在正在认识到一种更加细微的“以数据为中心”的方法。这个新兴的框架预示着培训数据作为模型性能的主要驱动力的质量，结构和相关性。为了操作这种范式转移，我们介绍了此HTTP URL样本数据集（“ DSD”），最初由大约10,610个高质量的人类同伴级摄影图像组成，并附有大量的多层注释。 DSD是一个基础计算机视觉数据集，旨在将新的标准用于商业图像数据集。 DSD代表了此HTTP URL 1亿多个图像目录的一小部分，为可靠的商业和多模式AI开发提供了可扩展的基础。通过这一深入的探索性分析，我们将DSD对针对已知基准的特定模型产生的定量改进进行了记录，并将代码以及我们评估中使用的训练有素的模型公开可用。

Title: Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization

Authors: Tailin Zhou, Zhilin Chen, Wenlong Lyu, Zhitang Chen, Danny H.K. Tsang, Jun Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05680
Pdf URL: https://arxiv.org/pdf/2506.05680
Copy Paste: [[2506.05680]] Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization(https://arxiv.org/abs/2506.05680)
Keywords: generation
Abstract: Optimizing complex systems, from discovering therapeutic drugs to designing high-performance materials, remains a fundamental challenge across science and engineering, as the underlying rules are often unknown and costly to evaluate. Offline optimization aims to optimize designs for target scores using pre-collected datasets without system interaction. However, conventional approaches may fail beyond training data, predicting inaccurate scores and generating inferior designs. This paper introduces ManGO, a diffusion-based framework that learns the design-score manifold, capturing the design-score interdependencies holistically. Unlike existing methods that treat design and score spaces in isolation, ManGO unifies forward prediction and backward generation, attaining generalization beyond training data. Key to this is its derivative-free guidance for conditional generation, coupled with adaptive inference-time scaling that dynamically optimizes denoising paths. Extensive evaluations demonstrate that ManGO outperforms 24 single- and 10 multi-objective optimization methods across diverse domains, including synthetic tasks, robot control, material design, DNA sequence, and real-world engineering optimization.
摘要：从发现治疗药物到设计高性能材料，优化复杂的系统仍然是在科学和工程上的根本挑战，因为基本规则通常是未知的，并且评估成本很高。离线优化旨在使用没有系统交互的预采用数据集优化目标得分的设计。但是，传统的方法可能会超出训练数据，预测分数不正确并产生劣等设计。本文介绍了芒果，芒果是一个基于扩散的框架，该框架学习了设计得分歧管，从整体上捕获了设计得分相互依赖。与现有的方法隔离地处理设计和得分空间的方法不同，芒果统一了前瞻性预测和落后生成，从而超出了训练数据以外的概括。这样做的关键是其有条件生成的无衍生化指南，再加上自适应推理时间缩放，可以动态优化降级路径。广泛的评估表明，芒果的表现优于不同领域的24个单目标优化方法，包括合成任务，机器人控制，材料设计，DNA序列和现实世界工程优化。

Title: Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR

Authors: Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour
Subjects: cs.LG, cs.AI, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05683
Pdf URL: https://arxiv.org/pdf/2506.05683
Copy Paste: [[2506.05683]] Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR(https://arxiv.org/abs/2506.05683)
Keywords: generation
Abstract: Extended reality (XR) systems, which consist of virtual reality (VR), augmented reality (AR), and mixed reality (XR), offer a transformative interface for immersive, multi-modal, and embodied human-computer interaction. In this paper, we envision that multi-modal multi-task (M3T) federated foundation models (FedFMs) can offer transformative capabilities for XR systems through integrating the representational strength of M3T foundation models (FMs) with the privacy-preserving model training principles of federated learning (FL). We present a modular architecture for FedFMs, which entails different coordination paradigms for model training and aggregations. Central to our vision is the codification of XR challenges that affect the implementation of FedFMs under the SHIFT dimensions: (1) Sensor and modality diversity, (2) Hardware heterogeneity and system-level constraints, (3) Interactivity and embodied personalization, (4) Functional/task variability, and (5) Temporality and environmental variability. We illustrate the manifestation of these dimensions across a set of emerging and anticipated applications of XR systems. Finally, we propose evaluation metrics, dataset requirements, and design tradeoffs necessary for the development of resource-aware FedFMs in XR. This perspective aims to chart the technical and conceptual foundations for context-aware privacy-preserving intelligence in the next generation of XR systems.
摘要：扩展现实（XR）系统由虚拟现实（VR），增强现实（AR）和混合现实（XR）组成，为沉浸式，多模式和体现的人类计算机交互提供了变革性界面。在本文中，我们设想，联合基础模型（FEDFM）可以通过将M3T基金会模型（FMS）的代表力与隐私保护模型培训原则（FL）相结合，从而为XR系统提供变换功能。我们为FEDFM提供了模块化体系结构，该体系结构需要用于模型培训和聚合的不同协调范例。我们愿景的核心是XR挑战的编纂，这些XR挑战影响了在移位尺寸下实施FedFM的：（1）传感器和模态多样性，（2）硬件异质性和系统级别的约束，（3）互动性和体现个性化，（4）功能/任务可变性，以及（5）时间和环境可变性。我们说明了XR系统的一组新兴和预期应用中这些维度的表现。最后，我们建议在XR中开发资源吸引FedFM所需的评估指标，数据集要求和设计权衡。该观点旨在为下一代XR系统中的上下文感知隐私智能绘制技术和概念基础。

Title: Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

Authors: Fanhu Zeng, Deli Yu, Zhenglun Kong, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05709
Pdf URL: https://arxiv.org/pdf/2506.05709
Copy Paste: [[2506.05709]] Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration(https://arxiv.org/abs/2506.05709)
Keywords: generation
Abstract: Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $\times$1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.
摘要：在各种视觉任务中，视觉变压器已广泛探索。由于沉重的计算成本，在代币方面动态压缩视觉变压器引起了很多兴趣。当前的方法主要关注令牌修剪或合并以减少令牌数字，在该数字中，将令牌仅被压缩，从而造成巨大的信息丢失，因此不可避免地需要培训后培训才能恢复性能。在本文中，我们重新考虑令牌减少并将过程统一为代币矩阵变换的明确形式，其中所有现有方法在框架内构建了特殊形式的矩阵形式。此外，我们提出了一个多一到多的代币转换框架，该框架是对所有现有方法的概括，并保留最多的信息，甚至可以实现无培训的加速度。我们进行广泛的实验来验证我们的框架。具体而言，我们将40％的拖鞋和加速DEIT-S减少了$ \ $ \ $ \ $ \ $ \ $ \ $ \ $ 0.1％的准确性下降。此外，我们将方法扩展到密集的预测任务，包括分割，对象检测，深度估计和语言模型生成。结果表明，所提出的方法始终取得了实质性的改进，提供了更好的计算绩效权衡，令人印象深刻的预算减少和推理加速度。

Title: Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application

Authors: Xiucheng Wang, Honggang Jia, Nan Cheng, Dusit Niyato
Subjects: cs.LG, cs.IT, eess.SY
Abstract URL: https://arxiv.org/abs/2506.05710
Pdf URL: https://arxiv.org/pdf/2506.05710
Copy Paste: [[2506.05710]] Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application(https://arxiv.org/abs/2506.05710)
Keywords: generative
Abstract: In this paper, a novel semantic communication framework empowered by generative artificial intelligence (GAI) is proposed, specifically leveraging the capabilities of diffusion models (DMs). A rigorous theoretical foundation is established based on stochastic differential equations (SDEs), which elucidates the denoising properties of DMs in mitigating additive white Gaussian noise (AWGN) in latent semantic representations. Crucially, a closed-form analytical relationship between the signal-to-noise ratio (SNR) and the denoising timestep is derived, enabling the optimal selection of diffusion parameters for any given channel condition. To address the distribution mismatch between the received signal and the DM's training data, a mathematically principled scaling mechanism is introduced, ensuring robust performance across a wide range of SNRs without requiring model fine-tuning. Built upon this theoretical insight, we develop a latent diffusion model (LDM)-based semantic transceiver, wherein a variational autoencoder (VAE) is employed for efficient semantic compression, and a pretrained DM serves as a universal denoiser. Notably, the proposed architecture is fully training-free at inference time, offering high modularity and compatibility with large-scale pretrained LDMs. This design inherently supports zero-shot generalization and mitigates the challenges posed by out-of-distribution inputs. Extensive experimental evaluations demonstrate that the proposed framework significantly outperforms conventional neural-network-based semantic communication baselines, particularly under low SNR conditions and distributional shifts, thereby establishing a promising direction for GAI-driven robust semantic transmission in future 6G systems.
摘要：在本文中，提出了一个新型的语义交流框架，该框架是由生成人工智能（GAI）赋予的，特别利用扩散模型（DMS）的能力。基于随机微分方程（SDE）建立了严格的理论基础，该基础阐明了DMS在缓解潜在语义表示中的添加性白色高斯噪声（AWGN）中的降解性能。至关重要的是，信噪比（SNR）与降解时间段之间存在封闭形式的分析关系，得出了任何给定通道条件的扩散参数的最佳选择。为了解决接收信号与DM训练数据之间的分布不匹配，引入了数学定义的缩放机制，从而确保在不需要模型进行微调的情况下确保在广泛的SNR上进行稳健的性能。基于这种理论洞察力，我们开发了一个基于潜在的扩散模型（LDM）基于语义收发器，其中使用变异自动编码器（VAE）进行有效的语义压缩，并且预审预测的DM用作通用的Deoiser。值得注意的是，所提出的体系结构在推理时间进行全面训练，提供了高模块化和与大规模预处理的LDM的兼容性。该设计固有地支持零击的概括，并减轻分布输入的挑战。广泛的实验评估表明，所提出的框架显着优于常规的基于神经网络的语义通信基线，尤其是在低SNR条件和分布变化的情况下，从而为未来6G系统中的GAI驱动稳健语义传输建立了有希望的方向。

Title: BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Authors: Yunpeng Qing, Shuo Chen, Yixiao Chi, Shunyu Liu, Sixu Lin, Changqing Zou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05762
Pdf URL: https://arxiv.org/pdf/2506.05762
Copy Paste: [[2506.05762]] BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning(https://arxiv.org/abs/2506.05762)
Keywords: generation, generative
Abstract: Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history this http URL can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.
摘要：离线强化学习（RL）的最新进展已证明，有效的政策学习可以从对预采用的数据集施加保守的限制中受益。但是，这种静态数据集经常表现出分布偏差，从而导致有限的概括。为了解决此限制，一个直接的解决方案是数据增强（DA），它利用生成模型丰富了数据分布。尽管结果有很有希望的结果，但当前的DA技术仅着眼于从给定州重建未来的轨迹，同时忽略了对它们到达历史过渡的探索。这种单向范式不可避免地阻碍了各种行为模式的发现，尤其是那些导致可能产生高回报结果的关键状态的行为模式。在这项工作中，我们引入了双向轨迹扩散（Bitrajdiff），这是一个新型的离线RL框架，它可以对任何中间状态的未来和历史轨迹进行建模。具体而言，我们将轨迹生成任务分解为两个独立但互补的扩散过程：一种产生前向轨迹以预测未来的动态，而另一个产生向后的轨迹以追踪基本历史此HTTP URL可以有效地利用关键状态作为锚点扩展到潜在的有价值的国家，而又不偏向于状态空间，从而扩展了该数据，从而使数据均不相同。在D4RL基准套件上进行的广泛实验表明，与各种离线RL主链中的其他高级DA方法相比，Bitrajdiff的性能表现出色。

Title: LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Authors: Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, Xunliang Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05806
Pdf URL: https://arxiv.org/pdf/2506.05806
Copy Paste: [[2506.05806]] LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models(https://arxiv.org/abs/2506.05806)
Keywords: generation
Abstract: Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness. However, their substantial computational requirements have constrained their deployment in real-time interactive avatar applications, where stringent speed, latency, and duration requirements are paramount. We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model's inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.
摘要：由于其出色的表现力，基于扩散的模型已在虚拟人类中获得广泛的采用。但是，他们的实质性计算要求已限制了其在实时交互式化身应用程序中的部署，在实时互动化的化身应用程序中，严格的速度，延迟和持续时间要求至关重要。我们提出了一个基于扩散模型的新型音频驱动的肖像视频生成框架，以应对这些挑战。首先，我们提出了强大的可变长度视频生成，以减少生成初始视频剪辑或状态过渡所需的最小时间，从而大大增强用户体验。其次，我们为音频图像到视频提出了一致性模型培训策略，以确保实时性能，从而快速生成几步。进一步利用模型量化和管道并行性来加速推理速度。为了减轻扩散过程和模型量化产生的稳定性损失，我们引入了一种针对长期视频生成的新推理策略。这些方法可确保实时性能和低潜伏期，同时保持高保真输出。第三，我们将类标签合并为有条件的输入，以在说话，听力和闲置状态之间无缝切换。最后，我们设计了一种新型机制，用于细粒度的面部表达控制，以利用模型的固有能力。广泛的实验表明，我们的方法实现了低延迟，流体和真实的双向交流。在NVIDIA RTX 4090D上，我们的模型以384x384分辨率和45 fps的分辨率达到78 fps，分辨率为512x512，初始视频产生延迟分别为140 ms和215 ms。

Title: NTIRE 2025 Challenge on HR Depth from Images of Specular and Transparent Surfaces

Authors: Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano, Radu Timofte, Alex Costanzino, Matteo Poggi, Samuele Salti, Stefano Mattoccia, Zhe Zhang, Yang Yang, Wu Chen, Anlong Ming, Mingshuai Zhao, Mengying Yu, Shida Gao, Xiangfeng Wang, Feng Xue, Jun Shi, Yong Yang, Yong A, Yixiang Jin, Dingzhe Li, Aryan Shukla, Liam Frija-Altarac, Matthew Toews, Hui Geng, Tianjiao Wan, Zijian Gao, Qisheng Xu, Kele Xu, Zijian Zang, Jameer Babu Pinjari, Kuldeep Purohit, Mykola Lavreniuk, Jing Cao, Shenyi Li, Kui Jiang, Junjun Jiang, Yong Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05815
Pdf URL: https://arxiv.org/pdf/2506.05815
Copy Paste: [[2506.05815]] NTIRE 2025 Challenge on HR Depth from Images of Specular and Transparent Surfaces(https://arxiv.org/abs/2506.05815)
Keywords: restoration
Abstract: This paper reports on the NTIRE 2025 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2025. This challenge aims to advance the research on depth estimation, specifically to address two of the main open issues in the field: high-resolution and non-Lambertian surfaces. The challenge proposes two tracks on stereo and single-image depth estimation, attracting about 177 registered participants. In the final testing stage, 4 and 4 participating teams submitted their models and fact sheets for the two tracks.
摘要：本文报告了NTIRE 2025对HR深度的挑战，该挑战是镜面和透明表面的图像，以及在CVPR 2025上的新趋势（NTIRE）讲习班的新趋势。这一挑战旨在促进深度估计的研究，尤其是针对领域的两个开放式问题，而不是高级分子和非贵族。挑战提出了有关立体声和单图像深度估计的两条曲目，吸引了约177名注册参与者。在最后的测试阶段，有4和4个参与的团队为这两个曲目提交了模型和事实表。

Title: Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning

Authors: Ngoc Bui, Menglin Yang, Runjin Chen, Leonardo Neves, Mingxuan Ju, Rex Ying, Neil Shah, Tong Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05826
Pdf URL: https://arxiv.org/pdf/2506.05826
Copy Paste: [[2506.05826]] Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning(https://arxiv.org/abs/2506.05826)
Keywords: generation
Abstract: Backward compatible representation learning enables updated models to integrate seamlessly with existing ones, avoiding to reprocess stored data. Despite recent advances, existing compatibility approaches in Euclidean space neglect the uncertainty in the old embedding model and force the new model to reconstruct outdated representations regardless of their quality, thereby hindering the learning process of the new model. In this paper, we propose to switch perspectives to hyperbolic geometry, where we treat time as a natural axis for capturing a model's confidence and evolution. By lifting embeddings into hyperbolic space and constraining updated embeddings to lie within the entailment cone of the old ones, we maintain generational consistency across models while accounting for uncertainties in the representations. To further enhance compatibility, we introduce a robust contrastive alignment loss that dynamically adjusts alignment weights based on the uncertainty of the old embeddings. Experiments validate the superiority of the proposed method in achieving compatibility, paving the way for more resilient and adaptable machine learning systems.
摘要：向后兼容表示学习使更新的模型能够与现有的模型无缝集成，避免重新处理存储的数据。尽管有最近的进步，但欧几里得空间中的现有兼容性方法忽略了旧嵌入模型中的不确定性，并迫使新模型重建过时的表示，无论其质量如何，因此妨碍了新模型的学习过程。在本文中，我们建议将视角切换到双曲线几何形状，在那里我们将时间视为捕获模型的置信度和演变的自然轴。通过将嵌入到双曲线空间中，并将更新的嵌入在旧模型的组成锥中，我们可以在模型之间保持世代的一致性，同时考虑表示中的不确定性。为了进一步提高兼容性，我们引入了强大的对比度比对损失，该损失基于旧嵌入的不确定性，动态调节对齐权重。实验验证了所提出的方法在实现兼容性方面的优越性，为更具弹性和适应性的机器学习系统铺平了道路。

Title: FontAdapter: Instant Font Adaptation in Visual Text Generation

Authors: Myungkyu Koo, Subin Kim, Sangkyung Kwak, Jaehyun Nam, Seojin Kim, Jinwoo Shin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05843
Pdf URL: https://arxiv.org/pdf/2506.05843
Copy Paste: [[2506.05843]] FontAdapter: Instant Font Adaptation in Visual Text Generation(https://arxiv.org/abs/2506.05843)
Keywords: generation
Abstract: Text-to-image diffusion models have significantly improved the seamless integration of visual text into diverse image contexts. Recent approaches further improve control over font styles through fine-tuning with predefined font dictionaries. However, adapting unseen fonts outside the preset is computationally expensive, often requiring tens of minutes, making real-time customization impractical. In this paper, we present FontAdapter, a framework that enables visual text generation in unseen fonts within seconds, conditioned on a reference glyph image. To this end, we find that direct training on font datasets fails to capture nuanced font attributes, limiting generalization to new glyphs. To overcome this, we propose a two-stage curriculum learning approach: FontAdapter first learns to extract font attributes from isolated glyphs and then integrates these styles into diverse natural backgrounds. To support this two-stage training scheme, we construct synthetic datasets tailored to each stage, leveraging large-scale online fonts effectively. Experiments demonstrate that FontAdapter enables high-quality, robust font customization across unseen fonts without additional fine-tuning during inference. Furthermore, it supports visual text editing, font style blending, and cross-lingual font transfer, positioning FontAdapter as a versatile framework for font customization tasks.
摘要：文本到图像扩散模型已显着改善了将视觉文本无缝集成到各种图像上下文中。最近的方法通过使用预定义的字体词典进行微调进一步改善了对字体样式的控制。但是，在预设之外调整看不见的字体在计算上是昂贵的，通常需要数十分钟，从而使实时自定义不切实际。在本文中，我们提出了Fontadapter，该框架可以在几秒钟内以看不见的字体形成视觉文本生成，并以参考字形图像为条件。为此，我们发现字体数据集上的直接培训无法捕获细微的字体属性，从而将概括性限制为新字形。为了克服这一点，我们提出了一种两阶段的课程学习方法：Fontadapter首先学会从孤立的字形提取字体属性，然后将这些样式集成到各种自然背景中。为了支持这一两阶段培训方案，我们构建了针对每个阶段量身定制的合成数据集，从而有效利用了大规模的在线字体。实验表明，fontadapter可以在不看到的字体上进行高质量，健壮的字体自定义，而无需在推理过程中进行其他微调。此外，它支持视觉文本编辑，字体样式混合和跨语性字体传输，将fontadapter定位为字体自定义任务的多功能框架。

Title: Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

Authors: Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05872
Pdf URL: https://arxiv.org/pdf/2506.05872
Copy Paste: [[2506.05872]] Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection(https://arxiv.org/abs/2506.05872)
Keywords: generation, generative
Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.
摘要：跨域几乎没有射击对象检测（CD-FSOD）旨在检测只有少数几个来自以前看不见的域的标记样品的新物体。尽管数据增强和生成方法在几次学习中已经表现出了希望，但由于需要视觉现实主义和域的一致性，它们对CD-FSOD的有效性尚不清楚。现有的策略，例如复制式增强和文本对图像生成，通常无法保留正确的对象类别或产生与目标域相干的背景，从而使它们不乏味直接应用于CD-FSOD。为了应对这些挑战，我们提出了针对CD-FSOD量身定制的无训练，检索引导的成分图像生成框架。域窗格由三个阶段组成：域感知背景检索，域引导背景生成和前景背景组成。具体而言，首先将输入图像分解为前景和背景区域。然后，我们以语义和风格相似的图像检索，以指导生成模型合成新的背景，并以原始和检索到的上下文为条件。最后，保留的前景与新生成的域对准背景组成，以形成生成的图像。无需进行任何其他监督或培训，域rag会在各种任务（包括CD-FSOD，遥感FSOD和伪装的FSOD）中产生高质量的域一致样本。广泛的实验表明，对强基线的一致改进并建立了新的最新结果。代码将在接受后发布。

Title: Exponential Family Variational Flow Matching for Tabular Data Generation

Authors: Andrés Guzmán-Cordero, Floor Eijkelboom, Jan-Willem van de Meent
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05940
Pdf URL: https://arxiv.org/pdf/2506.05940
Copy Paste: [[2506.05940]] Exponential Family Variational Flow Matching for Tabular Data Generation(https://arxiv.org/abs/2506.05940)
Keywords: generation, generative
Abstract: While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in real-world applications. To this end, we develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.
摘要：尽管降解扩散和流匹配促进了生成建模方面的重大进展，但它们在表格数据中的应用仍然有限，尽管它在现实世界中的应用中无处不在。为此，我们开发了TabbyFlow，这是用于表格数据生成的变异流匹配（VFM）方法。为了将VFM应用于具有混合连续和离散功能的数据，我们介绍了指数式的家庭变分流匹配（EF-VFM），该流量匹配（EF-VFM）代表使用一般指数式的家庭分布代表异质数据类型。特此，我们基于力矩匹配获得了有效的，数据驱动的目标，从而使混合连续和离散变量的概率路径有原则性学习。我们还建立了基于布雷格曼差异的变异流匹配和广义流匹配目标之间的联系。对表格数据基准的评估证明了与基线相比的最先进性能。

Title: MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Authors: Dongjie Fu, Tengjiao Sun, Pengcheng Fang, Xiaohao Cai, Hansung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05952
Pdf URL: https://arxiv.org/pdf/2506.05952
Copy Paste: [[2506.05952]] MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation(https://arxiv.org/abs/2506.05952)
Keywords: generation
Abstract: Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.
摘要：基于变压器的文本到动作生成的最新进展导致了综合高质量人类运动的令人印象深刻的进步。然而，共同实现高忠诚，流媒体能力，实时响应能力和可扩展性仍然是一个基本挑战。在本文中，我们提出了Mogo（带有一通运动的运动产生），这是一种针对高效和实时3D运动的新型自回归框架。 MOGO包括两个关键组成部分：（1）MOSA-VQ，一种运动尺度 - 自适应残留矢量量化模块，在层次上分别通过可学习的缩放尺度分别离散运动序列，以产生紧凑而表达的表示；（2）RQHC转换器，一种剩余的级别层次因果变压器，在单个正向传球中生成多层运动令牌，从而大大降低了推理潜伏期。为了增强语义保真度，我们进一步介绍了一种文本条件对准机制，该机制在文本控制下改善了运动解码。与基于最先进的变压器的方法相比，在包括HumanML3D，Kit-ML和CMP在内的基准数据集进行的广泛实验表明，MOGO可在实时性能，流媒体产生和零拍设置下的广泛性方面实现实质性改善。

Title: AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models

Authors: Adil Hasan, Thomas Peyrin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05960
Pdf URL: https://arxiv.org/pdf/2506.05960
Copy Paste: [[2506.05960]] AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models(https://arxiv.org/abs/2506.05960)
Keywords: generation
Abstract: Significant investments have been made towards the commodification of diffusion models for generation of diverse media. Their mass-market adoption is however still hobbled by the intense hardware resource requirements of diffusion model inference. Model quantization strategies tailored specifically towards diffusion models have been useful in easing this burden, yet have generally explored the Uniform Scalar Quantization (USQ) family of quantization methods. In contrast, Vector Quantization (VQ) methods, which operate on groups of multiple related weights as the basic unit of compression, have seen substantial success in Large Language Model (LLM) quantization. In this work, we apply codebook-based additive vector quantization to the problem of diffusion model compression. Our resulting approach achieves a new Pareto frontier for the extremely low-bit weight quantization on the standard class-conditional benchmark of LDM-4 on ImageNet at 20 inference time steps. Notably, we report sFID 1.92 points lower than the full-precision model at W4A8 and the best-reported results for FID, sFID and ISC at W2A8. We are also able to demonstrate FLOPs savings on arbitrary hardware via an efficient inference kernel, as opposed to savings resulting from small integer operations which may lack broad hardware support.
摘要：已经对生成不同媒体的扩散模型进行了大量投资。然而，它们的大众市场采用仍然受到扩散模型推断的强烈硬件资源要求的困扰。专门针对扩散模型量身定制的模型量化策略对于减轻了这一负担很有用，但通常探索了量子标量量化（USQ）的量化方法。相反，以多个相关权重作为压缩基本单位的组运行的矢量量化（VQ）方法在大语言模型（LLM）量化中已经取得了巨大成功。在这项工作中，我们将基于代码的添加矢量量化应用于扩散模型压缩问题。我们最终的方法在20推理时间步长以20个推理时间步骤的标准类别条件基准上实现了在标准的LDM-4的标准类条件基准上进行极低的重量量化。值得注意的是，我们报告的SFID比W4A8上的全精度模型低1.92点，以及在W2A8上为FID，SFID和ISC报告的最佳报告结果。我们还能够通过有效的推理内核来证明在任意硬件上节省的拖船，而不是由小型整数操作产生的节省，这些操作可能缺乏广泛的硬件支持。

Title: Restereo: Diffusion stereo video generation and restoration

Authors: Xingchang Huang, Ashish Kumar Singh, Florian Dubost, Cristina Nader Vasconcelos, Sakar Khattar, Liang Shi, Christian Theobalt, Cengiz Oztireli, Gurprit Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06023
Pdf URL: https://arxiv.org/pdf/2506.06023
Copy Paste: [[2506.06023]] Restereo: Diffusion stereo video generation and restoration(https://arxiv.org/abs/2506.06023)
Keywords: restoration, generation
Abstract: Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.
摘要：随着视频扩散模型的最新进展，立体视频的生成一直在引起人们的关注。但是，大多数现有方法都致力于从单眼2D视频中生成3D立体视频。这些方法通常假定输入单眼视频具有高质量，这使得任务主要是关于扭曲视频中插入封闭区域的任务，同时保留了分离的区域。在本文中，我们介绍了一条新的管道，该管道不仅生成立体声视频，而且还可以通过单个模型稳定地增强左视图和右视频视频。我们的方法通过在降级数据上微调模型来实现这一目标，并在翘曲的面具上调节该模型以保持一致的立体声生成。结果，我们的方法可以在相对较小的合成立体声视频数据集中进行微调，并应用于低质量的现实世界视频，同时执行立体声视频的生成和修复。实验表明，我们的方法的表现优于现有方法，从低分辨率输入中的立体视频生成中既定性和定量。

Title: Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification

Authors: Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06027
Pdf URL: https://arxiv.org/pdf/2506.06027
Copy Paste: [[2506.06027]] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification(https://arxiv.org/abs/2506.06027)
Keywords: generative
Abstract: Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: this https URL.
摘要：基于扩散的纯化（DBP）方法旨在通过向前扩散过程首先注入高斯噪声，然后通过反向生成过程恢复干净的示例，从而从输入样本中消除对抗噪声。在上面的过程中，将多少高斯噪声注入输入样本是DBP方法成功的关键，DBP方法的关键是由现有方法中所有样本的恒定噪声级别$ t^*$控制的。在本文中，我们发现每个样本的最佳$ t^*$确实可能是不同的。直观地，样品的清洁度越多，应注入的噪声就越少，反之亦然。在这一发现的激励下，我们提出了一个新框架，称为样本特定的分数噪声注入（SSNI）。具体而言，SSNI使用预先训练的分数网络来估计数据点偏离了清洁数据分布（即得分规范）的程度。然后，根据分数规范的幅度，SSNI应用了重新加权函数，以适应每个样品的$ T^*$，从而实现特定于样本的噪声注射。从经验上讲，将我们的框架与现有的DBP方法结合在一起，从而在CIFAR-10和Imagenet-1K上的准确性和鲁棒性都显着提高，这突出了DBP方法中不同样本的不同噪声水平的必要性。我们的代码可用：此HTTPS URL。

Title: HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Authors: Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06035
Pdf URL: https://arxiv.org/pdf/2506.06035
Copy Paste: [[2506.06035]] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion(https://arxiv.org/abs/2506.06035)
Keywords: generative
Abstract: Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios, the CLIP Adapter is trained with text captions describing the visual stimuli and their corresponding semantic images synthesized from these captions. The experimental results demonstrate that HAVIR effectively reconstructs both structural features and semantic information of visual stimuli even in complex scenarios, outperforming existing models.
摘要：重建来自大脑活动的视觉信息弥合神经科学与计算机视觉之间的差距。即使使用生成模型从功能磁共振成像中解码图像中取得了进展，但仍在准确恢复高度复杂的视觉刺激方面仍然存在挑战。这种困难源于它们的元素密度和多样性，复杂的空间结构以及多方面的语义信息。为了应对这些挑战，我们提出了包含两个适配器的Havir：（1）自动转移器将fMRI的体素转化为潜在的扩散，从而捕获拓扑结构；（2）剪辑适配器将体素转换为包含语义信息的剪辑文本和图像嵌入。这些互补表示通过多功能扩散融合在一起，以生成最终的重建图像。为了从复杂场景中提取最重要的语义信息，剪辑适配器经过训练，并用文本字幕描述了视觉刺激及其从这些字幕中合成的相应语义图像。实验结果表明，即使在复杂的场景中，Havir有效地重建了视觉刺激的结构特征和语义信息，表现优于现有模型。

Title: Feedback Guidance of Diffusion Models

Authors: Koulischer Felix, Handke Florian, Deleu Johannes, Demeester Thomas, Ambrogioni Luca
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06085
Pdf URL: https://arxiv.org/pdf/2506.06085
Copy Paste: [[2506.06085]] Feedback Guidance of Diffusion Models(https://arxiv.org/abs/2506.06085)
Keywords: generation
Abstract: While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.
摘要：尽管无分类器指导（CFG）已成为改善条件扩散模型样本保真度的标准，但无论特定样品是否需要校正，它都会通过应用恒定的指导来损害多样性并诱导记忆。我们提出了反馈指导（FBG），该指南使用国家依赖的系数根据需求自我调节指导金额。我们的方法是通过假设学习的条件分布被无条件的分布线性损坏来得出的，与CFG的隐式乘法假设形成鲜明对比。我们的计划取决于其自身对有条件信号信息的预测，以在推理过程中动态适应指导，从而挑战了指导视为固定的超参数的看法。该方法在ImagEnet512x512上进行了基准测试，在该方法中，它的表现明显优于无分类器的指导，并且在有限的间隔指导（LIG）中具有竞争力，同时受益于强大的数学框架。在文本到图像的生成上，我们证明，如预期的那样，我们的方法自动为复杂提示应用了更高的指导量表，而不是更简单的提示，并且可以轻松地与现有的指导方案（例如CFG或LIG）结合使用。

Title: Synthetic Tabular Data: Methods, Attacks and Defenses

Authors: Graham Cormode, Samuel Maddock, Enayat Ullah, Shripad Gade
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06108
Pdf URL: https://arxiv.org/pdf/2506.06108
Copy Paste: [[2506.06108]] Synthetic Tabular Data: Methods, Attacks and Defenses(https://arxiv.org/abs/2506.06108)
Keywords: generation
Abstract: Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this survey, we cover the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area.
摘要：合成数据通常被定位为解决方案，以替换敏感的固定尺寸数据集用无限匹配的数据来源，而摆脱了隐私问题。在过去的十年中，合成数据生成取得了很大进展，利用机器学习和数据分析的相应进展。在这项调查中，我们涵盖了表格合成数据生成的关键发展和主要概念，包括基于概率图形模型和深度学习的范例。在对方法进行技术深入研究之前，我们提供背景和动力。我们还通过研究试图检索有关原始敏感数据的信息的攻击来解决合成数据的局限性。最后，我们提出该领域的扩展和开放问题。

Title: Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models

Authors: Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.06137
Pdf URL: https://arxiv.org/pdf/2506.06137
Copy Paste: [[2506.06137]] Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models(https://arxiv.org/abs/2506.06137)
Keywords: generation
Abstract: Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerability to heterogeneity in table layouts, and (ii) inconsistency in reasoning due to limited code generation capability. We propose Table-r1, a two-stage P-TR method designed for SLMs. Stage 1 introduces an innovative self-supervised learning task, Layout Transformation Inference, to improve tabular layout generalization from a programmatic view. Stage 2 adopts a mix-paradigm variant of Group Relative Policy Optimization, enhancing P-TR consistency while allowing dynamic fallback to T-TR when needed. Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods, achieving at least a 15% accuracy improvement over the base model (LLaMA-8B) across all datasets and reaching performance competitive with LLMs.
摘要：表推理（TR）需要在半结构化表格数据上进行结构化推理，并且仍然具有挑战性，特别是对于小语言模型（例如SLMS，例如Llama-8b），由于其容量有限，与大型LMS（例如LLMS，例如GPT-4O）相比。为了缩小这一差距，我们探索基于程序的TR（P-TR），该差距通过生成可执行程序来规避基于文本的TR（T-TR）的关键局限，尤其是在数值推理中。但是，将p-TR应用于SLMS引入了两个挑战：（i）表面布局中异质性的脆弱性，以及（ii）由于代码生成功能有限而导致推理的不一致。我们提出了Table-R1，这是一种设计用于SLMS的两阶段P-TR方法。阶段1引入了创新的自我监督学习任务，布局转换推理，以从程序化视图中提高表格布局的概括。第2阶段采用组相对策略优化的混合范式变体，增强了P-TR的一致性，同时允许在需要时动态后备到T-TR。在四个TREM测试基准上的实验表明，Table-R1的表现优于所有基于SLM的方法，在所有数据集中，基本模型（Llama-8B）的精度至少提高了15％，并且与LLMS达到了性能竞争。

Title: ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Authors: Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06158
Pdf URL: https://arxiv.org/pdf/2506.06158
Copy Paste: [[2506.06158]] ENMA: Tokenwise Autoregression for Generative Neural PDE Operators(https://arxiv.org/abs/2506.06158)
Keywords: generation, generative
Abstract: Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.
摘要：解决时间依赖性参数偏微分方程（PDE）仍然是神经求解器的基本挑战，尤其是当在广泛的物理参数和动力学上概括时。当数据不确定或不完整时，通常是一种自然方法来转向生成模型。我们介绍了Enma，这是一种生成神经操作员，旨在模拟由物理现象引起的时空动力学。 ENMA使用经过流动匹配损耗训练的生成性掩盖自回归变压器，预测压缩潜在空间中的未来动态，从而实现了令牌生成。不规则采样的空间观察通过注意机制编码为统一的潜在表示，并通过时空卷积编码器进一步压缩。这使ENMA可以通过在目标轨迹的过去状态或具有相似动力学的辅助上下文轨迹的状态进行调节，在推理时间内进行中文学习。结果是一个坚固且适应性的框架，该框架概括为新的PDE制度，并支持时间依赖性参数PDE的单发替代建模。

Title: Model-Driven Graph Contrastive Learning

Authors: Ali Azizpour, Nicolas Zilberstein, Santiago Segarra
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06212
Pdf URL: https://arxiv.org/pdf/2506.06212
Copy Paste: [[2506.06212]] Model-Driven Graph Contrastive Learning(https://arxiv.org/abs/2506.06212)
Keywords: generative
Abstract: We propose $\textbf{MGCL}$, a model-driven graph contrastive learning (GCL) framework that leverages graphons (probabilistic generative models for graphs) to guide contrastive learning by accounting for the data's underlying generative process. GCL has emerged as a powerful self-supervised framework for learning expressive node or graph representations without relying on annotated labels, which are often scarce in real-world data. By contrasting augmented views of graph data, GCL has demonstrated strong performance across various downstream tasks, such as node and graph classification. However, existing methods typically rely on manually designed or heuristic augmentation strategies that are not tailored to the underlying data distribution and operate at the individual graph level, ignoring similarities among graphs generated from the same model. Conversely, in our proposed approach, MGCL first estimates the graphon associated with the observed data and then defines a graphon-informed augmentation process, enabling data-adaptive and principled augmentations. Additionally, for graph-level tasks, MGCL clusters the dataset and estimates a graphon per group, enabling contrastive pairs to reflect shared semantics and structure. Extensive experiments on benchmark datasets demonstrate that MGCL achieves state-of-the-art performance, highlighting the advantages of incorporating generative models into GCL.
摘要：我们建议$ \ textbf {mgcl} $，一种模型驱动的图形对比度学习（GCL）框架，该框架利用图形（图形的概率生成模型）来指导对比度学习，通过计算数据的基本生成过程。 GCL已成为一个强大的自我监督框架，用于学习表达性节点或图形表示，而不依赖于注释的标签，这些标签通常在现实世界中很少。通过对比图数据的增强视图，GCL在各种下游任务（例如节点和图形分类）中表现出了强劲的性能。但是，现有的方法通常依赖于手动设计或启发式增强策略，这些策略并非针对基础数据分布量身定制并在单个图级上运行，而忽略了同一模型产生的图表之间的相似性。相反，在我们提出的方法中，MGCL首先估算与观察到的数据关联的图形，然后定义图形信息的增强过程，从而实现数据适应性和原则性增强。此外，对于图形任务，MGCL将数据集簇并估算每个组的图形，从而使对比度对反映共享的语义和结构。基准数据集上的广泛实验表明，MGCL实现了最先进的性能，突出了将生成模型纳入GCL的优势。

Title: Corrector Sampling in Language Models

Authors: Itai Gat, Neta Shaul, Uriel Singer, Yaron Lipman
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.06215
Pdf URL: https://arxiv.org/pdf/2506.06215
Copy Paste: [[2506.06215]] Corrector Sampling in Language Models(https://arxiv.org/abs/2506.06215)
Keywords: generation
Abstract: Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by iteratively revisiting and potentially replacing tokens in a window of previously generated text. This method can be integrated into existing autoregressive models, preserving their next-token-prediction quality and speed. Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.
摘要：自回归语言模型由于其固定的，不可撤销的左右令牌生成而积累了错误。为了解决这个问题，我们提出了一种称为重新样本的新方法（RPT）。 RPT通过在先前生成的文本的窗口中迭代重新审视并有可能替换令牌来减轻错误积累。可以将此方法集成到现有的自回归模型中，从而保留其下一口预测的质量和速度。与标准采样相比，仅100B的RPT进行微调的8B参数模型，导致推理和编码基准的相对相对提高约10％。

Title: GenIR: Generative Visual Feedback for Mental Image Retrieval

Authors: Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06220
Pdf URL: https://arxiv.org/pdf/2506.06220
Copy Paste: [[2506.06220]] GenIR: Generative Visual Feedback for Mental Image Retrieval(https://arxiv.org/abs/2506.06220)
Keywords: generation, generative
Abstract: Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.
摘要：视觉模型（VLM）在文本对图像检索基准测试方面表现出强劲的性能。但是，将这一成功桥接到现实世界应用程序仍然是一个挑战。在实践中，人类搜索行为很少是单一动作。取而代之的是，这通常是一个多轮的过程，考虑到线索，即从模糊的回忆到目标形象的生动心理表征等等的心理形象。在这个差距的激励下，我们研究了心理图像检索的任务（MIR），该任务针对现实而又毫无疑问的设置，用户通过与图像搜索引擎的多轮交互来完善对精神设想的图像的搜索。成功的交互式检索的核心是机器为用户提供清晰，可操作的反馈的能力；但是，现有方法依赖于间接或抽象的口头反馈，这些反馈可能是模棱两可，误导或无效的用户来完善查询的。为了克服这一点，我们提出了Genir，Genir是一种生成的多轮检索范式，利用基于扩散的图像生成，以明确地在每个回合中明确估算AI系统的理解。这些合成的视觉表示提供了清晰，可解释的反馈，使用户能够直观有效地完善其查询。我们进一步引入了一条全自动管道，以生成高质量的多轮MIR数据集。实验结果表明，基因在miR方案中显着优于现有的交互式方法。这项工作通过数据集和有效的生成检索方法建立了一项新任务，为将来的研究提供了基础。

Title: Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study

Authors: Leon Mayer, Tim Rädsch, Dominik Michael, Lucas Luttner, Amine Yamlahi, Evangelia Christodoulou, Patrick Godau, Marcel Knopp, Annika Reinke, Fiona Kolbinger, Lena Maier-Hein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06232
Pdf URL: https://arxiv.org/pdf/2506.06232
Copy Paste: [[2506.06232]] Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study(https://arxiv.org/abs/2506.06232)
Keywords: generation
Abstract: While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.
摘要：尽管传统的计算机视觉模型在历史上一直在努力推广到内窥镜域，但基础模型的出现显示出有希望的跨域性能。在这项工作中，我们介绍了第一项大规模研究，评估了视觉语言模型（VLMS）的内窥镜任务的能力，并特别关注腹腔镜手术。使用各种最先进的模型，多个手术数据集和广泛的人类参考注释，我们解决了三个关键的研究问题：（1）当前VLMS可以在手术图像上解决基本的感知任务吗？（2）他们可以处理基于高级框架的内窥镜场景理解任务吗？（3）在这种情况下，专业医学VLM与通才模型相比如何？我们的结果表明，VLM可以有效执行基本的手术感知任务，例如对象计数和本地化，其性能水平与一般域任务相当。但是，当任务需要医学知识时，它们的表现会大大恶化。值得注意的是，我们发现与基本和高级手术任务中的通才模型相比，当前专门的医疗VLM表现不佳，这表明它们尚未针对手术环境的复杂性进行优化。这些发现凸显了需要进一步进步，以使VLM能够应对手术带来的独特挑战。总体而言，我们的工作为开发下一代内窥镜AI系统的开发提供了重要的见解，并确定了改善医学视觉语言模型的关键领域。

Title: STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Authors: Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06276
Pdf URL: https://arxiv.org/pdf/2506.06276
Copy Paste: [[2506.06276]] STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis(https://arxiv.org/abs/2506.06276)
Keywords: generation, generative
Abstract: We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.
摘要：我们提出Starflow，这是一种基于标准化流的可扩展生成模型，可在高分辨率图像合成中实现强大的性能。 Starflow的核心是变压器自回旋流量（TARFLOW），它结合了标准化流的表达能力与自回旋变压器的结构化建模功能。我们首先建立了TARFLOW的理论普遍性，用于建模连续分布。在该基础的基础上，我们介绍了几种关键的架构和算法创新，以显着提高可扩展性：（1）深度刺激设计，其中深度变压器块捕获了大多数模型的代表能力，并得到了一些浅层变压器块，这些浅层变压器块是计算上有效效率却基本上有益的；（2）在经过预处理的自动编码器的潜在空间中进行建模，这比直接像素级建模更有效；（3）一种新颖的指导算法，可显着提高样品质量。至关重要的是，我们的模型仍然是端到端的归一化流，可以在连续空间中实现精确的最大似然训练而无需离散化。 Starflow在课堂条件和文本条件图像生成任务中都能达到竞争性能，从而接近样本质量的最新扩散模型。据我们所知，这项工作是在此规模和解决方案中有效运行的首次成功演示。