2025-05-08

Title: Hierarchical Multi-Label Generation with Probabilistic Level-Constraint

Authors: Linqing Chen, Weilei Wang, Wentao Wu, Hanmeng Zhong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.03775
Pdf URL: https://arxiv.org/pdf/2505.03775
Copy Paste: [[2505.03775]] Hierarchical Multi-Label Generation with Probabilistic Level-Constraint(https://arxiv.org/abs/2505.03775)
Keywords: generation, generative
Abstract: Hierarchical Extreme Multi-Label Classification poses greater difficulties compared to traditional multi-label classification because of the intricate hierarchical connections of labels within a domain-specific taxonomy and the substantial number of labels. Some of the prior research endeavors centered on classifying text through several ancillary stages such as the cluster algorithm and multiphase classification. Others made attempts to leverage the assistance of generative methods yet were unable to properly control the output of the generative model. We redefine the task from hierarchical multi-Label classification to Hierarchical Multi-Label Generation (HMG) and employ a generative framework with Probabilistic Level Constraints (PLC) to generate hierarchical labels within a specific taxonomy that have complex hierarchical relationships. The approach we proposed in this paper enables the framework to generate all relevant labels across levels for each document without relying on preliminary operations like clustering. Meanwhile, it can control the model output precisely in terms of count, length, and level aspects. Experiments demonstrate that our approach not only achieves a new SOTA performance in the HMG task, but also has a much better performance in constrained the output of model than previous research work.
摘要：与传统的多标签分类相比，分层极端的多标签分类带来了更大的困难，因为在特定于域的分类法和大量标签中具有标签的复杂分层连接。一些先前的研究努力以几个辅助阶段（例如群集算法和多相分类）进行分类。其他人则尝试利用生成方法的帮助，但无法正确控制生成模型的输出。我们将任务从分层多标签分类重新定义为层次多标签生成（HMG），并采用具有概率级别约束（PLC）的生成框架，以在具有复杂层次关系关系的特定分类学内生成层次结构。我们在本文中提出的方法使框架能够在每个文档跨级别生成所有相关标签，而无需依赖诸如聚类之类的初步操作。同时，它可以根据计数，长度和级别方面精确控制模型输出。实验表明，我们的方法不仅在HMG任务中实现了新的SOTA性能，而且在限制模型的输出方面的性能要比以前的研究工作更好得多。

Title: ALFRED: Ask a Large-language model For Reliable ECG Diagnosis

Authors: Jin Yu, JaeHo Park, TaeJun Park, Gyurin Kim, JiHyun Lee, Min Sung Lee, Joon-myoung Kwon, Jeong Min Son, Yong-Yeon Jo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.03781
Pdf URL: https://arxiv.org/pdf/2505.03781
Copy Paste: [[2505.03781]] ALFRED: Ask a Large-language model For Reliable ECG Diagnosis(https://arxiv.org/abs/2505.03781)
Keywords: generation
Abstract: Leveraging Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) for analyzing medical data, particularly Electrocardiogram (ECG), offers high accuracy and convenience. However, generating reliable, evidence-based results in specialized fields like healthcare remains a challenge, as RAG alone may not suffice. We propose a Zero-shot ECG diagnosis framework based on RAG for ECG analysis that incorporates expert-curated knowledge to enhance diagnostic accuracy and explainability. Evaluation on the PTB-XL dataset demonstrates the framework's effectiveness, highlighting the value of structured domain expertise in automated ECG interpretation. Our framework is designed to support comprehensive ECG analysis, addressing diverse diagnostic needs with potential applications beyond the tested dataset.
摘要：利用大型语言模型（LLMS）具有检索功能的生成（RAG）来分析医疗数据，尤其是心电图（ECG），具有很高的准确性和便利性。但是，在医疗保健等专业领域中产生可靠的，基于证据的结果仍然是一个挑战，因为仅抹布就不够。我们提出了一个基于抹布的零拍摄的心电图诊断框架，以用于ECG分析，该框架结合了专家策划的知识，以提高诊断准确性和解释性。对PTB-XL数据集的评估证明了框架的有效性，突出了结构化域专业知识在自动ECG解释中的价值。我们的框架旨在支持全面的心电图分析，以解决经过测试数据集以外的潜在应用程序来满足各种诊断需求。

Title: When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator

Authors: Md Fahim Anjum
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.03786
Pdf URL: https://arxiv.org/pdf/2505.03786
Copy Paste: [[2505.03786]] When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator(https://arxiv.org/abs/2505.03786)
Keywords: generation
Abstract: Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to $87\%$ higher F1 and $3.7\%$ better discrimination accuracy than CodeLlama-7B, as well as $3.7\%$ higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.
摘要：具有推理能力的大型语言模型（LLM）为改善计划框架中的候选评估提供了有前途的途径，但是它们对传统的非争议模型的相对性能仍然在很大程度上尚未得到充分展望。在这项研究中，我们基准了一种蒸馏1.5b参数推理模型（DeepSeek-R1），以针对用于文本到SQL任务的发电机歧视器LLM计划框架中的几个最先进的非争议LLM。为此，我们介绍了一种新的方法，用于从推理中提取自营链（COT）输出的软得分，从而使候选者的细粒度排名。我们的核心假设是，推理模型比非争议的LLM更有效。我们的研究结果表明，蒸馏式DeepSeek-R1-1.5B的f1 \％$ $ 3.7 \％$ $ $ $ $ 3.7 \％$比Codellama-7B更好，尽管参数较少，但比Codellama-13b $ 3.7 \％$ $ $ 3.7 \％$ $ $ $ $ $ $ $ 3.7 \％\％。此外，我们发现推理模型的逻辑能力有一个限制，并且只提供更多的上下文或允许更多的计算预算用于推理不足以提高其歧视性能。最后，我们证明，与非争议的LLM不同，推理模型比歧视更具挑战性，而与较小的非调理LLM相比，发电机的发电机可能不佳。我们的工作突出了推理模型作为代理框架中的歧视者的潜力，远远超过了它们作为发电机的能力，提供了对LLM计划基础架构中其最佳作用的见解。

Title: Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

Authors: Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, Bo An
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.03792
Pdf URL: https://arxiv.org/pdf/2505.03792
Copy Paste: [[2505.03792]] Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning(https://arxiv.org/abs/2505.03792)
Keywords: generation
Abstract: Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at this https URL.
摘要：通过增强学习（RL）的在线微调视觉模型（VLM）代理表明，有望在动态环境中为代理提供多步，面向目标的功能。但是，他们的开放式文本动作空间和行动生成的非端到最终性质对RL的有效在线勘探面临着重大挑战，例如勘探空间的爆炸。我们提出了一种新颖的在线微调方法，反事实软加固学习（COSO），更适合VLM代理的文本输出空间。与先前的方法为所有令牌分配统一的不确定性相比，COSO利用反事实推理，以动态评估单个令牌对后加工作用的因果影响。通过优先考虑对行动关键令牌的探索，同时减少语义冗余或低影响代币的影响，COSO可以实现更有针对性，更有效的在线推出过程。我们提供理论分析，证明COSO的融合和政策改进保证以及支持COSO有效性的广泛经验评估。我们在包括Android设备控制，卡游戏和体现AI在内的各种代理任务中的结果，强调了其提高勘探效率并带来一致的性能提高的非凡能力。该代码可在此HTTPS URL上找到。

Title: Information Filtering Networks: Theoretical Foundations, Generative Methodologies, and Real-World Applications

Authors: Tomaso Aste
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.03812
Pdf URL: https://arxiv.org/pdf/2505.03812
Copy Paste: [[2505.03812]] Information Filtering Networks: Theoretical Foundations, Generative Methodologies, and Real-World Applications(https://arxiv.org/abs/2505.03812)
Keywords: generative
Abstract: Information Filtering Networks (IFNs) provide a powerful framework for modeling complex systems through globally sparse yet locally dense and interpretable structures that capture multivariate dependencies. This review offers a comprehensive account of IFNs, covering their theoretical foundations, construction methodologies, and diverse applications. Tracing their origins from early network-based models to advanced formulations such as the Triangulated Maximally Filtered Graph (TMFG) and the Maximally Filtered Clique Forest (MFCF), the paper highlights how IFNs address key challenges in high-dimensional data-driven modeling. IFNs and their construction methodologies are intrinsically higher-order networks that generate simplicial complexes-structures that are only now becoming popular in the broader literature. Applications span fields including finance, biology, psychology, and artificial intelligence, where IFNs improve interpretability, computational efficiency, and predictive performance. Special attention is given to their role in graphical modeling, where IFNs enable the estimation of sparse inverse covariance matrices with greater accuracy and scalability than traditional approaches like Graphical LASSO. Finally, the review discusses recent developments that integrate IFNs with machine learning and deep learning, underscoring their potential not only to bridge classical network theory with contemporary data-driven paradigms, but also to shape the architectures of deep learning models themselves.
摘要：信息过滤网络（IFNS）提供了一个有力的框架，可通过全球稀疏但局部且可解释的结构来建模复杂系统，以捕获多元依赖性。这篇评论提供了有关IFN的综合说明，涵盖了其理论基础，施工方法和不同的应用。从基于早期网络的模型到先进的配方，例如三角测量的最大过滤图（TMFG）和最大过滤的集团森林（MFCF），该论文突出了IFN如何在高维数据驱动的建模中解决关键挑战。 IFN及其构建方法是本质上较高的网络，它们产生了简单的复合物结构，这些结构仅在更广泛的文献中才流行。应用领域包括金融，生物学，心理学和人工智能，其中IFN可以提高可解释性，计算效率和预测性能。特别注意它们在图形建模中的作用，在图形建模中，IFN可以比传统方法（如图形套索）对稀疏逆协方差矩阵的估计更高的准确性和可扩展性。最后，评论讨论了将IFN与机器学习和深度学习融为一体的最新发展，不仅强调了它们的潜力，不仅可以用当代数据驱动的范式桥接经典网络理论，而且还塑造了深度学习模型的体系结构本身。

Title: Program Semantic Inequivalence Game with Large Language Models

Authors: Antonio Valerio Miceli-Barone, Vaishak Belle, Ali Payani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.03818
Pdf URL: https://arxiv.org/pdf/2505.03818
Copy Paste: [[2505.03818]] Program Semantic Inequivalence Game with Large Language Models(https://arxiv.org/abs/2505.03818)
Keywords: generation
Abstract: Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics. Finding training examples to teach LLMs to solve these tasks can be challenging. In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ: a generator agent creates program variants that are semantically distinct, derived from a dataset of real-world programming tasks, while an evaluator agent has to identify input examples that cause the original programs and the generated variants to diverge in their behaviour, with the agents training each other semi-adversarially. We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources. We evaluated our approach on multiple code generation and understanding benchmarks, including cross-language vulnerability detection (Lu et al., 2021), where our method improves vulnerability detection in C/C++ code despite being trained exclusively on Python code, and the challenging Python builtin identifier swap benchmark (Miceli-Barone et al., 2023), showing that whereas modern LLMs still struggle with this benchmark, our approach yields substantial improvements. We release the code needed to replicate the experiments, as well as the generated synthetic data, which can be used to fine-tune LLMs.
摘要：大型语言模型（LLMS）可以在日常编码任务上实现强大的性能，但是它们可能会在需要关于程序语义的非平凡推理的复杂任务上失败。寻找培训示例来教LLM来解决这些任务可能具有挑战性。在这项工作中，我们探讨了一种基于语义不相等游戏SINQ的合成生成代码推理培训数据的方法：生成器代理创建的程序变体具有语义上不同的，源自现实世界中编程任务的数据集，而评估器的代理人必须确定原始程序和生成的变量的投入型，以训练彼此的其他行为。我们证明，这种设置可以通过自我播放在无限的计算资源的限制中实现理论上无限的改进。 We evaluated our approach on multiple code generation and understanding benchmarks, including cross-language vulnerability detection (Lu et al., 2021), where our method improves vulnerability detection in C/C++ code despite being trained exclusively on Python code, and the challenging Python builtin identifier swap benchmark (Miceli-Barone et al., 2023), showing that whereas modern LLMs still struggle with this基准测试，我们的方法可实现重大改进。我们释放复制实验所需的代码以及可用于微调LLM的生成的合成数据。

Title: Machine Learning: a Lecture Note

Authors: Kyunghyun Cho
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.03861
Pdf URL: https://arxiv.org/pdf/2505.03861
Copy Paste: [[2505.03861]] Machine Learning: a Lecture Note(https://arxiv.org/abs/2505.03861)
Keywords: generative
Abstract: This lecture note is intended to prepare early-year master's and PhD students in data science or a related discipline with foundational ideas in machine learning. It starts with basic ideas in modern machine learning with classification as a main target task. These basic ideas include loss formulation, backpropagation, stochastic gradient descent, generalization, model selection as well as fundamental blocks of artificial neural networks. Based on these basic ideas, the lecture note explores in depth the probablistic approach to unsupervised learning, covering directed latent variable models, product of experts, generative adversarial networks and autoregressive models. Finally, the note ends by covering a diverse set of further topics, such as reinforcement learning, ensemble methods and meta-learning. After reading this lecture note, a student should be ready to embark on studying and researching more advanced topics in machine learning and more broadly artificial intelligence.
摘要：本讲座旨在为数据科学的早期硕士和博士学位学生准备与机器学习中的基本思想相关的学科。它从现代机器学习中的基本想法开始，分类为主要目标任务。这些基本思想包括损失配方，反向传播，随机梯度下降，概括，模型选择以及人工神经网络的基本块。基于这些基本思想，演讲说明深入探讨了无监督学习的概率方法，涵盖了有名的潜在变量模型，专家的产品，生成的对抗性网络和自动回归模型。最后，注释结束了，涵盖了一系列更多的主题，例如增强学习，集合方法和元学习。阅读本讲座后，学生应该准备好研究和研究机器学习和更广泛的人工智能的更高级主题。

Title: Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces

Authors: Nikhil M. Pawar, Jorge A. Prozzi, Feng Hong, Surya Sarat Chandra Congress
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.03974
Pdf URL: https://arxiv.org/pdf/2505.03974
Copy Paste: [[2505.03974]] Deep Learning Framework for Infrastructure Maintenance: Crack Detection and High-Resolution Imaging of Infrastructure Surfaces(https://arxiv.org/abs/2505.03974)
Keywords: super-resolution
Abstract: Recently, there has been an impetus for the application of cutting-edge data collection platforms such as drones mounted with camera sensors for infrastructure asset management. However, the sensor characteristics, proximity to the structure, hard-to-reach access, and environmental conditions often limit the resolution of the datasets. A few studies used super-resolution techniques to address the problem of low-resolution images. Nevertheless, these techniques were observed to increase computational cost and false alarms of distress detection due to the consideration of all the infrastructure images i.e., positive and negative distress classes. In order to address the pre-processing of false alarm and achieve efficient super-resolution, this study developed a framework consisting of convolutional neural network (CNN) and efficient sub-pixel convolutional neural network (ESPCNN). CNN accurately classified both the classes. ESPCNN, which is the lightweight super-resolution technique, generated high-resolution infrastructure image of positive distress obtained from CNN. The ESPCNN outperformed bicubic interpolation in all the evaluation metrics for super-resolution. Based on the performance metrics, the combination of CNN and ESPCNN was observed to be effective in preprocessing the infrastructure images with negative distress, reducing the computational cost and false alarms in the next step of super-resolution. The visual inspection showed that EPSCNN is able to capture crack propagation, complex geometry of even minor cracks. The proposed framework is expected to help the highway agencies in accurately performing distress detection and assist in efficient asset management practices.
摘要：最近，应用了尖端数据收集平台的应用，例如带有相机传感器用于基础架构资产管理的无人机。但是，传感器特性，与结构的接近性，难以到达的访问以及环境条件通常会限制数据集的分辨率。一些研究使用了超分辨率技术来解决低分辨率图像的问题。然而，由于考虑了所有基础设施图像，即正面和负面的遇险类别，因此观察到这些技术会增加计算成本和误报遇险检测的警报。为了解决虚假警报的预处理并实现有效的超分辨率，这项研究开发了一个框架，该框架包括卷积神经网络（CNN）和有效的亚像素卷积神经网络（ESPCNN）。 CNN准确地分类了这两个类。 ESPCNN是轻量级的超分辨率技术，产生了从CNN获得的积极困扰的高分辨率基础设施图像。在所有评估指标中，ESPCNN的表现都优于双学的插值。基于性能指标，观察到CNN和ESPCNN的组合有效地预处理具有负面困扰的基础架构图像，从而在下一步的超级分辨率下降低了计算成本和错误警报。视觉检查表明，EPSCNN能够捕获裂纹的传播，即使是轻微裂纹的复杂几何形状。预计该拟议的框架将帮助公路机构准确执行遇险检测并帮助有效的资产管理实践。

Title: Call for Action: towards the next generation of symbolic regression benchmark

Authors: Guilherme S. Imai Aldeia, Hengzhe Zhang, Geoffrey Bomarito, Miles Cranmer, Alcides Fonseca, Bogdan Burlacu, William G. La Cava, Fabrício Olivetti de França
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2505.03977
Pdf URL: https://arxiv.org/pdf/2505.03977
Copy Paste: [[2505.03977]] Call for Action: towards the next generation of symbolic regression benchmark(https://arxiv.org/abs/2505.03977)
Keywords: generation
Abstract: Symbolic Regression (SR) is a powerful technique for discovering interpretable mathematical expressions. However, benchmarking SR methods remains challenging due to the diversity of algorithms, datasets, and evaluation criteria. In this work, we present an updated version of SRBench. Our benchmark expands the previous one by nearly doubling the number of evaluated methods, refining evaluation metrics, and using improved visualizations of the results to understand the performances. Additionally, we analyze trade-offs between model complexity, accuracy, and energy consumption. Our results show that no single algorithm dominates across all datasets. We propose a call for action from SR community in maintaining and evolving SRBench as a living benchmark that reflects the state-of-the-art in symbolic regression, by standardizing hyperparameter tuning, execution constraints, and computational resource allocation. We also propose deprecation criteria to maintain the benchmark's relevance and discuss best practices for improving SR algorithms, such as adaptive hyperparameter tuning and energy-efficient implementations.
摘要：符号回归（SR）是发现可解释的数学表达式的强大技术。但是，由于算法，数据集和评估标准的多样性，基准测试SR方法仍然具有挑战性。在这项工作中，我们提出了SRBENCH的更新版本。我们的基准通过几乎加倍评估方法的数量，完善评估指标，并使用结果的可视化来了解表现，从而扩展了前一位。此外，我们分析了模型复杂性，准确性和能耗之间的权衡。我们的结果表明，所有数据集中没有任何单个算法占主导地位。我们提出了SR社区采取行动的呼吁，以维持和不断发展SRBENCH作为一个活着的基准，该基准通过标准化超参数调整，执行约束和计算资源分配来反映符号回归中最新的基准。我们还提出了折旧标准，以维持基准的相关性，并讨论改善SR算法的最佳实践，例如自适应超参数调整和节能实施。

Title: Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation

Authors: Hengyuan Hu, Aniket Das, Dorsa Sadigh, Nima Anari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.03983
Pdf URL: https://arxiv.org/pdf/2505.03983
Copy Paste: [[2505.03983]] Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Autospeculation(https://arxiv.org/abs/2505.03983)
Keywords: generative
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduce \emph{Autospeculative Decoding} (ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a $\tilde{O} (K^{\frac{1}{3}})$ parallel runtime speedup over the $K$ step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains.
摘要：剥离扩散概率模型（DDPM）已成为生成建模的强大工具。但是，它们的顺序计算要求导致明显的推理时间瓶颈。在这项工作中，我们利用DDPMS和随机定位之间的连接来证明，在适当的重新启动下，DDPM的增量满足了交换性属性。这种一般见解可以使从自回归模型到扩散设置的各种性能优化技术的近黑色盒子适应。为了证明这一点，我们将\ emph {autospeculative dododing}（ASD）介绍，这是不需要任何辅助草稿模型的广泛投机解码算法的扩展。我们的理论分析表明，ASD实现了$ \ tilde {o}（k^{\ frac {1} {3}}}）$在$ k $ step sequential ddpm上平行运行时加速。我们还证明，自动解码的实际实施可以在各个领域中显着加速DDPM的推断。

Title: MAISY: Motion-Aware Image SYnthesis for MedicalImage Motion Correction

Authors: Andrew Zhang, Hao Wang, Shuchang Ye, Michael Fulham, Jinman Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04105
Pdf URL: https://arxiv.org/pdf/2505.04105
Copy Paste: [[2505.04105]] MAISY: Motion-Aware Image SYnthesis for MedicalImage Motion Correction(https://arxiv.org/abs/2505.04105)
Keywords: generative
Abstract: Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation this http URL state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate motion-free images. However, we identified the following limitations: (i) they mainly focus on global structural characteristics and therefore overlook localized features that often carry critical pathological information, and (ii) the SSIM loss function struggles to handle images with varying pixel intensities, luminance factors, and variance. In this study, we propose Motion-Aware Image SYnthesis (MAISY) which initially characterize motion and then uses it for correction by: (a) leveraging the foundation model Segment Anything Model (SAM), to dynamically learn spatial patterns along anatomical boundaries where motion artifacts are most pronounced and, (b) introducing the Variance-Selective SSIM (VS-SSIM) loss which adaptively emphasizes spatial regions with high pixel variance to preserve essential anatomical details during artifact correction. Experiments on chest and head CT datasets demonstrate that our model outperformed the state-of-the-art counterparts, with Peak Signal-to-Noise Ratio (PSNR) increasing by 40%, SSIM by 10%, and Dice by 16%.
摘要：Patient motion during medical image acquisition causes blurring, ghosting, and distorts organs, which makes image interpretation this http URL state-of-the-art algorithms using Generative Adversarial Network (GAN)-based methods with their ability to learn the mappings between corrupted images and their ground truth via Structural Similarity Index Measure (SSIM) loss effectively generate motion-free images.但是，我们确定了以下局限性：（i）它们主要关注全球结构特征，因此忽略了通常具有关键病理信息的局部特征，以及（ii）SSIM损失函数努力努力处理具有不同像素强度，亮度因素，亮度因素和方差的图像。在这项研究中，我们提出了最初表征运动的动态图像合成（MAISY），然后通过以下方式使用它进行校正：高像素方差以保留伪影校正过程中必不可少的解剖细节。胸部和头部CT数据集的实验表明，我们的模型的表现优于最先进的对应物，峰值信噪比（PSNR）增加了40％，SSIM增长了10％，骰子骰子增加了16％。

Title: Learning from Similarity Proportion Loss for Classifying Skeletal Muscle Recovery Stages

Authors: Yu Yamaoka or Weng Ian Chan, Shigeto Seno, Soichiro Fukada, Hideo Matsuda
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04150
Pdf URL: https://arxiv.org/pdf/2505.04150
Copy Paste: [[2505.04150]] Learning from Similarity Proportion Loss for Classifying Skeletal Muscle Recovery Stages(https://arxiv.org/abs/2505.04150)
Keywords: generation
Abstract: Evaluating the regeneration process of damaged muscle tissue is a fundamental analysis in muscle research to measure experimental effect sizes and uncover mechanisms behind muscle weakness due to aging and disease. The conventional approach to assessing muscle tissue regeneration involves whole-slide imaging and expert visual inspection of the recovery stages based on the morphological information of cells and fibers. There is a need to replace these tasks with automated methods incorporating machine learning techniques to ensure a quantitative and objective analysis. Given the limited availability of fully labeled data, a possible approach is Learning from Label Proportions (LLP), a weakly supervised learning method using class label proportions. However, current LLP methods have two limitations: (1) they cannot adapt the feature extractor for muscle tissues, and (2) they treat the classes representing recovery stages and cell morphological changes as nominal, resulting in the loss of ordinal information. To address these issues, we propose Ordinal Scale Learning from Similarity Proportion (OSLSP), which uses a similarity proportion loss derived from two bag combinations. OSLSP can update the feature extractor by using class proportion attention to the ordinal scale of the class. Our model with OSLSP outperforms large-scale pre-trained and fine-tuning models in classification tasks of skeletal muscle recovery stages.
摘要：评估受损肌肉组织的再生过程是肌肉研究中的基本分析，用于衡量由于衰老和疾病而导致的肌肉无力背后的实验效应大小和发现的机制。评估肌肉组织再生的常规方法涉及基于细胞和纤维的形态信息对恢复阶段的全面成像和专家视觉检查。需要用结合机器学习技术的自动方法来替换这些任务，以确保定量和客观分析。鉴于完全标记的数据的可用性有限，一种可能的方法是从标签比例（LLP）学习，这是一种使用类标签比例的弱监督学习方法。但是，当前的LLP方法有两个局限性：（1）它们无法适应肌肉组织的特征提取器，并且（2）它们将代表恢复阶段的类和细胞形态变化的类别视为标称，从而导致序数信息的丧失。为了解决这些问题，我们提出了从相似性比例（OSLSP）学习的序数量表学习，该学习使用了来自两个袋子组合的相似性比例损失。 OSLSP可以通过使用班级比例注意到类的序数来更新功能提取器。我们使用OSLSP的模型在骨骼肌恢复阶段的分类任务中优于大规模预训练和微调模型。

Title: DiffPattern-Flex: Efficient Layout Pattern Generation via Discrete Diffusion

Authors: Zixiao Wang, Wenqian Zhao, Yunheng Shen, Yang Bai, Guojin Chen, Farzan Farnia, Bei Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04173
Pdf URL: https://arxiv.org/pdf/2505.04173
Copy Paste: [[2505.04173]] DiffPattern-Flex: Efficient Layout Pattern Generation via Discrete Diffusion(https://arxiv.org/abs/2505.04173)
Keywords: generation, generative
Abstract: Recent advancements in layout pattern generation have been dominated by deep generative models. However, relying solely on neural networks for legality guarantees raises concerns in many practical applications. In this paper, we present \tool{DiffPattern}-Flex, a novel approach designed to generate reliable layout patterns efficiently. \tool{DiffPattern}-Flex incorporates a new method for generating diverse topologies using a discrete diffusion model while maintaining a lossless and compute-efficient layout representation. To ensure legal pattern generation, we employ {an} optimization-based, white-box pattern assessment process based on specific design rules. Furthermore, fast sampling and efficient legalization technologies are employed to accelerate the generation process. Experimental results across various benchmarks demonstrate that \tool{DiffPattern}-Flex significantly outperforms existing methods and excels at producing reliable layout patterns.
摘要：布局模式产生的最新进展已由深层生成模型主导。但是，仅依靠神经网络来确保在许多实际应用中引起关注。在本文中，我们提出\ tool {diffpattern} -flex，一种新颖的方法，旨在有效地生成可靠的布局模式。 \ tool {diffpattern} -flex结合了一种使用离散扩散模型生成不同拓扑的新方法，同时保持无损和计算的布局表示。为了确保法律模式生成，我们基于特定的设计规则采用{an}基于优化的，基于优化的白框模式评估过程。此外，采用快速采样和有效的合法化技术来加速发电过程。各种基准的实验结果表明，\ tool {diffpattern} - flex显着优于现有方法，并且在产生可靠的布局模式方面表现出色。

Title: DOTA: Deformable Optimized Transformer Architecture for End-to-End Text Recognition with Retrieval-Augmented Generation

Authors: Naphat Nithisopa, Teerapong Panboonyuen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04175
Pdf URL: https://arxiv.org/pdf/2505.04175
Copy Paste: [[2505.04175]] DOTA: Deformable Optimized Transformer Architecture for End-to-End Text Recognition with Retrieval-Augmented Generation(https://arxiv.org/abs/2505.04175)
Keywords: generation
Abstract: Text recognition in natural images remains a challenging yet essential task, with broad applications spanning computer vision and natural language processing. This paper introduces a novel end-to-end framework that combines ResNet and Vision Transformer backbones with advanced methodologies, including Deformable Convolutions, Retrieval-Augmented Generation, and Conditional Random Fields (CRF). These innovations collectively enhance feature representation and improve Optical Character Recognition (OCR) performance. Specifically, the framework substitutes standard convolution layers in the third and fourth blocks with Deformable Convolutions, leverages adaptive dropout for regularization, and incorporates CRF for more refined sequence modeling. Extensive experiments conducted on six benchmark datasets IC13, IC15, SVT, IIIT5K, SVTP, and CUTE80 validate the proposed method's efficacy, achieving notable accuracies: 97.32% on IC13, 58.26% on IC15, 88.10% on SVT, 74.13% on IIIT5K, 82.17% on SVTP, and 66.67% on CUTE80, resulting in an average accuracy of 77.77%. These results establish a new state-of-the-art for text recognition, demonstrating the robustness of the approach across diverse and challenging datasets.
摘要：自然图像中的文本识别仍然是一项具有挑战性但必不可少的任务，涵盖计算机视觉和自然语言处理的广泛应用。本文介绍了一个新颖的端到端框架，该框架将重新网和视觉变压器骨架与先进的方法相结合，包括可变形的卷积，检索效果的生成和条件随机场（CRF）。这些创新共同增强了特征表示并改善了光学特征识别（OCR）的性能。具体而言，该框架用可变形的卷积替换了第三和第四块中的标准卷积层，利用自适应辍学来进行正则化，并结合了CRF以进行更完善的序列建模。在六个基准数据集IC13，IC15，SVT，IIIT5K，SVTP和可爱的80上进行的大量实验验证了提出的方法的功效，达到了显着的准确性：IC13，58.26％的IC15，58.26％的IC15，88.10％的SVT，SVT，74.13％，SVT，84.13％，SVT，84.13％，SVT上的82.13％。可爱80的66.67％的平均准确度为77.77％。这些结果为文本识别建立了新的最新最新，证明了在多样化和具有挑战性的数据集中该方法的鲁棒性。

Title: S3D: Sketch-Driven 3D Model Generation

Authors: Hail Song, Wonsik Shin, Naeun Lee, Soomin Chung, Nojun Kwak, Woontack Woo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04185
Pdf URL: https://arxiv.org/pdf/2505.04185
Copy Paste: [[2505.04185]] S3D: Sketch-Driven 3D Model Generation(https://arxiv.org/abs/2505.04185)
Keywords: generation
Abstract: Generating high-quality 3D models from 2D sketches is a challenging task due to the inherent ambiguity and sparsity of sketch data. In this paper, we present S3D, a novel framework that converts simple hand-drawn sketches into detailed 3D models. Our method utilizes a U-Net-based encoder-decoder architecture to convert sketches into face segmentation masks, which are then used to generate a 3D representation that can be rendered from novel views. To ensure robust consistency between the sketch domain and the 3D output, we introduce a novel style-alignment loss that aligns the U-Net bottleneck features with the initial encoder outputs of the 3D generation module, significantly enhancing reconstruction fidelity. To further enhance the network's robustness, we apply augmentation techniques to the sketch dataset. This streamlined framework demonstrates the effectiveness of S3D in generating high-quality 3D models from sketch inputs. The source code for this project is publicly available at this https URL.
摘要：从2D草图中生成高质量的3D模型是一项具有挑战性的任务，因为草图数据的固有歧义和稀疏性。在本文中，我们提出了S3D，这是一个新颖的框架，将简单的手绘草图转换为详细的3D模型。我们的方法利用基于U-NET的编码器架构来将草图转换为面部分割面具，然后将其用于生成3D表示，可以从新颖的视图中呈现。为了确保草图域和3D输出之间的稳健一致性，我们引入了一种新颖的样式损失，将U-NET瓶颈与3D生成模块的初始编码器输出保持一致，从而显着增强了重建保真度。为了进一步增强网络的鲁棒性，我们将增强技术应用于草图数据集。这个简化的框架证明了S3D从草图输入生成高质量3D模型的有效性。该项目的源代码可在此HTTPS URL上公开获得。

Title: A Large Language Model for Feasible and Diverse Population Synthesis

Authors: Sung Yoo Lim, Hyunsoo Yun, Prateek Bansal, Dong-Kyu Kim, Eui-Jin Kim
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2505.04196
Pdf URL: https://arxiv.org/pdf/2505.04196
Copy Paste: [[2505.04196]] A Large Language Model for Feasible and Diverse Population Synthesis(https://arxiv.org/abs/2505.04196)
Keywords: generation, generative
Abstract: Generating a synthetic population that is both feasible and diverse is crucial for ensuring the validity of downstream activity schedule simulation in activity-based models (ABMs). While deep generative models (DGMs), such as variational autoencoders and generative adversarial networks, have been applied to this task, they often struggle to balance the inclusion of rare but plausible combinations (i.e., sampling zeros) with the exclusion of implausible ones (i.e., structural zeros). To improve feasibility while maintaining diversity, we propose a fine-tuning method for large language models (LLMs) that explicitly controls the autoregressive generation process through topological orderings derived from a Bayesian Network (BN). Experimental results show that our hybrid LLM-BN approach outperforms both traditional DGMs and proprietary LLMs (e.g., ChatGPT-4o) with few-shot learning. Specifically, our approach achieves approximately 95% feasibility, significantly higher than the ~80% observed in DGMs, while maintaining comparable diversity, making it well-suited for practical applications. Importantly, the method is based on a lightweight open-source LLM, enabling fine-tuning and inference on standard personal computing environments. This makes the approach cost-effective and scalable for large-scale applications, such as synthesizing populations in megacities, without relying on expensive infrastructure. By initiating the ABM pipeline with high-quality synthetic populations, our method improves overall simulation reliability and reduces downstream error propagation. The source code for these methods is available for research and practical application.
摘要：产生既可行又多样的合成人群对于确保基于活动的模型（ABM）中下游活动时间表模拟的有效性至关重要。尽管已将深层生成模型（DGM）（例如变分的自动编码器和生成的对抗网络）应用于此任务，但它们通常很难平衡包含稀有但可见的组合（即采样零）与不可用的组合（即采样零）（即结构Zeros）。为了提高可行性的同时保持多样性，我们为大型语言模型（LLMS）提出了一种微调方法，该方法通过贝叶斯网络（BN）得出的拓扑排序明确控制自回归的生成过程。实验结果表明，我们的混合LLM-BN方法的表现优于传统的DGM和专有LLM（例如ChatGpt-4O），而少数学习。具体而言，我们的方法达到了约95％的可行性，显着高于DGM中观察到的〜80％，同时保持可比的多样性，使其非常适合实际应用。重要的是，该方法基于轻巧的开源LLM，可对标准个人计算环境进行微调和推断。这使该方法具有成本效益且可扩展的大规模应用程序，例如大型综合种群，而无需依赖昂贵的基础设施。通过使用高质量合成种群启动ABM管道，我们的方法可提高整体模拟可靠性并减少下游误差传播。这些方法的源代码可用于研究和实际应用。

Title: Technology prediction of a 3D model using Neural Network

Authors: Grzegorz Miebs, Rafał A. Bachorz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04241
Pdf URL: https://arxiv.org/pdf/2505.04241
Copy Paste: [[2505.04241]] Technology prediction of a 3D model using Neural Network(https://arxiv.org/abs/2505.04241)
Keywords: generative
Abstract: Accurate estimation of production times is critical for effective manufacturing scheduling, yet traditional methods relying on expert analysis or historical data often fall short in dynamic or customized production environments. This paper introduces a data-driven approach that predicts manufacturing steps and their durations directly from a product's 3D model. By rendering the model into multiple 2D images and leveraging a neural network inspired by the Generative Query Network, the method learns to map geometric features into time estimates for predefined production steps enabling scalable, adaptive, and precise process planning across varied product types.
摘要：对生产时间的准确估算对于有效的制造计划至关重要，但是依靠专家分析或历史数据的传统方法通常在动态或定制的生产环境中缺乏。本文介绍了一种数据驱动的方法，该方法可以直接从产品的3D模型中预测制造步骤及其持续时间。通过将模型渲染到多个2D图像中，并利用受生成查询网络启发的神经网络，该方法学会了将几何特征映射到预定义生产步骤的时间估计中，从而跨不同产品类型，可扩展，自适应和精确的过程计划。

Title: Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting

Authors: Feng Yang, Wenliang Qian, Wangmeng Zuo, Hui Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04262
Pdf URL: https://arxiv.org/pdf/2505.04262
Copy Paste: [[2505.04262]] Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting(https://arxiv.org/abs/2505.04262)
Keywords: generation
Abstract: Score Distillation Sampling (SDS) leverages pretrained 2D diffusion models to advance text-to-3D generation but neglects multi-view correlations, being prone to geometric inconsistencies and multi-face artifacts in the generated 3D content. In this work, we propose Coupled Score Distillation (CSD), a framework that couples multi-view joint distribution priors to ensure geometrically consistent 3D generation while enabling the stable and direct optimization of 3D Gaussian Splatting. Specifically, by reformulating the optimization as a multi-view joint optimization problem, we derive an effective optimization rule that effectively couples multi-view priors to guide optimization across different viewpoints while preserving the diversity of generated 3D assets. Additionally, we propose a framework that directly optimizes 3D Gaussian Splatting (3D-GS) with random initialization to generate geometrically consistent 3D content. We further employ a deformable tetrahedral grid, initialized from 3D-GS and refined through CSD, to produce high-quality, refined meshes. Quantitative and qualitative experimental results demonstrate the efficiency and competitive quality of our approach.
摘要：得分蒸馏采样（SDS）利用了预估计的2D扩散模型来推进文本到3D的生成，但忽略了多视图相关性，容易在生成的3D内容中几何不一致和多面文物。在这项工作中，我们提出了耦合的评分蒸馏（CSD），该框架是多视图联合分布先验的框架，以确保几何一致的3D代，同时启用3D高斯分裂的稳定和直接优化。具体而言，通过将优化重新定义为多视图关节优化问题，我们得出了一个有效的优化规则，该规则有效地融合了多视图先验，以指导跨不同观点的优化，同时保留生成的3D资产的多样性。此外，我们提出了一个直接优化3D高斯脱落（3D-GS）的框架，并随机初始化以生成几何一致的3D含量。我们进一步采用了可变形的四面体网格，从3D-GS初始化并通过CSD进行了精制，以产生高质量的精制网格。定量和定性实验结果证明了我们方法的效率和竞争质量。

Title: Non-stationary Diffusion For Probabilistic Time Series Forecasting

Authors: Weiwei Ye, Zhuopeng Xu, Ning Gui
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04278
Pdf URL: https://arxiv.org/pdf/2505.04278
Copy Paste: [[2505.04278]] Non-stationary Diffusion For Probabilistic Time Series Forecasting(https://arxiv.org/abs/2505.04278)
Keywords: generative
Abstract: Due to the dynamics of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this non-stationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step and integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code is available at this https URL.
摘要：由于潜在的物理和外部影响的动态，时间序列的不确定性通常会随着时间而变化。但是，现有的denoising扩散概率模型（DDPM）通常无法捕获这种非平稳性，受到其从添加噪声模型（ANM）的恒定方差假设的约束。在本文中，我们创新利用位置尺度噪声模型（LSNM）放松ANM的固定不确定性假设。基于非平稳扩散（NSDIFF）的基于扩散的概率预测框架是基于LSNM设计的，该框架能够建模不确定性的变化模式。具体而言，NSDIFF结合了基于扩散的条件生成模型与预训练的条件均值和方差估计器，从而实现适应性端点分布建模。此外，我们提出了一个不确定性感知的噪声时间表，该计划会动态调整噪声水平，以准确地反映每个步骤的数据不确定性，并将随时间变化的方差集成到扩散过程中。在9个现实世界和合成数据集上进行的广泛实验表明，与现有方法相比，NSDIFF的表现出色。代码可在此HTTPS URL上找到。

Title: MoDE: Mixture of Diffusion Experts for Any Occluded Face Recognition

Authors: Qiannan Fan, Zhuoyang Li, Jitong Li, Chenyang Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04306
Pdf URL: https://arxiv.org/pdf/2505.04306
Copy Paste: [[2505.04306]] MoDE: Mixture of Diffusion Experts for Any Occluded Face Recognition(https://arxiv.org/abs/2505.04306)
Keywords: generative
Abstract: With the continuous impact of epidemics, people have become accustomed to wearing masks. However, most current occluded face recognition (OFR) algorithms lack prior knowledge of occlusions, resulting in poor performance when dealing with occluded faces of varying types and severity in reality. Recognizing occluded faces is still a significant challenge, which greatly affects the convenience of people's daily lives. In this paper, we propose an identity-gated mixture of diffusion experts (MoDE) for OFR. Each diffusion-based generative expert estimates one possible complete image for occluded faces. Considering the random sampling process of the diffusion model, which introduces inevitable differences and variations between the inpainted faces and the real ones. To ensemble effective information from multi-reconstructed faces, we introduce an identity-gating network to evaluate the contribution of each reconstructed face to the identity and adaptively integrate the predictions in the decision space. Moreover, our MoDE is a plug-and-play module for most existing face recognition models. Extensive experiments on three public face datasets and two datasets in the wild validate our advanced performance for various occlusions in comparison with the competing methods.
摘要：随着流行病的持续影响，人们已经习惯戴口罩。但是，当前大多数闭塞面部识别（OFR）算法缺乏闭塞的先验知识，在处理现实中不同类型和严重性的闭塞面时，表现不佳。认识到遮挡的面孔仍然是一个重大挑战，这极大地影响了人们日常生活的便利。在本文中，我们提出了OFR扩散专家（模式）的身份门控混合物。每个基于扩散的生成专家都会估算遮挡面孔的一个可能的完整图像。考虑到扩散模型的随机抽样过程，该过程引入了未能的差异以及涂漆面和真实面部之间的变化。为了从多重建面孔中整合有效的信息，我们引入了一个身份门网络，以评估每个重建面对身份的贡献，并自适应地整合了决策空间中的预测。此外，我们的模式是大多数现有面部识别模型的插件模块。与竞争方法相比，在三个公共面部数据集和两个数据集上进行了广泛的实验，以验证我们的高级性能。

Title: Riemannian Denoising Diffusion Probabilistic Models

Authors: Zichen Liu, Wei Zhang, Christof Schütte, Tiejun Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.04338
Pdf URL: https://arxiv.org/pdf/2505.04338
Copy Paste: [[2505.04338]] Riemannian Denoising Diffusion Probabilistic Models(https://arxiv.org/abs/2505.04338)
Keywords: generative
Abstract: We propose Riemannian Denoising Diffusion Probabilistic Models (RDDPMs) for learning distributions on submanifolds of Euclidean space that are level sets of functions, including most of the manifolds relevant to applications. Existing methods for generative modeling on manifolds rely on substantial geometric information such as geodesic curves or eigenfunctions of the Laplace-Beltrami operator and, as a result, they are limited to manifolds where such information is available. In contrast, our method, built on a projection scheme, can be applied to more general manifolds, as it only requires being able to evaluate the value and the first order derivatives of the function that defines the submanifold. We provide a theoretical analysis of our method in the continuous-time limit, which elucidates the connection between our RDDPMs and score-based generative models on manifolds. The capability of our method is demonstrated on datasets from previous studies and on new datasets sampled from two high-dimensional manifolds, i.e. $\mathrm{SO}(10)$ and the configuration space of molecular system alanine dipeptide with fixed dihedral angle.
摘要：我们提出了Riemannian denoising扩散概率模型（RDDPM），以在欧几里得空间的子序列上学习分布，这些模型是级别的功能集，包括大多数与应用程序相关的流形。现有的用于流形的生成建模的方法依赖于大量的几何信息，例如Laplace-Beltrami操作员的地球曲线或本征函数，因此，它们仅限于可用此类信息的流形。相比之下，我们的方法（基于投影方案）可以应用于更通用的歧管，因为它仅需要能够评估定义Submanifold的函数的值和一阶导数。我们在连续时间限制中对我们的方法进行了理论分析，该分析阐明了我们的RDDPM和基于分数的生成模型之间的连接。在先前研究的数据集和从两个高维歧管中采样的新数据集中，我们的方法的能力证明了能力，即$ \ mathrm {so}（10）$以及分子系统丙氨酸二肽的配置空间，具有固定的二面角。

Title: CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion

Authors: Yanyu Li, Pencheng Wan, Liang Han, Yaowei Wang, Liqiang Nie, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04347
Pdf URL: https://arxiv.org/pdf/2505.04347
Copy Paste: [[2505.04347]] CountDiffusion: Text-to-Image Synthesis with Training-Free Counting-Guidance Diffusion(https://arxiv.org/abs/2505.04347)
Keywords: generation
Abstract: Stable Diffusion has advanced text-to-image synthesis, but training models to generate images with accurate object quantity is still difficult due to the high computational cost and the challenge of teaching models the abstract concept of quantity. In this paper, we propose CountDiffusion, a training-free framework aiming at generating images with correct object quantity from textual descriptions. CountDiffusion consists of two stages. In the first stage, an intermediate denoising result is generated by the diffusion model to predict the final synthesized image with one-step denoising, and a counting model is used to count the number of objects in this image. In the second stage, a correction module is used to correct the object quantity by changing the attention map of the object with universal guidance. The proposed CountDiffusion can be plugged into any diffusion-based text-to-image (T2I) generation models without further training. Experiment results demonstrate the superiority of our proposed CountDiffusion, which improves the accurate object quantity generation ability of T2I models by a large margin.
摘要：稳定的扩散具有先进的文本对图像综合，但是由于高计算成本和教学模型的挑战，训练模型仍然很困难，仍然很困难。在本文中，我们提出了Count-Diffusion，这是一个无训练的框架，旨在从文本描述中生成具有正确对象数量的图像。 Count -Diffusion包括两个阶段。在第一阶段，扩散模型生成了一个中间的脱索结果，以一种用一步denoising预测最终的合成图像，并且使用计数模型来计算该图像中对象的数量。在第二阶段，使用校正模块通过使用通用指导更改对象的注意力图来纠正对象数量。提出的count-diffusion可以插入任何基于扩散的文本对图（T2I）生成模型中，而无需进一步培训。实验结果证明了我们提出的count -fiffusion的优越性，这可以通过较大的边缘提高T2I模型的准确对象的数量产生能力。

Title: WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing

Authors: Jie Sun, Heng Liu, Yongzhen Wang, Xiao-Ping Zhang, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04369
Pdf URL: https://arxiv.org/pdf/2505.04369
Copy Paste: [[2505.04369]] WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing(https://arxiv.org/abs/2505.04369)
Keywords: restoration
Abstract: In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail enhancement. This coarse-to-fine strategy enables WDMamba to effectively capture features specific to each stage of the dehazing process, resulting in high-quality restored images. Specifically, in the low-frequency restoration stage, we integrate Mamba blocks to reconstruct global structures with linear complexity, efficiently removing overall haze and producing a coarse restored image. Thereafter, the detail enhancement stage reinstates fine-grained information that may have been overlooked during the previous phase, culminating in the final dehazed output. Furthermore, to enhance detail retention and achieve more natural dehazing, we introduce a self-guided contrastive regularization during network training. By utilizing the coarse restored output as a hard negative example, our model learns more discriminative representations, substantially boosting the overall dehazing performance. Extensive evaluations on public dehazing benchmarks demonstrate that our method surpasses state-of-the-art approaches both qualitatively and quantitatively. Code is available at this https URL.
摘要：在本文中，我们揭示了通过小波变换分析观察到的一种新型的雾片特异性小波降解，该分析表明，与有关性相关的信息主要驻留在低频成分中。利用这种见解，我们提出了一个新型的飞行框架WDMAMBA，该框架将图像除去任务分解为两个顺序阶段：低频恢复，然后进行细节增强。这种粗到最新的策略使WDMAMBA能够有效地捕获飞行过程每个阶段的特定功能，从而产生高质量的恢复图像。具体而言，在低频恢复阶段，我们将MAMBA块整合到具有线性复杂性的全局结构，有效地消除整体雾度并产生粗糙的恢复图像。此后，细节增强阶段恢复了可能在上一阶段可能被忽略的细粒度信息，最终达到了最终的脱壳输出。此外，为了增强细节保留并实现更自然的飞行，我们在网络培训期间引入了自我引导的对比度正规化。通过利用粗糙的恢复输出作为一个硬性负面示例，我们的模型可以学习更多的判别性表示，从而大大提高了整体飞行性能。对公共飞行基准的广泛评估表明，我们的方法在定性和定量上都超过了最先进的方法。代码可在此HTTPS URL上找到。

Title: DATA: Multi-Disentanglement based Contrastive Learning for Open-World Semi-Supervised Deepfake Attribution

Authors: Ming-Hui Liu, Xiao-Qian Liu, Xin Luo, Xin-Shun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04384
Pdf URL: https://arxiv.org/pdf/2505.04384
Copy Paste: [[2505.04384]] DATA: Multi-Disentanglement based Contrastive Learning for Open-World Semi-Supervised Deepfake Attribution(https://arxiv.org/abs/2505.04384)
Keywords: generation
Abstract: Deepfake attribution (DFA) aims to perform multiclassification on different facial manipulation techniques, thereby mitigating the detrimental effects of forgery content on the social order and personal reputations. However, previous methods focus only on method-specific clues, which easily lead to overfitting, while overlooking the crucial role of common forgery features. Additionally, they struggle to distinguish between uncertain novel classes in more practical open-world scenarios. To address these issues, in this paper we propose an innovative multi-DisentAnglement based conTrastive leArning framework, DATA, to enhance the generalization ability on novel classes for the open-world semi-supervised deepfake attribution (OSS-DFA) task. Specifically, since all generation techniques can be abstracted into a similar architecture, DATA defines the concept of 'Orthonormal Deepfake Basis' for the first time and utilizes it to disentangle method-specific features, thereby reducing the overfitting on forgery-irrelevant information. Furthermore, an augmented-memory mechanism is designed to assist in novel class discovery and contrastive learning, which aims to obtain clear class boundaries for the novel classes through instance-level disentanglements. Additionally, to enhance the standardization and discrimination of features, DATA uses bases contrastive loss and center contrastive loss as auxiliaries for the aforementioned modules. Extensive experimental evaluations show that DATA achieves state-of-the-art performance on the OSS-DFA benchmark, e.g., there are notable accuracy improvements in 2.55% / 5.7% under different settings, compared with the existing methods.
摘要：DeepFake归因（DFA）旨在对不同的面部操纵技术进行多分类，从而减轻伪造内容对社会秩序和个人声誉的有害影响。但是，以前的方法仅着眼于特定方法的线索，这些线索很容易导致过度拟合，同时忽略了共同伪造特征的关键作用。此外，他们努力在更实用的开放世界情景中区分不确定的新颖班级。为了解决这些问题，在本文中，我们提出了一个创新的基于多透明的学习框架，数据，以增强开放世界半衰期的新型类别的概括能力，用于开放世界的半持有性深层效果归因（OSS-DFA）任务。具体而言，由于所有生成技术都可以抽象成类似的体系结构，因此数据首次定义了“正顺序深击基础”的概念，并利用它来删除特定于方法的特定特征，从而减少了对伪造的信息的过度拟合。此外，增强的记忆机制旨在帮助新的类发现和对比度学习，该学习旨在通过实例级别的删节来获得新型类别的清晰类界限。此外，为了增强特征的标准化和歧视，数据使用基础对比损失和中心对比损失，作为上述模块的辅助机构。广泛的实验评估表明，与现有方法相比，在不同的设置下，数据可以在OSS-DFA基准测试上实现最新性能，例如，在不同的设置下，有2.55％ / 5.7％的准确性提高。

Title: Localized Diffusion Models for High Dimensional Distributions Generation

Authors: Georg A. Gottwald, Shuigen Liu, Youssef Marzouk, Sebastian Reich, Xin T. Tong
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2505.04417
Pdf URL: https://arxiv.org/pdf/2505.04417
Copy Paste: [[2505.04417]] Localized Diffusion Models for High Dimensional Distributions Generation(https://arxiv.org/abs/2505.04417)
Keywords: generation, generative
Abstract: Diffusion models are the state-of-the-art tools for various generative tasks. However, estimating high-dimensional score functions makes them potentially suffer from the curse of dimensionality (CoD). This underscores the importance of better understanding and exploiting low-dimensional structure in the target distribution. In this work, we consider locality structure, which describes sparse dependencies between model components. Under locality structure, the score function is effectively low-dimensional, so that it can be estimated by a localized neural network with significantly reduced sample complexity. This motivates the localized diffusion model, where a localized score matching loss is used to train the score function within a localized hypothesis space. We prove that such localization enables diffusion models to circumvent CoD, at the price of additional localization error. Under realistic sample size scaling, we show both theoretically and numerically that a moderate localization radius can balance the statistical and localization error, leading to a better overall performance. The localized structure also facilitates parallel training of diffusion models, making it potentially more efficient for large-scale applications.
摘要：扩散模型是用于各种生成任务的最新工具。但是，估计高维分函数使它们可能遭受维数（COD）的诅咒。这强调了更好地理解和利用目标分布中低维结构的重要性。在这项工作中，我们考虑了局部结构，该结构描述了模型组件之间的稀疏依赖性。在局部结构下，得分函数实际上是低维的，因此可以通过具有显着降低样品复杂性的局部神经网络来估计它。这激发了局部扩散模型，其中使用局部分数匹配损失来训练局部假设空间内的得分函数。我们证明，这种本地化使扩散模型以其他本地化错误的价格绕过COD。在现实的样本尺寸缩放下，我们从理论和数字上都表明中等定位半径可以平衡统计和本地化误差，从而导致更好的总体性能。局部结构还促进了扩散模型的平行训练，从而使其在大规模应用中的效率更高。

Title: RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation

Authors: Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, Xin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04424
Pdf URL: https://arxiv.org/pdf/2505.04424
Copy Paste: [[2505.04424]] RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation(https://arxiv.org/abs/2505.04424)
Keywords: generation
Abstract: Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at this https URL.
摘要：任意样式转移旨在将任何给定艺术图像的样式应用于另一个内容图像。尽管如此，现有的基于深度学习的方法通常需要大量的计算成本才能产生多种程式化的结果。在此激励的情况下，我们提出了一个新颖的加强学习框架，用于任意风格转移rlministyler。该框架利用统一的加强学习政策通过探索和利用风格的反馈来迭代指导样式转移过程，从而在实现模型轻量级的同时，生成了风格化结果的平滑序列。此外，我们引入了一种不确定性的多任务学习策略，该策略会自动调整减损权重以适应不同培训阶段的内容和样式平衡要求，从而加速模型收敛。通过图像各种分辨率的一系列实验，我们验证了rministyler在以较低成本生成高质量，多样化的艺术图像序列方面的优势。代码可在此HTTPS URL上找到。

Title: CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation

Authors: Jiahao Li, Weijian Ma, Xueyang Li, Yunzhong Lou, Guichun Zhou, Xiangdong Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04481
Pdf URL: https://arxiv.org/pdf/2505.04481
Copy Paste: [[2505.04481]] CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation(https://arxiv.org/abs/2505.04481)
Keywords: generation, generative
Abstract: Recently, Large Language Models (LLMs) have achieved significant success, prompting increased interest in expanding their generative capabilities beyond general text into domain-specific areas. This study investigates the generation of parametric sequences for computer-aided design (CAD) models using LLMs. This endeavor represents an initial step towards creating parametric 3D shapes with LLMs, as CAD model parameters directly correlate with shapes in three-dimensional space. Despite the formidable generative capacities of LLMs, this task remains challenging, as these models neither encounter parametric sequences during their pretraining phase nor possess direct awareness of 3D structures. To address this, we present CAD-Llama, a framework designed to enhance pretrained LLMs for generating parametric 3D CAD models. Specifically, we develop a hierarchical annotation pipeline and a code-like format to translate parametric 3D CAD command sequences into Structured Parametric CAD Code (SPCC), incorporating hierarchical semantic descriptions. Furthermore, we propose an adaptive pretraining approach utilizing SPCC, followed by an instruction tuning process aligned with CAD-specific guidelines. This methodology aims to equip LLMs with the spatial knowledge inherent in parametric sequences. Experimental results demonstrate that our framework significantly outperforms prior autoregressive methods and existing LLM baselines.
摘要：最近，大型语言模型（LLMS）取得了巨大的成功，促使人们对将其生成能力扩展到一般文本以外的领域的兴趣增加。这项研究研究了使用LLMS的计算机辅助设计（CAD）模型的参数序列的生成。这项工作代表了用LLM创建参数3D形状的初步步骤，因为CAD模型参数与三维空间中的形状直接相关。尽管LLM的生成能力强大，但此任务仍然具有挑战性，因为这些模型在预处理阶段既没有遇到参数序列，也没有对3D结构的直接认识。为了解决这个问题，我们提出了CAD-LLAMA，该框架旨在增强预识别的LLM，以生成参数3D CAD模型。具体来说，我们开发了分层注释管道和类似代码的格式，以将参数3D CAD命令序列转换为结构化参数CAD代码（SPCC），并结合了层次结构的语义描述。此外，我们提出了使用SPCC的自适应预处理方法，然后提出了与CAD特定指南一致的指令调整过程。该方法旨在为LLM配备参数序列中固有的空间知识。实验结果表明，我们的框架显着胜过先前的自回归方法和现有的LLM基准。

Title: Efficient Flow Matching using Latent Variables

Authors: Anirban Samaddar, Yixuan Sun, Viktor Nilsson, Sandeep Madireddy
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04486
Pdf URL: https://arxiv.org/pdf/2505.04486
Copy Paste: [[2505.04486]] Efficient Flow Matching using Latent Variables(https://arxiv.org/abs/2505.04486)
Keywords: generation, generative
Abstract: Flow matching models have shown great potential in image generation tasks among probabilistic generative models. Building upon the ideas of continuous normalizing flows, flow matching models generalize the transport path of the diffusion models from a simple prior distribution to the data. Most flow matching models in the literature do not explicitly model the underlying structure/manifold in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. Existing strategies of incorporating manifolds, including data with underlying multi-modal distribution, often require expensive training and hence frequently lead to suboptimal performance. To this end, we present \texttt{Latent-CFM}, which provides simplified training/inference strategies to incorporate multi-modal data structures using pretrained deep latent variable models. Through experiments on multi-modal synthetic data and widely used image benchmark datasets, we show that \texttt{Latent-CFM} exhibits improved generation quality with significantly less training ($\sim 50\%$ less in some cases) and computation than state-of-the-art flow matching models. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competitive approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features.
摘要：流量匹配模型在概率生成模型之间显示出很大的图像生成任务潜力。基于连续归一流流的思想，流匹配模型将扩散模型的传输路径从简单的先前分布到数据的传输路径推广。从文献中，大多数流量匹配模型都不会明确地对目标数据中的基础结构/歧管进行模拟，从而从类似标准高斯（标准高斯）等简单的源分布中学习流程。这导致学习效率低下，尤其是对于许多高维实际数据集，通常存在于低维歧管中。现有的合并流形的策略，包括具有基本多模式分布的数据，通常需要昂贵的培训，因此经常导致次优性能。为此，我们提出\ texttt {litent-cfm}，它提供了简化的培训/推理策略，可以使用预验证的深层可变模型合并多模式数据结构。通过对多模式合成数据和广泛使用的图像基准数据集进行的实验，我们表明\ texttt {litent-cfm}具有改进的生成质量，其培训的培训明显少得多（在某些情况下$ \ sim 50 \％$）和计算较少，而计算则比是先进的流动流量匹配模型。使用2D DARCY流数据集，我们证明我们的方法比竞争方法生成更精确的样本。此外，通过潜在空间分析，我们证明我们的方法可用于以潜在特征为条件的条件图像产生。

Title: Defining and Quantifying Creative Behavior in Popular Image Generators

Authors: Aditi Ramaswamy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04497
Pdf URL: https://arxiv.org/pdf/2505.04497
Copy Paste: [[2505.04497]] Defining and Quantifying Creative Behavior in Popular Image Generators(https://arxiv.org/abs/2505.04497)
Keywords: generation, generative
Abstract: Creativity of generative AI models has been a subject of scientific debate in the last years, without a conclusive answer. In this paper, we study creativity from a practical perspective and introduce quantitative measures that help the user to choose a suitable AI model for a given task. We evaluated our measures on a number of popular image-to-image generation models, and the results of this suggest that our measures conform to human intuition.
摘要：在过去的几年中，生成性AI模型的创造力一直是科学辩论的主题，没有结论性的答案。在本文中，我们从实际角度研究了创造力，并引入了定量措施，以帮助用户为给定任务选择合适的AI模型。我们评估了许多流行的图像到图像生成模型的措施，结果表明我们的措施符合人类的直觉。

Title: HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Authors: Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04512
Pdf URL: https://arxiv.org/pdf/2505.04512
Copy Paste: [[2505.04512]] HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation(https://arxiv.org/abs/2505.04512)
Keywords: generation
Abstract: Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at this https URL.
摘要：定制的视频生成旨在制作具有灵活用户定义条件下特定主题的视频，但是现有方法通常会因身份一致性和有限的输入方式而苦苦挣扎。在本文中，我们提出了Hunyuancustom，这是一个多模式定制的视频生成框架，强调主题一致性，同时支持图像，音频，视频和文本条件。我们的模型建立在HunyuanVideo的基础上，首先通过引入基于LLAVA的文本图像融合模块来解决图像文本条件的生成任务，以增强多模式的理解，并利用临时关注的图像ID增强模块，以增强跨框架跨框架的标识功能。为了启用音频和视频条件的生成，我们进一步提出了特定于模式的状态注入机制：通过空间跨注意来实现层次对齐的AUDIONET模块，以及通过基于贴片的特征 - 对准网络集成潜在的潜在压缩条件视频的视频驱动注入模块。关于单一和多主体场景的广泛实验表明，就ID一致性，现实主义和文本视频对齐方式而言，Hunyuancustom显着超过了最先进的开放和封闭式方法。此外，我们验证了其在下游任务中的鲁棒性，包括音频和视频驱动的自定义视频生成。我们的结果突出了多模式调节和具有身份的策略在推进可控视频生成方面的有效性。所有代码和模型均可在此HTTPS URL上找到。

Title: Text2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model

Authors: Pengfei Guo, Can Zhao, Dong Yang, Yufan He, Vishwesh Nath, Ziyue Xu, Pedro R. A. S. Bassi, Zongwei Zhou, Benjamin D. Simon, Stephanie Anne Harmon, Baris Turkbey, Daguang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04522
Pdf URL: https://arxiv.org/pdf/2505.04522
Copy Paste: [[2505.04522]] Text2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model(https://arxiv.org/abs/2505.04522)
Keywords: generation
Abstract: Generating 3D CT volumes from descriptive free-text inputs presents a transformative opportunity in diagnostics and research. In this paper, we introduce Text2CT, a novel approach for synthesizing 3D CT volumes from textual descriptions using the diffusion model. Unlike previous methods that rely on fixed-format text input, Text2CT employs a novel prompt formulation that enables generation from diverse, free-text descriptions. The proposed framework encodes medical text into latent representations and decodes them into high-resolution 3D CT scans, effectively bridging the gap between semantic text inputs and detailed volumetric representations in a unified 3D framework. Our method demonstrates superior performance in preserving anatomical fidelity and capturing intricate structures as described in the input text. Extensive evaluations show that our approach achieves state-of-the-art results, offering promising potential applications in diagnostics, and data augmentation.
摘要：从描述性的自由文本输入产生3D CT量为诊断和研究提供了变革性的机会。在本文中，我们介绍了Text2CT，这是一种使用扩散模型从文本描述中合成3D CT卷的新方法。与以前依赖固定格式文本输入的方法不同，Text2CT采用了一种新颖的及时公式，该公式可以从多样化的自由文本描述中产生。提出的框架将医学文本编码为潜在表示，并将其解码为高分辨率3D CT扫描，有效地弥合了统一的3D框架中语义文本输入和详细的体积表示之间的差距。我们的方法在保留解剖学保真度和捕获输入文本中所述的复杂结构方面表现出了卓越的性能。广泛的评估表明，我们的方法可实现最先进的结果，从而在诊断和数据增强中提供了有希望的潜在应用。

Title: Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait

Authors: Feng Liu, Nicholas Chimitt, Lanqing Guo, Jitesh Jain, Aditya Kane, Minchul Kim, Wes Robbins, Yiyang Su, Dingqiang Ye, Xingguang Zhang, Jie Zhu, Siddharth Satyakam, Christopher Perry, Stanley H. Chan, Arun Ross, Humphrey Shi, Zhangyang Wang, Anil Jain, Xiaoming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04616
Pdf URL: https://arxiv.org/pdf/2505.04616
Copy Paste: [[2505.04616]] Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait(https://arxiv.org/abs/2505.04616)
Keywords: restoration
Abstract: We address the problem of whole-body person recognition in unconstrained environments. This problem arises in surveillance scenarios such as those in the IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) program, where biometric data is captured at long standoff distances, elevated viewing angles, and under adverse atmospheric conditions (e.g., turbulence and high wind velocity). To this end, we propose FarSight, a unified end-to-end system for person recognition that integrates complementary biometric cues across face, gait, and body shape modalities. FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion. These components are designed to work cohesively under degraded image conditions, large pose and scale variations, and cross-domain gaps. Extensive experiments on the BRIAR dataset, one of the most comprehensive benchmarks for long-range, multi-modal biometric recognition, demonstrate the effectiveness of FarSight. Compared to our preliminary system, this system achieves a 34.1% absolute gain in 1:1 verification accuracy (TAR@0.1% FAR), a 17.8% increase in closed-set identification (Rank-20), and a 34.3% reduction in open-set identification errors (FNIR@1% FPIR). Furthermore, FarSight was evaluated in the 2025 NIST RTE Face in Video Evaluation (FIVE), which conducts standardized face recognition testing on the BRIAR dataset. These results establish FarSight as a state-of-the-art solution for operational biometric recognition in challenging real-world conditions.
摘要：我们解决了在不受约束的环境中全身认可的问题。这个问题出现在监视场景中，例如在高度和范围（BRIAR）计划的IARPA生物识别识别和识别中，该程序在长距离距离距离处捕获生物识别数据，较长的视角，在不利的大气条件下（例如，湍流和高风速）。为此，我们提出了Farsight，这是一个统一的端到端系统，用于人识别，该系统整合了面部，步态和身体形状方式互补的生物识别线索。远处纳入了四个核心模块的新算法：多主体检测和跟踪，识别感知的视频恢复，特定于模态的生物特征特征编码和质量引导的多模式融合。这些组件旨在在降解的图像条件，较大的姿势和尺度变化以及跨域间隙下凝聚力工作。 Briar数据集上的广泛实验是远程，多模式生物识别识别的最全面的基准之一，证明了远处的有效性。与我们的初步系统相比，该系统在1：1验证准确性（tar@0.1% far）中获得了34.1％的绝对增益，封闭式识别（等级20）增加了17.8％，开放式识别错误降低了34.3％（fnir@fnir@1％fpir）。此外，在视频评估（五）的2025 NIST RTE面上评估了有远见的视频，该视频评估（五）在Briar数据集上进行了标准化的面部识别测试。这些结果将有远见的结果作为最先进的解决方案，用于在具有挑战性的现实情况下进行生物识别识别。

Title: On Path to Multimodal Generalist: General-Level and General-Bench

Authors: Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04620
Pdf URL: https://arxiv.org/pdf/2505.04620
Copy Paste: [[2505.04620]] On Path to Multimodal Generalist: General-Level and General-Bench(https://arxiv.org/abs/2505.04620)
Keywords: generation
Abstract: The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: this https URL
摘要：由LLM的高级功能驱动，多模式大语言模型（MLLM）目前正在经历快速增长。与早期的专家不同，现有的MLLM正在发展为多模式通才范式。最初，这些模型不仅可以理解多种方式，不仅要理解跨模态。它们的能力已从粗粒度扩展到细粒度的多模式理解，并从支持有限的模式到任意方式。尽管存在许多基准来评估MLLM，但出现了一个关键的问题：我们可以简单地假设跨任务的较高性能表明MLLM功能更强，从而使我们更接近人级AI？我们认为答案并不像看起来那么简单。该项目介绍了通用级别，这是一个评估框架，该框架定义了5级水平的MLLM性能和一般性，提供了一种方法，以比较MLLM和衡量现有系统的进度朝着更强大的多模式通才，并最终降至AGI。框架的核心是协同作用的概念，该概念衡量模型是否在理解和产生以及多种方式之间保持一致的能力。为了支持此评估，我们提出了一般基础，其中包括更广泛的技能，方式，格式和功能，包括700多个任务和325,800个实例。涉及100多种现有最新的MLLM的评估结果揭示了通才的能力排名，强调了达到真正的AI的挑战。我们希望该项目为下一代多模式模型的未来研究铺平道路，从而提供了强大的基础架构来加速AGI的实现。项目页面：此HTTPS URL