2025-05-02

Title: Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

Authors: Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, Noah Snavely
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00135
Pdf URL: https://arxiv.org/pdf/2505.00135
Copy Paste: [[2505.00135]] Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis(https://arxiv.org/abs/2505.00135)
Keywords: generation
Abstract: The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on this https URL
摘要：沉浸式视觉体验的普及不断提高，对立体3D视频的产生增强了兴趣。尽管视频综合方面取得了重大进展，但由于3D视频数据的相对稀缺性，创建3D视频仍然具有挑战性。我们提出了一种简单的方法，可以将文本对电视发电机转换为视频到stereo发电机。给定输入视频，我们的框架会自动从变化的观点生成视频帧，从而实现引人注目的3D效果。此任务的先验和并发方法通常在多个阶段运行，首先估算视频差异或深度，然后相应地扭曲视频以产生第二视图，最后介绍了分离的区域。当场景涉及镜面或透明对象时，这种方法本质上会失败。在这种情况下，单层差异估计不足，导致扭曲过程中的伪影和不正确的像素转移。我们的工作通过直接综合新观点，避免任何中间步骤来绕过这些限制。这是通过利用预先训练的视频模型的几何学，对象材料，光学和语义的先验，而无需依赖外部几何模型或手动将几何形状从合成过程中解脱出来的。我们在复杂的现实世界情景中展示了我们方法的优势，这些场景具有多种对象材料和构图。在此HTTPS URL上查看视频

Title: Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

Authors: Minh-Hao Van, Xintao Wu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.00150
Pdf URL: https://arxiv.org/pdf/2505.00150
Copy Paste: [[2505.00150]] Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models(https://arxiv.org/abs/2505.00150)
Keywords: generation
Abstract: The rapid evolution of social media has provided enhanced communication channels for individuals to create online content, enabling them to express their thoughts and opinions. Multimodal memes, often utilized for playful or humorous expressions with visual and textual elements, are sometimes misused to disseminate hate speech against individuals or groups. While the detection of hateful memes is well-researched, developing effective methods to transform hateful content in memes remains a significant challenge. Leveraging the powerful generation and reasoning capabilities of Vision-Language Models (VLMs), we address the tasks of detecting and mitigating hateful content. This paper presents two key contributions: first, a definition-guided prompting technique for detecting hateful memes, and second, a unified framework for mitigating hateful content in memes, named UnHateMeme, which works by replacing hateful textual and/or visual components. With our definition-guided prompts, VLMs achieve impressive performance on hateful memes detection task. Furthermore, our UnHateMeme framework, integrated with VLMs, demonstrates a strong capability to convert hateful memes into non-hateful forms that meet human-level criteria for hate speech and maintain multimodal coherence between image and text. Through empirical experiments, we show the effectiveness of state-of-the-art pretrained VLMs such as LLaVA, Gemini and GPT-4o on the proposed tasks, providing a comprehensive analysis of their respective strengths and limitations for these tasks. This paper aims to shed light on important applications of VLMs for ensuring safe and respectful online environments.
摘要：社交媒体的快速发展为个人提供了增强的沟通渠道，以创建在线内容，使他们能够表达自己的思想和观点。多模式的模因通常用于带有视觉和文字元素的嬉戏或幽默表情，有时会被滥用以对个人或团体的仇恨言论传播。虽然对仇恨模因的发现进行了充分的研究，但开发有效的方法来改变模因中可恨的内容仍然是一个重大挑战。利用视觉模型（VLM）的强大产生和推理能力，我们解决了检测和减轻仇恨内容的任务。本文提出了两个关键贡献：首先，一种定义引导的提示技术来检测可恶的模因，其次是一种统一的框架，用于减轻模因中的仇恨内容，名为“未经病情”，该框架通过替换仇恨的文本和/或视觉成分而起作用。通过我们定义引导的提示，VLMS在仇恨模因检测任务上取得了令人印象深刻的表现。此外，与VLM集成的我们的不适式框架表明，将仇恨模因转变为符合人类水平的仇恨言论标准并保持图像和文本之间的多模式连贯性的非仇恨形式的强大能力。通过经验实验，我们在提议的任务上展示了最先进的VLM，例如Llava，Gemini和GPT-4O等最先进的VLM，从而对这些任务的各自优势和局限性进行了全面分析。本文旨在阐明VLM的重要应用，以确保安全和尊重的在线环境。

Title: GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation

Authors: Filipp Nikitin, Ian Dunn, David Ryan Koes, Olexandr Isayev
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.00169
Pdf URL: https://arxiv.org/pdf/2505.00169
Copy Paste: [[2505.00169]] GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation(https://arxiv.org/abs/2505.00169)
Keywords: generation, generative
Abstract: Deep generative models have shown significant promise in generating valid 3D molecular structures, with the GEOM-Drugs dataset serving as a key benchmark. However, current evaluation protocols suffer from critical flaws, including incorrect valency definitions, bugs in bond order calculations, and reliance on force fields inconsistent with the reference data. In this work, we revisit GEOM-Drugs and propose a corrected evaluation framework: we identify and fix issues in data preprocessing, construct chemically accurate valency tables, and introduce a GFN2-xTB-based geometry and energy benchmark. We retrain and re-evaluate several leading models under this framework, providing updated performance metrics and practical recommendations for future benchmarking. Our results underscore the need for chemically rigorous evaluation practices in 3D molecular generation. Our recommended evaluation methods and GEOM-Drugs processing scripts are available at this https URL.
摘要：深层生成模型在产生有效的3D分子结构方面表现出了巨大的希望，而Geom-Drugs数据集则是关键基准。但是，当前的评估协议遭受了关键缺陷，包括不正确的价值定义，债券顺序计算中的错误以及依赖与参考数据不一致的力场。在这项工作中，我们重新访问GEOM-Prugs并提出了一个校正的评估框架：我们在数据预处理中识别和解决问题，构建化学精确的价值表格，并引入基于GFN2-XTB的几何形状和能量基准。我们在此框架下重新训练并重新评估了几个领先模型，为将来的基准测试提供了更新的性能指标和实用建议。我们的结果强调了3D分子生成中化学严格评估实践的需求。我们推荐的评估方法和GEOM-DRUGS处理脚本可在此HTTPS URL上找到。

Title: Direct Motion Models for Assessing Generated Videos

Authors: Kelsey Allen, Carl Doersch, Guangyao Zhou, Mohammed Suhail, Danny Driess, Ignacio Rocco, Yulia Rubanova, Thomas Kipf, Mehdi S. M. Sajjadi, Kevin Murphy, Joao Carreira, Sjoerd van Steenkiste
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.00209
Pdf URL: https://arxiv.org/pdf/2505.00209
Copy Paste: [[2505.00209]] Direct Motion Models for Assessing Generated Videos(https://arxiv.org/abs/2505.00209)
Keywords: generative
Abstract: A current limitation of video generative video models is that they generate plausible looking frames, but poor motion -- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: this http URL.
摘要：当前的视频生成视频模型的局限性是它们产生合理的帧，但动作不佳 - 这是FVD和其他流行的方法来评估生成的视频的问题。在这里，我们通过开发一个更好地衡量合理对象相互作用和运动的度量来超越FVD。我们的新方法基于自动编码点轨道并产生运动功能，不仅可以用来比较视频的分布（一个生成的和一个地面真相，或多达两个数据集），还可以评估单个视频的运动。我们表明，使用点轨道代替像素重建或动作识别功能会导致一个度量，该指标对合成数据中的时间扭曲显着敏感，并且可以预测从开放源模型获得的生成的视频中对时间一致性和现实主义的人体评估，而不是广泛的替代方案。我们还表明，通过使用点轨道表示形式，我们可以时空定位生成视频不一致，从而提供相对于先前工作的产生视频错误的额外解释性。可以在项目页面上找到结果和指向代码的链接的概述：此HTTP URL。

Title: Generative Machine Learning in Adaptive Control of Dynamic Manufacturing Processes: A Review

Authors: Suk Ki Lee, Hyunwoong Ko
Subjects: cs.LG, cs.CE, eess.SY
Abstract URL: https://arxiv.org/abs/2505.00210
Pdf URL: https://arxiv.org/pdf/2505.00210
Copy Paste: [[2505.00210]] Generative Machine Learning in Adaptive Control of Dynamic Manufacturing Processes: A Review(https://arxiv.org/abs/2505.00210)
Keywords: generation, generative
Abstract: Dynamic manufacturing processes exhibit complex characteristics defined by time-varying parameters, nonlinear behaviors, and uncertainties. These characteristics require sophisticated in-situ monitoring techniques utilizing multimodal sensor data and adaptive control systems that can respond to real-time feedback while maintaining product quality. Recently, generative machine learning (ML) has emerged as a powerful tool for modeling complex distributions and generating synthetic data while handling these manufacturing uncertainties. However, adopting these generative technologies in dynamic manufacturing systems lacks a functional control-oriented perspective to translate their probabilistic understanding into actionable process controls while respecting constraints. This review presents a functional classification of Prediction-Based, Direct Policy, Quality Inference, and Knowledge-Integrated approaches, offering a perspective for understanding existing ML-enhanced control systems and incorporating generative ML. The analysis of generative ML architectures within this framework demonstrates control-relevant properties and potential to extend current ML-enhanced approaches where conventional methods prove insufficient. We show generative ML's potential for manufacturing control through decision-making applications, process guidance, simulation, and digital twins, while identifying critical research gaps: separation between generation and control functions, insufficient physical understanding of manufacturing phenomena, and challenges adapting models from other domains. To address these challenges, we propose future research directions aimed at developing integrated frameworks that combine generative ML and control technologies to address the dynamic complexities of modern manufacturing systems.
摘要：动态制造过程表现出由时变参数，非线性行为和不确定性定义的复杂特征。这些特征需要使用多模式传感器数据和适应性控制系统进行复杂的原位监测技术，这些系统可以在维持产品质量的同时响应实时反馈。最近，生成机器学习（ML）已成为建模复杂分布和生成合成数据的强大工具，同时处理这些制造不确定性。但是，在动态制造系统中采用这些生成技术缺乏以功能控制为导向的观点，可以将其概率理解转化为可行的过程控制，同时尊重约束。这篇评论介绍了基于预测的，直接的政策，质量推理和知识融合方法的功能分类，提供了了解现有的ML增强控制系统并结合生成ML的观点。在此框架内对生成ML体系结构的分析表明，与控制相关的属性和潜力扩展了当前ML增强方法，而常规方法证明不足。我们通过决策应用，过程指导，仿真和数字双胞胎展示了生成ML制造控制的潜力，同时识别关键的研究差距：发电和控制功能之间的分离，对制造现象的物理理解不足，并挑战适应其他领域的模型。为了应对这些挑战，我们提出了旨在开发综合框架的未来研究方向，这些框架结合了生成的ML和控制技术，以解决现代制造系统的动态复杂性。

Title: Online Federation For Mixtures of Proprietary Agents with Black-Box Encoders

Authors: Xuwei Yang, Fatemeh Tavakoli, David B. Emerson, Anastasis Kratsios
Subjects: cs.LG, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2505.00216
Pdf URL: https://arxiv.org/pdf/2505.00216
Copy Paste: [[2505.00216]] Online Federation For Mixtures of Proprietary Agents with Black-Box Encoders(https://arxiv.org/abs/2505.00216)
Keywords: generative
Abstract: Most industry-standard generative AIs and feature encoders are proprietary, offering only black-box access: their outputs are observable, but their internal parameters and architectures remain hidden from the end-user. This black-box access is especially limiting when constructing mixture-of-expert type ensemble models since the user cannot optimize each proprietary AI's internal parameters. Our problem naturally lends itself to a non-competitive game-theoretic lens where each proprietary AI (agent) is inherently competing against the other AI agents, with this competition arising naturally due to their obliviousness of the AI's to their internal structure. In contrast, the user acts as a central planner trying to synchronize the ensemble of competing AIs. We show the existence of the unique Nash equilibrium in the online setting, which we even compute in closed-form by eliciting a feedback mechanism between any given time series and the sequence generated by each (proprietary) AI agent. Our solution is implemented as a decentralized, federated-learning algorithm in which each agent optimizes their structure locally on their machine without ever releasing any internal structure to the others. We obtain refined expressions for pre-trained models such as transformers, random feature models, and echo-state networks. Our ``proprietary federated learning'' algorithm is implemented on a range of real-world and synthetic time-series benchmarks. It achieves orders-of-magnitude improvements in predictive accuracy over natural benchmarks, of which there are surprisingly few due to this natural problem still being largely unexplored.
摘要：大多数行业标准的生成AIS和功能编码器都是专有的，仅提供黑框访问：它们的输出是可观察到的，但是它们的内部参数和体系结构仍然隐藏在最终用户中。当用户无法优化每个专有AI的内部参数，因此在构造专家类型集合模型时，此黑框访问尤其有限。我们的问题自然会适合非竞争性的游戏理论镜头，在这种情况下，每个专有AI（代理）本质上与其他AI代理人竞争，这一竞争自然而然地引起了他们对AI的遗忘，因此他们对内部结构进行了竞争。相比之下，用户是试图同步竞争AIS合奏的中央计划者。我们显示了在线环境中唯一的NASH平衡的存在，我们甚至通过在任何给定时间序列和每个（专有）AI代理生成的序列之间提出反馈机制来计算封闭形式。我们的解决方案被实现为一种分散的联合学习算法，在该算法中，每个代理在其机器上本地优化其结构，而无需向其他机构释放任何内部结构。我们为预训练的模型（例如变压器，随机特征模型和回声状态网络）获得了精致表达式。我们的``专有联邦学习''算法是在一系列现实世界和合成时间序列的基准上实施的。它可以提高自然基准的预测准确性的秩序，因为这种自然问题仍然在很大程度上没有探索，因此很少有很多。

Title: Predicting Estimated Times of Restoration for Electrical Outages Using Longitudinal Tabular Transformers

Authors: Bogireddy Sai Prasanna Teja, Valliappan Muthukaruppan, Carls Benjamin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.00225
Pdf URL: https://arxiv.org/pdf/2505.00225
Copy Paste: [[2505.00225]] Predicting Estimated Times of Restoration for Electrical Outages Using Longitudinal Tabular Transformers(https://arxiv.org/abs/2505.00225)
Keywords: restoration
Abstract: As climate variability increases, the ability of utility providers to deliver precise Estimated Times of Restoration (ETR) during natural disasters has become increasingly critical. Accurate and timely ETRs are essential for enabling customer preparedness during extended power outages, where informed decision-making can be crucial, particularly in severe weather conditions. Nonetheless, prevailing utility practices predominantly depend on manual assessments or traditional statistical methods, which often fail to achieve the level of precision required for reliable and actionable predictions. To address these limitations, we propose a Longitudinal Tabular Transformer (LTT) model that leverages historical outage event data along with sequential updates of these events to improve the accuracy of ETR predictions. The model's performance was evaluated over 34,000 storm-related outage events from three major utility companies, collectively serving over 3 million customers over a 2-year period. Results demonstrate that the LTT model improves the Customer Satisfaction Impact (CSI) metric by an average of 19.08% (p > 0.001) compared to existing methods. Additionally, we introduce customer-informed regression metrics that align model evaluation with real-world satisfaction, ensuring the outcomes resonate with customer expectations. Furthermore, we employ interpretability techniques to analyze the temporal significance of incorporating sequential updates in modeling outage events and to identify the contributions of predictive features to a given ETR. This comprehensive approach not only improves predictive accuracy but also enhances transparency, fostering greater trust in the model's capabilities.
摘要：随着气候变异性的增加，自然灾害期间，公用事业提供者在自然灾害期间提供精确估计的恢复时间（ETR）的能力变得越来越重要。准确，及时的ETR对于在扩展停电期间使客户准备的必要条件至关重要，因为在恶劣的天气条件下，明智的决策可能是至关重要的。但是，主要的实用性实践主要取决于手动评估或传统的统计方法，这些方法通常无法达到可靠和可行的预测所需的精确度。为了解决这些局限性，我们提出了一个纵向表达变压器（LTT）模型，该模型利用历史停电事件数据以及这些事件的顺序更新以提高ETR预测的准确性。该模型的绩效经过了来自三家主要公用事业公司的34,000多个与风暴有关的停电活动的评估，在2年内共同为300万客户提供了服务。结果表明，与现有方法相比，LTT模型将客户满意度影响（CSI）度量平均提高19.08％（p> 0.001）。此外，我们介绍了客户信息的回归指标，以使模型评估与现实世界的满意度相结合，从而确保结果与客户期望产生共鸣。此外，我们采用可解释性技术来分析将顺序更新纳入建模中断事件并确定预测特征对给定ETR的贡献的时间意义。这种全面的方法不仅提高了预测精度，而且可以提高透明度，从而增强对模型能力的信任。

Title: ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

Authors: Xiaoman Zhang, Julián N. Acosta, Josh Miller, Ouwen Huang, Pranav Rajpurkar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00228
Pdf URL: https://arxiv.org/pdf/2505.00228
Copy Paste: [[2505.00228]] ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports(https://arxiv.org/abs/2505.00228)
Keywords: generation
Abstract: We present ReXGradient-160K, representing the largest publicly available chest X-ray dataset to date in terms of the number of patients. This dataset contains 160,000 chest X-ray studies with paired radiological reports from 109,487 unique patients across 3 U.S. health systems (79 medical sites). This comprehensive dataset includes multiple images per study and detailed radiology reports, making it particularly valuable for the development and evaluation of AI systems for medical imaging and automated report generation models. The dataset is divided into training (140,000 studies), validation (10,000 studies), and public test (10,000 studies) sets, with an additional private test set (10,000 studies) reserved for model evaluation on the ReXrank benchmark. By providing this extensive dataset, we aim to accelerate research in medical imaging AI and advance the state-of-the-art in automated radiological analysis. Our dataset will be open-sourced at this https URL.
摘要：我们介绍了Rexgradient-160K，代表了迄今为止，就患者数量而言，最大的胸部X射线数据集。该数据集包含160,000个胸部X射线研究，其中包含来自3个美国卫生系统（79个医疗场所）的109,487名独特患者的配对放射学报告。该全面的数据集包括每项研究的多个图像和详细的放射学报告，使其对于开发和评估医学成像和自动化报告生成模型的AI系统特别有价值。该数据集分为培训（140,000个研究），验证（10,000项研究）和公共测试（10,000个研究），并保留了额外的私人测试集（10,000项研究），用于在Rexrank基准上进行模型评估。通过提供这个广泛的数据集，我们旨在加快医学成像AI的研究，并推进自动放射学分析的最先进。我们的数据集将在此HTTPS URL上开源。

Title: Scaling On-Device GPU Inference for Large Generative Models

Authors: Jiuqiang Tang, Raman Sarokin, Ekaterina Ignasheva, Grant Jensen, Lin Chen, Juhyun Lee, Andrei Kulik, Matthias Grundmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.00232
Pdf URL: https://arxiv.org/pdf/2505.00232
Copy Paste: [[2505.00232]] Scaling On-Device GPU Inference for Large Generative Models(https://arxiv.org/abs/2505.00232)
Keywords: generative
Abstract: Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift--an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
摘要：在生成AI的进步的驱动下，大型机器学习模型已彻底改变了诸如图像处理，音频综合和语音识别之类的领域。虽然基于服务器的部署仍然是峰值性能的源头，但由于隐私和效率注意事项的必要性，因此必须进行设备推理的必要性。我们将GPU视为最宽的设备ML加速器，我们提出了ML漂移 - 一种优化的框架，扩展了最先进的GPU加速推理推理引擎的功能。 ML Drift可以在设备上执行生成AI工作负载，该工作负载包含比现有的设备生成AI模型多10到100倍。 ML漂移解决了与跨GPU API开发相关的复杂工程挑战，并确保在移动和台式/笔记本电脑平台之间进行广泛的兼容性，从而促进了在资源约束设备上明显更复杂的模型的部署。相对于现有的开源GPU推理引擎，我们的GPU加速ML/AI推理引擎可以提高速度的性能。

Title: Empowering Agentic Video Analytics Systems with Video Language Models

Authors: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.00254
Pdf URL: https://arxiv.org/pdf/2505.00254
Copy Paste: [[2505.00254]] Empowering Agentic Video Analytics Systems with Video Language Models(https://arxiv.org/abs/2505.00254)
Keywords: generation
Abstract: AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.
摘要：AI驱动的视频分析已变得越来越关键。但是，现有系统通常会限制在特定的预定义任务上，从而在开放式分析方案中限制了其适应性。作为变革技术的视频语言模型（VLM）的最新出现为实现开放式视频理解，推理和分析提供了巨大的潜力。然而，在处理超长视频内容时，他们有限的上下文窗口会出现挑战，这在现实世界中很普遍。为了解决这个问题，我们介绍了AVA，这是一个专为开放式高级视频分析设计的VLM驱动系统。 AVA结合了两个关键的创新：（1）事件知识图（EKGS）的近乎实时构建，以有效地索引长或连续的视频流，以及（2）一种代理检索生成机制，该机制利用EKG来处理复杂和多样的查询。对公共基准，LVBENCH和VIDEOMME-LONG的全面评估表明，AVA取得了最先进的性能，分别达到62.3％和64.1％的精度，显着超过了现有的VLM和视频检索型（RAG）系统。此外，为了评估超长和开放世界视频方案的视频分析，我们介绍了新的基准AVA-100。该基准包括8个视频，每个视频持续时间超过10小时，以及120个手动注释，多样化和复杂的提问对。在AVA-100上，AVA的精度为75.8％。

Title: AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

Authors: Biling Wang, Austen Maniscalco, Ti Bai, Siqiu Wang, Michael Dohopolski, Mu-Han Lin, Chenyang Shen, Dan Nguyen, Junzhou Huang, Steve Jiang, Xinlei Wang
Subjects: cs.CV, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2505.00308
Pdf URL: https://arxiv.org/pdf/2505.00308
Copy Paste: [[2505.00308]] AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality(https://arxiv.org/abs/2505.00308)
Keywords: quality assessment
Abstract: Purpose: This study presents a Deep Learning (DL)-based quality assessment (QA) approach for evaluating auto-generated contours (auto-contours) in radiotherapy, with emphasis on Online Adaptive Radiotherapy (OART). Leveraging Bayesian Ordinal Classification (BOC) and calibrated uncertainty thresholds, the method enables confident QA predictions without relying on ground truth contours or extensive manual labeling. Methods: We developed a BOC model to classify auto-contour quality and quantify prediction uncertainty. A calibration step was used to optimize uncertainty thresholds that meet clinical accuracy needs. The method was validated under three data scenarios: no manual labels, limited labels, and extensive labels. For rectum contours in prostate cancer, we applied geometric surrogate labels when manual labels were absent, transfer learning when limited, and direct supervision when ample labels were available. Results: The BOC model delivered robust performance across all scenarios. Fine-tuning with just 30 manual labels and calibrating with 34 subjects yielded over 90% accuracy on test data. Using the calibrated threshold, over 93% of the auto-contours' qualities were accurately predicted in over 98% of cases, reducing unnecessary manual reviews and highlighting cases needing correction. Conclusion: The proposed QA model enhances contouring efficiency in OART by reducing manual workload and enabling fast, informed clinical decisions. Through uncertainty quantification, it ensures safer, more reliable radiotherapy workflows.
摘要：目的：这项研究介绍了基于深度学习（DL）的质量评估（QA）方法，用于评估放射疗法中自动生成的轮廓（自动转换），重点是在线适应性放射治疗（OART）。该方法利用贝叶斯序列分类（BOC）和校准的不确定性阈值，可以实现自信的质量质量质量标准，而无需依靠地面真相轮廓或大量的手动标记。方法：我们开发了一个BOC模型，以对自动包装质量进行分类并量化预测不确定性。使用校准步骤来优化满足临床准确性需求的不确定性阈值。该方法在三个数据方案下进行了验证：无手动标签，有限的标签和广泛的标签。对于前列腺癌中的直肠轮廓，当缺乏手动标签时，我们应用了几何替代标签，在有限的情况下转移学习并在有足够的标签时进行直接监督。结果：BOC模型在所有情况下均提供了强大的性能。仅使用30个手动标签进行微调和34名受试者进行校准，在测试数据上的精度超过90％。使用校准的阈值，超过93％的自动室质量在超过98％的情况下进行了准确的预测，从而减少了不必要的手动审查，并突出了需要校正的情况。结论：拟议的质量保证模型通过减少手动工作量并实现快速，明智的临床决策来提高发放的轮廓效率。通过不确定性量化，它可以确保更安全，更可靠的放射治疗工作流程。

Title: Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution

Authors: Luigi Sigillo, Christian Bianchi, Danilo Comminiello
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.00334
Pdf URL: https://arxiv.org/pdf/2505.00334
Copy Paste: [[2505.00334]] Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution(https://arxiv.org/abs/2505.00334)
Keywords: super-resolution, generative
Abstract: Image Super-Resolution is a fundamental problem in computer vision with broad applications spacing from medical imaging to satellite analysis. The ability to reconstruct high-resolution images from low-resolution inputs is crucial for enhancing downstream tasks such as object detection and segmentation. While deep learning has significantly advanced SR, achieving high-quality reconstructions with fine-grained details and realistic textures remains challenging, particularly at high upscaling factors. Recent approaches leveraging diffusion models have demonstrated promising results, yet they often struggle to balance perceptual quality with structural fidelity. In this work, we introduce ResQu a novel SR framework that integrates a quaternion wavelet preprocessing framework with latent diffusion models, incorporating a new quaternion wavelet- and time-aware encoder. Unlike prior methods that simply apply wavelet transforms within diffusion models, our approach enhances the conditioning process by exploiting quaternion wavelet embeddings, which are dynamically integrated at different stages of denoising. Furthermore, we also leverage the generative priors of foundation models such as Stable Diffusion. Extensive experiments on domain-specific datasets demonstrate that our method achieves outstanding SR results, outperforming in many cases existing approaches in perceptual quality and standard evaluation metrics. The code will be available after the revision process.
摘要：图像超分辨率是计算机视觉中的一个基本问题，从医学成像到卫星分析的广泛应用间距。从低分辨率输入重建高分辨率图像的能力对于增强下游任务，例如对象检测和分割至关重要。尽管深度学习具有显着高级的SR，但通过细节细节和逼真的纹理实现高质量的重建仍然具有挑战性，尤其是在高度尺度的因素下。利用扩散模型的最新方法已经表现出了令人鼓舞的结果，但它们通常很难平衡感知质量与结构保真度。在这项工作中，我们介绍了一个新颖的SR框架，该框架将四个小波小波预处理框架与潜在扩散模型集成在一起，并结合了新的Quaternion小波和时间感知的编码器。与简单地应用小波转换在扩散模型中的先前方法不同，我们的方法通过利用四元小波嵌入的嵌入来增强调节过程，这些嵌入在DeNoising的不同阶段被动态整合。此外，我们还利用了基础模型的生成先验，例如稳定扩散。对特定领域数据集的广泛实验表明，我们的方法取得了出色的SR结果，在许多情况下，在知觉质量和标准评估指标中的现有方法表现优于现有方法。该代码将在修订过程后可用。

Title: T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

Authors: Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.00337
Pdf URL: https://arxiv.org/pdf/2505.00337
Copy Paste: [[2505.00337]] T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation(https://arxiv.org/abs/2505.00337)
Keywords: generation, generative
Abstract: Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
摘要：近年来，文本对视频生成模型已取得了长足的进步，制作了高质量的视频，在美学上的吸引力和准确的指导下都表现出色，并且已经成为数字艺术创建和在线用户参与的核心。然而，尽管取得了这些进步，但它们尊重基本物理定律的能力仍然很大程度上没有测试：许多输出仍然违反了基本的约束，例如僵化的碰撞，节能和引力动态，从而导致不现实甚至误解内容。现有的物理评估基准通常依赖于自动，像素级指标应用于简单的，生命的提示，从而忽略了人类判断力和第一原理物理学。为了填补这一空白，我们介绍了\ textbf {t2vphysbench}，这是一个原理的基准测试，该基准系统地评估了最先进的文本到video系统，无论是开源和商业而言，是否遵守牛顿力学，保护原理和现象学效果，包括十二个核心物理定律。我们的基准采用严格的人类评估方案，包括三个有针对性的研究：（1）总体合规性评估表明，所有法律类别中所有模型平均得分低于0.60；（2）迅速的暗示消融表明，即使是详细的，特定于法律的提示也无法补救物理违规；（3）反事实鲁棒性测试，表明模型通常会生成视频，这些视频在这样的指导时会明确打破物理规则。结果暴露了当前体系结构的持续局限性，并提供了具体的见解，以指导未来的研究对真正的物理感知视频生成。

Title: JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Authors: Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.00482
Pdf URL: https://arxiv.org/pdf/2505.00482
Copy Paste: [[2505.00482]] JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers(https://arxiv.org/abs/2505.00482)
Keywords: generation
Abstract: We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation. The project page is available at this https URL.
摘要：我们提出了联合贡献，这是一种建模RGB和深度的联合分布的扩散变压器。通过利用最先进的扩散变压器的建筑益处和出色的图像，联合迪特不仅会产生高保真图像，而且还会产生几何形式合理且准确的深度图。通过两种简单但有效的技术，即自适应调度权重来实现这种固体的关节分布建模，这些技术取决于每种模式的噪声水平以及不平衡的时间段采样策略。借助这些技术，我们以每种方式训练我们的模型在所有噪声水平上训练我们的模型，从而使联合迪特自然处理各种组合生成任务，包括联合生成，深度估计和深度调节图像生成，只需控制每个分支的时间段即可。联合迪特表现出出色的联合发电性能。此外，它可以在深度估计和深度条件的图像生成中获得可比的结果，这表明联合分布建模可以用作有条件产生的可更换替代品。该项目页面可在此HTTPS URL上找到。

Title: KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

Authors: Antoni Bigata, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00497
Pdf URL: https://arxiv.org/pdf/2505.00497
Copy Paste: [[2505.00497]] KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution(https://arxiv.org/abs/2505.00497)
Keywords: generation
Abstract: Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at this https URL.
摘要：唇部同步（称为与新输入音频的现有视频中对齐唇部动作的任务）通常被构成更简单的音频驱动面部动画变体。但是，除了遇到谈话头部发电（例如时间一致性）中的常规问题外，唇部同步提出了重大的新挑战，例如从输入视频和面部遮挡中泄漏出来的泄漏，这会严重影响现实世界中的现实应用程序，例如自动配音，但在现有工作中经常被忽略。为了解决这些缺点，我们提出了Keysync，这是一个成功解决时间一致性问题的两阶段框架，同时还使用精心设计的掩蔽策略结合了泄漏和遮挡的解决方案。我们表明，Keysync实现了最新的唇部重建和交叉同步，从而提高了视觉质量，并根据我们的新型泄漏度量标准Lipleak的说法，可减少表达泄漏。此外，我们证明了我们新的掩盖方法在处理遮挡方面的有效性，并通过几项消融研究来验证我们的建筑选择。代码和模型权重可以在此HTTPS URL上找到。

Title: Leveraging Partial SMILES Validation Scheme for Enhanced Drug Design in Reinforcement Learning Frameworks

Authors: Xinyu Wang, Jinbo Bi, Minghu Song
Subjects: cs.LG, cs.CE, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.00530
Pdf URL: https://arxiv.org/pdf/2505.00530
Copy Paste: [[2505.00530]] Leveraging Partial SMILES Validation Scheme for Enhanced Drug Design in Reinforcement Learning Frameworks(https://arxiv.org/abs/2505.00530)
Keywords: generation
Abstract: SMILES-based molecule generation has emerged as a powerful approach in drug discovery. Deep reinforcement learning (RL) using large language model (LLM) has been incorporated into the molecule generation process to achieve high matching score in term of likelihood of desired molecule candidates. However, a critical challenge in this approach is catastrophic forgetting during the RL phase, where knowledge such as molecule validity, which often exceeds 99\% during pretraining, significantly deteriorates. Current RL algorithms applied in drug discovery, such as REINVENT, use prior models as anchors to retian pretraining knowledge, but these methods lack robust exploration mechanisms. To address these issues, we propose Partial SMILES Validation-PPO (PSV-PPO), a novel RL algorithm that incorporates real-time partial SMILES validation to prevent catastrophic forgetting while encouraging exploration. Unlike traditional RL approaches that validate molecule structures only after generating entire sequences, PSV-PPO performs stepwise validation at each auto-regressive step, evaluating not only the selected token candidate but also all potential branches stemming from the prior partial sequence. This enables early detection of invalid partial SMILES across all potential paths. As a result, PSV-PPO maintains high validity rates even during aggressive exploration of the vast chemical space. Our experiments on the PMO and GuacaMol benchmark datasets demonstrate that PSV-PPO significantly reduces the number of invalid generated structures while maintaining competitive exploration and optimization performance. While our work primarily focuses on maintaining validity, the framework of PSV-PPO can be extended in future research to incorporate additional forms of valuable domain knowledge, further enhancing reinforcement learning applications in drug discovery.
摘要：基于微笑的分子产生已成为药物发现中的一种强大方法。使用大语言模型（LLM）进行了深入的增强学习（RL），已纳入分子生成过程中，以在所需分子候选物的可能性方面获得高匹配得分。但是，这种方法中的一个关键挑战是在RL阶段的灾难性遗忘，在这种情况下，诸如分子有效性之类的知识在预处理期间通常超过99 \％，会显着恶化。当前应用在药物发现中的RL算法（例如重塑）将先前的模型用作Retian预处理知识的锚点，但是这些方法缺乏强大的探索机制。为了解决这些问题，我们提出了部分微笑验证-PPO（PSV-PPO），这是一种新颖的RL算法，它结合了实时部分微笑验证，以防止灾难性的遗忘，同时鼓励探索。与仅在生成整个序列后验证分子结构的传统RL方法不同，PSV-PPO在每个自动回归步骤进行逐步验证，不仅评估所选的令牌候选者，而且还评估了所有来自先前部分序列的潜在分支。这可以在所有潜在路径上早期发现无效的部分微笑。结果，即使在对庞大的化学空间的积极探索期间，PSV-PPO仍保持高有效率。我们在PMO和鳄梨调子基准数据集上的实验表明，PSV-PPO显着减少了无效生成的结构的数量，同时保持了竞争性探索和优化性能。尽管我们的工作主要侧重于保持有效性，但可以在未来的研究中扩展PSV-PPO的框架，以结合其他形式的有价值的领域知识，从而进一步增强了在药物发现中的增强学习应用。

Title: A Robust Deep Networks based Multi-Object MultiCamera Tracking System for City Scale Traffic

Authors: Muhammad Imran Zaman, Usama Ijaz Bajwa, Gulshan Saleem, Rana Hammad Raza
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.00534
Pdf URL: https://arxiv.org/pdf/2505.00534
Copy Paste: [[2505.00534]] A Robust Deep Networks based Multi-Object MultiCamera Tracking System for City Scale Traffic(https://arxiv.org/abs/2505.00534)
Keywords: generation
Abstract: Vision sensors are becoming more important in Intelligent Transportation Systems (ITS) for traffic monitoring, management, and optimization as the number of network cameras continues to rise. However, manual object tracking and matching across multiple non-overlapping cameras pose significant challenges in city-scale urban traffic scenarios. These challenges include handling diverse vehicle attributes, occlusions, illumination variations, shadows, and varying video resolutions. To address these issues, we propose an efficient and cost-effective deep learning-based framework for Multi-Object Multi-Camera Tracking (MO-MCT). The proposed framework utilizes Mask R-CNN for object detection and employs Non-Maximum Suppression (NMS) to select target objects from overlapping detections. Transfer learning is employed for re-identification, enabling the association and generation of vehicle tracklets across multiple cameras. Moreover, we leverage appropriate loss functions and distance measures to handle occlusion, illumination, and shadow challenges. The final solution identification module performs feature extraction using ResNet-152 coupled with Deep SORT based vehicle tracking. The proposed framework is evaluated on the 5th AI City Challenge dataset (Track 3), comprising 46 camera feeds. Among these 46 camera streams, 40 are used for model training and validation, while the remaining six are utilized for model testing. The proposed framework achieves competitive performance with an IDF1 score of 0.8289, and precision and recall scores of 0.9026 and 0.8527 respectively, demonstrating its effectiveness in robust and accurate vehicle tracking.
摘要：随着网络摄像机数量的不断增加，视觉传感器在智能运输系统（ITS）中变得越来越重要。但是，在城市规模的城市交通场景中，手动对象跟踪和匹配在多个非重叠摄像机中构成了重大挑战。这些挑战包括处理多种车辆属性，遮挡，照明变化，阴影和不同视频分辨率。为了解决这些问题，我们提出了一个高效且具有成本效益的深度学习框架，用于多对象多相机跟踪（MO-MCT）。所提出的框架利用蒙版R-CNN进行对象检测，并采用非最大抑制（NMS）从重叠检测中选择目标对象。转移学习用于重新识别，使多个摄像头的车辆轨迹的关联和生成。此外，我们利用适当的损失功能和距离措施来处理遮挡，照明和阴影挑战。最终解决方案识别模块使用Resnet-152和基于深层的车辆跟踪的RESNET-152执行特征提取。提出的框架将在第五AI城市挑战数据集（轨道3）上进行评估，其中包括46个相机供稿。在这46个相机流中，有40个用于模型训练和验证，而其余6则用于模型测试。所提出的框架以0.8289的IDF1得分和0.9026和0.8527的精度和召回分数实现了竞争性能，表明了其在鲁棒和准确的车辆跟踪中的有效性。

Title: Towards Autonomous Micromobility through Scalable Urban Simulation

Authors: Wayne Wu, Honglin He, Chaoyuan Zhang, Jack He, Seth Z. Zhao, Ran Gong, Quanyi Li, Bolei Zhou
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.00690
Pdf URL: https://arxiv.org/pdf/2505.00690
Copy Paste: [[2505.00690]] Towards Autonomous Micromobility through Scalable Urban Simulation(https://arxiv.org/abs/2505.00690)
Keywords: generation
Abstract: Micromobility, which utilizes lightweight mobile machines moving in urban public spaces, such as delivery robots and mobility scooters, emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of unpredictable obstacles and pedestrians. Assisting humans with AI agents in maneuvering micromobility devices presents a viable solution for enhancing safety and efficiency. In this work, we present a scalable urban simulation solution to advance autonomous micromobility. First, we build URBAN-SIM - a high-performance robot learning platform for large-scale training of embodied agents in interactive urban scenes. URBAN-SIM contains three critical modules: Hierarchical Urban Generation pipeline, Interactive Dynamics Generation strategy, and Asynchronous Scene Sampling scheme, to improve the diversity, realism, and efficiency of robot learning in simulation. Then, we propose URBAN-BENCH - a suite of essential tasks and benchmarks to gauge various capabilities of the AI agents in achieving autonomous micromobility. URBAN-BENCH includes eight tasks based on three core skills of the agents: Urban Locomotion, Urban Navigation, and Urban Traverse. We evaluate four robots with heterogeneous embodiments, such as the wheeled and legged robots, across these tasks. Experiments on diverse terrains and urban structures reveal each robot's strengths and limitations.
摘要：使用轻巧的移动机器在城市公共场所中移动的轻型移动机器（例如交付机器人和移动性踏板车）是车辆出行的一种有希望的替代品。当前的微型操作性主要取决于人类的手动操作（面对面或遥控器），这在繁忙的城市环境中引起了安全和效率的关注，这些城市环境充满了无法预测的障碍和行人。用AI代理协助人类进行操纵微型操作设备，为提高安全性和效率提供了可行的解决方案。在这项工作中，我们提出了一种可扩展的城市模拟解决方案，以提高自动微动性。首先，我们构建了Urban-SIM - 一个高性能的机器人学习平台，用于在互动城市场景中大规模培训具体的代理。 Urban-SIM包含三个关键模块：分层城市生成管道，交互式动态生成策略和异步场景采样方案，以提高机器人学习在模拟中的多样性，现实主义和效率。然后，我们提出了Urban Bench，这是一套必不可少的任务和基准，以评估AI代理在实现自动微动用方面的各种功能。 Urban Bench包括基于代理商的三个核心技能的八项任务：城市运动，城市导航和城市穿越。我们在这些任务中使用异质实施方案（例如轮子和腿部机器人）评估了四个机器人。在各种地形和城市结构上进行的实验揭示了每个机器人的优势和局限性。

Title: T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Authors: Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.00703
Pdf URL: https://arxiv.org/pdf/2505.00703
Copy Paste: [[2505.00703]] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT(https://arxiv.org/abs/2505.00703)
Keywords: generation
Abstract: Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: this https URL
摘要：大型语言模型的最新进展已经证明了思想链（COT）和增强学习（RL）如何改善性能。但是，将这种推理策略应用于视觉生成领域仍然没有探索。在本文中，我们提出了T2i-R1，这是一种新型推理增强的文本对图像生成模型，由RL提供双层COT推理过程。具体而言，我们确定了可以利用的两个级别的COT来增强发电的不同阶段：（1）在提示的高级计划的语义级别的COT以及（2）在逐个补丁过程中用于低级像素处理的令牌级别的COT。为了更好地协调这两个级别的COT，我们将BiCot-Grpo与生成奖励合奏一起介绍，该奖励在同一训练步骤中无缝优化了两个生成COTS。通过将我们的推理策略应用于基线模型Janus-Pro，我们在T2i-Companch上提高了13％的绩效，在明智的基准方面提高了13％，甚至超过了最先进的模型Flux.1。代码可用：此HTTPS URL