2025-01-31

Title: Explainable Machine Learning: An Illustration of Kolmogorov-Arnold Network Model for Airfoil Lift Prediction

Authors: Sudhanva Kulkarni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.17896
Pdf URL: https://arxiv.org/pdf/2501.17896
Copy Paste: [[2501.17896]] Explainable Machine Learning: An Illustration of Kolmogorov-Arnold Network Model for Airfoil Lift Prediction(https://arxiv.org/abs/2501.17896)
Keywords: generation
Abstract: Data science has emerged as fourth paradigm of scientific exploration. However many machine learning models operate as black boxes offering limited insight into the reasoning behind their predictions. This lack of transparency is one of the drawbacks to generate new knowledge from data. Recently Kolmogorov-Arnold Network or KAN has been proposed as an alternative model which embeds explainable AI. This study demonstrates the potential of KAN for new scientific exploration. KAN along with five other popular supervised machine learning models are applied to the well-known problem of airfoil lift prediction in aerospace engineering. Standard data generated from an earlier study on 2900 different airfoils is used. KAN performed the best with an R2 score of 96.17 percent on the test data, surpassing both the baseline model and Multi Layer Perceptron. Explainability of KAN is shown by pruning and symbolizing the model resulting in an equation for coefficient of lift in terms of input variables. The explainable information retrieved from KAN model is found to be consistent with the known physics of lift generation by airfoil thus demonstrating its potential to aid in scientific exploration.
摘要：数据科学已成为科学探索的第四种范式。然而，许多机器学习模型就像黑匣子一样，对其预测背后的原因提供的洞察有限。这种缺乏透明度是从数据中生成新知识的缺点之一。最近，Kolmogorov-Arnold 网络或 KAN 被提议作为一种嵌入可解释 AI 的替代模型。这项研究展示了 KAN 在新的科学探索中的潜力。KAN 与其他五种流行的监督机器学习模型一起应用于航空航天工程中众所周知的翼型升力预测问题。使用了早期对 2900 种不同翼型的研究生成的标准数据。KAN 在测试数据的 R2 得分为 96.17%，表现最佳，超过了基线模型和多层感知器。通过修剪和符号化模型来显示 KAN 的可解释性，从而得到一个关于输入变量的升力系数方程。发现从 KAN 模型中检索到的可解释信息与已知的机翼产生升力的物理原理一致，从而证明了其在科学探索中具有潜力。

Title: Shared DIFF Transformer

Authors: Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Xiangju Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.17900
Pdf URL: https://arxiv.org/pdf/2501.17900
Copy Paste: [[2501.17900]] Shared DIFF Transformer(https://arxiv.org/abs/2501.17900)
Keywords: generation
Abstract: DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method achieves better performance in tasks such as long-sequence modeling, key information retrieval, and in-context learning. Our work provides a novel and efficient approach to optimizing differential attention mechanisms and advancing robust Transformer architectures.
摘要：DIFF Transformer 通过增强对相关上下文的关注并抑制噪声来改善注意力分配。它引入了一种差分注意力机制，可以计算两个独立生成的注意力分布之间的差异，从而有效地降低噪声并促进稀疏注意力模式。然而，DIFF Transformer 中的独立信号生成导致参数冗余和信息利用率不理想。在这项工作中，我们提出了共享 DIFF Transformer，它借鉴了差分放大器的思想，通过引入共享基矩阵来建模全局模式并结合低秩更新来增强特定任务的灵活性。这种设计显着减少了参数冗余，提高了效率，并保留了强大的噪声抑制能力。实验结果表明，与 DIFF Transformer 相比，我们的方法在长序列建模、关键信息检索和上下文学习等任务中取得了更好的性能。我们的工作为优化差分注意力机制和推进鲁棒的 Transformer 架构提供了一种新颖而有效的方法。

Title: Generative AI for Vision: A Comprehensive Study of Frameworks and Applications

Authors: Fouad Bousetouane
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.18033
Pdf URL: https://arxiv.org/pdf/2501.18033
Copy Paste: [[2501.18033]] Generative AI for Vision: A Comprehensive Study of Frameworks and Applications(https://arxiv.org/abs/2501.18033)
Keywords: generation, generative
Abstract: Generative AI is transforming image synthesis, enabling the creation of high-quality, diverse, and photorealistic visuals across industries like design, media, healthcare, and autonomous systems. Advances in techniques such as image-to-image translation, text-to-image generation, domain transfer, and multimodal alignment have broadened the scope of automated visual content creation, supporting a wide spectrum of applications. These advancements are driven by models like Generative Adversarial Networks (GANs), conditional frameworks, and diffusion-based approaches such as Stable Diffusion. This work presents a structured classification of image generation techniques based on the nature of the input, organizing methods by input modalities like noisy vectors, latent representations, and conditional inputs. We explore the principles behind these models, highlight key frameworks including DALL-E, ControlNet, and DeepSeek Janus-Pro, and address challenges such as computational costs, data biases, and output alignment with user intent. By offering this input-centric perspective, this study bridges technical depth with practical insights, providing researchers and practitioners with a comprehensive resource to harness generative AI for real-world applications.
摘要：生成式 AI 正在改变图像合成，使设计、媒体、医疗保健和自主系统等行业能够创建高质量、多样化且逼真的视觉效果。图像到图像转换、文本到图像生成、域传输和多模态对齐等技术的进步扩大了自动视觉内容创建的范围，支持广泛的应用。这些进步是由生成对抗网络 (GAN)、条件框架和基于扩散的方法（如稳定扩散）等模型推动的。这项工作根据输入的性质对图像生成技术进行了结构化分类，按输入模态（如噪声向量、潜在表示和条件输入）组织方法。我们探索这些模型背后的原理，重点介绍包括 DALL-E、ControlNet 和 DeepSeek Janus-Pro 在内的关键框架，并解决计算成本、数据偏差和输出与用户意图的对齐等挑战。通过提供以输入为中心的视角，这项研究将技术深度与实践见解结合起来，为研究人员和从业者提供了全面的资源，以利用生成式人工智能来实现现实世界的应用。

Title: FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models

Authors: Spencer Mateega, Carlos Georgescu, Danny Tang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2501.18062
Pdf URL: https://arxiv.org/pdf/2501.18062
Copy Paste: [[2501.18062]] FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models(https://arxiv.org/abs/2501.18062)
Keywords: generation
Abstract: FinanceQA is a testing suite that evaluates LLMs' performance on complex numerical financial analysis tasks that mirror real-world investment work. Despite recent advances, current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks that mimic on-the-job analyses at hedge funds, private equity firms, investment banks, and other financial institutions. The primary challenges include hand-spreading metrics, adhering to standard accounting and corporate valuation conventions, and performing analysis under incomplete information - particularly in multi-step tasks requiring assumption generation. This performance gap highlights the disconnect between existing LLM capabilities and the demands of professional financial analysis that are inadequately tested by current testing architectures. Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI's fine-tuning API. FinanceQA is publicly released at [this https URL](this https URL).
摘要：FinanceQA 是一个测试套件，用于评估 LLM 在模拟真实投资工作的复杂数值金融分析任务上的表现。尽管最近取得了进展，但目前的 LLM 仍未能满足金融机构严格的准确性要求，其模型在模拟对冲基金、私募股权公司、投资银行和其他金融机构的在职分析的实际任务中约有 60% 未能完成。主要挑战包括手动传播指标、遵守标准会计和公司估值惯例以及在信息不完整的情况下进行分析 - 特别是在需要生成假设的多步骤任务中。这种性能差距凸显了现有 LLM 功能与专业财务分析需求之间的脱节，而当前的测试架构尚未对这些需求进行充分测试。结果表明，需要更高质量的训练数据来支持此类任务，我们使用 OpenAI 的微调 API 进行了实验。FinanceQA 已在 [此 https URL](此 https URL) 公开发布。

Title: LLMs can see and hear without any training

Authors: Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.18096
Pdf URL: https://arxiv.org/pdf/2501.18096
Copy Paste: [[2501.18096]] LLMs can see and hear without any training(https://arxiv.org/abs/2501.18096)
Keywords: generation
Abstract: We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
摘要：我们推出了 MILS：多模态迭代 LLM 求解器，这是一种非常简单、无需训练的方法，可将多模态功能注入您最喜欢的 LLM。MILS 利用其执行多步推理的先天能力，提示 LLM 生成候选输出，每个输出都经过评分并迭代反馈，最终生成任务的解决方案。这使得各种应用程序能够使用，而这些应用程序通常需要在特定于任务的数据上训练专门的模型。特别是，我们在新兴的零样本图像、视频和音频字幕方面建立了新的最先进技术。MILS 也无缝应用于媒体生成，发现提示重写以改进文本到图像的生成，甚至编辑提示以进行风格转换！最后，作为一种无梯度优化方法，MILS 可以将多模态嵌入反转为文本，从而实现跨模态算法等应用程序。

Title: Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss

Authors: Wenshuo Chen, Haozhe Jia, Songning Lai, Keming Wu, Hongru Xiao, Lijie Hu, Yutao Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.18232
Pdf URL: https://arxiv.org/pdf/2501.18232
Copy Paste: [[2501.18232]] Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss(https://arxiv.org/abs/2501.18232)
Keywords: generation
Abstract: Rapid progress in text-to-motion generation has been largely driven by diffusion models. However, existing methods focus solely on temporal modeling, thereby overlooking frequency-domain analysis. We identify two key phases in motion denoising: the **semantic planning stage** and the **fine-grained improving stage**. To address these phases effectively, we propose **Fre**quency **e**nhanced **t**ext-**to**-**m**otion diffusion model (**Free-T2M**), incorporating stage-specific consistency losses that enhance the robustness of static features and improve fine-grained accuracy. Extensive experiments demonstrate the effectiveness of our method. Specifically, on StableMoFusion, our method reduces the FID from **0.189** to **0.051**, establishing a new SOTA performance within the diffusion architecture. These findings highlight the importance of incorporating frequency-domain insights into text-to-motion generation for more precise and robust results.
摘要：文本到运动生成的快速进步在很大程度上是由扩散模型推动的。然而，现有的方法只关注时间建模，从而忽略了频域分析。我们确定了运动去噪的两个关键阶段：**语义规划阶段**和**细粒度改进阶段**。为了有效地解决这些阶段，我们提出了**频率增强**文本到运动扩散模型（**Free-T2M**），结合了特定阶段的一致性损失，增强了静态特征的稳健性并提高了细粒度准确性。大量实验证明了我们方法的有效性。具体来说，在 StableMoFusion 上，我们的方法将 FID 从 **0.189** 降低到 **0.051**，在扩散架构中建立了新的 SOTA 性能。这些发现强调了将频域洞察力纳入文本到运动生成以获得更精确和更稳健的结果的重要性。

Title: MAMS: Model-Agnostic Module Selection Framework for Video Captioning

Authors: Sangho Lee, Il Yong Chun, Hogun Park
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.18269
Pdf URL: https://arxiv.org/pdf/2501.18269
Copy Paste: [[2501.18269]] MAMS: Model-Agnostic Module Selection Framework for Video Captioning(https://arxiv.org/abs/2501.18269)
Keywords: generation
Abstract: Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.
摘要：多模态转换器在视频字幕任务中迅速引起关注。现有的多模态视频字幕方法通常提取固定数量的帧，这带来了严峻的挑战。当提取有限数量的帧时，可能会遗漏包含字幕生成基本信息的重要帧。相反，提取过多的帧包括连续的帧，可能会导致从连续视频帧中提取的视觉标记出现冗余。为了为每个视频提取适当数量的帧，本文提出了第一个与模型无关的视频字幕模块选择框架，该框架具有两个主要功能：（1）根据从视频帧中提取的视觉标记选择具有适当大小的字幕生成模块，以及（2）为选定的字幕生成模块构建视觉标记子集。此外，我们提出了一种新的自适应注意力掩蔽方案，可以增强对重要视觉标记的注意力。我们在三个不同的基准数据集上进行的实验表明，所提出的框架显着提高了三个近期视频字幕模型的性能。

Title: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis

Authors: Haoxiong Liu, Jiacheng Sun, Zhenguo Li, Andrew C Yao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.18310
Pdf URL: https://arxiv.org/pdf/2501.18310
Copy Paste: [[2501.18310]] Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis(https://arxiv.org/abs/2501.18310)
Keywords: generation
Abstract: The synergy between deep learning models and traditional automation tools plays a pivotal role in developing robust neural theorem provers (NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when the model explicitly calls the method, or only at a single granularity level, failing to fully exploit the power of built-in tactics and off-the-shelf automated theorem provers. In this work, we propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency through equipping proof-generation LLMs with automation methods in different granularities via fine-grained structure analysis of model-generated proof proposals. Furthermore, ProofAug serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance. The superiority of our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version), setting a new SOTA across all proof languages with a total sample budget of only 2100. Our code is available at this https URL.
摘要：深度学习模型与传统自动化工具之间的协同作用在开发强大的神经定理证明器 (NTP) 中起着关键作用。然而，对于使用 LLM 进行证明综合，以前的工作仅在模型明确调用方法时应用自动化工具，或者仅在单一粒度级别应用自动化工具，无法充分利用内置策略和现成的自动定理证明器的强大功能。在这项工作中，我们提出了 ProofAug，这是一种新颖的定理证明方法，通过对模型生成的证明提案进行细粒度结构分析，为证明生成 LLM 配备不同粒度的自动化方法，从而具有出色的样本效率。此外，ProofAug 是一个多功能的即插即用模块，可以与任何树搜索算法无缝集成，使我们能够构建高效的递归证明 (ERP) 模块以进一步提高性能。使用开源 deepseek-math-7b-base 模型和 Isabelle 证明助手，我们在 miniF2F 测试基准上验证了我们方法的优越性。值得注意的是，通过另外采用混合提示策略，我们在整理数据集后实现了 66.0% 的累计通过率（原始版本为 61.9%），在所有证明语言中创下了新的 SOTA，总样本预算仅为 2100。我们的代码可在此 https URL 上获取。

Title: A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

Authors: Wiradee Imrattanatrai, Masaki Asada, Kimihiro Hasegawa, Zhi-Qi Cheng, Ken Fukuda, Teruko Mitamura
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2501.18324
Pdf URL: https://arxiv.org/pdf/2501.18324
Copy Paste: [[2501.18324]] A Video-grounded Dialogue Dataset and Metric for Event-driven Activities(https://arxiv.org/abs/2501.18324)
Keywords: generation
Abstract: This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.
摘要：本文介绍了 VDAct（一个基于视频的事件驱动活动对话数据集）以及 VDEval（一个专门为该任务设计的基于会话的上下文评估指标）。与现有数据集不同，VDAct 包含更长、更复杂的视频序列，这些视频序列描述了各种事件驱动活动，需要高级上下文理解才能准确生成响应。该数据集包含 3,000 个对话，其中包含超过 30,000 个问答对，这些对话来自 1,000 个具有各种活动场景的视频。VDAct 因其广泛的活动场景和广泛的问题类型而显示出显著的挑战性特征。对最先进的视觉基础模型的实证研究突出了它们在解决我们数据集上某些问题类型方面的局限性。此外，VDEval 集成了从我们的补充知识图中提取的对话会话历史和视频内容摘要来评估个人反应，与仅依赖单个对话轮次上下文的现有评估指标相比，它在 VDAct 数据集上与人类评估的相关性明显更高。

Title: State Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence

Authors: Thea Aviss
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.18356
Pdf URL: https://arxiv.org/pdf/2501.18356
Copy Paste: [[2501.18356]] State Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence(https://arxiv.org/abs/2501.18356)
Keywords: generation
Abstract: We introduce the State Stream Transformer (SST), a novel LLM architecture that reveals emergent reasoning behaviours and capabilities latent in pretrained weights through addressing a fundamental limitation in traditional transformer models: the lack of latent computational continuity across autoregressive generations in the state space. SST introduces a sliding window latent state (FFN) cache with weighted decay that maintains and evolves persistent latent processes throughout autoregressive generations. Through controlled experiments comparing base and SST architectures using the same frozen weights, we demonstrate that this architectural modification alone enables enhanced reasoning capabilities which appear best explained by some form of potential higher-order processing, as evidenced by emergent metacognitive behaviours. These behaviours persist under controlled conditions designed to eliminate confounding factors such as stochastic variation or learned response patterns. Analysis of latent state distributions and processing dynamics provides evidence that it is solely the 'state stream' that is responsible for these phenomena. In quantitative evaluations, the SST achieves substantial performance improvements over the base model on two reasoning benchmarks, reaching 89.01\% accuracy on GSM-8K (0-shot) and 91.04\% on ARC Challenge (0-shot CoT). These findings indicate that persistent computation in the latent state space enables fundamentally different information processing and internal reasoning strategies, with implications for our understanding of artificial intelligence systems.
摘要：我们引入了状态流变换器 (SST)，这是一种新颖的 LLM 架构，它通过解决传统变换器模型中的一个基本限制来揭示预训练权重中潜在的新兴推理行为和能力：状态空间中自回归代之间缺乏潜在的计算连续性。SST 引入了具有加权衰减的滑动窗口潜在状态 (FFN) 缓存，可在整个自回归代中维持和发展持久的潜在过程。通过使用相同冻结权重比较基础和 SST 架构的受控实验，我们证明仅凭这种架构修改就可以增强推理能力，而这种能力似乎可以通过某种形式的潜在高阶处理得到最好的解释，正如新兴的元认知行为所证明的那样。这些行为在受控条件下持续存在，旨在消除混杂因素，例如随机变化或学习到的反应模式。对潜在状态分布和处理动态的分析提供了证据，表明这些现象完全是由“状态流”造成的。在定量评估中，SST 在两个推理基准上实现了比基础模型显著的性能提升，在 GSM-8K（0 次测试）上的准确率达到 89.01%，在 ARC Challenge（0 次测试 CoT）上的准确率达到 91.04%。这些发现表明，潜在状态空间中的持续计算可以从根本上实现不同的信息处理和内部推理策略，这对我们理解人工智能系统具有重要意义。

Title: MatIR: A Hybrid Mamba-Transformer Image Restoration Model

Authors: Juan Wen (1 and 2), Weiyan Hou (1), Luc Van Gool (2 and 3 and 4), Radu Timofte (5) ((1) Zhengzhou University, (2) ETH Zurich, (3) KU Leuven, (4) INSAIT, Sofia University, (5) University of Wurzburg)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.18401
Pdf URL: https://arxiv.org/pdf/2501.18401
Copy Paste: [[2501.18401]] MatIR: A Hybrid Mamba-Transformer Image Restoration Model(https://arxiv.org/abs/2501.18401)
Keywords: restoration
Abstract: In recent years, Transformers-based models have made significant progress in the field of image restoration by leveraging their inherent ability to capture complex contextual features. Recently, Mamba models have made a splash in the field of computer vision due to their ability to handle long-range dependencies and their significant computational efficiency compared to Transformers. However, Mamba currently lags behind Transformers in contextual learning capabilities. To overcome the limitations of these two models, we propose a Mamba-Transformer hybrid image restoration model called MatIR. Specifically, MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features, thereby taking full advantage of the advantages of the two architectures. In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths to achieve efficient processing of long sequence data. In the Transformer module, we combine triangular window-based local attention with channel-based global attention to effectively activate the attention mechanism over a wider range of image pixels. Extensive experimental results and ablation studies demonstrate the effectiveness of our approach.
摘要：近年来，基于 Transformers 的模型凭借其捕捉复杂上下文特征的先天能力，在图像修复领域取得了重大进展。最近，Mamba 模型凭借其处理长距离依赖关系的能力以及与 Transformers 相比显著的计算效率，在计算机视觉领域引起了轰动。然而，目前 Mamba 在上下文学习能力上落后于 Transformers。为了克服这两个模型的局限性，我们提出了一种 Mamba-Transformer 混合图像修复模型 MatIR。具体来说，MatIR 交叉循环 Transformer 层和 Mamba 层的块来提取特征，从而充分利用两种架构的优势。在 Mamba 模块中，我们引入了图像修复状态空间 (IRSS) 模块，该模块沿四条扫描路径遍历，实现对长序列数据的高效处理。在 Transformer 模块中，我们将基于三角窗口的局部注意力与基于通道的全局注意力相结合，以有效地在更大范围的图像像素上激活注意力机制。大量的实验结果和烧蚀研究证明了我们方法的有效性。

Title: SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

Authors: Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.18427
Pdf URL: https://arxiv.org/pdf/2501.18427
Copy Paste: [[2501.18427]] SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer(https://arxiv.org/abs/2501.18427)
Keywords: generation
Abstract: This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.
摘要：本文介绍了一种线性扩散变换器 SANA-1.5，用于高效地将文本转换为图像。在 SANA-1.0 的基础上，我们引入了三项关键创新：（1）高效训练扩展：一种深度增长范式，能够以显著减少的计算资源将参数从 16 亿扩展到 48 亿，并结合内存高效的 8 位优化器。（2）模型深度剪枝：一种块重要性分析技术，能够将模型高效地压缩为任意大小，同时将质量损失降至最低。（3）推理时间扩展：一种重复采样策略，以计算量换取模型容量，使较小的模型在推理时能够匹配较大的模型质量。通过这些策略，SANA-1.5 在 GenEval 上实现了 0.72 的文本-图像对齐得分，通过推理扩展可以进一步提高到 0.80，从而在 GenEval 基准上建立了新的 SoTA。这些创新使得模型能够在不同的计算预算下进行有效的扩展，同时保持高质量，从而更容易实现高质量的图像生成。

Title: CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization

Authors: Yanxia Deng, Aozhong Zhang, Naigang Wang, Selcuk Gurses, Zi Yang, Penghang Yin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.18475
Pdf URL: https://arxiv.org/pdf/2501.18475
Copy Paste: [[2501.18475]] CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization(https://arxiv.org/abs/2501.18475)
Keywords: generation
Abstract: Fine-tuning large language models (LLMs) using low-rank adaptation (LoRA) has become a highly efficient approach for downstream tasks, particularly in scenarios with limited computational resources. However, applying LoRA techniques to quantized LLMs poses unique challenges due to the reduced representational precision of quantized weights. In this paper, we introduce CLoQ (Calibrated LoRA initialization for Quantized LLMs), a simplistic initialization strategy designed to overcome these challenges. Our approach focuses on minimizing the layer-wise discrepancy between the original LLM and its quantized counterpart with LoRA components during initialization. By leveraging a small calibration dataset, CLoQ quantizes a pre-trained LLM and determines the optimal LoRA components for each layer, ensuring a strong foundation for subsequent fine-tuning. A key contribution of this work is a novel theoretical result that enables the accurate and closed-form construction of these optimal LoRA components. We validate the efficacy of CLoQ across multiple tasks such as language generation, arithmetic reasoning, and commonsense reasoning, demonstrating that it consistently outperforms existing LoRA fine-tuning methods for quantized LLMs, especially at ultra low-bit widths.
摘要：使用低秩自适应 (LoRA) 对大型语言模型 (LLM) 进行微调已成为一种高效的下游任务方法，尤其是在计算资源有限的场景中。然而，将 LoRA 技术应用于量化 LLM 带来了独特的挑战，因为量化权重的表示精度降低了。在本文中，我们介绍了 CLoQ（量化 LLM 的校准 LoRA 初始化），这是一种旨在克服这些挑战的简单初始化策略。我们的方法侧重于在初始化过程中最小化原始 LLM 与其量化的 LoRA 组件之间的逐层差异。通过利用小型校准数据集，CLoQ 量化了预先训练的 LLM 并确定了每层的最佳 LoRA 组件，从而为后续微调奠定了坚实的基础。这项工作的一个关键贡献是一个新颖的理论结果，它能够准确、闭式地构建这些最佳 LoRA 组件。我们在语言生成、算术推理和常识推理等多项任务中验证了 CLoQ 的有效性，证明它在量化 LLM 方面始终优于现有的 LoRA 微调方法，尤其是在超低位宽下。

Title: HSRMamba: Contextual Spatial-Spectral State Space Model for Single Hyperspectral Super-Resolution

Authors: Shi Chen, Lefei Zhang, Liangpei Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2501.18500
Pdf URL: https://arxiv.org/pdf/2501.18500
Copy Paste: [[2501.18500]] HSRMamba: Contextual Spatial-Spectral State Space Model for Single Hyperspectral Super-Resolution(https://arxiv.org/abs/2501.18500)
Keywords: restoration, super-resolution
Abstract: Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code will be available soon.
摘要：Mamba 凭借其强大的全局建模能力和线性计算复杂度，在视觉任务中表现出色，在高光谱图像超分辨率 (HSISR) 方面具有巨大潜力。然而，在 HSISR 中，Mamba 面临着挑战，因为将图像转换为 1D 序列会忽略局部相邻像素之间的空间光谱结构关系，并且其性能对输入顺序高度敏感，这会影响空间和光谱细节的恢复。在本文中，我们提出了 HSRMamba，一种用于 HSISR 的上下文空间光谱建模状态空间模型，以解决局部和全局的这些问题。具体而言，设计了一种局部空间光谱分区机制来建立 3D 特征中相邻像素之间的逐块因果关系，从而减轻局部遗忘问题。此外，采用基于光谱相似性的全局光谱重排序策略来增强相似像素在空间和光谱维度上的因果表示。最后，实验结果表明，我们的 HSRMamba 在定量质量和视觉效果方面优于最先进的方法。代码即将推出。

Title: Integrating Spatial and Frequency Information for Under-Display Camera Image Restoration

Authors: Kyusu Ahn, Jinpyo Kim, Chanwoo Park, JiSoo Kim, Jaejin Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.18517
Pdf URL: https://arxiv.org/pdf/2501.18517
Copy Paste: [[2501.18517]] Integrating Spatial and Frequency Information for Under-Display Camera Image Restoration(https://arxiv.org/abs/2501.18517)
Keywords: restoration
Abstract: Under-Display Camera (UDC) houses a digital camera lens under a display panel. However, UDC introduces complex degradations such as noise, blur, decrease in transmittance, and flare. Despite the remarkable progress, previous research on UDC mainly focuses on eliminating diffraction in the spatial domain and rarely explores its potential in the frequency domain. It is essential to consider both the spatial and frequency domains effectively. For example, degradations, such as noise and blur, can be addressed by local information (e.g., CNN kernels in the spatial domain). At the same time, tackling flares may require leveraging global information (e.g., the frequency domain). In this paper, we revisit the UDC degradations in the Fourier space and figure out intrinsic frequency priors that imply the presence of the flares. Based on this observation, we propose a novel multi-level DNN architecture called SFIM. It efficiently restores UDC-distorted images by integrating local and global (the collective contribution of all points in the image) information. The architecture exploits CNNs to capture local information and FFT-based models to capture global information. SFIM comprises a spatial domain block (SDB), a Frequency Domain Block (FDB), and an Attention-based Multi-level Integration Block (AMIB). Specifically, SDB focuses more on detailed textures such as noise and blur, FDB emphasizes irregular texture loss in extensive areas such as flare, and AMIB enables effective cross-domain interaction. SFIM's superior performance over state-of-the-art approaches is demonstrated through rigorous quantitative and qualitative assessments across three UDC benchmarks.
摘要：屏下摄像头 (UDC) 将数码相机镜头安装在显示屏下方。然而，UDC 会带来复杂的退化，例如噪声、模糊、透射率下降和眩光。尽管取得了显著进展，但之前对 UDC 的研究主要集中在消除空间域中的衍射，很少探索其在频域中的潜力。有效地考虑空间和频域至关重要。例如，噪声和模糊等退化可以通过局部信息（例如空间域中的 CNN 内核）来解决。同时，解决眩光可能需要利用全局信息（例如频域）。在本文中，我们重新审视了傅里叶空间中的 UDC 退化，并找出暗示眩光存在的固有频率先验。基于这一观察，我们提出了一种称为 SFIM 的新型多级 DNN 架构。它通过整合局部和全局（图像中所有点的集体贡献）信息来有效地恢复 UDC 失真图像。该架构利用 CNN 来捕获局部信息，利用基于 FFT 的模型来捕获全局信息。SFIM 包括空间域块 (SDB)、频域块 (FDB) 和基于注意力机制的多级集成块 (AMIB)。具体来说，SDB 更侧重于噪声和模糊等细节纹理，FDB 强调大面积区域（如眩光）中的不规则纹理损失，而 AMIB 可实现有效的跨域交互。通过对三个 UDC 基准进行严格的定量和定性评估，SFIM 的性能优于最先进的方法。

Title: UDC-VIT: A Real-World Video Dataset for Under-Display Cameras

Authors: Kyusu Ahn, JiSoo Kim, Sangik Lee, HyunGyu Lee, Byeonghyun Ko, Chanwoo Park, Jaejin Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.18545
Pdf URL: https://arxiv.org/pdf/2501.18545
Copy Paste: [[2501.18545]] UDC-VIT: A Real-World Video Dataset for Under-Display Cameras(https://arxiv.org/abs/2501.18545)
Keywords: restoration
Abstract: Under Display Camera (UDC) is an advanced imaging system that places a digital camera lens underneath a display panel, effectively concealing the camera. However, the display panel significantly degrades captured images or videos, introducing low transmittance, blur, noise, and flare issues. Tackling such issues is challenging because of the complex degradation of UDCs, including diverse flare patterns. Despite extensive research on UDC images and their restoration models, studies on videos have yet to be significantly explored. While two UDC video datasets exist, they primarily focus on unrealistic or synthetic UDC degradation rather than real-world UDC degradation. In this paper, we propose a real-world UDC video dataset called UDC-VIT. Unlike existing datasets, only UDC-VIT exclusively includes human motions that target facial recognition. We propose a video-capturing system to simultaneously acquire non-degraded and UDC-degraded videos of the same scene. Then, we align a pair of captured videos frame by frame, using discrete Fourier transform (DFT). We compare UDC-VIT with six representative UDC still image datasets and two existing UDC video datasets. Using six deep-learning models, we compare UDC-VIT and an existing synthetic UDC video dataset. The results indicate the ineffectiveness of models trained on earlier synthetic UDC video datasets, as they do not reflect the actual characteristics of UDC-degraded videos. We also demonstrate the importance of effective UDC restoration by evaluating face recognition accuracy concerning PSNR, SSIM, and LPIPS scores. UDC-VIT enables further exploration in the UDC video restoration and offers better insights into the challenge. UDC-VIT is available at our project site.
摘要：显示屏下摄像头 (UDC) 是一种先进的成像系统，它将数码相机镜头置于显示面板下方，有效地隐藏了摄像头。然而，显示面板会严重降低所捕获图像或视频的质量，导致低透射率、模糊、噪声和眩光问题。由于 UDC 的退化复杂，包括各种眩光模式，解决这些问题具有挑战性。尽管对 UDC 图像及其恢复模型进行了广泛的研究，但对视频的研究尚未得到深入探索。虽然存在两个 UDC 视频数据集，但它们主要关注不切实际或合成的 UDC 退化，而不是现实世界的 UDC 退化。在本文中，我们提出了一个名为 UDC-VIT 的真实世界 UDC 视频数据集。与现有数据集不同，只有 UDC-VIT 专门包含针对面部识别的人体运动。我们提出了一个视频捕获系统，用于同时获取同一场景的未退化和 UDC 退化视频。然后，我们使用离散傅里叶变换 (DFT) 逐帧对齐一对捕获的视频。我们将 UDC-VIT 与六个代表性 UDC 静态图像数据集和两个现有的 UDC 视频数据集进行了比较。使用六个深度学习模型，我们将 UDC-VIT 与现有的合成 UDC 视频数据集进行了比较。结果表明，在早期合成 UDC 视频数据集上训练的模型无效，因为它们没有反映 UDC 降级视频的实际特征。我们还通过评估与 PSNR、SSIM 和 LPIPS 分数相关的人脸识别准确度来证明有效 UDC 恢复的重要性。UDC-VIT 使进一步探索 UDC 视频恢复成为可能，并提供了对挑战的更好见解。UDC-VIT 可在我们的项目网站上找到。

Title: Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH

Authors: Evgenii Evstafev
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.18576
Pdf URL: https://arxiv.org/pdf/2501.18576
Copy Paste: [[2501.18576]] Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH(https://arxiv.org/abs/2501.18576)
Keywords: generation
Abstract: This study investigates the performance of the DeepSeek R1 language model on 30 challenging mathematical problems derived from the MATH dataset, problems that previously proved unsolvable by other models under time constraints. Unlike prior work, this research removes time limitations to explore whether DeepSeek R1's architecture, known for its reliance on token-based reasoning, can achieve accurate solutions through a multi-step process. The study compares DeepSeek R1 with four other models (gemini-1.5-flash-8b, gpt-4o-mini-2024-07-18, llama3.1:8b, and mistral-8b-latest) across 11 temperature settings. Results demonstrate that DeepSeek R1 achieves superior accuracy on these complex problems but generates significantly more tokens than other models, confirming its token-intensive approach. The findings highlight a trade-off between accuracy and efficiency in mathematical problem-solving with large language models: while DeepSeek R1 excels in accuracy, its reliance on extensive token generation may not be optimal for applications requiring rapid responses. The study underscores the importance of considering task-specific requirements when selecting an LLM and emphasizes the role of temperature settings in optimizing performance.
摘要：本研究调查了 DeepSeek R1 语言模型在 MATH 数据集中得出的 30 个具有挑战性的数学问题上的表现，这些问题之前被证明是其他模型在时间限制下无法解决的。与之前的研究不同，这项研究消除了时间限制，以探索 DeepSeek R1 的架构（以依赖基于 token 的推理而闻名）是否可以通过多步骤过程获得准确的解决方案。该研究在 11 种温度设置下将 DeepSeek R1 与其他四种模型（gemini-1.5-flash-8b、gpt-4o-mini-2024-07-18、llama3.1:8b 和 mistral-8b-latest）进行了比较。结果表明，DeepSeek R1 在这些复杂问题上实现了卓越的准确性，但生成的 token 明显多于其他模型，证实了其 token 密集型方法。研究结果强调了使用大型语言模型解决数学问题时准确性和效率之间的权衡：虽然 DeepSeek R1 在准确性方面表现出色，但它对大量标记生成的依赖可能不适合需要快速响应的应用程序。该研究强调了在选择 LLM 时考虑特定任务要求的重要性，并强调了温度设置在优化性能中的作用。

Title: Diffusion Autoencoders are Scalable Image Tokenizers

Authors: Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan Misra
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.18593
Pdf URL: https://arxiv.org/pdf/2501.18593
Copy Paste: [[2501.18593]] Diffusion Autoencoders are Scalable Image Tokenizers(https://arxiv.org/abs/2501.18593)
Keywords: generation, generative
Abstract: Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.
摘要：将图像标记为紧凑的视觉表示是学习高效、高质量图像生成模型的关键步骤。我们提出了一个简单的扩散标记器 (DiTo)，它可以为图像生成模型学习紧凑的视觉表示。我们的主要见解是，单个学习目标，扩散 L2 损失，可用于训练可扩展的图像标记器。由于扩散已经广泛用于图像生成，我们的见解大大简化了训练此类标记器的过程。相比之下，目前最先进的标记器依赖于经验发现的启发式和损失组合，因此需要一种复杂的训练方法，该方法依赖于非平凡地平衡不同的损失和预训练的监督模型。我们展示了设计决策以及理论基础，使我们能够扩展 DiTo 以学习具有竞争力的图像表示。我们的结果表明，DiTo 是一种更简单、可扩展且自监督的替代方案，可替代目前最先进的受监督的图像标记器。DiTo 在图像重建和下游图像生成任务中实现了与最先进的技术相媲美或更好的质量。