2025-05-22

Title: Addressing the Challenges of Planning Language Generation

Authors: Prabhu Prakash Kagitha, Andrew Zhu, Li Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14763
Pdf URL: https://arxiv.org/pdf/2505.14763
Copy Paste: [[2505.14763]] Addressing the Challenges of Planning Language Generation(https://arxiv.org/abs/2505.14763)
Keywords: llm
Abstract: Using LLMs to generate formal planning languages such as PDDL that invokes symbolic solvers to deterministically derive plans has been shown to outperform generating plans directly. While this success has been limited to closed-sourced models or particular LLM pipelines, we design and evaluate 8 different PDDL generation pipelines with open-source models under 50 billion parameters previously shown to be incapable of this task. We find that intuitive approaches such as using a high-resource language wrapper or constrained decoding with grammar decrease performance, yet inference-time scaling approaches such as revision with feedback from the solver and plan validator more than double the performance.
摘要：使用LLMS生成正式的计划语言，例如PDDL，它可以调用符号求解器以确定性推导计划，从而直接超越了生成计划。尽管这种成功仅限于封闭式模型或特定的LLM管道，但我们设计和评估了8种不同的PDDL生成管道，其开源模型以前以下500亿以下参数以前证明是无法执行此任务的500亿个参数。我们发现，诸如使用高资源语言包装器或通过语法降低性能的限制解码等直观方法，但推理时间缩放方法，例如使用求解器的反馈和计划验证器的修订量，超过两倍以上。

Title: Automated Journalistic Questions: A New Method for Extracting 5W1H in French

Authors: Richard Khoury, Maxence Verhaverbeke, Julie A. Gramaccia
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14804
Pdf URL: https://arxiv.org/pdf/2505.14804
Copy Paste: [[2505.14804]] Automated Journalistic Questions: A New Method for Extracting 5W1H in French(https://arxiv.org/abs/2505.14804)
Keywords: language model, gpt
Abstract: The 5W1H questions - who, what, when, where, why and how - are commonly used in journalism to ensure that an article describes events clearly and systematically. An- swering them is a crucial prerequisites for tasks such as summarization, clustering, and news aggregation. In this paper, we design the first automated extraction pipeline to get 5W1H information from French news articles. To evaluate the performance of our algo- rithm, we also create a corpus of 250 Quebec news articles with 5W1H answers marked by four human annotators. Our results demonstrate that our pipeline performs as well in this task as the large language model GPT-4o.
摘要：5W1H问题 - 新闻业中通常使用谁，什么，何时，何地，为什么和如何使用 - 以确保文章清晰有系统地描述事件。将它们摆动是诸如汇总，聚类和新闻汇总等任务的关键先决条件。在本文中，我们设计了第一条自动提取管道，以从法国新闻文章中获取5W1H信息。为了评估我们的算法的性能，我们还创建了250个魁北克新闻文章的语料库，其中有5w1h答案，其中有四个人类注释者。我们的结果表明，我们的管道在这项任务中的性能和大型语言型号GPT-4O一样。

Title: Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Authors: Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14810
Pdf URL: https://arxiv.org/pdf/2505.14810
Copy Paste: [[2505.14810]] Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models(https://arxiv.org/abs/2505.14810)
Keywords: language model, llm
Abstract: Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at this https URL.
摘要：指导跟踪对于将大语言模型（LLM）与用户意图保持一致至关重要。尽管最近以推理为导向的模型在复杂的数学问题上表现出令人印象深刻的表现，但它们遵守自然语言指令的能力仍然没有得到充实的态度。在这项工作中，我们介绍了Mathif，这是一种专门的基准测试，用于评估数学推理任务中的指导跟踪。我们的经验分析揭示了扩大推理能力和保持可控性之间的一致张力，因为这些模型通常更有效地难以遵守用户指令。我们发现，通过蒸馏的长长的链条进行了调整的模型，或者通过以推理为导向的强化学习经常在指导依从性中降低，尤其是在发电长度增加时。此外，我们表明，即使是简单的干预措施也可以部分恢复服从，尽管以推理性能为代价。这些发现突出了当前LLM培训范式中的根本张力，并激发了对更多的指导性推理模型的需求。我们在此HTTPS URL上发布代码和数据。

Title: Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes

Authors: Mingyang Wang, Lukas Lange, Heike Adel, Yunpu Ma, Jannik Strötgen, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14815
Pdf URL: https://arxiv.org/pdf/2505.14815
Copy Paste: [[2505.14815]] Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes(https://arxiv.org/abs/2505.14815)
Keywords: language model, prompt, chain-of-thought
Abstract: Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model's internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for controlling reasoning languages to build more interpretable and adaptable RLMs.
摘要：推理语言模型（RLMS）通过利用一个经过经过经过经验的过程来生成结构化的中间步骤，在复杂的任务上表现出色。但是，语言混合，即包含来自提示以外的语言的代币的推理步骤在其输出中观察到并显示出影响性能的影响，尽管其影响仍在争论中。我们介绍了RLMS中语言混合的首次系统研究，研究了15种语言，7个任务困难水平和18个主题领域的模式，影响和内部原因，并显示所有三个因素如何影响语言混合。此外，我们证明了推理语言的选择显着影响性能：通过约束解码，迫使模型在拉丁文或han脚本中推理，这显着提高了准确性。最后，我们表明，推理轨迹的脚本组成与模型内部表示的脚本组成与模型的内部表示形式紧密一致，这表明语言混合反映了RLMS中的潜在处理偏好。我们的发现提供了可行的见解，用于优化多语言推理和开放新方向，以控制推理语言以构建更容易解释和适应性的RLM。

Title: WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

Authors: Leon Lin, Jun Zheng, Haidong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14818
Pdf URL: https://arxiv.org/pdf/2505.14818
Copy Paste: [[2505.14818]] WebNovelBench: Placing LLM Novelists on the Web Novel Distribution(https://arxiv.org/abs/2505.14818)
Keywords: language model, llm
Abstract: Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.
摘要：强有力评估大语言模型（LLM）的长期故事讲述能力仍然是一个重大挑战，因为现有的基准通常缺乏必要的规模，多样性或客观措施。为了解决这个问题，我们介绍了WebNovelbench，这是一种专门设计用于评估长形小说生成的小型基准。 WebNovelbench利用了4,000多个中国网络小说的大规模数据集，将评估作为概要到故事的生成任务。我们提出了一个多方面的框架，其中包含八个叙事质量维度，通过LLM-As-Gudge方法自动评估。使用主成分分析将分数汇总，并根据人为实现的作品映射到百分位等级。我们的实验表明，WebNovelbench有效地区分了人创作的杰作，流行的网络小说和LLM生成的内容。我们提供对24个最先进的LLM的全面分析，对他们的讲故事能力进行排名，并为未来发展提供见解。该基准提供了一种可扩展，可复制和数据驱动的方法，用于评估和推进LLM驱动的叙事生成。

Title: Tracing Multilingual Factual Knowledge Acquisition in Pretraining

Authors: Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14824
Pdf URL: https://arxiv.org/pdf/2505.14824
Copy Paste: [[2505.14824]] Tracing Multilingual Factual Knowledge Acquisition in Pretraining(https://arxiv.org/abs/2505.14824)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at this https URL.
摘要：大型语言模型（LLMS）能够回顾其预科数据中存在的多语言事实知识。但是，大多数研究仅评估最终模型，从而在整个预处理中留下了事实回忆和跨语言的一致性的发展。在这项工作中，我们追踪事实召回和跨语言在训练期间的发展，重点是OLMO-7B作为案例研究。我们发现，大多数语言的准确性和一致性随着时间的流逝而提高。我们表明，这种改进主要是由训练阶段语料库中的事实频率驱动的：无论语言如何，更可能正确召回事实。但是，仍然可以正确召回一些非英语语言的低频事实。我们的分析表明，这些实例在很大程度上受益于其英语对应物的跨语言转移，这一效果主要在预训练的早期阶段出现。我们指出了两种不同的途径，通过这些途径进行了多种事实知识获取的情况：（1）频率驱动的学习是主导和语言敏捷的，以及（2）跨语言转移，该学习规模限制，通常限制在涉及指定实体的关系类型上。我们发布我们的代码和数据，以促进此HTTPS URL的进一步研究。

Title: Text Generation Beyond Discrete Token Sampling

Authors: Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14827
Pdf URL: https://arxiv.org/pdf/2505.14827
Copy Paste: [[2505.14827]] Text Generation Beyond Discrete Token Sampling(https://arxiv.org/abs/2505.14827)
Keywords: llm
Abstract: In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.
摘要：在标准自回归生成中，LLM预测了下一句话的分布，样本一个离散的令牌，然后丢弃分布，仅通过采样令牌作为新输入。为了保留此分布的丰富信息，我们提出了输入（MOI）的混合，这是一种无训练的自回归产生方法。遵循标准范式生成令牌后，我们构建了一个新的输入，将生成的离散令牌与先前废弃的令牌分布融合在一起。具体而言，我们采用了一种贝叶斯估计方法，将令牌分布视为先验，将令牌作为观察结果，并以连续的后验预期代替了常规的一速矢量作为新模型输入。 MOI允许该模型在整个生成过程中保持更丰富的内部表示，从而提高了文本质量和推理能力。在数学推理，代码生成和PHD级质量请QA任务上，MOI始终提高QWQ-32B，Nemotron-Super-49b，Gemma-3-27b和Dapo-Qwen-32B等多种模型的性能，没有其他培训和可忽略的计算范围。

Title: SEPS: A Separability Measure for Robust Unlearning in LLMs

Authors: Wonje Jeung, Sangyeon Yoon, Albert No
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14832
Pdf URL: https://arxiv.org/pdf/2505.14832
Copy Paste: [[2505.14832]] SEPS: A Separability Measure for Robust Unlearning in LLMs(https://arxiv.org/abs/2505.14832)
Keywords: language model, llm, prompt
Abstract: Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial. We introduce SEPS, an evaluation framework that explicitly measures a model's ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.
摘要：Machine Unerning旨在从大型语言模型（LLM）中选择性地删除目标知识，以确保他们在保留基本信息的同时忘记了指定的内容。现有的未学习指标评估模型是否正确回答保留查询并拒绝忘记查询，但是它们无法捕获现实世界中的情况，而忘记查询很少出现在孤立的情况下。实际上，忘记并保留查询通常在同一提示中共存，这使得混合问题评估至关重要。我们介绍了SEP，这是一个评估框架，可明确测量模型在单个提示中忘记和保留信息的能力。通过在三个基准测试的大量实验中，我们在现有的未学习方法中确定了两种关键的故障模式：（1）一旦出现忘记查询，不可分割地擦除了遗忘和保留内容，并且（2）（2）有针对性的不学习对单程场景的刻痕，导致灾难性的失败，导致灾难性的失败，导致灾难性的失败处理多个Queries时。为了解决这些问题，我们提出了混合提示（MP）学习，该策略将忘记并将查询保留到统一的培训目标中。我们的方法显着提高了学习的效率，即使在复杂的环境中，最多八个混合忘记并保留了一个提示中的查询，也表现出了鲁棒性。

Title: A Comparative Study of Large Language Models and Human Personality Traits

Authors: Wang Jiaqi, Wang bo, Guo fa, Cheng cheng, Yang li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14845
Pdf URL: https://arxiv.org/pdf/2505.14845
Copy Paste: [[2505.14845]] A Comparative Study of Large Language Models and Human Personality Traits(https://arxiv.org/abs/2505.14845)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated human-like capabilities in language comprehension and generation, becoming active participants in social and cognitive domains. This study investigates whether LLMs exhibit personality-like traits and how these traits compare with human personality, focusing on the applicability of conventional personality assessment tools. A behavior-based approach was used across three empirical studies. Study 1 examined test-retest stability and found that LLMs show higher variability and are more input-sensitive than humans, lacking long-term stability. Based on this, we propose the Distributed Personality Framework, conceptualizing LLM traits as dynamic and input-driven. Study 2 analyzed cross-variant consistency in personality measures and found LLMs' responses were highly sensitive to item wording, showing low internal consistency compared to humans. Study 3 explored personality retention during role-playing, showing LLM traits are shaped by prompt and parameter settings. These findings suggest that LLMs express fluid, externally dependent personality patterns, offering insights for constructing LLM-specific personality frameworks and advancing human-AI interaction. This work contributes to responsible AI development and extends the boundaries of personality psychology in the age of intelligent systems.
摘要：大型语言模型（LLM）在语言理解和产生中表现出了类似人类的能力，成为社会和认知领域的积极参与者。这项研究调查了LLM是否表现出类似人格的特征，以及这些特征与人格的比较，重点是传统人格评估工具的适用性。在三项经验研究中使用了基于行为的方法。研究1检查了重测稳定性，发现LLMS比人类表现出更高的变异性，并且比人类更敏感，缺乏长期稳定性。基于此，我们提出了分布式人格框架，将LLM特征概念化为动态和输入驱动。研究2分析了人格措施的跨变量一致性，发现LLMS的反应对项目措辞高度敏感，与人类相比，内部一致性低。研究3探讨了角色扮演期间的性格保留率，表明LLM特征是通过及时和参数设置塑造的。这些发现表明，LLM表达了流体，外部依赖的人格模式，为构建LLM特定人格框架的见解并推进了人类的互动。这项工作有助于负责任的AI发展，并扩展了智能系统时代人格心理学的界限。

Title: MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

Authors: Xi Wang, Jiaqian Hu, Safinah Ali
Subjects: cs.CL, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2505.14848
Pdf URL: https://arxiv.org/pdf/2505.14848
Copy Paste: [[2505.14848]] MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation(https://arxiv.org/abs/2505.14848)
Keywords: language model, llm, agent
Abstract: We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine-grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single-agent methods that rely on self-correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero-shot and single-agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi-layered error diagnosis, omission detection across perspectives, and context-aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity.
摘要：我们提出MAATS，这是一种多维自动化翻译系统，利用多维质量指标（MQM）框架作为用于错误检测和改进的细粒信号。 MAATS采用多种专业的AI代理，每种代理都集中在独特的MQM类别（例如，准确性，流利度，样式，术语）上，然后是将注释集成到迭代性完善翻译的合成剂。该设计与依赖自我纠正的常规单代理方法形成鲜明对比。 MAATS跨不同语言对和大型语言模型（LLM）进行了评估，在自动指标和人类评估中，具有统计学意义的增长，优于零射击和单一基本基线。它在语义准确性，环境适应和语言遥远的语言对方面特别擅长。定性分析强调了其在多层错误诊断，跨观点的遗漏检测以及情境感知的完善中的优势。通过将模块化代理角色与可解释的MQM维度对齐，Maats缩小了黑盒LLM和人类翻译工作流之间的差距，将焦点从表面流利性转移到更深的语义和上下文忠诚度。

Title: EasyMath: A 0-shot Math Benchmark for SLMs

Authors: Drishya Karki, Michiel Kamphuis, Angelecia Frey
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14852
Pdf URL: https://arxiv.org/pdf/2505.14852
Copy Paste: [[2505.14852]] EasyMath: A 0-shot Math Benchmark for SLMs(https://arxiv.org/abs/2505.14852)
Keywords: language model, chain-of-thought
Abstract: EasyMath is a compact benchmark for practical math reasoning in small language models. It covers thirteen categories, from basic arithmetic and order of operations to word problems, algebraic expressions, edge cases, and omits specialist topics. We tested 23 models (14M to 4B parameters) using exact, numerical, and symbolic checks on free-form answers in a zero-shot setting. Accuracy rises with size and training, chain-of-thought adds modest gains, and consistency improves at scale.
摘要：Easymath是小语言模型中实用数学推理的紧凑基准。它涵盖了13个类别，从基本算术和操作顺序到单词问题，代数表达式，边缘案例以及省略专业主题。我们使用精确，数值和符号检查对零弹奏设置中的自由格式答案进行了测试23个模型（14m至4b参数）。精度随着尺寸和训练而提高，经过思考链增加了适度的增长，并且一致性在大规模上提高。

Title: Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

Authors: Ryan Solgi, Kai Zhen, Rupak Vignesh Swaminathan, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14871
Pdf URL: https://arxiv.org/pdf/2505.14871
Copy Paste: [[2505.14871]] Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models(https://arxiv.org/abs/2505.14871)
Keywords: language model, llm
Abstract: The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.
摘要：大型语言模型（LLM）的有效实施对于在资源约束设备上部署至关重要。低量张量压缩技术，例如张量训练（TT）网络，已广泛研究过过度参数化的神经网络。但是，由于预训练的LLM的高级性质以及缺乏访问预处理的数据，因此它们用于压缩前训练的大语言模型（LLMS）进行下游任务（训练后）仍然具有挑战性。在这项研究中，我们研究了微调期间低级别的张力LLM，并提出了稀疏的增强张量网络（SATEN）以增强其性能。所提出的卫星框架可实现完整的模型压缩。实验结果表明，卫星在张力的语言模型中提高了准确性和压缩效率，从而达到了最新的性能。

Title: Incorporating Token Usage into Prompting Strategy Evaluation

Authors: Chris Sypherd, Sergei Petrov, Sonny George, Vaishak Belle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14880
Pdf URL: https://arxiv.org/pdf/2505.14880
Copy Paste: [[2505.14880]] Incorporating Token Usage into Prompting Strategy Evaluation(https://arxiv.org/abs/2505.14880)
Keywords: language model, prompt
Abstract: In recent years, large language models have demonstrated remarkable performance across diverse tasks. However, their task effectiveness is heavily dependent on the prompting strategy used to elicit output, which can vary widely in both performance and token usage. While task performance is often used to determine prompting strategy success, we argue that efficiency--balancing performance and token usage--can be a more practical metric for real-world utility. To enable this, we propose Big-$O_{tok}$, a theoretical framework for describing the token usage growth of prompting strategies, and analyze Token Cost, an empirical measure of tokens per performance. We apply these to several common prompting strategies and find that increased token usage leads to drastically diminishing performance returns. Our results validate the Big-$O_{tok}$ analyses and reinforce the need for efficiency-aware evaluations.
摘要：近年来，大型语言模型在各种任务中表现出了出色的表现。但是，他们的任务效率在很大程度上取决于用于引起输出的提示策略，在性能和令牌使用情况下，它们的使用范围很大。尽管任务绩效通常用于确定促进策略成功，但我们认为效率 - 平衡性能和令牌用法 - can是现实世界实用程序的更实用的指标。为了实现这一目标，我们提出了一个大$ o_ {tok} $，这是一个理论框架，用于描述促使策略的代币使用增长，并分析令牌成本，这是对每个绩效代币的经验度量。我们将其应用于几种常见的提示策略，发现增加的令牌使用情况会导致绩效回报率急剧下降。我们的结果验证了big-$ o_ {tok} $分析，并增强了对效率感知评估的需求。

Title: Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters

Authors: Danqing Wang, Zhuorui Ye, Xinran Zhao, Fei Fang, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14886
Pdf URL: https://arxiv.org/pdf/2505.14886
Copy Paste: [[2505.14886]] Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters(https://arxiv.org/abs/2505.14886)
Keywords: llm, agent
Abstract: Winning competitive debates requires sophisticated reasoning and argument skills. There are unique challenges in the competitive debate: (1) The time constraints force debaters to make strategic choices about which points to pursue rather than covering all possible arguments; (2) The persuasiveness of the debate relies on the back-and-forth interaction between arguments, which a single final game status cannot evaluate. To address these challenges, we propose TreeDebater, a novel debate framework that excels in competitive debate. We introduce two tree structures: the Rehearsal Tree and Debate Flow Tree. The Rehearsal Tree anticipates the attack and defenses to evaluate the strength of the claim, while the Debate Flow Tree tracks the debate status to identify the active actions. TreeDebater allocates its time budget among candidate actions and uses the speech time controller and feedback from the simulated audience to revise its statement. The human evaluation on both the stage-level and the debate-level comparison shows that our TreeDebater outperforms the state-of-the-art multi-agent debate system. Further investigation shows that TreeDebater shows better strategies in limiting time to important debate actions, aligning with the strategies of human debate experts.
摘要：赢得竞争辩论需要复杂的推理和争论技巧。竞争性辩论中存在着独特的挑战：（1）时间限制迫使辩论者对要提出的要点做出战略选择，而不是涵盖所有可能的论点；（2）辩论的说服力依赖于参数之间的来回互动，该参数单个最终游戏状态无法评估。为了应对这些挑战，我们提出了Treedebater，这是一个在竞争性辩论中表现出色的新颖辩论框架。我们介绍了两个树结构：排练树和辩论流树。彩排树预测攻击和防御能力以评估索赔的强度，而辩论流树则跟踪辩论状态以识别主动行动。 Treedebater在候选人的行动中分配了时间预算，并使用语音时间控制器和模拟受众的反馈来修改其陈述。人类对阶段级别和辩论级比较的评估表明，我们的Treedebater优于最先进的多代理辩论系统。进一步的调查表明，Treedebater在限制了重要的辩论行动的时间中显示出更好的策略，与人类辩论专家的策略保持一致。

Title: In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties

Authors: Nathan Roll, Calbert Graham, Yuka Tatsumi, Kim Tien Nguyen, Meghan Sumner, Dan Jurafsky
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.14887
Pdf URL: https://arxiv.org/pdf/2505.14887
Copy Paste: [[2505.14887]] In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties(https://arxiv.org/abs/2505.14887)
Keywords: language model, prompt
Abstract: Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.
摘要：人类听众很容易通过曝光来适应陌生的说话者和语言品种，但是这些适应性益处是否扩展到最新的口语模型？我们引入了一个可扩展的框架，该框架允许使用交织的任务提示和音频文本对在PHI-4多模式中进行中文学习（ICL），并发现在推断时，在推断时，少于12个示例示例话语（〜50秒）将相对的19.7％（1.2 pp。）降低，平均在不同的英语corpora上。当上下文和目标扬声器匹配时，当提供更多示例时，这些改进最为明显，尽管缩放我们的过程会减少边际收益到上下文长度。总体而言，我们发现我们的新颖的ICL适应方案（1）揭示了与人类听众相似的性能概况，（2）在不同的扬声器和语言背景的自动语音识别（ASR）鲁棒性方面表现出一致的改进。尽管适应性成功，但某些品种仍然存在明显的差距，这揭示了当前模型仍然没有人类灵活性。我们在GitHub上发布提示和代码。

Title: Scaling Laws for State Dynamics in Large Language Models

Authors: Jacob X Li, Shreyas S Raman, Jessica Wan, Fahad Samman, Jazlyn Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14892
Pdf URL: https://arxiv.org/pdf/2505.14892
Copy Paste: [[2505.14892]] Scaling Laws for State Dynamics in Large Language Models(https://arxiv.org/abs/2505.14892)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state prediction accuracy degrades with increasing state-space size and sparse transitions. GPT-2 XL reaches about 70% accuracy in low-complexity settings but drops below 30% when the number of boxes or states exceeds 5 or 10, respectively. In DFA tasks, Pythia-1B fails to exceed 50% accuracy when the number of states is > 10 and transitions are < 30. Through activation patching, we identify attention heads responsible for propagating state information: GPT-2 XL Layer 22 Head 20, and Pythia-1B Heads at Layers 10, 11, 12, and 14. While these heads successfully move relevant state features, action information is not reliably routed to the final token, indicating weak joint state-action reasoning. Our results suggest that state tracking in LLMs emerges from distributed interactions of next-token heads rather than explicit symbolic computation.
摘要：大型语言模型（LLM）越来越多地用于需要内部状态跟踪的任务中，但它们对状态过渡动态进行建模的能力仍然很少了解。我们评估了LLM捕获3个域中的确定性状态动力学方法：框跟踪，抽象DFA序列和复杂的文本游戏，每个游戏都可以正式化为有限状态系统。在整个任务中，我们发现下一个国家的预测准确性随着状态空间的大小和稀疏过渡而降低。在低复杂性设置中，GPT-2 XL的精度约为70％，但当盒子或状态的数量分别超过5或10时，XL的精度下降到30％以下。在DFA任务中，当状态数为> 10并且过渡<30时，Pythia-1b无法超过50％的准确性。通过激活修补，我们确定了关注态度，负责传播状态信息：GPT-2 XL第22层head 20和Pythia-1b头在第10、11、11、11、11、11、12和14层的范围中，同时又有相关的范围，同时又有相关的范围。联合国家行动推理。我们的结果表明，LLMS中的状态跟踪是从下一步的头部的分布式相互作用而不是明确的符号计算中出现的。

Title: Concept Incongruence: An Exploration of Time and Death in Role Playing

Authors: Xiaoyan Bai, Ike Peng, Aditya Singh, Chenhao Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14905
Pdf URL: https://arxiv.org/pdf/2505.14905
Copy Paste: [[2505.14905]] Concept Incongruence: An Exploration of Time and Death in Role Playing(https://arxiv.org/abs/2505.14905)
Keywords: language model, llm, prompt
Abstract: Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to generate something anyway? We introduce concept incongruence to capture such phenomena where concept boundaries clash with each other, either in user prompts or in model representations, often leading to under-specified or mis-specified behaviors. In this work, we take the first step towards defining and analyzing model behavior under concept incongruence. Focusing on temporal boundaries in the Role-Play setting, we propose three behavioral metrics--abstention rate, conditional accuracy, and answer rate--to quantify model behavior under incongruence due to the role's death. We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting. Through probing experiments, we identify two main causes: (i) unreliable encoding of the "death" state across different years, leading to unsatisfactory abstention behavior, and (ii) role playing causes shifts in the model's temporal representations, resulting in accuracy drops. We leverage these insights to improve consistency in the model's abstention and answer behaviors. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.
摘要：考虑一下此提示“用两个角绘制独角兽”。大型语言模型（LLMS）是否应该认识到独角兽只有一个角，并要求用户进行澄清，还是继续生成一些东西？我们介绍了概念不一致，以捕获这种现象，其中概念边界在用户提示或模型表示中相互冲突，通常会导致未指定或错误指定的行为。在这项工作中，我们迈出了在概念不一致下定义和分析模型行为的第一步。为了关注角色扮演设置中的时间边界，我们提出了三个行为指标 - 规定率，条件准确性和答案率 - 量化由于角色死亡而在不一致的情况下量化模型行为。我们表明，与非角色播放设置相比，模型在死亡后无法弃权，并且遭受精度下降。通过探测实验，我们确定了两个主要原因：（i）在不同年份对“死亡”状态的不可靠编码，导致不满意的弃权行为，以及（ii）角色扮演会导致模型的时间表示的变化，从而导致准确的下降。我们利用这些见解来提高模型的弃权和回答行为的一致性。我们的发现表明，概念不一致会导致意外的模型行为，并指出未来关于在概念不一致下改善模型行为的方向。

Title: Understanding 6G through Language Models: A Case Study on LLM-aided Structured Entity Extraction in Telecom Domain

Authors: Ye Yuan, Haolun Wu, Hao Zhou, Xue Liu, Hao Chen, Yan Xin, Jianzhong (Charlie)Zhang
Subjects: cs.CL, eess.SY
Abstract URL: https://arxiv.org/abs/2505.14906
Pdf URL: https://arxiv.org/pdf/2505.14906
Copy Paste: [[2505.14906]] Understanding 6G through Language Models: A Case Study on LLM-aided Structured Entity Extraction in Telecom Domain(https://arxiv.org/abs/2505.14906)
Keywords: language model, llm, prompt
Abstract: Knowledge understanding is a foundational part of envisioned 6G networks to advance network intelligence and AI-native network architectures. In this paradigm, information extraction plays a pivotal role in transforming fragmented telecom knowledge into well-structured formats, empowering diverse AI models to better understand network terminologies. This work proposes a novel language model-based information extraction technique, aiming to extract structured entities from the telecom context. The proposed telecom structured entity extraction (TeleSEE) technique applies a token-efficient representation method to predict entity types and attribute keys, aiming to save the number of output tokens and improve prediction accuracy. Meanwhile, TeleSEE involves a hierarchical parallel decoding method, improving the standard encoder-decoder architecture by integrating additional prompting and decoding strategies into entity extraction tasks. In addition, to better evaluate the performance of the proposed technique in the telecom domain, we further designed a dataset named 6GTech, including 2390 sentences and 23747 words from more than 100 6G-related technical publications. Finally, the experiment shows that the proposed TeleSEE method achieves higher accuracy than other baseline techniques, and also presents 5 to 9 times higher sample processing speed.
摘要：知识理解是设想的6G网络的基础部分，以推动网络智能和AI-NETAGITE网络体系结构。在此范式中，信息提取在将零散的电信知识转换为结构良好的格式中起着关键作用，从而增强了多种AI模型以更好地了解网络术语。这项工作提出了一种基于语言模型的新型信息提取技术，旨在从电信环境中提取结构化实体。拟议的电信结构化实体提取（Telesee）技术应用了令牌有效的表示方法来预测实体类型和属性密钥，旨在节省输出令牌的数量并提高预测准确性。同时，Telesee涉及一种层次并行解码方法，通过将其他提示和解码策略集成到实体提取任务中，从而改善了标准的编码器架构。此外，为了更好地评估电信域中提出的技术的性能，我们进一步设计了一个名为6GTECH的数据集，其中包括2390个句子和23747个单词，来自100多个6G相关的技术出版物。最后，该实验表明，所提出的望远镜方法的精度比其他基线技术更高，并且呈现出5至9倍的样品处理速度。

Title: ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories

Authors: Zhiwei Liu, Paul Thompson, Jiaqi Rong, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14917
Pdf URL: https://arxiv.org/pdf/2505.14917
Copy Paste: [[2505.14917]] ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories(https://arxiv.org/abs/2505.14917)
Keywords: language model, llm
Abstract: Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at this https URL.
摘要：尽管大语言模型（LLMS）有许多好处，但它们也可能通过自动产生错误信息，包括阴谋理论造成伤害。此外，LLM还可以通过改变特征性文本特征（例如，将其典型的强烈负面情绪转换为更积极的语气）来“伪装”阴谋论。尽管几项研究提出了自动阴谋理论检测方法，但它们通常是使用人为著名文本进行训练的，其特征的特征可能因LLM生成的文本而异。此外，包括先前提出的Concememollm在内的几种阴谋检测模型在很大程度上依赖于人为实现的阴谋内容的典型情感特征。因此，有意伪装的内容可能会逃避检测。为了解决此类问题，我们首先开发了一个增强版的凝结阴谋检测数据集的增强版，该数据集condid-v2，该数据集补充了人类作者的阴谋推文，并用LLM重写的版本来减少其原始情感的消极情绪。通过将基于人类和LLM的评估结合来验证重写推文的质量。随后，我们使用Contid-V2训练ConspeMollm-V2，这是一个增强的conspemollm版本。实验结果表明，Conspemollm-V2保留或超过对凝聚物中原始的人为作品的含量的表现，并且当在凝聚-V2中应用于孕妇传输的推文时，同时均超过了Contemollm和其他几个基线。该项目将在此HTTPS URL上可用。

Title: Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

Authors: Fadel M. Megahed, Ying-Ju Chen, L. Allision Jones-Farmer, Younghwa Lee, Jiawei Brooke Wang, Inez M. Zwetsloot
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.14918
Pdf URL: https://arxiv.org/pdf/2505.14918
Copy Paste: [[2505.14918]] Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications(https://arxiv.org/abs/2505.14918)
Keywords: language model, gpt, llm
Abstract: This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
摘要：这项研究介绍了一个框架，用于评估大语言模型（LLM）二元文本分类的一致性，以解决缺乏既定的可靠性评估方法。适应心理学原理，我们确定样本量的要求，为无效响应制定指标，并评估评估者内和评估者的可靠性。我们的案例研究研究了14个LLM的财务新闻情感分类（包括Claude-3-7-Sonnet，GPT-4O，DeepSeek-R1，Gemma3，Llama3.2，Phi4和Command-R-Plus），每款重复五个，1,350篇文章。模型表现出很高的评估者一致性，在90-98％的示例中达到了完美的一致性，而来自同一家族的昂贵和经济模型之间的差异很小。当对StockNewSAPI标签进行验证时，模型的性能很强（准确性0.76-0.88），诸如Gemma3：1b，Llama3.2：3B之类的较小型号和Claude-3-5-Haiku超过了较大的对应物。所有模型在预测实际的市场移动时偶然执行，指示任务限制而不是模型限制。我们的框架为LLM选择，样本规划计划和可靠性评估提供了系统的指导，使组织能够优化分类任务的资源。

Title: Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels

Authors: Sil Hamilton, Rebecca M. M. Hicke, Matthew Wilkens, David Mimno
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14925
Pdf URL: https://arxiv.org/pdf/2505.14925
Copy Paste: [[2505.14925]] Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels(https://arxiv.org/abs/2505.14925)
Keywords: language model, llm
Abstract: Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn't Model (TLDM) benchmark, which tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond "lost in the middle" benchmarks when evaluating model performance in complex long-context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.
摘要：尽管大语模型（LLM）的上下文长度已增加到数百万个令牌，但事实证明，评估其有效性以外的有效性已被证明很困难。我们认为，小说提供了一项案例研究，涉及通常超过128K令牌的微妙，复杂的结构和远程语义依赖性。受到计算新颖分析的工作的启发，我们发布了太长，没有建模（TLDM）基准，该基准测试了模型报告绘图摘要，故事世界配置和经过的叙事时间的能力。我们发现，经过七个经过测试的Frontier LLMS中没有一个超过64K令牌的稳定理解。我们的结果表明，语言模型开发人员在评估复杂的长篇小说方案中的模型性能时必须超越“丢失”基准。为了进一步开发，我们将TLDM基准和参考代码和数据发布。

Title: MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Authors: Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, Danielle S. Bitterman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14963
Pdf URL: https://arxiv.org/pdf/2505.14963
Copy Paste: [[2505.14963]] MedBrowseComp: Benchmarking Medical Deep Research and Computer Use(https://arxiv.org/abs/2505.14963)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases -- trials, primary studies, regulatory documents, and cost data -- under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent's ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at: this https URL
摘要：在临床实践中，大型语言模型（LLM）越来越被视为决策支持工具，但安全的临床推理要求在严格的准确限制下整合异质知识基础 - 试验，初级研究，监管文件和成本数据。现有的评估通常依赖于综合提示，将任务减少到单跳实际查询，或将推理与开放式生成相结合，使其现实世界的实用程序不清楚。为了缩小这一差距，我们提出了MedBrowseComp，这是第一个系统地测试代理商可靠地检索和合成从现场，特定领域的知识库中的多跳医学事实的能力的基准。 Medbrowsecomp包含1000多个人类策划的问题，这些问题反映了临床场景，从业者必须调和零散或相互矛盾的信息才能得出最新的结论。将MEDBROWSECOMP应用于边境代理系统显示出低至10％的性能短缺，暴露了当前LLM功能与临床环境中所需的严格性之间的关键差距。因此，Medbrowsecomp为可靠的医疗信息寻求并为未来的模型和工具链升级设定了具体目标，为可靠的医疗信息提供了明确的测试床。您可以访问我们的项目页面：此HTTPS URL

Title: DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis

Authors: Prashanth Vijayaraghavan, Soroush Vosoughi, Lamogha Chizor, Raya Horesh, Rogerio Abreu de Paula, Ehsan Degan, Vandana Mukherjee
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.14971
Pdf URL: https://arxiv.org/pdf/2505.14971
Copy Paste: [[2505.14971]] DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis(https://arxiv.org/abs/2505.14971)
Keywords: language model, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing (NLP) and expanded their applications across diverse domains. However, despite their impressive capabilities, LLMs have been shown to reflect and perpetuate harmful societal biases, including those based on ethnicity, gender, and religion. A critical and underexplored issue is the reinforcement of caste-based biases, particularly towards India's marginalized caste groups such as Dalits and Shudras. In this paper, we address this gap by proposing DECASTE, a novel, multi-dimensional framework designed to detect and assess both implicit and explicit caste biases in LLMs. Our approach evaluates caste fairness across four dimensions: socio-cultural, economic, educational, and political, using a range of customized prompting strategies. By benchmarking several state-of-the-art LLMs, we reveal that these models systematically reinforce caste biases, with significant disparities observed in the treatment of oppressed versus dominant caste groups. For example, bias scores are notably elevated when comparing Dalits and Shudras with dominant caste groups, reflecting societal prejudices that persist in model outputs. These results expose the subtle yet pervasive caste biases in LLMs and emphasize the need for more comprehensive and inclusive bias evaluation methodologies that assess the potential risks of deploying such models in real-world contexts.
摘要：大型语言模型（LLM）的最新进展已彻底改变了自然语言处理（NLP），并将其应用扩展到了各种领域。但是，尽管具有令人印象深刻的能力，但LLM已被证明可以反映和永久存在有害的社会偏见，包括基于种族，性别和宗教的偏见。一个关键且毫无疑问的问题是加强基于种姓的偏见，尤其是对印度边缘化种姓群体（例如达利特人和索德拉斯）的偏见。在本文中，我们通过提出Decaste来解决这一差距，这是一个新型的多维框架，旨在检测和评估LLMS中的隐式和显式种姓偏见。我们的方法使用一系列定制的提示策略评估了四个方面的种姓公平：社会文化，经济，教育和政治。通过对几个最先进的LLM进行基准测试，我们揭示了这些模型系统地增强了种姓偏见，并且在治疗被压迫的种姓和主导种姓群体的治疗中观察到了显着差异。例如，当比较达利特人和索德拉斯与主要种姓群体时，偏差得分显着提高，这反映了持续存在模型输出的社会偏见。这些结果暴露了LLMS中微妙而普遍的种姓偏见，并强调需要更全面和包容性的偏见评估方法，以评估在现实世界中部署此类模型的潜在风险。

Title: Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies

Authors: Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14972
Pdf URL: https://arxiv.org/pdf/2505.14972
Copy Paste: [[2505.14972]] Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies(https://arxiv.org/abs/2505.14972)
Keywords: language model, gpt
Abstract: Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o's cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.
摘要：大型视觉模型（LVLM）越来越多地部署在全球分布式应用程序中，例如旅游助手，但它们产生适当的文化响应的能力仍然没有得到充实的态度。现有的多模式安全基准主要集中于物理安全和忽视植根于文化规范的违规行为，这可能导致象征性伤害。为了解决这一差距，我们介绍了Cross，这是一个基准，旨在评估LVLM的文化安全推理能力。 Cross包括来自16个国家，三个日常领域和14种语言的1,284个多语言视觉扎根的查询，其中仅在图像在上下文中解释图像时才会出现文化规范。我们提出了一个基于文化理论的框架，该框架衡量了四个关键方面：文化意识，规范教育，合规性和帮助。使用此框架，我们评估了21个领先的LVLM，包括专家模型和推理模型。结果揭示了明显的文化安全差距：表现最佳的模型仅在意识中获得61.79％，而合规性只能达到37.73％。尽管某些开源模型达到了GPT-4O级的性能，但它们仍然显着缺乏专有模型。我们的结果进一步表明，提高的推理能力可以改善文化一致性，但不能完全解决这个问题。为了提高模型性能，我们制定了两种增强策略：通过文化扎根的开放式数据和偏好调整，通过对比度响应对进行微调，以突出安全性与不安全行为。这些方法显着提高了GPT-4O的文化意识（+60.14％）和合规性（+55.2％），同时保留一般的多模式能力，降低了一般多模式理解基准的性能最小。

Title: CRAFT: Training-Free Cascaded Retrieval for Tabular QA

Authors: Adarsh Singh, Kushal Raj Bhandari, Jianxi Gao, Soham Dan, Vivek Gupta
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.14984
Pdf URL: https://arxiv.org/pdf/2505.14984
Copy Paste: [[2505.14984]] CRAFT: Training-Free Cascaded Retrieval for Tabular QA(https://arxiv.org/abs/2505.14984)
Keywords: language model, llm
Abstract: Table Question Answering (TQA) involves retrieving relevant tables from a large corpus to answer natural language queries. Traditional dense retrieval models, such as DTR and ColBERT, not only incur high computational costs for large-scale retrieval tasks but also require retraining or fine-tuning on new datasets, limiting their adaptability to evolving domains and knowledge. In this work, we propose $\textbf{CRAFT}$, a cascaded retrieval approach that first uses a sparse retrieval model to filter a subset of candidate tables before applying more computationally expensive dense models and neural re-rankers. Our approach achieves better retrieval performance than state-of-the-art (SOTA) sparse, dense, and hybrid retrievers. We further enhance table representations by generating table descriptions and titles using Gemini Flash 1.5. End-to-end TQA results using various Large Language Models (LLMs) on NQ-Tables, a subset of the Natural Questions Dataset, demonstrate $\textbf{CRAFT}$ effectiveness.
摘要：表问题回答（TQA）涉及从大型语料库中检索相关表以回答自然语言查询。传统的密集检索模型，例如DTR和Colbert，不仅要为大规模检索任务产生高计算成本，而且还需要在新数据集中进行重新调整或进行微调，从而限制了它们对不断发展的域和知识的适应性。在这项工作中，我们提出了$ \ textbf {craft} $，这是一种级联的检索方法，首先使用稀疏的检索模型来过滤候选表的一部分，然后再应用更多计算昂贵的密集模型和神经重新组。我们的方法比最先进的（SOTA）稀疏，浓密和混合猎犬获得了更好的检索性能。我们通过使用Gemini Flash 1.5生成表描述和标题来进一步增强表表示。端到端TQA使用NQ-Tables上的各种大型语言模型（LLM）（自然问题数据集的一个子集）演示$ \ textbf {craft} $效力结果。

Title: Language Specific Knowledge: Do Models Know Better in X than in English?

Authors: Ishika Agarwal, Nimet Beyza Bozdag, Dilek Hakkani-Tür
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14990
Pdf URL: https://arxiv.org/pdf/2505.14990
Copy Paste: [[2505.14990]] Language Specific Knowledge: Do Models Know Better in X than in English?(https://arxiv.org/abs/2505.14990)
Keywords: language model, chain-of-thought
Abstract: Code-switching is a common phenomenon of alternating between different languages in the same utterance, thought, or conversation. We posit that humans code-switch because they feel more comfortable talking about certain topics and domains in one language than another. With the rise of knowledge-intensive language models, we ask ourselves the next, natural question: Could models hold more knowledge on some topics in some language X? More importantly, could we improve reasoning by changing the language that reasoning is performed in? We coin the term Language Specific Knowledge (LSK) to represent this phenomenon. As ethnic cultures tend to develop alongside different languages, we employ culture-specific datasets (that contain knowledge about cultural and social behavioral norms). We find that language models can perform better when using chain-of-thought reasoning in some languages other than English, sometimes even better in low-resource languages. Paired with previous works showing that semantic similarity does not equate to representational similarity, we hypothesize that culturally specific texts occur more abundantly in corresponding languages, enabling specific knowledge to occur only in specific "expert" languages. Motivated by our initial results, we design a simple methodology called LSKExtractor to benchmark the language-specific knowledge present in a language model and, then, exploit it during inference. We show our results on various models and datasets, showing an average relative improvement of 10% in accuracy. Our research contributes to the open-source development of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.
摘要：代码切换是在相同的话语，思想或对话中不同语言之间交替交替的常见现象。我们认为人类是代码开关，因为他们比另一种语言更愿意谈论某些主题和域。随着知识密集型语言模型的兴起，我们问自己下一个自然的问题：模型可以在某些语言x中对某些主题的更多知识？更重要的是，我们可以通过更改推理的语言来改善推理吗？我们创造了特定语言知识（LSK）一词以代表这种现象。随着种族文化倾向于与不同的语言一起发展，我们采用特定文化的数据集（其中包含有关文化和社会行为规范的知识）。我们发现，在使用英语以外的某些语言中使用经过思考的推理时，语言模型可以表现更好，有时在低资源语言中甚至更好。与以前的作品相结合，表明语义相似性并不等于表示性相似性，我们假设具有文化特定文本的文本更丰富地以相应的语言出现，使特定的知识仅在特定的“专家”语言中发生。在我们的初始结果的激励下，我们设计了一种名为lskextractor的简单方法，以基于语言模型中存在的语言特定知识，然后在推理过程中利用它。我们在各种模型和数据集上显示了我们的结果，显示准确性的平均相对改善为10％。我们的研究为包容性的语言模型的开源开发做出了贡献，这些模型包含在内，并且与部署的文化和语言环境更加一致。

Title: Effective and Efficient Schema-aware Information Extraction Using On-Device Large Language Models

Authors: Zhihao Wen, Sheng Liang, Yaxiong Wu, Yongyue Zhang, Yong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.14992
Pdf URL: https://arxiv.org/pdf/2505.14992
Copy Paste: [[2505.14992]] Effective and Efficient Schema-aware Information Extraction Using On-Device Large Language Models(https://arxiv.org/abs/2505.14992)
Keywords: language model, llm, hallucination
Abstract: Information extraction (IE) plays a crucial role in natural language processing (NLP) by converting unstructured text into structured knowledge. Deploying computationally intensive large language models (LLMs) on resource-constrained devices for information extraction is challenging, particularly due to issues like hallucinations, limited context length, and high latency-especially when handling diverse extraction schemas. To address these challenges, we propose a two-stage information extraction approach adapted for on-device LLMs, called Dual-LoRA with Incremental Schema Caching (DLISC), which enhances both schema identification and schema-aware extraction in terms of effectiveness and efficiency. In particular, DLISC adopts an Identification LoRA module for retrieving the most relevant schemas to a given query, and an Extraction LoRA module for performing information extraction based on the previously selected schemas. To accelerate extraction inference, Incremental Schema Caching is incorporated to reduce redundant computation, substantially improving efficiency. Extensive experiments across multiple information extraction datasets demonstrate notable improvements in both effectiveness and efficiency.
摘要：信息提取（IE）通过将非结构化文本转换为结构化知识，在自然语言处理（NLP）中起着至关重要的作用。在资源受限的设备上部署计算密集的大语言模型（LLMS）以进行信息提取是具有挑战性的，尤其是由于幻觉，有限的上下文长度和高延迟等问题，尤其是在处理多样化的提取模式时。为了应对这些挑战，我们提出了一种针对内部设备LLM的两阶段信息提取方法，称为双LORA，具有增量架构缓存（DLISC），从而增强了在有效性和效率方面增强模式识别和架构感知的提取。特别是，DLISC采用标识Lora模块，用于将最相关的模式检索到给定的查询中，并根据先前选择的模式进行信息提取，以提取信息提取。为了加速提取推断，合并增量模式缓存以减少冗余计算，从而大大提高效率。跨多个信息提取数据集进行的广泛实验表明，有效性和效率都有显着提高。

Title: Meta-Design Matters: A Self-Design Multi-Agent System

Authors: Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14996
Pdf URL: https://arxiv.org/pdf/2505.14996
Copy Paste: [[2505.14996]] Meta-Design Matters: A Self-Design Multi-Agent System(https://arxiv.org/abs/2505.14996)
Keywords: language model, llm, agent
Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce SELF-MAS, the first self-supervised, inference-time only framework for automatic MAS design. SELF-MAS employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that SELF-MAS outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-supervised design for creating effective and adaptive MAS.
摘要：多代理系统（MAS）利用大语言模型（LLMS）令人印象深刻的能力具有应对复杂任务的巨大潜力。但是，当前大多数MAS取决于手动设计的代理角色和通信协议。这些手动设计通常无法与基础LLM的优势保持一致，并难以适应新任务。最近的自动MAS方法试图减轻这些局限性，但通常需要进行调整和产生静态MAS设计的验证集，而在推断过程中缺乏适应性。我们介绍了自动MAS设计的第一个自我监管，唯一的推理时间框架。 Self-Mas采用元级设计来迭代生成，评估和完善针对每个问题实例量身定制的MAS配置，而无需验证集。至关重要的是，它可以通过元回馈解决性和完整性来实现动态代理组成和问题分解。使用封闭源和开源的LLM跨尺寸的封闭源和开源LLM跨尺寸的跨数量的实验表明，自我MAS的表现优于手动和自动Mas Baseline，在下一个强劲的基线相比，在维持成本效益的同时，自动MAS的平均准确性提高了7.44％。这些发现强调了元级别的自我监督设计的希望，以创建有效和适应性的MAS。

Title: Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems

Authors: Chengwei Wei, Bin Wang, Jung-jae Kim, Nancy F. Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15000
Pdf URL: https://arxiv.org/pdf/2505.15000
Copy Paste: [[2505.15000]] Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems(https://arxiv.org/abs/2505.15000)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) and multimodal LLMs (MLLMs) have led to strong reasoning ability across a wide range of tasks. However, their ability to perform mathematical reasoning from spoken input remains underexplored. Prior studies on speech modality have mostly focused on factual speech understanding or simple audio reasoning tasks, providing limited insight into logical step-by-step reasoning, such as that required for mathematical problem solving. To address this gap, we introduce Spoken Math Question Answering (Spoken-MQA), a new benchmark designed to evaluate the mathematical reasoning capabilities of speech-based models, including both cascade models (ASR + LLMs) and end-to-end speech LLMs. Spoken-MQA covers a diverse set of math problems, including pure arithmetic, single-step and multi-step contextual reasoning, and knowledge-oriented reasoning problems, all presented in unambiguous natural spoken language. Through extensive experiments, we find that: (1) while some speech LLMs perform competitively on contextual reasoning tasks involving basic arithmetic, they still struggle with direct arithmetic problems; (2) current LLMs exhibit a strong bias toward symbolic mathematical expressions written in LaTex and have difficulty interpreting verbalized mathematical expressions; and (3) mathematical knowledge reasoning abilities are significantly degraded in current speech LLMs.
摘要：大型语言模型（LLMS）和多模式LLM（MLLM）的最新进展已导致在广泛的任务中推理了强大的推理能力。但是，他们从口语输入中执行数学推理的能力仍未得到充实。先前关于语音方式的研究主要集中在事实语音理解或简单的音频推理任务上，从而有限地了解逻辑逐步推理，例如数学问题解决所需的逻辑。为了解决这一差距，我们介绍了口语数学答案（口语-MQA），这是一种新的基准测试，旨在评估基于语音模型的数学推理能力，包括级联模型（ASR + LLMS）和端到端语音LLM。口语MQA涵盖了各种数学问题，包括纯算术，单步和多步上下文推理以及面向知识的推理问题，所有这些问题都以明确的自然口语呈现。通过广泛的实验，我们发现：（1）虽然某些语音LLM在涉及基本算术的上下文推理任务上有竞争性地表现，但他们仍然在直接算术问题上挣扎；（2）当前的LLM对用乳胶编写的符号数学表达式表现出很大的偏见，并且难以解释口头上的数学表达式；（3）当前语音LLM中，数学知识推理能力显着降低。

Title: Diagnosing our datasets: How does my language model learn clinical information?

Authors: Furong Jia, David Sontag, Monica Agrawal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15024
Pdf URL: https://arxiv.org/pdf/2505.15024
Copy Paste: [[2505.15024]] Diagnosing our datasets: How does my language model learn clinical information?(https://arxiv.org/abs/2505.15024)
Keywords: language model, llm
Abstract: Large language models (LLMs) have performed well across various clinical natural language processing tasks, despite not being directly trained on electronic health record (EHR) data. In this work, we examine how popular open-source LLMs learn clinical information from large mined corpora through two crucial but understudied lenses: (1) their interpretation of clinical jargon, a foundational ability for understanding real-world clinical notes, and (2) their responses to unsupported medical claims. For both use cases, we investigate the frequency of relevant clinical information in their corresponding pretraining corpora, the relationship between pretraining data composition and model outputs, and the sources underlying this data. To isolate clinical jargon understanding, we evaluate LLMs on a new dataset MedLingo. Unsurprisingly, we find that the frequency of clinical jargon mentions across major pretraining corpora correlates with model performance. However, jargon frequently appearing in clinical notes often rarely appears in pretraining corpora, revealing a mismatch between available data and real-world usage. Similarly, we find that a non-negligible portion of documents support disputed claims that can then be parroted by models. Finally, we classified and analyzed the types of online sources in which clinical jargon and unsupported medical claims appear, with implications for future dataset composition.
摘要：尽管没有直接接受电子健康记录（EHR）数据培训，但大型语言模型（LLM）在各种临床自然语言处理任务中表现出色。在这项工作中，我们研究了受欢迎的开源LLM如何通过两个关键但已研究的镜头从大型开采语料库中学习临床信息：（1）他们对临床术语的解释，这是理解现实世界中临床笔记的基本能力，以及（2）他们对不受支持的医疗主张的反应。对于这两种用例，我们都会研究相关临床信息在相应的预处理中的频率，预处理数据组成和模型输出之间的关系以及该数据的源头。为了隔离临床术语的理解，我们在新的数据集Medlingo上评估了LLM。毫不奇怪，我们发现临床术语的频率提到了主要的训练阶段，这与模型性能相关。但是，术语中经常出现在临床笔记中的行话经常出现在训练阶段的语料库中，从而揭示了可用的数据和现实世界使用之间的不匹配。同样，我们发现，不可忽略的文件部分支持有争议的索赔，然后可以被模型伪装。最后，我们对出现临床术语和不受支持的医学主张的在线资源进行了分类和分析，这对未来的数据集组成有影响。

Title: Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Authors: Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15038
Pdf URL: https://arxiv.org/pdf/2505.15038
Copy Paste: [[2505.15038]] Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering(https://arxiv.org/abs/2505.15038)
Keywords: language model, llm
Abstract: Linear Concept Vectors have proven effective for steering large language models (LLMs). While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.
摘要：线性概念向量已被证明对转向大语言模型（LLM）有效。虽然现有的方法探测和均值差异从LLM隐藏表示形式得出了这些向量，但各种数据引入了挑战转向鲁棒性的噪音（即无关紧要的功能）。为了解决这个问题，我们建议使用稀疏的自动编码器构造概念向量（SDCV），该概念向量（SDCV）使用稀疏的自动编码器来滤除隐藏表示的嘈杂功能。当应用于线性探测和均值时，我们的方法会提高其转向成功率。我们通过反事实实验和特征可视化来验证噪声假设。

Title: Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Authors: Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15045
Pdf URL: https://arxiv.org/pdf/2505.15045
Copy Paste: [[2505.15045]] Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective(https://arxiv.org/abs/2505.15045)
Keywords: language model, llm
Abstract: Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.
摘要：大型语言模型（LLM）基于大规模培训和训练后受益的嵌入模型已开始超过通用文本嵌入任务（例如文档检索）的基于BERT和T5的模型。但是，LLM嵌入的基本局限性在于自回归预训练期间使用的单向关注，这与文本嵌入任务的双向性质失误。为此，我们建议采用用于文本嵌入的扩散语言模型，这是由于它们固有的双向体系结构以及最近在匹配或超越LLM的成功的动机，尤其是在推理任务上。我们介绍了扩散语言嵌入模型的首次系统研究，该模型的表现优于基于LLM的嵌入模型，在长期检索上的含量为20％，在推理密集型检索方面的表现为8％，在遵循指导遵循的检索中为2％，并在传统的文本嵌入基准上实现竞争性能。我们的分析验证了双向关注对于编码长期和复杂文本中的全球环境至关重要。

Title: ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding

Authors: Yifan Wu, Lutao Yan, Leixian Shen, Yinan Mei, Jiannan Wang, Yuyu Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15046
Pdf URL: https://arxiv.org/pdf/2505.15046
Copy Paste: [[2505.15046]] ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding(https://arxiv.org/abs/2505.15046)
Keywords: language model, llm
Abstract: The emergence of Multi-modal Large Language Models (MLLMs) presents new opportunities for chart understanding. However, due to the fine-grained nature of these tasks, applying MLLMs typically requires large, high-quality datasets for task-specific fine-tuning, leading to high data collection and training costs. To address this, we propose ChartCards, a unified chart-metadata generation framework for multi-task chart understanding. ChartCards systematically synthesizes various chart information, including data tables, visualization code, visual elements, and multi-dimensional semantic captions. By structuring this information into organized metadata, ChartCards enables a single chart to support multiple downstream tasks, such as text-to-chart retrieval, chart summarization, chart-to-table conversion, chart description, and chart question answering. Using ChartCards, we further construct MetaChart, a large-scale high-quality dataset containing 10,862 data tables, 85K charts, and 170 K high-quality chart captions. We validate the dataset through qualitative crowdsourcing evaluations and quantitative fine-tuning experiments across various chart understanding tasks. Fine-tuning six different models on MetaChart resulted in an average performance improvement of 5% across all tasks. The most notable improvements are seen in text-to-chart retrieval and chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements of 17% and 28%, respectively.
摘要：多模式大语言模型（MLLM）的出现为图表理解提供了新的机会。但是，由于这些任务的精细性质，应用MLLM通常需要大型，高质量的数据集进行特定于任务的微调，从而导致高数据收集和培训成本。为了解决这个问题，我们建议使用ChartCards，这是一个统一的图表 - 汇总生成框架，可用于多任务图表。 ChartCard系统地综合了各种图表信息，包括数据表，可视化代码，视觉元素和多维语义标题。通过将这些信息构造到有组织的元数据中，ChartCard可以使单个图表支持多个下游任务，例如文本到图案检索，图表摘要，图表到表转换，图表描述和图表问题回答。使用ChartCard，我们进一步构建了Metachart，这是一个大规模的高质量数据集，其中包含10,862个数据表，85K图表和170 K高质量的图表标题。我们通过定性众包评估和各种图表理解任务的定量微调实验来验证数据集。 Metachart上的六个不同模型的微调导致所有任务的平均性能提高5％。最显着的改进是在文本到创建的检索和图表到餐桌上的任务中可以看到的，长剪辑和骆驼3.2-11b分别取得了17％和28％的改善。

Title: Improving the fact-checking performance of language models by relying on their entailment ability

Authors: Gaurav Kumar, Debajyoti Mazumder, Ayush Garg, Jasabanta Patro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15050
Pdf URL: https://arxiv.org/pdf/2505.15050
Copy Paste: [[2505.15050]] Improving the fact-checking performance of language models by relying on their entailment ability(https://arxiv.org/abs/2505.15050)
Keywords: language model, prompt
Abstract: Automated fact-checking is a crucial task in this digital age. To verify a claim, current approaches majorly follow one of two strategies i.e. (i) relying on embedded knowledge of language models, and (ii) fine-tuning them with evidence pieces. While the former can make systems to hallucinate, the later have not been very successful till date. The primary reason behind this is that fact verification is a complex process. Language models have to parse through multiple pieces of evidence before making a prediction. Further, the evidence pieces often contradict each other. This makes the reasoning process even more complex. We proposed a simple yet effective approach where we relied on entailment and the generative ability of language models to produce ''supporting'' and ''refuting'' justifications (for the truthfulness of a claim). We trained language models based on these justifications and achieved superior results. Apart from that, we did a systematic comparison of different prompting and fine-tuning strategies, as it is currently lacking in the literature. Some of our observations are: (i) training language models with raw evidence sentences registered an improvement up to 8.20% in macro-F1, over the best performing baseline for the RAW-FC dataset, (ii) similarly, training language models with prompted claim-evidence understanding (TBE-2) registered an improvement (with a margin up to 16.39%) over the baselines for the same dataset, (iii) training language models with entailed justifications (TBE-3) outperformed the baselines by a huge margin (up to 28.57% and 44.26% for LIAR-RAW and RAW-FC, respectively). We have shared our code repository to reproduce the results.
摘要：在这个数字时代，自动化事实检查是至关重要的任务。为了验证索赔，当前的方法主要遵循两种策略之一，即（i）依靠嵌入式语言模型的知识，以及（ii）用证据件对其进行微调。虽然前者可以制造幻觉的系统，但迄今为止，后来还没有非常成功。背后的主要原因是事实验证是一个复杂的过程。语言模型必须在做出预测之前通过多个证据解析。此外，证据片段经常相互矛盾。这使推理过程更加复杂。我们提出了一种简单而有效的方法，在该方法中，我们依靠元素以及语言模型产生“支持”和“反驳”的生成能力（对于索赔的真实性）。我们根据这些理由训练了语言模型，并取得了卓越的成果。除此之外，我们对不同的提示和微调策略进行了系统的比较，因为它目前缺乏文献。我们的某些观察结果是：（i）具有原始证据句子的培训语言模型，在宏F1中的提高了高达8.20％，而不是原始FC数据集的最佳性能基线，（ii）同样，同样，同样的是，培训语言模型具有促进索赔的语言理解（TBE-2）的培训语言（TBE-2），对基础的培训（最高为16.39％），而不是16.39％（III）（III）（III III），（III II III）（TBE-3）优于基准的巨大利润（分别为撒谎者和原始FC高达28.57％和44.26％）。我们共享了代码存储库来重现结果。

Title: MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

Authors: Feiyang Cai, Jiahui Bai, Tao Tang, Joshua Luo, Tianyu Zhu, Ling Liu, Feng Luo
Subjects: cs.CL, cs.AI, cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.15054
Pdf URL: https://arxiv.org/pdf/2505.15054
Copy Paste: [[2505.15054]] MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation(https://arxiv.org/abs/2505.15054)
Keywords: prompt
Abstract: Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (o3) achieves $79.2\%$ and $78.5\%$ accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only $29.0\%$ accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications.
摘要：精确的识别，编辑和分子的产生是化学家和AI系统的必要先决条件，这些系统可以应对各种化学任务。我们提出了Mollangbench，这是一种综合基准，旨在评估基本分子语言界面任务：语言促进的分子结构识别，编辑和生成。为了确保高质量，明确和确定性的输出，我们使用自动化的化学形式工具构建识别任务，并通过严格的专家注释和验证来策划编辑和生成任务。 Mollangbench支持与不同分子表示的模型的评估，包括线性字符串，分子图像和分子图。对最先进模型的评估揭示了重大局限性：最强的模型（O3）达到了$ 79.2 \％$和$ 78.5 \％$ $ $ $ $的准确性，而识别和编辑任务的准确性对于人类而言非常简单，并且在一代任务上的表现更糟，仅达到$ 29.0 \％\％$ $ $ $。这些结果突出了当前AI系统在处理甚至初步分子识别和操纵任务方面的缺点。我们希望Mollangbench能够促进进一步的研究，以实现化学应用的更有效和可靠的AI系统。

Title: Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Authors: Hongli Zhou, Hui Huang, Ziqing Zhao, Lvyuan Han, Huicheng Wang, Kehai Chen, Muyun Yang, Wei Bao, Jian Dong, Bing Xu, Conghui Zhu, Hailong Cao, Tiejun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15055
Pdf URL: https://arxiv.org/pdf/2505.15055
Copy Paste: [[2505.15055]] Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory(https://arxiv.org/abs/2505.15055)
Keywords: language model, llm
Abstract: The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining main-stream prominent LLM benchmarks using results from diverse models. We first propose a new framework for accurate and reliable estimations of item characteristics and model abilities. Specifically, we propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. Based on PSN-IRT, we conduct extensive analysis which reveals significant and varied shortcomings in the measurement quality of current benchmarks. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.
摘要：通过基准对大语言模型（LLM）的评估是广泛的，但不同排行榜之间的不一致性与顶级模型之间的差可分离性之间存在不一致性，这引起了人们对它们准确反映正宗模型能力的能力的担忧。本文对基准有效性进行了批判性分析，并使用各种模型的结果检查了主流突出的LLM基准。我们首先提出了一个新的框架，以准确可靠地估计项目特征和模型能力。具体而言，我们提出了用于项目响应理论（PSN-ript）的伪siamese网络，这是一个增强的项目响应理论框架，该框架将丰富的项目参数集中在IRT接地架构中。基于PSN-RIRT，我们进行了广泛的分析，该分析揭示了当前基准测量测量质量的显着和不同的缺点。此外，我们证明了利用PSN-Rirt能够构建较小的基准测试，同时保持与人类偏好更强的一致性。

Title: Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning

Authors: Jiashu He, Jinxuan Fan, Bowen Jiang, Ignacio Houine, Dan Roth, Alejandro Ribeiro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15062
Pdf URL: https://arxiv.org/pdf/2505.15062
Copy Paste: [[2505.15062]] Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning(https://arxiv.org/abs/2505.15062)
Keywords: language model, gpt, llm
Abstract: When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate "hormones helping mental disorders" with "melatonin being a hormone and insomnia a mental disorder" to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE's key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to $\textbf{28.5%$\rightarrow$71.4%}$ and $\textbf{78.6$\rightarrow$90.5%}$ in samples $\textbf{unseen}$ in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90\%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.
摘要：在解决需要新信息的复杂问题时，人们经常将问题与现有知识联系起来，以得出明智的答案。例如，在评估褪黑激素是否有助于失眠时，可能会将“激素帮助精神疾病”与“褪黑激素”为激素和失眠症的精神障碍，以完成推理。大型语言模型（LLM）还需要这种关联思维，尤其是在检索知识时解决科学询问的过程中不足，并且没有直接回答这个问题。图启发了通过知识图（kg）推断结构化知识来启发真实性外推（给予）。但是，它涉及许多假设三胞胎的构建和修剪，这限制了效率和概括性。我们提出了自我基因，这是一个检索RL框架，通过增强学习通过自动关联思维来增强LLM。自助提取物结构化信息和实体集，以协助模型链接到查询概念。我们解决了关键的局限性：（1）广泛的LLM呼叫和象征性的开销，以进行知识外推，（2）由于复杂的说明而在较小的LLM（3B或7B）上部署困难，以及（3）（3）LLM修剪的知识不准确。具体而言，在使用135个节点UMLS kg进行微调后，它将QWEN2.5 3B和7B型号的性能提高了高达$ \ textbf {28.5％$ \ rightArrow $ 71.4％} $，$ \ \ \ \ \ \ \\ textbf {78.6 $ \ rightarrow $ 90.5％$}具有挑战性的生物医学质量检查任务。特别是，自助式的7B模型可以与Give匹配或胜过GPT3.5 Turbo，同时将令牌用法降低了90 \％。自我基因增强了结构化检索和推理与关联思维的可扩展整合。

Title: UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

Authors: Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Numaan Naeem, Muhammad Ahsan Riaz Khan, Arham Riaz, Muhammad Arslan Manzoor, Yuxia Wang, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15063
Pdf URL: https://arxiv.org/pdf/2505.15063
Copy Paste: [[2505.15063]] UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking(https://arxiv.org/abs/2505.15063)
Keywords: language model, llm, agent
Abstract: The rapid use of large language models (LLMs) has raised critical concerns regarding the factual reliability of their outputs, especially in low-resource languages such as Urdu. Existing automated fact-checking solutions overwhelmingly focus on English, leaving a significant gap for the 200+ million Urdu speakers worldwide. In this work, we introduce UrduFactCheck, the first comprehensive, modular fact-checking framework specifically tailored for Urdu. Our system features a dynamic, multi-strategy evidence retrieval pipeline that combines monolingual and translation-based approaches to address the scarcity of high-quality Urdu evidence. We curate and release two new hand-annotated benchmarks: UrduFactBench for claim verification and UrduFactQA for evaluating LLM factuality. Extensive experiments demonstrate that UrduFactCheck, particularly its translation-augmented variants, consistently outperforms baselines and open-source alternatives on multiple metrics. We further benchmark twelve state-of-the-art (SOTA) LLMs on factual question answering in Urdu, highlighting persistent gaps between proprietary and open-source models. UrduFactCheck's code and datasets are open-sourced and publicly available at this https URL.
摘要：大型语言模型（LLM）的快速使用引起了人们对其产出的事实可靠性的关键关注，尤其是在乌尔都语等低资源语言中。现有的自动化事实检查解决方案压倒性地集中在英语上，这给全球200多个乌尔都语演讲者留下了很大的差距。在这项工作中，我们介绍了Urdufactcheck，这是专门针对乌尔都语量身定制的第一个全面的，模块化的事实检查框架。我们的系统采用动态的，多策略的证据检索管道，结合了基于单语和翻译的方法来解决高质量乌尔都语证据的稀缺性。我们策划并释放两个新的手工宣布的基准：用于索赔验证的Urdufactbench和用于评估LLM事实的Urdufactqa。广泛的实验表明，Urdufactcheck，尤其是其翻译功能的变体，在多个指标上始终优于基准和开源替代方案。我们进一步基于乌尔都语的事实问题回答的十二个最先进的LLM（SOTA）LLM，强调了专有和开源模型之间的持续差距。 UrdufactCheck的代码和数据集是开源的，并在此HTTPS URL上公开可用。

Title: The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support

Authors: Suhas BN, Yash Mahajan, Dominik Mattioli, Andrew M. Sherrill, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.15065
Pdf URL: https://arxiv.org/pdf/2505.15065
Copy Paste: [[2505.15065]] The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support(https://arxiv.org/abs/2505.15065)
Keywords: language model
Abstract: Can small language models with 0.5B to 5B parameters meaningfully engage in trauma-informed, empathetic dialogue for individuals with PTSD? We address this question by introducing TIDE, a dataset of 10,000 two-turn dialogues spanning 500 diverse PTSD client personas and grounded in a three-factor empathy model: emotion recognition, distress normalization, and supportive reflection. All scenarios and reference responses were reviewed for realism and trauma sensitivity by a clinical psychologist specializing in PTSD. We evaluate eight small language models before and after fine-tuning, comparing their outputs to a frontier model (Claude Sonnet 3.5). Our IRB-approved human evaluation and automatic metrics show that fine-tuning generally improves perceived empathy, but gains are highly scenario- and user-dependent, with smaller models facing an empathy ceiling. Demographic analysis shows older adults value distress validation and graduate-educated users prefer nuanced replies, while gender effects are minimal. We highlight the limitations of automatic metrics and the need for context- and user-aware system design. Our findings, along with the planned release of TIDE, provide a foundation for building safe, resource-efficient, and ethically sound empathetic AI to supplement, not replace, clinical mental health care.
摘要：具有0.5b至5b参数的小语言模型是否可以有意义地为患有PTSD的人进行创伤性，同理心对话吗？我们通过引入潮汐（跨越500多种PTSD客户角色，并以三因素同理心模型：情感识别，痛苦的归一化和支持性反思的方式扎根）来解决这个问题。专门从事PTSD的临床心理学家对所有场景和参考反应进行了现实主义和创伤敏感性的审查。我们在微调之前和之后评估了八种小语言模型，将其输出与边境模型进行比较（Claude Sonnet 3.5）。我们经过IRB批准的人类评估和自动指标表明，微调通常会改善感知的同理心，但收益是高度的方案和用户依赖性的，较小的模型面临着同理心天花板。人口统计分析表明，老年人重视遇险验证，受过研究生教育的用户更喜欢细微的答复，而性别影响很小。我们强调了自动指标的局限性以及对上下文和用户感知系统设计的需求。我们的发现，以及计划的潮汐发行，为建立安全，资源效率和道德上声音的AI提供了基础，以补充而不是取代临床心理保健。

Title: In-Domain African Languages Translation Using LLMs and Multi-armed Bandits

Authors: Pratik Rakesh Singh, Kritarth Prasad, Mohammadi Zaki, Pankaj Wasnik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15069
Pdf URL: https://arxiv.org/pdf/2505.15069
Copy Paste: [[2505.15069]] In-Domain African Languages Translation Using LLMs and Multi-armed Bandits(https://arxiv.org/abs/2505.15069)
Keywords: llm
Abstract: Neural Machine Translation (NMT) systems face significant challenges when working with low-resource languages, particularly in domain adaptation tasks. These difficulties arise due to limited training data and suboptimal model generalization, As a result, selecting an optimal model for translation is crucial for achieving strong performance on in-domain data, particularly in scenarios where fine-tuning is not feasible or practical. In this paper, we investigate strategies for selecting the most suitable NMT model for a given domain using bandit-based algorithms, including Upper Confidence Bound, Linear UCB, Neural Linear Bandit, and Thompson Sampling. Our method effectively addresses the resource constraints by facilitating optimal model selection with high confidence. We evaluate the approach across three African languages and domains, demonstrating its robustness and effectiveness in both scenarios where target data is available and where it is absent.
摘要：当使用低资源语言，尤其是在域适应任务中，神经机器翻译（NMT）系统在使用低资源语言时面临重大挑战。结果，由于训练数据有限和次优模型的概括，因此出现了这些困难，因此选择一个最佳翻译模型对于在内域数据上实现强劲的性能至关重要，尤其是在微观调整不可行或实用的情况下。在本文中，我们研究了使用基于Bandit的算法选择给定域中最合适的NMT模型的策略，包括上置信度结合，线性UCB，神经线性匪徒和Thompson采样。我们的方法通过高度置信度促进最佳模型选择有效地解决了资源限制。我们评估了三种非洲语言和领域的方法，在两种情况下都证明了它的稳健性和有效性，这些情况有可用的目标数据以及缺乏目标数据。

Title: Can Large Language Models Understand Internet Buzzwords Through User-Generated Content

Authors: Chen Huang, Junkai Luo, Xinzuo Wang, Wenqiang Lei, Jiancheng Lv
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15071
Pdf URL: https://arxiv.org/pdf/2505.15071
Copy Paste: [[2505.15071]] Can Large Language Models Understand Internet Buzzwords Through User-Generated Content(https://arxiv.org/abs/2505.15071)
Keywords: language model, llm
Abstract: The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing crucial shared challenges: over-reliance on prior exposure, underdeveloped inferential abilities, and difficulty identifying high-quality UGC to facilitate comprehension. We believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code are available at this https URL.
摘要：中国社交媒体可用的大量用户生成的内容（UGC）导致了研究互联网流行语的可能性。在本文中，我们研究了大型语言模型（LLMS）是否可以基于UGC为这些流行语生成准确的定义。我们的工作提供了三倍的贡献。首先，我们引入Cheer，这是中国互联网流行语的第一个数据集，每个数据集都带有定义和相关的UGC。其次，我们提出了一种名为Ress的新颖方法，以有效地引导LLM的理解过程，以产生更准确的流行语定义，从而反映人类语言学习的技能。第三，我们加油助喜您，基于各种现成定义生成方法和我们的Ress的优点和缺点。我们的基准表明了Ress的有效性，同时揭示了至关重要的共同挑战：过度依赖于先前的暴露，欠发达的推论能力以及难以识别高质量的UGC以促进理解。我们相信我们的工作为基于LLM的定义生成中的未来进步奠定了基础。我们的数据集和代码可在此HTTPS URL上找到。

Title: DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Authors: Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15074
Pdf URL: https://arxiv.org/pdf/2505.15074
Copy Paste: [[2505.15074]] DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data(https://arxiv.org/abs/2505.15074)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups - assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks.
摘要：大型语言模型（LLM）越来越多地通过从人类反馈（RLHF）学习的增强来与人类的偏好保持一致。在RLHF方法中，小组相对策略优化（GRPO）因其简单性和强大的性能而引起了人们的关注，特别是消除了学习价值函数的需求。但是，GRPO隐含地假设跨组平衡的域分布和统一的语义对齐方式 - 在现实世界中很少存在的假设。当应用于多域，不平衡的数据时，GRPO不成比例地优化主导域，忽略了代表性不足的域，并导致概括和公平性差。我们提出了域信息的自通策略优化（DISCO），这是GRPO的原则扩展，该扩展与两个关键创新有关集团间的不平衡。域感知的奖励缩放量表通过基于域患病率重新加权优化来抵消频率偏差。难以理解的奖励缩放利用迅速级别的自符合性，以识别和确定提供更大学习价值的不确定提示。这些策略共同促进了跨领域的更公平和有效的政策学习。跨多个LLM和偏斜训练分布进行的大量实验表明，迪斯科曲线改善了概括，在QWEN3模型上超过了现有的GRPO变体，并为多域对齐基准测试了新的最新结果。

Title: Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Authors: Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15075
Pdf URL: https://arxiv.org/pdf/2505.15075
Copy Paste: [[2505.15075]] Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs(https://arxiv.org/abs/2505.15075)
Keywords: language model, llm
Abstract: The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
摘要：多模式大语言模型（MLLM）的快速演变显着增强了其现实世界的应用。但是，在整合文化知识的情况下，尤其是在整合文化知识的情况下，达到一致的表现仍然是一个重大挑战。为了更好地评估这个问题，我们介绍了两个新的基准：KnowRecall和VisreCall，它们评估了MLLM中的跨语性一致性。 KnowRecall是一个视觉问题，回答旨在衡量15种语言的事实知识一致性的基准，重点是有关全球地标的文化和历史问题。 VisreCall通过要求模型描述9种语言中的地标外观来评估视觉记忆一致性，而无需访问图像。实验结果表明，最新的MLLM，包括专有的MLLM，仍然难以实现跨语义的一致性。这强调了对更强大的方法的需求，从而产生真正多种语言和文化意识的模型。

Title: DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer

Authors: Sona Elza Simon, Preethi Jyothi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15090
Pdf URL: https://arxiv.org/pdf/2505.15090
Copy Paste: [[2505.15090]] DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer(https://arxiv.org/abs/2505.15090)
Keywords: language model
Abstract: Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.
摘要：有效的跨语性转移仍然是将大语言模型从高资源扩展到低资源语言的关键挑战。为了实现这一目标，先前的研究探索了许多方法，将任务知识从特定于任务的数据结合在（高资源）源语言和语言知识中，从（低资源）目标语言中的未标记文本结合在一起。一种值得注意的方法提出了跨语性转移的可综合稀疏微调（SFT），该跨语性转移学习了特定于任务和特定语言的稀疏掩码，以选择预告片模型的参数的子集，这些参数是进一步调整的。这些稀疏的微调矢量（SFTS）随后与验证的模型一起组成，以促进零击的跨语性转移到目标语言中，仅使用来自源语言的特定于任务的数据。使用基于简单的基于简单的修剪来识别这些用于SFTS的稀疏面膜。在我们的工作中，我们引入了Deft-X，这是一种新型的可综合SFT方法，它在使用单数值分解的幅度修剪之前将预告片模型的重量矩阵降低，从而产生了更强大的SFT。我们评估了多种多样的低资源语言（NUSAX）和自然语言推断（Americasnli）的Deft-X，并证明它在PAR或跑赢大于SFT和其他杰出的跨语言转移基础上的性能。

Title: SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Authors: Jing Yu, Yuqi Tang, Kehua Feng, Mingyang Rao, Lei Liang, Zhiqiang Zhang, Mengshu Sun, Wen Zhang, Qiang Zhang, Keyan Ding, Huajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15094
Pdf URL: https://arxiv.org/pdf/2505.15094
Copy Paste: [[2505.15094]] SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models(https://arxiv.org/abs/2505.15094)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.
摘要：大型语言模型（LLM）在上下文理解和推理方面表现出了令人印象深刻的能力。但是，评估它们在各种科学领域的性能仍然没有被忽视，因为现有的基准主要集中在通用领域，并且无法捕获科学数据的复杂复杂性。为了弥合这一差距，我们构建了Scicueval，这是一个量身定制的全面基准数据集，该数据集量化了用于评估LLM的科学环境理解能力。它包括涵盖生物学，化学，物理学，生物医学和材料科学的十个领域特异性子数据集，它们整合了包括结构化表，知识图和非结构化文本在内的各种数据模式。 Scicueval系统地评估了四个核心竞争力：相关信息识别，信息 - 放弃检测，多源信息集成以及通过各种问题格式进行上下文所感知的推断。我们对SciCueval的最先进的LLM进行了广泛的评估，对科学环境中的优势和局限性进行了精细的分析，并为科学域LLM的未来发展提供了宝贵的见解。

Title: Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English

Authors: Ishmanbir Singh, Dipankar Srirag, Aditya Joshi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15095
Pdf URL: https://arxiv.org/pdf/2505.15095
Copy Paste: [[2505.15095]] Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English(https://arxiv.org/abs/2505.15095)
Keywords: llm, prompt, agent
Abstract: Sarcasm is a challenge to sentiment analysis because of the incongruity between stated and implied sentiment. The challenge is exacerbated when the implication may be relevant to a specific country or geographical region. Pragmatic metacognitive prompting (PMP) is a cognition-inspired technique that has been used for pragmatic reasoning. In this paper, we harness PMP for explainable sarcasm detection for Australian and Indian English, alongside a benchmark dataset for standard English. We manually add sarcasm explanations to an existing sarcasm-labeled dataset for Australian and Indian English called BESSTIE, and compare the performance for explainable sarcasm detection for them with FLUTE, a standard English dataset containing sarcasm explanations. Our approach utilising PMP when evaluated on two open-weight LLMs (GEMMA and LLAMA) achieves statistically significant performance improvement across all tasks and datasets when compared with four alternative prompting strategies. We also find that alternative techniques such as agentic prompting mitigate context-related failures by enabling external knowledge retrieval. The focused contribution of our work is utilising PMP in generating sarcasm explanations for varieties of English.
摘要：讽刺是对情感分析的挑战，因为所述和隐含情感之间的不一致。当含义可能与特定国家或地理区域有关时，挑战会加剧。务实的元认知提示（PMP）是一种以认知为灵感的技术，用于实用推理。在本文中，我们利用PMP来解释澳大利亚和印度英语的可解释讽刺检测，以及用于标准英语的基准数据集。我们将讽刺性解释添加到现有的澳大利亚和印度英语的讽刺标签数据集中，称为Besstie，并比较了它们的可解释讽刺检测的性能与长笛，这是一种标准的英语数据集，该数据集包含讽刺解释。与四种替代提示策略相比，在两个开放量LLM（Gemma和Llama）上进行评估时，我们利用PMP的方法可以在所有任务和数据集上实现统计学意义的性能改善。我们还发现，通过启用外部知识检索，诸如代理促使与上下文相关的失败等替代技术。我们工作的重点贡献是利用PMP来为各种英语的讽刺解释。

Title: Mechanistic evaluation of Transformers and state space models

Authors: Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csórdas, Dan Jurafsky, Christopher Potts
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15105
Pdf URL: https://arxiv.org/pdf/2505.15105
Copy Paste: [[2505.15105]] Mechanistic evaluation of Transformers and state space models(https://arxiv.org/abs/2505.15105)
Keywords: language model
Abstract: State space models (SSMs) for language modelling promise an efficient and performant alternative to quadratic-attention Transformers, yet show variable performance on recalling basic information from the context. While performance on synthetic tasks like Associative Recall (AR) can point to this deficiency, behavioural metrics provide little information as to why--on a mechanistic level--certain architectures fail and others succeed. To address this, we conduct experiments on AR and find that only Transformers and Based SSM models fully succeed at AR, with Mamba a close third, whereas the other SSMs (H3, Hyena) fail. We then use causal interventions to explain why. We find that Transformers and Based learn to store key-value associations in-context using induction heads. By contrast, the SSMs compute these associations only at the last state, with only Mamba succeeding because of its short convolution component. To extend and deepen these findings, we introduce Associative Treecall (ATR), a synthetic task similar to AR based on PCFG induction. ATR introduces language-like hierarchical structure into the AR setting. We find that all architectures learn the same mechanism as they did for AR, and the same three models succeed at the task. These results reveal that architectures with similar accuracy may still have substantive differences, motivating the adoption of mechanistic evaluations.
摘要：用于语言建模的状态空间模型（SSM）有望成为二次注意变形金刚的有效且性能的替代方案，但在回忆中上下文的基本信息时显示出可变的性能。尽管诸如联想召回（AR）之类的合成任务的性能可以指出这种缺陷，但行为指标几乎没有提供有关机械级别的信息 - 确定的体系结构失败，而其他人则取得了成功。为了解决这个问题，我们在AR上进行实验，发现只有变形金刚和基于的SSM模型在AR上完全取得了成功，而Mamba接近三分之一，而其他SSM（H3，Hyena）失败了。然后，我们使用因果干预来解释原因。我们发现，变形金刚和基于基础的人学习使用感应头将键值关联存储。相比之下，SSM仅在最后一个状态计算这些关联，因为Mamba由于其短卷积组成部分而成功。为了扩展和加深这些发现，我们介绍了基于PCFG诱导的AR的合成任务，我们介绍了联合treecall（ATR）。 ATR将类似语言的层次结构引入AR设置。我们发现所有架构都学到了与AR相同的机制，并且在任务中相同的三个模型成功。这些结果表明，具有相似精度的体系结构仍可能存在实质性差异，从而激发了机械评估的采用。

Title: StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

Authors: Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.15107
Pdf URL: https://arxiv.org/pdf/2505.15107
Copy Paste: [[2505.15107]] StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization(https://arxiv.org/abs/2505.15107)
Keywords: language model, llm, agent
Abstract: Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our implementation is publicly available at this https URL.
摘要：高效的多跳跃推理需要基于大语言模型（LLM）的代理，才能迭代地获取高价值的外部知识。先前的工作探索了加强学习（RL）来培训LLMS以执行基于搜索的文档检索，从而取得了显着的改进，从而取得了显着的改进，但由于仅来自全球信号的稀疏奖励而产生的复杂，多跳的质量质量质量检查表现不佳。为了解决现有研究中的这一差距，我们介绍了Stepearch，这是一个搜索LLMS的框架，该框架接受了逐步近端策略优化方法。它由基于信息增益和冗余惩罚的更丰富，更详细的中间搜索奖励和令牌级别的过程监督，以更好地指导每个搜索步骤。我们通过一组Data Pipeline方法构建了一个基于开源数据集的子问题级别搜索轨迹，构建了一个精细粒度的提问数据集。在标准的多跳QA基准测试中，它极大地胜过全球奖励基线，仅使用19K培训数据，使用RL基线在各种搜索中实现了11.2％和4.2％的绝对改进，证明了精美的，逐步的监督在优化深入搜索LLM方面的有效性。我们的实施在此HTTPS URL上公开可用。

Title: A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents

Authors: Ian Steenstra, Timothy W. Bickmore
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.15108
Pdf URL: https://arxiv.org/pdf/2505.15108
Copy Paste: [[2505.15108]] A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents(https://arxiv.org/abs/2505.15108)
Keywords: language model, llm, agent
Abstract: The proliferation of Large Language Models (LLMs) and Intelligent Virtual Agents acting as psychotherapists presents significant opportunities for expanding mental healthcare access. However, their deployment has also been linked to serious adverse outcomes, including user harm and suicide, facilitated by a lack of standardized evaluation methodologies capable of capturing the nuanced risks of therapeutic interaction. Current evaluation techniques lack the sensitivity to detect subtle changes in patient cognition and behavior during therapy sessions that may lead to subsequent decompensation. We introduce a novel risk taxonomy specifically designed for the systematic evaluation of conversational AI psychotherapists. Developed through an iterative process including review of the psychotherapy risk literature, qualitative interviews with clinical and legal experts, and alignment with established clinical criteria (e.g., DSM-5) and existing assessment tools (e.g., NEQ, UE-ATR), the taxonomy aims to provide a structured approach to identifying and assessing user/patient harms. We provide a high-level overview of this taxonomy, detailing its grounding, and discuss potential use cases. We discuss two use cases in detail: monitoring cognitive model-based risk factors during a counseling conversation to detect unsafe deviations, in both human-AI counseling sessions and in automated benchmarking of AI psychotherapists with simulated patients. The proposed taxonomy offers a foundational step towards establishing safer and more responsible innovation in the domain of AI-driven mental health support.
摘要：大型语言模型（LLM）和充当心理治疗师的智能虚拟代理的扩散为扩大心理保健访问提供了重要的机会。但是，他们的部署也与严重的不良后果有关，包括用户伤害和自杀，这是由于缺乏能够捕获治疗互动差异风险的标准化评估方法所促进的。当前的评估技术缺乏检测治疗过程中患者认知和行为的细微变化的敏感性，这可能会导致随后的代偿失调。我们介绍了一种专门针对对话性AI心理治疗师系统评估的新型风险分类学。通过迭代过程开发，包括对心理治疗风险文献的审查，对临床和法律专家的定性访谈，以及与既定的临床标准（例如DSM-5）和现有评估工具（例如NEQ，UE-ATR）保持一致，该分类学旨在提供确定和评估用户/患者的结构性方法。我们提供了该分类法的高级概述，详细介绍了其基础，并讨论了潜在用例。我们详细讨论了两种用例：在咨询对话中监测基于认知模型的风险因素，以检测不安全的偏差，在人类咨询会议和与模拟患者的AI心理治疗师的自动基准中。拟议的分类法为在AI驱动的心理健康支持领域建立更安全，更负责任的创新提供了基本的步骤。

Title: RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals

Authors: Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15110
Pdf URL: https://arxiv.org/pdf/2505.15110
Copy Paste: [[2505.15110]] RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals(https://arxiv.org/abs/2505.15110)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: The table reasoning task, crucial for efficient data acquisition, aims to answer questions based on the given table. Recently, reasoning large language models (RLLMs) with Long Chain-of-Thought (Long CoT) significantly enhance reasoning capabilities, leading to brilliant performance on table reasoning. However, Long CoT suffers from high cost for training and exhibits low reliability due to table content hallucinations. Therefore, we propose Row-of-Thought (RoT), which performs iteratively row-wise table traversal, allowing for reasoning extension and reflection-based refinement at each traversal. Scaling reasoning length by row-wise traversal and leveraging reflection capabilities of LLMs, RoT is training-free. The sequential traversal encourages greater attention to the table, thus reducing hallucinations. Experiments show that RoT, using non-reasoning models, outperforms RLLMs by an average of 4.3%, and achieves state-of-the-art results on WikiTableQuestions and TableBench with comparable models, proving its effectiveness. Also, RoT outperforms Long CoT with fewer reasoning tokens, indicating higher efficiency.
摘要：对于有效的数据获取至关重要的表格推理任务旨在根据给定表回答问题。最近，推理具有长长的思考链（长床）的大型语言模型（RLLM）显着增强了推理能力，从而在桌面推理上表现出色。但是，长床的培训成本很高，并且由于表内容幻觉而导致的可靠性较低。因此，我们提出了对迭代划分的表遍历的经营行动（腐烂），从而可以在每个遍历上进行推理扩展和基于反射的改进。通过行遍历和利用LLM的反射功能来扩展推理长度，ROT不含训练。顺序遍历会鼓励对桌子的更大关注，从而减少幻觉。实验表明，腐烂，使用非争议模型的折磨平均优于RLLM 4.3％，并以可比的模型在Wikikitable Queptions和Tablebench上实现最先进的结果，证明其有效性。同样，腐烂的效果超过了较少的推理令牌，表明效率更高。

Title: An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents

Authors: Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, Jiawei Han
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.15117
Pdf URL: https://arxiv.org/pdf/2505.15117
Copy Paste: [[2505.15117]] An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents(https://arxiv.org/abs/2505.15117)
Keywords: language model, llm, agent
Abstract: Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at this https URL.
摘要：强化学习（RL）在训练大语模型（LLMS）方面表现出强大的潜力，能够对解决现实世界的问题解决。最近，RL已被利用来创建基于LLM的复杂搜索剂，这些搜索剂熟练地将推理与搜索引擎的使用相结合。尽管将RL用于培训搜索剂是有希望的，但此类代理的最佳设计仍未完全理解。特别是，关键因素（例如（1）奖励公式，（2）基础LLM的选择和特征，以及（3）搜索引擎在RL过程中的作用 - 需要进一步研究。在这项工作中，我们进行了全面的经验研究，以系统地研究这些研究并提供可行的见解。我们重点介绍了几个关键发现：格式奖励有效地提高最终性能，而中等的检索奖励的影响有限； LLM的规模和初始化（通用与推理特有的）显着影响RL结果；搜索引擎的选择在塑造RL训练动态和推理过程中训练有素的代理的鲁棒性方面起着至关重要的作用。这些建立了重要的指南，以成功构建和部署现实世界应用程序中的基于LLM的搜索代理。代码可在此HTTPS URL上找到。

Title: Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning

Authors: Jinghui Lu, Haiyang Yu, Siliang Xu, Shiwei Ran, Guozhi Tang, Siqi Wang, Bin Shan, Teng Fu, Hao Feng, Jingqun Tang, Han Wang, Can Huang
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2505.15154
Pdf URL: https://arxiv.org/pdf/2505.15154
Copy Paste: [[2505.15154]] Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning(https://arxiv.org/abs/2505.15154)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advancements in reasoning have significantly enhanced the capabilities of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) across diverse tasks. However, excessive reliance on chain-of-thought (CoT) reasoning can impair model performance and brings unnecessarily lengthened outputs, reducing efficiency. Our work reveals that prolonged reasoning does not universally improve accuracy and even degrade performance on simpler tasks. To address this, we propose Certainty-based Adaptive Reasoning (CAR), a novel framework that dynamically switches between short answers and long-form reasoning based on the model perplexity. CAR first generates a short answer and evaluates its perplexity, triggering reasoning only when the model exhibits low confidence (i.e., high perplexity). Experiments across diverse multimodal VQA/KIE benchmarks and text reasoning datasets show that CAR outperforms both short-answer and long-form reasoning approaches, striking an optimal balance between accuracy and efficiency.
摘要：推理的最新进展显着增强了大语模型（LLM）和多模式大型语言模型（MLLM）的能力。但是，过度依赖对经营链（COT）推理会损害模型性能，并带来不必要的延长产出，从而降低效率。我们的工作表明，延长推理并不能普遍提高准确性，甚至无法在更简单的任务上降低绩效。为了解决这个问题，我们提出了基于确定性的自适应推理（CAR），这是一个新颖的框架，该框架基于模型的困惑，在简短的答案和长格式推理之间动态切换。 CAR首先产生一个简短的答案并评估其困惑，仅当模型表现出较低的置信度（即高度困惑）时才触发推理。跨不同多模式VQA/KIE基准和文本推理数据集进行的实验表明，CAR的表现优于简短和长形的推理方法，从而达到了准确性和效率之间的最佳平衡。

Title: ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection

Authors: Jeonghye Kim, Sojeong Rhee, Minbeom Kim, Dohyung Kim, Sangmook Lee, Youngchul Sung, Kyomin Jung
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15182
Pdf URL: https://arxiv.org/pdf/2505.15182
Copy Paste: [[2505.15182]] ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection(https://arxiv.org/abs/2505.15182)
Keywords: llm, hallucination, agent
Abstract: Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent's actual state and goal. Our analysis finds that this stems from ReAct's inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent's state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.
摘要：LLM代理的最新进展主要建立在诸如React之类的推理骨架上，这些主持人在复杂的环境中交织了思想和行动。但是，React通常会产生未接地或不连贯的推理步骤，从而导致代理商的实际状态和目标之间的不对对准。我们的分析发现，这源于React无法维持一致的内部信念和目标一致性，从而导致更加复杂的错误和幻觉。为了解决这个问题，我们介绍了Reflact，这是一种新颖的骨干，将推理从仅计划下一步行动转变为不断反思代理商的国家相对于其目标。通过在各州明确扎实的决策并实施持续的目标一致性，反射极大地提高了战略性的可靠性。这种设计可带来可观的经验增长：反射平均超过27.7％，在ALFWORLD中取得了93.3％的成功率。值得注意的是，反射甚至优于添加的增强模块（例如，反射，WKM）反应，表明加强核心推理骨架是可靠剂性能的关键。

Title: EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association

Authors: Weiqi Wang, Limeng Cui, Xin Liu, Sreyashi Nag, Wenju Xu, Chen Luo, Sheikh Muhammad Sarwar, Yang Li, Hansu Gu, Hui Liu, Changlong Yu, Jiaxin Bai, Yifan Gao, Haiyang Zhang, Qi He, Shuiwang Ji, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15196
Pdf URL: https://arxiv.org/pdf/2505.15196
Copy Paste: [[2505.15196]] EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association(https://arxiv.org/abs/2505.15196)
Keywords: llm
Abstract: Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains underexplored due to several challenges, including the inability of LLMs to simultaneously conduct script planning and product retrieval, difficulties in matching products caused by semantic discrepancies between planned actions and search queries, and a lack of methods and benchmark data for evaluation. In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks. We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step based on the semantic similarity between the actions and their purchase intentions. By applying our framework to real-world e-commerce data, we construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products. Human annotations are then conducted to provide gold labels for a sampled subset, forming an evaluation benchmark. Extensive experiments reveal that current (L)LMs face significant challenges with EcomScript tasks, even after fine-tuning, while injecting product purchase intentions improves their performance.
摘要：人类通常使用面向目标的脚本计划，或者将连贯的行动序列朝特定目标设计为特定目标的能力。在电子商务中，客户越来越多地寻求基于LLM的助手来生成脚本并在每个步骤推荐产品，从而促进了方便有效的购物体验。但是，由于几个挑战，该功能仍然没有得到充实的态度，包括LLMS无法同时进行脚本计划和产品检索，在计划中的语义差异和搜索查询之间造成的匹配产品中的困难以及缺乏方法和缺乏方法和基准数据进行评估。在本文中，我们通过正式将电子商务脚本计划（ecomscript）的任务定义为三个顺序子任务来向前迈进。我们提出了一个新颖的框架，该框架可以通过将产品与每个步骤相关联，它根据动作与其购买意图之间的语义相似性，通过将产品与每个步骤相关联。通过将我们的框架应用于现实世界的电子商务数据，我们构建了第一个大型Ecomscript数据集EcomscriptBench，其中包括605,229个脚本，这些脚本来自240万个产品。然后进行人体注释为采样子集提供金标签，形成评估基准。广泛的实验表明，即使经过微调，当前（L）LMS在Ecomscript任务中面临重大挑战，同时注入产品购买意图也可以提高其性能。

Title: DUSK: Do Not Unlearn Shared Knowledge

Authors: Wonje Jeung, Sangyeon Yoon, Hyesoo Hong, Soeun Kim, Seungju Han, Youngjae Yu, Albert No
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15209
Pdf URL: https://arxiv.org/pdf/2505.15209
Copy Paste: [[2505.15209]] DUSK: Do Not Unlearn Shared Knowledge(https://arxiv.org/abs/2505.15209)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about the unauthorized use of copyrighted or sensitive data. Machine unlearning aims to remove such 'forget' data while preserving utility and information from the 'retain' set. However, existing evaluations typically assume that forget and retain sets are fully disjoint, overlooking realistic scenarios where they share overlapping content. For instance, a news article may need to be unlearned, even though the same event, such as an earthquake in Japan, is also described factually on Wikipedia. Effective unlearning should remove the specific phrasing of the news article while preserving publicly supported facts. In this paper, we introduce DUSK, a benchmark designed to evaluate unlearning methods under realistic data overlap. DUSK constructs document sets that describe the same factual content in different styles, with some shared information appearing across all sets and other content remaining unique to each. When one set is designated for unlearning, an ideal method should remove its unique content while preserving shared facts. We define seven evaluation metrics to assess whether unlearning methods can achieve this selective removal. Our evaluation of nine recent unlearning methods reveals a key limitation: while most can remove surface-level text, they often fail to erase deeper, context-specific knowledge without damaging shared content. We release DUSK as a public benchmark to support the development of more precise and reliable unlearning techniques for real-world applications.
摘要：大型语言模型（LLM）越来越多地部署在现实世界中，引起了人们对未经授权使用版权或敏感数据的担忧。 Machine Unerning旨在删除此类“忘记”数据，同时保留“保留”集中的实用程序和信息。但是，现有评估通常认为忘记和保留集是完全不相交的，忽略了它们共享重叠内容的现实场景。例如，即使在威基百科上也描述了同一事件，例如在日本的地震，也可能需要删除新闻文章。有效的学习应消除新闻文章的具体措辞，同时保存公开支持的事实。在本文中，我们介绍了Dusk，这是一种旨在评估现实数据重叠下的学习方法的基准。黄昏构造文档集，这些文档集以不同样式描述相同的事实内容，并且在每个集合中都会出现一些共享信息，并且每个集合中剩下的其他内容都剩下。当一组被指定用于学习时，理想的方法应在保留共同事实的同时删除其唯一内容。我们定义七个评估指标，以评估未学习方法是否可以实现此选择性去除。我们对九种最新学习方法的评估揭示了一个关键局限性：尽管大多数人可以删除表面级文本，但它们通常无法消除更深的，特定于上下文的知识而不会损害共享内容。我们发布黄昏作为公共基准，以支持为现实世界应用程序开发更精确，更可靠的学习技术。

Title: Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

Authors: Jie Ma, Ning Qu, Zhitao Gao, Rui Xing, Jun Liu, Hongbin Pei, Jiang Xie, Linyun Song, Pinghui Wang, Jing Tao, Zhou Su
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.15210
Pdf URL: https://arxiv.org/pdf/2505.15210
Copy Paste: [[2505.15210]] Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs(https://arxiv.org/abs/2505.15210)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs' reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at this https URL.
摘要：知识基于图形的检索生成一代旨在减轻由不足或过时的知识引起的大语言模型（LLMS）中的幻觉。但是，现有方法通常无法完全利用知识图中嵌入的先验知识（kgs），尤其是其结构信息以及明确或隐式约束。前者可以增强LLMS推理的忠诚，而后者可以提高响应产生的可靠性。在这些动机上，我们提出了一个值得信赖的推理框架，该框架被认为是对先验的审议（DP），该框架充分利用了公园中包含的先验。具体而言，DP采用了一种渐进的知识蒸馏策略，该策略通过有监督的微调和Kahneman-Tversky优化的结合将结构先验整合到LLMS中，从而提高了关系路径的忠诚度。此外，我们的框架采用了推理 - 内置策略，该策略指导LLM基于提取的约束先验执行精致的推理验证，从而确保响应产生的可靠性。在三个基准数据集上进行的广泛实验表明，DP实现了新的最新性能，尤其是在复杂的Webquestions数据集上的命中率提高了13％，并产生了高度值得信赖的响应。我们还进行了各种分析，以验证其灵活性和实用性。该代码可在此HTTPS URL上找到。

Title: R-TOFU: Unlearning in Large Reasoning Models

Authors: Sangyeon Yoon, Wonje Jeung, Albert No
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15214
Pdf URL: https://arxiv.org/pdf/2505.15214
Copy Paste: [[2505.15214]] R-TOFU: Unlearning in Large Reasoning Models(https://arxiv.org/abs/2505.15214)
Keywords: llm, chain-of-thought
Abstract: Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier refusal styles. Finally, we identify a failure mode: decoding variants such as ZeroThink and LessThink can still reveal forgotten content despite seemingly successful unlearning, emphasizing the need to evaluate models under diverse decoding settings. Together, the benchmark, analysis, and new baseline establish a systematic foundation for studying and improving unlearning in LRMs while preserving their reasoning capabilities.
摘要：大型推理模型（LRMS）不仅在其最终答案中嵌入了私人或受版权保护的信息，而且在整个多步骤链（COT）痕迹中也嵌入了私有信息，这使得可靠的未来学习要求比在标准LLM中更高。我们介绍了推理tofu（R-tofu），这是针对此设置量身定制的第一个基准。 R-Tofu通过现实的COT注释增加了现有的未学习任务，并提供了逐步指标，这些指标可以使剩余的知识看不见答案级别的检查。使用R-TOFU，我们对基于梯度和优先优化基线的基准进行了全面比较，并表明传统的仅接回答目标在推理中留下了实质性的遗忘痕迹。我们进一步提出了合理的IDK，这是一种优先优化的变体，可保留连贯但尚无定论的推理，在忘记效力和模型效用之间取得了比以前的拒绝风格更强的平衡。最后，我们确定了一种故障模式：尽管看似成功地学习，但仍可以揭示诸如Zerothink和Limesthink之类的解码变体，但仍可以揭示被遗忘的内容，强调需要在不同的解码设置下评估模型。基准，分析和新的基线共同为研究和改善LRMS中的学习能力而建立了系统的基础，以确保其推理能力。

Title: Multilingual Prompting for Improving LLM Generation Diversity

Authors: Qihan Wang, Shidong Pan, Tal Linzen, Emily Black
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.15229
Pdf URL: https://arxiv.org/pdf/2505.15229
Copy Paste: [[2505.15229]] Multilingual Prompting for Improving LLM Generation Diversity(https://arxiv.org/abs/2505.15229)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and personas prompting. Further analyses show that the benefits of multilingual prompting vary with language resource level and model size, and that aligning the prompting language with the cultural cues reduces hallucination about culturally-specific information.
摘要：众所周知，大型语言模型（LLM）缺乏文化代表性和几代人的整体多样性，从表达意见到回答事实问题。为了减轻此问题，我们提出了多语言提示：一种提示方法，该方法与来自几种文化的文化和语言提示产生了几种基本提示的变体，产生了响应，然后将结果结合在一起。在证据表明LLM具有特定语言知识的基础上，多语言提示试图通过激活模型培训数据中的更广泛的文化知识来增加多样性。通过跨多种模型（GPT-4O，GPT-4O-MINI，LLAMA 70B和LLAMA 8B）的实验，我们表明，多语言促使始终如一地胜过现有的多样性增强技术，例如高温采样，逐步召回，逐步召回，以及人员提示。进一步的分析表明，多语言促使语言资源水平和模型大小的好处，并且将提示语言与文化提示保持一致，从而减少了对文化特定信息的幻觉。

Title: Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework

Authors: Zihao Jiang, Ben Liu, Miao Peng, Wenjie Xu, Yao Xiao, Zhenyan Shan, Min Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15245
Pdf URL: https://arxiv.org/pdf/2505.15245
Copy Paste: [[2505.15245]] Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework(https://arxiv.org/abs/2505.15245)
Keywords: language model, llm, prompt
Abstract: While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at this https URL.
摘要：尽管大型语言模型（LLM）在时间推理中表现出巨大的潜力，但大多数现有工作都集中在提高性能上，通常忽略了结果所基于的可解释的推理过程。为了解决这一差距，我们引入了一个全面的基准测试，涵盖了广泛的时间粒度，旨在系统地评估LLMS在可解释的时间推理中的功能。此外，我们的发现表明，LLM仅在依靠文本信息时很难提供令人信服的解释。为了应对挑战，我们提出了Geter，这是一种新颖的结构感知生成框架，将图形结构与文本集成在一起，以进行解释。具体来说，我们首先利用时间知识图来开发一个时间编码器，该编码器捕获查询的结构信息。随后，我们将一个结构 - 文本前缀适配器引入将图形结构特征映射到文本嵌入空间中。最后，LLM通过将软图令牌与指令调整提示令牌无缝集成，从而生成说明文本。实验结果表明，Geter实现了最先进的性能，同时还展示了其有效性以及强大的概括能力。我们的数据集和代码可在此HTTPS URL上找到。

Title: Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation

Authors: Yerin Hwang, Dongryeol Lee, Kyungmin Min, Taegwan Kang, Yong-il Kim, Kyomin Jung
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.15249
Pdf URL: https://arxiv.org/pdf/2505.15249
Copy Paste: [[2505.15249]] Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation(https://arxiv.org/abs/2505.15249)
Keywords: language model, prompt
Abstract: Recently, large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment, yet their robustness along the visual modality remains underexplored. This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores? We define potential image induced biases within the context of T2I evaluation and examine how these biases affect the evaluations of LVLM judges. Moreover, we introduce a novel, fine-grained, multi-domain meta-evaluation benchmark named FRAME, which is deliberately constructed to exhibit diverse score distributions. By introducing the defined biases into the benchmark, we reveal that all tested LVLM judges exhibit vulnerability across all domains, consistently inflating scores for manipulated images. Further analysis reveals that combining multiple biases amplifies their effects, and pairwise evaluations are similarly susceptible. Moreover, we observe that visual biases persist under prompt-based mitigation strategies, highlighting the vulnerability of current LVLM evaluation systems and underscoring the urgent need for more robust LVLM judges.
摘要：最近，大型视觉模型（LVLMS）已成为判断文本图像对齐的首选工具，但是它们沿着视觉模态的稳健性仍然没有被逐渐倍增。这项工作是第一个解决关键研究问题的研究：对抗性视觉操纵是否可以系统地愚弄LVLM法官分配不公平膨胀的分数？我们在T2I评估的背景下定义了潜在图像引起的偏见，并检查了这些偏见如何影响LVLM法官的评估。此外，我们介绍了一种新颖的，细粒度的多域元评估基准，名为Frame，该基准是故意构建的，以展示各种得分分布。通过将定义的偏见引入基准，我们揭示了所有经过测试的LVLM法官在所有领域均表现出脆弱性，从而使操纵图像的得分始终膨胀。进一步的分析表明，组合多种偏见会扩大其效果，并且成对评估同样易感。此外，我们观察到，在基于迅速的缓解策略下，视觉偏见持续存在，突出了当前LVLM评估系统的脆弱性，并强调了迫切需要更强大的LVLM法官。

Title: MentalMAC: Enhancing Large Language Models for Detecting Mental Manipulation via Multi-Task Anti-Curriculum Distillation

Authors: Yuansheng Gao, Han Bao, Tong Zhang, Bin Li, Zonghui Wang, Wenzhi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15255
Pdf URL: https://arxiv.org/pdf/2505.15255
Copy Paste: [[2505.15255]] MentalMAC: Enhancing Large Language Models for Detecting Mental Manipulation via Multi-Task Anti-Curriculum Distillation(https://arxiv.org/abs/2505.15255)
Keywords: language model, llm
Abstract: Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Its covert nature and the complexity of manipulation strategies make it challenging to detect, even for state-of-the-art large language models (LLMs). This concealment also hinders the manual collection of large-scale, high-quality annotations essential for training effective models. Although recent efforts have sought to improve LLM's performance on this task, progress remains limited due to the scarcity of real-world annotated datasets. To address these challenges, we propose MentalMAC, a multi-task anti-curriculum distillation method that enhances LLMs' ability to detect mental manipulation in multi-turn dialogue. Our approach includes: (i) EvoSA, an unsupervised data expansion method based on evolutionary operations and speech act theory; (ii) teacher-model-generated multi-task supervision; and (iii) progressive knowledge distillation from complex to simpler tasks. We then constructed the ReaMent dataset with 5,000 real-world dialogue samples, using a MentalMAC-distilled model to assist human annotation. Vast experiments demonstrate that our method significantly narrows the gap between student and teacher models and outperforms competitive LLMs across key evaluation metrics. All code, datasets, and checkpoints will be released upon paper acceptance. Warning: This paper contains content that may be offensive to readers.
摘要：心理操纵是一种微妙而普遍的心理虐待形式，对心理健康构成了严重威胁。它的秘密性质和操纵策略的复杂性使检测到最先进的大语言模型（LLMS）的挑战。这种隐藏还阻碍了对培训有效模型必不可少的大规模，高质量注释的手动收集。尽管最近的努力试图改善LLM在这项任务上的表现，但由于现实世界注释的数据集缺乏，进度仍然有限。为了应对这些挑战，我们提出了一种多任务抗外交蒸馏方法，增强了LLMS在多转化对话中检测精神操纵的能力。我们的方法包括：（i）Evosa，一种基于进化操作和言语ACT理论的无监督数据扩展方法；（ii）教师模型生成的多任务监督；（iii）从复杂到更简单任务的渐进知识蒸馏。然后，我们使用精神杀伤模型来帮助人类注释，并使用5,000个真实世界对话样本构建了Reament数据集。巨大的实验表明，我们的方法显着缩小了学生和教师模型之间的差距，并且在关键评估指标中胜过竞争性LLM。所有代码，数据集和检查点将在纸上接受后发布。警告：本文包含可能对读者冒犯的内容。

Title: When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners

Authors: Weixiang Zhao, Jiahe Guo, Yang Deng, Tongtong Wu, Wenxuan Zhang, Yulin Hu, Xingyu Sui, Yanyan Zhao, Wanxiang Che, Bing Qin, Tat-Seng Chua, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15257
Pdf URL: https://arxiv.org/pdf/2505.15257
Copy Paste: [[2505.15257]] When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners(https://arxiv.org/abs/2505.15257)
Keywords: language model, llm
Abstract: Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning. To evaluate this, we perform a causal intervention by ablating language-specific representations at inference time. Experiments on 10 open-source LLMs spanning 11 typologically diverse languages show that this language-specific ablation consistently boosts multilingual reasoning performance. Layer-wise analyses further confirm that language and reasoning representations can be effectively decoupled throughout the model, yielding improved multilingual reasoning capabilities, while preserving top-layer language features remains essential for maintaining linguistic fidelity. Compared to post-training such as supervised fine-tuning or reinforcement learning, our training-free ablation achieves comparable or superior results with minimal computational overhead. These findings shed light on the internal mechanisms underlying multilingual reasoning in LLMs and suggest a lightweight and interpretable strategy for improving cross-lingual generalization.
摘要：对于大型语言模型（LLM），多语言推理仍然是一个重大挑战，表现不成比例地赞成高资源语言。我们从认知神经科学中汲取灵感，这表明人类推理在很大程度上独立于语言处理，我们假设LLMS类似地将推理和语言与可分开的组件编码为可分离的组件，这些组件可以被解散以增强多语言推理。为了评估这一点，我们在推理时通过消融语言特定表示来进行因果干预。跨越11种类型的语言的10个开源LLMS的实验表明，这种特定于语言的消融始终提高多语言推理性能。层次分析进一步证实，语言和推理表示可以在整个模型中有效地解耦，从而提高了多种语言的推理能力，同时保留顶级语言功能对于维持语言忠诚度仍然是必不可少的。与训练后的训练（例如监督的微调或加强学习）相比，我们的无训练消融可在最小的计算开销中获得可比或优越的结果。这些发现阐明了LLMS中多语言推理的内部机制，并提出了改善跨语性概括的轻巧且可解释的策略。

Title: AGENT-X: Adaptive Guideline-based Expert Network for Threshold-free AI-generated teXt detection

Authors: Jiatao Li, Mao Ye, Cheng Peng, Xunjian Yin, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15261
Pdf URL: https://arxiv.org/pdf/2505.15261
Copy Paste: [[2505.15261]] AGENT-X: Adaptive Guideline-based Expert Network for Threshold-free AI-generated teXt detection(https://arxiv.org/abs/2505.15261)
Keywords: agent
Abstract: Existing AI-generated text detection methods heavily depend on large annotated datasets and external threshold tuning, restricting interpretability, adaptability, and zero-shot effectiveness. To address these limitations, we propose AGENT-X, a zero-shot multi-agent framework informed by classical rhetoric and systemic functional linguistics. Specifically, we organize detection guidelines into semantic, stylistic, and structural dimensions, each independently evaluated by specialized linguistic agents that provide explicit reasoning and robust calibrated confidence via semantic steering. A meta agent integrates these assessments through confidence-aware aggregation, enabling threshold-free, interpretable classification. Additionally, an adaptive Mixture-of-Agent router dynamically selects guidelines based on inferred textual characteristics. Experiments on diverse datasets demonstrate that AGENT-X substantially surpasses state-of-the-art supervised and zero-shot approaches in accuracy, interpretability, and generalization.
摘要：现有的AI生成的文本检测方法在很大程度上取决于大量注释的数据集和外部阈值调整，限制了可解释性，适应性和零摄像的有效性。为了解决这些局限性，我们提出了Agent-X，这是一个零击的多代理框架，该框架由经典的言论和系统性功能语言学告知。具体而言，我们将检测指南组织为语义，风格和结构维度，每个尺寸都由专门的语言代理人独立评估，这些语言通过语义转向提供了明确的推理和强大的校准置信度。元代理通过置信度的聚合整合了这些评估，从而实现了无阈值，可解释的分类。此外，一种自适应的代理路由器会根据推断的文本特征动态选择指南。在各种数据集上的实验表明，Agent-X在准确性，可解释性和概括方面实质上超过了最新的监督和零击方法。

Title: Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Authors: Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15277
Pdf URL: https://arxiv.org/pdf/2505.15277
Copy Paste: [[2505.15277]] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents(https://arxiv.org/abs/2505.15277)
Keywords: language model, gpt, llm, agent
Abstract: Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.
摘要：Web导航是一个独特的领域，可以自动化许多重复的现实生活任务，并且具有挑战性，因为它需要长期胜利的顺序决策，而不是典型的多模式大语言模型（MLLM）任务。然而，到目前为止，在培训和测试时间期间可以使用的Web导航的专门奖励模型。尽管速度和成本效益很重要，但先前的工作还是利用了MLLM作为奖励模型，这对现实世界部署构成了重大限制。为了解决这个问题，在这项工作中，我们提出了一个称为Web-Shepherd的过程奖励模型（PRM），该模型可以在阶梯级中评估Web导航轨迹。为了实现这一目标，我们首先构建了WebPRM集合，这是一个具有40k级别偏好对的大规模数据集和涵盖不同域和难度级别的注释清单。接下来，我们还介绍了WebRewardBench，这是第一个评估PRMS的元评估基准。在我们的实验中，我们观察到，与在WebRewardbench上使用GPT-4O相比，我们的网络居民的精度约为30分。此外，当使用GPT-4O-Mini作为策略和Web-Shepherd作为验证者对Webarena-Lite进行测试时，我们取得了10.9分更好的性能，与使用GPT-4O-MINI作为验证者相比，成本少10个。我们的模型，数据集和代码可在链接上公开可用。

Title: Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization

Authors: Joonho Yang, Seunghyun Yoon, Hwan Chang, Byeongjeong Kim, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15291
Pdf URL: https://arxiv.org/pdf/2505.15291
Copy Paste: [[2505.15291]] Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization(https://arxiv.org/abs/2505.15291)
Keywords: language model, llm, long context, hallucination
Abstract: Large Language Models (LLMs) have significantly advanced text generation capabilities, including tasks like summarization, often producing coherent and fluent outputs. However, faithfulness to source material remains a significant challenge due to the generation of hallucinations. While extensive research focuses on detecting and reducing these inaccuracies, less attention has been paid to the positional distribution of hallucination within generated text, particularly in long outputs. In this work, we investigate where hallucinations occur in LLM-based long response generation, using long document summarization as a key case study. Focusing on the challenging setting of long context-aware long response generation, we find a consistent and concerning phenomenon: hallucinations tend to concentrate disproportionately in the latter parts of the generated long response. To understand this bias, we explore potential contributing factors related to the dynamics of attention and decoding over long sequences. Furthermore, we investigate methods to mitigate this positional hallucination, aiming to improve faithfulness specifically in the concluding segments of long outputs.
摘要：大型语言模型（LLMS）具有明显的高级文本生成功能，包括诸如摘要之类的任务，通常会产生连贯和流利的输出。但是，由于幻觉的产生，对原料材料的忠诚仍然是一个重大挑战。尽管广泛的研究着重于检测和减少这些不准确性，但对生成的文本中幻觉的位置分布的关注较少，尤其是在长输出中。在这项工作中，我们使用长文档摘要作为关键案例研究，调查基于LLM的长期响应产生中幻觉发生的位置。着眼于长期感知长期响应产生的挑战性环境，我们发现了一种一致且与现象有关的现象：幻觉往往会不成比例地集中在产生的长期响应的后半部分中。为了理解这种偏见，我们探索了与注意力和解码长序列的动力学相关的潜在促成因素。此外，我们研究了减轻这种位置幻觉的方法，旨在提高忠诚，特别是在长期输出的结论中。

Title: Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

Authors: Xintong Wang, Yixiao Liu, Jingheng Pan, Liang Ding, Longyue Wang, Chris Biemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15297
Pdf URL: https://arxiv.org/pdf/2505.15297
Copy Paste: [[2505.15297]] Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites(https://arxiv.org/abs/2505.15297)
Keywords: language model, llm
Abstract: Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.
摘要：在保留演讲者的原始意图的同时，对进攻性语言进行排毒，这是提高在线互动质量的具有挑战性但至关重要的目标。尽管大型语言模型（LLMS）在重写有毒内容方面表现出希望，但他们通常违约会过分礼貌地重写，扭曲情感语气和交流意图。这个问题在中文中尤其是敏锐的，在中文中，毒性通常通过表情符号，同音或话语背景隐含。我们提出了ToxireWriteCn，这是第一个明确设计用于保持情感极性的中国排毒数据集。该数据集包含1,556个精心注释的三胞胎，每个三胞胎包含有毒句子，一个与情感的无毒重写和有毒跨度的标记。它涵盖了五个现实世界中的情况：标准表达式，表情符号诱导的和均匀的毒性，以及单转弯和多转向对话。我们评估了17个LLM，包括在四个维度上具有变异架构的商业和开源模型：排毒精度，流利度，内容保存和情感极性。结果表明，虽然商业和MOE模型的整体表现最佳，但所有模型都在努力在更微妙或较重的环境中（例如表情符号，同型和基于对话的输入）中平衡安全性与情感忠诚度。我们发布ToxireWriteCN，以支持对中国人的可控，情感吸引排毒的未来研究。

Title: Emotional Supporters often Use Multiple Strategies in a Single Turn

Authors: Xin Bai, Guanyi Chen, Tingting He, Chenlian Zhou, Yu Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15316
Pdf URL: https://arxiv.org/pdf/2505.15316
Copy Paste: [[2505.15316]] Emotional Supporters often Use Multiple Strategies in a Single Turn(https://arxiv.org/abs/2505.15316)
Keywords: language model, llm
Abstract: Emotional Support Conversations (ESC) are crucial for providing empathy, validation, and actionable guidance to individuals in distress. However, existing definitions of the ESC task oversimplify the structure of supportive responses, typically modelling them as single strategy-utterance pairs. Through a detailed corpus analysis of the ESConv dataset, we identify a common yet previously overlooked phenomenon: emotional supporters often employ multiple strategies consecutively within a single turn. We formally redefine the ESC task to account for this, proposing a revised formulation that requires generating the full sequence of strategy-utterance pairs given a dialogue history. To facilitate this refined task, we introduce several modelling approaches, including supervised deep learning models and large language models. Our experiments show that, under this redefined task, state-of-the-art LLMs outperform both supervised models and human supporters. Notably, contrary to some earlier findings, we observe that LLMs frequently ask questions and provide suggestions, demonstrating more holistic support capabilities.
摘要：情感支持对话（ESC）对于为处于困境的人提供同理心，验证和可行的指导至关重要。但是，ESC任务的现有定义过度简化了支持响应的结构，通常将它们建模为单个策略 - 浓度对。通过对Esconv数据集的详细语料库分析，我们确定了一种常见但以前被忽视的现象：情感支持者通常在单个转弯中连续采用多种策略。我们正式重新定义了ESC任务以说明这一点，提出了修订的配方，该配方需要在对话历史上生成一系列策略 - 浓度对。为了促进这项精致任务，我们介绍了几种建模方法，包括监督的深度学习模型和大型语言模型。我们的实验表明，在重新定义任务下，最先进的LLMS的表现都优于监督模型和人类支持者。值得注意的是，与一些较早的发现相反，我们观察到LLMS经常提出问题并提供建议，并证明更多的整体支持能力。

Title: Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack

Authors: Silvia Cappelletti, Tobia Poppi, Samuele Poppi, Zheng-Xin Yong, Diego Garcia-Olano, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15323
Pdf URL: https://arxiv.org/pdf/2505.15323
Copy Paste: [[2505.15323]] Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack(https://arxiv.org/abs/2505.15323)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.
摘要：大型语言模型（LLMS）越来越多地通过使用 *第一句概率 *（FTP）对多项选择性答案（MCQA）任务进行评估，该任务选择了其初始令牌可能性最高的答案选项。虽然有效，但FTP可能会很脆弱：模型可能会为无关的令牌（*错位*）分配高概率，或仅将有效令牌作为通用序言的一部分，而不是作为明确的答案选择（*误解*），破坏了符号评估的可靠性。我们提出了一个简单的解决方案：*预填充攻击*，一个结构化的自然语言前缀（例如，“*正确的选项为：*”）备用到模型输出。我们最初在AI安全中探索，我们重新利用预填充以引导模型以干净，有效的选项响应，而无需修改其参数。从经验上讲，具有预填充策略的FTP显着提高了广泛的LLM和MCQA基准的精度，校准和产出一致性。它表现优于标准FTP，并且经常与需要完整解码和外部分类器的开放式生成方法的性能相匹配，同时又要高效。我们的发现表明，预填充是一种简单，健壮且低成本的方法，可以增强在多项选择设置中基于FTP的评估的可靠性。

Title: Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation

Authors: Yuhao Zhang, Xiangnan Ma, Kaiqi Kou, Peizhuo Liu, Weiqiao Shan, Benyou Wang, Tong Xiao, Yuxin Huang, Zhengtao Yu, Jingbo Zhu
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.15333
Pdf URL: https://arxiv.org/pdf/2505.15333
Copy Paste: [[2505.15333]] Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation(https://arxiv.org/abs/2505.15333)
Keywords: language model, prompt
Abstract: The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using $n$-gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements over a strong baseline and achieves performance comparable to models trained with text.
摘要：构建无文本语音到语音翻译（S2ST）模型的成功引起了很多关注。但是，S2ST仍然面临两个主要挑战：1）为各种语音信号提取语言特征，称为跨模式（CM）和2）以长序列（称为跨语言（CL））学习差异语言的学习对齐。我们建议单位语言克服两个建模挑战。单元语言可以被视为类似文本的表示格式，该格式使用$ n $ gram语言建模构建。我们实施多任务学习来利用单元语言指导语音建模过程。我们的最初结果表明，同时应用源和目标单元语言时发生了冲突。我们建议任务提示建模以减轻这一冲突。我们对Voxpupil数据集的四种语言进行实验。我们的方法表明，与强大的基线相比，取得了显着的改进，并实现了与经过文本训练的模型相当的性能。

Title: Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

Authors: Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15337
Pdf URL: https://arxiv.org/pdf/2505.15337
Copy Paste: [[2505.15337]] Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors(https://arxiv.org/abs/2505.15337)
Keywords: language model, llm
Abstract: The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.
摘要：滥用大型语言模型（LLM），例如学术窃，驱动了检测器的发展，以识别LLM生成的文本。为了绕过这些探测器，已经出现了释义攻击，目的是重写这些文本以逃避检测。尽管取得了成功，但现有的方法需要大量的数据和计算预算来培训专门的释义，并且当面对高级检测算法时，它们的攻击功效大大降低。为了解决这个问题，我们提出\ textbf {co} ntrastive \ textbf {p} araphrase \ textbf {a} ttack（copa），这是一种无训练的方法，可有效欺骗使用现成的llms的文本检测器。第一步是仔细制定鼓励LLM的说明，以产生更类似人类的文本。但是，我们观察到，LLM的固有统计偏差仍然可以导致一些携带某些类似机器的属性的生成的文本，这些属性可以由检测器捕获。为了克服这一点，Copa构建了类似辅助机器的单词分布，与LLM产生的类人样分布形成鲜明对比。通过在解码过程中从类似人类的分布中减去类似机器的模式，Copa能够产生句子，而文本检测器则不太明显。我们的理论分析表明了拟议攻击的优势。广泛的实验验证了Cop在各种情况下欺骗文本探测器方面的有效性。

Title: FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

Authors: Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15347
Pdf URL: https://arxiv.org/pdf/2505.15347
Copy Paste: [[2505.15347]] FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management(https://arxiv.org/abs/2505.15347)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90\% to 75.40\%, particularly in later conversational turns.
摘要：大型语言模型（LLM）越来越多地部署在多转交谈应用中，其中密钥值（KV）缓存的管理呈现出重要的瓶颈。与对话历史相关的KV缓存的线性增长构成了实质性的计算成本，现有的驱逐策略通常会通过反复压缩早期对话环境来降低性能，从而导致信息丢失和环境遗忘。本文介绍了KV高速缓存管理的Novel \ TextBf {多转移隔离机制}的小说\ TextBf {多转移隔离机制}，可以将其应用于没有训练的任何KV缓存压缩方法。 FlowKV的核心创新是一种多转移的隔离机制，可从过去的转弯中保留累积的压缩KV缓存。然后，将压缩仅对最新完整的转弯的新生成的KV对进行策略性应用，从而有效地防止了旧环境的重新压缩，从而减轻了灾难性的遗忘。我们的结果表明，FlotKV一致，显着优于基线策略，在保持指导遵循的准确性和用户偏好保留率为10.90 \％至75.40 \％，尤其是在以后的对话转弯中。

Title: Revealing Language Model Trajectories via Kullback-Leibler Divergence

Authors: Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15353
Pdf URL: https://arxiv.org/pdf/2505.15353
Copy Paste: [[2505.15353]] Revealing Language Model Trajectories via Kullback-Leibler Divergence(https://arxiv.org/abs/2505.15353)
Keywords: language model
Abstract: A recently proposed method enables efficient estimation of the KL divergence between language models, including models with different architectures, by assigning coordinates based on log-likelihood vectors. To better understand the behavior of this metric, we systematically evaluate KL divergence across a wide range of conditions using publicly available language models. Our analysis covers comparisons between pretraining checkpoints, fine-tuned and base models, and layers via the logit lens. We find that trajectories of language models, as measured by KL divergence, exhibit a spiral structure during pretraining and thread-like progressions across layers. Furthermore, we show that, in terms of diffusion exponents, model trajectories in the log-likelihood space are more constrained than those in weight space.
摘要：最近提出的方法可以通过基于log-ofikelihood矢量分配坐标来对语言模型之间的KL差异有效估计。为了更好地理解该指标的行为，我们使用公开可用的语言模型系统地评估了在各种条件的KL差异。我们的分析涵盖了预处理检查点，微调和基本模型以及通过Logit镜头层之间的比较。我们发现，通过KL Divergence测量的语言模型轨迹在跨层的训练和线状进行过程中表现出螺旋结构。此外，我们表明，就扩散指数而言，模型轨迹比体重空间中的模型轨迹更具限制。

Title: NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging

Authors: Weiming Zhang, Qingyao Li, Xinyi Dai, Jizheng Chen, Kounianhua Du, Weinan Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15356
Pdf URL: https://arxiv.org/pdf/2505.15356
Copy Paste: [[2505.15356]] NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging(https://arxiv.org/abs/2505.15356)
Keywords: language model, llm
Abstract: Debugging is a critical aspect of LLM's coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.
摘要：调试是LLM编码能力的关键方面。早期调试工作主要集中在代码级分析上，在解决需要更深入了解算法逻辑的复杂编程错误时，这些分析通常不足。大型语言模型（LLM）的最新进步已将注意力转向利用自然语言推理以增强与代码相关的任务。但是，两个基本问题仍然没有解决：哪种类型的自然语言格式最有效地调试任务？自然语言推理为调试过程带来哪些特定好处？在本文中，我们介绍了NL-Debugging，这是一个新颖的框架，该框架采用自然语言作为中间表示来改善代码调试。通过在自然语言层面上进行调试，我们证明了NL欺骗的表现优于传统的调试方法，并通过直接改进的执行反馈来实现更广泛的修改空间。我们的发现突出了自然语言推理提高自动代码调试并应对复杂编程挑战的潜力。

Title: X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

Authors: Peng Wang, Ruihan Tao, Qiguang Chen, Mengkang Hu, Libo Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15372
Pdf URL: https://arxiv.org/pdf/2505.15372
Copy Paste: [[2505.15372]] X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System(https://arxiv.org/abs/2505.15372)
Keywords: language model, gpt, llm, agent
Abstract: Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.
摘要：最近，基于大型语言模型（LLM）的代理商在互动环境中取得了巨大的成功，引起了很大的学术和工业关注。尽管取得了这些进步，但当前的研究主要集中在英语场景上。实际上，全球有超过7,000种语言，所有这些语言都要求访问可比较的代理服务。然而，语言代理的发展仍然不足以满足多语言代理应用的不同要求。为了填补这一空白，我们介绍了X-WebagentBench，这是一种新型的多语言代理基准，在交互式网络环境中，评估了跨多种语言的语言代理的计划和相互作用性能，从而有助于全球代理情报的发展。此外，我们评估了各种LLM和跨语言对准方法的性能，从而检查了它们在增强剂中的有效性。我们的发现表明，即使是诸如GPT-4O之类的高级模型，当与跨语性技术结合使用时，也无法实现令人满意的结果。我们希望X-Webagentbench可以作为现实世界应用中多语言代理方案的宝贵基准。

Title: RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection

Authors: Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Xuming Hu, Yi R. (May)Fung, Xinlei He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15386
Pdf URL: https://arxiv.org/pdf/2505.15386
Copy Paste: [[2505.15386]] RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection(https://arxiv.org/abs/2505.15386)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. While previous works improved the capability of hallucination detection by measuring uncertainty, they all lack the ability to explain the provenance behind why hallucinations occur, i.e., which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as total score. Experiments show that our method achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and our method is capable of producing token-level uncertainty scores as explanations for the hallucination. Leveraging these scores, we preliminarily find the chaotic pattern of hallucination and showcase its promising usage.
摘要：大型语言模型（LLM）已经变得强大，但是幻觉仍然是其可信赖使用的至关重要的障碍。尽管以前的作品通过测量不确定性提高了幻觉检测的能力，但他们都缺乏解释幻觉发生的出处的能力，即发生哪一部分输入倾向于触发幻觉。关于及时攻击的最新作品表明，语义传播中存在不确定性，在这种语义传播中，注意机制逐渐将本地令牌信息融合到跨层的高级语义中。同时，由于语言生成的不确定性也出现在语言生成中，因为它基于概率选择了为世代的高级语义选择。基于这一点，我们提议通过这两个方面重新校准不确定性测量，这两个方面将可解释的不确定性得分访问了每个令牌和聚集体，并以困惑风格的对数平均形式作为总分数。实验表明，我们的方法在高级模型（平均AUC为0.833）上实现了各种QA数据集的最佳综合检测性能，并且我们的方法能够产生令牌级别的不确定性评分，以作为幻觉的解释。利用这些分数，我们初步发现了幻觉的混乱模式，并展示了其有希望的用法。

Title: Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Authors: DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu
Subjects: cs.CL, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2505.15389
Pdf URL: https://arxiv.org/pdf/2505.15389
Copy Paste: [[2505.15389]] Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study(https://arxiv.org/abs/2505.15389)
Keywords: language model, llm, prompt
Abstract: Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.
摘要：视觉模型的快速部署（VLM）扩大了安全风险，但大多数评估都依赖于人工图像。这项研究询问：面对普通用户共享的模因图像时，当前VLM的安全性如何？为了调查这个问题，我们介绍了MemesafetyBench，这是一种50,430-Insance基准配对真实的模因图像，并既有有害和良性说明。使用全面的安全分类法和基于LLM的指导生成，我们评估了单一和多转交互之间的多个VLM。我们研究了现实世界中的模因如何影响有害产出，对话环境的缓解影响以及模型量表与安全指标之间的关系。我们的发现表明，VLMS比合成图像或印刷图像更容易出现基于模因的有害提示的脆弱性。与仅文本输入相比，模因显着增加了有害反应并减少拒绝。尽管多转交互提供了部分缓解，但较高的脆弱性仍然存在。这些结果突出了对生态有效评估和更强安全机制的需求。

Title: An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations

Authors: Yiming Huang, Biquan Bie, Zuqiu Na, Weilin Ruan, Songxin Lei, Yutao Yue, Xinlei He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15392
Pdf URL: https://arxiv.org/pdf/2505.15392
Copy Paste: [[2505.15392]] An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations(https://arxiv.org/abs/2505.15392)
Keywords: language model, gpt, llm, chat
Abstract: The rise of Large Language Models (LLMs) like ChatGPT has advanced natural language processing, yet concerns about cognitive biases are growing. In this paper, we investigate the anchoring effect, a cognitive bias where the mind relies heavily on the first information as anchors to make affected judgments. We explore whether LLMs are affected by anchoring, the underlying mechanisms, and potential mitigation strategies. To facilitate studies at scale on the anchoring effect, we introduce a new dataset, SynAnchors. Combining refined evaluation metrics, we benchmark current widely used LLMs. Our findings show that LLMs' anchoring bias exists commonly with shallow-layer acting and is not eliminated by conventional strategies, while reasoning can offer some mitigation. This recontextualization via cognitive psychology urges that LLM evaluations focus not on standard benchmarks or over-optimized robustness tests, but on cognitive-bias-aware trustworthy evaluation.
摘要：大型语言模型（LLM）等大型语言的兴起具有先进的自然语言处理，但对认知偏见的担忧正在增长。在本文中，我们研究了锚定效应，这是一种认知偏见，在这种偏见中，大脑在很大程度上依赖第一个信息作为锚定以做出影响的判断。我们探索LLM是否受锚定，基本机制和潜在缓解策略的影响。为了促进对锚定效果的大规模研究，我们介绍了一个新的数据集，合成器。结合精制评估指标，我们基准了广泛使用的LLM。我们的发现表明，LLMS的锚定偏见通常存在于浅层表演中，并且不会被传统策略所消除，而推理可以提供一些缓解。通过认知心理学的这种重新定义敦促，LLM评估不集中于标准的基准或过度优化的鲁棒性测试，而是关注认知偏见的值得信赖的评估。

Title: How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Authors: Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15404
Pdf URL: https://arxiv.org/pdf/2505.15404
Copy Paste: [[2505.15404]] How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study(https://arxiv.org/abs/2505.15404)
Keywords: prompt
Abstract: Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how can we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify three key failure patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance-and are significantly easier for models to learn than more intricate reasoning chains. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we find that mixing math reasoning data during safety fine-tuning is helpful to balance safety and over-refusal. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs. The code and data used in our experiments are released in this https URL.
摘要：大型推理模型（LRMS）在数学和编程等推理密集型任务上取得了巨大的成功。但是，它们增强的推理功能并不一定会改善安全性能，在某些情况下，甚至可能会降解它。这提出了一个重要的研究问题：我们如何提高LRM的安全性？在本文中，我们介绍了一项有关如何通过监督的微调（SFT）提高LRM的安全性的全面实证研究。我们的调查始于意外的观察：直接从DeepSeek-R1中提取安全反应并不能显着提高安全性。我们分析了这一现象，并确定了有助于该现象的三种关键故障模式。然后，我们证明，在数据蒸馏过程中明确解决这些问题可能会导致大大改善。接下来，我们探讨是否需要长期且复杂的推理过程才能实现安全性。有趣的是，我们发现，简单地使用基于简短或基于模板的推理过程就可以达到可比的安全性能，并且对于模型而言，与更复杂的推理链相比，模型更容易学习。这些发现促使人们对推理在确保安全方面的作用进行了更深入的反思。最后，我们发现在安全微调期间混合数学推理数据有助于平衡安全性和过度倍增。总体而言，我们希望我们的实证研究能够为提高LRM的安全提供更全面的画面。我们实验中使用的代码和数据在此HTTPS URL中发布。

Title: Trends and Challenges in Authorship Analysis: A Review of ML, DL, and LLM Approaches

Authors: Nudrat Habib, Tosin Adewumi, Marcus Liwicki, Elisa Barney
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15422
Pdf URL: https://arxiv.org/pdf/2505.15422
Copy Paste: [[2505.15422]] Trends and Challenges in Authorship Analysis: A Review of ML, DL, and LLM Approaches(https://arxiv.org/abs/2505.15422)
Keywords: llm
Abstract: Authorship analysis plays an important role in diverse domains, including forensic linguistics, academia, cybersecurity, and digital content authentication. This paper presents a systematic literature review on two key sub-tasks of authorship analysis; Author Attribution and Author Verification. The review explores SOTA methodolo- gies, ranging from traditional ML approaches to DL models and LLMs, highlighting their evolution, strengths, and limitations, based on studies conducted from 2015 to 2024. Key contributions include a comprehensive analysis of methods, techniques, their corresponding feature extraction techniques, datasets used, and emerging chal- lenges in authorship analysis. The study highlights critical research gaps, particularly in low-resource language processing, multilingual adaptation, cross-domain generaliza- tion, and AI-generated text detection. This review aims to help researchers by giving an overview of the latest trends and challenges in authorship analysis. It also points out possible areas for future study. The goal is to support the development of better, more reliable, and accurate authorship analysis system in diverse textual domain.
摘要：作者分析在不同领域，包括法医语言学，学术界，网络安全和数字内容身份验证中起着重要作用。本文介绍了有关作者身份分析的两个关键子任务的系统文献综述。作者归因和作者验证。 The review explores SOTA methodolo- gies, ranging from traditional ML approaches to DL models and LLMs, highlighting their evolution, strengths, and limitations, based on studies conducted from 2015 to 2024. Key contributions include a comprehensive analysis of methods, techniques, their corresponding feature extraction techniques, datasets used, and emerging chal- lenges in authorship analysis.该研究强调了关键的研究差距，尤其是在低资源语言处理，多语言适应性，跨域概括和AI生成的文本检测中。这篇评论旨在通过概述作者资格分析中的最新趋势和挑战来帮助研究人员。它还指出了未来研究的可能领域。目的是支持各种文本领域中更好，更可靠和准确的作者分析系统的发展。

Title: Gated Integration of Low-Rank Adaptation for Continual Learning of Language Models

Authors: Yan-Shuo Liang, Wu-Jun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15424
Pdf URL: https://arxiv.org/pdf/2505.15424
Copy Paste: [[2505.15424]] Gated Integration of Low-Rank Adaptation for Continual Learning of Language Models(https://arxiv.org/abs/2505.15424)
Keywords: language model
Abstract: Continual learning (CL), which requires the model to learn multiple tasks sequentially, is crucial for language models (LMs). Recently, low-rank adaptation (LoRA), one of the most representative parameter-efficient fine-tuning (PEFT) methods, has gained increasing attention in CL of LMs. However, most existing CL methods based on LoRA typically expand a new LoRA branch to learn each new task and force the new and old LoRA branches to contribute equally to old tasks, potentially leading to forgetting. In this work, we propose a new method, called gated integration of low-rank adaptation (GainLoRA), for CL of LMs. GainLoRA expands a new LoRA branch for each new task and introduces gating modules to integrate the new and old LoRA branches. Furthermore, GainLoRA leverages the new gating module to minimize the contribution from the new LoRA branch to old tasks, effectively mitigating forgetting and improving the model's overall performance. Experimental results on CL benchmarks demonstrate that GainLoRA outperforms existing state-of-the-art methods.
摘要：持续学习（CL）要求模型依次学习多个任务，这对于语言模型（LMS）至关重要。最近，低排名适应性（LORA）是最具代表性的参数效率微调（PEFT）方法之一，它在LMS CL中的关注越来越多。但是，大多数基于洛拉的现有CL方法通常会扩展一个新的洛拉分支，以学习每个新任务，并迫使新的和旧的洛拉分支，以同等地贡献旧任务，可能导致忘记。在这项工作中，我们提出了一种新方法，称为LMS CL的低级别适应性（Gainlora）的门控积分。 Gainlora为每个新任务扩展了一个新的Lora分支，并引入了门控模块以整合新的洛拉分支。此外，Gainlora利用新的门控模块最大程度地减少了新洛拉分支对旧任务的贡献，从而有效地减轻了忘记并改善模型的整体性能。基准基准的实验结果表明，Gainlora的表现优于现有的最新方法。

Title: NeoN: A Tool for Automated Detection, Linguistic and LLM-Driven Analysis of Neologisms in Polish

Authors: Aleksandra Tomaszewska, Dariusz Czerski, Bartosz Żuk, Maciej Ogrodniczuk
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15426
Pdf URL: https://arxiv.org/pdf/2505.15426
Copy Paste: [[2505.15426]] NeoN: A Tool for Automated Detection, Linguistic and LLM-Driven Analysis of Neologisms in Polish(https://arxiv.org/abs/2505.15426)
Keywords: llm
Abstract: NeoN, a tool for detecting and analyzing Polish neologisms. Unlike traditional dictionary-based methods requiring extensive manual review, NeoN combines reference corpora, Polish-specific linguistic filters, an LLM-driven precision-boosting filter, and daily RSS monitoring in a multi-layered pipeline. The system uses context-aware lemmatization, frequency analysis, and orthographic normalization to extract candidate neologisms while consolidating inflectional variants. Researchers can verify candidates through an intuitive interface with visualizations and filtering controls. An integrated LLM module automatically generates definitions and categorizes neologisms by domain and sentiment. Evaluations show NeoN maintains high accuracy while significantly reducing manual effort, providing an accessible solution for tracking lexical innovation in Polish.
摘要：霓虹灯，一种用于检测和分析波兰新系统的工具。与需要大量手动审查的传统基于词典的方法不同，霓虹灯结合了参考文献，波兰特定的语言滤波器，LLM驱动的Precision-Boost-Boosting Filter和Daily RSS监视多层管道中的RSS监视。该系统使用上下文感知的障碍，频率分析和拼字法归一化来提取候选新系统，同时巩固拐点变体。研究人员可以通过可视化和过滤控件的直观界面来验证候选人。集成的LLM模块会自动生成定义，并按域和情感对新系统进行分类。评估表明，霓虹灯的精度很高，同时大大减少了手动努力，为跟踪波兰的词汇创新提供了可访问的解决方案。

Title: Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions

Authors: Zhiwen Li, Die Chen, Mingyuan Fan, Cen Chen, Yaliang Li, Yanhao Wang, Wenmeng Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15427
Pdf URL: https://arxiv.org/pdf/2505.15427
Copy Paste: [[2505.15427]] Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions(https://arxiv.org/abs/2505.15427)
Keywords: prompt
Abstract: The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines.
摘要：扩散模型产生高保真图像的显着能力导致了它们广泛的采用。但是，人们对它们在工作（NSFW）内容（NSFW）的潜力并表现出社会偏见，阻碍了它们在现实世界应用中的实际使用，这也引起了人们的关注。为了应对这一挑战，先前的工作重点是采用安全过滤器来识别和排除有毒文本，或者对预先训练的预训练的扩散模型来消除敏感概念。不幸的是，现有的方法努力实现令人满意的性能，因为它们可以对正常模型输出产生重大影响，而在某些情况下仍无法防止产生有害内容。在本文中，我们提出了一种新型的自我发现方法，用于识别嵌入空间中的语义方向向量，以限制在安全区域内嵌入的文本。我们的方法规避了需要在输入文本中纠正单个单词的需要，并将整个文本提示转向嵌入空间中的安全区域，从而增强模型鲁棒性，以防止所有可能不安全的提示。此外，我们采用低级适应性（LORA）来进行语义方向矢量初始化，以减少对其他语义的模型性能的影响。此外，我们的方法也可以与现有方法集成以改善其社会责任。基准数据集的广泛实验表明，与几种最先进的基线相比，我们的方法可以有效地减少NSFW含量并减轻扩散模型产生的社会偏见。

Title: Likelihood Variance as Text Importance for Resampling Texts to Map Language Models

Authors: Momose Oyama, Ryo Kishino, Hiroaki Yamagiwa, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15428
Pdf URL: https://arxiv.org/pdf/2505.15428
Copy Paste: [[2505.15428]] Likelihood Variance as Text Importance for Resampling Texts to Map Language Models(https://arxiv.org/abs/2505.15428)
Keywords: language model
Abstract: We address the computational cost of constructing a model map, which embeds diverse language models into a common space for comparison via KL divergence. The map relies on log-likelihoods over a large text set, making the cost proportional to the number of texts. To reduce this cost, we propose a resampling method that selects important texts with weights proportional to the variance of log-likelihoods across models for each text. Our method significantly reduces the number of required texts while preserving the accuracy of KL divergence estimates. Experiments show that it achieves comparable performance to uniform sampling with about half as many texts, and also facilitates efficient incorporation of new models into an existing map. These results enable scalable and efficient construction of language model maps.
摘要：我们解决了构建模型图的计算成本，该图表将各种语言模型嵌入了通过KL Divergence进行比较的共同空间。该地图依赖于大型文本集上的日志类样，使成本与文本数量成正比。为了降低这一成本，我们提出了一种重新采样方法，该方法选择了与每个文本跨模型之间的日志样本差异成正比的重要文本。我们的方法大大减少了所需文本的数量，同时保留了KL差异估计的准确性。实验表明，它的性能与大约一半文本的统一抽样相当，并且还促进了将新模型有效地融合到现有地图中。这些结果可实现语言模型图的可扩展有效构建。

Title: Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

Authors: Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, Dian Jiao, Dong Du, Dong Wang, Feng Zhang, Fengzong Lian, Guanghui Xu, Guanwei Zhang, Hai Wang, Haipeng Luo, Han Hu, Huilin Xu, Jiajia Wu, Jianchen Zhu, Jianfeng Yan, Jiaqi Zhu, Jihong Zhang, Jinbao Xue, Jun Xia, Junqiang Zheng, Kai Liu, Kai Zhang, Kai Zheng, Kejiao Li, Keyao Wang, Lan Jiang, Lixin Liu, Lulu Wu, Mengyuan Huang, Peijie Yu, Peiqi Wang, Qian Wang, Qianbiao Xiang, Qibin Liu, Qingfeng Sun, Richard Guo, Ruobing Xie, Saiyong Yang, Shaohua Chen, Shihui Hu, Shuai Li, Shuaipeng Li, Shuang Chen, Suncong Zheng, Tao Yang, Tian Zhang, Tinghao Yu, Weidong Han, Weijie Liu, Weijin Zhou, Weikang Wang, Wesleye Chen, Xiao Feng, Xiaoqin Ren, Xingwu Sun, Xiong Kuang, Xuemeng Huang, Xun Cao, Yanfeng Chen, Yang Du, Yang Zhen, Yangyu Tao, Yaping Deng, Yi Shen, Yigeng Hong, Yiqi Chen, Yiqing Huang, Yuchi Deng, Yue Mao, Yulong Wang, Yuyuan Zeng, Zenan Xu, Zhanhui Kang, Zhe Zhao, ZhenXiang Yan, Zheng Fang, Zhichao Hu, Zhongzhi Chen, Zhuoyu Li, Zongwei Li, Alex Yan, Ande Liang, Baitong Liu, Beiping Pan, Bin Xing, Binghong Wu, Bingxin Qu, Bolin Ni, Boyu Wu, Chen Li, Cheng Jiang, Cheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15431
Pdf URL: https://arxiv.org/pdf/2505.15431
Copy Paste: [[2505.15431]] Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought(https://arxiv.org/abs/2505.15431)
Keywords: language model, llm, chat, chain-of-thought
Abstract: As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.
摘要：随着大型语言模型（LLMS）迅速发展，我们引入了Hunyuan-Turbos，这是一种新型的大型混合变压器Mamba专家（MOE）模型的混合物。它协同结合了Mamba的长期处理效率与变形金刚的卓越背景理解。 Hunyuan-Turbos具有自适应的长期思想链（COT）机制，在简单查询的快速响应之间动态切换，而对复杂问题的深度“思考”模式，以优化计算资源。从结构上讲，该56B激活（560B总数）参数模型使用创新的AMF/MF块模式采用128层（MAMBA2，注意，FFN）。更快的MAMBA2可确保线性复杂性，分组的疑问注意力最小化KV缓存，而FFNS使用MOE结构。它在16T高质量的代币中进行了预训练，它支持256K上下文长度，并且是第一个由行业部署的大型Mamba模型。我们的全面培训策略通过监督的微调（3M指令），一种新型的自适应长期舒适的COT融合方法，进行迭代改进的多轮审议学习以及两阶段的大规模增强学习过程靶向STEM和一般指导，从而增强了功能。评估表现出强劲的表现：LMSYS CHATBOT AREAA的总前7名，得分为1356，表现优于Gemini-2.0-Flash-001（1352）和O4-MINI-MINI-2025-04-16（1345）（1345）等领先模型。在23个自动化基准中，Turbos平均达到77.9％。 Hunyuan-Turbos平衡了高性能和效率，比许多推理模型提供了较低的推理成本功能，为有效的大规模大规模预训练模型建立了新的范式。

Title: On the Generalization vs Fidelity Paradox in Knowledge Distillation

Authors: Suhas Kamasetty Ramesh, Ayan Sengupta, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15442
Pdf URL: https://arxiv.org/pdf/2505.15442
Copy Paste: [[2505.15442]] On the Generalization vs Fidelity Paradox in Knowledge Distillation(https://arxiv.org/abs/2505.15442)
Keywords: language model
Abstract: Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to $10\%$, with a peak task specific gain of $22\%$, while providing only marginal benefits ($\sim 1.3\%$) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from KD, whereas larger LMs show diminished gains. Additionally, we uncover a misalignment between improvements in student performance and reasoning fidelity, suggesting that while KD enhances accuracy, it does not always maintain the structured decision-making processes of the teacher. Our ablation study further highlights the importance of teacher signals and logit smoothing in influencing students' performance after distillation. Overall, our study offers a comprehensive empirical and statistical assessment of KD, highlighting both its benefits and trade-offs when distilling knowledge from larger to smaller LMs.
摘要：知识蒸馏（KD）是将大型语言模型压缩到较小语言的同时，在保持性能的同时，将其压缩为关键技术。尽管最近有KD研究的牵引力，但其对较小语言模型（LMS）和驱动知识转移的机制的有效性仍未得到充实。在这项工作中，我们介绍了KD对KD的第一个大规模经验和统计分析，范围从0.5B到7b参数，在14个复杂的推理任务上，在零弹片设置中。我们的发现表明，KD可以将较小型号的平均性能提高到最高$ 10 \％$，而峰值任务的特定收益为$ 22 \％$，而仅提供较大型号的边际收益（$ \ sim 1.3 \％$）。令人惊讶的是，教师表现对学生成绩的影响很小，而教师任务专业知识影响KD的效率。一项相关研究表明，较小的LMS从KD中受益更多，而较大的LMS显示出收益减少。此外，我们发现了学生绩效的改善与推理忠诚度之间的错位，这表明尽管KD提高了准确性，但它并不总是维护老师的结构化决策过程。我们的消融研究进一步凸显了教师信号和对数平滑在影响学生蒸馏后表现的重要性。总体而言，我们的研究对KD进行了全面的经验和统计评估，在将知识从较大的LMS提炼到较小的LM时，强调了其收益和权衡。

Title: AdUE: Improving uncertainty estimation head for LoRA adapters in LLMs

Authors: Artem Zabolotnyi, Roman Makarov, Mile Mitrovic, Polina Proskura, Oleg Travkin, Roman Alferov, Alexey Zaytsev
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2505.15443
Pdf URL: https://arxiv.org/pdf/2505.15443
Copy Paste: [[2505.15443]] AdUE: Improving uncertainty estimation head for LoRA adapters in LLMs(https://arxiv.org/abs/2505.15443)
Keywords: language model, llm
Abstract: Uncertainty estimation remains a critical challenge in adapting pre-trained language models to classification tasks, particularly under parameter-efficient fine-tuning approaches such as adapters. We introduce AdUE1, an efficient post-hoc uncertainty estimation (UE) method, to enhance softmax-based estimates. Our approach (1) uses a differentiable approximation of the maximum function and (2) applies additional regularization through L2-SP, anchoring the fine-tuned head weights and regularizing the model. Evaluations on five NLP classification datasets across four language models (RoBERTa, ELECTRA, LLaMA-2, Qwen) demonstrate that our method consistently outperforms established baselines such as Mahalanobis distance and softmax response. Our approach is lightweight (no base-model changes) and produces better-calibrated confidence.
摘要：不确定性估计仍然是将预训练的语言模型调整为分类任务的关键挑战，尤其是在参数有效的微调方法（例如适配器）下。我们介绍了ADUE1，这是一种有效的事后不确定性估计（UE）方法，以增强基于软疗法的估计。我们的方法（1）使用最大函数的可区分近似值，（2）通过L2-SP应用额外的正则化，固定了微调的头部重量并正规化模型。对跨四种语言模型（Roberta，Electra，Llama-2，Qwen）进行五个NLP分类数据集的评估表明，我们的方法始终优于建立的基线，例如Mahalanobis距离和SoftMax响应。我们的方法是轻巧（无基本模型变化），并产生更好的信心。

Title: Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization

Authors: Yutao Zhu, Jiajie Jin, Hongjin Qian, Zheng Liu, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15444
Pdf URL: https://arxiv.org/pdf/2505.15444
Copy Paste: [[2505.15444]] Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization(https://arxiv.org/abs/2505.15444)
Keywords: llm, retrieval-augmented generation
Abstract: Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.
摘要：现有的研究已经优化了各种子任务（例如查询理解和检索精炼）跨任务中的检索生成（RAG），但是将这些优化纳入统一的框架仍然具有挑战性。为了解决这个问题，这项工作提出了Rolerag，Rolerag是一个统一的抹布框架，通过特定角色的代币优化实现了有效的多任务处理。 Rolerag包含六个模块，每个模块都在抹布过程中处理特定的子任务。此外，我们引入了一个查询图来表示查询的分解，可以根据分解状态动态解析。所有模块均由相同的基础LLM驱动，这些LLM由单独优化的特定于任务的角色标记区别。该设计使Rolerag能够在单个LLM实例中动态激活不同的模块，从而简化部署并减少资源消耗。五个开放域问答数据集的实验结果证明了我们框架的有效性，可推广性和灵活性。

Title: Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

Authors: Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo, Haixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15456
Pdf URL: https://arxiv.org/pdf/2505.15456
Copy Paste: [[2505.15456]] Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment(https://arxiv.org/abs/2505.15456)
Keywords: language model, gpt, llm, prompt
Abstract: Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.
摘要：个性化对齐对于使大型语言模型（LLM）有效地参与以用户为中心的对话至关重要。虽然最近的基于及时的及时优化方法提供了初步解决方案，但由于它们固有的静态和浅设计，它们在冷启动场景和长期个性化方面缺乏。在这项工作中，我们介绍了个性化对齐（RLPA）框架的强化学习，其中LLM与模拟用户模型相互作用以通过对话来迭代地推断和完善用户配置文件。培训过程以双级奖励结构为指导：概况奖励鼓励准确构建用户表示，而响应奖励激发了与推断概况一致的响应的产生。我们通过微调QWEN-2.5-3B-INSTRUCT进行实例化RLPA，从而导致Qwen-RLPA，从而在个性化的对话中实现了最先进的表现。经验评估表明，QWEN-RLPA始终优于促使和离线微调基线的表现，甚至超过了诸如Claude-3.5和GPT-4O之类的先进商业模型。进一步的分析强调，与最近以推理为重点的LLM相比，QWEN-RLPA在核对冲突的用户偏好，维持长期个性化并提供更有效的推断方面的鲁棒性。这些结果强调了动态概况推断的潜力，这是建立个性化对话系统的更有效范式。

Title: Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning

Authors: Yukun Zhao, Lingyong Yan, Zhenyang Li, Shuaiqiang Wang, Zhumin Chen, Zhaochun Ren, Dawei Yin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15467
Pdf URL: https://arxiv.org/pdf/2505.15467
Copy Paste: [[2505.15467]] Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning(https://arxiv.org/abs/2505.15467)
Keywords: language model, prompt
Abstract: Large language models have achieved remarkable success in various tasks. However, it is challenging for them to learn new tasks incrementally due to catastrophic forgetting. Existing approaches rely on experience replay, optimization constraints, or task differentiation, which encounter strict limitations in real-world scenarios. To address these issues, we propose Joint Flashback Adaptation. We first introduce flashbacks -- a limited number of prompts from old tasks -- when adapting to new tasks and constrain the deviations of the model outputs compared to the original one. We then interpolate latent tasks between flashbacks and new tasks to enable jointly learning relevant latent tasks, new tasks, and flashbacks, alleviating data sparsity in flashbacks and facilitating knowledge sharing for smooth adaptation. Our method requires only a limited number of flashbacks without access to the replay data and is task-agnostic. We conduct extensive experiments on state-of-the-art large language models across 1000+ instruction-following tasks, arithmetic reasoning tasks, and general reasoning tasks. The results demonstrate the superior performance of our method in improving generalization on new tasks and reducing forgetting in old tasks.
摘要：大型语言模型在各种任务中取得了巨大的成功。但是，由于灾难性的遗忘，他们逐步学习新任务是一项挑战。现有方法依赖于经验重播，优化约束或任务差异化，在现实世界中遇到严格的限制。为了解决这些问题，我们提出了联合闪回适应。我们首先引入闪回（从旧任务中的数量有限的提示）进行适应新任务时，并约束与原始输出相比的模型输出的偏差。然后，我们在闪回和新任务之间插值潜在任务，以使共同学习相关的潜在任务，新任务和闪回，减轻闪回中的数据稀疏，并促进知识共享以进行平滑适应。我们的方法仅需要有限数量的闪回，而无需访问重播数据，并且是任务不可能的。我们对1000多个指令遵守任务，算术推理任务和一般推理任务进行的最先进的大语言模型进行了广泛的实验。结果表明，我们方法在改善对新任务的概括并减少旧任务中的遗忘方面的卓越表现。

Title: CoLA: Collaborative Low-Rank Adaptation

Authors: Yiyun Zhou, Chang Yao, Jingyuan Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15471
Pdf URL: https://arxiv.org/pdf/2505.15471
Copy Paste: [[2505.15471]] CoLA: Collaborative Low-Rank Adaptation(https://arxiv.org/abs/2505.15471)
Keywords: language model, llm
Abstract: The scaling law of Large Language Models (LLMs) reveals a power-law relationship, showing diminishing return on performance as model scale increases. While training LLMs from scratch is resource-intensive, fine-tuning a pre-trained model for specific tasks has become a practical alternative. Full fine-tuning (FFT) achieves strong performance; however, it is computationally expensive and inefficient. Parameter-efficient fine-tuning (PEFT) methods, like LoRA, have been proposed to address these challenges by freezing the pre-trained model and adding lightweight task-specific modules. LoRA, in particular, has proven effective, but its application to multi-task scenarios is limited by interference between tasks. Recent approaches, such as Mixture-of-Experts (MOE) and asymmetric LoRA, have aimed to mitigate these issues but still struggle with sample scarcity and noise interference due to their fixed structure. In response, we propose CoLA, a more flexible LoRA architecture with an efficient initialization scheme, and introduces three collaborative strategies to enhance performance by better utilizing the quantitative relationships between matrices $A$ and $B$. Our experiments demonstrate the effectiveness and robustness of CoLA, outperforming existing PEFT methods, especially in low-sample scenarios. Our data and code are fully publicly available at this https URL.
摘要：大语言模型（LLMS）的缩放定律揭示了幂律的关系，显示出随着模型量表的增加而表现的减少。虽然从头开始培训LLM是资源密集的，但针对特定任务进行了细微培训的模型已成为一种实用的选择。完整的微调（FFT）实现了强劲的性能；但是，它在计算上昂贵且效率低下。已经提出了像Lora这样的参数效率微调（PEFT）方法，通过冻结预训练的模型并添加轻巧的特定任务模块来解决这些挑战。尤其是洛拉（Lora）已被证明有效，但其应用于多任务场景受到任务之间的干扰的限制。最近的方法，例如Experts（MOE）和非对称Lora的混合物，旨在减轻这些问题，但由于其固定结构而导致的样本稀缺性和噪声干扰仍然困难。作为回应，我们提出了一种具有高效初始化计划的更灵活的洛拉体系结构，并提出了三种协作策略，以通过更好地利用矩阵$ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $ a $的定量关系来提高性能。我们的实验证明了可乐的有效性和鲁棒性，表现优于现有的PEFT方法，尤其是在低样本场景中。我们的数据和代码在此HTTPS URL上已完全公开可用。

Title: PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions

Authors: Song Dai, Yibo Yan, Jiamin Su, Dongfang Zihao, Yubo Gao, Yonghua Hei, Jungang Li, Junyan Zhang, Sicheng Tao, Zhuoran Gao, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15472
Pdf URL: https://arxiv.org/pdf/2505.15472
Copy Paste: [[2505.15472]] PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions(https://arxiv.org/abs/2505.15472)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation. PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.
摘要：多模式的大语言模型（MLLM）在各种推理任务中表现出了显着的功能，但是它们在复杂的物理推理中的应用仍未得到充分兴奋。物理推理提出了独特的挑战，需要在物理条件和多模式信息的解释中进行基础。当前的物理基准是有限的，通常专注于仅文本输入或仅解决问题，从而忽略了可变识别和过程公式的关键中间步骤。为了解决这些局限性，我们介绍了物理学，这是第一个多模式物理推理基准，旨在整体评估三个关键维度的MLLM：可变识别，物理过程公式和溶液推导。 Physicsarena旨在提供一个全面的平台，以评估和推进MLLM的多模式物理推理能力。

Title: LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

Authors: Zhanyue Qin, Yue Ding, Deyuan Liu, Qingbin Liu, Junxian Cai, Xi Chen, Zhiying Tu, Dianhui Chu, Cuiyun Gao, Dianbo Sui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15475
Pdf URL: https://arxiv.org/pdf/2505.15475
Copy Paste: [[2505.15475]] LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models(https://arxiv.org/abs/2505.15475)
Keywords: language model, llm, prompt
Abstract: Nowadays, Large Language Models (LLMs) have attracted widespread attention due to their powerful performance. However, due to the unavoidable exposure to socially biased data during training, LLMs tend to exhibit social biases, particularly gender bias. To better explore and quantifying the degree of gender bias in LLMs, we propose a pair of datasets named GenBiasEval and GenHintEval, respectively. The GenBiasEval is responsible for evaluating the degree of gender bias in LLMs, accompanied by an evaluation metric named AFGB-Score (Absolutely Fair Gender Bias Score). Meanwhile, the GenHintEval is used to assess whether LLMs can provide responses consistent with prompts that contain gender hints, along with the accompanying evaluation metric UB-Score (UnBias Score). Besides, in order to mitigate gender bias in LLMs more effectively, we present the LFTF (Locating First and Then Fine-Tuning) this http URL algorithm first ranks specific LLM blocks by their relevance to gender bias in descending order using a metric called BMI (Block Mitigating Importance Score). Based on this ranking, the block most strongly associated with gender bias is then fine-tuned using a carefully designed loss function. Numerous experiments have shown that our proposed LFTF algorithm can significantly mitigate gender bias in LLMs while maintaining their general capabilities.
摘要：如今，大型语言模型（LLM）由于其强大的表现而引起了广泛的关注。但是，由于在培训期间不可避免地会暴露于社会偏见的数据，因此LLMS倾向于表现出社会偏见，尤其是性别偏见。为了更好地探索和量化LLM中的性别偏见程度，我们分别提出了一对名为GenBiaSeval和Genhinteval的数据集。 GenBiAseval负责评估LLMS中性别偏见程度，并附有一个名为AFGB-SCORE的评估度量标准（绝对公平的性别偏见评分）。同时，Genhinteval用于评估LLM是否可以提供与包含性别提示的提示以及随附的评估度量UB分数（无偏见得分）一致的响应。此外，为了更有效地减轻LLMS的性别偏见，我们提出LFTF（首先定位，然后进行微调），该HTTP URL算法首先将特定的LLM块与其与性别偏见相关的降序，该llm与降序的性别偏见相关，该降低使用称为BMI的降序（块MITigigitigitigitigitigitigy的重要性得分）。基于此排名，使用精心设计的损失函数对与性别偏差最密切相关的块。许多实验表明，我们提出的LFTF算法可以显着减轻LLM中的性别偏见，同时保持其一般能力。

Title: KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance

Authors: Qihuang Zhong, Liang Ding, Xiantao Cai, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15480
Pdf URL: https://arxiv.org/pdf/2505.15480
Copy Paste: [[2505.15480]] KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance(https://arxiv.org/abs/2505.15480)
Keywords: language model, llm, hallucination
Abstract: Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs' internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs' performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.
摘要：监督微调（SFT）是改善大语模型（LLMS）的特定领域提问（QA）性能的常见方法。但是，最近的文献表明，由于LLMS的内部知识与培训数据的上下文知识之间的冲突，使用完整的QA培训集的Vanilla SFT通常是最佳的。在本文中，我们首先设计了一个查询多元化策略，以进行强大的冲突检测，然后进行一系列实验来分析知识冲突的影响。我们发现1）训练不同冲突的样本的贡献有所不同，其中SFT的数据发生较大冲突会导致灾难性的绩效下降； 2）与直接滤除冲突数据相比，适当应用冲突数据将更加有益。在此激励的情况下，我们提出了一种简单的知识知识微调（即KAFT）方法，以有效提高LLMS的性能。 KAFT的核心是通过根据冲突水平为不同的培训样本分配不同的奖励来适应训练重量。广泛的实验表明，Kaft在四个LLM中带来了一致和显着的改进。更多的分析证明，KAFT有效地改善了模型的概括并减轻幻觉。

Title: Collaborative Problem-Solving in an Optimization Game

Authors: Isidora Jeknic, Alex Duchnowski, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15490
Pdf URL: https://arxiv.org/pdf/2505.15490
Copy Paste: [[2505.15490]] Collaborative Problem-Solving in an Optimization Game(https://arxiv.org/abs/2505.15490)
Keywords: llm, prompt, agent
Abstract: Dialogue agents that support human users in solving complex tasks have received much attention recently. Many such tasks are NP-hard optimization problems that require careful collaborative exploration of the solution space. We introduce a novel dialogue game in which the agents collaboratively solve a two-player Traveling Salesman problem, along with an agent that combines LLM prompting with symbolic mechanisms for state tracking and grounding. Our best agent solves 45% of games optimally in self-play. It also demonstrates an ability to collaborate successfully with human users and generalize to unfamiliar graphs.
摘要：最近，支持人类用户解决复杂任务的对话代理人最近受到了很多关注。许多这样的任务是NP-HARD优化问题，需要仔细协作解决方案空间。我们介绍了一个新颖的对话游戏，其中代理商合作解决了一个两人旅行的推销员问题，以及将LLM提示与国家跟踪和接地的符号机制相结合的代理商。我们最好的代理商在自我游戏中最佳地解决了45％的游戏。它还证明了与人类用户成功合作并概括不熟悉图形的能力。

Title: Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs

Authors: Federico Ranaldi, Andrea Zugarini, Leonardo Ranaldi, Fabio Massimo Zanzotto
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15501
Pdf URL: https://arxiv.org/pdf/2505.15501
Copy Paste: [[2505.15501]] Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs(https://arxiv.org/abs/2505.15501)
Keywords: language model, llm, prompt
Abstract: We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.
摘要：我们介绍了突发动物的概念，以形式化和衡量如何在预训练期间内部化编码知识图的代币序列，并在推理时间通过大型语言模型（LLMS）使用。的确，LLM已证明了在训练过程中记住大量令牌序列的能力，而一个核心的开放问题是，它们如何通过概括来利用这一记忆作为可重复使用的知识。然后，我们将突发动物分类为词汇，层次和拓扑形式，与需要激活的知识类型不同。我们通过知识激活任务（KAT）来测量突发事件，分析其一般属性，例如语义偏见。然后，我们通过根据输入条件改变提示策略来调查突发动力学对文本到SPARQL性能的影响。为此，我们采用了一个新颖的分析框架，该框架评估了模型预测是否与每个查询的相关突发动物的成功激活保持一致。该方法提供了一种实用的工具来探索语义级别的数据污染，并作为封闭式模型的有效策略。

Title: Multilingual Test-Time Scaling via Initial Thought Transfer

Authors: Prasoon Bajpai, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15508
Pdf URL: https://arxiv.org/pdf/2505.15508
Copy Paste: [[2505.15508]] Multilingual Test-Time Scaling via Initial Thought Transfer(https://arxiv.org/abs/2505.15508)
Keywords: prompt
Abstract: Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning. Building on our findings, we introduce MITT (Multilingual Initial Thought Transfer), an unsupervised and lightweight reasoning prefix-tuning approach that transfers high-resource reasoning prefixes to enhance test-time scaling across all languages, addressing inconsistencies in multilingual reasoning performance. MITT significantly boosts DeepSeek-R1-Distill-Qwen-7B's reasoning performance, especially for underrepresented languages.
摘要：测试时间扩展已成为一种广泛采用的推理时间策略，用于提高推理性能。但是，它的有效性几乎仅以英语研究，其行为在其他语言中基本上没有探索。我们介绍了在多语言环境中进行测试时间缩放的首次系统研究，评估了高水位和低资源拉丁语脚本语言中的DeepSeek-R1-Distill-dillama-lalama-8B和DeepSeek-R1-Distill-Qwen-7b。我们的发现表明，从测试时间缩放中的相对收益在各种语言之间差异很大。此外，即使在严格单语言提示下运行时，模型也经常转向英语中期。我们进一步表明，低资源语言不仅产生了与英语有很大不同的初始推理思想，而且在早期推理中的内部一致性也较低。在我们的发现的基础上，我们介绍了MITT（多语言初始思想转移），这是一种无监督且轻巧的推理前缀调整方法，它转移了高源推理前缀，以增强所有语言的测试时间缩放，从而解决了多语言推理性能的不一致。米特大大提高了DeepSeek-R1-Distill-Qwen-7b的推理性能，尤其是对于代表性不足的语言。

Title: Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs

Authors: Lang Gao, Kaiyang Wan, Wei Liu, Chenxi Wang, Zirui Song, Zixiang Xu, Yanbo Wang, Veselin Stoyanov, Xiuying Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15524
Pdf URL: https://arxiv.org/pdf/2505.15524
Copy Paste: [[2505.15524]] Evaluate Bias without Manual Test Sets: A Concept Representation Perspective for LLMs(https://arxiv.org/abs/2505.15524)
Keywords: language model, llm
Abstract: Bias in Large Language Models (LLMs) significantly undermines their reliability and fairness. We focus on a common form of bias: when two reference concepts in the model's concept space, such as sentiment polarities (e.g., "positive" and "negative"), are asymmetrically correlated with a third, target concept, such as a reviewing aspect, the model exhibits unintended bias. For instance, the understanding of "food" should not skew toward any particular sentiment. Existing bias evaluation methods assess behavioral differences of LLMs by constructing labeled data for different social groups and measuring model responses across them, a process that requires substantial human effort and captures only a limited set of social concepts. To overcome these limitations, we propose BiasLens, a test-set-free bias analysis framework based on the structure of the model's vector space. BiasLens combines Concept Activation Vectors (CAVs) with Sparse Autoencoders (SAEs) to extract interpretable concept representations, and quantifies bias by measuring the variation in representational similarity between the target concept and each of the reference concepts. Even without labeled data, BiasLens shows strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens reveals forms of bias that are difficult to detect using existing methods. For example, in simulated clinical scenarios, a patient's insurance status can cause the LLM to produce biased diagnostic assessments. Overall, BiasLens offers a scalable, interpretable, and efficient paradigm for bias discovery, paving the way for improving fairness and transparency in LLMs.
摘要：大型语言模型（LLM）的偏见会大大破坏其可靠性和公平性。我们专注于一种共同的偏见形式：当模型概念空间中的两个参考概念（例如情感极性（例如“正”和“负”）与第三个目标概念（例如审查方面）不对称，模型表现出意外偏见。例如，对“食物”的理解不应偏向任何特定的情感。现有的偏见评估方法通过为不同的社会群体构建标记的数据并衡量他们的模型响应，以评估LLM的行为差异，该过程需要大量的人类努力并仅捕获一组有限的社会概念。为了克服这些局限性，我们提出了基于模型向量空间的结构的无测试偏置分析框架Biasslens。 Biasslens将概念激活向量（CAVS）与稀疏的自动编码器（SAE）相结合，以提取可解释的概念表示形式，并通过测量目标概念与每个参考概念之间的表示性相似性的变化来量化偏差。即使没有标记的数据，Biasslens也与传统的偏见评估指标（Spearman相关性r> 0.85）显示出很强的一致性。此外，偏见揭示了使用现有方法难以检测的偏差形式。例如，在模拟的临床方案中，患者的保险状况可能导致LLM产生有偏见的诊断评估。总体而言，Biaslens为偏见发现提供了可扩展，可解释和有效的范式，为提高LLM的公平性和透明度铺平了道路。

Title: Social Bias in Popular Question-Answering Benchmarks

Authors: Angelie Kraft, Judith Simon, Sonja Schimmler
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.15553
Pdf URL: https://arxiv.org/pdf/2505.15553
Copy Paste: [[2505.15553]] Social Bias in Popular Question-Answering Benchmarks(https://arxiv.org/abs/2505.15553)
Keywords: language model, llm
Abstract: Question-answering (QA) and reading comprehension (RC) benchmarks are essential for assessing the capabilities of large language models (LLMs) in retrieving and reproducing knowledge. However, we demonstrate that popular QA and RC benchmarks are biased and do not cover questions about different demographics or regions in a representative way, potentially due to a lack of diversity of those involved in their creation. We perform a qualitative content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) how social bias is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most analyzed benchmark papers provided insufficient information regarding the stakeholders involved in benchmark creation, particularly the annotators. Notably, just one of the benchmark papers explicitly reported measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. More transparent and bias-aware QA and RC benchmark creation practices are needed to facilitate better scrutiny and incentivize the development of fairer LLMs.
摘要：提问（QA）和阅读理解（RC）基准对于评估大语模型（LLMS）检索和再现知识的能力至关重要。但是，我们证明了流行的质量检查和RC基准是有偏见的，并且没有以代表性的方式涵盖有关不同人口统计或地区的问题，这可能是由于缺乏参与其创作的人的多样性。我们对30个基准论文进行定性内容分析以及对20个基准数据集进行的定量分析，以学习（1）参与基准创建的人，（2）如何解决或预防社交偏见，以及（3）创建者和注释者的人口统计学是否与内容中的特定偏见相对应。大多数经过分析的基准论文提供了有关参与基准创建的利益相关者，特别是注释者的不足信息。值得注意的是，只是明确报告了为解决社会代表问题采取的措施之一。此外，数据分析揭示了各种百科全书，常识和学术基准的性别，宗教和地理偏见。需要更透明和偏见的质量保证和RC基准创建实践，以促进更好的审查并激励更公平的LLM的发展。

Title: DayDreamer at CQs-Gen 2025: Generating Critical Questions through Argument Scheme Completion

Authors: Wendi Zhou, Ameer Saadat-Yazdi, Nadin Kökciyan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15554
Pdf URL: https://arxiv.org/pdf/2505.15554
Copy Paste: [[2505.15554]] DayDreamer at CQs-Gen 2025: Generating Critical Questions through Argument Scheme Completion(https://arxiv.org/abs/2505.15554)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Critical questions are essential resources to provoke critical thinking when encountering an argumentative text. We present our system for the Critical Questions Generation (CQs-Gen) Shared Task at ArgMining 2025. Our approach leverages large language models (LLMs) with chain-of-thought prompting to generate critical questions guided by Walton's argumentation schemes. For each input intervention, we conversationally prompt LLMs to instantiate the corresponding argument scheme template to first obtain structured arguments, and then generate relevant critical questions. Following this, we rank all the available critical questions by prompting LLMs to select the top 3 most helpful questions based on the original intervention text. This combination of structured argumentation theory and step-by-step reasoning enables the generation of contextually relevant and diverse critical questions. Our pipeline achieves competitive performance in the final test set, showing its potential to foster critical thinking given argumentative text and detect missing or uninformed claims. Code available at \href{this https URL}{DayDreamer}.
摘要：关键问题是遇到论证文本时引起批判性思维的重要资源。我们介绍了在2025年争论的关键问题生成（CQS-GEN）共享任务的系统。我们的方法利用了大型语言模型（LLMS），并提示了促使沃尔顿论证方案指导的关键问题。对于每个输入干预措施，我们会在对话中提示LLM实例化相应的参数方案模板以首先获得结构化参数，然后产生相关的关键问题。之后，我们通过提示LLMS根据原始干预文本选择前3个最有用的问题来对所有可用的关键问题进行排名。结构化论证理论和逐步推理的这种结合使得与上下文相关和多样化的关键问题的产生。我们的管道在最终的测试集中实现了竞争性能，显示出其潜力，即给定论证文本并检测缺失或不知情的主张。可在\ href {此https url} {Daydreamer}上获得代码。

Title: Do RAG Systems Suffer From Positional Bias?

Authors: Florin Cuconasu, Simone Filice, Guy Horowitz, Yoelle Maarek, Fabrizio Silvestri
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.15561
Pdf URL: https://arxiv.org/pdf/2505.15561
Copy Paste: [[2505.15561]] Do RAG Systems Suffer From Positional Bias?(https://arxiv.org/abs/2505.15561)
Keywords: llm, prompt, retrieval augmented generation
Abstract: Retrieval Augmented Generation enhances LLM accuracy by adding passages retrieved from an external corpus to the LLM prompt. This paper investigates how positional bias - the tendency of LLMs to weight information differently based on its position in the prompt - affects not only the LLM's capability to capitalize on relevant passages, but also its susceptibility to distracting passages. Through extensive experiments on three benchmarks, we show how state-of-the-art retrieval pipelines, while attempting to retrieve relevant passages, systematically bring highly distracting ones to the top ranks, with over 60% of queries containing at least one highly distracting passage among the top-10 retrieved passages. As a result, the impact of the LLM positional bias, which in controlled settings is often reported as very prominent by related works, is actually marginal in real scenarios since both relevant and distracting passages are, in turn, penalized. Indeed, our findings reveal that sophisticated strategies that attempt to rearrange the passages based on LLM positional preferences do not perform better than random shuffling.
摘要：检索增强生成通过将从外部语料库检索到LLM提示的段落来提高LLM的精度。本文研究了位置偏见 - LLMS在提示中的位置以不同的地位来不同的位置偏见 - 不仅会影响LLM资本利用相关段落的能力，而且会影响其分散注意力段落的敏感性。通过对三个基准测试的广泛实验，我们展示了最新的检索管道如何在试图检索相关段落的同时，系统地将高度分散注意力的段落带到了最高等级，其中60％以上的查询中包含至少一个高度分散注意力的前10个检索通道。结果，在受控设置中，LLM位置偏差的影响通常是相关工作非常突出的，实际上在实际情况下是微不足道的，因为相关和分心的段落又受到了惩罚。确实，我们的发现表明，试图根据LLM位置偏好重新排列段落的复杂策略并没有比随机改组更好。

Title: From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning

Authors: David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, Mrinmaya Sachan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15607
Pdf URL: https://arxiv.org/pdf/2505.15607
Copy Paste: [[2505.15607]] From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning(https://arxiv.org/abs/2505.15607)
Keywords: language model, llm
Abstract: Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model's instructional planning.
摘要：大型语言模型（LLMS）可以改变教育，但是它们对直接提问的优化经常破坏有效的教学法，这需要战略性地扣留答案。为了减轻这种情况，我们提出了一种基于在线的加强学习（RL）的对齐框架，可以通过强调教学质量和解决方案的指导性解决方案，而不是简单地给出答案，可以使用模拟的学生教师互动将LLM迅速调整为有效的导师。我们使用我们的方法来训练7B参数导师模型，而没有人类注释，该模型的性能与LearnLM这样的较大专有模型的性能相似。我们引入了可控的奖励加权，以平衡教学支持和学生解决的准确性，从而使我们能够追踪这两个目标之间的帕累托前沿。我们的模型比单转弯的SFT基线更好地保留推理能力，并且可以通过揭示模型的教学计划的思维标签来可选地增强可解释性。

Title: Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Authors: Tiasa Singha Roy, Aditeya Baral, Ayush Rajesh Jhaveri, Yusuf Baig
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15623
Pdf URL: https://arxiv.org/pdf/2505.15623
Copy Paste: [[2505.15623]] Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning(https://arxiv.org/abs/2505.15623)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.
摘要：大型语言模型（LLMS）在各种自然语言任务中表现出巨大的潜力，但是在数学推理中面临重大挑战，尤其是在执行精确的多步逻辑方面。但是，当前的评估框架仅根据准确性来判断其绩效，这仅说明了最终答案。这项研究通过采用新颖的评估框架来探讨这些陷阱。我们提出了一个称为“枫树得分”的评估度量，该度量可以通过整体量化错误率，冗余性和有效性来整体量化推理未对准。

Title: Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions

Authors: David Thulke, Jakob Kemmler, Christian Dugast, Hermann Ney
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15633
Pdf URL: https://arxiv.org/pdf/2505.15633
Copy Paste: [[2505.15633]] Listen to the Context: Towards Faithful Large Language Models for Retrieval Augmented Generation on Climate Questions(https://arxiv.org/abs/2505.15633)
Keywords: language model, gpt, hallucination, retrieval augmented generation
Abstract: Large language models that use retrieval augmented generation have the potential to unlock valuable knowledge for researchers, policymakers, and the public by making long and technical climate-related documents more accessible. While this approach can help alleviate factual hallucinations by relying on retrieved passages as additional context, its effectiveness depends on whether the model's output remains faithful to these passages. To address this, we explore the automatic assessment of faithfulness of different models in this setting. We then focus on ClimateGPT, a large language model specialised in climate science, to examine which factors in its instruction fine-tuning impact the model's faithfulness. By excluding unfaithful subsets of the model's training data, we develop ClimateGPT Faithful+, which achieves an improvement in faithfulness from 30% to 57% in supported atomic claims according to our automatic metric.
摘要：使用检索增强一代的大型语言模型有可能通过使长期且技术性与气候相关的文档更容易访问研究人员，决策者和公众。尽管这种方法可以通过依靠检索到的段落作为其他背景来帮助减轻事实幻觉，但其有效性取决于模型的输出是否仍然忠实于这些段落。为了解决这个问题，我们探讨了在这种情况下对不同模型的忠诚的自动评估。然后，我们专注于ClimateGPT，这是一种专门从事气候科学的大型语言模型，以研究其教学微调中的哪些因素会影响该模型的忠诚。通过排除模型培训数据的不忠实子集，我们开发了climategpt Faithful+，根据我们的自动指标，忠实的忠诚度从30％增加到57％。

Title: Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models

Authors: Zihao Li, Xu Wang, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, Mengnan Du
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15634
Pdf URL: https://arxiv.org/pdf/2505.15634
Copy Paste: [[2505.15634]] Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models(https://arxiv.org/abs/2505.15634)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM's internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.
摘要：大型语言模型（LLMS）证明了使用经过三通链（COT）技术解决推理和数学问题的能力。如在DeepSeek-R1之类的模型中所见，扩大COT长度可显着提高了复杂问题的推理，但需要昂贵且高质量的长COT数据和微调。这项工作受到DeepSeek-R1的深思熟虑范式的启发，它利用转向技术来增强没有外部数据集的LLM的推理能力。我们的方法首先采用稀疏的自动编码器（SAE）来从香草婴儿床中提取可解释的功能。然后将这些功能用于引导LLM的内部状态。认识到许多LLM没有相应的预训练的SAE，我们进一步引入了一种新型的无SAE转向算法，该算法直接从LLM的残余激活中计算转向方向，从而避免了对显式SAE的需求。实验结果表明，我们的基于SAE的和随后的无SAE转向算法都显着增强了LLM的推理能力。

Title: Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Authors: Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15656
Pdf URL: https://arxiv.org/pdf/2505.15656
Copy Paste: [[2505.15656]] Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!(https://arxiv.org/abs/2505.15656)
Keywords: language model, llm
Abstract: Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at this https URL.
摘要：现在，使用专有数据对开源大语言模型（LLM）进行微调，现在是下游开发人员获得特定于任务LLM的标准实践。令人惊讶的是，我们揭示了一种新的和令人担忧的风险以及实践：开源LLM的创建者随后可以通过简单的后门培训来提取私人下游微调数据，只需要黑盒访问细调的下游模型。 Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings.我们还探索了一种基于检测的防御策略，但发现它可以通过改进的攻击来绕开。总体而言，我们强调了这种新确定的数据泄露风险的紧急情况，我们希望更多的后续研究可以推动解决这一风险的进展。我们实验中使用的代码和数据在此HTTPS URL上发布。

Title: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Authors: Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.15670
Pdf URL: https://arxiv.org/pdf/2505.15670
Copy Paste: [[2505.15670]] Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model(https://arxiv.org/abs/2505.15670)
Keywords: language model, llm, agent
Abstract: Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.
摘要：口语对话是人类计算机互动的一种直观形式，但是当前的语音语言模型通常仍被限制在基于转弯的交流中，缺乏实时适应性，例如用户驳船。我们向语音（S2S）体系结构提出了一种新颖的双工语音，该架构具有连续的用户输入和用通道融合的编解码器代理输出，该输出直接模拟同时同时使用的用户和代理流。使用预审前的流媒体编码器进行用户输入启用第一个双工S2S模型，而无需语音预处理。与以前的工作相比，对代理和用户建模的单独体系结构有助于对更好的代理声音进行编解码，并将比特率（0.6 kbps）减半。实验结果表明，所提出的模型在推理，转弯和驳船能力方面优于先前的双链模型。随着语音预处理的跳过，该模型需要较少的语音数据，这显着简化了从任何LLMS构建双工S2S模型的过程。最后，这是首个具有培训和推理代码的公开双层S2S模型，以促进可重复性。

Title: UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models

Authors: Miao Yu, Liang Lin, Guibin Zhang, Xinfeng Li, Junfeng Fang, Ningyu Zhang, Kun Wang, Yang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15674
Pdf URL: https://arxiv.org/pdf/2505.15674
Copy Paste: [[2505.15674]] UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models(https://arxiv.org/abs/2505.15674)
Keywords: language model, llm, prompt
Abstract: Large language models require iterative updates to address challenges such as knowledge conflicts and outdated information (e.g., incorrect, private, or illegal contents). Machine unlearning provides a systematic methodology for targeted knowledge removal from trained models, enabling elimination of sensitive information influences. However, mainstream fine-tuning-based unlearning methods often fail to balance unlearning efficacy and model ability, frequently resulting in catastrophic model collapse under extensive knowledge removal. Meanwhile, in-context unlearning, which relies solely on contextual prompting without modifying the model's intrinsic mechanisms, suffers from limited generalizability and struggles to achieve true unlearning. In this work, we introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors. UniErase operates through two key phases: (I) an optimization stage that binds desired unlearning outputs to the model's autoregressive probability distribution via token optimization, followed by (II) a lightweight model editing phase that activates the learned token to probabilistically induce specified forgetting objective. Serving as a new research direction for token learning to induce unlearning target, UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings. Remarkably, in terms of TOFU benchmark, UniErase, modifying only around 3.66% of the LLM parameters, outperforms previous forgetting SOTA baseline by around 4.01 times for model ability with even better unlearning efficacy. Similarly, UniErase, maintaining more ability, also surpasses previous retaining SOTA by 35.96% for unlearning efficacy, showing dual top-tier performances in current unlearing domain.
摘要：大型语言模型需要迭代更新，以应对知识冲突和过时的信息（例如，不正确，私人或非法内容）等挑战。 Machine Unerning为从训练有素的模型中删除有针对性的知识提供了系统的方法，从而消除了敏感信息的影响。但是，基于主流微调的学业方法通常无法平衡效力和模型能力，经常导致大量知识删除灾难性模型崩溃。同时，在不修改模型的固有机制的情况下仅依赖上下文提示的文本中文化，遭受了有限的可推广性和努力实现真实学习的努力。在这项工作中，我们介绍了Unierase，这是一种新颖的学习范式，该范式使用可学习的参数后缀（未学习令牌）将语言模型转向有针对性的遗忘行为。 Unierase通过两个关键阶段运行：（i）优化阶段，该阶段通过令牌优化将所需的未学习输出绑定到模型的自动回归概率分布，然后是（ii）轻量级模型编辑阶段，该阶段激活了学识渊博的令牌以概率地诱导指定的遗忘目标。 Unierase是代币学习诱导学习目标的新研究方向，在虚拟和现实世界的知识设置下，在批处理，顺序且精确地学习的情况下实现了最先进的（SOTA）表现。值得注意的是，就tofu Benchmark而言，Unierase仅修改了LLM参数的3.66％左右，在模型能力方面的效果更好的模型能力上，以前忘记了SOTA基线的4.01倍以上。同样，保持更多能力的Unierase也超过了先前的SOTA，以效果超过35.96％，在当前不呈现的域中显示了双层表现。

Title: The Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect

Authors: Cosimo Iaia, Bhavin Choksi, Emily Wiebers, Gemma Roig, Christian J. Fiebach
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15682
Pdf URL: https://arxiv.org/pdf/2505.15682
Copy Paste: [[2505.15682]] The Representational Alignment between Humans and Language Models is implicitly driven by a Concreteness Effect(https://arxiv.org/abs/2505.15682)
Keywords: language model
Abstract: The nouns of our language refer to either concrete entities (like a table) or abstract concepts (like justice or love), and cognitive psychology has established that concreteness influences how words are processed. Accordingly, understanding how concreteness is represented in our mind and brain is a central question in psychology, neuroscience, and computational linguistics. While the advent of powerful language models has allowed for quantitative inquiries into the nature of semantic representations, it remains largely underexplored how they represent concreteness. Here, we used behavioral judgments to estimate semantic distances implicitly used by humans, for a set of carefully selected abstract and concrete nouns. Using Representational Similarity Analysis, we find that the implicit representational space of participants and the semantic representations of language models are significantly aligned. We also find that both representational spaces are implicitly aligned to an explicit representation of concreteness, which was obtained from our participants using an additional concreteness rating task. Importantly, using ablation experiments, we demonstrate that the human-to-model alignment is substantially driven by concreteness, but not by other important word characteristics established in psycholinguistics. These results indicate that humans and language models converge on the concreteness dimension, but not on other dimensions.
摘要：我们语言的名词是指混凝土实体（例如表）或抽象概念（例如正义或爱情），而认知心理学已经确定具体性会影响单词的处理方式。因此，了解我们的思想和大脑中的具体性是心理学，神经科学和计算语言学的核心问题。尽管强大的语言模型的出现允许对语义表示本质进行定量查询，但它在很大程度上仍然没有被忽视的方式。在这里，我们使用行为判断来估计人类隐式使用的语义距离，用于一组精心选择的抽象和具体名词。使用代表性相似性分析，我们发现参与者的隐式表示空间和语言模型的语义表示显着对齐。我们还发现，两个表示空间都与具体性的明确表示，这两个代表空间都是使用额外的具体评级任务从我们的参与者那里获得的。重要的是，使用消融实验，我们证明了人与模型的一致性基本上是由具体性驱动的，但不是由心理语言学中建立的其他重要单词特征驱动的。这些结果表明，人类和语言模型会在具体维度上汇合，而不是其他维度。

Title: A Federated Splitting Framework for LLMs: Security, Efficiency, and Adaptability

Authors: Zishuai Zhang, Hainan Zhang, Jiaying Zheng, Ziwei Wang, Yongxin Tong, Jin Dong, Zhiming Zheng
Subjects: cs.CL, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2505.15683
Pdf URL: https://arxiv.org/pdf/2505.15683
Copy Paste: [[2505.15683]] A Federated Splitting Framework for LLMs: Security, Efficiency, and Adaptability(https://arxiv.org/abs/2505.15683)
Keywords: llm
Abstract: Private data is typically larger and of higher quality than public data, offering great potential to improve LLM. However, its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based split learning model has emerged, offloading most model parameters to the server while retaining only the embedding and output layers on clients to ensure privacy. However, it still faces significant challenges in security, efficiency, and adaptability: 1) embedding gradients are vulnerable to attacks, leading to reverse engineering of private data; 2) the autoregressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FL-LLaMA, a secure, efficient, and adaptive federated split framework based on LLaMA2. First, we place some input and output blocks on the local client and inject Gaussian noise into forward-pass hidden states, enabling secure end-to-end propagation. Second, we employ client-batch and server-hierarchical strategies to achieve parallel training, along with attention-mask compression and KV cache mechanisms to accelerate inference, reducing communication costs effectively. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements and hardware limitations. Experiments on NLU, summarization and conversational QA tasks show that FL-LLaMA maintains performance comparable to centralized LLaMA2, and achieves up to 2x train speedups and 8x inference speedups. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FL-LLaMA in security and adaptability.
摘要：私人数据通常比公共数据更大，质量更高，具有提高LLM的巨大潜力。但是，其在数据孤岛之间的分散分布和LLM的高计算需求限制了其在联合环境中的部署。为了解决这个问题，已经出现了基于变压器的拆分学习模型，将大多数模型参数卸载到服务器上，同时仅保留客户端上的嵌入和输出层以确保隐私。但是，它在安全性，效率和适应性方面仍面临重大挑战：1）嵌入梯度容易受到攻击，从而导致私人数据的反向工程； 2）LLMS的自回旋性质意味着联邦分裂学习只能依次训练和推断，从而导致高度沟通开销； 3）固定分区点缺乏对下游任务的适应性。在本文中，我们介绍了基于Llama2的安全，高效和自适应的联合拆分框架。首先，我们将一些输入和输出块放在本地客户端上，然后将高斯噪声注入前通用的隐藏状态，从而实现安全的端到端传播。其次，我们采用客户端和服务器层次结构策略来实现并行培训，以及注意力掩盖压缩和KV缓存机制来加速推理，从而有效地降低了通信成本。第三，我们允许用户根据特定的任务要求和硬件限制，动态调整输入/输出块的分区点。 NLU，摘要和对话质量检查任务的实验表明，Fl-llama保持与集中式Llama2相当的性能，并达到高达2倍的火车速度和8倍的推理加速。对隐私攻击和不同分区点的进一步分析也证明了Fl-lalama在安全性和适应性方面的有效性。

Title: ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy

Authors: Gengyang Li, Yifeng Gao, Yuming Li, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15684
Pdf URL: https://arxiv.org/pdf/2505.15684
Copy Paste: [[2505.15684]] ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy(https://arxiv.org/abs/2505.15684)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: While Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), the excessive length of reasoning tokens increases latency and KV cache memory usage, and may even truncate final answers under context limits. We propose ThinkLess, an inference-efficient framework that terminates reasoning generation early and maintains output quality without modifying the model. Atttention analysis reveals that answer tokens focus minimally on earlier reasoning steps and primarily attend to the reasoning terminator token, due to information migration under causal masking. Building on this insight, ThinkLess inserts the terminator token at earlier positions to skip redundant reasoning while preserving the underlying knowledge transfer. To prevent format discruption casued by early termination, ThinkLess employs a lightweight post-regulation mechanism, relying on the model's natural instruction-following ability to produce well-structured answers. Without fine-tuning or auxiliary data, ThinkLess achieves comparable accuracy to full-length CoT decoding while greatly reducing decoding time and memory consumption.
摘要：虽然促进大语模型（LLMS）的推理（COT）提示推理，但过度的推理令牌会增加延迟和KV缓存内存的使用情况，甚至可能在上下文限制下截断最终答案。我们提出了毫无意义的推理框架，该框架可以尽早终止推理生成并保持产出质量而不修改模型。提示分析表明，由于因果掩盖下的信息迁移，答案令牌最少关注早期的推理步骤，主要关注推理终结者令牌。在这种见识的基础上，毫无思想地插入终结者令牌，以在较早的职位上跳过冗余推理，同时保留基本知识转移。为了防止通过早期终止而产生的格式爆发，毫无疑问采用了轻巧的调节机制，依靠模型的自然指导遵循能力来产生结构良好的答案。没有微调或辅助数据，毫无疑问，可以达到与全长COT解码相当的精度，同时大大减少了解码时间和记忆消耗。

Title: Can Large Language Models be Effective Online Opinion Miners?

Authors: Ryang Heo, Yongsik Seo, Junseong Lee, Dongha Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15695
Pdf URL: https://arxiv.org/pdf/2505.15695
Copy Paste: [[2505.15695]] Can Large Language Models be Effective Online Opinion Miners?(https://arxiv.org/abs/2505.15695)
Keywords: language model, llm
Abstract: The surge of user-generated online content presents a wealth of insights into customer preferences and market trends. However, the highly diverse, complex, and context-rich nature of such contents poses significant challenges to traditional opinion mining approaches. To address this, we introduce Online Opinion Mining Benchmark (OOMB), a novel dataset and evaluation protocol designed to assess the ability of large language models (LLMs) to mine opinions effectively from diverse and intricate online environments. OOMB provides extensive (entity, feature, opinion) tuple annotations and a comprehensive opinion-centric summary that highlights key opinion topics within each content, thereby enabling the evaluation of both the extractive and abstractive capabilities of models. Through our proposed benchmark, we conduct a comprehensive analysis of which aspects remain challenging and where LLMs exhibit adaptability, to explore whether they can effectively serve as opinion miners in realistic online scenarios. This study lays the foundation for LLM-based opinion mining and discusses directions for future research in this field.
摘要：用户生成的在线内容的激增为客户的喜好和市场趋势提供了丰富的见解。但是，这种内容的高度多样性，复杂和上下文的性质对传统意见采矿方法构成了重大挑战。为了解决这个问题，我们介绍了在线意见挖掘基准（OOMB），这是一种新颖的数据集和评估协议，旨在评估大型语言模型（LLMS）能够从各种和复杂的在线环境中有效地挖掘观点。 OOMB提供了元组注释（实体，功能，意见）注释和以意见为中心的全面摘要，该摘要突出了每个内容中的关键意见主题，从而可以评估模型的提取和抽象能力。通过我们提出的基准测试，我们对哪些方面仍然具有挑战性以及LLM具有适应性的何处进行全面分析，以探索他们是否可以在现实的在线场景中有效地充当矿工。这项研究奠定了基于LLM的意见采矿的基础，并讨论了该领域未来研究的方向。

Title: LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing

Authors: Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15702
Pdf URL: https://arxiv.org/pdf/2505.15702
Copy Paste: [[2505.15702]] LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing(https://arxiv.org/abs/2505.15702)
Keywords: language model
Abstract: Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation. To tackle this, we model the sequential editing as a constrained stochastic programming. Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, \textbf{LyapLock} is proposed. It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving. This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation. Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89\% over SOTA baselines. Furthermore, it can be leveraged to enhance the performance of baseline methods. Our code is released on this https URL.
摘要：大型语言模型通常包含事实不正确或过时的知识，从而为精确的知识更新提供了模型编辑方法。然而，由于不足的长期知识保存机制不足，当前的主流定位在顺序编辑过程中表现出逐步的性能下降。为了解决这个问题，我们将顺序编辑建模为受约束的随机编程。鉴于累积保存错误约束和逐渐显示的编辑任务所带来的挑战，提出了\ textbf {lyaplock}。它将排队理论和Lyapunov优化整合到将长期约束的编程分解为可拖动的逐步子问题，以进行有效的解决。这是具有严格理论保证的第一个模型编辑框架，在满足长期知识保护的限制的同时，实现了渐近的最佳编辑性能。实验结果表明，我们的框架尺度将顺序编辑能力超过10,000个编辑，同时稳定一般功能，并使SOTA基准的平均编辑功效提高了11.89 \％。此外，可以利用它来增强基线方法的性能。我们的代码在此HTTPS URL上发布。

Title: Advancing LLM Safe Alignment with Safety Representation Ranking

Authors: Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, Yisen Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15710
Pdf URL: https://arxiv.org/pdf/2505.15710
Copy Paste: [[2505.15710]] Advancing LLM Safe Alignment with Safety Representation Ranking(https://arxiv.org/abs/2505.15710)
Keywords: language model, llm, prompt
Abstract: The rapid advancement of large language models (LLMs) has demonstrated milestone success in a variety of tasks, yet their potential for generating harmful content has raised significant safety concerns. Existing safety evaluation approaches typically operate directly on textual responses, overlooking the rich information embedded in the model's internal representations. In this paper, we propose Safety Representation Ranking (SRR), a listwise ranking framework that selects safe responses using hidden states from the LLM itself. SRR encodes both instructions and candidate completions using intermediate transformer representations and ranks candidates via a lightweight similarity-based scorer. Our approach directly leverages internal model states and supervision at the list level to capture subtle safety signals. Experiments across multiple benchmarks show that SRR significantly improves robustness to adversarial prompts. Our code will be available upon publication.
摘要：大型语言模型（LLM）的快速发展已在各种任务中取得了里程碑的成功，但是它们产生有害内容的潜力引起了重大的安全问题。现有的安全评估方法通常直接在文本响应上运行，从而忽略了模型内部表示中嵌入的丰富信息。在本文中，我们提出了安全表示排名（SRR），这是一个列表等级框架，使用LLM本身中隐藏状态选择安全响应。 SRR使用中间变压器表示编码指令和候选完成，并通过基于轻量级相似性的得分手对候选人进行排名。我们的方法直接利用内部模型状态和列表级别的监督来捕获微妙的安全信号。跨多个基准测试的实验表明，SRR显着提高了对对抗提示的鲁棒性。我们的代码将在出版后提供。

Title: TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games

Authors: Yuan Yuan, Muyu He, Muhammad Adil Shahid, Jiani Huang, Ziyang Li, Li Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15712
Pdf URL: https://arxiv.org/pdf/2505.15712
Copy Paste: [[2505.15712]] TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games(https://arxiv.org/abs/2505.15712)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, the number of reasoning step and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs' deductive reasoning abilities in complex, narrative-rich environments.
摘要：本文介绍了Turnaboutllm，这是一个新颖的框架和数据集，用于评估大语言模型（LLMS）的演绎推理能力（LLMS），利用了侦探游戏ACE律师和Danganronpa的交互式游戏玩法。框架任务是在长期叙事环境中识别证词和证据之间的矛盾，这是由于其问题提出的较大的答案空间和各种推理类型而导致的挑战性任务。我们在数据集上评估了十二个最先进的LLMS，暗示了流行策略的限制，以增强演绎推理，例如广泛的思维和经过思考的促进链。结果还表明，上下文大小的影响，推理步骤的数量和答案空间大小对模型性能的影响。总体而言，Turnaboutllm对LLMS在复杂，叙事丰富的环境中的演绎推理能力提出了重大挑战。

Title: Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling

Authors: He Hu, Yucheng Zhou, Juzheng Si, Qianning Wang, Hengheng Zhang, Fuji Ren, Fei Ma, Laizhong Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15715
Pdf URL: https://arxiv.org/pdf/2505.15715
Copy Paste: [[2505.15715]] Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling(https://arxiv.org/abs/2505.15715)
Keywords: language model, llm
Abstract: Large language models (LLMs) hold significant potential for mental health support, capable of generating empathetic responses and simulating therapeutic conversations. However, existing LLM-based approaches often lack the clinical grounding necessary for real-world psychological counseling, particularly in explicit diagnostic reasoning aligned with standards like the DSM/ICD and incorporating diverse therapeutic modalities beyond basic empathy or single strategies. To address these critical limitations, we propose PsyLLM, the first large language model designed to systematically integrate both diagnostic and therapeutic reasoning for mental health counseling. To develop the PsyLLM, we propose a novel automated data synthesis pipeline. This pipeline processes real-world mental health posts, generates multi-turn dialogue structures, and leverages LLMs guided by international diagnostic standards (e.g., DSM/ICD) and multiple therapeutic frameworks (e.g., CBT, ACT, psychodynamic) to simulate detailed clinical reasoning processes. Rigorous multi-dimensional filtering ensures the generation of high-quality, clinically aligned dialogue data. In addition, we introduce a new benchmark and evaluation protocol, assessing counseling quality across four key dimensions: comprehensiveness, professionalism, authenticity, and safety. Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models on this benchmark.
摘要：大型语言模型（LLM）具有精神健康支持的巨大潜力，能够产生善解人意的反应并模拟治疗对话。但是，现有的基于LLM的方法通常缺乏现实心理咨询所必需的临床基础，特别是在明确的诊断推理与DSM/ICD等标准相一致的明确诊断推理中，并结合了基本移情或单一策略以外的各种治疗方式。为了解决这些关键局限性，我们提出了Psyllm，这是第一个旨在系统地整合诊断和治疗推理的大型语言模型，以进行心理健康咨询。为了开发Psesllm，我们提出了一种新型的自动数据合成管道。该管道处理现实世界中的心理健康职位，生成多转化的对话结构，并利用以国际诊断标准（例如DSM/ICD）和多个治疗框架（例如CBT，ACT，ACT，心理动力学）指导的LLMS来模拟详细的临床推理过程。严格的多维滤波可确保产生高质量的临床对话数据。此外，我们引入了一种新的基准和评估协议，评估了四个关键维度的咨询质量：全面，专业精神，真实性和安全性。我们的实验表明，Psyllm在此基准测试中的最先进基线模型明显优于最先进的基线模型。

Title: Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities

Authors: Xiaoyu Luo, Yiyi Chen, Johannes Bjerva, Qiongxiu Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15722
Pdf URL: https://arxiv.org/pdf/2505.15722
Copy Paste: [[2505.15722]] Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities(https://arxiv.org/abs/2505.15722)
Keywords: language model, llm
Abstract: We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that treating languages in isolation - ignoring their similarities - obscures the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a language-aware perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.
摘要：我们介绍了对多语言大语言模型（MLLM）中的记忆的首次全面研究，并使用跨不同模型量表，体系结构和记忆定义的模型分析了95种语言。随着MLLM越来越多地部署，了解他们的记忆行为已变得至关重要。然而，先前的工作主要集中在单语模型上，尽管培训语料库的固有性质固有的本质性质，但多语言记忆却没有被忽视。我们发现，现有的假设，即记忆与训练数据的可用性高度相关，无法完全解释MLLM中的记忆模式。我们假设孤立地对待语言 - 忽略它们的相似性 - 掩盖了记忆的真实模式。为了解决这个问题，我们提出了一种基于图的新型相关度量，该指标结合了语言相似性以分析跨语性记忆。我们的分析表明，在类似的语言中，那些培训令牌较少的人倾向于表现出更高的记忆，这种趋势只有在明确建模跨语性关系时才会出现。这些发现强调了一种语言意识到的观点在评估和减轻MLLM中的记忆漏洞方面的重要性。这也构成了经验证据，表明语言相似性既解释了MLLM中的记忆，又解释了跨语言可转移性，对多语言NLP具有广泛的影响。

Title: VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Authors: Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15727
Pdf URL: https://arxiv.org/pdf/2505.15727
Copy Paste: [[2505.15727]] VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models(https://arxiv.org/abs/2505.15727)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has accelerated the development of multi-modal models capable of vocal communication. Unlike text-based interactions, speech conveys rich and diverse information, including semantic content, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models predominantly focus on the quality of their textual responses, often overlooking critical aspects of vocal performance and lacking benchmarks with vocal-specific test instances. To address this gap, we propose VocalBench, a comprehensive benchmark designed to evaluate speech interaction models' capabilities in vocal communication. VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers 16 fundamental skills essential for effective vocal interaction. Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech-based interaction systems. Code and evaluation instances are available at this https URL.
摘要：大型语言模型（LLM）的快速发展已加速了能够发声交流的多模式模型的发展。与基于文本的互动不同，语音传达了丰富而多样的信息，包括语义内容，声学变化，副语言提示和环境环境。但是，现有对语音交互模型的评估主要集中在其文本响应的质量上，通常忽略了声音表现的关键方面，并且缺乏具有特定于人声的测试实例的基准。为了解决这一差距，我们提出了Vocalbench，这是一个综合基准，旨在评估语音交流的能力。 Vocalbench包括9,400个跨四个关键维度精心策划的实例：语义质量，声学性能，对话能力和鲁棒性。它涵盖了有效的人声互动至关重要的16个基本技能。实验结果揭示了当前模型能力的显着差异，每个模型能力都表现出不同的优势和劣势，并提供了有价值的见解，以指导基于语音的交互系统的未来研究。该HTTPS URL可用代码和评估实例。

Title: DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

Authors: Gaurav Srivastava, Zhenyu Bi, Meng Lu, Xuan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15734
Pdf URL: https://arxiv.org/pdf/2505.15734
Copy Paste: [[2505.15734]] DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning(https://arxiv.org/abs/2505.15734)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on five reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities.
摘要：大型语言模型（LLM）通过大规模数据集进行了广泛的培训，在推理方面有了显着改善。但是，仅依靠其他数据进行改进正在变得越来越不切实际，这突显了模型在没有外部监督的情况下自主增强推理的需求。在本文中，我们提出了辩论，训练，进化（DTE），这是一个新颖的无真实培训框架，它使用多代理辩论痕迹来发展单语言模型。我们还引入了一种新的促使策略反思 - 危机欺骗，以通过明确指示代理商批评和完善推理来提高辩论质量。对五个推理基准和六个开放式模型的五个推理基准进行了广泛的评估表明，我们的DTE框架可实现实质性改进，平均准确性增长率为8.92％，在具有挑战性的GSM-Plus数据集中。此外，我们观察到强大的跨域泛化，所有其他基准的平均准确性增长率为5.8％，这表明我们的方法捕获了一般的推理能力。

Title: Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention

Authors: Huanxuan Liao, Wen Hu, Yao Xu, Shizhu He, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15774
Pdf URL: https://arxiv.org/pdf/2505.15774
Copy Paste: [[2505.15774]] Beyond Hard and Soft: Hybrid Context Compression for Balancing Local and Global Information Retention(https://arxiv.org/abs/2505.15774)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) encounter significant challenges in long-sequence inference due to computational inefficiency and redundant processing, driving interest in context compression techniques. Existing methods often rely on token importance to perform hard local compression or encode context into latent representations for soft global compression. However, the uneven distribution of textual content relevance and the diversity of demands for user instructions mean these approaches frequently lead to the loss of potentially valuable information. To address this, we propose $\textbf{Hy}$brid $\textbf{Co}$ntext $\textbf{Co}$mpression (HyCo$_2$) for LLMs, which integrates both global and local perspectives to guide context compression while retaining both the essential semantics and critical details for task completion. Specifically, we employ a hybrid adapter to refine global semantics with the global view, based on the observation that different adapters excel at different tasks. Then we incorporate a classification layer that assigns a retention probability to each context token based on the local view, determining whether it should be retained or discarded. To foster a balanced integration of global and local compression, we introduce auxiliary paraphrasing and completion pretraining before instruction tuning. This promotes a synergistic integration that emphasizes instruction-relevant information while preserving essential local details, ultimately balancing local and global information retention in context compression. Experiments show that our HyCo$_2$ method significantly enhances long-text reasoning while reducing token usage. It improves the performance of various LLM series by an average of 13.1\% across seven knowledge-intensive QA benchmarks. Moreover, HyCo$_2$ matches the performance of uncompressed methods while reducing token consumption by 88.8\%.
摘要：大型语言模型（LLMS）由于计算效率低下和冗余处理而在长期推断中遇到了重大挑战，引起了上下文压缩技术的兴趣。现有的方法通常依赖于代币的重要性来执行硬局部压缩或将上下文编码为潜在表示，以实现软整体压缩。但是，文本内容相关性的分布不均匀，用户说明的需求多样性意味着这些方法通常会导致潜在的有价值的信息丢失。为了解决这个问题，我们建议$ \ textbf {hy} $ brid $ \ textbf {co} $ ntext $ \ textbf {co} $ mpression（hyco $ _2 $）for llms集成了全局和本地视角，以指导上下文压缩，同时保留基本的详细信息和关键的详细信息，以完成任务的重要详细信息。具体而言，我们采用混合适配器以全球视图来完善全球语义，这是基于观察到不同任务的不同适配器出色的观察结果。然后，我们合并了一个分类层，该分类层根据本地视图为每个上下文令牌分配保留概率，以确定是否应保留或丢弃。为了促进全球和局部压缩的平衡整合，我们在教学调整之前介绍了辅助论术和完成预处理。这促进了协同的整合，该集成强调与教学相关的信息，同时保留基本的本地细节，最终在上下文压缩中平衡本地和全球信息保留。实验表明，我们的Hyco $ _2 $方法可以显着增强长文推理，同时减少令牌使用情况。它在七个知识密集的QA基准中，各种LLM系列的性能平均提高了13.1 \％。此外，Hyco $ _2 $与未压缩方法的性能相匹配，同时将令牌消费量减少88.8 \％。

Title: ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning

Authors: Changtai Zhu, Siyin Wang, Ruijun Feng, Kai Song, Xipeng Qiu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.15776
Pdf URL: https://arxiv.org/pdf/2505.15776
Copy Paste: [[2505.15776]] ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.15776)
Keywords: language model
Abstract: Conversational search systems require effective handling of context-dependent queries that often contain ambiguity, omission, and coreference. Conversational Query Reformulation (CQR) addresses this challenge by transforming these queries into self-contained forms suitable for off-the-shelf retrievers. However, existing CQR approaches suffer from two critical constraints: high dependency on costly external supervision from human annotations or large language models, and insufficient alignment between the rewriting model and downstream retrievers. We present ConvSearch-R1, the first self-driven framework that completely eliminates dependency on external rewrite supervision by leveraging reinforcement learning to optimize reformulation directly through retrieval signals. Our novel two-stage approach combines Self-Driven Policy Warm-Up to address the cold-start problem through retrieval-guided self-distillation, followed by Retrieval-Guided Reinforcement Learning with a specially designed rank-incentive reward shaping mechanism that addresses the sparsity issue in conventional retrieval metrics. Extensive experiments on TopiOCQA and QReCC datasets demonstrate that ConvSearch-R1 significantly outperforms previous state-of-the-art methods, achieving over 10% improvement on the challenging TopiOCQA dataset while using smaller 3B parameter models without any external supervision.
摘要：会话搜索系统需要有效处理上下文依赖性查询，这些查询通常包含歧义，遗漏和核心。会话查询重新印象（CQR）通过将这些查询转换为适合现成的猎犬的独立形式，从而解决了这一挑战。但是，现有的CQR方法遭受了两个关键限制：对人类注释或大语言模型的昂贵外部监督的高度依赖，以及重写模型和下游猎犬之间的一致性不足。我们提出了Consearch-R1，这是第一个自动驱动的框架，它通过利用强化学习来通过检索信号直接优化重新制定，完全消除了对外部重写监督的依赖。我们的新颖的两阶段方法结合了自我驱动的政策热身，通过检索引导的自我鉴定来解决冷启动问题，随后是检索引导的强化学习以及一种专门设计的排名较高的奖励塑造机制，该机制解决了常规检索衡量指标中的稀疏问题。 TopioCQA和QRECC数据集的广泛实验表明，Consearch-R1的表现显着胜过先前的最先进方法，在没有任何外部监督的情况下使用较小的3B参数模型，在具有挑战性的TopioCQA数据集方面取得了超过10％的提高。

Title: Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

Authors: Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, Xin Eric Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15778
Pdf URL: https://arxiv.org/pdf/2505.15778
Copy Paste: [[2505.15778]] Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space(https://arxiv.org/abs/2505.15778)
Keywords: llm, chain-of-thought
Abstract: Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning. Code is available at this https URL.
摘要：人类认知通常涉及通过抽象，流体概念进行思考，而不是严格使用离散的语言令牌。但是，当前的推理模型受到人类语言边界内的推理，处理代表语义空间中固定点的离散令牌嵌入。这种离散的限制限制了这种推理模型的表达能力和上力，通常会导致对推理路径的不完整探索，因为标准的经过三通方法（COT）方法依赖于每个步骤对一个令牌进行抽样。在这项工作中，我们引入了软思维，这是一种无训练的方法，通过在连续的概念空间中产生柔软的，抽象的概念令牌来模仿类似人类的“软”推理。这些概念令牌是由令牌嵌入的概率加权混合物创建的，该嵌入形成了连续的概念空间，可以使平稳的过渡和更丰富的表示超越传统的离散边界。从本质上讲，每个生成的概念代币都封装了相关离散令牌的多种含义，隐含地探索了各种推理路径，以有效地收敛到正确的答案。对不同数学和编码基准的经验评估始终显示出软思维的有效性和效率，将PASS@1的准确性提高了2.48点，同时将令牌用法同时降低了22.4％，高达22.4％。定性分析进一步表明，软思维输出仍然是高度易于解释和可读性的，突出了软思维的潜力，即打破基于语言的理由的固有瓶颈。代码可在此HTTPS URL上找到。

Title: dKV-Cache: The Cache for Diffusion Language Models

Authors: Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15781
Pdf URL: https://arxiv.org/pdf/2505.15781
Copy Paste: [[2505.15781]] dKV-Cache: The Cache for Diffusion Language Models(https://arxiv.org/abs/2505.15781)
Keywords: language model
Abstract: Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.
摘要：扩散语言模型（DLM）已被视为自回归语言模型的有前途的竞争对手。但是，扩散语言模型长期以来一直受到缓慢的推论来限制。一个核心挑战是，他们的非自动回忆架构和双向注意力排除了加速解码的键值缓存。我们通过提出类似KV-CACHE的机制（延迟KV-CACHE）来解决此瓶颈，以延迟DLMS的降解过程。我们的方法是通过观察到的，即在整个扩散过程中具有不同的代表动态。因此，我们提出了针对关键状态和价值状态的延迟和条件的缓存策略。我们设计了两个互补的变体来缓存键并逐步进行价值：（1）DKV-CACHE-DECODE，它几乎提供了无损的加速度，甚至可以改善长序列的性能，这表明现有的DLMS可能在推理过程中不足以利用上下文信息。（2）DKV-CACHE-GREDY具有积极的缓存，其寿命降低，以二次时间复杂性达到更高的加速度，以某些性能降解为代价。最终，DKV-CACHE在推理的2-10倍速度上实现了很大程度上缩小ARS和DLM之间的差距。我们在几个基准上评估了DKV-CACHE，从而在一般语言理解，数学和代码生成基准的基准中提供了加速。实验表明，即使以当前DLM的无训练方式，也可以在DLMS中使用缓存。

Title: Long-Form Information Alignment Evaluation Beyond Atomic Facts

Authors: Danna Zheng, Mirella Lapata, Jeff Z. Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15792
Pdf URL: https://arxiv.org/pdf/2505.15792
Copy Paste: [[2505.15792]] Long-Form Information Alignment Evaluation Beyond Atomic Facts(https://arxiv.org/abs/2505.15792)
Keywords: llm, hallucination
Abstract: Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at this https URL.
摘要：信息一致性评估器对于各种NLG评估任务和可信赖的LLM部署至关重要，减少幻觉并增强用户信任。当前的细颗粒方法（例如Factscore）单独验证事实，但忽略了事实依赖性，从而实现了微妙的脆弱性。在这项工作中，我们介绍了蒙塔格利（Montagelie），这是一个充满挑战的基准，它通过“蒙法”真实的陈述而不引入明确的幻觉来构建欺骗性的叙事。我们证明，基于粗线LLM的评估者和当前的细粒框架都容易受到这种攻击，AUC-ROC得分降至65％以下。为了实现更强大的细粒度评估，我们提出了DovesCore，这是一个新的框架，可以共同验证事实准确性和事件订单的一致性。通过对事实间的关系进行建模，DovesCore的表现优于现有的细颗粒方法超过8％，为长形式的文本一致性评估提供了更强大的解决方案。我们的代码和数据集可在此HTTPS URL上找到。

Title: Reverse Engineering Human Preferences with Reinforcement Learning

Authors: Lisa Alazraki, Tan Yi-Chern, Jon Ander Campos, Maximilian Mozes, Marek Rei, Max Bartolo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15795
Pdf URL: https://arxiv.org/pdf/2505.15795
Copy Paste: [[2505.15795]] Reverse Engineering Human Preferences with Reinforcement Learning(https://arxiv.org/abs/2505.15795)
Keywords: language model, llm
Abstract: The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning--an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.
摘要：大语言模型（LLM）的功能通常由受过训练以预测人类偏好的其他LLM进行评估。该框架（称为LLM-AS-A-Gudge）是高度可扩展的，相对较低的成本。但是，它也容易受到恶意剥削的影响，因为可以调整LLM的回应以使法官的偏好过高。先前的工作表明，可以在事后编辑候选人-LLM生成的答案，以最大程度地提高法官-LLM分配给他们的分数。在这项研究中，我们采用了另一种方法，并使用法官llms提供的信号作为对对抗性模型的奖励，这些模型生成了旨在提高下游性能的文本前序。我们发现，使用这些模型管道的冷冻LLMS比现有框架获得更高的LLM评估得分。至关重要的是，与直接在模型响应上进行干预的其他框架不同，我们的方法实际上是无法检测的。我们还证明，当候选人-LLM和法官-LLM替换为训练过程中未使用的模型时，调整后的序言发电机转移的有效性会发生。这些发现引发了有关更可靠的LLM-AS-A-A-A-Gudge评估设置设计的重要问题。他们还表明，通过管道LLM来通过增强学习来优化上游前序，这种方法可以在各种任务和对抗性攻击以外的不同任务和领域中找到未来的应用，从而有效地对人类的偏好进行了反向设计。

Title: VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Authors: Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15801
Pdf URL: https://arxiv.org/pdf/2505.15801
Copy Paste: [[2505.15801]] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models(https://arxiv.org/abs/2505.15801)
Keywords: language model
Abstract: Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.
摘要：诸如OpenAI O1和DeepSeek-R1之类的大型推理模型在推理领域取得了出色的性能。他们培训的一个关键组成部分是将可验证的奖励纳入加固学习（RL）。但是，现有的奖励基准并未评估基于参考的奖励系统，使研究人员对RL中使用的验证者的准确性有限。在本文中，我们介绍了两个基准，即验证和验证基地，旨在评估基于参考的奖励系统的性能。这些基准是通过细致的数据收集和策划来构建的，然后进行仔细的人类注释以确保高质量。当前的型号仍然显示出可在验证板和验证板范围（尤其是较小规模的型号）上改进的空间。此外，我们对评估结果进行了详尽而全面的分析，为理解和开发基于参考的奖励系统提供了见解。我们提出的基准测试是指导验证者准确性开发的有效工具，以及通过RL在推理任务中训练的模型的推理功能。

Title: Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering

Authors: Hwan Chang, Yumin Kim, Yonghyun Jun, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15805
Pdf URL: https://arxiv.org/pdf/2505.15805
Copy Paste: [[2505.15805]] Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering(https://arxiv.org/abs/2505.15805)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.
摘要：由于大型语言模型（LLM）越来越多地部署在企业和政府等敏感领域中，以确保它们在上下文中遵守用户定义的安全策略至关重要 - 尤其是在信息未披露方面。虽然先前的LLM研究集中在一般安全性和对社会敏感的数据上，但仍缺乏针对攻击的上下文安全保护的大规模基准。为了解决这个问题，我们介绍了一个新颖的大规模基准数据集，Copriva，评估LLM遵守有关上下文的非披露策略的遵守。我们的数据集源自现实的上下文，包括直接且具有挑战性的间接攻击，以寻求禁止的信息，包括明确的政策和查询。我们在基准测试中评估了10个LLM，并揭示了一个重要的漏洞：许多模型违反了用户定义的策略和泄漏敏感信息。这种故障在间接攻击方面尤其严重，突显了当前LLM安全应用的危险差距。我们的分析表明，尽管模型通常可以识别出查询的正确答案，但它们很难在一代中纳入政策限制。相比之下，当明确提示时，它们具有部分修改产出的能力。我们的发现强调了迫切需要更强大的方法来保证上下文安全。

Title: The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

Authors: Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15807
Pdf URL: https://arxiv.org/pdf/2505.15807
Copy Paste: [[2505.15807]] The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation(https://arxiv.org/abs/2505.15807)
Keywords: language model, prompt
Abstract: Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.
摘要：大型语言模型能够利用内在的学习学习，通过检索 - 启发其培训数据以外的外部知识。尽管有希望，但它的内部运作仍不清楚。在这项工作中，我们通过将提示视为信息组件的组成，阐明了在网站检索增强的机制，以回答问题。我们提出了一种基于归因的方法来识别专业的注意力头，揭示了理解指令并检索相关上下文信息的内在主管以及存储实体关系知识的参数头。为了更好地了解其角色，我们提取功能向量并修改其注意力权重，以显示它们如何影响答案生成过程。最后，我们利用获得的见解来追踪推理过程中使用的知识来源，为更安全和透明的语言模型铺平了道路。

Title: GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents

Authors: Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinqlin Jia, Junxu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.15810
Pdf URL: https://arxiv.org/pdf/2505.15810
Copy Paste: [[2505.15810]] GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents(https://arxiv.org/abs/2505.15810)
Keywords: chain-of-thought, agent
Abstract: Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at this https URL.
摘要：最近的图形用户界面（GUI）代理复制了R1-Zero范式，将在线增强学习（RL）与对象接地之前的明确的经过经过思考的推理相结合，从而实现了可观的性能。在本文中，我们首先对该培训管道的三个关键组成部分进行了广泛的分析实验：输入设计，输出评估和策略更新 - 揭示了盲目应用通用RL RL而不适应GUI接地任务而引起的不同挑战。输入设计：当前的模板鼓励该模型产生经过思考的推理，但更长的链条意外地导致了较差的基础性能。输出评估：基于命中信号或盒子区域的奖励功能允许模型利用盒子的大小，从而导致奖励黑客入侵和本地化质量不佳。策略更新：在线RL由于长度偏见和样本难度而倾向于过度拟合简单的示例，从而导致较难的情况不足。为了解决这些问题，我们提出了三种目标解决方案。首先，我们采用了一个快速的思维模板，该模板鼓励直接答案生成，从而减少了培训期间过多的推理。其次，我们将框尺寸约束纳入奖励功能中，以减轻奖励黑客入侵。第三，我们通过调整长度归一化并添加困难的缩放系数来修改RL目标，从而可以更好地优化硬样品。我们的GUI-G1-3B接受了17K公共样品的训练，该样品的QWEN2.5-VL-3B-INSTRUCTION在Screenspot上的精度为90.3％，ScreensPot-Pro的精度为37.1％。这超过了所有相似大小的模型，甚至超过了较大的UI-TARS-7B，建立了GUI代理接地的新最新。该项目存储库可在此HTTPS URL上找到。

Title: Learning to Reason via Mixture-of-Thought for Logical Reasoning

Authors: Tong Zheng, Lichang Chen, Simeng Han, R. Thomas McCoy, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15817
Pdf URL: https://arxiv.org/pdf/2505.15817
Copy Paste: [[2505.15817]] Learning to Reason via Mixture-of-Thought for Logical Reasoning(https://arxiv.org/abs/2505.15817)
Keywords: llm, chain-of-thought
Abstract: Human beings naturally utilize multiple reasoning modalities to learn and solve logical problems, i.e., different representational formats such as natural language, code, and symbolic logic. In contrast, most existing LLM-based approaches operate with a single reasoning modality during training, typically natural language. Although some methods explored modality selection or augmentation at inference time, the training process remains modality-blind, limiting synergy among modalities. To fill in this gap, we propose Mixture-of-Thought (MoT), a framework that enables LLMs to reason across three complementary modalities: natural language, code, and a newly introduced symbolic modality, truth-table, which systematically enumerates logical cases and partially mitigates key failure modes in natural language reasoning. MoT adopts a two-phase design: (1) self-evolving MoT training, which jointly learns from filtered, self-generated rationales across modalities; and (2) MoT inference, which fully leverages the synergy of three modalities to produce better predictions. Experiments on logical reasoning benchmarks including FOLIO and ProofWriter demonstrate that our MoT framework consistently and significantly outperforms strong LLM baselines with single-modality chain-of-thought approaches, achieving up to +11.7pp average accuracy gain. Further analyses show that our MoT framework benefits both training and inference stages; that it is particularly effective on harder logical reasoning problems; and that different modalities contribute complementary strengths, with truth-table reasoning helping to overcome key bottlenecks in natural language inference.
摘要：人类自然会利用多种推理方式来学习和解决逻辑问题，即不同的代表性格式，例如自然语言，代码和符号逻辑。相比之下，大多数现有的基于LLM的方法在训练过程中以单一的推理方式（通常是自然语言）运作。尽管某些方法在推理时探索了模态选择或增强，但训练过程仍然是模盲，限制了模态之间的协同作用。为了填补这一空白，我们提出了一个思想（MOT）的混合物，该框架使LLM可以跨越三种互补方式进行推理：自然语言，代码和新引入的象征性模式，真理桌子，系统地列举了逻辑案例，并在自然语言的自然语言推理中部分减轻关键失败模式。 MOT采用了两阶段的设计：（1）自我发展的MOT训练，该培训从模式的越来越多的自我生成的理由中共同学习；（2）MOT推断，它完全利用了三种方式的协同作用来产生更好的预测。对逻辑推理基准（包括对开本和证明作者）在内的实验表明，我们的MOT框架始终如一地超过了具有单模式链方法的强LLM基准，从而实现了高达 +11.7pp的平均准确性增长。进一步的分析表明，我们的MOT框架有利于培训和推理阶段。它对更困难的逻辑推理问题特别有效；而且，不同的方式可以促进互补的优势，并有助于克服自然语言推断的关键瓶颈。