2025-02-19

Title: Leveraging large language models for structured information extraction from pathology reports

Authors: Jeya Balaji Balasubramanian, Daniel Adams, Ioannis Roxanis, Amy Berrington de Gonzalez, Penny Coulson, Jonas S. Almeida, Montserrat García-Closas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12183
Pdf URL: https://arxiv.org/pdf/2502.12183
Copy Paste: [[2502.12183]] Leveraging large language models for structured information extraction from pathology reports(https://arxiv.org/abs/2502.12183)
Keywords: language model, gpt, llm, prompt
Abstract: Background: Structured information extraction from unstructured histopathology reports facilitates data accessibility for clinical research. Manual extraction by experts is time-consuming and expensive, limiting scalability. Large language models (LLMs) offer efficient automated extraction through zero-shot prompting, requiring only natural language instructions without labeled data or training. We evaluate LLMs' accuracy in extracting structured information from breast cancer histopathology reports, compared to manual extraction by a trained human annotator. Methods: We developed the Medical Report Information Extractor, a web application leveraging LLMs for automated extraction. We developed a gold standard extraction dataset to evaluate the human annotator alongside five LLMs including GPT-4o, a leading proprietary model, and the Llama 3 model family, which allows self-hosting for data privacy. Our assessment involved 111 histopathology reports from the Breast Cancer Now (BCN) Generations Study, extracting 51 pathology features specified in the study's data dictionary. Results: Evaluation against the gold standard dataset showed that both Llama 3.1 405B (94.7% accuracy) and GPT-4o (96.1%) achieved extraction accuracy comparable to the human annotator (95.4%; p = 0.146 and p = 0.106, respectively). While Llama 3.1 70B (91.6%) performed below human accuracy (p <0.001), its reduced computational requirements make it a viable option for self-hosting. Conclusion: We developed an open-source tool for structured information extraction that can be customized by non-programmers using natural language. Its modular design enables reuse for various extraction tasks, producing standardized, structured data from unstructured text reports to facilitate analytics through improved accessibility and interoperability.
摘要：背景：从非结构化组织病理学报告中提取的结构化信息促进了临床研究的数据可访问性。专家的手动提取是耗时且昂贵的，限制了可扩展性。大型语言模型（LLMS）通过零射击提示提供有效的自动提取，仅需要自然语言说明，而无需标记数据或培训。与受过训练的人类注释者手动提取相比，我们评估了LLMS从乳腺癌组织病理学报告中提取结构化信息的准确性。方法：我们开发了医学报告信息提取器，这是一种利用LLMS自动提取的Web应用程序。我们开发了一个黄金标准提取数据集，以评估人类注释者与五个LLM一起评估包括GPT-4O（领先的专有模型GPT-4O）和Llama 3模型家族，该家族允许自托管数据隐私。我们的评估涉及111个来自乳腺癌（BCN）世代研究的组织病理学报告，研究了研究词典中指定的51个病理特征。结果：针对黄金标准数据集的评估表明，Llama 3.1 405b（准确度为94.7％）和GPT-4O（96.1％）的提取精度可与人类注释相当（95.4％; P = 0.146; P = 0.146和P = 0.106）。尽管美洲驼3.1 70b（91.6％）的表现低于人类的准确性（p <0.001），但其计算要求降低使其成为自我托管的可行选择。结论：我们开发了一种用于结构化信息提取的开源工具，该工具可以由非程序员使用自然语言自定义。它的模块化设计可重用各种提取任务，从非结构化文本报告中产生标准化的结构化数据，从而通过提高可访问性和互操作性来促进分析。

Title: Large Language Models for Extrapolative Modeling of Manufacturing Processes

Authors: Kiarash Naghavi Khanghah, Anandkumar Patel, Rajiv Malhotra, Hongyi Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12185
Pdf URL: https://arxiv.org/pdf/2502.12185
Copy Paste: [[2502.12185]] Large Language Models for Extrapolative Modeling of Manufacturing Processes(https://arxiv.org/abs/2502.12185)
Keywords: language model, llm
Abstract: Conventional predictive modeling of parametric relationships in manufacturing processes is limited by the subjectivity of human expertise and intuition on the one hand and by the cost and time of experimental data generation on the other hand. This work addresses this issue by establishing a new Large Language Model (LLM) framework. The novelty lies in combining automatic extraction of process-relevant knowledge embedded in the literature with iterative model refinement based on a small amount of experimental data. This approach is evaluated on three distinct manufacturing processes that are based on machining, deformation, and additive principles. The results show that for the same small experimental data budget the models derived by our framework have unexpectedly high extrapolative performance, often surpassing the capabilities of conventional Machine Learning. Further, our approach eliminates manual generation of initial models or expertise-dependent interpretation of the literature. The results also reveal the importance of the nature of the knowledge extracted from the literature and the significance of both the knowledge extraction and model refinement components.
摘要：制造过程中参数关系的常规预测建模受到人类专业知识和直觉的主观性的限制，另一方面是实验数据生成的成本和时间。这项工作通过建立新的大型语言模型（LLM）框架来解决此问题。新颖性在于，基于少量的实验数据，将嵌入文献中嵌入的过程相关知识的自动提取与迭代模型改进结合在一起。对基于加工，变形和添加剂原理的三个不同的制造过程进行评估。结果表明，对于相同的小实验数据预算，我们的框架所产生的模型具有出乎意料的外反性性能，通常超过了传统机器学习的能力。此外，我们的方法消除了初始模型的手动生成或对文献的专业知识的解释。结果还揭示了从文献中提取的知识性质的重要性以及知识提取和模型改进成分的重要性。

Title: Hallucinations are inevitable but statistically negligible

Authors: Atsushi Suzuki, Yulan He, Feng Tian, Zhongyuan Wang
Subjects: cs.CL, cs.FL, cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2502.12187
Pdf URL: https://arxiv.org/pdf/2502.12187
Copy Paste: [[2502.12187]] Hallucinations are inevitable but statistically negligible(https://arxiv.org/abs/2502.12187)
Keywords: language model, hallucination
Abstract: Hallucinations, a phenomenon where a language model (LM) generates nonfactual content, pose a significant challenge to the practical deployment of LMs. While many empirical methods have been proposed to mitigate hallucinations, a recent study established a computability-theoretic result showing that any LM will inevitably generate hallucinations on an infinite set of inputs, regardless of the quality and quantity of training datasets and the choice of the language model architecture and training and inference algorithms. Although the computability-theoretic result may seem pessimistic, its significance in practical viewpoints has remained unclear. In contrast, we present a positive theoretical result from a probabilistic perspective. Specifically, we prove that hallucinations can be made statistically negligible, provided that the quality and quantity of the training data are sufficient. Interestingly, our positive result coexists with the computability-theoretic result, implying that while hallucinations on an infinite set of inputs cannot be entirely eliminated, their probability can always be reduced by improving algorithms and training data. By evaluating the two seemingly contradictory results through the lens of information theory, we argue that our probability-theoretic positive result better reflects practical considerations than the computability-theoretic negative result.
摘要：幻觉是一种语言模型（LM）产生非事实内容的现象，对LMS的实际部署构成了重大挑战。虽然已经提出了许多经验方法来减轻幻觉，但最近的一项研究确立了一个可计算性的理论结果，表明任何LM都将不可避免地会在一套无限的输入上产生幻觉，而不管培训数据的质量和数量如何模型架构，培训和推理算法。尽管可计算性理论结果似乎似乎是悲观的，但在实际观点中的重要性尚不清楚。相反，我们从概率的角度提出了积极的理论结果。具体而言，我们证明幻觉可以在统计上可以忽略不计，只要培训数据的质量和数量就足够。有趣的是，我们的积极结果与可计算性理论结果并存，这意味着，尽管无法完全消除一组无限投入的幻觉，但可以通过改善算法和培训数据来降低它们的概率。通过通过信息理论的角度评估两个看似矛盾的结果，我们认为我们的概率理论阳性结果更好地反映了实际考虑，而不是可计算性理论负面结果。

Title: AI and the Law: Evaluating ChatGPT's Performance in Legal Classification

Authors: Pawel Weichbroth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12193
Pdf URL: https://arxiv.org/pdf/2502.12193
Copy Paste: [[2502.12193]] AI and the Law: Evaluating ChatGPT's Performance in Legal Classification(https://arxiv.org/abs/2502.12193)
Keywords: gpt, chat
Abstract: The use of ChatGPT to analyze and classify evidence in criminal proceedings has been a topic of ongoing discussion. However, to the best of our knowledge, this issue has not been studied in the context of the Polish language. This study addresses this research gap by evaluating the effectiveness of ChatGPT in classifying legal cases under the Polish Penal Code. The results show excellent binary classification accuracy, with all positive and negative cases correctly categorized. In addition, a qualitative evaluation confirms that the legal basis provided for each case, along with the relevant legal content, was appropriate. The results obtained suggest that ChatGPT can effectively analyze and classify evidence while applying the appropriate legal rules. In conclusion, ChatGPT has the potential to assist interested parties in the analysis of evidence and serve as a valuable legal resource for individuals with less experience or knowledge in this area.
摘要：在刑事诉讼中使用CHATGPT来分析和分类证据是一个持续讨论的话题。但是，据我们所知，这个问题尚未在波兰语言的背景下进行研究。这项研究通过评估Chatgpt在《波兰刑法》中分类法律案件中的有效性来解决这一研究差距。结果显示出极好的二元分类精度，所有正和负案例都正确分类。此外，定性评估证实，为每种情况提供的法律依据以及相关法律内容是适当的。获得的结果表明，Chatgpt可以在应用适当的法律规则的同时有效地分析和分类证据。总之，Chatgpt有可能协助感兴趣的方分析证据，并为在这一领域经验或知识较少的个人提供宝贵的法律资源。

Title: A Closer Look at System Prompt Robustness

Authors: Norman Mu, Jonathan Lu, Michael Lavery, David Wagner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12197
Pdf URL: https://arxiv.org/pdf/2502.12197
Copy Paste: [[2502.12197]] A Closer Look at System Prompt Robustness(https://arxiv.org/abs/2502.12197)
Keywords: gpt, llm, prompt, chat, agent
Abstract: System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.
摘要：系统提示已成为指定LLM在聊天和代理设置中的行为的关键控制表面。开发人员依靠系统提示来指定重要的上下文，输出格式，个性，护栏，内容策略和安全对策，所有这些都需要模型才能坚持系统提示，尤其是在面对冲突或对抗性用户输入时。实际上，模型通常会忘记考虑相关的护栏或无法解决系统与用户之间的冲突需求。在这项工作中，我们根据从OpenAI的GPT商店和Huggingface的HuggingChat中收集的提示来创建现实的新评估和微调数据集，从而研究了改善系统的各种方法。我们的实验通过一组新的和现有的基准评估模型表明，通过现实的微调数据以及推理时间干预措施（例如无分类器指导）可以大大提高性能。最后，我们分析了OpenAI和DeepSeek最近发布的推理模型的结果，这些模型对我们研究的基准显示了令人兴奋但不均匀的改进。总体而言，当前技术无法确保系统迅速鲁棒性和进一步的研究。

Title: Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product

Authors: Pengxiang Lan, Haoyu Xu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, Xingwei Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12200
Pdf URL: https://arxiv.org/pdf/2502.12200
Copy Paste: [[2502.12200]] Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product(https://arxiv.org/abs/2502.12200)
Keywords: language model, prompt
Abstract: Prompt tuning (PT) offers a cost-effective alternative to fine-tuning large-scale pre-trained language models (PLMs), requiring only a few parameters in soft prompt tokens added before the input text. However, existing PT approaches face two significant issues: (i) They overlook intrinsic semantic associations between soft prompt tokens, leading to high discreteness and limited interactions, thus reducing the model's comprehension and effectiveness in complex tasks. (ii) Due to the complexity of downstream tasks, long soft prompt is necessitated to improve performance, but prompt length correlates positively with memory usage and computational costs. Achieving high efficiency and performance remains an ongoing challenge. To address these issues, we propose a novel Low-parameters prompt tuning (LAMP) method, which leverages prompt decomposition and compressed outer product. Specifically, the prompt decomposition module employs Truncated SVD to reduce training parameters and significantly lower the dimensionality of the soft prompt parameter space. It then utilizes a compressed outer product module to facilitate multiple interactions among prompt tokens, exploring their intrinsic associations to enhance knowledge representation. Finally, LAMP uses average pooling to reduce memory usage and training/inference time. Extensive experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
摘要：及时调整（PT）提供了一种经济高效的替代方法，用于微调大规模的预训练语言模型（PLMS），在输入文本之前仅添加了柔和的提示令牌中的一些参数。但是，现有的PT方法面临两个重要的问题：（i）它们忽略了软及时令牌之间的内在语义关联，导致高离散性和有限的相互作用，从而降低了模型对复杂任务的理解和有效性。（ii）由于下游任务的复杂性，需要长时间的软提示来提高性能，但及时长度与内存使用和计算成本呈正相关。达到高效率和性能仍然是一个持续的挑战。为了解决这些问题，我们提出了一种新型的低参数促使调整（LAMP）方法，该方法促使分解和压缩外产品。具体而言，及时分解模块采用截短的SVD来减少训练参数，并显着降低软提示参数空间的维度。然后，它利用压缩的外产品模块来促进及时令牌之间的多次相互作用，探索其内在关联以增强知识表示。最后，LAMP使用平均合并来减少记忆使用和训练/推理时间。跨六个架构和八个数据集进行的广泛实验表明，灯在性能和效率方面优于基于PT的最先进和基于LORA的方法。

Title: BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack

Authors: Zihao Zhu, Hongbao Zhang, Mingda Zhang, Ruotong Wang, Guanzong Wu, Ke Xu, Baoyuan Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12202
Pdf URL: https://arxiv.org/pdf/2502.12202
Copy Paste: [[2502.12202]] BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack(https://arxiv.org/abs/2502.12202)
Keywords: language model
Abstract: Longer thought, better performance: large language models with deep reasoning capabilities, particularly o1-like models, have demonstrated remarkable performance by generating extensive thought processes during inference. This trade-off reveals a potential vulnerability: adversaries could compromise model performance by forcing immediate responses without thought processes. To this end, in this paper, we introduce a novel attack scenario targeting the long thought processes of o1-like models and propose BoT (Break CoT), which can selectively break intrinsic reasoning mechanisms through backdoor attacks. BoT constructs poisoned datasets with designed triggers and injects backdoor by either supervised fine-tuning or direct preference optimization. When triggered, the model directly generates answers without thought processes, while maintaining normal reasoning capabilities for clean inputs. Extensive experiments on open-source o1-like models, including recent DeepSeek-R1, demonstrate that BoT nearly achieves high attack success rates while maintaining clean accuracy, highlighting the critical safety risk in current models. Furthermore, the relationship between task difficulty and helpfulness reveals a potential application for good, enabling users to customize model behavior based on task complexity. Code is available at \href{this https URL}{this https URL}.
摘要：更长的思考，更好的表现：具有深层推理能力的大型语言模型，尤其是类似O1的模型，通过在推断期间产生广泛的思维过程表现出了出色的性能。这种权衡揭示了潜在的脆弱性：对手可以通过强迫无需思考过程的即时响应来损害模型绩效。为此，在本文中，我们介绍了一种新颖的攻击场景，以针对O1型模型的长时间思考过程并提出了Bot（Break Cot），该过程可以通过后门攻击选择性地破坏内在的推理机制。 Bot通过监督微调或直接偏好优化构建具有设计的触发器和注入后门的中毒数据集。当触发时，模型直接生成答案而无需思考过程，同时保持了正常的推理能力以进行清洁输入。包括最近的DeepSeek-R1在内的开源O1模型进行的广泛实验表明，机器人几乎取得了高攻击成功率，同时保持了清洁准确性，突出了当前模型的关键安全风险。此外，任务难度与帮助性之间的关系揭示了良好的潜在应用，使用户能够根据任务复杂性自定义模型行为。代码可在\ href {此https url} {this https url}中获得。

Title: Enhancing Frame Detection with Retrieval Augmented Generation

Authors: Papa Abdou Karim Karou Diallo, Amal Zouaq
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12210
Pdf URL: https://arxiv.org/pdf/2502.12210
Copy Paste: [[2502.12210]] Enhancing Frame Detection with Retrieval Augmented Generation(https://arxiv.org/abs/2502.12210)
Keywords: retrieval augmented generation, retrieval-augmented generation
Abstract: Recent advancements in Natural Language Processing have significantly improved the extraction of structured semantic representations from unstructured text, especially through Frame Semantic Role Labeling (FSRL). Despite this progress, the potential of Retrieval-Augmented Generation (RAG) models for frame detection remains under-explored. In this paper, we present the first RAG-based approach for frame detection called RCIF (Retrieve Candidates and Identify Frames). RCIF is also the first approach to operate without the need for explicit target span and comprises three main stages: (1) generation of frame embeddings from various representations ; (2) retrieval of candidate frames given an input text; and (3) identification of the most suitable frames. We conducted extensive experiments across multiple configurations, including zero-shot, few-shot, and fine-tuning settings. Our results show that our retrieval component significantly reduces the complexity of the task by narrowing the search space thus allowing the frame identifier to refine and complete the set of candidates. Our approach achieves state-of-the-art performance on FrameNet 1.5 and 1.7, demonstrating its robustness in scenarios where only raw text is provided. Furthermore, we leverage the structured representation obtained through this method as a proxy to enhance generalization across lexical variations in the task of translating natural language questions into SPARQL queries.
摘要：自然语言处理的最新进展显着改善了从非结构化文本中提取结构化语义表示形式，尤其是通过框架语义角色标签（FSRL）。尽管取得了这种进步，但检索型生成（RAG）模型的框架检测的潜力仍然不足。在本文中，我们介绍了第一个基于抹布的框架检测方法称为RCIF（检索候选者并识别帧）。 RCIF也是不需要明确目标跨度的第一种操作方法，并包括三个主要阶段：（1）从各种表示形式生成帧嵌入；（2）给定输入文本的候选框架的检索；（3）识别最合适的帧。我们跨多种配置进行了广泛的实验，包括零射，很少射击和微调设置。我们的结果表明，我们的检索组件通过缩小搜索空间可大大降低任务的复杂性，从而使框架标识符能够完善并完成一组候选者。我们的方法在Framenet 1.5和1.7上实现了最先进的性能，在仅提供原始文本的场景中证明了其稳健性。此外，我们利用通过此方法获得的结构化表示，作为代理来增强将自然语言问题转化为SPARQL查询的任务中跨词汇变化的概括。

Title: Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

Authors: Guanghao Li, Wenhao Jiang, Li Shen, Ming Tang, Chun Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12214
Pdf URL: https://arxiv.org/pdf/2502.12214
Copy Paste: [[2502.12214]] Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement(https://arxiv.org/abs/2502.12214)
Keywords: language model, llm
Abstract: Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches typically force each layer to assume multiple roles with a predetermined number of iterations, restricting efficiency and adaptability. In this work, we propose the Zero Token Transformer (ZTT), which features a head-tail decoupled parameter cycling method. We disentangle the first (head) and last (tail) layers from parameter cycling and iteratively refine only the intermediate layers. Furthermore, we introduce a Zero-Token Mechanism, an internal architectural component rather than an input token, to guide layer-specific computation. At each cycle, the model retrieves a zero token (with trainable key values) from a Zero-Token Pool, integrating it alongside regular tokens in the attention mechanism. The corresponding attention scores not only reflect each layer's computational importance but also enable dynamic early exits without sacrificing overall model accuracy. Our approach achieves superior performance under tight parameter budgets, effectively reduces computational overhead via early exits, and can be readily applied to fine-tune existing pre-trained models for enhanced efficiency and adaptability.
摘要：资源限制通常会限制大语言模型（LLMS）的参数计数，从而阻碍其性能。尽管现有方法采用参数共享来重用固定预算下的相同参数集，但这种方法通常迫使每层效果扮演多个角色，并具有预定数量的迭代次数，从而限制了效率和适应性。在这项工作中，我们提出了零令牌变压器（ZTT），该变压器（ZTT）具有尾巴解耦参数循环方法。我们将第一（头）和最后一层（尾部）层从参数循环中解散，并且仅迭代地完善中间层。此外，我们引入了一种零token机制，一种内部体系结构组件而不是输入令牌，以指导特定于层的计算。在每个循环中，该模型从零token池中检索一个零令牌（具有训练键值），将其与注意机制中的常规令牌一起集成在一起。相应的注意力分数不仅反映了每一层的计算重要性，而且还可以使动态早期退出而无需牺牲整体模型精度。我们的方法在紧张的参数预算下实现了卓越的性能，通过早期出口有效地减少了计算开销，并且可以轻松地应用于现有的预训练模型，以提高效率和适应性。

Title: InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context

Authors: Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brandão, Luckeciano C. Melo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12257
Pdf URL: https://arxiv.org/pdf/2502.12257
Copy Paste: [[2502.12257]] InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context(https://arxiv.org/abs/2502.12257)
Keywords: language model, chat, agent
Abstract: While large language models excel at following explicit instructions, they often struggle with ambiguous or incomplete user requests, defaulting to verbose, generic responses rather than seeking clarification. We introduce InfoQuest, a multi-turn chat benchmark designed to evaluate how dialogue agents handle hidden context in open-ended user requests. The benchmark presents intentionally ambiguous scenarios that require models to engage in information-seeking dialogue through clarifying questions before providing appropriate responses. Our evaluation of both open and closed-source models reveals that while proprietary models generally perform better, all current assistants struggle with effectively gathering critical information, often requiring multiple turns to infer user intent and frequently defaulting to generic responses without proper clarification. We provide a systematic methodology for generating diverse scenarios and evaluating models' information-seeking capabilities, offering insights into the current limitations of language models in handling ambiguous requests through multi-turn interactions.
摘要：尽管大型语言模型在遵循明确的说明方面表现出色，但他们通常会在用户要求的含糊或不完整的要求上挣扎，默认为冗长，通用响应而不是寻求澄清。我们介绍了InfoQuest，这是一个多转弯聊天基准测试，旨在评估对话代理在开放式用户请求中如何处理隐藏的上下文。该基准提出了有意的歧义场景，这些场景需要模型在提供适当的回答之前通过澄清问题进行信息寻求对话。我们对开放式和封闭源模型的评估表明，尽管专有模型通常表现更好，但所有当前的助手都在有效地收集关键信息的情况下努力，通常需要多个转弯来推断用户意图，并且经常默认为通用响应，而无需正确澄清。我们提供了一种系统的方法，用于生成各种场景和评估模型寻求信息的能力，从而对语言模型的当前局限性提供有关通过多转交互作用来处理模棱两可请求的当前局限性的见解。

Title: Story Grammar Semantic Matching for Literary Study

Authors: Abigail Swenor, Neil Coffee, Walter Scheirer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12276
Pdf URL: https://arxiv.org/pdf/2502.12276
Copy Paste: [[2502.12276]] Story Grammar Semantic Matching for Literary Study(https://arxiv.org/abs/2502.12276)
Keywords: language model
Abstract: In Natural Language Processing (NLP), semantic matching algorithms have traditionally relied on the feature of word co-occurrence to measure semantic similarity. While this feature approach has proven valuable in many contexts, its simplistic nature limits its analytical and explanatory power when used to understand literary texts. To address these limitations, we propose a more transparent approach that makes use of story structure and related elements. Using a BERT language model pipeline, we label prose and epic poetry with story element labels and perform semantic matching by only considering these labels as features. This new method, Story Grammar Semantic Matching, guides literary scholars to allusions and other semantic similarities across texts in a way that allows for characterizing patterns and literary technique.
摘要：在自然语言处理（NLP）中，传统上，语义匹配算法依赖于单词共存的特征来测量语义相似性。尽管这种特征方法在许多情况下已被证明具有价值，但其简单的性质限制了用来理解文学文本的分析和解释能力。为了解决这些局限性，我们提出了一种更透明的方法，该方法利用故事结构和相关元素。使用Bert语言模型管道，我们将散文和史诗诗用故事元素标签标记，并仅将这些标签视为特征来执行语义匹配。这种新方法，故事语法语义匹配，指导文学学者在文本之间引起典故和其他语义相似性，以表征模式和文学技巧。

Title: Evaluating Step-by-step Reasoning Traces: A Survey

Authors: Jinu Lee, Julia Hockenmaier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12289
Pdf URL: https://arxiv.org/pdf/2502.12289
Copy Paste: [[2502.12289]] Evaluating Step-by-step Reasoning Traces: A Survey(https://arxiv.org/abs/2502.12289)
Keywords: language model, llm
Abstract: Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, the evaluation criteria remain highly unstandardized, leading to fragmented efforts in developing metrics and meta-evaluation benchmarks. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility). We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria. Finally, we identify key directions for future research.
摘要：逐步推理被广泛用于增强复杂问题中大语言模型（LLM）的推理能力。评估推理轨迹的质量对于理解和改善LLM推理至关重要。但是，评估标准仍然高度不合格，导致在制定指标和元评估基准方面的努力分散。为了解决这一差距，这项调查提供了逐步推理评估的全面概述，提出了具有四个顶级类别（接地，有效性，连贯性和实用程序）的评估标准的分类法。然后，我们根据指标的实现进行分类，调查哪些指标用于评估每个标准，并探索评估者模型是否可以跨不同标准传输。最后，我们确定了未来研究的关键方向。

Title: SMOL: Professionally translated parallel data for 115 under-represented languages

Authors: Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry, Geza Kovacs, Hadar Shemtov, Partha Talukdar, Dinesh Tewari, Baba Mamadi Diane, Koulako Moussa Doumbouya, Djibrila Diane, Solo Farabado Cissé
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12301
Pdf URL: https://arxiv.org/pdf/2502.12301
Copy Paste: [[2502.12301]] SMOL: Professionally translated parallel data for 115 under-represented languages(https://arxiv.org/abs/2502.12301)
Keywords: language model, prompt
Abstract: We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock translation for low-resource languages (LRLs). SMOL has been translated into 115 under-resourced languages, including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOL-Sent, a set of sentences chosen for broad unique token coverage, and SMOL-Doc, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust ChrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOL-Doc, yielding the first factuality datasets for most of these languages.
摘要：我们开放源SMOL（最大总杠杆），这是一套培训数据，可解锁低资源语言（LRLS）的翻译。 SMOL已被翻译成115种资源不足的语言，其中包括许多以前没有公共资源的人，总计610万个翻译令牌。 SMOL包括两个子数据集，每个dataset都仔细地选择了其尺寸的最大影响：SMOL-sents，一组用于广泛的独特令牌覆盖范围的句子，以及Smol-doc，Smol-doc是文档级别的源，侧重于广泛的主题覆盖范围。他们加入已经发布的gatitos进行了段落，句子和代币级内容的三项。我们证明，使用SMOL提示或微调大型语言模型可产生强大的CHRF改进。除了翻译外，我们还为SMOL-DOC中所有文档提供了事实评级和理由，为大多数这些语言提供了第一个事实数据集。

Title: Can Language Models Learn Typologically Implausible Languages?

Authors: Tianyang Xu, Tatsuki Kuribayashi, Yohei Oseki, Ryan Cotterell, Alex Warstadt
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12317
Pdf URL: https://arxiv.org/pdf/2502.12317
Copy Paste: [[2502.12317]] Can Language Models Learn Typologically Implausible Languages?(https://arxiv.org/abs/2502.12317)
Keywords: language model
Abstract: Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans. However, empirical evidence has been limited to experiments with highly simplified artificial languages, and whether these correlations arise from domain-general or language-specific biases remains a matter of debate. Language models (LMs) provide an opportunity to study artificial language learning at a large scale and with a high degree of naturalism. In this paper, we begin with an in-depth discussion of how LMs allow us to better determine the role of domain-general learning biases in language universals. We then assess learnability differences for LMs resulting from typologically plausible and implausible languages closely following the word-order universals identified by linguistic typologists. We conduct a symmetrical cross-lingual study training and testing LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages. Compared to similar work, our datasets are more naturalistic and fall closer to the boundary of plausibility. Our experiments show that these LMs are often slower to learn these subtly implausible languages, while ultimately achieving similar performance on some metrics regardless of typological plausibility. These findings lend credence to the conclusion that LMs do show some typologically-aligned learning preferences, and that the typological patterns may result from, at least to some degree, domain-general learning biases.
摘要：人类语言的语法特征表现出有趣的相关性，通常归因于人类学习偏见。但是，经验证据仅限于具有高度简化的人工语言的实验，以及这些相关性是否来自领域或特定语言的偏见仍然是辩论的问题。语言模型（LMS）为大规模研究人造语言学习提供了机会，并且具有高度的自然主义。在本文中，我们从深入的讨论开始，讨论LM如何使我们更好地确定领域学习偏见在语言普遍性中的作用。然后，我们评估遵循语言类型学家确定的单词顺序普遍性的类型合理和难以置信的语言而导致的LMS的可学习性差异。我们对一系列高度自然的但反事实的英语版本（头目）和日语（日语）语言进行对称的跨语性研究培训和测试LMS。与类似的工作相比，我们的数据集更自然，并且更接近合理性的边界。我们的实验表明，这些LM通常会慢慢学习这些微妙的语言，同时最终在某些指标上实现类似的性能，无论类型学上的合理性如何。这些发现给出了以下结论：LMS确实表现出一些类型的学习偏好，并且至少在某种程度上可能是域中总学习偏见。

Title: From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs

Authors: Kumari Nishu, Sachin Mehta, Samira Abnar, Mehrdad Farajtabar, Maxwell Horton, Mahyar Najibi, Moin Nabi, Minsik Cho, Devang Naik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12325
Pdf URL: https://arxiv.org/pdf/2502.12325
Copy Paste: [[2502.12325]] From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs(https://arxiv.org/abs/2502.12325)
Keywords: language model, llm
Abstract: Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
摘要：培训大型语言模型（LLMS）针对不同的推理约束在计算上昂贵，这限制了对效率准确性权衡的控制。此外，一旦受过训练，这些模型通常会均匀地处理代币，无论其复杂性如何，都会导致静态和僵化的行为。在本文中，我们引入了训练后优化框架Dynamoe，该框架将预先训练的密度LLM适应了以最小的微调成本的代币 - 缺陷驱动的Experts模型。这种适应使模型动态，并具有灵敏度控制，以自定义效率和准确性之间的平衡。 Dynamoe具有令牌缺陷的路由器，可预测令牌的难度，并将其引导到适当的子网络或专家，使较大的专家能够处理更复杂的令牌和较小的专家来处理更简单的专家。我们的实验表明，Dynamoe可以通过单个微调步骤生成一系列现有训练有素的LLM的自适应模型变体，与基本模型的培训相比，仅利用$ 10B $代币，这是最低的成本。每个变体都在准确性和性能之间提供不同的权衡。与基线训练后优化框架相比，Flextron，尽管仅使用$ \ frac {1} {1} {9} {9} \ text {th} $，但我们的方法在下游任务中达到了相似的汇总精度。

Title: LM Agents for Coordinating Multi-User Information Gathering

Authors: Harsh Jhamtani, Jacob Andreas, Benjamin Van Durme
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12328
Pdf URL: https://arxiv.org/pdf/2502.12328
Copy Paste: [[2502.12328]] LM Agents for Coordinating Multi-User Information Gathering(https://arxiv.org/abs/2502.12328)
Keywords: agent
Abstract: This paper introduces PeopleJoin, a benchmark for evaluating LM-mediated collaborative problem solving. Given a user request, PeopleJoin agents must identify teammates who might be able to assist, converse with these teammates to gather information, and finally compile a useful answer or summary for the original user. PeopleJoin comprises two evaluation domains: PeopleJoin-QA, focused on questions about tabular data, and PeopleJoin-DocCreation, focused on document creation tasks. The two domains are adapted from existing NLP benchmarks for database question answering and multi-document summarization; here, however, the information needed to complete these tasks is distributed across synthetic ``organizations'' of 2--20 users, simulating natural multi-user collaboration scenarios. We implemented several popular LM agent architectures, evaluating their accuracy and efficiency at completing tasks, and highlight new research questions that can be studied using PeopleJoin.
摘要：本文介绍了PeopleJoin，这是评估LM介导的协作解决问题的基准。给定用户请求，人们加入代理商必须确定可能能够协助，与这些队友交谈以收集信息的队友，并最终为原始用户编写有用的答案或摘要。 People Join包括两个评估域：People-QA，重点介绍有关表格数据的问题，而People Join-Doccreation则重点介绍了文档创建任务。这两个域是根据现有的NLP基准进行了调整的数据库问题答案和多文件摘要的。但是，在这里，完成这些任务所需的信息是在2--20位用户的合成``组织''中分发，从而模拟了自然的多用户协作方案。我们实施了几种流行的LM代理体系结构，评估了它们在完成任务方面的准确性和效率，并突出了可以使用People Join进行研究的新研究问题。

Title: ConFit v2: Improving Resume-Job Matching using Hypothetical Resume Embedding and Runner-Up Hard-Negative Mining

Authors: Xiao Yu, Ruize Xu, Chengyuan Xue, Jinzhong Zhang, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12361
Pdf URL: https://arxiv.org/pdf/2502.12361
Copy Paste: [[2502.12361]] ConFit v2: Improving Resume-Job Matching using Hypothetical Resume Embedding and Runner-Up Hard-Negative Mining(https://arxiv.org/abs/2502.12361)
Keywords: language model
Abstract: A reliable resume-job matching system helps a company recommend suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction labels in resume-job datasets are sparse. We introduce ConFit v2, an improvement over ConFit to tackle this sparsity problem. We propose two techniques to enhance the encoder's contrastive training process: augmenting job data with hypothetical reference resume generated by a large language model; and creating high-quality hard negatives from unlabeled resume/job pairs using a novel hard-negative mining strategy. We evaluate ConFit v2 on two real-world datasets and demonstrate that it outperforms ConFit and prior methods (including BM25 and OpenAI text-embedding-003), achieving an average absolute improvement of 13.8% in recall and 17.5% in nDCG across job-ranking and resume-ranking tasks.
摘要：可靠的简历匹配系统可帮助公司从简历池中推荐合适的候选人，并帮助求职者从工作职位列表中找到相关的工作。但是，由于求职者仅适用于一些工作，因此简历数据集中的交互标签很少。我们介绍了Condit V2，这是对解决此稀疏问题的改进。我们提出了两种技术来增强编码器的对比培训过程：通过大型语言模型产生的假设参考简历来增强工作数据；并使用一种新颖的硬性采矿策略从未标记的简历/工作对制成高质量的硬否负面因素。我们在两个现实世界数据集上评估了V2的V2，并证明它的表现优于构成和先前的方法（包括BM25和OpenAI Text-Embedding-003），在召回中的平均绝对提高为13.8％，在NDCG中，NDCG在跨职位级别的持续工作中达到17.5％和恢复级别的任务。

Title: Classifiers of Data Sharing Statements in Clinical Trial Records

Authors: Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12362
Pdf URL: https://arxiv.org/pdf/2502.12362
Copy Paste: [[2502.12362]] Classifiers of Data Sharing Statements in Clinical Trial Records(https://arxiv.org/abs/2502.12362)
Keywords: language model
Abstract: Digital individual participant data (IPD) from clinical trials are increasingly distributed for potential scientific reuse. The identification of available IPD, however, requires interpretations of textual data-sharing statements (DSS) in large databases. Recent advancements in computational linguistics include pre-trained language models that promise to simplify the implementation of effective classifiers based on textual inputs. In a subset of 5,000 textual DSS from this http URL, we evaluate how well classifiers based on domain-specific pre-trained language models reproduce original availability categories as well as manually annotated labels. Typical metrics indicate that classifiers that predicted manual annotations outperformed those that learned to output the original availability categories. This suggests that the textual DSS descriptions contain applicable information that the availability categories do not, and that such classifiers could thus aid the automatic identification of available IPD in large trial databases.
摘要：来自临床试验的数字参与者数据（IPD）越来越分布，以进行潜在的科学重用。但是，可用IPD的识别需要对大型数据库中文本数据共享语句（DSS）的解释。计算语言学的最新进展包括预先训练的语言模型，这些模型有望简化基于文本输入的有效分类器的实现。在此HTTP URL的5,000个文本DSS的子集中，我们评估了基于特定领域的预训练的预训练的语言模型的分类器如何重现原始可用性类别以及手动注释的标签。典型的指标表明，预测手动注释的分类器优于学会输出原始可用性类别的分类器。这表明文本DSS描述包含可用性类别所不适用的适用信息，因此，此类分类器可以帮助自动识别大型试验数据库中可用IPD。

Title: Factual Inconsistency in Data-to-Text Generation Scales Exponentially with LLM Size: A Statistical Validation

Authors: Joy Mahapatra, Soumyajit Roy, Utpal Garain
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12372
Pdf URL: https://arxiv.org/pdf/2502.12372
Copy Paste: [[2502.12372]] Factual Inconsistency in Data-to-Text Generation Scales Exponentially with LLM Size: A Statistical Validation(https://arxiv.org/abs/2502.12372)
Keywords: language model, llm
Abstract: Monitoring factual inconsistency is essential for ensuring trustworthiness in data-to-text generation (D2T). While large language models (LLMs) have demonstrated exceptional performance across various D2T tasks, previous studies on scaling laws have primarily focused on generalization error through power law scaling to LLM size (i.e., the number of model parameters). However, no research has examined the impact of LLM size on factual inconsistency in D2T. In this paper, we investigate how factual inconsistency in D2T scales with LLM size by exploring two scaling laws: power law and exponential scaling. To rigorously evaluate and compare these scaling laws, we employ a statistical validation framework consisting of three key stages: predictive performance estimation, goodness-of-fit assessment, and comparative analysis. For a comprehensive empirical study, we analyze three popular LLM families across five D2T datasets, measuring factual inconsistency inversely using four state-of-the-art consistency metrics. Our findings, based on exhaustive empirical results and validated through our framework, reveal that, contrary to the widely assumed power law scaling, factual inconsistency in D2T follows an exponential scaling with LLM size.
摘要：监视事实不一致对于确保数据到文本生成（D2T）的可信度至关重要。尽管大型语言模型（LLMS）在各种D2T任务中都表现出了出色的性能，但先前对缩放定律的研究主要集中在通过功率定律扩展到LLM大小（即模型参数的数量）上。但是，尚无研究检查LLM规模对D2T事实不一致的影响。在本文中，我们通过探索两个缩放定律：幂定律和指数缩放来调查D2T尺度中的事实不一致。为了严格评估和比较这些缩放定律，我们采用了一个统计验证框架，该框架由三个关键阶段组成：预测性能估计，拟合优度评估和比较分析。为了进行全面的实证研究，我们分析了五个D2T数据集中的三个受欢迎的LLM家族，使用四个最先进的一致性指标，以反相反地衡量事实不一致。我们的发现基于详尽的经验结果并通过我们的框架进行了验证，表明，与广泛假定的功率定律缩放相反，D2T中的事实不一致遵循了LLM大小的指数缩放。

Title: UltraGen: Extremely Fine-grained Controllable Generation via Attribute Reconstruction and Global Preference Optimization

Authors: Longfei Yun, Letian Peng, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12375
Pdf URL: https://arxiv.org/pdf/2502.12375
Copy Paste: [[2502.12375]] UltraGen: Extremely Fine-grained Controllable Generation via Attribute Reconstruction and Global Preference Optimization(https://arxiv.org/abs/2502.12375)
Keywords: llm
Abstract: Fine granularity is an essential requirement for controllable text generation, which has seen rapid growth with the ability of LLMs. However, existing methods focus mainly on a small set of attributes like 3 to 5, and their performance degrades significantly when the number of attributes increases to the next order of magnitude. To address this challenge, we propose a novel zero-shot approach for extremely fine-grained controllable generation (EFCG), proposing auto-reconstruction (AR) and global preference optimization (GPO). In the AR phase, we leverage LLMs to extract soft attributes (e.g., Emphasis on simplicity and minimalism in design) from raw texts, and combine them with programmatically derived hard attributes (e.g., The text should be between 300 and 400 words) to construct massive (around 45) multi-attribute requirements, which guide the fine-grained text reconstruction process under weak supervision. In the GPO phase, we apply direct preference optimization (DPO) to refine text generation under diverse attribute combinations, enabling efficient exploration of the global combination space. Additionally, we introduce an efficient attribute sampling strategy to identify and correct potentially erroneous attributes, further improving global optimization. Our framework significantly improves the constraint satisfaction rate (CSR) and text quality for EFCG by mitigating position bias and alleviating attention dilution.
摘要：细粒度是可控文本生成的必不可少的要求，该文本生成具有LLM的能力的快速增长。但是，现有方法主要集中在3至5之类的一小部分属性上，当属性数量增加到下一个数量级时，它们的性能会大大降低。为了应对这一挑战，我们提出了一种新颖的零摄像方法，以实现极细的可控生成（EFCG），提出自动重建（AR）和全球偏好优化（GPO）。在AR阶段，我们利用LLM从原始文本中提取软属性（例如，强调设计中的简单性和极简主义），并将它们与编程性得出的硬性属性（例如，文本应为300和400个单词）结合起来，以构造构造巨大的（大约45）多属性要求，这些要求在弱监督下指导细颗粒的文本重建过程。在GPO阶段，我们将直接偏好优化（DPO）应用于不同属性组合下的文本生成，从而有效地探索了全球组合空间。此外，我们引入了一种有效的属性抽样策略，以识别和纠正潜在的错误属性，从而进一步改善全局优化。我们的框架可大大提高EFCG的约束满意度（CSR）和文本质量，通过减轻位置偏差并减轻注意力稀释。

Title: Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges

Authors: Bolei Ma, Yuting Li, Wei Zhou, Ziwei Gong, Yang Janet Liu, Katja Jasinskaja, Annemarie Friedrich, Julia Hirschberg, Frauke Kreuter, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12378
Pdf URL: https://arxiv.org/pdf/2502.12378
Copy Paste: [[2502.12378]] Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges(https://arxiv.org/abs/2502.12378)
Keywords: language model
Abstract: Understanding pragmatics-the use of language in context-is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatics phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.
摘要：了解语用学 - 语言在上下文中的使用 - 对于开发能够解释细微的语言使用的NLP系统的至关重要。尽管语言技术的最新进展，包括大语言模型，但评估了其处理务实现象（例如含义和参考）的能力仍然具有挑战性。为了提高模型中的务实能力，必须了解当前的评估趋势并确定现有的局限性。在这项调查中，我们对旨在评估NLP中实用功能的资源进行了全面的审查，并根据其解决的语用现象对数据集进行了分类。我们分析任务设计，数据收集方法，评估方法及其与现实世界应用程序的相关性。通过在现代语言模型的背景下检查这些资源，我们强调了现有基准中的新兴趋势，挑战和差距。我们的调查旨在阐明务实评估的景观，并指导更全面和有针对性的基准的开发，最终为更加细微和背景感知的NLP模型做出贡献。

Title: WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects

Authors: Daniel Deutsch, Eleftheria Briakou, Isaac Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, Markus Freitag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12404
Pdf URL: https://arxiv.org/pdf/2502.12404
Copy Paste: [[2502.12404]] WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects(https://arxiv.org/abs/2502.12404)
Keywords: language model, llm
Abstract: As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages and dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. The dataset covers four domains: literary, news, social, and speech. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. These results should be confirmed using a human-based evaluation, which we leave for future work.
摘要：随着大型语言模型（LLM）在英语以外的其他语言中变得越来越有能力，因此收集基准数据集以评估其多语言性能，包括机器翻译（MT）等任务，这一点很重要。在这项工作中，我们将WMT24数据集扩展到覆盖55种语言，除了在原始WMT24数据集中的9种语言中收集新的46种新语言和方言的新的人工写作的参考文献和后编辑后编辑，除了9种参考文献外，。数据集涵盖了四个领域：文学，新闻，社会和言语。我们使用自动指标在收集的数据集上基准了各种MT提供商和LLM，并发现LLMS是所有55种语言中表现最好的MT系统。这些结果应使用基于人类的评估来确认，我们将为以后的工作提供。

Title: Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models

Authors: Jingyuan Yang, Bowen Yan, Rongjun Li, Ziyu Zhou, Xin Chen, Zhiyong Feng, Wei Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12411
Pdf URL: https://arxiv.org/pdf/2502.12411
Copy Paste: [[2502.12411]] Gradient Co-occurrence Analysis for Detecting Unsafe Prompts in Large Language Models(https://arxiv.org/abs/2502.12411)
Keywords: language model, llm, prompt, chat
Abstract: Unsafe prompts pose significant safety risks to large language models (LLMs). Existing methods for detecting unsafe prompts rely on data-driven fine-tuning to train guardrail models, necessitating significant data and computational resources. In contrast, recent few-shot gradient-based methods emerge, requiring only few safe and unsafe reference prompts. A gradient-based approach identifies unsafe prompts by analyzing consistent patterns of the gradients of safety-critical parameters in LLMs. Although effective, its restriction to directional similarity (cosine similarity) introduces ``directional bias'', limiting its capability to identify unsafe prompts. To overcome this limitation, we introduce GradCoo, a novel gradient co-occurrence analysis method that expands the scope of safety-critical parameter identification to include unsigned gradient similarity, thereby reducing the impact of ``directional bias'' and enhancing the accuracy of unsafe prompt detection. Comprehensive experiments on the widely-used benchmark datasets ToxicChat and XStest demonstrate that our proposed method can achieve state-of-the-art (SOTA) performance compared to existing methods. Moreover, we confirm the generalizability of GradCoo in detecting unsafe prompts across a range of LLM base models with various sizes and origins.
摘要：不安全提示对大语言模型（LLM）构成重大安全风险。现有的检测不安全提示的方法依赖于数据驱动的微调来培训护栏模型，需要大量数据和计算资源。相比之下，最近几个基于梯度的方法出现了，只需要很少的安全和不安全的参考提示即可。基于梯度的方法通过分析LLMS中安全 - 关键参数梯度的一致模式来确定不安全的提示。尽管有效，但其对方向相似性的限制（余弦相似性）引入了``定向偏见''，从而限制了其识别不安全提示的能力。为了克服这一限制，我们引入了Gradcoo，这是一种新型的梯度共发生分析方法，该方法扩大了安全 - 关键参数识别的范围，以包括无签名的梯度相似性，从而降低了``定向偏见''的影响并增强了不安全的准确性及时检测。与现有方法相比，对广泛使用的基准数据集和XSTest的全面实验表明，我们提出的方法可以实现最先进的（SOTA）性能。此外，我们确认了GradCoo在检测各种具有各种尺寸和起源的LLM基本模型的不安全提示方面的普遍性。

Title: Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models

Authors: Hanin Atwany, Abdul Waheed, Rita Singh, Monojit Choudhury, Bhiksha Raj
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12414
Pdf URL: https://arxiv.org/pdf/2502.12414
Copy Paste: [[2502.12414]] Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models(https://arxiv.org/abs/2502.12414)
Keywords: hallucination
Abstract: Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, particularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In our work, we address this gap by investigating hallucination in ASR models. We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. Our analysis of 20 ASR models reveals \numinsights~key insights: (1) High WERs can mask low hallucination rates, while low WERs may conceal dangerous hallucinations. (2) Synthetic noise, both adversarial and common perturbations like white noise, pitch shift, and time stretching, increase HER. (3) Distribution shift correlates strongly with HER ($\alpha = 0.91$). Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance, particularly in high-stakes domains.
摘要：语音基础模型在模型和数据大小方面接受了大规模训练的训练，从而导致能够执行多个语音任务的强大系统，包括自动语音识别（ASR）。这些模型超越了语言和领域的障碍，但是有效地衡量其性能仍然是一个挑战。传统的指标（例如单词错误率（WER）和字符错误率（CER））通常用于评估ASR性能，但通常无法在临界环境中反映转录质量，尤其是在检测捏造的输出时。这种现象被称为幻觉，尤其是关于医疗保健，法律和航空等高风险领域的关注，在这种领域中，错误可能会带来严重的后果。在我们的工作中，我们通过调查ASR模型中的幻觉来解决这一差距。我们研究了分配变化，模型大小和模型结构等因素如何影响幻觉错误率（HER），这是我们引入的指标来量化幻觉。我们对20种ASR模型的分析揭示了\ numinsights〜关键见解：（1）高层可以掩盖低幻觉速度，而低Wers可能会隐藏危险的幻觉。（2）综合噪声，包括对抗和常见的扰动，例如白噪声，俯仰变速和时间伸展，会增加她。（3）分配移动与她（$ \ alpha = 0.91 $）密切相关。我们的发现突出了将她与像传统指标一样的重要性，以更好地评估ASR模型性能，尤其是在高风险领域。

Title: Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models

Authors: Shuqi Liu, Han Wu, Bowei He, Xiongwei Han, Mingxuan Yuan, Linqin Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12420
Pdf URL: https://arxiv.org/pdf/2502.12420
Copy Paste: [[2502.12420]] Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models(https://arxiv.org/abs/2502.12420)
Keywords: language model
Abstract: Recent advances in large language models have led to numerous task-specialized fine-tuned variants, creating a need for efficient model merging techniques that preserve specialized capabilities while avoiding costly retraining. While existing task vector-based merging methods show promise, they typically apply uniform coefficients across all parameters, overlooking varying parameter importance both within and across tasks. We present Sens-Merging, a sensitivity-guided coefficient adjustment method that enhances existing model merging techniques by operating at both task-specific and cross-task levels. Our method analyzes parameter sensitivity within individual tasks and evaluates cross-task transferability to determine optimal merging coefficients. Extensive experiments on Mistral 7B and LLaMA2-7B/13B models demonstrate that Sens-Merging significantly improves performance across general knowledge, mathematical reasoning, and code generation tasks. Notably, when combined with existing merging techniques, our method enables merged models to outperform specialized fine-tuned models, particularly in code generation tasks. Our findings reveal important trade-offs between task-specific and cross-task scalings, providing insights for future model merging strategies.
摘要：大型语言模型的最新进展导致了许多任务专门的微调变体，从而需要有效的模型合并技术，这些技术可以保留专门的功能，同时避免了昂贵的再培训。尽管现有的基于任务向量向量的合并方法表现出希望，但它们通常在所有参数上应用统一系数，忽略任务内部和跨任务的不同参数的重要性。我们提出了一种灵敏度引导的系数调整方法，该方法通过在特定于任务和交叉任务水平上运行来增强现有模型合并技术。我们的方法分析了各个任务内的参数灵敏度，并评估了交叉任务的转移性以确定最佳合并系数。关于Mistral 7b和Llama2-7b/13b模型的广泛实验表明，Sens-Mersing显着改善了跨通用知识，数学推理和代码生成任务的性能。值得注意的是，当与现有合并技术结合使用时，我们的方法使合并的模型能够优于专业的微型模型，尤其是在代码生成任务中。我们的发现揭示了特定于任务和交叉任务量表之间的重要权衡，从而为未来的模型合并策略提供了见解。

Title: Wi-Chat: Large Language Model Powered Wi-Fi Sensing

Authors: Haopeng Zhang, Yili Ren, Haohan Yuan, Jingzhe Zhang, Yitong Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12421
Pdf URL: https://arxiv.org/pdf/2502.12421
Copy Paste: [[2502.12421]] Wi-Chat: Large Language Model Powered Wi-Fi Sensing(https://arxiv.org/abs/2502.12421)
Keywords: language model, llm, prompt, chat
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, their potential to integrate physical model knowledge for real-world signal interpretation remains largely unexplored. In this work, we introduce Wi-Chat, the first LLM-powered Wi-Fi-based human activity recognition system. We demonstrate that LLMs can process raw Wi-Fi signals and infer human activities by incorporating Wi-Fi sensing principles into prompts. Our approach leverages physical model insights to guide LLMs in interpreting Channel State Information (CSI) data without traditional signal processing techniques. Through experiments on real-world Wi-Fi datasets, we show that LLMs exhibit strong reasoning capabilities, achieving zero-shot activity recognition. These findings highlight a new paradigm for Wi-Fi sensing, expanding LLM applications beyond conventional language tasks and enhancing the accessibility of wireless sensing for real-world deployments.
摘要：大型语言模型（LLM）的最新进展表现出了各种任务的显着能力。但是，它们将物理模型知识集成为现实世界信号解释的潜力仍然在很大程度上尚未探索。在这项工作中，我们介绍了Wi-Chat，这是第一个基于LLM的Wi-Fi人类活动识别系统。我们证明LLM可以通过将Wi-Fi感应原理纳入提示来处理原始的Wi-Fi信号并推断人类活动。我们的方法利用物理模型的见解来指导LLM在没有传统信号处理技术的情况下解释通道状态信息（CSI）数据。通过对实际Wi-Fi数据集的实验，我们表明LLMS具有强大的推理能力，可以实现零弹性活动识别。这些发现突出了一种用于Wi-Fi感应的新范式，将LLM应用程序扩展到了传统的语言任务之外，并增强了无线传感对现实世界部署的可访问性。

Title: Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL

Authors: Wichayaporn Wongkamjan, Yanze Wang, Feng Gu, Denis Peskoff, Jonathan K. Kummerfeld, Jonathan May, Jordan Lee Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12436
Pdf URL: https://arxiv.org/pdf/2502.12436
Copy Paste: [[2502.12436]] Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL(https://arxiv.org/abs/2502.12436)
Keywords: language model, agent
Abstract: An increasingly prevalent socio-technical problem is people being taken in by offers that sound ``too good to be true'', where persuasion and trust shape decision-making. This paper investigates how \abr{ai} can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in \textit{Diplomacy}, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms of proposed agreements in player communications and computing the relative rewards of the proposal using agents' value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-\abr{ai} interaction tools can build on our methods for deception detection by triggering \textit{friction} to give users a chance of interrogating suspicious proposals.
摘要：一个越来越普遍的社会技术问题是，人们被听起来``太好了，无法真实''所吸引，在这种情况下，说服力和信任塑造了决策。本文研究了\ abr {ai}如何帮助检测这些欺骗性的情况。我们分析了人类在\ textit {外交}中如何在战略上互相欺骗，这是一款需要自然语言交流和战略推理的棋盘游戏。这需要在玩家通信中提取拟议协议的逻辑形式，并使用代理的价值函数计算提案的相对奖励。结合基于文本的功能，这可以改善我们的欺骗检测。与大型语言模型方法相比，我们的方法以高精度检测人类的欺骗，该方法将许多真实的信息称为欺骗性。未来的人\ abr {ai}交互工具可以通过触发\ textit {摩擦}来构建我们的欺骗检测方法，从而使用户有机会询问可疑建议。

Title: Multi-Attribute Steering of Language Models via Targeted Intervention

Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12446
Pdf URL: https://arxiv.org/pdf/2502.12446
Copy Paste: [[2502.12446]] Multi-Attribute Steering of Language Models via Targeted Intervention(https://arxiv.org/abs/2502.12446)
Keywords: language model, llm
Abstract: Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient finetuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
摘要：推理时间干预（ITI）已成为一种有前途的方法，可以通过介入代币表示，而没有对LLM参数进行昂贵的更新，将大型语言模型（LLM）行为转向特定方向（例如，提高了有用性）。但是，现有的ITI方法无法扩展到具有冲突的多属性设置，例如增强帮助性的同时还会降低毒性。为了解决这个问题，我们引入了多属性目标转向（MAT-Steer），这是一个新颖的转向框架，旨在跨多个属性进行选择性令牌级干预。 Mat-Steer使用一个对齐目标来学习转向向量，该目标将模型的内部表示不良输出移动，更接近了理想的输出，同时对向量之间的稀疏性和正交性进行了不同属性，从而减少了互动冲突。我们在两个不同的环境中评估MAT-Stereer：（i）关于回答（QA）任务的问题，在该任务中我们平衡了诸如真实性，偏见和毒性之类的属性；（ii）在生成任务上，我们同时改善了诸如帮助，正确性和连贯性之类的属性。 MAT-Steer在两种任务类型上都优于现有的ITI和参数有效的芬太尼方法（例如，QA任务的平均准确性增长率为3％，最佳ITI基线的胜率为55.82％）。

Title: DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs

Authors: Minxuan Lv, Zhenpeng Su, Leiyu Pan, Yizhe Xiong, Zijia Lin, Hui Chen, Wei Zhou, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Songlin Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12455
Pdf URL: https://arxiv.org/pdf/2502.12455
Copy Paste: [[2502.12455]] DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs(https://arxiv.org/abs/2502.12455)
Keywords: language model, llm
Abstract: As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.
摘要：随着大型语言模型继续扩展，计算成本和资源消耗已成为重大挑战。虽然现有的稀疏方法（例如修剪降低了计算开销，但他们可能会通过删除参数丢失模型知识。本文提出了DSMOE（动态稀疏的Experts），这是一种新颖的方法，通过将预训练的FFN层划分为计算块来实现稀疏。我们使用Sigmoid激活和直通估计器实施自适应专家路由，使代币能够根据输入复杂性灵活地访问模型知识的不同方面。此外，我们引入了一个稀疏损失项，以平衡性能和计算效率。在Llama模型上进行的广泛实验表明，在同等的计算限制下，DSMOE在语言建模和下游任务中的现有修剪和MOE方法比现有的修剪和MOE方法表现出色，尤其是在发电任务中卓越卓越。分析表明，DSMOE学习了独特的层次激活模式，为未来的MOE体系结构设计提供了新的见解。

Title: An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding

Authors: Annamalai Senthilnathan, Kristjan Arumae, Mohammed Khalilia, Zhengzheng Xing, Aaron R. Colak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12458
Pdf URL: https://arxiv.org/pdf/2502.12458
Copy Paste: [[2502.12458]] An Empirical Evaluation of Encoder Architectures for Fast Real-Time Long Conversational Understanding(https://arxiv.org/abs/2502.12458)
Keywords: agent
Abstract: Analyzing long text data such as customer call transcripts is a cost-intensive and tedious task. Machine learning methods, namely Transformers, are leveraged to model agent-customer interactions. Unfortunately, Transformers adhere to fixed-length architectures and their self-attention mechanism scales quadratically with input length. Such limitations make it challenging to leverage traditional Transformers for long sequence tasks, such as conversational understanding, especially in real-time use cases. In this paper we explore and evaluate recently proposed efficient Transformer variants (e.g. Performer, Reformer) and a CNN-based architecture for real-time and near real-time long conversational understanding tasks. We show that CNN-based models are dynamic, ~2.6x faster to train, ~80% faster inference and ~72% more memory efficient compared to Transformers on average. Additionally, we evaluate the CNN model using the Long Range Arena benchmark to demonstrate competitiveness in general long document analysis.
摘要：分析诸如客户呼叫成绩单之类的长文本数据是一项成本密集且繁琐的任务。机器学习方法，即变形金刚，将其利用为模型代理 - 客户相互作用。不幸的是，变形金刚遵守固定长度的体系结构及其自我发项机制的缩放尺寸二次地缩放，并以输入长度缩放。这样的局限性使得将传统变压器用于长序列任务，例如对话式理解，尤其是在实时用例中。在本文中，我们探讨并评估最近提出的有效变压器变体（例如，表演者，改革者）和基于CNN的架构，用于实时和接近实时的实时长时间对话理解任务。我们表明，基于CNN的模型是动态的，训练的速度约为2.6倍，与变压器平均相比，推理速度约为80％，记忆效率高约72％。此外，我们使用远距离竞技场基准评估了CNN模型，以在一般文档分析中证明竞争力。

Title: Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance

Authors: Guangxiang Zhao, Saier Hu, Xiaoqi Jian, Jinzhu Wu, Yuhan Wu, Change Jia, Lin Sun, Xiangzheng Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12459
Pdf URL: https://arxiv.org/pdf/2502.12459
Copy Paste: [[2502.12459]] Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance(https://arxiv.org/abs/2502.12459)
Keywords: language model, gpt, llm
Abstract: This paper investigates the fragility of Large Language Models (LLMs) in generalizing to novel inputs, specifically focusing on minor perturbations in well-established benchmarks (e.g., slight changes in question format or distractor length). Despite high benchmark scores, LLMs exhibit significant accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT-4 experiences a 25-point accuracy loss when question types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts. This work aligns with the ACL 2025 theme track on the Generalization of NLP models, proposing a "Generalization Stress Test" to assess performance shifts under controlled perturbations. The study calls for reevaluating benchmarks and developing more reliable evaluation methodologies to capture LLM generalization abilities better.
摘要：本文研究了大型语言模型（LLM）的脆弱性，以推广到新的投入，特别关注良好的基准测试中的少量扰动（例如，问题格式或分散术者长度的略有变化）。尽管有很高的基准分数，但在面对这些次要但具有内容的修改时，LLMS表现出明显的准确性下降和意外偏见（例如，偏爱较长的干扰物）。例如，QWEN 2.5 1.5b的MMLU得分从60升至89，当更改期权长度而不更改问题时，将从89下降到36。当更改问题类型时，即使是GPT-4也会经历25分的精度损失，所有三个修改类别的下降6分都下降了。这些分析表明，LLM在很大程度上依赖于表面提示，而不是形成跨越格式，词汇变化和无关含量转移的稳定的稳定的抽象表示。这项工作与NLP模型概括的ACL 2025主题轨道保持一致，并提出了“泛化应力测试”，以评估受控扰动下的性能转移。该研究要求重新评估基准并开发更可靠的评估方法，以更好地捕获LLM的概括能力。

Title: Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs

Authors: Joon Park, Kyohei Atarashi, Koh Takeuchi, Hisashi Kashima
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12462
Pdf URL: https://arxiv.org/pdf/2502.12462
Copy Paste: [[2502.12462]] Emulating Retrieval Augmented Generation via Prompt Engineering for Enhanced Long Context Comprehension in LLMs(https://arxiv.org/abs/2502.12462)
Keywords: language model, llm, long context, prompt, retrieval augmented generation, chain-of-thought
Abstract: This paper addresses the challenge of comprehending very long contexts in Large Language Models (LLMs) by proposing a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought (CoT) reasoning. While recent LLMs support over 100,000 tokens in a single prompt, simply enlarging context windows has not guaranteed robust multi-hop reasoning when key details are scattered across massive input. Our approach treats the model as both the retriever and the reasoner: it first tags relevant segments within a long passage, then employs a stepwise CoT workflow to integrate these pieces of evidence. This single-pass method thereby reduces reliance on an external retriever, yet maintains focus on crucial segments. We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text. Compared to baseline (no retrieval) and naive RAG pipelines, our approach more accurately handles multi-fact questions such as object location tracking, counting, and indefinite knowledge. Furthermore, we analyze how prompt structure, including the order of question, relevant-text tags, and overall instructions, significantly affects performance. These findings underscore that optimized prompt engineering, combined with guided reasoning, can enhance LLMs' long-context comprehension and serve as a lightweight alternative to traditional retrieval pipelines.
摘要：本文通过提出一种通过专门的及时工程和思维链（COT）推理来模拟检索增强产生（RAG）的方法，以解决大语模型（LLMS）中非常长的背景的挑战。尽管最近的LLM在一个提示中支持超过100,000个令牌，但是当关键细节散布在大量输入中时，简单地放大上下文窗口并不能保证可靠的多跳上推理。我们的方法将模型视为猎犬和推理器：它首先标记长段落内的相关段，然后采用逐步的cot工作流程来整合这些证据。因此，这种单通道方法降低了对外部猎犬的依赖，但仍关注关键段。我们评估了来自Babilong的选定任务的方法，Babilong将标准的Babi QA问题与大量干扰物文本交织在一起。与基线（无检索）和天真的抹布管道相比，我们的方法更准确地处理了多事实问题，例如对象位置跟踪，计数和无限知识。此外，我们分析了及时的结构，包括问题顺序，相关文本标签和整体指示如何显着影响性能。这些发现强调了优化及时工程以及指导推理的优化工程，可以增强LLMS的长期理解，并作为传统检索管道的轻量级替代方案。

Title: SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

Authors: Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12464
Pdf URL: https://arxiv.org/pdf/2502.12464
Copy Paste: [[2502.12464]] SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models(https://arxiv.org/abs/2502.12464)
Keywords: language model, llm, prompt
Abstract: Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.
摘要：在现实世界应用中部署大型语言模型（LLMS）需要强大的安全护罩模型来检测和阻止有害用户提示。尽管大型的安全保护型模型达到了强大的性能，但其计算成本却是巨大的。为了减轻这种情况，使用了较小的蒸馏模型，但是它们通常在“硬”示例中表现不佳，在“硬”示例中，较大的模型提供了准确的预测。我们观察到，较小的模型可以可靠地处理许多输入，而只有很小的部分需要较大的模型的容量。在此激励的情况下，我们提出了Saferoute，这是一种二元路由器，将硬示例与简单的例子区分开来。我们的方法选择性地将较大的安全保护措施模型应用于路由器认为硬的数据，从而提高了效率，同时与仅使用较大的安全保护罩模型相比，保持了准确性。多个基准数据集的实验结果表明，我们的自适应模型选择显着提高了计算成本和安全性能之间的权衡，超过了相关的基线。

Title: Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

Authors: Alireza S. Ziabari, Nona Ghazizadeh, Zhivar Sourati, Farzan Karimi-Malekabadi, Payam Piray, Morteza Dehghani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12470
Pdf URL: https://arxiv.org/pdf/2502.12470
Copy Paste: [[2502.12470]] Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking(https://arxiv.org/abs/2502.12470)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step processing reveals a critical limitation. While human cognition fluidly adapts between intuitive, heuristic (System 1) and analytical, deliberative (System 2) reasoning depending on the context, LLMs lack this dynamic flexibility. This rigidity can lead to brittle and unreliable performance when faced with tasks that deviate from their trained patterns. To address this, we create a dataset of 2,000 samples with valid System 1 and System 2 answers, explicitly align LLMs with these reasoning styles, and evaluate their performance across reasoning benchmarks. Our results reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense tasks. A mechanistic analysis of model responses shows that System 1 models employ more definitive answers, whereas System 2 models demonstrate greater uncertainty. Interpolating between these extremes produces a monotonic transition in reasoning accuracy, preserving coherence. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.
摘要：大型语言模型（LLMS）具有令人印象深刻的推理能力，但它们对结构化的分步处理的依赖揭示了一个关键的限制。尽管人类认知能够在直觉，启发式（系统1）和分析性，审议（系统2）推理之间流动，但根据上下文，LLMS缺乏这种动态灵活性。当面对偏离训练模式的任务时，这种刚性可能会导致脆弱和不可靠的性能。为了解决这个问题，我们创建了一个具有有效系统1和System 2答案的2,000个样本的数据集，将LLMS与这些推理样式明确相一致，并在推理基准中评估其性能。我们的结果揭示了准确的效率折衷：系统2对齐的模型在算术和象征性推理中表现出色，而系统1对准模型在常识任务中的表现更好。模型响应的机械分析表明，系统1模型采用更确定的答案，而系统2模型表现出更大的不确定性。这些极端之间的插值在推理准确性方面产生单调过渡，从而保持连贯性。这项工作挑战了这样一个假设，即逐步推理始终是最佳的，并且强调了根据任务需求调整推理策略的必要性。

Title: CoCo-CoLa: Evaluating Language Adherence in Multilingual LLMs

Authors: Elnaz Rahmati, Alireza S. Ziabari, Morteza Dehghani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12476
Pdf URL: https://arxiv.org/pdf/2502.12476
Copy Paste: [[2502.12476]] CoCo-CoLa: Evaluating Language Adherence in Multilingual LLMs(https://arxiv.org/abs/2502.12476)
Keywords: language model, llm
Abstract: Multilingual Large Language Models (LLMs) develop cross-lingual abilities despite being trained on limited parallel data. However, they often struggle to generate responses in the intended language, favoring high-resource languages such as English. In this work, we introduce CoCo-CoLa (Correct Concept - Correct Language), a novel metric to evaluate language adherence in multilingual LLMs. Using fine-tuning experiments on a closed-book QA task across seven languages, we analyze how training in one language affects others' performance. Our findings reveal that multilingual models share task knowledge across languages but exhibit biases in the selection of output language. We identify language-specific layers, showing that final layers play a crucial role in determining output language. Accordingly, we propose a partial training strategy that selectively fine-tunes key layers, improving language adherence while significantly reducing computational cost. Our method achieves comparable or superior performance to full fine-tuning, particularly for low-resource languages, offering a more efficient multilingual adaptation.
摘要：多语言大型语言模型（LLMS）尽管接受了有限的并行数据的培训，但仍会发展跨语性能力。但是，他们经常努力用预期的语言产生响应，而偏爱诸如英语之类的高源语言。在这项工作中，我们介绍了可可粉（正确的概念 - 正确的语言），这是一种评估多语言LLMS语言依从性的新颖指标。我们在七种语言上使用封闭式质量检查任务进行微调实验，分析一种语言的培训如何影响他人的表现。我们的发现表明，多语言模型跨语言共享任务知识，但在选择输出语言的选择中表现出偏见。我们确定特定于语言的层，表明最终层在确定输出语言中起着至关重要的作用。因此，我们提出了一种部分培训策略，该策略有选择性地微调关键层，改善了语言依从性，同时大大降低了计算成本。我们的方法可以实现与完整的微调相当或出色的性能，尤其是对于低资源语言，提供了更有效的多语言适应性。

Title: Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning

Authors: Kimia Noorbakhsh, Joseph Chandler, Pantea Karimi, Mohammad Alizadeh, Hari Balakrishnan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12477
Pdf URL: https://arxiv.org/pdf/2502.12477
Copy Paste: [[2502.12477]] Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning(https://arxiv.org/abs/2502.12477)
Keywords: language model, llm, prompt
Abstract: Assessing and enhancing human learning through question-answering is vital, yet automating this process remains challenging. While large language models (LLMs) excel at summarization and query responses, their ability to generate meaningful questions for learners is underexplored. We propose Savaal, a scalable question-generation system with three objectives: (i) scalability, enabling question generation from hundreds of pages of text (ii) depth of understanding, producing questions beyond factual recall to test conceptual reasoning, and (iii) domain-independence, automatically generating questions across diverse knowledge areas. Instead of providing an LLM with large documents as context, Savaal improves results with a three-stage processing pipeline. Our evaluation with 76 human experts on 71 papers and PhD dissertations shows that Savaal generates questions that better test depth of understanding by 6.5X for dissertations and 1.5X for papers compared to a direct-prompting LLM baseline. Notably, as document length increases, Savaal's advantages in higher question quality and lower cost become more pronounced.
摘要：通过提问来评估和增强人类学习至关重要，但是自动化这一过程仍然具有挑战性。尽管大型语言模型（LLMS）在摘要和查询响应方面表现出色，但它们为学习者产生有意义的问题的能力并没有得到充实。我们提出了一个可扩展的问题生成系统的Savaal，具有三个目标：（i）可伸缩性，使问题产生数百页的文本（ii）理解深度，在事实召回以外的问题上产生问题以测试概念性推理，以及（iii）域名 - 独立，自动在不同的知识领域产生问题。 Savaal没有提供大量文档作为上下文，而是通过三阶段的处理管道改善了结果。我们对76名人类专家的71篇论文和博士学位论文的评估表明，与直接宣传的LLM基线相比，Savaal产生了6.5倍论文的理解深度和论文1.5倍的问题。值得注意的是，随着文件长度的增加，Savaal在较高的问题质量和较低成本方面的优势变得更加明显。

Title: MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition

Authors: Yang Yang, Xunde Dong, Yupeng Qiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12478
Pdf URL: https://arxiv.org/pdf/2502.12478
Copy Paste: [[2502.12478]] MSE-Adapter: A Lightweight Plugin Endowing LLMs with the Capability to Perform Multimodal Sentiment Analysis and Emotion Recognition(https://arxiv.org/abs/2502.12478)
Keywords: language model, llm, chat
Abstract: Current Multimodal Sentiment Analysis (MSA) and Emotion Recognition in Conversations (ERC) methods based on pre-trained language models exhibit two primary limitations: 1) Once trained for MSA and ERC tasks, these pre-trained language models lose their original generalized capabilities. 2) They demand considerable computational resources. As the size of pre-trained language models continues to grow, training larger multimodal sentiment analysis models using previous approaches could result in unnecessary computational cost. In response to this challenge, we propose \textbf{M}ultimodal \textbf{S}entiment Analysis and \textbf{E}motion Recognition \textbf{Adapter} (MSE-Adapter), a lightweight and adaptable plugin. This plugin enables a large language model (LLM) to carry out MSA or ERC tasks with minimal computational overhead (only introduces approximately 2.6M to 2.8M trainable parameters upon the 6/7B models), while preserving the intrinsic capabilities of the LLM. In the MSE-Adapter, the Text-Guide-Mixer (TGM) module is introduced to establish explicit connections between non-textual and textual modalities through the Hadamard product. This allows non-textual modalities to better align with textual modalities at the feature level, promoting the generation of higher-quality pseudo tokens. Extensive experiments were conducted on four public English and Chinese datasets using consumer-grade GPUs and open-source LLMs (Qwen-1.8B, ChatGLM3-6B-base, and LLaMA2-7B) as the backbone. The results demonstrate the effectiveness of the proposed plugin. The code will be released on GitHub after a blind review.
摘要：基于预训练的语言模型的对话（ERC）方法（ERC）方法中的当前多模式情感分析（MSA）和情绪识别表现出两个主要局限性：1）一旦接受过MSA和ERC任务的培训，这些预训练的语言模型将失去其原始的广义能力。 2）他们需要大量的计算资源。随着预训练的语言模型的规模不断增长，使用以前的方法培训更大的多模式分析模型可能会导致不必要的计算成本。为了应对这一挑战，我们提出了\ textbf {m} ult-Imodal \ textbf {s} intiments Anallys和\ textbf {e}运动识别\ textbf {adapter}（mse-adapter），一个轻量级和适应性的插件。该插件使一个大型语言模型（LLM）能够使用最小的计算开销执行MSA或ERC任务（在6/7B型号上仅引入约260万至280万的可训练参数），同时保留LLM的内在功能。在MSE适配器中，引入了文本指标混合物（TGM）模块，以通过Hadamard产品在非文本和文本模式之间建立明确的连接。这使得非文本模式可以更好地与特征级别的文本方式保持一致，从而促进了高质量的伪代币的产生。使用消费级GPU和开源LLM（QWEN-1.8B，CHATGLM3-6B-BASE和LLAMA2-7B）作为骨干进行了大量实验。结果证明了拟议的插件的有效性。盲目审查后，该代码将在Github上发布。

Title: The Knowledge Microscope: Features as Better Analytical Lenses than Neurons

Authors: Yuheng Chen, Pengfei Cao, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12483
Pdf URL: https://arxiv.org/pdf/2502.12483
Copy Paste: [[2502.12483]] The Knowledge Microscope: Features as Better Analytical Lenses than Neurons(https://arxiv.org/abs/2502.12483)
Keywords: language model
Abstract: Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from this http URL and dataset will be available.
摘要：先前的研究主要利用MLP神经元作为分析单位，以了解语言模型（LMS）中事实知识的机制。然而，神经元患有多疾病，导致知识表达有限和可解释性差。在本文中，我们首先进行初步实验，以验证稀疏自动编码器（SAE）可以有效地将神经元分解为特征，这些神经元是替代分析单元。有了这一既定，我们的核心发现揭示了特征比神经元的三个关键优势：（1）特征对知识表达和卓越的可解释性的影响更强。（2）特征表现出增强的单体气质，显示相关事实和无关事实之间的不同激活模式。（3）通过我们提出的功能EDIT方法证明的功能比神经元获得更好的隐私保护，该方法在删除此HTTP URL和数据集中删除对隐私敏感信息的现有基于神经元的方法大大优于现有的方法。

Title: Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study

Authors: Isaac Lim, Shaun Khoo, Watson Chua, Goh Jiayi, Jessica Foo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12485
Pdf URL: https://arxiv.org/pdf/2502.12485
Copy Paste: [[2502.12485]] Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study(https://arxiv.org/abs/2502.12485)
Keywords: language model, llm
Abstract: To ensure safe usage, Large Language Models (LLMs) typically undergo alignment with human-defined values. However, this alignment often relies on primarily English data and is biased towards Western-centric values, limiting its effectiveness in low-resource language settings. In this paper, we describe our approach for aligning SEA-Lion-v2.1-Instruct (a Llama3-8B variant) to minimize toxicity in Singlish, an English creole specific to Singapore. We find that supervised fine-tuning and Kahneman-Tversky Optimization (KTO) on paired and unpaired preferences is more sample efficient and yields significantly better results than Direct Preference Optimization (DPO). Our analysis reveals that DPO implicitly enforces a weaker safety objective than KTO, and that SFT complements KTO by improving training stability. Finally, we introduce a simple but novel modification to KTO, KTO-S, which improves training stability through better gradient exploitation. Overall, we present a general approach for safety alignment conducive to low-resource English languages, successfully reducing toxicity by 99\% on our Singlish benchmark, with gains generalizing to the broader TOXIGEN dataset while maintaining strong performance across standard LLM benchmarks.
摘要：为了确保安全使用，大型语言模型（LLMS）通常会与人类定义的值进行对齐。但是，这种一致性通常主要依赖于英语数据，并且偏向于以西方为中心的价值观，从而限制了其在低资源语言环境中的有效性。在本文中，我们描述了对对齐海狮-V2.1教学（一种llama3-8b变体）的方法，以最大程度地减少新加坡特定的英语克里奥尔语中的毒性。我们发现，对配对和未配对偏好的监督微调和Kahneman-Tversky优化（KTO）比直接偏好优化（DPO）更有效率，并且产生的结果明显好得多。我们的分析表明，与KTO相比，DPO隐含地执行弱安全目标，并且SFT通过改善训练稳定性来补充KTO。最后，我们为KTO，KTO-S引入了一种简单但新颖的修改，该修改通过更好的梯度开发来提高训练稳定性。总体而言，我们提出了一种有利于低资源英语语言的安全对齐方式的一般方法，在我们的Singlish基准测试中成功地将毒性降低了99 \％，并获得了对更广泛的Toxigen数据集的推广，同时在跨标准LLM基准测试中保持了强劲的性能。

Title: EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning

Authors: Xiaoqian Liu, Ke Wang, Yongbin Li, Yuchuan Wu, Wentao Ma, Aobo Kong, Fei Huang, Jianbin Jiao, Junge Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12486
Pdf URL: https://arxiv.org/pdf/2502.12486
Copy Paste: [[2502.12486]] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning(https://arxiv.org/abs/2502.12486)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business negotiations, which require strategic reasoning-an ability to navigate dynamic environments and align long-term goals amidst uncertainty. Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts. To address these issues, we propose explicit policy optimization (EPO) for strategic reasoning, featuring an LLM that provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior. To improve adaptability and policy transferability, we train the strategic reasoning model via multi-turn reinforcement learning (RL) using process rewards and iterative self-play, without supervised fine-tuning (SFT) as a preliminary step. Experiments across social and physical domains demonstrate EPO's ability of long-term goal alignment through enhanced strategic reasoning, achieving state-of-the-art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in EPO and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real-world applications.
摘要：大型语言模型（LLM）在明确的解决方案（例如数学和编码）的明确解决方案中显示出令人印象深刻的推理能力。但是，他们仍然在诸如业务谈判之类的复杂的现实世界情景中挣扎，这需要战略推理 - 在不确定性的不确定性中，可以驾驶动态环境并保持长期目标的能力。现有的战略推理方法面临适应性，可伸缩性和将策略转移到新环境中的挑战。为了解决这些问题，我们提出了针对战略推理的明确政策优化（EPO），其LLM的特色是提供开放式动作空间中的策略，并可以插入任意的LLM代理商以激发目标定向的行为。为了提高适应性和政策可转让性，我们使用过程奖励和迭代自我播放通过多转弯强化学习（RL）培训战略推理模型，而无需监督的微调（SFT）作为初步步骤。跨社会和物理领域的实验表明，通过增强战略推理，在社交对话和网络导航任务上实现了最新的表现，EPO的长期目标对齐能力。我们的发现揭示了EPO中出现的各种协作推理机制及其在生成新型策略方面的有效性，强调了其在现实世界应用中的战略推理潜力。

Title: Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge

Authors: Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12501
Pdf URL: https://arxiv.org/pdf/2502.12501
Copy Paste: [[2502.12501]] Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge(https://arxiv.org/abs/2502.12501)
Keywords: llm, chain-of-thought
Abstract: LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning's inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.
摘要：产生思想链（COT）判断的LLM-AS-A-Gudge已成为一种广泛采用的自动评估方法。但是，COT推理无法捕获全面，更深入的细节，这通常会导致结果不完整，从而损害了其可靠性。现有方法主要依赖于多数投票或标准扩展，这不足以解决COT的限制。我们提出了基于人群的比较评估，该评估引入了与候选人的回应相比的其他人群反应，从而在候选人的回应中揭示了更深入，更全面的细节。该过程有效地指导LLM-AS-A-A-Gudge提供更详细的COT判断。广泛的实验表明，我们的方法增强了评估可靠性，在五个基准测试中获得了6.7％的平均准确性增益。此外，我们的方法产生了更高质量的婴儿床，可促进法官蒸馏法官的拒绝采样（SFT）（SFT），被称为人群拒绝采样，从而提高了更有效的SFT。我们的分析证实，我们的COTS产生的COTS更全面，质量更高，评估精度随着推理量表而提高。

Title: Efficient OpAmp Adaptation for Zoom Attention to Golden Contexts

Authors: Haoyuan Wu, Rui Ming, Haisheng Zheng, Zhuolun He, Bei Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12502
Pdf URL: https://arxiv.org/pdf/2502.12502
Copy Paste: [[2502.12502]] Efficient OpAmp Adaptation for Zoom Attention to Golden Contexts(https://arxiv.org/abs/2502.12502)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have shown significant promise in question-answering (QA) tasks, particularly in retrieval-augmented generation (RAG) scenarios and long-context applications. However, their performance is hindered by noisy reference documents, which often distract from essential information. Despite fine-tuning efforts, Transformer-based architectures struggle to prioritize relevant content. This is evidenced by their tendency to allocate disproportionate attention to irrelevant or later-positioned documents. Recent work proposes the differential attention mechanism to address this issue, but this mechanism is limited by an unsuitable common-mode rejection ratio (CMRR) and high computational costs. Inspired by the operational amplifier (OpAmp), we propose the OpAmp adaptation to address these challenges, which is implemented with adapters efficiently. By integrating the adapter into pre-trained Transformer blocks, our approach enhances focus on the golden context without costly training from scratch. Empirical evaluations on noisy-context benchmarks reveal that our Qwen2.5-OpAmp-72B model, trained with our OpAmp adaptation, surpasses the performance of state-of-the-art LLMs, including DeepSeek-V3 and GPT-4o.
摘要：大型语言模型（LLMS）在提问（QA）任务中表现出了巨大的希望，尤其是在检索演出的一代（RAG）方案和长期文化应用程序中。但是，嘈杂的参考文档阻碍了它们的性能，这些文档通常会分散基本信息的注意力。尽管进行了微调，但基于变压器的体系结构努力优先考虑相关内容。他们倾向于分配对无关或后来的文档的不成比例的关注来证明这一点。最近的工作提出了解决此问题的差异注意机制，但是该机制受到不合适的共同模式排斥比（CMRR）和高计算成本的限制。受操作放大器（OPAMP）的启发，我们提出了OPAMP适应以应对这些挑战，这些挑战是有效地实施的。通过将适配器集成到预训练的变压器块中，我们的方法可以增强对黄金背景的关注，而无需从头开始训练。对嘈杂的基准基准的经验评估表明，我们的QWEN2.5-OPAMP-72B模型，经过我们的OPAMP适应性训练，超过了最先进的LLMS的性能，包括DeepSeek-V3和GPT-4O。

Title: LegalCore: A Dataset for Legal Documents Event Coreference Resolution

Authors: Kangda Wei, Xi Shi, Jonathan Tong, Sai Ramana Reddy, Anandhavelu Natarajan, Rajiv Jain, Aparna Garimella, Ruihong Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12509
Pdf URL: https://arxiv.org/pdf/2502.12509
Copy Paste: [[2502.12509]] LegalCore: A Dataset for Legal Documents Event Coreference Resolution(https://arxiv.org/abs/2502.12509)
Keywords: language model, llm
Abstract: Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.
摘要：识别事件及其在文档中的核心提及对于理解文本的语义含义至关重要。现有关于事件核心解决方案的研究大部仅限于新闻文章。在本文中，我们介绍了法律领域的第一个数据集，即LegalCore，该数据集已通过全面的事件和事件核心信息进行注释。我们在此数据集中注释的法律合同文件比新闻文章长几倍，平均每个文件约为25,000。注释表明，法律文件有密集的事件提及并具有事件提及之间的短途和超长距离的核心联系。我们进一步基于此数据集上的主流大型语言模型（LLMS），用于事件检测和事件核心分辨率任务，并发现该数据集对最先进的开源和专有LLM构成了重大挑战比有监督的基线。我们将发布数据集以及代码。

Title: Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review

Authors: Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12510
Pdf URL: https://arxiv.org/pdf/2502.12510
Copy Paste: [[2502.12510]] Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review(https://arxiv.org/abs/2502.12510)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: We propose an aspect-guided, multi-level perturbation framework to evaluate the robustness of Large Language Models (LLMs) in automated peer review. Our framework explores perturbations in three key components of the peer review process-papers, reviews, and rebuttals-across several quality aspects, including contribution, soundness, presentation, tone, and completeness. By applying targeted perturbations and examining their effects on both LLM-as-Reviewer and LLM-as-Meta-Reviewer, we investigate how aspect-based manipulations, such as omitting methodological details from papers or altering reviewer conclusions, can introduce significant biases in the review process. We identify several potential vulnerabilities: review conclusions that recommend a strong reject may significantly influence meta-reviews, negative or misleading reviews may be wrongly interpreted as thorough, and incomplete or hostile rebuttals can unexpectedly lead to higher acceptance rates. Statistical tests show that these biases persist under various Chain-of-Thought prompting strategies, highlighting the lack of robust critical evaluation in current LLMs. Our framework offers a practical methodology for diagnosing these vulnerabilities, thereby contributing to the development of more reliable and robust automated reviewing systems.
摘要：我们提出了一个方面引导的多层次扰动框架，以评估自动同行评审中大语言模型（LLM）的鲁棒性。我们的框架探索了同行评审过程纸，评论和反驳的三个关键组成部分的扰动，包括贡献，健全，表现，音调和完整性，包括贡献，包括贡献。通过应用有针对性的扰动并检查其对LLM-AS-AS-REVIEWER和LLM-AS-AS-META-REVIEWER的影响，我们研究了基于方面的操作（例如省略论文的方法论细节或改变审阅者结论）如何引入重大偏见审核过程。我们确定了几个潜在的漏洞：建议强烈拒绝的审查结论可能会严重影响元评论，负面或误导性的评论可能被错误地解释为彻底，而不完整或敌对的反驳可能会导致更高的接受率。统计测试表明，这些偏见在各种经过三链链接的促进策略下持续存在，这突出了当前LLM中缺乏强大的批判性评估。我们的框架提供了一种实用方法来诊断这些漏洞，从而有助于开发更可靠和强大的自动化审核系统。

Title: Can LLMs Extract Frame-Semantic Arguments?

Authors: Jacob Devasier, Rishabh Mediratta, Chengkai Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12516
Pdf URL: https://arxiv.org/pdf/2502.12516
Copy Paste: [[2502.12516]] Can LLMs Extract Frame-Semantic Arguments?(https://arxiv.org/abs/2502.12516)
Keywords: language model, llm
Abstract: Frame-semantic parsing is a critical task in natural language understanding, yet the ability of large language models (LLMs) to extract frame-semantic arguments remains underexplored. This paper presents a comprehensive evaluation of LLMs on frame-semantic argument identification, analyzing the impact of input representation formats, model architectures, and generalization to unseen and out-of-domain samples. Our experiments, spanning models from 0.5B to 78B parameters, reveal that JSON-based representations significantly enhance performance, and while larger models generally perform better, smaller models can achieve competitive results through fine-tuning. We also introduce a novel approach to frame identification leveraging predicted frame elements, achieving state-of-the-art performance on ambiguous targets. Despite strong generalization capabilities, our analysis finds that LLMs still struggle with out-of-domain data.
摘要：框架语义解析是自然语言理解中的一项关键任务，但是大型语言模型（LLMS）提取框架 - 语义参数的能力仍然没有被逐渐解散。本文对LLM在框架语义参数识别上进行了全面评估，分析了输入表示格式，模型体系结构的影响以及对看不见和不域样本的概括。我们的实验跨越了从0.5B到78B参数的模型，表明基于JSON的表示可以显着提高性能，尽管较大的模型通常性能更好，但较小的模型可以通过微调实现竞争成果。我们还引入了一种新型的方法来构架识别利用预测的框架元素，从而在模棱两可的目标上实现了最新的性能。尽管有强大的概括能力，但我们的分析发现LLM仍在室外数据方面困难。

Title: Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

Authors: Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12530
Pdf URL: https://arxiv.org/pdf/2502.12530
Copy Paste: [[2502.12530]] Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards(https://arxiv.org/abs/2502.12530)
Keywords: llm, agent
Abstract: As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain their policies in natural language will be vital for reliable coexistence. In this paper, we build a model-agnostic explanation generator based on an LLM. The technical novelty is that the rewards for training this LLM are generated by a generative flow matching model. This model has a specially designed structure with a hidden layer merged with an LLM to harness the linguistic cues of explanations into generating appropriate rewards. Experiments on both RL and LLM tasks demonstrate that our method can generate dense and effective rewards while saving on expensive human feedback; it thus enables effective explanations and even improves the accuracy of the decisions in original tasks.
摘要：随着人类越来越多地与由RL，LLM及以后的不同代理人共享环境，以自然语言解释其政策的能力对于可靠的共存至关重要。在本文中，我们基于LLM构建了一个模型的解释发生器。技术新颖性是训练该LLM的奖励是由生成流匹配模型产生的。该模型具有专门设计的结构，其隐藏层与LLM合并，以利用解释的语言提示来产生适当的奖励。对RL和LLM任务的实验表明，我们的方法可以在节省昂贵的人类反馈的同时产生密集有效的奖励；因此，它实现了有效的解释，甚至可以提高原始任务中决策的准确性。

Title: How does a Language-Specific Tokenizer affect LLMs?

Authors: Jean Seo, Jaeyoon Kim, SungJoo Byun, Hyopil Shin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12560
Pdf URL: https://arxiv.org/pdf/2502.12560
Copy Paste: [[2502.12560]] How does a Language-Specific Tokenizer affect LLMs?(https://arxiv.org/abs/2502.12560)
Keywords: language model, llm
Abstract: The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.
摘要：直觉上，特定语言引导者的必要性对于有效的自然语言处理至关重要，但缺乏关于其意义和潜在原因的经验分析。这项研究探讨了通过韩语的案例研究，探讨了特定语言的引导者如何影响主要用英语文本数据训练的大语模型的行为。该研究在两个主要阶段展开：（1）开发韩国特定的扩展令牌和（2）实验，将模型与基本令牌和扩展令牌进行比较，并通过各种次要的标记预测任务进行了比较。我们的深入分析表明，扩展的令牌剂会在发电过程中降低对不正确预测的信心，并减少复杂任务中的跨透明拷贝，这表明产生较少的荒谬输出的趋势。因此，扩展的令牌器在生成过程中提供了稳定性，可能会导致下游任务的更高性能。

Title: SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings

Authors: Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng
Subjects: cs.CL, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2502.12562
Pdf URL: https://arxiv.org/pdf/2502.12562
Copy Paste: [[2502.12562]] SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings(https://arxiv.org/abs/2502.12562)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have serious security this http URL safety alignment using multimodal datasets consisting of text and data of additional modalities can effectively enhance MLLM's security, it is costly to construct these datasets. Existing low-resource security alignment methods, including textual alignment, have been found to struggle with the security risks posed by additional modalities. To address this, we propose Synthetic Embedding augmented safety Alignment (SEA), which optimizes embeddings of additional modality through gradient updates to expand textual datasets. This enables multimodal safety alignment training even when only textual data is available. Extensive experiments on image, video, and audio-based MLLMs demonstrate that SEA can synthesize a high-quality embedding on a single RTX3090 GPU within 24 seconds. SEA significantly improves the security of MLLMs when faced with threats from additional modalities. To assess the security risks introduced by video and audio, we also introduced a new benchmark called VA-SafetyBench. High attack success rates across multiple MLLMs validate its challenge. Our code and data will be available at this https URL.
摘要：多模式大型语言模型（MLLM）具有严重的安全性，使用由文本和其他模式的数据组成的多模式数据集的HTTP URL安全对准可以有效地增强MLLM的安全性，构造这些数据集是昂贵的。现有的低资源安全对准方法（包括文本一致性）已被发现与其他模式相比的安全风险挣扎。为了解决这个问题，我们提出了合成嵌入增强安全对齐（SEA）的嵌入，该渐变更新优化了其他模态的嵌入，以扩展文本数据集。即使只有文本数据，这也可以实现多模式的安全对准训练。基于图像，视频和基于音频的MLLM的广泛实验表明，SEA可以在24秒内合成单个RTX3090 GPU上的高质量嵌入。当面临其他方式的威胁时，SEA可显着提高MLLM的安全性。为了评估视频和音频引入的安全风险，我们还引入了一个名为VA-SafetyBench的新基准。多个MLLM的高攻击成功率验证了其挑战。我们的代码和数据将在此HTTPS URL上找到。

Title: Evaluating Language Models on Grooming Risk Estimation Using Fuzzy Theory

Authors: Geetanjali Bihani, Tatiana Ringenberg, Julia Rayz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12563
Pdf URL: https://arxiv.org/pdf/2502.12563
Copy Paste: [[2502.12563]] Evaluating Language Models on Grooming Risk Estimation Using Fuzzy Theory(https://arxiv.org/abs/2502.12563)
Keywords: language model
Abstract: Encoding implicit language presents a challenge for language models, especially in high-risk domains where maintaining high precision is important. Automated detection of online child grooming is one such critical domain, where predators manipulate victims using a combination of explicit and implicit language to convey harmful intentions. While recent studies have shown the potential of Transformer language models like SBERT for preemptive grooming detection, they primarily depend on surface-level features and approximate real victim grooming processes using vigilante and law enforcement conversations. The question of whether these features and approximations are reasonable has not been addressed thus far. In this paper, we address this gap and study whether SBERT can effectively discern varying degrees of grooming risk inherent in conversations, and evaluate its results across different participant groups. Our analysis reveals that while fine-tuning aids language models in learning to assign grooming scores, they show high variance in predictions, especially for contexts containing higher degrees of grooming risk. These errors appear in cases that 1) utilize indirect speech pathways to manipulate victims and 2) lack sexually explicit content. This finding underscores the necessity for robust modeling of indirect speech acts by language models, particularly those employed by predators.
摘要：编码隐式语言对语言模型提出了挑战，尤其是在保持高精度很重要的高风险领域。自动检测在线儿童修饰是一个关键领域，在这种领域中，掠食者使用明确和隐性语言的组合来操纵受害者，以传达有害意图。尽管最近的研究表明，诸如Sbert的Transformer语言模型的潜力用于先发制人的修饰检测，但它们主要取决于表面级特征，并使用Vigilante和执法对话近似真正的受害者修饰过程。到目前为止，尚未解决这些特征和近似值是否合理的问题。在本文中，我们解决了这一差距，并研究了Sbert是否可以有效地识别对话中固有的修饰风险的不同程度，并在不同参与者组中评估其结果。我们的分析表明，虽然微调辅助语言模型在学习分配美容分数时，它们的预测差异很大，尤其是对于包含较高修饰风险程度的上下文。这些错误出现在1）利用间接语音途径来操纵受害者的情况下； 2）缺乏性别明确的内容。这一发现强调了通过语言模型（尤其是掠食者使用的语言模型）对间接语音行为进行健壮建模的必要性。

Title: Self Iterative Label Refinement via Robust Unlabeled Learning

Authors: Hikaru Asano, Tadashi Kozuno, Yukino Baba
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12565
Pdf URL: https://arxiv.org/pdf/2502.12565
Copy Paste: [[2502.12565]] Self Iterative Label Refinement via Robust Unlabeled Learning(https://arxiv.org/abs/2502.12565)
Keywords: language model, gpt, llm
Abstract: Recent advances in large language models (LLMs) have yielded impressive performance on various tasks, yet they often depend on high-quality feedback that can be costly. Self-refinement methods attempt to leverage LLMs' internal evaluation mechanisms with minimal human supervision; however, these approaches frequently suffer from inherent biases and overconfidence, especially in domains where the models lack sufficient internal knowledge, resulting in performance degradation. As an initial step toward enhancing self-refinement for broader applications, we introduce an iterative refinement pipeline that employs the Unlabeled-Unlabeled learning framework to improve LLM-generated pseudo-labels for classification tasks. By exploiting two unlabeled datasets with differing positive class ratios, our approach iteratively denoises and refines the initial pseudo-labels, thereby mitigating the adverse effects of internal biases with minimal human supervision. Evaluations on diverse datasets, including low-resource language corpora, patent classifications, and protein structure categorizations, demonstrate that our method consistently outperforms both initial LLM's classification performance and the self-refinement approaches by cutting-edge models (e.g., GPT-4o and DeepSeek-R1).
摘要：大型语言模型（LLM）的最新进展在各种任务上都产生了令人印象深刻的表现，但它们通常取决于高质量的反馈，这可能是昂贵的。自我注册方法试图利用LLMS的内部评估机制，并以最少的人为监督；但是，这些方法经常遭受固有的偏见和过度自信，尤其是在模型缺乏足够内部知识的领域，从而导致性能降级。作为增强更广泛应用程序的自我投资的第一步，我们介绍了一条迭代的改进管道，该管道采用了未标记的未标记的学习框架来改善LLM生成的伪标记进行分类任务。通过利用两个未标记的数据集的正相比率不同，我们的方法迭代地降解并完善了初始伪标签，从而减轻了内部偏见的不利影响，并以最小的人类监督。对各种数据集的评估，包括低资源语言语料库，专利分类和蛋白质结构分类，这表明我们的方法始终优于初始LLM的分类性能和通过切割模型（例如GPT-4O和DeepSeek）和DeepSeek进行的自我翻新方法-r1）。

Title: A Cognitive Writing Perspective for Constrained Long-Form Text Generation

Authors: Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, Xiuying Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12568
Pdf URL: https://arxiv.org/pdf/2502.12568
Copy Paste: [[2502.12568]] A Cognitive Writing Perspective for Constrained Long-Form Text Generation(https://arxiv.org/abs/2502.12568)
Keywords: language model, gpt, llm, agent
Abstract: Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: \href{this https URL}{CogWriter}.
摘要：像人类一样，大型语言模型（LLM）很难产生高质量的长篇文本，这些文本遵守一条通行证的严格要求。根据认知写作理论，成功的人写作是一个复杂的认知过程，涉及迭代计划，翻译，审查和监测，这是不足为奇的。在这些认知原则的激励下，我们旨在通过Cogwriter为LLM提供类似人类的认知写作能力，Cogwriter是一个新颖的无培训框架，将LLM限制的长期文本生成转变为系统的认知写作范式。我们的框架由两个关键模块组成：（1）执行层次规划以分解任务的计划代理，以及（2）多生成代理并行执行这些计划。该系统通过持续监视和审查机制保持质量，该机制可根据指定要求评估输出并触发必要的修订。 Cogwriter在Longgenbench上展示了出色的性能，Longgenbench是复杂约束长期文本生成的基准。即使使用QWEN-2.5-14B作为骨干，Cogwriter在复杂的指令完成精度中也超过了GPT-4O，同时可靠地生成超过10,000个单词的文本。我们希望这种认知科学启发的方法为LLM写作进步提供了一个范式：\ href {this HTTPS url} {cogwriter}。

Title: A Fuzzy Evaluation of Sentence Encoders on Grooming Risk Classification

Authors: Geetanjali Bihani, Julia Rayz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12576
Pdf URL: https://arxiv.org/pdf/2502.12576
Copy Paste: [[2502.12576]] A Fuzzy Evaluation of Sentence Encoders on Grooming Risk Classification(https://arxiv.org/abs/2502.12576)
Keywords: chat
Abstract: With the advent of social media, children are becoming increasingly vulnerable to the risk of grooming in online settings. Detecting grooming instances in an online conversation poses a significant challenge as the interactions are not necessarily sexually explicit, since the predators take time to build trust and a relationship with their victim. Moreover, predators evade detection using indirect and coded language. While previous studies have fine-tuned Transformers to automatically identify grooming in chat conversations, they overlook the impact of coded and indirect language on model predictions, and how these align with human perceptions of grooming. In this paper, we address this gap and evaluate bi-encoders on the task of classifying different degrees of grooming risk in chat contexts, for three different participant groups, i.e. law enforcement officers, real victims, and decoys. Using a fuzzy-theoretic framework, we map human assessments of grooming behaviors to estimate the actual degree of grooming risk. Our analysis reveals that fine-tuned models fail to tag instances where the predator uses indirect speech pathways and coded language to evade detection. Further, we find that such instances are characterized by a higher presence of out-of-vocabulary (OOV) words in samples, causing the model to misclassify. Our findings highlight the need for more robust models to identify coded language from noisy chat inputs in grooming contexts.
摘要：随着社交媒体的出现，儿童越来越容易受到在线环境中修饰的风险。在线对话中检测修饰实例提出了重大挑战，因为互动不一定是性明确的，因为掠食者需要时间来建立信任并与受害者建立关系。此外，捕食者使用间接和编码语言逃避检测。虽然先前的研究具有微调的变压器，可以自动识别聊天对话中的修饰，但它们忽略了编码和间接语言对模型预测的影响，以及它们与人类对美容的看法如何保持一致。在本文中，我们解决了这一差距，并评估了对聊天环境中不同程度的修饰风险分类的任务，即三个不同的参与者组，即执法人员，真正的受害者和诱饵。使用模糊的理论框架，我们绘制了对美容行为的人体评估，以估计修饰风险的实际程度。我们的分析表明，微调模型无法在捕食者使用间接语音途径和编码语言逃避检测的情况下标记实例。此外，我们发现此类实例的特征是样本中较高的唱歌外（OOV）单词的存在，从而导致模型错误分类。我们的发现突出了需要更强大的模型，以在修饰上下文中从嘈杂的聊天输入中识别编码语言。

Title: LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

Authors: Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12583
Pdf URL: https://arxiv.org/pdf/2502.12583
Copy Paste: [[2502.12583]] LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data(https://arxiv.org/abs/2502.12583)
Keywords: language model, llm, prompt
Abstract: Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
摘要：尽管长篇文化大语模型（LLM）的发展不断增长，但与忠实有关的问题阻碍了依赖于综合数据的以数据为中心的方法，这限制了它们在增强在长期文化推理和质疑等任务上的模型绩效方面的有效性回答（QA）。由于缺乏验证，没有归因的推理以及潜在的知识冲突而导致的错误信息通常会加剧这些挑战。我们提出了Longfaith，这是一种综合忠实的长篇小说推理指令数据集的新型管道。通过整合地面真理和基于引用的推理提示，我们消除了分心并提高了推理链的准确性，从而减轻了对昂贵验证过程的需求。我们开放两个合成的数据集，LongFaith-SFT和Longfaith-Po，它们系统地解决了忠实的多个维度，包括经过验证的推理，归因和上下文基础。对多跳的推理数据集和Longbench进行的广泛实验表明，这些数据集中微调的模型可显着提高性能。我们的消融研究突出了Longfaith管道的可伸缩性和适应性，展示了其在开发长篇小说LLM中的广泛适用性。

Title: PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Authors: Bowei He, Lihao Yin, Hui-Ling Zhen, Xiaokun Zhang, Mingxuan Yuan, Chen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12594
Pdf URL: https://arxiv.org/pdf/2502.12594
Copy Paste: [[2502.12594]] PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery(https://arxiv.org/abs/2502.12594)
Keywords: language model, llm
Abstract: Model pruning is an effective approach for compressing large language models. However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some instruction data irrelevant to model capability recovery may introduce negative effects. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions where model capabilities are most severely compromised within a certain recovery data budget. Our approach first applies manifold learning and spectral clustering to group recovery data in the semantic space, revealing capability-specific instruction sets. We then adaptively allocate the data budget to different clusters based on the degrees of model capability degradation. In each cluster, we prioritize data samples where model performance has declined dramatically. To mitigate potential negative transfer, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data.
摘要：修剪模型是压缩大语言模型的有效方法。但是，此过程通常会导致模型能力的重大降解。虽然通常采用培训后技术（例如教学调音）来恢复模型性能，但现有方法通常忽略模型能力的不均匀恶化和造成高计算成本。此外，某些与模型能力恢复无关的指令数据可能会带来负面影响。要解决这些挑战，我们提出\ textbf {p} ost训练d \ textbf {a} ta \ textbf {s}选举方法\ \ textbf {e} fficient pruned pruned pruned大语言模型\ textbf \ textbf {r} {Paser}）。 Paser的目的是确定模型功能在特定恢复数据预算中最严重损害的指令。我们的方法首先将多种学习和光谱聚类应用于语义空间中的组恢复数据，从而揭示了特定于功能的指令集。然后，我们根据模型能力降解的程度将数据预算自适应分配给不同的集群。在每个群集中，我们将模型性能急剧下降的数据样本确定优先级。为了减轻潜在的负转移，我们还检测到并过滤出冲突或无关紧要的恢复数据。广泛的实验表明，PASER显着胜过常规基线，有效地恢复了修剪的LLM的一般能力，而仅利用4 \％-20 \％的原始训练后数据。

Title: Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion

Authors: Mingyang Wang, Alisa Stoll, Lukas Lange, Heike Adel, Hinrich Schütze, Jannik Strötgen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12598
Pdf URL: https://arxiv.org/pdf/2502.12598
Copy Paste: [[2502.12598]] Bring Your Own Knowledge: A Survey of Methods for LLM Knowledge Expansion(https://arxiv.org/abs/2502.12598)
Keywords: language model, llm
Abstract: Adapting large language models (LLMs) to new and diverse knowledge is essential for their lasting effectiveness in real-world applications. This survey provides an overview of state-of-the-art methods for expanding the knowledge of LLMs, focusing on integrating various knowledge types, including factual information, domain expertise, language proficiency, and user preferences. We explore techniques, such as continual learning, model editing, and retrieval-based explicit adaptation, while discussing challenges like knowledge consistency and scalability. Designed as a guide for researchers and practitioners, this survey sheds light on opportunities for advancing LLMs as adaptable and robust knowledge systems.
摘要：将大型语言模型（LLM）调整为新的和多样化的知识对于它们在现实世界应用中的持久有效性至关重要。这项调查提供了最先进的方法概述，以扩展LLM的知识，重点是整合各种知识类型，包括事实信息，领域专业知识，语言水平和用户偏好。我们探索技术，例如持续学习，模型编辑和基于检索的明确适应，同时讨论了知识一致性和可扩展性等挑战。该调查是为研究人员和从业人员的指南而设计的，阐明了将LLMS推进到适应能力和强大的知识系统的机会。

Title: COPU: Conformal Prediction for Uncertainty Quantification in Natural Language Generation

Authors: Sean Wang, Yicheng Jiang, Yuxin Tang, Lu Cheng, Hanjie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12601
Pdf URL: https://arxiv.org/pdf/2502.12601
Copy Paste: [[2502.12601]] COPU: Conformal Prediction for Uncertainty Quantification in Natural Language Generation(https://arxiv.org/abs/2502.12601)
Keywords: language model, llm
Abstract: Uncertainty Quantification (UQ) for Natural Language Generation (NLG) is crucial for assessing the performance of Large Language Models (LLMs), as it reveals confidence in predictions, identifies failure modes, and gauges output reliability. Conformal Prediction (CP), a model-agnostic method that generates prediction sets with a specified error rate, has been adopted for UQ in classification tasks, where the size of the prediction set indicates the model's uncertainty. However, when adapting CP to NLG, the sampling-based method for generating candidate outputs cannot guarantee the inclusion of the ground truth, limiting its applicability across a wide range of error rates. To address this, we propose \ourmethod, a method that explicitly adds the ground truth to the candidate outputs and uses logit scores to measure nonconformity. Our experiments with six LLMs on four NLG tasks show that \ourmethod outperforms baseline methods in calibrating error rates and empirical cover rates, offering accurate UQ across a wide range of user-specified error rates.
摘要：自然语言产生（NLG）的不确定性定量（UQ）对于评估大语模型（LLMS）的性能至关重要，因为它揭示了对预测，确定故障模式和测量值的产出可靠性的信心。 Condomal预测（CP）是一种模型 - 敏捷方法，生成具有指定错误率的预测集，在分类任务中已采用指定的错误率，其中预测集的大小表示模型的不确定性。但是，当将CP适应NLG时，基于采样的候选输出方法无法保证包含地面真相，从而限制了其在较大的错误率范围内的适用性。为了解决这个问题，我们提出了\ oureMethod，该方法将地面真理显式添加到候选输出中，并使用logit评分来衡量不合格。我们在四个NLG任务上使用六个LLM的实验表明，\ ourthod在校准错误率和经验覆盖率方面优于基线方法，在广泛的用户指定错误率范围内提供了准确的UQ。

Title: Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection

Authors: Jiatao Li, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12611
Pdf URL: https://arxiv.org/pdf/2502.12611
Copy Paste: [[2502.12611]] Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection(https://arxiv.org/abs/2502.12611)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) necessitates accurate AI-generated text detection. However, current approaches largely overlook the influence of author characteristics. We investigate how sociolinguistic attributes-gender, CEFR proficiency, academic field, and language environment-impact state-of-the-art AI text detectors. Using the ICNALE corpus of human-authored texts and parallel AI-generated texts from diverse LLMs, we conduct a rigorous evaluation employing multi-factor ANOVA and weighted least squares (WLS). Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects. These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups. We offer novel empirical evidence, a robust statistical framework, and actionable insights for developing more equitable and reliable detection systems in real-world, out-of-domain contexts. This work paves the way for future research on bias mitigation, inclusive evaluation benchmarks, and socially responsible LLM detectors.
摘要：大型语言模型（LLM）的兴起需要准确的AI生成的文本检测。但是，当前的方法在很大程度上忽略了作者特征的影响。我们研究了社会语言属性性别，CEFR能力，学术领域和语言环境影响的最新AI文本探测器。使用来自不同llms的人类著名文本和平行AI生成的文本的Icnale语料库，我们采用多因素ANOVA和加权最小二乘（WLS）进行了严格的评估。我们的结果表明了巨大的偏见：CEFR的水平和语言环境始终影响探测器的准确性，而性别和学术领域则显示依赖于探测器的效果。这些发现凸显了对社会意识的AI文本检测的关键需求，以免对特定的人口统计组进行不公平的惩罚。我们提供了新颖的经验证据，一个健壮的统计框架以及可行的见解，以在现实世界中，外域内环境中开发更公平和可靠的检测系统。这项工作为对缓解偏见，包容性评估基准和对社会负责的LLM探测器的未来研究铺平了道路。

Title: Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions

Authors: Leonardo Ranaldi, Marco Valentino, Alexander Polonsky, Andrè Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12616
Pdf URL: https://arxiv.org/pdf/2502.12616
Copy Paste: [[2502.12616]] Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions(https://arxiv.org/abs/2502.12616)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Though (CoT) represents a common strategy for reasoning in Large Language Models (LLMs) by decomposing complex tasks into intermediate inference steps. However, explanations generated via CoT are susceptible to content biases that negatively affect their robustness and faithfulness. To mitigate existing limitations, recent work has proposed using logical formalisms coupled with external symbolic solvers. However, fully symbolic approaches possess the bottleneck of requiring a complete translation from natural language to formal languages, a process that affects efficiency and flexibility. To achieve a trade-off, this paper investigates methods to disentangle content from logical reasoning without a complete formalisation. In particular, we present QuaSAR (for Quasi-Symbolic Abstract Reasoning), a variation of CoT that guides LLMs to operate at a higher level of abstraction via quasi-symbolic explanations. Our framework leverages the capability of LLMs to formalise only relevant variables and predicates, enabling the coexistence of symbolic elements with natural language. We show the impact of QuaSAR for in-context learning and for constructing demonstrations to improve the reasoning capabilities of smaller models. Our experiments show that quasi-symbolic abstractions can improve CoT-based methods by up to 8% accuracy, enhancing robustness and consistency on challenging adversarial variations on both natural language (i.e. MMLU-Redux) and symbolic reasoning tasks (i.e., GSM-Symbolic).
摘要：但是，通过将复杂的任务分解为中间推理步骤，这些链（COT）（COT）代表了大语模型（LLMS）推理的共同策略。但是，通过COT产生的解释容易受到负面影响其稳健性和忠诚的内容偏见。为了减轻现有局限性，最近的工作使用逻辑形式主义与外部符号求解器相结合。但是，完全象征性的方法具有需要从自然语言到形式语言的完整翻译的瓶颈，这一过程会影响效率和灵活性。为了实现权衡，本文研究了将内容与逻辑推理相关的方法，而无需完整的形式化。特别是，我们介绍了类星体（用于准符号抽象的推理），这是COT的一种变体，它指导LLMS通过准符号符号解释以较高的抽象作用。我们的框架利用LLM的能力仅对相关变量和谓词进行形式化，从而使符号元素与自然语言共存。我们展示了Quasar对秘密学习的影响以及构建演示以提高较小模型的推理能力。我们的实验表明，准符号抽象可以通过高达8％的精度改善基于COT的方法，从而提高自然语言（即MMLU-REDUX）和符号推理任务（即GSM-Symbolic）的自然语言（即MMLU-REDUX）和象征性推理任务的鲁棒性和一致性。

Title: \textit{One Size doesn't Fit All}: A Personalized Conversational Tutoring Agent for Mathematics Instruction

Authors: Ben Liu, Jihan Zhang, Fangquan Lin, Xu Jia, Min Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12633
Pdf URL: https://arxiv.org/pdf/2502.12633
Copy Paste: [[2502.12633]] \textit{One Size doesn't Fit All}: A Personalized Conversational Tutoring Agent for Mathematics Instruction(https://arxiv.org/abs/2502.12633)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have been increasingly employed in various intelligent educational systems, simulating human tutors to facilitate effective human-machine interaction. However, previous studies often overlook the significance of recognizing and adapting to individual learner characteristics. Such adaptation is crucial for enhancing student engagement and learning efficiency, particularly in mathematics instruction, where diverse learning styles require personalized strategies to promote comprehension and enthusiasm. In this paper, we propose a \textbf{P}erson\textbf{A}lized \textbf{C}onversational tutoring ag\textbf{E}nt (PACE) for mathematics instruction. PACE simulates students' learning styles based on the Felder and Silverman learning style model, aligning with each student's persona. In this way, our PACE can effectively assess the personality of students, allowing to develop individualized teaching strategies that resonate with their unique learning styles. To further enhance students' comprehension, PACE employs the Socratic teaching method to provide instant feedback and encourage deep thinking. By constructing personalized teaching data and training models, PACE demonstrates the ability to identify and adapt to the unique needs of each student, significantly improving the overall learning experience and outcomes. Moreover, we establish multi-aspect evaluation criteria and conduct extensive analysis to assess the performance of personalized teaching. Experimental results demonstrate the superiority of our model in personalizing the educational experience and motivating students compared to existing methods.
摘要：大型语言模型（LLM）越来越多地在各种智能教育系统中使用，模拟了人类导师以促进有效的人机相互作用。但是，以前的研究通常忽略了识别和适应个人学习者特征的重要性。这种适应对于提高学生的参与和学习效率至关重要，尤其是在数学教学中，多样化的学习方式需要个性化的策略来促进理解和热情。在本文中，我们提出了a \ textbf {p} erson \ textbf {a} lized \ textbf {c}用于数学指令的onvessitation tuterational补习ag ag \ textbf {e} nt（pace）。 PACE模拟了基于Felder和Silverman学习风格模型的学生的学习风格，并与每个学生的角色保持一致。这样，我们的速度可以有效地评估学生的个性，从而制定个性化的教学策略，以引起其独特的学习风格。为了进一步增强学生的理解，PACE采用了苏格拉底教学方法来提供即时反馈并鼓励深思熟虑。通过构建个性化的教学数据和培训模型，PACE展示了识别和适应每个学生独特需求的能力，从而大大改善了整体学习经验和成果。此外，我们建立了多种评估标准并进行广泛的分析以评估个性化教学的绩效。实验结果表明，与现有方法相比，与现有方法相比，我们模型在个性化教育经验和激励学生方面具有优势。

Title: R.R.: Unveiling LLM Training Privacy through Recollection and Ranking

Authors: Wenlong Meng, Zhenyuan Guo, Lenan Wu, Chen Gong, Wenyan Liu, Weixian Li, Chengkun Wei, Wenzhi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12658
Pdf URL: https://arxiv.org/pdf/2502.12658
Copy Paste: [[2502.12658]] R.R.: Unveiling LLM Training Privacy through Recollection and Ranking(https://arxiv.org/abs/2502.12658)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) pose significant privacy risks, potentially leaking training data due to implicit memorization. Existing privacy attacks primarily focus on membership inference attacks (MIAs) or data extraction attacks, but reconstructing specific personally identifiable information (PII) in LLM's training data remains challenging. In this paper, we propose R.R. (Recollect and Rank), a novel two-step privacy stealing attack that enables attackers to reconstruct PII entities from scrubbed training data where the PII entities have been masked. In the first stage, we introduce a prompt paradigm named recollection, which instructs the LLM to repeat a masked text but fill in masks. Then we can use PII identifiers to extract recollected PII candidates. In the second stage, we design a new criterion to score each PII candidate and rank them. Motivated by membership inference, we leverage the reference model as a calibration to our criterion. Experiments across three popular PII datasets demonstrate that the R.R. achieves better PII identical performance compared to baselines. These results highlight the vulnerability of LLMs to PII leakage even when training data has been scrubbed. We release the replicate package of R.R. at a link.
摘要：大型语言模型（LLMS）构成了很大的隐私风险，由于隐式记忆，可能会泄漏培训数据。现有的隐私攻击主要集中于会员推理攻击（MIA）或数据提取攻击，但在LLM的培训数据中重建特定的个人身份信息（PII）仍然具有挑战性。在本文中，我们提出了R.R.（回忆和排名），这是一种新颖的两步隐私窃取攻击，使攻击者能够从掩盖PII实体的擦洗训练数据中重建PII实体。在第一阶段，我们引入了一个名为Recollection的提示范式，该范式指示LLM重复蒙版文本，但要填充口罩。然后，我们可以使用PII标识符来提取回忆的PII候选者。在第二阶段，我们设计了一个新标准，以评分每个PII候选人并对其进行排名。在会员推理的推动下，我们利用参考模型作为对我们标准的校准。在三个流行的PII数据集中进行的实验表明，与基准相比，R.R.取得更好的PII性能。这些结果突出了LLMS对PII泄漏的脆弱性，即使训练数据已被擦洗。我们在链接中发布R.R.的重复包。

Title: Demystifying Multilingual Chain-of-Thought in Process Reward Modeling

Authors: Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12663
Pdf URL: https://arxiv.org/pdf/2502.12663
Copy Paste: [[2502.12663]] Demystifying Multilingual Chain-of-Thought in Process Reward Modeling(https://arxiv.org/abs/2502.12663)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.
摘要：大型语言模型（LLM）旨在执行各种任务。为了提高他们解决需要多步推理的复杂问题的能力，最近的研究利用了过程奖励建模，可以在加强学习的推理过程的每个步骤（RL）的每个步骤中提供精细的反馈，但它主要集中在英语上。在本文中，我们应对将流程奖励模型（PRM）扩展到多语言设置的关键挑战。为了实现这一目标，我们在跨越七种语言的数据集上训练多语言PRM，这是从英语翻译而来的。通过对11种语言中两种广泛使用的推理基准测试的全面评估，我们证明了多语言PRM不仅提高了平均准确性，还可以减少早期推理错误。此外，我们的结果突出了多语言PRM对培训语言数量和英语数据数量的敏感性，同时还发现了更多候选响应和可训练的参数所带来的好处。这项工作为在复杂的多个步骤推理任务中的强大多语言应用程序开辟了有希望的途径。此外，我们发布了代码，以促进该线路的研究。

Title: A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization

Authors: Junhui He, Junna Xing, Nan Wang, Rui Xu, Shangyu Wu, Peng Zhou, Qiang Liu, Chun Jason Xue, Qingan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12665
Pdf URL: https://arxiv.org/pdf/2502.12665
Copy Paste: [[2502.12665]] A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization(https://arxiv.org/abs/2502.12665)
Keywords: language model, llm, long context
Abstract: Long context large language models (LLMs) pose significant challenges for efficient serving due to the large memory footprint and high access overhead of KV cache. Retrieval-based KV cache reduction methods can mitigate these challenges, typically by offloading the complete KV cache to CPU and retrieving necessary tokens on demand during inference. However, these methods still suffer from unsatisfactory accuracy degradation and extra retrieval overhead. To address these limitations, this paper proposes A$^2$ATS, a novel retrieval-based KV cache reduction method. A$^2$ATS aims to obtain an accurate approximation of attention scores by applying the vector quantization technique to key states, thereby enabling efficient and precise retrieval of the top-K tokens. First, we propose Windowed Rotary Position Embedding, which decouples the positional dependency from query and key states after position embedding. Then, we propose query-aware vector quantization that optimizes the objective of attention score approximation directly. Finally, we design the heterogeneous inference architecture for KV cache offloading, enabling long context serving with larger batch sizes. Experimental results demonstrate that A$^2$ATS can achieve a lower performance degradation with similar or lower overhead compared to existing methods, thereby increasing long context serving throughput by up to $2.7 \times$.
摘要：长篇小说大语言模型（LLMS）由于记忆足迹较大和KV缓存的高访问开销而对有效服务构成了重大挑战。基于检索的KV缓存减少方法可以减轻这些挑战，通常通过将完整的KV缓存到CPU并在推理过程中按需检索必要的令牌。但是，这些方法仍然无法令人满意的准确性降解和额外的检索开销。为了解决这些限制，本文提出了一种$^2 $ ats，这是一种基于新颖的基于检索的KV缓存方法。 $^2 $ ATS旨在通过将矢量量化技术应用于关键状态，从而获得准确的注意力评分近似，从而实现了TOP-K代币的有效和精确的检索。首先，我们提出了窗户的旋转位置嵌入，该位置嵌入了位置嵌入后的查询和关键状态的位置依赖性。然后，我们提出了查询感知的矢量量化，以优化关注得分近似的目标。最后，我们为KV缓存卸载设计了异质的推理体系结构，从而使长上下文能够具有较大的批量尺寸。实验结果表明，与现有方法相比，$^2 $ ats可以实现具有相似或较低间接费用的较低性能降解，从而将长上下文的吞吐量提高到$ 2.7 \ times $。

Title: Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

Authors: Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe, Mitsuki Sakamoto, Eiji Uchibe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12668
Pdf URL: https://arxiv.org/pdf/2502.12668
Copy Paste: [[2502.12668]] Evaluation of Best-of-N Sampling Strategies for Language Model Alignment(https://arxiv.org/abs/2502.12668)
Keywords: language model, llm
Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Since the reward model is an imperfect proxy for the true objective, an excessive focus on optimizing its value can lead to a compromise of its performance on the true objective. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling so that it mitigates reward hacking and empirically (Jinnai et al., 2024). However, Jinnai et al. (2024) introduce RBoN based on a heuristic and they lack the analysis of why such regularization strategy improves the performance of BoN sampling. The aim of this study is to analyze the effect of BoN sampling on regularization strategies. Using the regularization strategies corresponds to robust optimization, which maximizes the worst case over a set of possible perturbations in the proxy reward. Although the theoretical guarantees are not directly applicable to RBoN, RBoN corresponds to a practical implementation. This paper proposes an extension of the RBoN framework, called Stochastic RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN in proxy reward. We then perform an empirical evaluation using the AlpacaFarm and Anthropic's hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward. In addition, we also propose another simple RBoN method, the Sentence Length Regularized BoN, which has a better performance in the experiment as compared to the previous methods.
摘要：用奖励模型的最佳N（BON）采样已被证明是将大型语言模型（LLMS）与解码时人类偏好保持一致的有效策略。 BON采样容易受到称为奖励黑客的问题。由于奖励模型是真正目标的不完善的代理，因此过度专注于优化其价值可能会导致其在真正目标上的绩效妥协。先前的工作提出了正则化样本（RBON），这是一个带有正则化的BON采样，并表明它的表现优于BON采样，从而减轻奖励黑客入侵和经验（Jinnai等，2024）。但是，Jinnai等人。（2024）基于启发式的引入rbon，他们缺乏分析这种正则化策略为何改善BON采样的性能。这项研究的目的是分析BON采样对正则化策略的影响。使用正则化策略对应于强大的优化，这在代理奖励中的一组可能的扰动中最大化了最坏情况。尽管理论保证不直接适用于Rbon，但Rbon对应于实际实施。本文提出了Rbon框架的扩展，称为随机Rbon采样（SRBON），这是一种理论上保证的方法，可以在代理奖励中进行最坏情况。然后，我们使用Alpacafarm和Anthropic的HH-RLHF数据集进行经验评估，以评估正则化策略的哪些因素有助于改善真正的代理奖励。此外，我们还提出了另一种简单的rbon方法，即句子长度正则化bon，与以前的方法相比，在实验中的性能更好。

Title: Baichuan-M1: Pushing the Medical Capability of Large Language Models

Authors: Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, Zhengyun Zhao, Da Pan, Fan Yang, Fei Kou, Fei Li, Fuzhong Chen, Guosheng Dong, Han Liu, Hongda Zhang, Jin He, Jinjie Yang, Kangxi Wu, Kegeng Wu, Lei Su, Linlin Niu, Linzhuang Sun, Mang Wang, Pengcheng Fan, Qianli Shen, Rihui Xin, Shunya Dang, Songchi Zhou, Weipeng Chen, Wenjing Luo, Xin Chen, Xin Men, Xionghai Lin, Xuezhen Dong, Yan Zhang, Yifei Duan, Yuyan Zhou, Zhi Ma, Zhiying Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12671
Pdf URL: https://arxiv.org/pdf/2502.12671
Copy Paste: [[2502.12671]] Baichuan-M1: Pushing the Medical Capability of Large Language Models(https://arxiv.org/abs/2502.12671)
Keywords: language model, llm
Abstract: The current generation of large language models (LLMs) is typically designed for broad, general-purpose applications, while domain-specific LLMs, especially in vertical fields like medicine, remain relatively scarce. In particular, the development of highly efficient and practical LLMs for the medical domain is challenging due to the complexity of medical knowledge and the limited availability of high-quality data. To bridge this gap, we introduce Baichuan-M1, a series of large language models specifically optimized for medical applications. Unlike traditional approaches that simply continue pretraining on existing models or apply post-training to a general base model, Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical capabilities. Our model is trained on 20 trillion tokens and incorporates a range of effective training methods that strike a balance between general capabilities and medical expertise. As a result, Baichuan-M1 not only performs strongly across general domains such as mathematics and coding but also excels in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini version of our model, which can be accessed through the following links.
摘要：当前的大型语言模型（LLMS）通常是为广泛的通用应用而设计的，而特定领域的LLM，尤其是在诸如医学之类的垂直领域，仍然相对较少。特别是，由于医学知识的复杂性和高质量数据的可用性，医疗领域高效和实用的LLM的开发是具有挑战性的。为了弥合这一差距，我们介绍了Baichuan-M1，这是一系列针对医疗应用优化的大型语言模型。与传统的方法只是继续在现有模型上进行预处理或将培训应用于通用基础模型不同，Baichuan-M1是从头开始训练的，专门专注于增强医疗功能。我们的模型接受了20万亿代币的培训，并结合了一系列有效的培训方法，这些方法在一般能力和医疗专业知识之间取得了平衡。结果，Baichuan-M1不仅在数学和编码等一般领域都表现出色，而且在专门的医疗领域中表现出色。我们拥有开源的Baichuan-M1-14B，这是我们模型的迷你版本，可以通过以下链接访问。

Title: Multi-Novelty: Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorming

Authors: Arash Lagzian, Srinivas Anumasa, Dianbo Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12700
Pdf URL: https://arxiv.org/pdf/2502.12700
Copy Paste: [[2502.12700]] Multi-Novelty: Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorming(https://arxiv.org/abs/2502.12700)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) demonstrate remarkable proficiency in generating accurate and fluent text. However, they often struggle with diversity and novelty, leading to repetitive or overly deterministic responses. These limitations stem from constraints in training data, including gaps in specific knowledge domains, outdated information, and an over-reliance on textual sources. Such shortcomings reduce their effectiveness in tasks requiring creativity, multi-perspective reasoning, and exploratory thinking, such as LLM based AI scientist agents and creative artist agents . To address this challenge, we introduce inference-time multi-view brainstorming method, a novel approach that enriches input prompts with diverse perspectives derived from both textual and visual sources, which we refere to as "Multi-Novelty". By incorporating additional contextual information as diverse starting point for chain of thoughts, this method enhances the variety and creativity of generated outputs. Importantly, our approach is model-agnostic, requiring no architectural modifications and being compatible with both open-source and proprietary LLMs.
摘要：大型语言模型（LLMS）表现出极大的熟练程度，可以产生准确和流利的文本。但是，他们经常在多样性和新颖性上挣扎，导致重复或过于确定性的反应。这些限制源于培训数据中的限制，包括特定知识领域的差距，过时的信息以及对文本来源的过度依赖。这种缺陷降低了其在需要创造力，多观点推理和探索性思维的任务中的有效性，例如基于LLM的AI科学家代理商和创意艺术家的代理商。为了应对这一挑战，我们介绍了推理时间多视图集思广益方法，一种新颖的方法丰富了输入提示，这些提示具有从文本和视觉来源中得出的各种视角，我们将其作为“多诺维尔”。通过将其他上下文信息纳入思想链的各种起点，该方法可以增强生成的产出的多样性和创造力。重要的是，我们的方法是模型不合时宜的，不需要建筑修改，并且与开源和专有LLM兼容。

Title: "I know myself better, but not really greatly": Using LLMs to Detect and Explain LLM-Generated Texts

Authors: Jiazhou Ji, Jie Guo, Weidong Qiu, Zheng Huang, Yang Xu, Xinru Lu, Xiaoyu Jiang, Ruizhe Li, Shujun Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12743
Pdf URL: https://arxiv.org/pdf/2502.12743
Copy Paste: [[2502.12743]] "I know myself better, but not really greatly": Using LLMs to Detect and Explain LLM-Generated Texts(https://arxiv.org/abs/2502.12743)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in generating human-like texts, but the potential misuse of such LLM-generated texts raises the need to distinguish between human-generated and LLM-generated content. This paper explores the detection and explanation capabilities of LLM-based detectors of LLM-generated texts, in the context of a binary classification task (human-generated texts vs LLM-generated texts) and a ternary classification task (human-generated texts, LLM-generated texts, and undecided). By evaluating on six close/open-source LLMs with different sizes, our findings reveal that while self-detection consistently outperforms cross-detection, i.e., LLMs can detect texts generated by themselves more accurately than those generated by other LLMs, the performance of self-detection is still far from ideal, indicating that further improvements are needed. We also show that extending the binary to the ternary classification task with a new class "Undecided" can enhance both detection accuracy and explanation quality, with improvements being statistically significant and consistent across all LLMs. We finally conducted comprehensive qualitative and quantitative analyses on the explanation errors, which are categorized into three types: reliance on inaccurate features (the most frequent error), hallucinations, and incorrect reasoning. These findings with our human-annotated dataset emphasize the need for further research into improving both self-detection and self-explanation, particularly to address overfitting issues that may hinder generalization.
摘要：大型语言模型（LLM）在产生类似人类的文本方面表现出了令人印象深刻的能力，但是这种LLM生成的文本的潜在滥用提出了区分人类生成和LLM生成的内容的必要性。本文在二进制分类任务（人类生成的文本与LLM生成的文本）的背景下，探讨了LLM生成文本的基于LLM基于LLM的检测功能和解释功能 - 生成的文本，不确定）。通过评估具有不同尺寸的六个关闭/开源LLM，我们的发现表明，虽然自我检测始终优于交叉检测，即LLM可以比其他LLM更准确地检测出自己生成的文本，但自我生成的文本，但 - 挖掘仍然远非理想，表明需要进一步改进。我们还表明，将二进制的新类“不确定”扩展到三元分类任务可以提高检测准确性和解释质量，并且在所有LLM中的改进具有统计学意义和一致性。我们最终对解释错误进行了全面的定性和定量分析，这些分析分为三种类型：依赖不准确的特征（最常见的错误），幻觉和不正确的推理。这些发现与我们的人类通知数据集强调了进一步研究的必要性，以改善自我检测和自我解释，尤其是解决可能阻碍概括的过度拟合问题。

Title: Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation

Authors: Yong Zhang, Bingyuan Zhang, Zhitao Li, Ming Li, Ning Cheng, Minchuan Chen, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12744
Pdf URL: https://arxiv.org/pdf/2502.12744
Copy Paste: [[2502.12744]] Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation(https://arxiv.org/abs/2502.12744)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: The rapid advancement of large language models (LLMs) has significantly enhanced their reasoning abilities, enabling increasingly complex tasks. However, these capabilities often diminish in smaller, more computationally efficient models like GPT-2. Recent research shows that reasoning distillation can help small models acquire reasoning capabilities, but most existing methods focus primarily on improving teacher-generated reasoning paths. Our observations reveal that small models can generate high-quality reasoning paths during sampling, even without chain-of-thought prompting, though these paths are often latent due to their low probability under standard decoding strategies. To address this, we propose Self-Enhanced Reasoning Training (SERT), which activates and leverages latent reasoning capabilities in small models through self-training on filtered, self-generated reasoning paths under zero-shot conditions. Experiments using OpenAI's GPT-3.5 as the teacher model and GPT-2 models as the student models demonstrate that SERT enhances the reasoning abilities of small models, improving their performance in reasoning distillation.
摘要：大型语言模型（LLM）的快速发展已大大提高了其推理能力，从而实现了日益复杂的任务。但是，这些功能通常会在较小的，更高效的模型（例如GPT-2）中降低。最近的研究表明，推理蒸馏可以帮助小型模型获得推理能力，但是大多数现有方法主要侧重于改善教师生成的推理路径。我们的观察结果表明，小型模型在抽样过程中也可以产生高质量的推理路径，即使没有经过思考的提示，尽管这些路径在标准解码策略下的概率很低，但这些路径通常是潜在的。为了解决这个问题，我们提出了自我增强的推理培训（SERT），该培训（SERT）通过在零摄影条件下的过滤，自我生成的推理路径上进行自我训练来激活和利用小型模型中的潜在推理能力。使用OpenAI的GPT-3.5作为教师模型和GPT-2模型作为学生模型的实验表明，SERT增强了小型模型的推理能力，从而提高了其在推理蒸馏中的性能。

Title: MediaMind: Revolutionizing Media Monitoring using Agentification

Authors: Ahmet Gunduz, Kamer Ali Yuksel, Hassan Sawaf
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12745
Pdf URL: https://arxiv.org/pdf/2502.12745
Copy Paste: [[2502.12745]] MediaMind: Revolutionizing Media Monitoring using Agentification(https://arxiv.org/abs/2502.12745)
Keywords: agent
Abstract: In an era of rapid technological advancements, agentification of software tools has emerged as a critical innovation, enabling systems to function autonomously and adaptively. This paper introduces MediaMind as a case study to demonstrate the agentification process, highlighting how existing software can be transformed into intelligent agents capable of independent decision-making and dynamic interaction. Developed by aiXplain, MediaMind leverages agent-based architecture to autonomously monitor, analyze, and provide insights from multilingual media content in real time. The focus of this paper is on the technical methodologies and design principles behind agentifying MediaMind, showcasing how agentification enhances adaptability, efficiency, and responsiveness. Through detailed case studies and practical examples, we illustrate how the agentification of MediaMind empowers organizations to streamline workflows, optimize decision-making, and respond to evolving trends. This work underscores the broader potential of agentification to revolutionize software tools across various domains.
摘要：在快速技术进步的时代，软件工具的代理化已成为一项关键创新，使系统能够自主和自适应地运行。本文介绍了MediaMind作为案例研究，以证明代理化过程，并强调如何将现有软件转变为能够独立决策和动态互动的智能代理。 MediaMind由Aixplain开发，利用基于代理的体系结构自主监视，分析和提供实时多语言媒体内容的见解。本文的重点是代理媒体媒介背后的技术方法和设计原则，展示了代理如何增强适应性，效率和响应能力。通过详细的案例研究和实践示例，我们说明了MediaMind的代理如何使组织的代理能力简化工作流程，优化决策并响应不断发展的趋势。这项工作强调了代理化的广泛潜力，可以彻底改变各个领域的软件工具。

Title: Efficient Machine Translation Corpus Generation: Integrating Human-in-the-Loop Post-Editing with Large Language Models

Authors: Kamer Ali Yuksel, Ahmet Gunduz, Abdul Baseet Anees, Hassan Sawaf
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2502.12755
Pdf URL: https://arxiv.org/pdf/2502.12755
Copy Paste: [[2502.12755]] Efficient Machine Translation Corpus Generation: Integrating Human-in-the-Loop Post-Editing with Large Language Models(https://arxiv.org/abs/2502.12755)
Keywords: language model, llm
Abstract: This paper introduces an advanced methodology for machine translation (MT) corpus generation, integrating semi-automated, human-in-the-loop post-editing with large language models (LLMs) to enhance efficiency and translation quality. Building upon previous work that utilized real-time training of a custom MT quality estimation metric, this system incorporates novel LLM features such as Enhanced Translation Synthesis and Assisted Annotation Analysis, which improve initial translation hypotheses and quality assessments, respectively. Additionally, the system employs LLM-Driven Pseudo Labeling and a Translation Recommendation System to reduce human annotator workload in specific contexts. These improvements not only retain the original benefits of cost reduction and enhanced post-edit quality but also open new avenues for leveraging cutting-edge LLM advancements. The project's source code is available for community use, promoting collaborative developments in the field. The demo video can be accessed here.
摘要：本文介绍了一种用于机器翻译（MT）语料库生成的先进方法，将半自动化的，人类的邮寄后编辑与大语言模型（LLMS）相结合，以提高效率和翻译质量。在以前利用自定义MT质量估计度量的实时培训的工作的基础上，该系统结合了新颖的LLM功能，例如增强的翻译合成和辅助注释分析，分别改善了初始翻译假设和质量评估。此外，该系统还采用LLM驱动的伪标签和翻译建议系统来减少在特定情况下的人类注释器工作量。这些改进不仅保留了降低成本和增强后编辑后质量的原始优势，而且还保留了利用最先进的LLM进步的新途径。该项目的源代码可用于社区使用，促进该领域的协作发展。可以在此处访问演示视频。

Title: R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Authors: Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12767
Pdf URL: https://arxiv.org/pdf/2502.12767
Copy Paste: [[2502.12767]] R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs(https://arxiv.org/abs/2502.12767)
Keywords: language model, llm, hallucination, agent
Abstract: Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks are often rigid, struggling to adapt to KG or task changes. They also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across multiple KG-based reasoning tasks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability while reducing inference cost. However, it also leads to a higher abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning. It reduces reliance on high-capacity LLMs while ensuring trustworthy inference.
摘要：最近的研究将大型语言模型（LLM）与知识图（kgs）相结合，以增强推理，提高推理准确性，而无需在减轻幻觉的同时进行额外的培训。但是，现有的框架通常是僵化的，努力适应kg或任务更改。他们还严重依靠强大的LLM来实现可靠（即值得信赖的）推理。为了解决这个问题，我们介绍了R2-kg，这是一个插件，双代理框架，将推理分为两个角色：操作员（低容量LLM），该框架（一个低容量的LLM）收集证据和主管（高容量LLM）这做出了最终判断。对于LLM推断，这种设计具有成本效益，同时仍保持强大的推理准确性。此外，R2-kg采用弃用机制，仅当从KG收集足够的证据时才产生答案，从而大大提高了可靠性。跨多个基于KG的推理任务的实验表明，R2-KG在准确性和可靠性方面始终优于基准，而不管LLM用作操作员的固有能力如何。进一步的实验表明，配备了严格的自洽策略的R2-kg的单代版本，在降低推理成本的同时，实现了比基线的可靠性要高得多。但是，这也导致复杂KG的弃戒率更高。我们的发现将R2-KG建立为基于KG的推理的灵活且具有成本效益的解决方案。它减少了对高容量LLM的依赖，同时确保了值得信赖的推断。

Title: How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild

Authors: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12769
Pdf URL: https://arxiv.org/pdf/2502.12769
Copy Paste: [[2502.12769]] How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild(https://arxiv.org/abs/2502.12769)
Keywords: language model, llm, hallucination, prompt
Abstract: In the age of misinformation, hallucination -- the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses -- represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild'' than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering. To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to generate (noisy) training data in other languages. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build a knowledge-intensive QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. We find that, while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation. Further, we find that smaller LLMs exhibit larger hallucination rates than larger models.
摘要：在错误信息时代，幻觉 - 大语言模型（LLMS）产生非事实或不忠实的趋势 - 代表了其全球效用的主要风险。尽管LLMS变得越来越多地语言，但大多数关于检测和量化LLM幻觉的研究是（a）以英语为中心的，（b）专注于机器翻译（MT）和摘要，这些任务不太常见。而不是寻求开放的信息。相比之下，我们旨在量化知识密集型的长形式问题回答中跨语言的LLM幻觉的程度。为此，我们培训了一个多语言幻觉检测模型，并对30种语言和6个开源LLM家庭进行了大规模研究。我们从英语幻觉检测数据集开始，然后依靠MT来生成其他语言的培训数据。我们还手动注释五种高资源语言的黄金数据；然后，对于这些语言，我们证明了银（LLM生成）和黄金测试集之间的幻觉率的估计值相似，从而验证了使用银数据以估算其他语言的幻觉率。为了进行最终费率估算，我们为30种语言的知识密集型质量保证数据集构建了LLM生成的提示和Wikipedia文章作为参考文献。我们发现，虽然LLMS对更高资源语言的幻觉代币产生较长的响应，但语言的长度范围幻觉速率与其数字表示之间没有相关性。此外，我们发现较小的LLMs比较大的模型表现出更大的幻觉速度。

Title: Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach

Authors: Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2502.12771
Pdf URL: https://arxiv.org/pdf/2502.12771
Copy Paste: [[2502.12771]] Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach(https://arxiv.org/abs/2502.12771)
Keywords: language model
Abstract: Self-supervised language and audio models effectively predict brain responses to speech. However, traditional prediction models rely on linear mappings from unimodal features, despite the complex integration of auditory signals with linguistic and semantic information across widespread brain networks during speech comprehension. Here, we introduce a nonlinear, multimodal prediction model that combines audio and linguistic features from pre-trained models (e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in prediction performance (unnormalized and normalized correlation) over traditional unimodal linear models, as well as a 7.7% and 14.4% improvement, respectively, over prior state-of-the-art models. These improvements represent a major step towards future robust in-silico testing and improved decoding performance. They also reveal how auditory and semantic information are fused in motor, somatosensory, and higher-level semantic regions, aligning with existing neurolinguistic theories. Overall, our work highlights the often neglected potential of nonlinear and multimodal approaches to brain modeling, paving the way for future studies to embrace these strategies in naturalistic neurolinguistics research.
摘要：自我监督的语言和音频模型有效地预测了大脑对语音的反应。然而，尽管在语音理解过程中，听觉信号与语言和语义信息跨越了语言和语义信息，但传统的预测模型依赖于单峰特征的线性映射。在这里，我们介绍了一个非线性的多模式预测模型，该模型结合了预训练模型的音频和语言特征（例如Llama，Whisper）。我们的方法比传统的单峰线性模型提高了17.2％和17.9％的预测性能（未归一化和归一化相关性），分别比先前的先前先进模型提高了7.7％和14.4％。这些改进是朝着未来强大的内部测试和改善解码性能的重大步骤。他们还揭示了听觉和语义信息如何融合在运动，体感和高级语义区域，与现有的神经语言理论保持一致。总体而言，我们的工作突出了非线性和多模式的大脑建模方法通常被忽视的潜力，这为将来的研究铺平了道路，以将这些策略纳入自然主义的神经语言学研究中。

Title: Commonsense Reasoning in Arab Culture

Authors: Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12788
Pdf URL: https://arxiv.org/pdf/2502.12788
Copy Paste: [[2502.12788]] Commonsense Reasoning in Arab Culture(https://arxiv.org/abs/2502.12788)
Keywords: language model, gpt
Abstract: Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce \datasetname, a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. \datasetname spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.
摘要：尽管阿拉伯大型语言模型（例如JAI和ACEGPT）取得了进展，但他们对常识性推理的评估主要依赖于机器翻译的数据集，这些数据集缺乏文化深度，并且可能引入中心为中心的偏见。常识性推理是由地理和文化背景塑造的，现有的英语数据集无法捕捉阿拉伯世界的多样性。为了解决这个问题，我们介绍了现代标准阿拉伯语（MSA）中的常识性推理数据集\ datasetname，涵盖了整个海湾，黎凡特，北非和尼罗河谷的13个国家的文化。该数据集是通过吸引母语的人来为各自国家写和验证与文化相关的问题而从头开始构建的。 \ DataSetName跨越了12个日常生活领域，具有54个细粒度的子主题，反映了社会规范，传统和日常经验的各个方面。零拍的评估表明，具有高达32B参数的开放式语言模型难以理解各种阿拉伯文化，并且各个地区的性能各不相同。这些发现凸显了需要针对讲阿拉伯语世界的更具文化意识的模型和数据集。

Title: Towards Text-Image Interleaved Retrieval

Authors: Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
Subjects: cs.CL, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2502.12799
Pdf URL: https://arxiv.org/pdf/2502.12799
Copy Paste: [[2502.12799]] Towards Text-Image Interleaved Retrieval(https://arxiv.org/abs/2502.12799)
Keywords: language model, llm
Abstract: Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.
摘要：当前的多模式信息检索研究主要集中于单图像输入，该输入限制了涉及多个图像和文本图像交织内容的现实世界应用。在这项工作中，我们介绍了文本图像交织的检索（TIIR）任务，其中查询和文档是交织的文本图像序列，并且需要模型才能从交织的上下文中理解语义以进行有效检索。我们基于自然交织的Wikihow教程构建了TIIR基准测试，其中特定管道旨在生成交织的查询。为了探索任务，我们适应了几个现成的检索器，并通过交织的多模式大语言模型（MLLM）建立一个密集的基线。然后，我们提出了一种新型的Matryoshka多模式嵌入器（MME），该嵌入者压缩了不同粒度的视觉令牌的数量，以解决基于MLLM的TIIR模型中过度的视觉令牌的挑战。实验表明，现有模型的简单适应不会始终产生有效的结果。我们的MME通过视觉令牌较少而实现了基线的显着改善。我们提供广泛的分析，并将发布数据集和代码以促进未来的研究。

Title: Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models

Authors: Adnan Ahmad, Stefan Hillmann, Sebastian Möller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12813
Pdf URL: https://arxiv.org/pdf/2502.12813
Copy Paste: [[2502.12813]] Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models(https://arxiv.org/abs/2502.12813)
Keywords: language model, gpt, llm
Abstract: In this study, we explore the application of Large Language Models (LLMs) for generating synthetic users and simulating user conversations with a task-oriented dialogue system and present detailed results and their analysis. We propose a comprehensive novel approach to user simulation technique that uses LLMs to create diverse user profiles, set goals, engage in multi-turn dialogues, and evaluate the conversation success. We employ two proprietary LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a heterogeneous base of user profiles, characterized by varied demographics, multiple user goals, different conversational styles, initial knowledge levels, interests, and conversational objectives. We perform a detailed analysis of the user profiles generated by LLMs to assess the diversity, consistency, and potential biases inherent in these LLM-generated user simulations. We find that GPT-o1 generates more heterogeneous user distribution across most user attributes, while GPT-4o generates more skewed user attributes. The generated set of user profiles are then utilized to simulate dialogue sessions by interacting with a task-oriented dialogue system.
摘要：在这项研究中，我们探讨了大语模型（LLM）的应用，以生成合成用户并使用以任务为导向的对话系统模拟用户对话，并提出详细的结果及其分析。我们提出了一种用于用户仿真技术的全面新颖方法，该方法使用LLM来创建不同的用户配置文件，设定目标，进行多转式对话并评估对话成功。我们使用两个专有的LLM，分别是GPT-4O和GPT-O1（Achiam等，2023）来生成一个异质的用户概况基础，其特征是人口统计学不同，多个用户目标，不同的对话风格，初始知识水平，兴趣，兴趣，兴趣，兴趣。和对话目标。我们对LLMS生成的用户配置文件进行详细分析，以评估这些LLM生成的用户模拟中固有的多样性，一致性和潜在偏见。我们发现GPT-O1在大多数用户属性上生成更异质的用户分布，而GPT-4O会生成更多偏斜的用户属性。然后，通过与以任务为导向的对话系统进行交互来使用生成的用户配置文件集来模拟对话会话。

Title: Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models

Authors: Elena Stringli, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12821
Pdf URL: https://arxiv.org/pdf/2502.12821
Copy Paste: [[2502.12821]] Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models(https://arxiv.org/abs/2502.12821)
Keywords: language model, llm, prompt
Abstract: Inverse tasks can uncover potential reasoning gaps as Large Language Models (LLMs) scale up. In this work, we explore the redefinition task, in which we assign alternative values to well-known physical constants and units of measure, prompting LLMs to respond accordingly. Our findings show that not only does model performance degrade with scale, but its false confidence also rises. Moreover, while factors such as prompting strategies or response formatting are influential, they do not preclude LLMs from anchoring to memorized values.
摘要：随着大语言模型（LLMS）扩展，逆任务可以发现潜在的推理差距。在这项工作中，我们探讨了重新定义任务，在该任务中，我们将替代价值分配给了众所周知的物理常数和度量单位，从而促使LLMS做出相应的响应。我们的发现表明，不仅模型绩效会随着规模而降级，而且其虚假信心也会上升。此外，尽管提示策略或响应格式等因素具有影响力，但它们并不排除LLM从锚定到记忆的值。

Title: Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

Authors: Rubing Lu, João Sedoc, Arun Sundararajan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12825
Pdf URL: https://arxiv.org/pdf/2502.12825
Copy Paste: [[2502.12825]] Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models(https://arxiv.org/abs/2502.12825)
Keywords: language model, gpt, llm
Abstract: When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek's more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization's AI strategy.
摘要：当遇到新的大型语言模型（LLM）越来越频繁地提高性能或成本降低时，利用LLMS的应用程序的开发人员必须决定是利用这些改进还是使用较旧的经过测试的模型来保持。低感知的切换摩擦会导致选择，这些选择不会考虑过渡可能引起的更多微妙的行为变化。我们的实验使用流行的游戏理论行为经济学模型，以显示Openai和DeepSeek模型的信任行为的明显差异。我们强调了O1-Mini和O3-Mini模型的经济信任行为的崩溃，因为它们使利润最大化和寻求风险的利润与未来的回报与信任的未来回报进行了对比，并将其与DeepSeek更加复杂和更有利可图的信任行为进行了对比，这源于源自能够纳入诸如前瞻性计划和心理理论之类的更深层次的概念。作为LLMS构成高风险商业系统的基础，我们的结果突出了依靠LLM性能基准测试的危险，这些基准的定义太狭窄，建议对其隐藏的断层线的仔细分析应成为任何组织的AI策略的一部分。

Title: KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan

Authors: Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12829
Pdf URL: https://arxiv.org/pdf/2502.12829
Copy Paste: [[2502.12829]] KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan(https://arxiv.org/abs/2502.12829)
Keywords: language model, gpt, llm
Abstract: Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan's bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.
摘要：尽管人口为两千万，但在自然语言处理领域，哈萨克斯坦的文化和语言仍然不足。尽管大型语言模型（LLM）在全球范围内继续发展，但在专门的模型和基准评估的稀缺性中可以看出，哈萨克语的进展受到限制。为了解决这一差距，我们介绍了Kazmmlu，这是第一个专门为哈萨克语设计的MMLU风格数据集。 Kazmmlu包括23,000个问题，这些问题涵盖了各种教育水平，包括STEM，人文科学和社会科学，这些教育材料来自真实的教育材料，并由母语人士和教育者手动验证。该数据集包括10,969个哈萨克的问题和12,031个俄罗斯问题，反映了哈萨克斯坦的双语教育体系和丰富的地方环境。我们对几种最先进的多语言模型（Llama-3.1，Qwen-2.5，GPT-4和DeepSeek V3）的评估证明了改进的大量空间，因为即使是表现最好的模型也很难在哈萨克州实现竞争性表现和俄语。与高资源语言相比，这些发现强调了巨大的性能差距。我们希望我们的数据集能够进一步研究以哈萨克州为中心的LLMS。数据和代码将在接受后提供。

Title: Subword models struggle with word learning, but surprisal hides it

Authors: Bastian Bunzeck, Sina Zarrieß
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12835
Pdf URL: https://arxiv.org/pdf/2502.12835
Copy Paste: [[2502.12835]] Subword models struggle with word learning, but surprisal hides it(https://arxiv.org/abs/2502.12835)
Keywords: language model
Abstract: We study word learning in subword and character language models with the psycholinguistic lexical decision task. While subword LMs struggle to discern words and non-words with high accuracy, character LMs solve this task easily and consistently. Furthermore, when comparing word learning and syntactic learning, both processes are separable in character LM where word learning predates syntactic learning, whereas these processes are simultaneous in subword LM. This raises questions about the adequacy of subword LMs for modeling language acquisition and positions character LMs as a viable alternative.
摘要：我们通过心理语言词汇决策任务在子字和角色语言模型中学习单词学习。尽管子词LMS难以辨别出高准确性的单词和非字，但字符LMS轻松而始终如一地解决此任务。此外，在比较单词学习和句法学习时，这两个过程在字符lm中都是可分离的，其中单词学习早于句法学习，而这些过程在子字LM中同时进行。这就提出了有关子词LMS对语言获取的适当性的疑问，并将字符LMS定位为可行的替代方案。

Title: An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation

Authors: Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M.Rahmani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12836
Pdf URL: https://arxiv.org/pdf/2502.12836
Copy Paste: [[2502.12836]] An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation(https://arxiv.org/abs/2502.12836)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Large language models (LLMs) are revolutionizing healthcare by improving diagnosis, patient care, and decision support through interactive communication. More recently, they have been applied to analyzing physiological time-series like wearable data for health insight extraction. Existing methods embed raw numerical sequences directly into prompts, which exceeds token limits and increases computational costs. Additionally, some studies integrated features extracted from time-series in textual prompts or applied multimodal approaches. However, these methods often produce generic and unreliable outputs due to LLMs' limited analytical rigor and inefficiency in interpreting continuous waveforms. In this paper, we develop an LLM-powered agent for physiological time-series analysis aimed to bridge the gap in integrating LLMs with well-established analytical tools. Built on the OpenCHA, an open-source LLM-powered framework, our agent features an orchestrator that integrates user interaction, data sources, and analytical tools to generate accurate health insights. To evaluate its effectiveness, we implement a case study on heart rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study. The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o, with ECG serving as the gold standard for HR estimation. Results demonstrate that our agent significantly outperforms benchmark models by achieving lower error rates and more reliable HR estimations. The agent implementation is publicly available on GitHub.
摘要：大型语言模型（LLM）通过通过互动沟通改善诊断，患者护理和决策支持来彻底改变医疗保健。最近，它们已用于分析生理时间序列，例如可穿戴数据，以进行健康洞察提取。现有方法将原始数值序列直接嵌入到提示中，从而超过令牌限制并增加了计算成本。此外，一些研究集成了从文本提示或应用多模式方法中从时间序列提取的集成特征。但是，由于LLMS在解释连续波形时的分析严格和效率低下，这些方法通常会产生通用和不可靠的输出。在本文中，我们开发了一种以LLM驱动的代理，用于生理时间序列分析，旨在弥合LLM与公认的分析工具的差距。我们的代理商建立在开源LLM驱动的框架的OpenCha上，它具有整合用户互动，数据源和分析工具的编排，以生成准确的健康见解。为了评估其有效性，我们在远程健康监测研究中使用PPG和心电图（ECG）记录的数据集实施了对光摄取图（PPG）信号的心率（HR）估计的案例研究。该代理商的性能是针对OpenAI GPT-4O-Mini和GPT-4O的基准测试的，ECG是人力资源估计的黄金标准。结果表明，我们的代理通过达到较低的错误率和更可靠的人力资源估计来显着优于基准模型。代理实施在GitHub上公开可用。

Title: MeMo: Towards Language Models with Associative Memory Mechanisms

Authors: Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12851
Pdf URL: https://arxiv.org/pdf/2502.12851
Copy Paste: [[2502.12851]] MeMo: Towards Language Models with Associative Memory Mechanisms(https://arxiv.org/abs/2502.12851)
Keywords: language model
Abstract: Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.
摘要：记忆是通过学习实现的基于变压器的大语言模型的基本能力。在本文中，我们提出了一个范式转变，通过设计建筑以直接记忆文本，牢记在学习之前的原则。我们介绍了备忘录，这是一种用于语言建模的新颖架构，可明确记住在分层的关联记忆中的令牌序列。根据设计，备忘录提供了透明度和模型编辑的可能性，包括忘记文本。我们尝试了备忘录体系结构，显示了单层和多层配置的记忆力。

Title: MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

Authors: Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12852
Pdf URL: https://arxiv.org/pdf/2502.12852
Copy Paste: [[2502.12852]] MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching(https://arxiv.org/abs/2502.12852)
Keywords: language model, gpt
Abstract: Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.
摘要：现有的多语言视觉语言（VL）基准通常仅涵盖少数语言。因此，对大型视力语言模型（LVLM）的评估主要针对高资源语言，强调了低资源语言评估数据的需求。为了解决这一限制，我们介绍了MVL-SIB，这是一种大量多种语言视觉语言基准，可评估205种语言的跨模式和仅文本主题匹配 - 比最多语言现有的VL基准多100多种。然后，我们与MVL-SIB上的GPT-4O（-MINI）一起对一系列开放量LVLM进行了基准测试。我们的结果表明，LVLM在跨模式的主题中挣扎的较低的资源语言匹配，在N'Koo之类的语言上表现不佳。我们的分析进一步表明，LVLMS中的VL支持相对于对低资源语言的文本支持而言，VL的支持不成比例，这可以通过比较跨模式和仅文本主题匹配性能的比较来证明。我们进一步观察到，开放权重的LVLM并不能从代表一个以上图像的主题中受益，这表明这些模型尚未完全有效地处理多图像任务。通过将MVL-SIB的性能与其他多语言VL基准相关联，我们强调，MVL-SIB是对LVLM中多语言VL理解的全面探针。

Title: S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

Authors: Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12853
Pdf URL: https://arxiv.org/pdf/2502.12853
Copy Paste: [[2502.12853]] S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning(https://arxiv.org/abs/2502.12853)
Keywords: llm
Abstract: Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at this https URL.
摘要：最近的研究证明了LLM测试时间缩放的有效性。但是，现有的激励LLMS深思熟虑能力的方法通常需要大规模的数据或重大的培训工作。同时，尚不清楚如何提高功能较低的基本模型的思维能力。在这项工作中，我们介绍了S $^2 $ R，这是一个有效的框架，通过教学模型在推理过程中自我验证和自我纠正来增强LLM推理。具体而言，我们首先通过迭代自我验证和自我纠正行为初始化LLM，通过对精心策划的数据进行微调进行微调。然后，通过最小化资源需求，通过结果级别和过程级的强化学习来进一步增强自我验证和自我纠正技能，从而使模型能够自适应地完善推理期间的推理过程。我们的结果表明，只有3.1k自验证和自我校正行为初始化样本样本，QWEN2.5-MATH-7B可以从51.0 \％\％\％\％的精度提高，超过了经过相当数量的长期长期培训的模型COT蒸馏数据。基于内域和室外基准的三个基本模型的大量实验和分析验证了S $^2 $ R的有效性。我们的代码和数据可在此HTTPS URL上找到。

Title: Rejected Dialects: Biases Against African American Language in Reward Models

Authors: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, Maarten Sap
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2502.12858
Pdf URL: https://arxiv.org/pdf/2502.12858
Copy Paste: [[2502.12858]] Rejected Dialects: Biases Against African American Language in Reward Models(https://arxiv.org/abs/2502.12858)
Keywords: language model, llm, prompt
Abstract: Preference alignment via reward models helps build safe, helpful, and reliable large language models (LLMs). However, subjectivity in preference judgments and the lack of representative sampling in preference data collection can introduce new biases, hindering reward models' fairness and equity. In this work, we introduce a framework for evaluating dialect biases in reward models and conduct a case study on biases against African American Language (AAL) through several experiments comparing reward model preferences and behavior on paired White Mainstream English (WME) and both machine-translated and human-written AAL corpora. We show that reward models are less aligned with human preferences when processing AAL texts vs. WME ones (-4\% accuracy on average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and steer conversations toward WME, even when prompted with AAL texts. Our findings provide a targeted analysis of anti-AAL biases at a relatively understudied stage in LLM development, highlighting representational harms and ethical questions about the desired behavior of LLMs concerning AAL.
摘要：通过奖励模型的偏好对齐有助于建立安全，有益和可靠的大型语言模型（LLMS）。但是，偏好判断的主观性以及在偏好数据收集中缺乏代表性抽样会引入新的偏见，阻碍奖励模型的公平性和公平性。在这项工作中，我们引入了一个框架，用于评估奖励模型中的方言偏见，并通过几个实验进行了对非裔美国人语言（AAL）的偏见进行案例研究，以比较奖励模型的偏好和对配对的白色主流英语（WME）的奖励模型偏好和行为 - 翻译和人工编写的AAL语料库。我们表明，处理AAL文本与WME（平均为-4％的准确性）时，奖励模型与人类偏好不那么一致，经常分配AAL对准文本与WME与WME的文本，即使在WME上，也将对话转向WME，提示使用AAL文本。我们的发现提供了针对LLM开发中相对研究的抗AAL偏见的有针对性分析，强调了有关LLM关于AAL的所需行为的代表性危害和道德问题。

Title: PAFT: Prompt-Agnostic Fine-Tuning

Authors: Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12859
Pdf URL: https://arxiv.org/pdf/2502.12859
Copy Paste: [[2502.12859]] PAFT: Prompt-Agnostic Fine-Tuning(https://arxiv.org/abs/2502.12859)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) adapt well to downstream tasks after fine-tuning, this adaptability often compromises prompt robustness, as even minor prompt variations can significantly degrade performance. To address this, we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach that dynamically adjusts prompts during fine-tuning. This encourages the model to learn underlying task principles rather than overfitting to specific prompt formulations. PAFT operates in two stages: First, a diverse set of meaningful, synthetic candidate prompts is constructed. Second, during fine-tuning, prompts are randomly sampled from this set to create dynamic training inputs. Extensive experiments across diverse datasets and LLMs demonstrate that models trained with PAFT exhibit strong robustness and generalization across a wide range of prompts, including unseen ones. This enhanced robustness improves both model performance and inference speed while maintaining training efficiency. Ablation studies further confirm the effectiveness of PAFT.
摘要：虽然大型语言模型（LLMS）在微调后很好地适应了下游任务，但这种适应性通常会损害迅速的鲁棒性，因为即使是较小的迅速变化也会大大降低性能。为了解决这个问题，我们提出了迅速的无形微调（PAFT），这是一种简单而有效的方法，可以在微调过程中动态调整提示。这鼓励模型学习基本的任务原则，而不是过度适合特定的及时配方。 PAFT分为两个阶段：首先，构建了一套有意义的合成候选提示。其次，在微调过程中，从该集合中随机采样提示，以创建动态训练输入。跨不同数据集和LLM的广泛实验表明，经过PAFT训练的模型在包括看不见的提示（包括看不见的提示）上表现出强大的鲁棒性和概括性。这种增强的鲁棒性可以提高模型性能和推理速度，同时保持训练效率。消融研究进一步证实了PAFT的有效性。

Title: How desirable is alignment between LLMs and linguistically diverse human users?

Authors: Pia Knoeferle, Sebastian Möller, Dorothea Kolossa, Veronika Solopova, Georg Rehm
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12884
Pdf URL: https://arxiv.org/pdf/2502.12884
Copy Paste: [[2502.12884]] How desirable is alignment between LLMs and linguistically diverse human users?(https://arxiv.org/abs/2502.12884)
Keywords: language model, llm
Abstract: We discuss how desirable it is that Large Language Models (LLMs) be able to adapt or align their language behavior with users who may be diverse in their language use. User diversity may come about among others due to i) age differences; ii) gender characteristics, and/or iii) multilingual experience, and associated differences in language processing and use. We consider potential consequences for usability, communication, and LLM development.
摘要：我们讨论大型语言模型（LLM）能够与可能在语言使用中多样化的用户适应或使他们的语言行为适应或调整其语言行为是多么可取。由于I）年龄差异，用户多样性可能会出现； ii）性别特征和/或iii）多语言经验，以及语言处理和使用方面的相关差异。我们考虑对可用性，沟通和LLM开发的潜在后果。

Title: Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?

Authors: Georg Rehm, Annika Grützner-Zahn, Fabio Barth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12886
Pdf URL: https://arxiv.org/pdf/2502.12886
Copy Paste: [[2502.12886]] Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?(https://arxiv.org/abs/2502.12886)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate unprecedented capabilities and define the state of the art for almost all natural language processing (NLP) tasks and also for essentially all Language Technology (LT) applications. LLMs can only be trained for languages for which a sufficient amount of pre-training data is available, effectively excluding many languages that are typically characterised as under-resourced. However, there is both circumstantial and empirical evidence that multilingual LLMs, which have been trained using data sets that cover multiple languages (including under-resourced ones), do exhibit strong capabilities for some of these under-resourced languages. Eventually, this approach may have the potential to be a technological off-ramp for those under-resourced languages for which "native" LLMs, and LLM-based technologies, cannot be developed due to a lack of training data. This paper, which concentrates on European languages, examines this idea, analyses the current situation in terms of technology support and summarises related work. The article concludes by focusing on the key open questions that need to be answered for the approach to be put into practice in a systematic way.
摘要：大型语言模型（LLMS）展示了前所未有的功能，并定义了几乎所有自然语言处理（NLP）任务的艺术状态，也针对所有语言技术（LT）应用程序。 LLM只能接受有关有足够数量的预训练数据的语言培训，有效地排除了许多通常被表征为资源不足的语言。但是，有环境和经验证据表明，使用涵盖多种语言（包括资源不足）的数据集进行了多种语言LLM的培训，确实对这些资源不足的某些语言具有很强的能力。最终，这种方法可能有可能成为那些资源不足的语言的技术外坡道，因为缺乏培训数据，因此无法开发“本机” LLM和基于LLM的技术。本文着眼于欧洲语言，研究了这一想法，分析了技术支持方面的当前状况，并总结了相关工作。本文结束了，重点介绍了需要以系统的方式付诸实践的方法需要回答的关键开放问题。

Title: H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking

Authors: Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, Yiran Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12893
Pdf URL: https://arxiv.org/pdf/2502.12893
Copy Paste: [[2502.12893]] H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking(https://arxiv.org/abs/2502.12893)
Keywords: prompt, chain-of-thought
Abstract: Large Reasoning Models (LRMs) have recently extended their powerful reasoning capabilities to safety checks-using chain-of-thought reasoning to decide whether a request should be answered. While this new approach offers a promising route for balancing model utility and safety, its robustness remains underexplored. To address this gap, we introduce Malicious-Educator, a benchmark that disguises extremely dangerous or malicious requests beneath seemingly legitimate educational prompts. Our experiments reveal severe security flaws in popular commercial-grade LRMs, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1 model initially maintains a high refusal rate of about 98%, subsequent model updates significantly compromise its safety; and attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks. To further highlight these vulnerabilities, we propose Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method that leverages the model's own displayed intermediate reasoning to jailbreak its safety reasoning mechanism. Under H-CoT, refusal rates sharply decline-dropping from 98% to below 2%-and, in some instances, even transform initially cautious tones into ones that are willing to provide harmful content. We hope these findings underscore the urgent need for more robust safety mechanisms to preserve the benefits of advanced reasoning capabilities without compromising ethical standards.
摘要：大型推理模型（LRMS）最近将其强大的推理功能扩展到了使用经过思考的推理，以决定是否应回答请求。尽管这种新方法为平衡模型实用程序和安全性提供了有希望的途径，但其稳健性仍然没有得到充实。为了解决这一差距，我们介绍了恶意教育者，这是一个基准，该基准掩盖了看似合理的教育提示下的极其危险或恶意的要求。我们的实验表明，流行的商业级LRMS（包括OpenAI O1/O3，DeepSeek-R1和Gemini 2.0 Flash Thinky）中存在严重的安全缺陷。例如，尽管OpenAI的O1模型最初保持高度拒绝率约为98％，但随后的模型更新显着损害了其安全性。而且攻击者可以轻松地从DeepSeek-R1和Gemini 2.0 Flash思考中提取犯罪策略，而无需任何其他技巧。为了进一步强调这些漏洞，我们提出了劫机链链（H-COT），这是一种通用且可转移的攻击方法，利用模型自己显示的中间推理以越狱其安全推理机制。在H-COT下，拒绝率从98％急剧下降到低于2％，在某些情况下，最初最初谨慎的音调变成了愿意提供有害内容的音调。我们希望这些发现强调了迫切需要更强大的安全机制来保留高级推理能力的好处，而不会损害道德标准。

Title: Multilingual European Language Models: Benchmarking Approaches and Challenges

Authors: Fabio Barth, Georg Rehm
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12895
Pdf URL: https://arxiv.org/pdf/2502.12895
Copy Paste: [[2502.12895]] Multilingual European Language Models: Benchmarking Approaches and Challenges(https://arxiv.org/abs/2502.12895)
Keywords: language model, llm, chat
Abstract: The breakthrough of generative large language models (LLMs) that can solve different tasks through chat interaction has led to a significant increase in the use of general benchmarks to assess the quality or performance of these models beyond individual applications. There is also a need for better methods to evaluate and also to compare models due to the ever increasing number of new models published. However, most of the established benchmarks revolve around the English language. This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks. We analyse seven multilingual benchmarks and identify four major challenges. Furthermore, we discuss potential solutions to enhance translation quality and mitigate cultural biases, including human-in-the-loop verification and iterative translation ranking. Our analysis highlights the need for culturally aware and rigorously validated benchmarks to assess the reasoning and question-answering capabilities of multilingual LLMs accurately.
摘要：可以通过聊天互动来解决不同任务的生成大语言模型（LLM）的突破，导致使用一般基准测试以评估这些模型以外的个人应用程序的质量或性能的大幅提高。由于发布的新模型数量不断增加，因此还需要更好地评估方法并比较模型。但是，大多数既定的基准都围绕英语。本文分析了当前评估数据集的好处和局限性，重点是多语言欧洲基准。我们分析了七个多语言基准，并确定了四个主要挑战。此外，我们讨论了提高翻译质量并减轻文化偏见的潜在解决方案，包括人类在环境验证和迭代翻译排名。我们的分析强调了对文化意识和严格验证的基准的需求，以准确评估多语言LLM的推理和提问功能。

Title: None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks

Authors: Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12896
Pdf URL: https://arxiv.org/pdf/2502.12896
Copy Paste: [[2502.12896]] None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks(https://arxiv.org/abs/2502.12896)
Keywords: llm
Abstract: In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset. Results show that all models experience remarkable accuracy drops under our proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access 2024, ranging from 10% to 93% across models. Notably, the most accurate model in our experimentation (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that the best models in standard evaluations may not be the ones with better reasoning capabilities. Also, we see larger accuracy drops in public (vs private) datasets and questions posed in their original language (vs a manual translation), which are signs of contamination and also point to a relevant role of recall/memorization in current LLMs' answers.
摘要：在LLM评估中，通常通过对以数学为导向的问题进行数值变化来区分推理与召回/记忆的区别。在这里，我们介绍了一种通用变体方法，用于多项选择问题，该方法将正确的答案完全解离了以前看到的令牌或概念，要求LLMS可以理解和理解（而不是记住）以正确地回答（而不是记忆）。使用这种方法，我们评估了两个可用英语和西班牙语的数据集上的最新专有和开源的LLM：公共MMLU基准和私人Uned-Access 2024数据集。结果表明，所有模型在我们提出的变化下都有明显的精度下降，MMLU的平均损失为57％，Uned-Access 2024的平均损失为50％，范围从10％到93％。值得注意的是，我们实验中最准确的模型（OpenAI-O3-Mini）不是最强大的（DeepSeek-R1-70B），这表明标准评估中最好的模型可能不是具有更好推理能力的模型。此外，我们看到公共（vs私人）数据集的准确性下降和以其原始语言（与手动翻译）提出的问题的差异，这是污染的迹象，也表明了当前LLMS答案中召回/记忆的相关作用。

Title: Soundwave: Less is More for Speech-Text Alignment in LLMs

Authors: Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
Subjects: cs.CL, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2502.12900
Pdf URL: https://arxiv.org/pdf/2502.12900
Copy Paste: [[2502.12900]] Soundwave: Less is More for Speech-Text Alignment in LLMs(https://arxiv.org/abs/2502.12900)
Keywords: language model, llm
Abstract: Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at this https URL.
摘要：现有的端到端语音大语模型（LLM）通常依赖大规模注释的数据进行培训，而尚未深入讨论数据效率的培训。我们专注于语音和文本之间的两个基本问题：表示空间差距和序列长度不一致。我们提出了Soundwave，它利用有效的培训策略和新颖的体系结构来解决这些问题。结果表明，Soundwave在语音翻译和空中台式语音任务中的高级Qwen2-Audio的表现仅使用五十个培训数据。进一步的分析表明，Soundwave在对话中仍然保留其智能。该项目可在此HTTPS URL上找到。

Title: Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements

Authors: Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12904
Pdf URL: https://arxiv.org/pdf/2502.12904
Copy Paste: [[2502.12904]] Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements(https://arxiv.org/abs/2502.12904)
Keywords: llm, agent
Abstract: We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to defend against internet fraud and phishing in dynamic, real-world scenarios. Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job postings, social media, and news, categorized into 5 major fraud types. Unlike previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to assess LLMs' resistance to fraud at different stages, including credibility building, urgency creation, and emotional manipulation. Furthermore, we evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM provides general decision-making assistance, and 2. Role-play, where the model assumes a specific persona, widely used in real-world agent-based interactions. Our evaluation reveals the significant challenges in defending against fraud and phishing inducement, especially in role-play settings and fake job postings. Additionally, we observe a substantial performance gap between Chinese and English, underscoring the need for improved multilingual fraud detection capabilities.
摘要：我们介绍了欺诈R1，这是一种基准，旨在评估LLMS在动态的，现实世界中的情况下防御互联网欺诈和网络钓鱼的能力。 Draud-R1包括8,564例欺诈案件，这些案件来自网络钓鱼骗局，虚假工作发布，社交媒体和新闻，分为5种主要的欺诈类型。与以前的基准分析不同，欺诈R1引入了多轮评估管道，以评估LLMS在不同阶段对欺诈的抵抗，包括建立信誉，紧急创造和情感操纵。此外，我们在两个设置下评估了15个LLM：1。有帮助的助剂，LLM提供一般决策协助，以及2个角色扮演，该模型假设一个特定的角色，广泛用于现实世界代理互动。我们的评估揭示了防御欺诈和诱因的重大挑战，尤其是在角色扮演环境和虚假的职位上。此外，我们观察到中文和英语之间的性能差距很大，强调了提高多语言欺诈检测能力的需求。

Title: Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation

Authors: Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Xiao Huang
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2502.12911
Pdf URL: https://arxiv.org/pdf/2502.12911
Copy Paste: [[2502.12911]] Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation(https://arxiv.org/abs/2502.12911)
Keywords: llm, agent
Abstract: Generating SQLs from user queries is a long-standing challenge, where the accuracy of initial schema linking significantly impacts subsequent SQL generation performance. However, current schema linking models still struggle with missing relevant schema elements or an excess of redundant ones. A crucial reason for this is that commonly used metrics, recall and precision, fail to capture relevant element missing and thus cannot reflect actual schema linking performance. Motivated by this, we propose an enhanced schema linking metric by introducing a restricted missing indicator. Accordingly, we introduce Knapsack optimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent designed to prevent the missing of relevant schema elements while minimizing the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy that first identifies the optimal table linking and subsequently links columns within the selected table to reduce linking candidate space. In each linking process, it utilize a knapsack optimization approach to link potentially relevant elements while accounting for a limited tolerance of potential redundant this http URL this optimization, KaSLA-1.6B achieves superior schema linking results compared to large-scale LLMs, including deepseek-v3 with state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider and BIRD benchmarks verify that KaSLA can significantly improve the SQL generation performance of SOTA text-to-SQL models by substituting their schema linking processes.
摘要：从用户查询中生成SQL是一个长期存在的挑战，在此挑战中，初始模式链接的准确性显着影响随后的SQL生成性能。但是，当前的模型链接模型仍在缺少相关模式元素或过量的冗余模型。这样做的一个至关重要的原因是，常用的指标，召回和精确度未能捕获相关元素缺失，因此无法反映实际的模式链接性能。在此激励的情况下，我们提出了一个增强的模式，通过引入有限的缺失指标来链接度量。因此，我们引入了基于背包优化的模式链接代理（KASLA），这是一种插入式架构链接代理，旨在防止缺少相关的模式元素，同时最大程度地减少包含冗余的模式。 Kasla采用了层次链接策略，该策略首先识别最佳表链接，然后将所选表中的列链接起来，以减少链接候选空间。在每个链接过程中，它都使用背包优化方法来链接潜在相关元素，同时考虑到潜在冗余的有限耐受性，这一优化是Kasla-1.6B与包括深度LLM的大型LLM相比，取得了优越的架构链接结果V3具有最新的（SOTA）架构链接方法。关于蜘蛛和鸟基准的广泛实验证明，Kasla可以通过取代其模式链接过程来显着改善SOTA文本到SQL模型的SQL生成性能。

Title: Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison

Authors: George-Kirollos Saad, Scott Sanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12921
Pdf URL: https://arxiv.org/pdf/2502.12921
Copy Paste: [[2502.12921]] Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison(https://arxiv.org/abs/2502.12921)
Keywords: language model, llm, prompt
Abstract: Query-driven recommendation with unknown items poses a challenge for users to understand why certain items are appropriate for their needs. Query-driven Contrastive Summarization (QCS) is a methodology designed to address this issue by leveraging language-based item descriptions to clarify contrasts between them. However, existing state-of-the-art contrastive summarization methods such as STRUM-LLM fall short of this goal. To overcome these limitations, we introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs debate-style prompting to generate focused and contrastive summarizations of item aspects relevant to a query. Leveraging modern large language models (LLMs) as powerful tools for generating debates, Q-STRUM Debate provides enhanced contrastive summaries. Experiments across three datasets demonstrate that Q-STRUM Debate yields significant performance improvements over existing methods on key contrastive summarization criteria, thus introducing a novel and performant debate prompting methodology for QCS.
摘要：与未知项目的查询驱动的建议对用户构成了一个挑战，要了解某些项目为什么适合其需求。查询驱动的对比摘要（QCS）是一种方法，旨在通过利用基于语言的项目描述来阐明它们之间的对比度来解决此问题。但是，现有的最先进的对比摘要方法（例如Strum-llm）未达到此目标。为了克服这些局限性，我们介绍了Q-strum Debate，这是Strum-llm的新型扩展，采用辩论风格的提示来产生与查询相关的项目方面的集中和对比摘要。 Q-STRUM辩论将现代大型语言模型（LLM）作为生成辩论的强大工具提供了增强的对比摘要。在三个数据集中进行的实验表明，Q-strum辩论对关键对比度摘要标准的现有方法产生了重大的性能改进，从而引入了一种新颖的和表现的辩论促使QC的方法。

Title: On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation

Authors: Rune Birkmose, Nathan Mørkeberg Reece, Esben Hofstedt Norvin, Johannes Bjerva, Mike Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12923
Pdf URL: https://arxiv.org/pdf/2502.12923
Copy Paste: [[2502.12923]] On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation(https://arxiv.org/abs/2502.12923)
Keywords: language model, llm, prompt
Abstract: This paper investigates whether Large Language Models (LLMs), fine-tuned on synthetic but domain-representative data, can perform the twofold task of (i) slot and intent detection and (ii) natural language response generation for a smart home assistant, while running solely on resource-limited, CPU-only edge hardware. We fine-tune LLMs to produce both JSON action calls and text responses. Our experiments show that 16-bit and 8-bit quantized variants preserve high accuracy on slot and intent detection and maintain strong semantic coherence in generated text, while the 4-bit model, while retaining generative fluency, suffers a noticeable drop in device-service classification accuracy. Further evaluations on noisy human (non-synthetic) prompts and out-of-domain intents confirm the models' generalization ability, obtaining around 80--86\% accuracy. While the average inference time is 5--6 seconds per query -- acceptable for one-shot commands but suboptimal for multi-turn dialogue -- our results affirm that an on-device LLM can effectively unify command interpretation and flexible response generation for home automation without relying on specialized hardware.
摘要：本文研究了大型语言模型（LLM）是否在合成但域代表性数据上进行了微调，是否可以执行（i）插槽和意图检测的双重任务，以及（ii）智能家庭助理的自然语言响应，同时，而自然语言响应的生成仅在资源有限的仅CPU的边缘硬件上运行。我们微调llms以产生JSON动作呼叫和文本响应。我们的实验表明，16位和8位量化的变体可在插槽和意图检测方面保持高精度，并在生成的文本中保持强烈的语义连贯性，而4位模型则保持了生成流利的同时，设备服务器的生成流利程度显着下降分类精度。对嘈杂的人类（非合成）提示和室外意图的进一步评估证实了模型的概括能力，获得了约80--86 \％的准确性。虽然平均推理时间为每个查询的5--6秒 - 一声命令可接受，但对于多转向对话而言，我们的结果肯定，我们的结果肯定，在设备上的LLM可以有效地统一命令解释和Home的灵活响应生成自动化而无需依靠专用硬件。

Title: Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data

Authors: Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12924
Pdf URL: https://arxiv.org/pdf/2502.12924
Copy Paste: [[2502.12924]] Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data(https://arxiv.org/abs/2502.12924)
Keywords: language model, llm
Abstract: Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.
摘要：代码转换（CS）仍然是自然语言处理（NLP）的关键挑战。当前的大型语言模型（LLM）难以解释和生成代码切换的文本，这主要是由于大型CS数据集缺乏培训。本文提出了一种新的方法，可以使用LLMS生成CS数据，并在英语 - 西班牙语对上进行测试。我们将反向翻译的天然CS句子提交单语英语，并使用由此产生的平行语料库微调LLMS将单语句子转化为CS。与以前的CS生成方法不同，我们的方法论使用天然CS数据作为起点，从而使模型可以学习其自然分布以外的语法模式。我们通过对人类偏好的研究，定性错误分析以及对流行自动指标进行评估来彻底分析模型的绩效。结果表明，我们的方法论会产生流利的代码切换文本，扩大CS通信中的研究机会，并且在评估生成的CS数据的质量时，传统指标与人类判断无关。我们在CC-BY-NC-SA许可证下发布代码并生成数据集。

Title: SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems

Authors: Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12927
Pdf URL: https://arxiv.org/pdf/2502.12927
Copy Paste: [[2502.12927]] SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems(https://arxiv.org/abs/2502.12927)
Keywords: language model, llm, agent
Abstract: Providing high-quality feedback is crucial for student success but is constrained by time, cost, and limited data availability. We introduce Synthetic Educational Feedback Loops (SEFL), a novel framework designed to deliver immediate, on-demand feedback at scale without relying on extensive, real-world student data. In SEFL, two large language models (LLMs) operate in teacher--student roles to simulate assignment completion and formative feedback, generating abundant synthetic pairs of student work and corresponding critiques. We then fine-tune smaller, more computationally efficient LLMs on these synthetic pairs, enabling them to replicate key features of high-quality, goal-oriented feedback. Unlike personalized tutoring approaches that offer multi-turn, individualized instruction, SEFL specifically focuses on replicating the teacher-->student feedback loop for diverse assignments. Through both LLM-as-a-judge and human evaluations, we demonstrate that SEFL-tuned models outperform their non-tuned counterparts in feedback quality, clarity, and timeliness. These findings reveal SEFL's potential to transform feedback processes for higher education and beyond, offering an ethical and scalable alternative to conventional manual feedback cycles.
摘要：提供高质量的反馈对于学生成功至关重要，但受时间，成本和有限的数据可用性的限制。我们介绍了合成教育反馈循环（SEFL），这是一个新颖的框架，旨在在不依赖广泛的现实世界学生数据的情况下进行大规模的直接，按需反馈。在SEFL中，两个大型语言模型（LLM）在老师的角色中运作，以模拟作业完成和形成性反馈，从而产生丰富的学生工作和相应批评的合成对。然后，我们在这些合成对上微调了较小，更有效的LLM，使它们能够复制高质量，面向目标反馈的关键特征。与提供多转弯，个性化指导的个性化辅导方法不同，SEFL专门针对复制老师 - >学生反馈循环进行各种作业。通过LLM-AS-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-As-Ass评估，我们证明了SEFL调整的模型在反馈质量，清晰度和及时性方面的表现优于其非调节的模型。这些发现揭示了SEFL改变高等教育及以后的反馈过程的潜力，为传统的手动反馈周期提供了道德和可扩展的替代方案。

Title: Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts

Authors: Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12928
Pdf URL: https://arxiv.org/pdf/2502.12928
Copy Paste: [[2502.12928]] Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts(https://arxiv.org/abs/2502.12928)
Keywords: language model, llm
Abstract: Large language models have demonstrated exceptional performance across a wide range of tasks. However, dense models usually suffer from sparse activation, where many activation values tend towards zero (i.e., being inactivated). We argue that this could restrict the efficient exploration of model representation space. To mitigate this issue, we propose Finedeep, a deep-layered fine-grained expert architecture for dense models. Our framework partitions the feed-forward neural network layers of traditional dense models into small experts, arranges them across multiple sub-layers. A novel routing mechanism is proposed to determine each expert's contribution. We conduct extensive experiments across various model sizes, demonstrating that our approach significantly outperforms traditional dense architectures in terms of perplexity and benchmark performance while maintaining a comparable number of parameters and floating-point operations. Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer. Empirical results confirm that Finedeep effectively alleviates sparse activation and efficiently utilizes representation capacity in dense models.
摘要：大型语言模型已在各种任务中表现出出色的表现。但是，密集的模型通常遭受稀疏激活的影响，其中许多激活值趋于零（即被灭活）。我们认为这可能限制了模型表示空间的有效探索。为了减轻此问题，我们提出了Finedeep，这是一种深入的高颗粒专家架构，用于密集模型。我们的框架将传统密集模型的馈送神经网络层分为小型专家，将它们整合到多个子层中。提出了一种新型的路由机制来确定每个专家的贡献。我们在各种模型尺寸上进行了广泛的实验，表明我们的方法在困惑和基准性能方面显着优于传统的密集体系结构，同时保持了可比数量的参数和浮点操作。此外，我们发现，在平衡深度和宽度时，罚款可以实现最佳结果，特别是通过调整专家子层的数量和每个子层的专家数量。经验结果证实，有效罚款可有效减轻稀疏激活，并有效利用密集模型中的表示能力。

Title: Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages

Authors: Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12932
Pdf URL: https://arxiv.org/pdf/2502.12932
Copy Paste: [[2502.12932]] Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages(https://arxiv.org/abs/2502.12932)
Keywords: language model, llm
Abstract: Quantifying reasoning capability in low-resource languages remains a challenge in NLP due to data scarcity and limited access to annotators. While LLM-assisted dataset construction has proven useful for medium- and high-resource languages, its effectiveness in low-resource languages, particularly for commonsense reasoning, is still unclear. In this paper, we compare three dataset creation strategies: (1) LLM-assisted dataset generation, (2) machine translation, and (3) human-written data by native speakers, to build a culturally nuanced story comprehension dataset. We focus on Javanese and Sundanese, two major local languages in Indonesia, and evaluate the effectiveness of open-weight and closed-weight LLMs in assisting dataset creation through extensive manual validation. To assess the utility of synthetic data, we fine-tune language models on classification and generation tasks using this data and evaluate performance on a human-written test set. Our findings indicate that LLM-assisted data creation outperforms machine translation.
摘要：量化低资源语言中的推理能力仍然是NLP的挑战，这是由于数据稀缺和对注释者的访问有限。虽然已证明具有LLM辅助数据集构建对中和高资源语言有用，但其在低资源语言中的有效性，尤其是对于常识性推理，仍然尚不清楚。在本文中，我们比较了三种数据集创建策略：（1）LLM辅助数据集生成，（2）机器翻译和（3）母语者用人写的数据来构建一个文化上细微的故事理解数据集。我们专注于印度尼西亚两种主要本地语言的Javanese和Sundanese，并评估了开放重量和封闭式LLMS通过大量的手动验证来协助数据集创建的有效性。为了评估合成数据的效用，我们使用此数据微调了有关分类和生成任务的语言模型，并评估人为编写的测试集的性能。我们的发现表明，LLM辅助数据创建优于机器翻译。

Title: LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation

Authors: Junchen Fu, Xuri Ge, Kaiwen Zheng, Ioannis Arapakis, Xin Xin, Joemon M. Jose
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.12945
Pdf URL: https://arxiv.org/pdf/2502.12945
Copy Paste: [[2502.12945]] LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation(https://arxiv.org/abs/2502.12945)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold significant commercial value. The rise of high-quality AI-generated content has spurred interest in AI-driven micro-video creation. However, despite the advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek in text generation and reasoning, their potential to assist the creation of popular micro-videos remains largely unexplored. In this paper, we conduct an empirical study on LLM-assisted popular micro-video generation (LLMPopcorn). Specifically, we investigate the following research questions: (i) How can LLMs be effectively utilized to assist popular micro-video generation? (ii) To what extent can prompt-based enhancements optimize the LLM-generated content for higher popularity? (iii) How well do various LLMs and video generators perform in the popular micro-video generation task? By exploring these questions, we show that advanced LLMs like DeepSeek-V3 enable micro-video generation to achieve popularity comparable to human-created content. Prompt enhancements further boost popularity, and benchmarking highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and HunyuanVideo lead in video generation. This pioneering work advances AI-assisted micro-video creation, uncovering new research opportunities. We will release the code and datasets to support future studies.
摘要：在Tiktok和YouTube等平台上占主导地位的流行微观视频，具有巨大的商业价值。高质量AI生成的内容的兴起激发了人们对AI驱动的微视频创建的兴趣。但是，尽管大语模型（LLM）的高级功能（如Chatgpt和DeepSeek在文本和推理中），但它们有助于创建流行的Micro-Videos的潜力仍然在很大程度上没有探索。在本文中，我们对LLM辅助流行的微视频产生（LLMPOPCORN）进行了一项实证研究。具体来说，我们研究了以下研究问题：（i）如何有效利用LLMS来帮助流行的微型视频生成？（ii）在多大程度上可以在多大程度上促进基于LLM生成的内容以提高基于更高的知名度？（iii）各种LLM和视频发电机在流行的微观生成任务中的表现如何？通过探索这些问题，我们表明，诸如DeepSeek-V3之类的高级LLM可以使微观Video的生成能够获得与人类创建的内容相媲美的流行。迅速的增强进一步提高了知名度，基准测试凸显了LLM中的DeepSeek-V3和DeepSeek-R1，而LTX-Video和Hunyuanvideo在视频生成中的负责人。这项开创性的工作促进了AI辅助的微型视频创建，从而揭示了新的研究机会。我们将发布代码和数据集以支持未来的研究。

Title: Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models

Authors: Gyeongman Kim, Gyouk Chu, Eunho Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12947
Pdf URL: https://arxiv.org/pdf/2502.12947
Copy Paste: [[2502.12947]] Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models(https://arxiv.org/abs/2502.12947)
Keywords: language model
Abstract: With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts. Specifically, KA augments knowledge by sampling experts multiple times, while SAR uses all experts and adjusts the expert weights through router training to provide optimal knowledge. Extensive experiments show that our methods outperform conventional KD methods, demonstrating their effectiveness for MoE teacher models.
摘要：随着Experts（MOE）的混合物的出现，模型大小的有效缩放量已经加快了大型语言模型的发展。但是，它们的高内存要求阻止了它们在资源约束环境中的使用。尽管知识蒸馏（KD）已成为模型压缩的一种经过验证的方法，但其在MOE教师模型中的应用仍未得到充满激光。通过我们的调查，我们发现，MoE模型中的未激活专家具有有益于学生模型的宝贵知识。我们进一步证明，现有的KD方法对于压缩MOE模型并不是最佳的，因为它们无法有效地利用这一知识。为了解决这个问题，我们首次提出了两种直观的MOE特定的KD方法：知识增强（KA）和学生意识的路由器（SAR），均旨在有效地从所有专家那里提取知识。具体而言，KA通过对专家进行抽样的知识增强知识，而SAR使用所有专家并通过路由器培训来调整专家权重以提供最佳知识。广泛的实验表明，我们的方法表现优于常规KD方法，证明了它们对MoE教师模型的有效性。

Title: Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text

Authors: Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12953
Pdf URL: https://arxiv.org/pdf/2502.12953
Copy Paste: [[2502.12953]] Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text(https://arxiv.org/abs/2502.12953)
Keywords: language model
Abstract: Masked language modeling has become a widely adopted unsupervised technique to pre-train language models. However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at this https URL.
摘要：蒙面的语言建模已成为一种广泛采用的无监督技术，用于预训练语言模型。但是，选择代币以进行掩盖的过程是随机的，并且通常在整个训练过程中固定了屏蔽令牌的百分比。在本文中，我们建议根据一种新型的任务反学学习方案来调整掩盖率，并决定掩盖哪些令牌。首先，我们利用有关有用和有害令牌的特定特定任务知识，以确定要掩盖哪些令牌。其次，我们提出了一个循环衰减掩蔽比，该比对应于反课程时间表（从难以轻松到轻松）。我们通过掩盖（TIACBM）方法在三种不同的下游任务中示例了我们新颖的任务信息的反学方法：情感分析，主题和作者身份归因。我们的发现表明，TIACBM增强了模型专注于关键任务相关的功能的能力，从而促进了整个任务的统计学意义性能提高。我们在此HTTPS URL上发布代码。

Title: AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages

Authors: Steve Bakos, Félix Gaschi, David Guzmán, Riddhi More, Kelly Chutong Li, En-Shiun Annie Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12959
Pdf URL: https://arxiv.org/pdf/2502.12959
Copy Paste: [[2502.12959]] AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages(https://arxiv.org/abs/2502.12959)
Keywords: language model
Abstract: Realignment techniques are often employed to enhance cross-lingual transfer in multilingual language models, still, they can sometimes degrade performance in languages that differ significantly from the fine-tuned source language. This paper introduces AlignFreeze, a method that freezes either the layers' lower half or upper half during realignment. Through controlled experiments on 4 tasks, 3 models, and in 35 languages, we find that realignment affects all the layers but can be the most detrimental to the lower ones. Freezing the lower layers can prevent performance degradation. Particularly, AlignFreeze improves Part-of-Speech (PoS) tagging performances in languages where full realignment fails: with XLM-R, it provides improvements of more than one standard deviation in accuracy in seven more languages than full realignment.
摘要：调整技术通常用于增强多语言语言模型中的跨语性转移，但它们有时会以与微调源语言有很大差异的语言来降低性能。本文介绍了Alignfreeze，这种方法在重新调整期间冻结了层的下半部分或上半部分。通过对4个任务，3个模型和35种语言的受控实验，我们发现重新对齐会影响所有层，但可能对较低的层最有害。冻结下层可以防止性能降解。特别是，AlignFreeze在完全重新调整失败的语言中改善了语音的标签（POS）标记表演：使用XLM-R，它可以改善七种语言中的准确性超过一个标准偏差，而不是完全重新调整。

Title: Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing

Authors: Xiaoju Ye, Zhichun Wang, Jingyuan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12962
Pdf URL: https://arxiv.org/pdf/2502.12962
Copy Paste: [[2502.12962]] Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing(https://arxiv.org/abs/2502.12962)
Keywords: language model, llm
Abstract: Limited by the context window size of Large Language Models(LLMs), handling various tasks with input tokens exceeding the upper limit has been challenging, whether it is a simple direct retrieval task or a complex multi-hop reasoning task. Although various methods have been proposed to enhance the long-context processing capabilities of LLMs, they either incur substantial post-training costs, or require additional tool modules(e.g.,RAG), or have not shown significant improvement in realistic tasks. Our work observes the correlation between the attention distribution and generated answers across each layer, and establishes the attention allocation aligns with retrieval-augmented capabilities through experiments. Drawing on the above insights, we propose a novel method InfiniRetri that leverages the LLMs's own attention information to enable accurate retrieval across inputs of infinitely length. Our evaluations indicate that InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model, surpassing other method or larger models and setting a new state-of-the-art(SOTA). Moreover, our method achieves significant performance improvements on real-world benchmarks, with a maximum 288% improvement. In addition, InfiniRetri can be applied to any Transformer-based LLMs without additional training and substantially reduces inference latency and compute overhead in long texts. In summary, our comprehensive studies show InfiniRetri's potential for practical applications and creates a paradigm for retrievaling information using LLMs own capabilities under infinite-length tokens. Code will be released in link.
摘要：受大语模型（LLMS）的上下文窗口大小的限制，无论是简单的直接检索任务还是复杂的多跳上推理任务，处理超过上限的各种任务都是挑战。尽管已经提出了各种方法来增强LLMS的长期处理能力，但它们要么会招致大量的训练后成本，要么需要其他工具模块（例如RAG），或者未显示现实任务的显着改善。我们的工作观察了注意力分布与每一层中产生的答案之间的相关性，并通过实验确定了与检索提升功能的关注分配。利用上述见解，我们提出了一种新的方法Infiniretri，它利用LLMS自己的注意信息，以实现无限长度输入的准确检索。我们的评估表明，Infiniretri使用0.5B参数模型在1M代币上的针头测试（NIH）测试中达到100％的精度，超过其他方法或更大的模型并设置新的最新最新的技术（ sota）。此外，我们的方法可在实际基准测试方面取得重大的性能提高，最大提高了288％。此外，Infiniretri可以应用于任何基于变压器的LLM，而无需其他培训，并大大降低了推理潜伏期并计算长文本中的开销。总而言之，我们的全面研究表明，Infiniretri在实际应用中的潜力，并创建了使用LLMS在无限长度标记下使用LLMS自己的能力来检索信息的范式。代码将在链接中发布。

Title: Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs

Authors: Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12964
Pdf URL: https://arxiv.org/pdf/2502.12964
Copy Paste: [[2502.12964]] Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs(https://arxiv.org/abs/2502.12964)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) often generate outputs that lack grounding in real-world facts, a phenomenon known as hallucinations. Prior research has associated hallucinations with model uncertainty, leveraging this relationship for hallucination detection and mitigation. In this paper, we challenge the underlying assumption that all hallucinations are associated with uncertainty. Using knowledge detection and uncertainty measurement methods, we demonstrate that models can hallucinate with high certainty even when they have the correct knowledge. We further show that high-certainty hallucinations are consistent across models and datasets, distinctive enough to be singled out, and challenge existing mitigation methods. Our findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety. The code is available at this https URL .
摘要：大型语言模型（LLM）通常会产生在现实世界中缺乏基础的输出，这种现象被称为幻觉。先前的研究将幻觉与模型不确定性相关联，并利用这种关系进行幻觉检测和缓解。在本文中，我们挑战了所有幻觉与不确定性相关的基本假设。使用知识检测和不确定性测量方法，我们证明了模型即使具有正确的知识，模型也可以高度确定。我们进一步表明，高确定性幻觉在模型和数据集之间是一致的，足以挑出并挑战现有缓解方法。我们的发现揭示了幻觉的一个被忽视的方面，强调需要了解其起源并改善缓解策略以提高LLM安全性。该代码可在此HTTPS URL上找到。

Title: Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Authors: Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12970
Pdf URL: https://arxiv.org/pdf/2502.12970
Copy Paste: [[2502.12970]] Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking(https://arxiv.org/abs/2502.12970)
Keywords: language model, llm
Abstract: The reasoning abilities of Large Language Models (LLMs) have demonstrated remarkable advancement and exceptional performance across diverse domains. However, leveraging these reasoning capabilities to enhance LLM safety against adversarial attacks and jailbreak queries remains largely unexplored. To bridge this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates safety reflections of queries and responses into LLMs' generation process, unlocking a safety-aware reasoning mechanism. This approach enables self-evaluation at each reasoning step to create safety pivot tokens as indicators of the response's safety status. Furthermore, in order to improve the learning efficiency of pivot token prediction, we propose Contrastive Pivot Optimization(CPO), which enhances the model's ability to perceive the safety status of dialogues. Through this mechanism, LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their defense capabilities against jailbreak attacks. Extensive experimental results demonstrate that R2D effectively mitigates various attacks and improves overall safety, highlighting the substantial potential of safety-aware reasoning in strengthening LLMs' robustness against jailbreaks.
摘要：大型语言模型（LLM）的推理能力表现出了不同领域的显着进步和出色的表现。但是，利用这些推理能力来提高LLM安全性，以防止对抗性攻击和越狱查询，这在很大程度上尚未探索。为了弥合这一差距，我们提出了推理对原告（R2D），这是一种新颖的培训范式，将查询和响应的安全反思整合到LLMS的生成过程中，从而释放了安全感知的推理机制。这种方法可以在每个推理步骤中进行自我评估，以创建安全枢轴令牌作为响应安全状况的指标。此外，为了提高枢轴代币预测的学习效率，我们提出了对比度枢轴优化（CPO），从而增强了模型感知对话的安全状态的能力。通过这种机制，LLM在推理过程中动态调整了他们的响应策略，从而显着增强了他们针对越狱攻击的防御能力。广泛的实验结果表明，R2D有效地减轻了各种攻击并提高了整体安全性，强调了安全意识推理在增强LLMS对越狱的鲁棒性方面的巨大潜力。

Title: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

Authors: Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12982
Pdf URL: https://arxiv.org/pdf/2502.12982
Copy Paste: [[2502.12982]] Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs(https://arxiv.org/abs/2502.12982)
Keywords: language model, gpt, llm
Abstract: Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
摘要：Sailor2是一个针对东南亚（海洋）语言的尖端多语言模型的家族，可用于1B，8B和20B尺寸，以适合各种应用。 Sailor2以Qwen2.5为基础，在500B代币（400B SEA特异性和100B重播令牌）上进行连续预培训，以支持13种海语，同时保持熟练程度的中文和英语。 Sailor2-20B型号以跨海语对GPT-4O的胜利率达到50-50。我们还提供了一本有关如何有效地开发多语言模型的综合食谱，包括五个关键方面：数据策划，培训前，培训后，模型定制和评估。我们希望Sailor2 Model（Apache 2.0许可证）能够推动海洋地区的语言发展，Sailor2 Cookbook将激发研究人员为其他服务不足的语言建立更多包容性的LLM。

Title: Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs

Authors: Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.12988
Pdf URL: https://arxiv.org/pdf/2502.12988
Copy Paste: [[2502.12988]] Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs(https://arxiv.org/abs/2502.12988)
Keywords: language model, llm
Abstract: Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character's responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character. Using Lu Xun, a renowned Chinese writer, as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun's internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope that this work inspires future research on deep character persona simulation LLM.
摘要：角色模拟的先前方法大语言模型（LLMS）通常依赖于学习基本的传记信息，或者使用有限的角色扮演对话数据集来捕获角色的响应。但是，一个人的整体表示超越了表面层面的事实或对话，而不是更深刻的思想和思维。在这项工作中，我们介绍了TargualBot，该模型旨在复制角色的语言模式和独特的思维过程。我们使用著名的中国作家Lu Xun作为案例研究，我们提出了从他的17篇论文集中得出的四项培训任务。其中包括一项旨在掌握外部语言结构和知识的培训前任务，以及三个微调任务：多项选择的问题回答，生成性问题回答和样式转移，每个任务都将LLM与Lu Xun的内部构思和写作保持一致风格。为了优化跨这些任务的学习，我们介绍了Charlora参数更新机制，其中一般语言风格的专家与其他特定于任务的专家合作，以更好地研究语言风格和对更深思预域的理解。我们在三个任务上评估了语言准确性和意见理解的三个任务，这表明它在适应的指标上的表现明显优于基准。我们希望这项工作激发了对深角色角色模拟LLM的未来研究。

Title: B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Authors: Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12992
Pdf URL: https://arxiv.org/pdf/2502.12992
Copy Paste: [[2502.12992]] B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability(https://arxiv.org/abs/2502.12992)
Keywords: language model
Abstract: Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural models. Meanwhile, B-cos networks have been introduced to improve model explainability through architectural and computational adaptations, but their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous B-cos methods. Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we provide practical guidelines for effectively building B-cos LMs based on our findings. Our code is available at this https URL.
摘要：由于目前的神经模型缺乏解释性，黑盒模型的事后解释方法通常会因忠诚和人类的解释性而苦苦挣扎。同时，已经引入了B-COS网络，以通过架构和计算适应来提高模型的解释性，但是到目前为止，它们的应用仅限于计算机视觉模型及其相关的培训管道。在这项工作中，我们介绍了B-COS LMS，即授权用于NLP任务的B-COS网络。我们的方法通过结合B-COS转换和任务微调，将预训练的语言模型直接转化为B-COS LMS，与以前的B-COS方法相比提高了效率。我们的自动和人类评估结果表明，B-COS LMS比事后方法产生更忠实和人类的解释解释，同时保持与常规微调相当的任务绩效。我们的深入分析探讨了B-COS LMS在其学习过程和解释模式中与常规微调模型的不同。最后，我们提供了根据我们的发现有效构建B-COS LMS的实用指南。我们的代码可在此HTTPS URL上找到。

Title: Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge

Authors: Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2502.13010
Pdf URL: https://arxiv.org/pdf/2502.13010
Copy Paste: [[2502.13010]] Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge(https://arxiv.org/abs/2502.13010)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have significantly advanced medical question-answering by leveraging extensive clinical data and medical literature. However, the rapid evolution of medical knowledge and the labor-intensive process of manually updating domain-specific resources pose challenges to the reliability of these systems. To address this, we introduce Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates the construction and continuous updating of medical knowledge graphs, integrates reasoning, and retrieves current external evidence, such as PubMed and WikiSearch. By dynamically linking new findings and complex medical concepts, AMG-RAG not only improves accuracy but also enhances interpretability in medical queries. Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of 66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to 100 times larger. Notably, these improvements are achieved without increasing computational overhead, highlighting the critical role of automated knowledge graph generation and external evidence retrieval in delivering up-to-date, trustworthy medical insights.
摘要：大型语言模型（LLMS）通过利用广泛的临床数据和医学文献来显着提高医学质疑。但是，医学知识的快速发展和手动更新特定领域资源的劳动密集型过程对这些系统的可靠性构成了挑战。为了解决这个问题，我们介绍了自适应医学图形剥离（AMG-rag），这是一个综合框架，可自动化医学知识图，整合推理并检索当前外部证据，例如PubMed和Wikisearch等当前的外部证据。通过动态联系新发现和复杂的医学概念，AMG-rag不仅提高了准确性，而且可以提高医疗查询的可解释性。对MEDQA和MEDMCQA基准的评估证明了AMG-rag的有效性，在MEDQA上获得了74.1％的F1得分，而MEDMCQA的准确度为66.34％，超过了可比的模型和10至100倍的比例。值得注意的是，这些改进是在不增加计算开销的情况下实现的，突出了自动化知识图生成和外部证据检索在提供最新，值得信赖的医学见解方面的关键作用。

Title: Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation

Authors: Sha Li, Naren Ramarkrishnan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13019
Pdf URL: https://arxiv.org/pdf/2502.13019
Copy Paste: [[2502.13019]] Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation(https://arxiv.org/abs/2502.13019)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Despite the remarkable capabilities of Large Language Models (LLMs) in various NLP tasks, they remain vulnerable to hallucinations due to their limited parametric knowledge and lack of domain-specific expertise. Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating external document retrieval to augment the knowledge base of LLMs. In this approach, RAG retrieves document chunks from an external corpus in response to a query, which are then used as context for the downstream language model to generate an answer. However, these retrieved knowledge sources often include irrelevant or erroneous information, undermining the effectiveness of RAG in downstream tasks. To overcome this limitation, we introduce a compact, efficient, and pluggable module designed to refine external knowledge sources before feeding them to the generator. The module reconstructs retrieved content by extracting the most relevant and supportive information and reorganising it into a concise, query-specific format. Through a three-stage training paradigm - comprising supervised fine-tuning, contrastive multi-task learning, and reinforcement learning-based alignment - it prioritises critical knowledge and aligns it with the generator's preferences. This method enables LLMs to produce outputs that are more accurate, reliable, and contextually appropriate.
摘要：尽管大型语言模型（LLM）在各种NLP任务中具有显着的功能，但由于其参数知识有限和缺乏特定领域的专业知识，它们仍然容易受到幻觉的影响。通过合并外部文件检索以扩大LLM的知识库，检索授权的一代（RAG）解决了这一挑战。在这种方法中，RAG从外部语料库中检索文档块以响应查询，然后将其用作下游语言模型的上下文来生成答案。但是，这些检索的知识来源通常包括无关紧要或错误的信息，破坏了抹布在下游任务中的有效性。为了克服这一局限性，我们引入了一个紧凑，高效且可插入的模块，旨在在将其喂入发电机之前，旨在完善外部知识来源。这些模块通过提取最相关和最支持的信息并将其重新组织为简洁，特定的特定格式来重建内容。通过三阶段的培训范式 - 包括监督的微调，对比度多任务学习以及基于增强学习的对齐方式 - 它优先考虑批判性知识，并将其与发电机的偏好保持一致。此方法使LLMS能够产生更准确，可靠和上下文适当的输出。

Title: HPSS: Heuristic Prompting Strategy Search for LLM Evaluators

Authors: Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13031
Pdf URL: https://arxiv.org/pdf/2502.13031
Copy Paste: [[2502.13031]] HPSS: Heuristic Prompting Strategy Search for LLM Evaluators(https://arxiv.org/abs/2502.13031)
Keywords: language model, llm, prompt
Abstract: Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods.
摘要：由于在自然语言处理领域（NLP）采用大型语言模型（LLM）进行文本评估变得越来越普遍，因此一系列现有作品试图优化LLM评估者提示，以改善其与人类判断的一致性。但是，他们的努力仅限于优化评估提示的个体因素，例如评估标准或输出格式，从而忽略了多种因素的组合影响，从而导致评估管道的优化不足。然而，确定行为良好的提示来调整多个因素的策略需要广泛的枚举。为此，我们全面整合了8个关键因素以进行评估提示，并提出了一种新型的自动提示策略优化方法，称为启发式提示策略搜索（HPSS）。受遗传算法的启发，HPSS进行了迭代搜索，以寻找行为良好的LLM评估者的提示策略。使用启发式功能来指导搜索过程，从而增强我们的算法的性能。四个评估任务的广泛实验证明了HPS的有效性，始终超过了人工设计的提示和现有的自动及时及时优化方法。

Title: Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction

Authors: Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13044
Pdf URL: https://arxiv.org/pdf/2502.13044
Copy Paste: [[2502.13044]] Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction(https://arxiv.org/abs/2502.13044)
Keywords: language model, llm, prompt
Abstract: Aspect sentiment quadruple prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores slightly below those obtained with state-of-the-art fine-tuned models but exceeding previously reported zero- and few-shot performance. In the 40-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were also close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.
摘要：方面情感四核预测（ASQP）通过确定每个意见的意见术语，方面术语，方面类别和情感极性，促进了对文本中表达的观点的详细理解。但是，将一组全套培训示例注释为ASQP微调模型是一个资源密集型过程。在这项研究中，我们探讨了大语模型（LLMS）在五个不同数据集中对ASQP任务进行零和少量学习的功能。我们报告的F1得分略低于最先进的微型模型获得的得分，但超过了先前报道的零和少数射击性能。在REST16 REST16餐厅域数据集的40张设置中，LLMS的F1得分为52.46，而表现最佳的微调方法MVP为60.39。此外，我们报告了在目标方面情感检测（TASD）中LLM的性能，其中F1得分也接近微调模型，在40次设置的REST16上达到66.03，而MVP为72.76。尽管人类注释对于实现最佳性能仍然至关重要，但LLM可以减少对ASQP任务中大量手动注释的需求。

Title: AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks

Authors: Yurun Chen, Xueyu Hu, Keting Yin, Juncheng Li, Shengyu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13053
Pdf URL: https://arxiv.org/pdf/2502.13053
Copy Paste: [[2502.13053]] AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks(https://arxiv.org/abs/2502.13053)
Keywords: llm, agent
Abstract: As researchers continuously optimize AI agents to perform tasks more effectively within operating systems, they often neglect to address the critical need for enabling these agents to identify "impostors" within the system. Through an analysis of the agents' operating environment, we identified a potential threat: attackers can disguise their attack methods as environmental elements, injecting active disturbances into the agents' execution process, thereby disrupting their decision-making. We define this type of attack as Active Environment Injection Attack (AEIA). Based on this, we propose AEIA-MN, an active environment injection attack scheme that exploits interaction vulnerabilities in the mobile operating system to evaluate the robustness of MLLM-based agents against such threats. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% in the AndroidWorld benchmark.
摘要：随着研究人员不断优化AI代理以在操作系统中更有效地执行任务时，他们经常忽略以解决使这些代理能够在系统中识别“冒名顶替者”的关键需求。通过对代理商的操作环境的分析，我们确定了潜在的威胁：攻击者可以将其攻击方法掩盖为环境因素，将主动干扰注入代理商的执行过程，从而破坏他们的决策。我们将这种类型的攻击定义为主动环境注入攻击（AEIA）。基于此，我们提出了AEIA-MN，这是一种主动环境注射攻击方案，利用移动操作系统中的相互作用漏洞来评估基于MLLM的代理商对此类威胁的鲁棒性。实验结果表明，即使是高级MLLM也很容易受到此攻击的影响，在Androidworld基准中，最大攻击成功率为93％。

Title: SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Authors: Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13059
Pdf URL: https://arxiv.org/pdf/2502.13059
Copy Paste: [[2502.13059]] SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models(https://arxiv.org/abs/2502.13059)
Keywords: language model, llm
Abstract: The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.
摘要：多模式大语言模型（MLLM）在各个领域的越来越多地介绍了其产出可靠性和准确性的本质，尤其是它们产生以事实信息为基础的内容的能力（例如，普通和特定于领域的知识）。在这项工作中，我们介绍了SimpleVQA，这是第一个评估MLLM回答自然语言简短问题的事实能力的全面多模式基准。 SimpleVQA的特征是六个关键特征：它涵盖了多个任务和多种情况，确保高质量和具有挑战性的查询，保持静态和永恒的参考答案，并且可以简单地进行评估。我们的方法涉及将视觉提问的项目分类为围绕客观事件或常识的9个不同任务，并将其置于9个主题中。实施严格的质量控制过程以确保高质量，简洁和清晰的答案，从而通过LLM-AS-A-A-A-Gudge评分系统促进评估，从而促进评估。使用SimpleVQA，我们对领先的18个MLLM和8个仅文本LLM进行全面评估，通过识别和分析错误情况来深入研究其图像理解和文本生成能力。

Title: Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection

Authors: Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13061
Pdf URL: https://arxiv.org/pdf/2502.13061
Copy Paste: [[2502.13061]] Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection(https://arxiv.org/abs/2502.13061)
Keywords: gpt, agent
Abstract: Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While large multimodal models have shown strong generalization across various tasks, they exhibit poor generalization to hateful meme detection due to the dynamic nature of memes tied to emerging social trends and breaking news. Recent work further highlights the limitations of conventional supervised fine-tuning for large multimodal models in this context. To address these challenges, we propose Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a novel two-stage fine-tuning framework designed to improve both in-domain accuracy and cross-domain generalization. Experimental results on six widely used meme classification datasets demonstrate that LMM-RGCL achieves state-of-the-art performance, outperforming agent-based systems such as VPD-PALI-X-55B. Furthermore, our method effectively generalizes to out-of-domain memes under low-resource settings, surpassing models like GPT-4o.
摘要：可恶的模因已成为互联网上的重大关注，需要强大的自动检测系统。尽管大型多模型模型在各种任务中表现出强烈的概括，但由于与新兴的社会趋势和突发新闻相关的模因的动态性质，它们对可恶的模因检测的概括不佳。最近的工作进一步凸显了在这种情况下，大型多模型模型的常规监督微调的局限性。为了应对这些挑战，我们提出了大型多模型检索引导的对比度学习（LMM-RGCL），这是一个新型的两阶段微调框架，旨在提高域中的准确性和交叉域的概括。六个广泛使用的模因分类数据集的实验结果表明，LMM-RGCL达到了最先进的性能，优于基于代理的系统，例如VPD-PALI-X-55B。此外，我们的方法有效地概括为低资源设置下的跨域模因，超过了诸如GPT-4O之类的模型。

Title: Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Authors: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13063
Pdf URL: https://arxiv.org/pdf/2502.13063
Copy Paste: [[2502.13063]] Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity(https://arxiv.org/abs/2502.13063)
Keywords: language model
Abstract: A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.
摘要：最近的一系列作品涉及将令牌序列压缩到较短的实价矢量序列中的问题，该序列被用作输入，而不是令牌嵌入或键值缓存。这些方法可以减少现有语言模型中的计算量。尽管依靠功能强大的模型作为编码器，但最大可实现的无损压缩比通常不高于x10。这一事实非常吸引人，因为从理论上讲，即使对于16位精度和适度的向量大小，大型实价矢量的最大信息能力也远远超出了呈现的速率。在这项工作中，我们通过使用样本优化过程代替编码器来探索压缩的限制。我们表明，具有高达X1500的压缩比的向量突出了现有解决方案和实际实现的解决方案之间的两个数量级差距。此外，我们从经验上表明，压缩极限不是由输入的长度确定的，而是由要降低的不确定性量（即，在此序列上的跨凝性损失而没有任何条件。所获得的限制突出了输入嵌入的理论能力与其实际利用之间的巨大差距，这表明在模型设计中优化了很大的空间。

Title: KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits

Authors: Xin Xia, Yujin Wang, Jun Zhou, Guisheng Zhong, Linning Cai, Chen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13076
Pdf URL: https://arxiv.org/pdf/2502.13076
Copy Paste: [[2502.13076]] KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits(https://arxiv.org/abs/2502.13076)
Keywords: language model, prompt
Abstract: Patent analysis highly relies on concise and interpretable document representations, referred to as patent portraits. Keyphrases, both present and absent, are ideal candidates for patent portraits due to their brevity, representativeness, and clarity. In this paper, we introduce KAPPA, an integrated framework designed to construct keyphrase-based patent portraits and enhance patent analysis. KAPPA operates in two phases: patent portrait construction and portrait-based analysis. To ensure effective portrait construction, we propose a semantic-calibrated keyphrase generation paradigm that integrates pre-trained language models with a prompt-based hierarchical decoding strategy to leverage the multi-level structural characteristics of patents. For portrait-based analysis, we develop a comprehensive framework that employs keyphrase-based patent portraits to enable efficient and accurate patent analysis. Extensive experiments on benchmark datasets of keyphrase generation, the proposed model achieves significant improvements compared to state-of-the-art baselines. Further experiments conducted on real-world patent applications demonstrate that our keyphrase-based portraits effectively capture domain-specific knowledge and enrich semantic representation for patent analysis tasks.
摘要：专利分析高度依赖于简洁而可解释的文档表示形式，称为专利肖像。钥匙拼（无论是现在还是不存在），由于其简短，代表性和清晰度，是专利肖像的理想候选人。在本文中，我们介绍了Kappa，这是一个集成框架，旨在构建基于钥匙的专利肖像并增强专利分析。 Kappa分为两个阶段：专利肖像结构和基于肖像的分析。为了确保有效的肖像结构，我们提出了一个由语义校准的键形生成范式，该范式将预训练的语言模型与迅速的基于基于的分层解码策略相结合，以利用专利的多层结构特征。为了基于肖像的分析，我们开发了一个综合框架，该框架采用基于钥匙的专利肖像来实现有效而准确的专利分析。与最先进的基线相比，在键形生成基准数据集的基准数据集上进行了广泛的实验。对现实世界专利实施的进一步实验表明，我们的基于钥匙的肖像有效地捕获了特定于领域的知识并丰富了针对专利分析任务的语义表示。

Title: Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

Authors: Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13092
Pdf URL: https://arxiv.org/pdf/2502.13092
Copy Paste: [[2502.13092]] Text2World: Benchmarking Large Language Models for Symbolic World Model Generation(https://arxiv.org/abs/2502.13092)
Keywords: language model, llm, agent
Abstract: Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at this https URL.
摘要：最近，人们对利用大型语言模型（LLM）的兴趣越来越多，从文本描述中生成符号世界模型。尽管LLM在世界建模的背景下进行了广泛的探索，但先前的研究遇到了几个挑战，包括评估随机性，对间接指标的依赖性和有限的域范围。为了解决这些限制，我们基于规划域定义语言（PDDL）介绍了一个新颖的基准，即Text2World，它具有数百个不同的域，并采用多标准，基于执行的指标来进行更强大的评估。我们使用Text2World基准了当前的LLM，并发现经过大规模增强训练的推理模型学习优于其他人。但是，即使是表现最佳的模型，在世界建模中仍然表现出有限的功能。在这些见解的基础上，我们研究了一些有前途的策略，以增强LLM的世界建模功能，包括测试时间扩展，代理培训等。我们希望Text2World可以作为至关重要的资源，为将LLM作为世界模型的未来研究奠定了基础。该项目页面可在此HTTPS URL上找到。

Title: STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Authors: Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin-Leyton Brown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13119
Pdf URL: https://arxiv.org/pdf/2502.13119
Copy Paste: [[2502.13119]] STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models(https://arxiv.org/abs/2502.13119)
Keywords: language model, llm, prompt
Abstract: How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.
摘要：一个给定的大语言模型（LLM）应该如何可靠地执行经济推理？大多数现有的LLM基准都集中在特定的应用程序上，并且未能为模型提供各种经济任务。 Raman等人是一个值得注意的例外。 [2024]，他们提供了一种全面基准制定战略决策的方法；但是，这种方法未能解决微观经济学中普遍的非战略环境，例如供应和需求分析。我们通过将微观经济推理分类为$ 58 $不同的要素来解决这一差距，重点是供求的逻辑，每种都以高达$ 10 $不同的域名，$ 5 $ Perspectives和3美元的类型为基础。该组合空间中基准数据的生成由我们配音自动发展的新颖的LLM辅助数据生成协议提供动力，该协议通过将手写模板适应针对新的域和观点来生成一组问题。由于它提供了一种自动化的方法来产生新的问题，因此自动步进可以减轻LLM培训以过度拟合评估基准的风险；因此，我们希望它将在未来几年内作为评估和微调模型的有用工具。我们通过对$ 27 $ llms的案例研究证明了我们的基准测试的有用性，从小型开源型号到当前的最新状态。我们研究了每个模型在整个分类法中解决微观经济问题的能力，并在一系列促使策略和评分指标中介绍了结果。

Title: Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context

Authors: Marion Bartl, Thomas Brendan Murphy, Susan Leavy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.13120
Pdf URL: https://arxiv.org/pdf/2502.13120
Copy Paste: [[2502.13120]] Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context(https://arxiv.org/abs/2502.13120)
Keywords: language model, llm
Abstract: Gender-inclusive language is often used with the aim of ensuring that all individuals, regardless of gender, can be associated with certain concepts. While psycholinguistic studies have examined its effects in relation to human cognition, it remains unclear how Large Language Models (LLMs) process gender-inclusive language. Given that commercial LLMs are gaining an increasingly strong foothold in everyday applications, it is crucial to examine whether LLMs in fact interpret gender-inclusive language neutrally, because the language they generate has the potential to influence the language of their users. This study examines whether LLM-generated coreferent terms align with a given gender expression or reflect model biases. Adapting psycholinguistic methods from French to English and German, we find that in English, LLMs generally maintain the antecedent's gender but exhibit underlying masculine bias. In German, this bias is much stronger, overriding all tested gender-neutralization strategies.
摘要：通常使用包含性别的语言，目的是确保所有个人，无论性别如何，都可以与某些概念相关联。尽管心理语言学研究研究了与人类认知有关的影响，但尚不清楚语言模型（LLMS）如何处理包含性别的语言。鉴于商业LLM在日常应用中的立足点越来越强，至关重要的是要检查LLM实际上是否中立地解释了包含性别的语言，因为他们产生的语言有可能影响用户的语言。这项研究检查了LLM生成的核心术语是否与给定的性别表达保持一致或反映模型偏见。从法语到英语和德语的适应心理语言方法，我们发现在英语中，LLMS通常维持了先例的性别，但表现出了基本的男性偏见。在德语中，这种偏见更加强大，超越了所有经过测试的性别中和化策略。

Title: RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

Authors: Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13125
Pdf URL: https://arxiv.org/pdf/2502.13125
Copy Paste: [[2502.13125]] RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises(https://arxiv.org/abs/2502.13125)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have shown that they can answer questions requiring complex reasoning. However, their ability to identify and respond to text containing logical fallacies or deliberately misleading premises remains less studied. To address this gap, we introduce RuozhiBench, a bilingual dataset comprising 677 carefully curated questions that contain various forms of deceptive reasoning, meticulously crafted through extensive human effort and expert review. In a comprehensive evaluation of 17 LLMs from 5 Series over RuozhiBench using both open-ended and two-choice formats, we conduct extensive analyses on evaluation protocols and result patterns. Despite their high scores on conventional benchmarks, these models showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.
摘要：大型语言模型（LLM）的最新进展表明，他们可以回答需要复杂推理的问题。但是，他们识别和响应包含逻辑谬误或故意误导前提的文本的能力仍然较少。为了解决这一差距，我们介绍了Ruozhibench，这是一个由677个精心策划的问题组成的双语数据集，其中包含各种形式的欺骗性推理，并通过广泛的人类努力和专家审查精心制作。在使用开放式和两项选择格式对Ruozhibench的5系列中的17个LLM的全面评估中，我们对评估方案和结果模式进行了广泛的分析。尽管在常规基准测试方面得分很高，但这些模型表现出有限的检测能力和理由正确的逻辑谬误，即使是表现最好的模型Claude-3-Haiku，与90％以上的人相比，精度仅达到62％的精度。

Title: Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

Authors: Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.13127
Pdf URL: https://arxiv.org/pdf/2502.13127
Copy Paste: [[2502.13127]] Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning(https://arxiv.org/abs/2502.13127)
Keywords: language model, gpt, llm, long context, chain-of-thought, agent
Abstract: Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-driven Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI's reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's financial subset.
摘要：大型语言模型（LLM）的最新进展使他们能够处理越来越长的序列，范围从2K到2M，甚至超越。但是，简单地扩展输入序列长度并不一定会导致有效的长期理解。在这项研究中，我们以监督的方式将思想链（COT）推理整合到LLM中，以促进有效的长期理解。为了实现这一目标，我们介绍了LongFinanceQA，这是金融领域中的合成数据集，旨在改善长篇小说推理。与现有的长篇小说合成数据不同，LongFinanceQA在最终结论之前包括中间的COT推理，这鼓励LLMS执行明确的推理，提高长篇文化理解中的准确性和可解释性。为了产生合成的COT推理，我们提出了与物业驱动的代理推理（PAI），这是一个模拟类似人类的推理步骤的代理框架，包括属性提取，检索和摘要。我们通过评估带有loong基准的pai的GPT-4O-Mini来评估PAI的推理能力，从而超过标准GPT-4O-Mini的推理能力20.0％。此外，我们在LongfinanceQa上微调了Llama-3.1-8B教学，在Loong的财务子集中获得了24.6％的收益。

Title: UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models

Authors: Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.13141
Pdf URL: https://arxiv.org/pdf/2502.13141
Copy Paste: [[2502.13141]] UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models(https://arxiv.org/abs/2502.13141)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks, which manipulate prompts or models to generate harmful outputs. In this paper, departing from traditional deep learning attack paradigms, we explore their intrinsic relationship and collectively term them Prompt Trigger Attacks (PTA). This raises a key question: Can we determine if a prompt is benign or poisoned? To address this, we propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs. Additionally, we introduce a single-forward strategy to optimize the detection pipeline, enabling simultaneous attack detection and text generation within a single forward pass. Our experiments confirm that UniGuardian accurately and efficiently identifies malicious prompts in LLMs.
摘要：大型语言模型（LLMS）容易受到诸如及时注射，后门攻击和对抗性攻击等攻击的攻击，这些攻击会操纵提示或模型以产生有害输出。在本文中，与传统的深度学习攻击范式背道而驰，我们探索了它们的内在关系，并共同列举了它们迅速触发攻击（PTA）。这提出了一个关键问题：我们可以确定提示是良性还是中毒？为了解决这个问题，我们提出了Uniguardian，这是第一个旨在检测LLM中立即注射，后门攻击和对抗性攻击的统一防御机制。此外，我们引入了单个策略，以优化检测管道，从而在单个正向传球内同时进行攻击检测和文本生成。我们的实验证实，Uniguardian准确，有效地确定了LLMS中的恶意提示。