2025-06-18

Title: ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries

Authors: Zhou Chen, Xiao Wang, Yuanhong Liao, Ming Lin, Yuqi Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13796
Pdf URL: https://arxiv.org/pdf/2506.13796
Copy Paste: [[2506.13796]] ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries(https://arxiv.org/abs/2506.13796)
Keywords: language model, llm, chat
Abstract: As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance on relevant tasks by constructing climate change-related instruction data and instruction-tuning LLMs. However, current research remains inadequate in efficiently producing large volumes of high-precision instruction data for climate change, which limits further development of climate change LLMs. This study introduces an automated method for constructing instruction data. The method generates instructions using facts and background knowledge from documents and enhances the diversity of the instruction data through web scraping and the collection of seed instructions. Using this method, we constructed a climate change instruction dataset, named ClimateChat-Corpus, which was used to fine-tune open-source LLMs, resulting in an LLM named ClimateChat. Evaluation results show that ClimateChat significantly improves performance on climate change question-and-answer tasks. Additionally, we evaluated the impact of different base models and instruction data on LLM performance and demonstrated its capability to adapt to a wide range of climate change scientific discovery tasks, emphasizing the importance of selecting an appropriate base model for instruction tuning. This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs.
摘要：随着全球气候变化的问题变得越来越严重，气候科学研究的需求不断增长。由大型语言模型（LLM）代表的自然语言处理技术已广泛应用于特定于气候变化的研究，为决策者和公众提供了基本的信息支持。一些研究通过构建与气候变化相关的指令数据和指导调用LLM来提高了相关任务的模型性能。但是，当前的研究在有效地生成大量的高精度指导数据以进行气候变化方面仍然不足，这限制了气候变化LLM的进一步发展。这项研究介绍了一种构建指导数据的自动化方法。该方法使用文档中的事实和背景知识生成指令，并通过网络刮擦和种子指令收集来增强指令数据的多样性。使用这种方法，我们构建了一个名为Climatechat-Corpus的气候变化指令数据集，该数据集用于微调开源LLM，从而导致了一个名为Climatechat的LLM。评估结果表明，Climatechat显着提高了气候变化问答任务的绩效。此外，我们评估了不同的基本模型和指导数据对LLM性能的影响，并证明了其适应广泛的气候变化科学发现任务的能力，强调了选择适当的基础模型以进行指导调整的重要性。这项研究为构建气候变化指导数据和培训气候变化特定的LLM提供了宝贵的参考和经验支持。

Title: Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles

Authors: Antara Raaghavi Bhattacharya, Isabel Papadimitriou, Kathryn Davidson, David Alvarez-Melis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13886
Pdf URL: https://arxiv.org/pdf/2506.13886
Copy Paste: [[2506.13886]] Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles(https://arxiv.org/abs/2506.13886)
Keywords: language model, llm
Abstract: Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols ($+$, $\times$, etc, as in "twenty + three"). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.
摘要：跨语言，数字系统在构建和组合数字方面的差异很大。尽管人类始终学会浏览这种多样性，但大型语言模型（LLM）与涉及跨语言数字系统的语言数学难题斗争，人类可以学会成功地解决这些难题。我们研究了为什么LLM难以通过一系列实验来解开语言数字的语言和数学方面的一系列实验。我们的实验确定，除非使用已知符号（$ + $，$ \ times $等）明确标记问题中的数学操作，否则模型不能始终如一地解决此类问题。在进一步的消融研究中，我们探讨了数字结构和组合的个别参数如何影响性能。尽管人类利用对数字的语言理解来推断数字的隐式组成结构，但LLMS似乎缺乏这种隐式数字结构的概念。我们得出的结论是，从人类规模数据中的隐式模式中灵活推断成分规则的能力仍然是当前推理模型的开放挑战。

Title: VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training

Authors: Jipeng Zhang, Kehao Miao, Renjie Pi, Zhaowei Wang, Runtao Liu, Rui Pan, Tong Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.13888
Pdf URL: https://arxiv.org/pdf/2506.13888
Copy Paste: [[2506.13888]] VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training(https://arxiv.org/abs/2506.13888)
Keywords: language model, hallucination, chain-of-thought
Abstract: Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.
摘要：具有可验证的奖励的增强微调（RFT）具有高级大语模型，但仍未对视觉语言（VL）模型充满信心。视觉奖励模型（VL-RM）是通过提供结构化反馈来对齐VL模型的关键，但是训练有效的VL-RMS面临两个主要挑战。首先，由于高质量的培训数据取决于已经很强的VL模型，因此引导困境会出现，从而创造了一个自我生成的监督会加剧现有偏见的周期。其次，当VL模型幻觉不正确的视觉属性时，会发生模态偏差和负面示例放大，从而导致有缺陷的偏好数据，从而进一步误导了训练。为了解决这些问题，我们提出了一个迭代培训框架，利用视觉专家，思想链（COT）理由和基于保证金的拒绝抽样。我们的方法完善了偏好数据集，增强结构化的批评，并迭代地改善了推理。跨VL-RM基准测试的实验表明，在幻觉检测和多模式推理方面表现出了卓越的性能，从而通过增强学习来推进VL模型对齐。

Title: EmoNews: A Spoken Dialogue System for Expressive News Conversations

Authors: Ryuki Matsuura, Shikhar Bharadwaj, Jiarui Liu, Dhatchi Kunde Govindarajan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13894
Pdf URL: https://arxiv.org/pdf/2506.13894
Copy Paste: [[2506.13894]] EmoNews: A Spoken Dialogue System for Expressive News Conversations(https://arxiv.org/abs/2506.13894)
Keywords: language model, llm, prompt
Abstract: We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at this https URL
摘要：我们开发了一个以任务为导向的对话系统（SDS），该对话系统（SDS）根据上下文提示来调节情感语音，以实现更多同情的新闻对话。尽管情感文本到语音（TTS）技术取得了进步，但由于SDS和情感TTS研究的分区性质，以及缺乏针对社会目标的标准化评估指标，因此以任务为导向的情感SDS仍未得到充实。我们通过为新闻对话开发情感SD来解决这些挑战，该新闻对话使用大型语言模型（LLM）的情感分析仪来识别适当的情绪并提示综合上下文适合情感语音。我们还提出了情绪SDS的主观评估量表，并判断拟议的基线系统和基线系统的情绪调节性能。实验表明，我们的情绪SD在情绪调节和参与方面的表现优于基线系统。这些结果表明，语音情感在更引人入胜的对话中的关键作用。我们所有的源代码都在此HTTPS URL上开源

Title: Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Authors: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.13901
Pdf URL: https://arxiv.org/pdf/2506.13901
Copy Paste: [[2506.13901]] Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations(https://arxiv.org/abs/2506.13901)
Keywords: language model, llm, prompt
Abstract: Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
摘要：对齐不再是一种奢侈品，这是必需的。随着大型语言模型（LLMS）进入教育，医疗保健，治理和法律等高风险领域，其行为必须可靠地反映出人类一致的价值观和安全限制。然而，当前的评估在很大程度上依赖于诸如拒绝率，G-eval评分和毒性分类器等行为代理，所有这些都有关键的盲点。一致的模型通常容易受到越狱，一代的随机性和对齐伪造的影响。为了解决这个问题，我们介绍了对齐质量指数（AQI）。这种新颖的几何和迅速不变的度量度量通过分析潜在空间中安全和不安全激活的分离来凭经验评估LLM的比对。通过结合诸如Davies-Bouldin评分（DB），Dunn Index（DI），Xie-Beni指数（XBI）和Calinski-Harabasz索引（CHI）等措施，AQI在各种配方中均捕获聚类的质量，以检测隐藏的失误和越狱风险，即使在输出中也显得合理。 AQI还可以作为对齐伪造的预警信号，为行为不可知的安全审核提供了强大的解码工具。此外，我们提出了LITMUS数据集，以促进在这些具有挑战性的条件下进行稳健的评估。在DPO，GRPO和RLHF条件下训练的不同模型的LITMU的经验测试表明，AQI与外部法官的相关性以及揭示拒绝指标错过的漏洞的能力。我们公开实施，以促进该领域的未来研究。

Title: ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection

Authors: Shang-Chi Tsai, Seiya Kawano, Angel Garcia Contreras, Koichiro Yoshino, Yun-Nung Chen
Subjects: cs.CL, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.13956
Pdf URL: https://arxiv.org/pdf/2506.13956
Copy Paste: [[2506.13956]] ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection(https://arxiv.org/abs/2506.13956)
Keywords: language model
Abstract: When designing robots to assist in everyday human activities, it is crucial to enhance user requests with visual cues from their surroundings for improved intent understanding. This process is defined as a multimodal classification task. However, gathering a large-scale dataset encompassing both visual and linguistic elements for model training is challenging and time-consuming. To address this issue, our paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios, encompassing both dialogues and related environmental imagery. This approach involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts, followed by the use of a stable diffusion model to create images depicting these environments. The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions in response to user interactions with the limited target data. Our experimental results, based on a dataset collected from real-world scenarios, demonstrate that our methodology significantly enhances the robot's action selection capabilities, achieving the state-of-the-art performance.
摘要：在设计机器人以协助日常人类活动时，至关重要的是通过周围环境的视觉提示提高用户要求，以提高意图理解。此过程定义为多模式分类任务。但是，收集一个大规模的数据集，包括用于模型训练的视觉和语言元素，这是充满挑战和耗时的。为了解决这个问题，我们的论文介绍了一个新颖的框架，重点是机器人援助方案中的数据增强，涵盖了对话和相关的环境图像。这种方法涉及利用复杂的大型语言模型来模拟潜在的对话和环境环境，然后使用稳定的扩散模型来创建描绘这些环境的图像。另外生成的数据可用于完善最新的多模式模型，从而使它们能够更准确地确定与有限目标数据的用户互动，以确定适当的操作。我们的实验结果基于从现实世界情景中收集的数据集，表明我们的方法可以显着增强机器人的动作选择能力，从而达到最先进的性能。

Title: Are manual annotations necessary for statutory interpretations retrieval?

Authors: Aleksander Smywiński-Pohl, Tomer Libal, Adam Kaczmarczyk, Magdalena Król
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13965
Pdf URL: https://arxiv.org/pdf/2506.13965
Copy Paste: [[2506.13965]] Are manual annotations necessary for statutory interpretations retrieval?(https://arxiv.org/abs/2506.13965)
Keywords: language model, llm, prompt
Abstract: One of the elements of legal research is looking for cases where judges have extended the meaning of a legal concept by providing interpretations of what a concept means or does not mean. This allow legal professionals to use such interpretations as precedents as well as laymen to better understand the legal concept. The state-of-the-art approach for retrieving the most relevant interpretations for these concepts currently depends on the ranking of sentences and the training of language models over annotated examples. That manual annotation process can be quite expensive and need to be repeated for each such concept, which prompted recent research in trying to automate this process. In this paper, we highlight the results of various experiments conducted to determine the volume, scope and even the need for manual annotation. First of all, we check what is the optimal number of annotations per a legal concept. Second, we check if we can draw the sentences for annotation randomly or there is a gain in the performance of the model, when only the best candidates are annotated. As the last question we check what is the outcome of automating the annotation process with the help of an LLM.
摘要：法律研究的一个要素之一是寻找法官通过提供概念含义或不意味着的解释来扩展法律概念的含义的案例。这使得法律专业人员可以将这些解释作为先例以及外行人更好地理解法律概念。当前，检索这些概念最相关的解释的最新方法取决于句子的排名以及语言模型的培训，而不是注释的示例。该手动注释过程可能非常昂贵，并且需要重复每个这样的概念，这促使最近的研究试图自动化此过程。在本文中，我们强调了进行各种实验的结果，以确定体积，范围甚至需要手动注释的需求。首先，我们检查法律概念的最佳注释数量是多少。其次，我们检查是否可以随机绘制注释的句子，或者在仅注释最佳候选人时，模型的性能会增加。作为最后一个问题，我们检查借助于LLM，自动化注释过程的结果是什么。

Title: AI shares emotion with humans across languages and cultures

Authors: Xiuwen Wu, Hao Wang, Zhiang Yan, Xiaohan Tang, Pengfei Xu, Wai-Ting Siok, Ping Li, Jia-Hong Gao, Bingjiang Lyu, Lang Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.13978
Pdf URL: https://arxiv.org/pdf/2506.13978
Copy Paste: [[2506.13978]] AI shares emotion with humans across languages and cultures(https://arxiv.org/abs/2506.13978)
Keywords: language model, llm
Abstract: Effective and safe human-machine collaboration requires the regulated and meaningful exchange of emotions between humans and artificial intelligence (AI). Current AI systems based on large language models (LLMs) can provide feedback that makes people feel heard. Yet it remains unclear whether LLMs represent emotion in language as humans do, or whether and how the emotional tone of their output can be controlled. We assess human-AI emotional alignment across linguistic-cultural groups and model-families, using interpretable LLM features translated from concept-sets for over twenty nuanced emotion categories (including six basic emotions). Our analyses reveal that LLM-derived emotion spaces are structurally congruent with human perception, underpinned by the fundamental affective dimensions of valence and arousal. Furthermore, these emotion-related features also accurately predict large-scale behavioural data on word ratings along these two core dimensions, reflecting both universal and language-specific patterns. Finally, by leveraging steering vectors derived solely from human-centric emotion concepts, we show that model expressions can be stably and naturally modulated across distinct emotion categories, which provides causal evidence that human emotion concepts can be used to systematically induce LLMs to produce corresponding affective states when conveying content. These findings suggest AI not only shares emotional representations with humans but its affective outputs can be precisely guided using psychologically grounded emotion concepts.
摘要：有效，安全的人机合作需要人类与人工智能（AI）之间的受调节和有意义的情绪交流。基于大型语言模型（LLM）的当前AI系统可以提供使人们感到被听到的反馈。然而，尚不清楚LLM是否像人类一样代表语言中的情感，还是如何以及如何控制其产出的情感语调。我们使用可解释的LLM功能从二十多个细微差别的情感类别（包括六种基本情绪）中转化为跨语言文化群体和模型家庭的人类情感一致性。我们的分析表明，LLM衍生的情绪空间在结构上与人类的看法一致，这是基于价值和唤醒的基本情感维度的基础。此外，这些与情绪相关的特征还准确地预测了沿这两个核心维度的单词评分的大规模行为数据，从而反映了通用和特定语言的模式。最后，通过利用仅从以人为中心的情感概念得出的转向向量，我们表明模型表达可以稳定，自然地在不同的情绪类别中进行调制，这提供了因果证据，即在传达内容时可以系统地使用人类情感概念来系统地诱导LLMs来产生相应的情感状态。这些发现表明，AI不仅与人类共享情感表征，而且可以使用心理扎根的情感概念来精确指导其情感输出。

Title: Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

Authors: Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14012
Pdf URL: https://arxiv.org/pdf/2506.14012
Copy Paste: [[2506.14012]] Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text(https://arxiv.org/abs/2506.14012)
Keywords: language model, llm, prompt
Abstract: Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English text$\unicode{x2013}$even under linguistic constraints$\unicode{x2013}$embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.
摘要：代码转换（CSW）是单个话语中两种或多种语言之间交替的行为。这种现象在多语言社区中广泛存在，并且在在线内容中越来越普遍，在线内容中，用户自然会在日常交流中混合语言。结果，现在是内容处理和生成的大型语言模型（LLMS）经常暴露于代码转换输入中。鉴于它们广泛使用，了解LLM的过程和理由如何进行这种混合语言文本至关重要。本文通过生成既定的推理和理解基准的CSW变体，对在代码转换下的LLM理解进行系统评估。当外国代币破坏英语文本$ \ unicode {x2013} $时，即使在语言约束下$ \ unicode {x2013} $将英语嵌入到其他语言中通常会提高理解力时，降解也很明显。尽管提示产生的结果不同，但微型调整提供了缓解降解的更稳定的途径。

Title: MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

Authors: Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14028
Pdf URL: https://arxiv.org/pdf/2506.14028
Copy Paste: [[2506.14028]] MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation(https://arxiv.org/abs/2506.14028)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.
摘要：大型语言模型（LLM）的最新进展已加快了财务NLP和应用程序的进展，但是现有的基准仍限于单语言和单峰设置，通常过度地涉及简单任务，并且未能反映现实世界中财务交流的复杂性。我们介绍了MultiFinben，这是针对全球金融领域量身定制的第一个多语言和多模式基准，评估了跨模式（文本，视觉，音频）和语言设置（单语，双语，多语言）的LLM，涉及领域特定的任务。我们介绍了两个新任务，包括polyfiqa-easy和polyfiqa-expert，这是第一个多语言财务基准，要求模型对混合语言输入进行复杂的推理；和英国人和西班牙裔，这是第一个OCR限制的金融质量检查质量检查质量质量保证任务，挑战了从Visual-Text财务文件中提取和推理信息的模型。此外，我们提出了一种动态，困难的选择机制，并策划了一个紧凑的平衡基准，而不是简单的聚合现有数据集。对22种最先进模型的广泛评估表明，即使最强大的模型，尽管它们具有一般的多模式和多语言能力，但在面对金融领域中复杂的跨语义和多模式任务时，却急剧挣扎。 Multifinben公开发布，以促进金融研究和应用中的透明，可重现和包容性进步。

Title: Ace-CEFR -- A Dataset for Automated Evaluation of the Linguistic Difficulty of Conversational Texts for LLM Applications

Authors: David Kogan, Max Schumacher, Sam Nguyen, Masanori Suzuki, Melissa Smith, Chloe Sophia Bellows, Jared Bernstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14046
Pdf URL: https://arxiv.org/pdf/2506.14046
Copy Paste: [[2506.14046]] Ace-CEFR -- A Dataset for Automated Evaluation of the Linguistic Difficulty of Conversational Texts for LLM Applications(https://arxiv.org/abs/2506.14046)
Keywords: language model, llm
Abstract: There is an unmet need to evaluate the language difficulty of short, conversational passages of text, particularly for training and filtering Large Language Models (LLMs). We introduce Ace-CEFR, a dataset of English conversational text passages expert-annotated with their corresponding level of text difficulty. We experiment with several models on Ace-CEFR, including Transformer-based models and LLMs. We show that models trained on Ace-CEFR can measure text difficulty more accurately than human experts and have latency appropriate to production environments. Finally, we release the Ace-CEFR dataset to the public for research and development.
摘要：不需要评估文本简短的会话段落的语言难度，尤其是用于培训和过滤大型语言模型（LLMS）。我们介绍了ACE-CEFR，这是一个及其相应文本难度级别的英语对话文本段落的数据集。我们在ACE-CEFR上尝试了几种模型，包括基于变压器的模型和LLM。我们表明，接受ACE-CEFR培训的模型比人类专家更准确地测量文本难度，并且具有适合生产环境的潜伏期。最后，我们将ACE-CEFR数据集发布给公众进行研发。

Title: Abstract Meaning Representation for Hospital Discharge Summarization

Authors: Paul Landes, Sitara Rao, Aaron Jeremy Chaise, Barbara Di Eugenio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14101
Pdf URL: https://arxiv.org/pdf/2506.14101
Copy Paste: [[2506.14101]] Abstract Meaning Representation for Hospital Discharge Summarization(https://arxiv.org/abs/2506.14101)
Keywords: language model, llm, hallucination
Abstract: The Achilles heel of Large Language Models (LLMs) is hallucination, which has drastic consequences for the clinical domain. This is particularly important with regards to automatically generating discharge summaries (a lengthy medical document that summarizes a hospital in-patient visit). Automatically generating these summaries would free physicians to care for patients and reduce documentation burden. The goal of this work is to discover new methods that combine language-based graphs and deep learning models to address provenance of content and trustworthiness in automatic summarization. Our method shows impressive reliability results on the publicly available Medical Information Mart for Intensive III (MIMIC-III) corpus and clinical notes written by physicians at Anonymous Hospital. rovide our method, generated discharge ary output examples, source code and trained models.
摘要：大语言模型（LLM）的致命脚跟是幻觉，这对临床领域产生了巨大后果。对于自动产生出院摘要（总结医院住院就诊的冗长医疗文件），这一点尤其重要。自动产生这些摘要将使医生照顾患者并减轻文件负担。这项工作的目的是发现结合基于语言的图形和深度学习模型的新方法，以解决自动摘要中内容和可信度的出处。我们的方法显示了密集III（模拟III）语料库的公开可用医疗信息MART和由匿名医院的医生撰写的临床笔记的令人印象深刻的可靠性结果。将我们的方法扎根，产生的排放量输出示例，源代码和训练有素的模型。

Title: Essential-Web v1.0: 24T tokens of organized web data

Authors: Essential AI: Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.14111
Pdf URL: https://arxiv.org/pdf/2506.14111
Copy Paste: [[2506.14111]] Essential-Web v1.0: 24T tokens of organized web data(https://arxiv.org/abs/2506.14111)
Keywords: language model
Abstract: Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: this https URL
摘要：数据在语言模型如何获得技能和知识方面起着最突出的作用。缺乏大规模，组织良好的预训练数据集导致昂贵且无法访问的数据管道。我们提出了基本WEB v1.0，这是一个24万亿英里的数据集，其中每个文档都用十二类分类学分类，涵盖主题，格式，内容复杂性和质量。分类标签由EAI-DISTILL-0.5B生产，这是一种微调的0.5B参数模型，可在QWEN2.5-32B-Instruct的3％内实现注释者一致。只有SQL风格的过滤器，我们可以在数学（相对于SOTA），Web代码（+14.3％），STEM（+24.5％）和医疗（+8.6％）中获得具有竞争性的网络策划数据集（-8.0％）。 Essential-Web v1.0可在HuggingFace上提供：此HTTPS URL

Title: Sampling from Your Language Model One Byte at a Time

Authors: Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh
Subjects: cs.CL, cs.FL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.14123
Pdf URL: https://arxiv.org/pdf/2506.14123
Copy Paste: [[2506.14123]] Sampling from Your Language Model One Byte at a Time(https://arxiv.org/abs/2506.14123)
Keywords: language model, prompt
Abstract: Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations. For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. This Prompt Boundary Problem (PBP) also arises in languages such as Chinese and in code generation, where tokens often do not line up with syntactic boundaries. Additionally mismatching tokenizers often hinder model composition and interoperability. For example, it is not possible to directly ensemble models with different tokenizers due to their mismatching vocabularies. To address these issues, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM, without changing its generative distribution at the text level. Our method efficient solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time as well as transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals.
摘要：现代语言模型几乎普遍地使用了令牌化，可以使用多字节或多字符令牌实现有效的文本表示。但是，先前的工作表明，令牌化可以将失真引入模型的几代。例如，通常建议用户不要在空间上结束他们的提示，因为它可以防止模型将空间包含在下一代币的一部分中。此提示边界问题（PBP）也出现在中文和代码生成等语言中，在该语言中，令牌通常与句法边界不符。另外，与匹配的引物不匹配通常会阻碍模型组成和互操作性。例如，由于词汇匹配不匹配，因此不可能将模型与不同的令牌合成。为了解决这些问题，我们提出了一种推理时间方法，将带有BPE令牌的任何自回归LM转换为字符级别或字节级LM，而无需在文本级别更改其生成分布。我们的方法有效地解决了PBP，并且还能够用不同的令牌将语言模型的词汇统一，从而使一个人在推理时间将LMS与不同的引导者合并，并使用代理调节将后培训从一个模型转移到另一个模型。我们在实验中证明了集合和代理模型在下游浮动物上的成分优于其成分。

Title: DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization

Authors: Chengyu Huang, Tanya Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14157
Pdf URL: https://arxiv.org/pdf/2506.14157
Copy Paste: [[2506.14157]] DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization(https://arxiv.org/abs/2506.14157)
Keywords: llm
Abstract: Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response $y^+$ and dispreferred response $y^-$ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of-$N^2$ pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models' performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.
摘要：最近的研究试图将偏好优化（PO）性能与基础偏好数据集相关联。在这项工作中，我们的观察结果是，首选响应$ y^+$与分配响应$ y^ - $影响LLMS可以学习的内容之间的差异，这可能与所需的理想差异不符。因此，我们使用距离和奖励余量来量化这些差异，并将它们结合在一起以获得距离校准的奖励余量（DCRM），该指标衡量了PO响应对的质量。直觉上，DCRM鼓励最小的嘈杂差异和最大的预期差异。因此，我们研究了三种类型的常用偏好数据集，这些数据集沿两个轴分类：响应的来源和偏好标记函数。我们在培训集的较高DCRM与更好的学习成果之间建立了一般的相关性。受此启发，我们提出了一种最佳$ n^2 $配对方法，该方法选择了最高DCRM的响应对。从经验上讲，在各种情况下，我们的方法生成培训数据集，可以进一步改善现有培训集对羊石瓦尔，山基台和竞技场上的模型的性能。

Title: S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

Authors: Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao, Guiguang Ding, Pengyang Wang, Feng Tian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14158
Pdf URL: https://arxiv.org/pdf/2506.14158
Copy Paste: [[2506.14158]] S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models(https://arxiv.org/abs/2506.14158)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.
摘要：大型语言模型（LLMS）在各种下游任务中具有出色的推理能力。但是，它们的自回归性质导致了大量的推断潜伏期，对实时应用构成了挑战。推测采样通过引入起草阶段，然后进行并行验证阶段来减轻此问题，从而使代币的生成和验证更快。但是，现有的方法忽略了文本生成的固有连贯性，从而限制了它们的效率。为了解决这一差距，我们提出了一种具有句法和语义连贯性（S $^4 $ c）框架的投机采样，该框架通过利用多头启动来扩展投机性采样，以快速代币产生和连续的验证树，以进行有效的候选验证和功能重复使用。实验结果表明，S $^4 $ C超过了主流任务的基线方法，提供了提高的效率，并行性，并具有更少的计算资源生成更有效令牌的能力。在Spec-Bench基准上，S $^4 $ C的加速度比率为2.26x-2.60x，表现优于最先进的方法。

Title: MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind

Authors: Yanlin Li, Hao Liu, Huimin Liu, Yinwei Wei, Yupeng Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14161
Pdf URL: https://arxiv.org/pdf/2506.14161
Copy Paste: [[2506.14161]] MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind(https://arxiv.org/abs/2506.14161)
Keywords: language model, llm
Abstract: Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.
摘要：大型语言模型（LLM）中的心理理论（TOM）是指对精神状态推理的能力，但这种能力的失败通常表现为系统的隐式偏见。评估这种偏见是具有挑战性的，因为传统的直接疑问方法容易受到社会可取性的影响，并且无法捕捉其微妙的多维性质。为此，我们提出了一个评估框架，该框架利用刻板印象内容模型（SCM）将偏见重新概念化为TOM跨越能力，社交和道德的多维失败。该框架引入了两个间接任务：一词关联偏差测试（WABT），以评估隐式词汇关联和情感归因测试（AAT），以测量掩盖情感倾向，均设计旨在探测潜在的刻板印象而无需触发模型避免模型。对8个最先进的LLM的广泛实验证明了我们框架揭示复杂偏见结构的能力，包括普遍的社交偏见，多维差异和不对称的刻板印象扩增，从而为识别隐含偏见的结构性提供了一种更强大的方法。

Title: GRAM: A Generative Foundation Reward Model for Reward Generalization

Authors: Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Tong Xiao, Chunliang Zhang, Tongran Liu, Jingbo Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14175
Pdf URL: https://arxiv.org/pdf/2506.14175
Copy Paste: [[2506.14175]] GRAM: A Generative Foundation Reward Model for Reward Generalization(https://arxiv.org/abs/2506.14175)
Keywords: language model, llm
Abstract: In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.
摘要：在对齐大语言模型（LLMS）中，奖励模型发挥了重要作用，但被标准地培训为判别模型，仅依靠标记的人类偏好数据。在本文中，我们探讨了使用未标记和标记数据训练奖励模型的方法。在LLMS中的生成模型的基础上，我们开发了一种生成奖励模型，该模型首先是通过大规模无监督学习培训的，然后通过监督学习进行了微调。我们还表明，通过使用标签平滑，我们实际上是在优化正规化的成对排名损失。反过来，该结果提供了培训奖励模型的新观点，该模型将生成模型和判别模型链接在同类培训目标下。这些技术的结果是基础奖励模型，可以应用于几乎没有或没有进一步的微调工作的广泛任务。广泛的实验表明，该模型在几个任务中都很好地推广，包括响应排名，从人类反馈中进行的强化学习以及通过微调进行调整，从而在几种强大的基线模型中实现了显着的绩效改进。

Title: MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment

Authors: Junghwan Kim, Kieun Park, Sohee Park, Hyunggug Kim, Bongwon Suh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14199
Pdf URL: https://arxiv.org/pdf/2506.14199
Copy Paste: [[2506.14199]] MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment(https://arxiv.org/abs/2506.14199)
Keywords: language model, llm, agent
Abstract: Literary translation requires preserving cultural nuances and stylistic elements, which traditional metrics like BLEU and METEOR fail to assess due to their focus on lexical overlap. This oversight neglects the narrative consistency and stylistic fidelity that are crucial for literary works. To address this, we propose MAS-LitEval, a multi-agent system using Large Language Models (LLMs) to evaluate translations based on terminology, narrative, and style. We tested MAS-LitEval on translations of The Little Prince and A Connecticut Yankee in King Arthur's Court, generated by various LLMs, and compared it to traditional metrics. \textbf{MAS-LitEval} outperformed these metrics, with top models scoring up to 0.890 in capturing literary nuances. This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers.
摘要：文学翻译需要维护文化上的细微差别和风格元素，因为它们关注词汇重叠，例如Bleu和流星等传统指标无法评估。这种疏忽忽略了对文学作品至关重要的叙事一致性和风格上的忠诚。为了解决这个问题，我们建议使用大型语言模型（LLM）的多机构系统MAS-LiteVal来评估基于术语，叙事和样式的翻译。我们在亚瑟王法院（King Arthur's Court）中的小王子和康涅狄格州洋基的翻译测试了Mas-liteval，该翻译由各种LLMS产生，并将其与传统指标进行了比较。 \ textbf {mas-liteval}胜过这些指标，最高模型在捕获文学细微差别时得分高达0.890。这项工作为翻译质量评估（TQA）引入了一个可扩展的，细微的框架，为翻译人员和研究人员提供了实用的工具。

Title: ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations

Authors: Brihi Joshi, Keyu He, Sahana Ramnath, Sadra Sabouri, Kaitlyn Zhou, Souti Chattopadhyay, Swabha Swayamdipta, Xiang Ren
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.14200
Pdf URL: https://arxiv.org/pdf/2506.14200
Copy Paste: [[2506.14200]] ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations(https://arxiv.org/abs/2506.14200)
Keywords: language model, gpt
Abstract: Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K "Why" questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an "educator" to assess model explanations' fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.
摘要：当今的语言模型被广泛用于教育中，但他们为具有不同信息需求和知识背景的学习者量身定制反应的能力仍未得到探索。为此，我们介绍了Eli-Why，这是13.4K“为什么”问题评估语言模型的教学能力的基准。然后，我们进行了两项广泛的人类研究，以评估我们基准中语言模型生成的解释性答案（解释）的效用，这些解释性答案（解释）是针对三个不同的教育等级量身定制的：小学，高中和研究生院。在我们的第一项研究中，人类评估者扮演“教育者”评估模型解释适合不同教育等级的作用。我们发现，GPT-4生成的解释仅在50％的时间内与他们的预期教育背景相匹配，而人类策划的解释为79％。在我们的第二项研究中，人类评估者承担了学习者的作用，以评估解释是否适合他们自己的信息需求。与外行人策划的解释相比，在所有教育背景中，用户认为GPT-4生成的解释平均降低了20％，平均降低了其信息需求。此外，自动化评估指标表明，不同语言模型家庭针对不同信息需求产生的解释在其成绩水平上仍然无法区分，从而限制了他们的教学效率。

Title: Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation

Authors: Jongho Kim, Romain Storaï, Seung-won Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14203
Pdf URL: https://arxiv.org/pdf/2506.14203
Copy Paste: [[2506.14203]] Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation(https://arxiv.org/abs/2506.14203)
Keywords: language model
Abstract: In this study, we investigate the potential of language models (LMs) in aiding patients experiencing anomia, a difficulty identifying the names of items. Identifying the intended target item from patient's circumlocution involves the two challenges of term failure and error: (1) The terms relevant to identifying the item remain unseen. (2) What makes the challenge unique is inherent perturbed terms by semantic paraphasia, which are not exactly related to the target item, hindering the identification process. To address each, we propose robustifying the model from semantically paraphasic errors and enhancing the model with unseen terms with gradient-based selective augmentation. Specifically, the gradient value controls augmented data quality amid semantic errors, while the gradient variance guides the inclusion of unseen but relevant terms. Due to limited domain-specific datasets, we evaluate the model on the Tip-of-the-Tongue dataset as an intermediary task and then apply our findings to real patient data from AphasiaBank. Our results demonstrate strong performance against baselines, aiding anomia patients by addressing the outlined challenges.
摘要：在这项研究中，我们研究了语言模型（LMS）在帮助患有异常的患者中的潜力，难以识别项目名称。从患者的跨性别中识别预期的目标项目涉及期限失败和错误的两个挑战：（1）与识别项目相关的术语仍然看不见。（2）使挑战与众不同的是，语义旁的固有的扰动术语与目标项目并不完全相关，从而阻碍了标识过程。为了解决每个人，我们提出了从语义上的播音错误中鲁棒化模型，并以基于梯度的选择性增强来增强模型。具体而言，梯度值在语义错误中控制增强数据质量，而梯度差异指导包含看不见但相关的术语。由于特定于域的数据集有限，我们将其作为中间任务评估了丁格图尖端数据集的模型，然后将我们的发现应用于Aphineabark的真实患者数据。我们的结果表明，针对基准的表现强劲，通过解决概述的挑战来帮助异常患者。

Title: AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

Authors: Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14205
Pdf URL: https://arxiv.org/pdf/2506.14205
Copy Paste: [[2506.14205]] AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents(https://arxiv.org/abs/2506.14205)
Keywords: llm, agent
Abstract: We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at this https URL
摘要：我们介绍了代理，这是一种可扩展且具有成本效益的管道，用于自动合成通才计算机使用代理的高质量任务和轨迹数据集。利用信息不对称，代理构造了在发电过程中简单的子任务，但在构成长途任务时更具挑战性，从而实现了超过6,000多种不同和现实的任务。我们的管道始于基于LLM的任务建议者，以角色为指导，然后是完成任务并记录轨迹的执行代理。迭代地重复此过程以形成一系列子任务，然后由单独的代理总结为可控难度的复合任务。代理的关键优势是它通过改变子任务的数量来精确调节任务复杂性的能力。经验评估表明，最先进的LLM代理商的性能下降，从难度1级的18％的成功到6级的4％，强调了基准的难度和歧视能力。此外，我们的管道的平均成本低于每条轨迹0.60美元，比人类注释便宜的数量级。我们的代码和数据在此HTTPS URL上公开可用

Title: Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation

Authors: Sina Abdidizaji, Md Kowsher, Niloofar Yousefi, Ivan Garibay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14211
Pdf URL: https://arxiv.org/pdf/2506.14211
Copy Paste: [[2506.14211]] Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation(https://arxiv.org/abs/2506.14211)
Keywords: language model
Abstract: In the era of digitalization, as individuals increasingly rely on digital platforms for communication and news consumption, various actors employ linguistic strategies to influence public perception. While models have become proficient at detecting explicit patterns, which typically appear in texts as single remarks referred to as utterances, such as social media posts, malicious actors have shifted toward utilizing implicit influential verbal patterns embedded within conversations. These verbal patterns aim to mentally penetrate the victim's mind in order to influence them, enabling the actor to obtain the desired information through implicit means. This paper presents an improved approach for detecting such implicit influential patterns. Furthermore, the proposed model is capable of identifying the specific locations of these influential elements within a conversation. To achieve this, the existing dataset was augmented using the reasoning capabilities of state-of-the-art language models. Our designed framework resulted in a 6% improvement in the detection of implicit influential patterns in conversations. Moreover, this approach improved the multi-label classification tasks related to both the techniques used for influence and the vulnerability of victims by 33% and 43%, respectively.
摘要：在数字化时代，随着个人越来越依赖数字平台进行沟通和新闻消费，各种参与者采用语言策略来影响公众的看法。尽管模型已经熟练地检测出明确的模式，这些模式通常以文本为单一的话语，例如被称为话语，例如社交媒体帖子，但恶意参与者已转向利用隐含的有影响力的有影响力的言语模式。这些口头模式旨在在心理上渗透受害者的思想，以影响他们，使演员能够通过隐式手段获得所需的信息。本文提出了一种改进的方法，用于检测这种隐性影响模式。此外，所提出的模型能够识别对话中这些有影响力元素的特定位置。为此，使用最先进的语言模型的推理功能增强了现有数据集。我们设计的框架导致对话中隐性影响模式的检测提高了6％。此外，这种方法改善了与影响力和受害者的脆弱性相关的多标签分类任务，分别提高了33％和43％。

Title: Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team

Authors: Md Tanzib Hosain, Salman Rahman, Md Kishor Morol, Md Rizwan Parvez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14234
Pdf URL: https://arxiv.org/pdf/2506.14234
Copy Paste: [[2506.14234]] Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team(https://arxiv.org/abs/2506.14234)
Keywords: language model, llm, agent
Abstract: Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME'24 (94.4%), AIME'25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at this https URL.
摘要：尽管在复杂的推理上取得了令人印象深刻的进展，但当前的大型语言模型（LLM）通常孤立地运作 - 将每个问题视为一个独立的尝试，而无需累积或整合体验知识。相反，专家问题解决者 - 例如奥林匹克或编程竞赛团队 - 利用丰富的经验挂毯：从教练中吸收指导，发展过去问题的直觉，利用对工具使用和图书馆功能的知识，调整策略的知识，基于同龄人的专业知识，通过竞争不断地进行竞争，甚至在其他相关方面进行竞争，甚至可以通过竞争进行竞争，甚至可以通过竞争进行竞争。我们介绍了Xolver，这是一个无训练的多代理推理框架，该框架使黑盒LLM具有持久的，不断发展的整体体验的记忆。 Xolver整合了各种体验模式，包括外部和自我恢复，工具使用，协作互动，代理驱动的评估和迭代精致。通过从相关策略，代码片段和推理时的抽象推理模式中学习，Xolver避免了从头开始生成解决方案 - 标志着从孤立的推理向意识到的语言代理的过渡。 Xolver建立在开放式和专有模型的基础上，始终优于专业推理剂。即使有轻量级的骨干（例如QWQ-32B），它也经常超过包括QWEN3-235B，Gemini 2.5 Pro，O3和O4-Mini-High在内的高级模型。借助O3米尼高，它可以在GSM8K（98.1％），Aime'24（94.4％），Aime'25（93.7％），Math-500（99.8％）和LiveCodeBench-V5（91.6％）（91.6％）（91.6％）上取得新的最佳效果。代码和数据可在此HTTPS URL上找到。

Title: Re-Initialization Token Learning for Tool-Augmented Large Language Models

Authors: Chenghao Li, Liu Liu, Baosheng Yu, Jiayan Qiu, Yibing Zhan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14248
Pdf URL: https://arxiv.org/pdf/2506.14248
Copy Paste: [[2506.14248]] Re-Initialization Token Learning for Tool-Augmented Large Language Models(https://arxiv.org/abs/2506.14248)
Keywords: language model, gpt, llm
Abstract: Large language models have demonstrated exceptional performance, yet struggle with complex tasks such as numerical reasoning, plan generation. Integrating external tools, such as calculators and databases, into large language models (LLMs) is crucial for enhancing problem-solving capabilities. Current methods assign a unique token to each tool, enabling LLMs to call tools through token prediction-similar to word generation. However, this approach fails to account for the relationship between tool and word tokens, limiting adaptability within pre-trained LLMs. To address this issue, we propose a novel token learning method that aligns tool tokens with the existing word embedding space from the perspective of initialization, thereby enhancing model performance. We begin by constructing prior token embeddings for each tool based on the tool's name or description, which are used to initialize and regularize the learnable tool token embeddings. This ensures the learned embeddings are well-aligned with the word token space, improving tool call accuracy. We evaluate the method on tasks such as numerical reasoning, knowledge-based question answering, and embodied plan generation using GSM8K-XL, FuncQA, KAMEL, and VirtualHome datasets. The results demonstrate clear improvements over recent baselines, including CoT, REACT, ICL, and ToolkenGPT, indicating that our approach effectively augments LLMs with tools through relevant tokens across diverse domains.
摘要：大型语言模型表现出了出色的表现，但在数值推理，计划生成等复杂任务中挣扎。将外部工具（例如计算器和数据库）集成到大语言模型（LLMS）中对于增强解决问题的能力至关重要。当前方法为每个工具分配一个唯一的令牌，使LLM可以通过令牌预测与单词生成相似。但是，这种方法无法说明工具和单词令牌之间的关系，从而限制了预训练的LLM中的适应性。为了解决这个问题，我们提出了一种新颖的令牌学习方法，该方法将工具令牌与现有单词嵌入空间保持一致，从而增强了模型性能。我们首先根据工具的名称或描述在每个工具的令牌嵌入之前构造，该工具用于初始化和正规化可学习的工具令牌令牌嵌入。这样可以确保学习的嵌入与单词令牌空间相吻合，从而提高了工具呼叫的准确性。我们使用GSM8K-XL，Funcqa，Kamel和VirtualHome数据集评估了诸如数值推理，基于知识的问题答案以及体现计划生成等任务的方法。结果表明，包括COT，REACT，ICL和工具Kengpt在内的最近基线的明显改进，表明我们的方法通过跨不同领域的相关令牌有效地通过工具来增强LLMS。

Title: From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents

Authors: Seongbo Jang, Minjin Jeon, Jaehoon Lee, Seonghyeon Lee, Dongha Lee, Hwanjo Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14285
Pdf URL: https://arxiv.org/pdf/2506.14285
Copy Paste: [[2506.14285]] From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents(https://arxiv.org/abs/2506.14285)
Keywords: language model, llm, prompt, chat, agent
Abstract: While research on dialogue response generation has primarily focused on generating coherent responses conditioning on textual context, the critical question of when to respond grounded on the temporal context remains underexplored. To bridge this gap, we propose a novel task called timely dialogue response generation and introduce the TimelyChat benchmark, which evaluates the capabilities of language models to predict appropriate time intervals and generate time-conditioned responses. Additionally, we construct a large-scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph and employing a large language model (LLM) to synthesize 55K event-driven dialogues. We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals. Experimental results show that Timer outperforms prompting-based LLMs and other fine-tuned baselines in both turn-level and dialogue-level evaluations. We publicly release our data, model, and code.
摘要：虽然对对话响应生成的研究主要集中于在文本上下文上产生连贯的响应，但何时基于时间上下文的响应的关键问题仍未得到充实。为了弥合这一差距，我们提出了一项名为及时对话响应生成的新颖任务，并介绍了及时的基准，该任务评估了语言模型的能力以预测适当的时间间隔并生成时间条件的响应。此外，我们通过利用未标记的事件知识来构建一个大规模的培训数据集，并采用大型语言模型（LLM）来合成55K事件驱动的对话。然后，我们训练计时器，这是一种旨在主动预测时间间隔的对话代理，并产生与这些间隔保持一致的及时响应。实验结果表明，计时器在转交和对话级别的评估中均优于促使基于LLM的LLM和其他微调基线。我们公开发布我们的数据，模型和代码。

Title: Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent

Authors: Xueyang Feng, Jingsen Zhang, Jiakai Tang, Wei Li, Guohao Cai, Xu Chen, Quanyu Dai, Yue Zhu, Zhenhua Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14302
Pdf URL: https://arxiv.org/pdf/2506.14302
Copy Paste: [[2506.14302]] Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent(https://arxiv.org/abs/2506.14302)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs). However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations. Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue. To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm ECPO, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction. These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization. ECPO ingeniously eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements. To support ECPO, we introduce an LLM-based user simulator, AILO, to simulate user feedback and perform expectation confirmation during conversational recommendations. Experimental results show that ECPO significantly enhances CRA's interaction capabilities, delivering notable improvements in both efficiency and effectiveness over existing MTPO methods.
摘要：大型语言模型（LLM）的最新进展显着推动了会话推荐剂（CRA）的发展。但是，这些代理通常会产生短视响应，无法维持用户指导并满足期望。尽管偏好优化已被证明有效地使LLM与用户期望保持一致，但它仍然昂贵，并且在多转化对话中的表现较差。为了应对这一挑战，我们引入了一种新颖的多转变偏好优化（MTPO）范式ECPO，该范式范围ECPO利用期望确认理论明确模拟了在多转交战中用户满意度的演变，从而发现了不满意的根本原因。这些原因可用于支持对不令人满意的响应的有针对性优化，从而实现了转向级别的优先优化。 Ecpo巧妙地消除了现有MTPO方法的大量抽样开销，同时确保优化过程可以带来有意义的改进。为了支持ECPO，我们介绍了基于LLM的用户模拟器AILO，以模拟用户反馈并在对话建议期间执行期望确认。实验结果表明，ECPO显着增强了CRA的相互作用能力，从而在现有MTPO方法方面效率和有效性都显着提高。

Title: Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics

Authors: Silvia Casola, Yang Janet Liu, Siyao Peng, Oliver Kraus, Albert Gatt, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14335
Pdf URL: https://arxiv.org/pdf/2506.14335
Copy Paste: [[2506.14335]] Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics(https://arxiv.org/abs/2506.14335)
Keywords: llm
Abstract: Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.
摘要：人类语言生产具有显着的丰富性和差异，反映了各种沟通方式和意图。但是，在总结评估中通常会忽略这种变化。尽管已知有多个参考摘要可以改善与人类判断的相关性，但使用不同的参考集对基于参考的指标的影响尚未系统地研究。这项工作研究了广泛使用的基于参考的指标对参考集选择的敏感性，分析了三个不同的多引用摘要数据集：Summeval，Gumsum和DUC2004。我们证明许多流行的指标表现出明显的不稳定性。这种不稳定性尤其关注基于N-Gram的指标，例如Rouge，其中模型排名因参考集而有所不同，这破坏了模型比较的可靠性。我们还收集有关LLM输出的人类判断，以获取流派多样性数据，并检查其与指标的相关性，以补充新闻新闻摘要以外的现有发现，发现弱到不相关。综上所述，我们建议将参考集变化纳入摘要评估中，以增强一致性以及与人类判断的相关性，尤其是在评估LLMS时。

Title: A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis

Authors: Bruno Martins, Piotr Szymański, Piotr Gramacki
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.14345
Pdf URL: https://arxiv.org/pdf/2506.14345
Copy Paste: [[2506.14345]] A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis(https://arxiv.org/abs/2506.14345)
Keywords: language model, llm
Abstract: The emergence of Large Language Models (LLMs) has transformed information access, with current LLMs also powering deep research systems that can generate comprehensive report-style answers, through planned iterative search, retrieval, and reasoning. Still, current deep research systems lack the geo-temporal capabilities that are essential for answering context-rich questions involving geographic and/or temporal constraints, frequently occurring in domains like public health, environmental science, or socio-economic analysis. This paper reports our vision towards next generation systems, identifying important technical, infrastructural, and evaluative challenges in integrating geo-temporal reasoning into deep research pipelines. We argue for augmenting retrieval and synthesis processes with the ability to handle geo-temporal constraints, supported by open and reproducible infrastructures and rigorous evaluation protocols. Our vision outlines a path towards more advanced and geo-temporally aware deep research systems, of potential impact to the future of AI-driven information access.
摘要：大型语言模型（LLM）的出现已经改变了信息访问，当前的LLM还为深度研究系统提供动力，可以通过计划的迭代搜索，检索和推理来生成全面的报告式答案。尽管如此，当前的深入研究系统仍缺乏地理能力，这些能力对于回答涉及地理和/或时间限制的上下文问题至关重要，这些问题经常发生在公共卫生，环境科学或社会经济分析等领域。本文报告了我们对下一代系统的愿景，确定了将地理 - 周期性推理整合到深度研究管道中的重要技术，基础设施和评估挑战。我们主张以开放且可重复的基础架构和严格的评估协议支持，以处理地理临时约束的能力来增强检索和合成过程。我们的愿景概述了通往更先进和地理意识的深入研究系统的道路，对AI驱动信息访问的未来潜在影响。

Title: ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection

Authors: Lucile Favero, Daniel Frases, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.14371
Pdf URL: https://arxiv.org/pdf/2506.14371
Copy Paste: [[2506.14371]] ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection(https://arxiv.org/abs/2506.14371)
Keywords: language model, llm, chat
Abstract: The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.
摘要：基于大语言模型（LLM）的聊天界面广泛采用引起了人们对促进肤浅学习和破坏批判性思维技能发展的担忧。这项工作不是纯粹依靠LLMS纯粹用于检索事实信息，而是通过在辩论干预措施中引起不支持或模糊的主张来探讨其深入推理的潜力。这项研究是与ACL 2025共同置于自动关键问题生成的第12届论证挖掘研讨会的共同任务的一部分。我们提出了一个两步的框架，涉及两个小规模的开源语言模型：一个发出多个候选问题的发问者和一个选择最相关的问题的法官。我们的系统在共同的任务竞赛中排名第一，证明了拟议的基于LLM的方法的潜力，以鼓励对论证文本进行批判性参与。

Title: Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Authors: Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang, Sangho Kim, Jaejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14397
Pdf URL: https://arxiv.org/pdf/2506.14397
Copy Paste: [[2506.14397]] Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding(https://arxiv.org/abs/2506.14397)
Keywords: language model, llm
Abstract: Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Existing benchmarks often treat negation as a side case within broader tasks like natural language inference, resulting in a lack of benchmarks that exclusively target negation understanding. In this work, we introduce \textbf{Thunder-NUBench}, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs. Thunder-NUBench goes beyond surface-level cue detection by contrasting standard negation with structurally diverse alternatives such as local negation, contradiction, and paraphrase. The benchmark consists of manually curated sentence-negation pairs and a multiple-choice dataset that enables in-depth evaluation of models' negation understanding.
摘要：否定是一种基本的语言现象，它对大型语言模型（LLM）构成了持续的挑战，尤其是在需要深层语义理解的任务中。现有的基准通常将否定视为自然语言推断（例如自然语言推断）中更广泛的任务中的副案例，导致缺乏仅针对否定理解的基准。在这项工作中，我们介绍了\ textbf {Thunder-nubench}，这是一种新颖的基准，该基准明确设计用于评估LLMS中的句子级别的否定理解。通过将标准否定性与结构上多样化的替代方案（例如局部否定，矛盾和释义）进行对比，超越了表面级提示检测。该基准由手动策划的句子束缚对和一个多项选择数据集组成，可对模型的否定理解进行深入评估。

Title: ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

Authors: Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14407
Pdf URL: https://arxiv.org/pdf/2506.14407
Copy Paste: [[2506.14407]] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge(https://arxiv.org/abs/2506.14407)
Keywords: gpt, prompt
Abstract: Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques -- like prompting or multi-hop retrieval -- that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving "two days ago"), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge. Our codes are available at this http URL.
摘要：检索系统对于许多NLP管道都是核心，但通常依赖于表面级别的提示，例如关键字重叠和词汇语义相似性。为了评估除这些浅信号之外的检索，最近的基准测试了推理较重的查询；但是，它们主要将负担转移到查询侧处理技术（例如提示或多跳检索），可以帮助解决复杂性。相比之下，我们提出了Inmandiret，这是一种将推理挑战转移到文档侧处理的基准：查询很简单，但相关性取决于通过时间段（例如，两天前解决”，ArithMeactic和World Inspolect Weblove Cantement在文档中暗示的事实。我们评估了一系列稀疏和密集的检索器，在这种情况下，所有这些都挣扎：最佳NDCG@10仅为15.07％。我们还测试了长篇小说模型是否可以克服此限制。但是，即使只有十个文件的简短背景，包括正面文件，GPT-4.1也只能得分35.06％，表明该文件侧推理仍然是一个挑战。我们的代码可在此HTTP URL上找到。

Title: LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

Authors: Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14429
Pdf URL: https://arxiv.org/pdf/2506.14429
Copy Paste: [[2506.14429]] LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs(https://arxiv.org/abs/2506.14429)
Keywords: llm, long context
Abstract: Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textbf{\textit{stable perplexity}} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textbf{\textit{local perception}} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.
摘要：大型语言扩散模型或扩散LLM已成为NLP研究的重要重点，并致力于理解其可扩展性和下游任务绩效。但是，它们的长期文化功能仍未开发，缺乏系统分析或上下文扩展的方法。在这项工作中，我们提出了第一次系统调查，比较了扩散LLM和传统自动回归LLM的长篇文本性能。我们首先确定了扩散llms的独特特征，与自动回归llms不同，它们在直接上下文外推期间保持了明显的\ textbf {\ textIt {稳定的困惑}}。此外，如果自动回归模型在围绕一项框架的任务中完全失败，并且上下文超过了其预验证的长度时，我们发现扩散的LLMS表现出独特的\ textbf {\ textIt {\ textit {local perception}}}}现象，从而成功地从最近的上下文中取得了成功。我们通过旋转位置嵌入（绳索）缩放理论的镜头来解释这两种现象。在这些观察结果的基础上，我们提出了Longllada，这是一种将LLADA与基于NTK的绳索外推的无训练方法。我们的结果验证了已建立的外推定缩放定律仍然有效地扩展了扩散LLM的上下文窗口。此外，我们确定了长篇小说任务，其中扩散llms的表现优于自动回归LLM，而其他人则缺乏。因此，这项研究确立了第一种环境推断LLM的推断方法，同时提供了基本的理论见解和经验基准，这对于推进对长期文化扩散LLM的未来研究至关重要。

Title: How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison

Authors: Jiayin Wang, Zhiquang Guo, Weizhi Ma, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14448
Pdf URL: https://arxiv.org/pdf/2506.14448
Copy Paste: [[2506.14448]] How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison(https://arxiv.org/abs/2506.14448)
Keywords: language model, llm
Abstract: As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.
摘要：由于大语模型的评估设计可能会影响我们对人工智能的轨迹，因此必须进行全面和前瞻性的评估。现有基准主要评估静态知识，而智力也需要快速从经验中学习的能力。为此，我们主张评估考试时间学习，在考试期间提高基于经验的，基于推理的，涉及经验的密集型任务的能力。在这项工作中，我们建议语义游戏作为评估测试时间学习的有效测试床，因为它们抵制了饱和和对战略推理的固有需求。我们介绍了一个客观的评估框架，该框架比较了有限和累积经验设置下的模型性能，并包含四种形式的体验表示形式。为了提供比较基线，我们招募了八名人类参与者完成相同的任务。结果表明，LLM具有可测量的测试时间学习能力；但是，在累积经验下，它们的改进比人类观察到的稳定性较低，而且进步的速度要慢。这些发现强调了LLM作为通用学习机的潜力，同时也揭示了模型与人之间的巨大智力差距，而不管LLMS在静态基准测试方面的表现如何。

Title: LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data

Authors: Eyal German, Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2506.14474
Pdf URL: https://arxiv.org/pdf/2506.14474
Copy Paste: [[2506.14474]] LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data(https://arxiv.org/abs/2506.14474)
Keywords: language model, llm
Abstract: Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.
摘要：未经所有者同意，可以对获得的数据培训或微调大型语言模型（LLMS）。验证是否在特定数据实例上培训了特定的LLM或整个数据集是极具挑战性的。数据集水印通过将可识别的修改嵌入培训数据中以检测未经授权的使用来解决此问题。但是，现有方法通常缺乏隐身，使它们相对容易检测和删除。鉴于这些局限性，我们提出了一种专为文本和文档设计的新型水印技术Leximark，它嵌入了精心选择的高渗透词的同义词替代。我们的方法旨在增强LLM在水印文本上的记忆能力，而不会改变文本的语义完整性。结果，很难检测到水印，无缝地将其无缝地融合到没有可见标记的文本中，并且由于其微妙的，上下文适当的替代而逃避了自动化和手动检测。我们使用来自最近的研究和七个开源模型的基线数据集进行了评估：Llama-1 7b，Llama-3 8b，Mistral 7b，Pythia 6.9b，以及来自毕达氏菌家族（160m，410m和1b）的三个较小的变体。我们的评估涵盖了多个培训设置，包括持续的预处理和微调方案。结果表明，与现有方法相比，AUROC得分的显着改善，强调了我们方法在可靠地验证是否在LLM培训中使用未经授权的水印数据的有效性。

Title: LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops

Authors: Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, Wenqiang Zhang
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2506.14493
Pdf URL: https://arxiv.org/pdf/2506.14493
Copy Paste: [[2506.14493]] LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops(https://arxiv.org/abs/2506.14493)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments demonstrate LingoLoop can increase generated tokens by up to 30 times and energy consumption by a comparable factor on models like Qwen2.5-VL-3B, consistently driving MLLMs towards their maximum generation limits. These findings expose significant MLLMs' vulnerabilities, posing challenges for their reliable deployment. The code will be released publicly following the paper's acceptance.
摘要：多模式的大语言模型（MLLM）表现出巨大的希望，但在推断过程中需要大量的计算资源。攻击者可以通过引起过度输出来利用这一点，从而导致资源耗尽和服务退化。先前的能源延迟攻击旨在通过将输出令牌分布从EOS代币中大幅移动来增加发电时间，但它们忽略了令牌级的词性词性（POS）特征（POS）对EOS和句子级结构模式对输出数量的影响，从而限制了其功效。为了解决这个问题，我们提出了Lingoloop，这是一种旨在诱导MLLM的攻击，以产生过度的详细和重复序列。首先，我们发现令牌的POS标签强烈影响产生EOS令牌的可能性。基于这种见解，我们通过调整由POS信息引导的注意力重量来推迟EOS代币产生的POS-感知延迟机制。其次，我们确定限制输出多样性诱导重复循环对于持续产生有效。我们引入了一种生成的路径修剪机制，该机制限制了隐藏状态的大小，鼓励模型产生持久的环。广泛的实验表明，Lingoloop可以在QWEN2.5-VL-3B等模型上通过可比因素增加产生的令牌，并通过可比的因素增加30次，从而始终将MLLM驱动到其最大生成限制。这些发现暴露了重要的MLLM的脆弱性，并为其可靠的部署带来了挑战。该代码将在本文接受后公开发布。

Title: M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models

Authors: Can Zheng, Jiguang He, Chung G. Kang, Guofa Cai, Zitong Yu, Merouane Debbah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14532
Pdf URL: https://arxiv.org/pdf/2506.14532
Copy Paste: [[2506.14532]] M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models(https://arxiv.org/abs/2506.14532)
Keywords: language model, gpt, llm
Abstract: This paper introduces a novel neural network framework called M2BeamLLM for beam prediction in millimeter-wave (mmWave) massive multi-input multi-output (mMIMO) communication systems. M2BeamLLM integrates multi-modal sensor data, including images, radar, LiDAR, and GPS, leveraging the powerful reasoning capabilities of large language models (LLMs) such as GPT-2 for beam prediction. By combining sensing data encoding, multimodal alignment and fusion, and supervised fine-tuning (SFT), M2BeamLLM achieves significantly higher beam prediction accuracy and robustness, demonstrably outperforming traditional deep learning (DL) models in both standard and few-shot scenarios. Furthermore, its prediction performance consistently improves with increased diversity in sensing modalities. Our study provides an efficient and intelligent beam prediction solution for vehicle-to-infrastructure (V2I) mmWave communication systems.
摘要：本文介绍了一个新型的神经网络框架，称为M2Beamllm，用于毫米波（MMWave）大量多输入多输出（MMIMO）通信系统。 M2BeamllM集成了多模式传感器数据，包括图像，雷达，激光雷达和GPS，利用大型语言模型（LLM）（例如GPT-2）（例如GPT-2）的强大推理能力进行梁预测。通过结合感应数据编码，多模式对准和融合以及监督的微调（SFT），M2Beamllm在标准场景和较少的场景中都表现出了更高的梁预测准确性和鲁棒性，以优于传统深度学习（DL）模型。此外，其预测性能随着感应方式的多样性而始终如一地改善。我们的研究为车辆到基础设施（V2I）MMWave通信系统提供了有效且智能的梁预测解决方案。

Title: AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Authors: Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.14562
Pdf URL: https://arxiv.org/pdf/2506.14562
Copy Paste: [[2506.14562]] AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs(https://arxiv.org/abs/2506.14562)
Keywords: language model, llm
Abstract: Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.
摘要：重量衰减是一种用于培训大语言模型（LLM）的标准正规化技术。虽然通常将均匀的衰减速率分配给每个层，但这种方法忽略了LLM的结构多样性和跨模块的不同光谱特性。在本文中，我们介绍了AlphadeCay，这是一种简单而有效的方法，可以自适应地为LLM的每个模块分配不同的重量衰减强度。我们的方法以重尾自我调节（HT-SR）理论为指导，该理论分析了重量相关矩阵的经验光谱密度（ESD），以量化“重尾”。表现出更为明显的重尾ESD的模块，反映了更强的特征学习，分配了较弱的衰减，而光谱较轻的模块会衰减更强。我们的方法利用定制的重量衰减分配来平衡光谱特性的模块差异，从而提高了性能。具有从60m到1B的各种模型大小的广泛训练任务表明，与常规均匀衰变和其他适应性衰减基线相比，AlphadeCay实现更好的困惑和概括。

Title: GenerationPrograms: Fine-grained Attribution with Executable Programs

Authors: David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, Mohit Bansal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14580
Pdf URL: https://arxiv.org/pdf/2506.14580
Copy Paste: [[2506.14580]] GenerationPrograms: Fine-grained Attribution with Executable Programs(https://arxiv.org/abs/2506.14580)
Keywords: language model, llm, agent
Abstract: Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable "code agent" architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program's specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.
摘要：最近的大型语言模型（LLM）在源条件文本生成中实现了令人印象深刻的性能，但通常无法正确地为其输出提供细粒度的归因，从而破坏了可验证性和信任。此外，现有的归因方法不能解释如何以及为什么模型利用提供的源文档来生成其最终响应，从而限制了可解释性。为了克服这些挑战，我们引入了一个模块化生成框架，生成图，灵感来自可执行的“代码代理”体系结构的最新进步。与传统的生成方法同时生成输出和属性或依赖于事后归因，生成图将过程分解为两个不同的阶段：首先，创建一个可执行的程序计划，该计划由模块化的文本操作（例如派出，压缩，压缩和融合）明确构成了对Query的响应，并执行了这些程序，并执行了这些程序，并执行了这些程序，并执行了该程序。经验评估表明，在两个长期提问任务和一项多文件摘要任务中，生成图可显着提高文档级别和句子级别的归因质量。我们进一步证明，生成图可以有效地充当事后归因方法，在恢复准确的属性方面表现优于传统技术。此外，通过生成计划生成的可解释程序通过模块化层面的改进可以进一步提高整体归因质量。

Title: Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees

Authors: Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud
Subjects: cs.CL, cs.AR, cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2506.14606
Pdf URL: https://arxiv.org/pdf/2506.14606
Copy Paste: [[2506.14606]] Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees(https://arxiv.org/abs/2506.14606)
Keywords: language model, llm
Abstract: The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.
摘要：硬件生态系统正在迅速发展，越来越有兴趣以一种快速，灵活且正确的方法来改善现有代码的可移植性和寿命，以跨不同的指令集架构（ISA）翻译低级程序。由于教学复杂性，内存模型和执行范式的基本差异，该转换问题的一个特别具有挑战性的类别是在复杂（CISC）和减少（RISC）硬件体系结构之间转换。在这项工作中，我们介绍了GG（保证猜测），这是一种以ISA为中心的转介管道，将预训练的大语言模型（LLMS）的翻译功率与已建立的软件测试结构的严格性结合在一起。我们的方法使用从一个ISA到另一个ISA的LLM生成候选翻译，并将这些翻译嵌入软件测试框架中，以在翻译中建立可量化的信心。我们在两个不同的数据集上评估了我们的GG方法，在单位测试中执行高码覆盖率（> 98％），并在人道计划中实现99％的功能/语义正确性，分别在Bringupbench计划上实现49％。此外，我们将我们的方法比较了Apple Silicon上最先进的Rosetta 2框架，展示了1.73倍的运行时性能，1.47倍提高能源效率，并为我们的转载代码提供了更好的内存使用情况，并证明了GG对现实世界中的CISC CISC-CISC-CISC-CISC-CISC-CISC-CISC-CISC-CISC TO-RICS TOCE-RISC TOCH TOCH TRANSTACTION。我们将开放代码，数据，模型和基准，以建立ISA级代码翻译研究的共同基础。

Title: When Does Meaning Backfire? Investigating the Role of AMRs in NLI

Authors: Junghyun Min, Xiulin Yang, Shira Wein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14613
Pdf URL: https://arxiv.org/pdf/2506.14613
Copy Paste: [[2506.14613]] When Does Meaning Backfire? Investigating the Role of AMRs in NLI(https://arxiv.org/abs/2506.14613)
Keywords: language model, gpt, prompt
Abstract: Natural Language Inference (NLI) relies heavily on adequately parsing the semantic content of the premise and hypothesis. In this work, we investigate whether adding semantic information in the form of an Abstract Meaning Representation (AMR) helps pretrained language models better generalize in NLI. Our experiments integrating AMR into NLI in both fine-tuning and prompting settings show that the presence of AMR in fine-tuning hinders model generalization while prompting with AMR leads to slight gains in \texttt{GPT-4o}. However, an ablation study reveals that the improvement comes from amplifying surface-level differences rather than aiding semantic reasoning. This amplification can mislead models to predict non-entailment even when the core meaning is preserved.
摘要：自然语言推论（NLI）在很大程度上依赖于充分解析前提和假设的语义内容。在这项工作中，我们调查是否以抽象含义表示形式（AMR）的形式添加语义信息有助于预验证的语言模型更好地推广NLI。我们在微调和提示设置中将AMR整合到NLI中的实验表明，AMR在微调阻碍模型概括中的存在，同时提示使用AMR会导致\ Texttt {gpt-4O}的略有增长。但是，一项消融研究表明，改进是由于放大表面级别的差异而不是帮助语义推理。即使保留了核心含义，这种放大可能会误导模型来预测非执行。

Title: Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

Authors: Chenchen Yuan, Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14625
Pdf URL: https://arxiv.org/pdf/2506.14625
Copy Paste: [[2506.14625]] Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models(https://arxiv.org/abs/2506.14625)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.
摘要：大型语言模型（LLM）表现出令人印象深刻的道德推理能力。然而，当面对复杂，多因素的道德困境时，它们通常会发散。为了解决这些差异，我们提出了一个框架，将多个LLMS的道德判断综合为一个集体制定的道德判断，并重新调整了显着偏离该共识的模型。我们的聚合机制将连续的道德可接受性得分（超出二进制标签）融合为集体概率，并通过模型可靠性进行加权贡献。对于未对准的模型，针对性的嵌入式优化程序微型嵌入道德哲学理论，将JS与共识的分歧最小化，同时保持语义完整性。大规模社会道德困境数据集的实验表明，我们的方法建立了强大的共识并提高了个别模型的保真度。这些发现突出了多个模型中数据驱动的道德一致性的价值及其对更安全，更一致的AI系统的潜力。

Title: AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

Authors: Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessika Daikeler
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.14634
Pdf URL: https://arxiv.org/pdf/2506.14634
Copy Paste: [[2506.14634]] AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation(https://arxiv.org/abs/2506.14634)
Keywords: language model, llm, prompt
Abstract: The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.
摘要：LLM的最新发展和更广泛的可及性激发了有关如何在调查研究中使用的讨论，包括对开放式调查响应进行分类。由于其语言能力，LLM可能是耗时的手动编码和监督机器学习模型的预培训的有效替代方法。由于大多数关于该主题的研究都集中在与非复杂主题或单个LLM有关的英语响应上，因此尚不清楚其发现是否概括以及这些分类的质量与既定方法相比如何。在这项研究中，我们使用德国有关调查参与的原因的数据在其他情况下使用不同的LLM在何种程度上对开放式调查响应进行编码。我们比较了几种最先进的LLM和几种提示方法，并通过使用人类专家编码来评估LLMS的性能。 LLM之间的整体性能很大，只有微调的LLM才能达到令人满意的预测性能水平。提示方法之间的性能差异是在使用的LLM的条件下。最后，在不使用微调的情况下，LLMS跨不同类别的调查参与原因的不平等分类性能会导致不同的分类分布。我们讨论了这些发现的含义，包括关于编码开放式响应的方法论研究及其实质性分析，以及从业者处理或实质性分析此类数据的含义。最后，我们重点介绍了研究人员在选择自动化方法进行开放式响应分类时需要考虑的许多权衡。在此过程中，我们的研究为LLM在调查研究中有效，准确和可靠利用的条件的研究促进了不断增长的研究。

Title: Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Authors: Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu, Xinyu Zhang, Xiao Zhang, Yong Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.14641
Pdf URL: https://arxiv.org/pdf/2506.14641
Copy Paste: [[2506.14641]] Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot(https://arxiv.org/abs/2506.14641)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.
摘要：内部文化学习（ICL）是大语言模型（LLMS）的必不可少的能力，而最近的研究将思想链（COT）引入ICL的典范以增强推理能力，尤其是在数学任务中。但是，鉴于模型能力的持续发展，尚不清楚COT典范是否仍会受益于此类任务中最近更强大的模型。通过系统的实验，我们发现，对于最近的强大模型，例如QWEN2.5系列，添加传统的COT示例并不能提高与零摄影床相比的推理性能。相反，它们的主要功能是将输出格式与人类期望保持一致。我们进一步研究了使用高级模型的答案，例如\ texttt {qwen2.5-max}和\ texttt {deepSeek-r1}，并研究了增强的COT示例的有效性。实验结果表明，这些增强的示例仍然无法改善模型的推理性能。进一步的分析表明，模型倾向于忽略示例，主要关注指示，从而导致推理能力的可观察到。总体而言，我们的发现突出了数学推理中当前ICL+COT框架的局限性，呼吁重新检查ICL范式和示例的定义。

Title: Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments

Authors: . Pazzaglia, V. Vendetti, L. D. Comencini, F. Deriu, V. Modugno
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.14645
Pdf URL: https://arxiv.org/pdf/2506.14645
Copy Paste: [[2506.14645]] Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments(https://arxiv.org/abs/2506.14645)
Keywords: language model, llm
Abstract: The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content. This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments. Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses. The model's outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse. The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans. These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns. The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks.
摘要：大型语言模型（LLM）的复杂性日益成熟引发了人们对通过自动产生的有说服力和偏见的内容来加剧意识形态极化的潜在作用的越来越多的关注。这项研究探讨了微调的LLM可以在在线环境中复制和放大两极分化话语的程度。我们使用从Reddit提取的政治性讨论的精选数据集，我们微调开源LLM，以产生背景感知和意识形态上的响应。该模型的输出是通过语言分析，情感评分和人类注释来评估的，并特别关注了与原始话语的信誉和反对。结果表明，在对党派数据进行培训时，LLM能够产生高度合理和挑衅的评论，通常与人类所写的评论无法区分。这些发现提出了有关在政治话语，虚假信息和操纵运动中使用AI的重大道德问题。本文最后讨论了对AI治理，平台监管以及开发检测工具以减轻对抗性微调风险的更广泛含义。

Title: GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors

Authors: Hengyuan Zhang, Xinrong Chen, Yingmin Qiu, Xiao Liang, Ziyue Li, Guanyu Wang, Weiping Li, Tong Mo, Wenyue Li, Hayden Kwok-Hay So, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14646
Pdf URL: https://arxiv.org/pdf/2506.14646
Copy Paste: [[2506.14646]] GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors(https://arxiv.org/abs/2506.14646)
Keywords: language model
Abstract: Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs. However, their performance is limited by the small number of trainable parameters. Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity. To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs). GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks. Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration. Our code is available at this https URL.
摘要：参数有效的微调方法（PEFT）方法，尤其是低级适应性（LORA），提供了一种有效的方法，可以使大型语言模型以降低的计算成本进行调整。但是，它们的性能受到少量可训练参数的限制。最近的工作结合了洛拉（Lora）与专家的混合物（MOE），即洛拉·莫伊（Lora-Moe），以增强容量，但仍有两个限制在阻碍其潜力的全部限制中：1）在分配专家数字时下游任务的影响，以及2）限制所有Lora专家的统一等级分配，这限制了代表性的多样性。为了减轻这些差距，我们提出了Guilomo，这是一个细粒度的专家数字，并使用带导向矢量（GSV）对分配策略进行排名。 GSV是通过先前的二元优化过程来学习的，以捕获模型和特定于任务的需求，然后用来分配最佳的专家数字和等级。在各种基准的三个主链模型上进行的实验表明，Guilomo始终达到比所有基础线的优越或可比的性能。进一步的分析提供了有关专家数字和排名如何在各个层和任务之间变化的关键见解，从而突出了自适应专家配置的好处。我们的代码可在此HTTPS URL上找到。

Title: Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

Authors: Yuto Harada, Yusuke Yamauchi, Yusuke Oda, Yohei Oseki, Yusuke Miyao, Yu Takagi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14681
Pdf URL: https://arxiv.org/pdf/2506.14681
Copy Paste: [[2506.14681]] Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality(https://arxiv.org/abs/2506.14681)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness--often surpassing superficial similarity between trained data and benchmark--and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.
摘要：监督的微调（SFT）是使大语言模型（LLM）与人类的指示和价值观保持一致的关键步骤，但是SFT的许多方面仍然很少理解。我们在各种数据集上培训了广泛的基本模型，包括代码生成，数学推理和通用域任务，从而在受控条件下导致1,000多个SFT模型。然后，我们确定了最重要的数据集属性，并检查了SFT引入的层修改。我们的发现表明，在所有模型中，一些训练任务的协同效应持续存在，而其他模型则差异很大，强调了模型特定策略的重要性。此外，我们证明了困惑始终预测SFT的效率 - 通常超过训练的数据和基准之间的表面相似性，而中层重量的变化与性能提高最密切相关。我们将发布这1,000多种SFT模型和基准结果，以加速进一步的研究。

Title: Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers

Authors: Daniel D'souza, Julia Kreutzer, Adrien Morisot, Ahmet Üstün, Sara Hooker
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.14702
Pdf URL: https://arxiv.org/pdf/2506.14702
Copy Paste: [[2506.14702]] Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers(https://arxiv.org/abs/2506.14702)
Keywords: prompt
Abstract: One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: "Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?" We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.
摘要：现代机器学习最深切的挑战之一是在稀有和代表性不足的特征的长尾上表现良好。大型通用模型经过许多任务的培训，但在高频用例中最有效。经过训练，很难适应模型在培训语料库中所占的特定用例中表现良好。依靠迅速的工程或几次示例来最大程度地提高特定测试用例的输出质量，这可能会令人沮丧，因为模型可以对小变化高度敏感，以不预测的方式做出反应或依靠固定系统提示保持性能。在这项工作中，我们问：“我们可以优化培训方案以提高推理时间中代表性不足的用例的可控性和性能吗？”我们重新审视培训和推理技术之间的鸿沟，以改善长尾性能，同时为用户提供一组控制杆，该模型经过培训以响应。我们创建了数据特征和任务出处的详细分类学，以明确控制发电属性并在推理时间隐式状态。我们对基本模型进行微调以自动推断这些标记，这使它们在推理时可以进行可选。这种原则性灵活的方法可以提高性能，尤其是在训练分布的长尾部的例子上。虽然我们观察到带有标记的开放式生成质量的平均提升率为5.7％，但我们看到代表性不足的域中收益超过9.1％。在评估后，我们还观察到代表性不足的任务（例如CoderePair）的相对提升最高14.1％，并且在评估后的长度指令中的绝对改善为35.3％。

Title: Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

Authors: Ring Team, Bin Hu, Cai Chen, Deng Zhao, Ding Liu, Dingnan Jin, Feng Zhu, Hao Dai, Hongzhi Luan, Jia Guo, Jiaming Liu, Jiewei Wu, Jun Mei, Jun Zhou, Junbo Zhao, Junwu Xiong, Kaihong Zhang, Kuan Xu, Lei Liang, Liang Jiang, Liangcheng Fu, Longfei Zheng, Qiang Gao, Qing Cui, Quan Wan, Shaomian Zheng, Shuaicheng Li, Tongkai Yang, Wang Ren, Xiaodong Yan, Xiaopei Wan, Xiaoyun Feng, Xin Zhao, Xinxing Yang, Xinyu Kong, Xuemin Yang, Yang Li, Yingting Wu, Yongkang Liu, Zhankai Xu, Zhenduo Zhang, Zhenglei Zhou, Zhenyu Huang, Zhiqiang Zhang, Zihao Wang, Zujie Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14731
Pdf URL: https://arxiv.org/pdf/2506.14731
Copy Paste: [[2506.14731]] Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs(https://arxiv.org/abs/2506.14731)
Keywords: language model, llm
Abstract: We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.
摘要：我们提出了环圈，这是通过增强学习（RL）优化的基于Experts（MOE）的大型语言模型，以实现有效且健壮的推理能力。 Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models.为了实现这一目标，我们引入了联合培训管道，将蒸馏与RL整合在一起，从而揭示了MOE RL培训中的无证挑战。首先，我们确定在RL训练期间的优化不稳定，并提出了受约束的上下文计算策略优化（C3PO），这是一种新颖的方法，可增强训练稳定性并通过算法 - 系统共同设计方法来改善计算吞吐量。其次，我们从经验上证明，基于RL训练的熵损失而不是验证指标，在随后的RL培训中选择蒸馏检查点，而不是验证指标。最后，我们开发了一个两阶段的培训范式，以协调多域数据集成，以解决混合数据集培训中出现的领域冲突。我们将发布模型，数据集和代码。

Title: Reasoning with Exploration: An Entropy Perspective

Authors: Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.14758
Pdf URL: https://arxiv.org/pdf/2506.14758
Copy Paste: [[2506.14758]] Reasoning with Exploration: An Entropy Perspective(https://arxiv.org/abs/2506.14758)
Keywords: language model
Abstract: Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.
摘要：平衡探索和剥削是强化学习（RL）的核心目标。尽管最近在增强语言模型（LM）推理方面取得了进步，但大多数方法都倾向于剥削，并越来越多地遇到性能高原。在这项工作中，我们重新审视熵 - RL探索的信号 - 并检查其与LMS中探索性推理的关系。通过经验分析，我们发现了高渗透区域与三种类型的探索性推理作用之间的牢固的正相关：（1）确定或连接逻辑步骤的关键令牌，（2）反射性动作，例如自我验证和校正，以及（3）基本LMS识别的稀有行为。在此激励的情况下，我们仅使用一行代码引入了标准RL的最小修改：使用基于熵的项来增强优势函数。与传统的最大渗透方法不同，通过促进不确定性来鼓励探索，我们通过促进更长，更深的推理链来鼓励探索。值得注意的是，我们的方法在LM推理能力的上限估计器（以极大的K值进行评估，推动LM推理的界限时，也可以实现Pass@K Metric的显着增长。

Title: From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Authors: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.14761
Pdf URL: https://arxiv.org/pdf/2506.14761
Copy Paste: [[2506.14761]] From Bytes to Ideas: Language Modeling with Autoregressive U-Nets(https://arxiv.org/abs/2506.14761)
Keywords: language model
Abstract: Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.
摘要：令牌化对输入文本施加了固定的粒度，从而冻结了语言模型在数据上的运作方式以及未来预测的范围。字节对编码（BPE）和类似的方案一次拆分文本，构建静态词汇，然后将模型固定在该选择中。我们通过引入自动回归的U-NET来放松这种刚度，该自动回旋U-net学会在训练时嵌入自己的令牌。该网络读取原始字节，将它们汇总成单词，然后成对单词，然后最多4个单词，从而使其对序列进行多尺度视图。在更深层次的阶段，该模型必须进一步预测未来 - 预期接下来的几个单词而不是下一个字节 - 因此，更深层次的阶段专注于更广泛的语义模式，而较早的阶段则处理细节。当仔细调整和控制预训练计算时，浅层层次结构将强大的BPE基线和更深的层次结构扎成一个有希望的趋势。由于象征化现在已经存在于模型内部，因此相同的系统可以处理角色级任务并跨越低资源语言的知识。

Title: A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Authors: Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.14767
Pdf URL: https://arxiv.org/pdf/2506.14767
Copy Paste: [[2506.14767]] A Variational Framework for Improving Naturalness in Generative Spoken Language Models(https://arxiv.org/abs/2506.14767)
Keywords: language model
Abstract: The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at this https URL.
摘要：大型语言模型在文本处理中的成功激发了他们对语音建模的适应。但是，由于语音是连续且复杂的，因此通常将其离散为自回归建模。源自自我监督模型（称为语义令牌）的语音令牌通常集中在语言方面，但忽略了韵律信息。结果，对这些代币进行训练的模型可以产生自然性降低的语音。现有方法试图通过在语义令牌中添加音调功能来解决此问题。但是，单独音调不能完全代表副语言属性的范围，并且选择正确的功能需要仔细的手工设计。为了克服这一点，我们提出了一种端到端的变分方法，该方法自动学习编码这些连续的语音属性以增强语义令牌。我们的方法消除了对手动提取和选择副语言特征的需求。此外，根据人类评估者，它产生了首选的语音连续性。该HTTPS URL可用代码，样品和型号。