2025-10-08

Title: Collaborative and Proactive Management of Task-Oriented Conversations

Authors: Arezoo Saedi, Afsaneh Fatemi, Mohammad Ali Nematbakhsh, Sophie Rosset, Anne Vilnat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05110
Pdf URL: https://arxiv.org/pdf/2510.05110
Copy Paste: [[2510.05110]] Collaborative and Proactive Management of Task-Oriented Conversations(https://arxiv.org/abs/2510.05110)
Keywords: language model, llm
Abstract: Task oriented dialogue systems (TOD) complete particular tasks based on user preferences across natural language interactions. Considering the impressive performance of large language models (LLMs) in natural language processing (NLP) tasks, most of the latest TODs are centered on LLMs. While proactive planning is crucial for task completion, many existing TODs overlook effective goal-aware planning. This paper creates a model for managing task-oriented conversations, conceptualized centered on the information state approach to dialogue management. The created model incorporated constructive intermediate information in planning. Initially, predefined slots and text part informational components are created to model user preferences. Investigating intermediate information, critical circumstances are identified. Informational components corresponding to these circumstances are created. Possible configurations for these informational components lead to limited information states. Then, dialogue moves, which indicate movement between these information states and the procedures that must be performed in the movements, are created. Eventually, the update strategy is constructed. The created model is implemented leveraging in-context learning of LLMs. In this model, database queries are created centered on indicated predefined slots and the order of retrieved entities is indicated centered on text part. This mechanism enables passing the whole corresponding entities to the preferences in the order of congruency. Evaluations exploiting the complete test conversations of MultiWOZ, with no more than a domain in a conversation, illustrate maximal inform and success, and improvement compared with previous methods.
摘要：根据自然语言互动的用户偏好，以任务为导向的对话系统（TOD）完成特定任务。考虑到自然语言处理（NLP）任务中大型语言模型（LLM）的令人印象深刻的表现，大多数最新TOD都以LLM为中心。尽管积极的计划对于完成任务至关重要，但许多现有的TOD忽略了有效的目标感知计划。本文创建了一个用于管理面向任务的对话的模型，该模型以信息状态对话管理方法为中心。创建的模型将建设性的中间信息包含在计划中。最初，创建了预定义的插槽和文本部分信息组件来建模用户偏好。调查中间信息，确定关键情况。创建了与这些情况相对应的信息组件。这些信息组件的可能配置导致信息状态有限。然后，将创建对话移动，这表明这些信息状态与动作中必须执行的过程之间的运动。最终，构建了更新策略。创建的模型是实现了利用LLM的文化学习。在此模型中，创建数据库查询以指示的预定插槽为中心，并指示检索实体的顺序以文本部分为中心。该机制使整个相应的实体按一致顺序传递给偏好。利用多沃兹的完整测试对话的评估，不超过对话中的域，说明了与以前的方法相比的最大知识和成功以及改进。

Title: Hallucination is Inevitable for LLMs with the Open World Assumption

Authors: Bowen Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05116
Pdf URL: https://arxiv.org/pdf/2510.05116
Copy Paste: [[2510.05116]] Hallucination is Inevitable for LLMs with the Open World Assumption(https://arxiv.org/abs/2510.05116)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) exhibit impressive linguistic competence but also produce inaccurate or fabricated outputs, often called ``hallucinations''. Engineering approaches usually regard hallucination as a defect to be minimized, while formal analyses have argued for its theoretical inevitability. Yet both perspectives remain incomplete when considering the conditions required for artificial general intelligence (AGI). This paper reframes ``hallucination'' as a manifestation of the generalization problem. Under the Closed World assumption, where training and test distributions are consistent, hallucinations may be mitigated. Under the Open World assumption, however, where the environment is unbounded, hallucinations become inevitable. This paper further develops a classification of hallucination, distinguishing cases that may be corrected from those that appear unavoidable under open-world conditions. On this basis, it suggests that ``hallucination'' should be approached not merely as an engineering defect but as a structural feature to be tolerated and made compatible with human intelligence.
摘要：大型语言模型 (LLM) 表现出令人印象深刻的语言能力，但也会产生不准确或捏造的输出，通常称为“幻觉”。工程方法通常将幻觉视为需要最小化的缺陷，而正式分析则主张其理论上的必然性。然而，在考虑通用人工智能（AGI）所需的条件时，这两种观点仍然不完整。本文将“幻觉”重新定义为泛化问题的一种表现。在封闭世界假设下，训练和测试分布一致，幻觉可能会减轻。然而，在开放世界假设下，环境是无限的，幻觉就不可避免。本文进一步发展了幻觉的分类，将可以纠正的情况与在开放世界条件下似乎不可避免的情况区分开来。在此基础上，它表明“幻觉”不应仅仅被视为一种工程缺陷，而应被视为一种可以容忍并与人类智能兼容的结构特征。

Title: Towards Structured Knowledge: Advancing Triple Extraction from Regional Trade Agreements using Large Language Models

Authors: Durgesh Nandini, Rebekka Koch, Mirco Schoenfeld
Subjects: cs.CL, cs.CE, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05121
Pdf URL: https://arxiv.org/pdf/2510.05121
Copy Paste: [[2510.05121]] Towards Structured Knowledge: Advancing Triple Extraction from Regional Trade Agreements using Large Language Models(https://arxiv.org/abs/2510.05121)
Keywords: language model, llm, prompt
Abstract: This study investigates the effectiveness of Large Language Models (LLMs) for the extraction of structured knowledge in the form of Subject-Predicate-Object triples. We apply the setup for the domain of Economics application. The findings can be applied to a wide range of scenarios, including the creation of economic trade knowledge graphs from natural language legal trade agreement texts. As a use case, we apply the model to regional trade agreement texts to extract trade-related information triples. In particular, we explore the zero-shot, one-shot and few-shot prompting techniques, incorporating positive and negative examples, and evaluate their performance based on quantitative and qualitative metrics. Specifically, we used Llama 3.1 model to process the unstructured regional trade agreement texts and extract triples. We discuss key insights, challenges, and potential future directions, emphasizing the significance of language models in economic applications.
摘要：本研究调查了大型语言模型 (LLM) 以主谓宾三元组形式提取结构化知识的有效性。我们将设置应用于经济学应用领域。研究结果可应用于广泛的场景，包括从自然语言合法贸易协议文本创建经济贸易知识图谱。作为一个用例，我们将该模型应用于区域贸易协定文本，以提取与贸易相关的信息三元组。特别是，我们探索了零样本、单样本和少样本提示技术，结合正面和负面例子，并根据定量和定性指标评估其性能。具体来说，我们使用 Llama 3.1 模型来处理非结构化区域贸易协定文本并提取三元组。我们讨论了关键见解、挑战和潜在的未来方向，强调了语言模型在经济应用中的重要性。

Title: MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

Authors: Mingjin Li, Yu Liu, Huayi Liu, Xiang Ye, Chao Jiang, Hongguang Zhang
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2510.05124
Pdf URL: https://arxiv.org/pdf/2510.05124
Copy Paste: [[2510.05124]] MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation(https://arxiv.org/abs/2510.05124)
Keywords: llm, prompt, agent
Abstract: We propose MADS (Multi-Agent Dialogue Simulation), a scalable framework for generating persuasive multi-turn dialogues via agent self-play. MADS employs three coordinated agents: User Agents simulating diverse persona-driven behaviors, a Dialog Agent executing task-oriented persuasion strategies and an Optimization Agent evaluating and refining dialogue outcomes. We further validate its effectiveness through users' Chain-of-Attitude (CoA) modeling and dedicated LLMs' persuasion assessment. This approach enables low-cost generation of training data without human annotation, addressing key industry challenges such as lack of user data, cold-start evaluation difficulties, and prompt inefficiency. Applied to a real-world marketing scenario, MADS significantly improved the persuasion capacity of small LLMs, increasing the organic traffic conversion rate by 22.4\% (from 1.83\% to 2.24\%) , demonstrating clear business value.
摘要：我们提出了 MADS（多智能体对话模拟），这是一个可扩展的框架，用于通过智能体自我游戏生成有说服力的多轮对话。 MADS 采用三个协调代理：模拟不同角色驱动行为的用户代理、执行面向任务的说服策略的对话代理以及评估和完善对话结果的优化代理。我们通过用户的态度链 (CoA) 建模和专门的法学硕士的说服评估进一步验证其有效性。这种方法无需人工注释即可低成本生成训练数据，解决了缺乏用户数据、冷启动评估困难和即时效率低下等关键行业挑战。应用于真实的营销场景中，MADS显着提高了小型LLM的说服能力，将有机流量转化率提高了22.4%（从1.83%到2.24%），展现出明显的商业价值。

Title: Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation

Authors: Reza Shirkavand, Xiaokai Wei, Chen Wang, Zheng Hui, Heng Huang, Michelle Gong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05125
Pdf URL: https://arxiv.org/pdf/2510.05125
Copy Paste: [[2510.05125]] Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation(https://arxiv.org/abs/2510.05125)
Keywords: language model, llm
Abstract: While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Oral-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.
摘要：尽管协作过滤提供了预测的准确性和效率，并且大型语言模型（LLMS）可实现表达和可推广的推理，但现代推荐系统必须将这些优势融合在一起。诸如自然语言查询和透明解释之类的用户期望不断增长，进一步强调了对统一方法的需求。但是，这样做是不平凡的。协作信号通常是象征性的，但在语义上是不透明的，而LLMS在语义上是丰富的，但仅在仅接受文本输入培训时就很难模拟隐式用户偏好。本文介绍了Experts语言模型（IDIOMOE）的item-ID +口语混合物，该模型将项目的相互作用历史视为语言空间中的本地方言，从而使协作信号能够以与自然语言相同的方式理解。通过将验证的LLM的每个块的饲料前向网络分为单独的文本专家和具有令牌型门控的项目专家，我们的方法避免了文本和目录方式之间的破坏性干扰。 Idiomoe在公众和专有数据集中表现出强烈的建议性能，同时保留了对预算模型的文本理解。

Title: Improving Metacognition and Uncertainty Communication in Language Models

Authors: Mark Steyvers, Catarina Belem, Padhraic Smyth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05126
Pdf URL: https://arxiv.org/pdf/2510.05126
Copy Paste: [[2510.05126]] Improving Metacognition and Uncertainty Communication in Language Models(https://arxiv.org/abs/2510.05126)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. While prior work shows that LLMs maintain internal uncertainty signals, their explicit verbalized confidence is typically miscalibrated and poorly discriminates between correct and incorrect answers. Across two types of LLMs, we investigate whether supervised finetuning can improve models' ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We finetune the LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to have correct. We assess generalization to unseen domains, including medical and legal reasoning. Results show that finetuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains, while leaving accuracy unchanged. However, improvements are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. In contrast, multitask finetuning on both forms of metacognition yields broader gains, producing lower calibration error and stronger discrimination in out-of-domain evaluations. These results show that while uncertainty communication in LLMs is trainable and generalizable, different metacognitive skills do not naturally reinforce one another and must be developed together through multitask training.
摘要：大型语言模型 (LLM) 越来越多地用于决策环境中，但当它们在不表示低置信度的情况下给出答案时，用户可能会在不知不觉中对错误的输出采取行动。虽然之前的研究表明法学硕士保留了内部不确定性信号，但他们明确的言语信心通常是错误校准的，并且很难区分正确和错误的答案。在两种类型的法学硕士中，我们研究了监督微调是否可以提高模型传达不确定性的能力，以及这种改进是否可以跨任务和领域推广。我们在涵盖常识、数学和开放式琐事的数据集上对法学硕士进行微调，并评估两个元认知任务：(1) 单问题置信度估计，其中模型为其答案分配数字确定性；(2) 成对置信度比较，其中模型选择两个答案中哪一个更有可能是正确的。我们评估对未知领域的泛化，包括医学和法律推理。结果表明，微调可以改善域内和跨域的校准（指定置信度和准确性之间的一致性）和区分度（正确与错误响应的置信度更高），同时保持准确性不变。然而，改进是特定于任务的：单问题校准的训练不会转移到成对比较，反之亦然。相比之下，对两种形式的元认知进行多任务微调会产生更广泛的收益，在域外评估中产生更低的校准误差和更强的辨别力。这些结果表明，虽然法学硕士中的不确定性沟通是可训练和可推广的，但不同的元认知技能不会自然地相互增强，必须通过多任务训练共同发展。

Title: Advancing Automated Spatio-Semantic Analysis in Picture Description Using Language Models

Authors: Si-Ioi Ng, Pranav S. Ambadi, Kimberly D. Mueller, Julie Liss, Visar Berisha
Subjects: cs.CL, cs.CV, eess.AS
Abstract URL: https://arxiv.org/abs/2510.05128
Pdf URL: https://arxiv.org/pdf/2510.05128
Copy Paste: [[2510.05128]] Advancing Automated Spatio-Semantic Analysis in Picture Description Using Language Models(https://arxiv.org/abs/2510.05128)
Keywords: language model
Abstract: Current methods for automated assessment of cognitive-linguistic impairment via picture description often neglect the visual narrative path - the sequence and locations of elements a speaker described in the picture. Analyses of spatio-semantic features capture this path using content information units (CIUs), but manual tagging or dictionary-based mapping is labor-intensive. This study proposes a BERT-based pipeline, fine tuned with binary cross-entropy and pairwise ranking loss, for automated CIU extraction and ordering from the Cookie Theft picture description. Evaluated by 5-fold cross-validation, it achieves 93% median precision, 96% median recall in CIU detection, and 24% sequence error rates. The proposed method extracts features that exhibit strong Pearson correlations with ground truth, surpassing the dictionary-based baseline in external validation. These features also perform comparably to those derived from manual annotations in evaluating group differences via ANCOVA. The pipeline is shown to effectively characterize visual narrative paths for cognitive impairment assessment, with the implementation and models open-sourced to public.
摘要：目前通过图片描述自动评估认知语言障碍的方法常常忽略视觉叙事路径——说话者在图片中描述的元素的顺序和位置。空间语义特征的分析使用内容信息单元（CIU）捕获这条路径，但手动标记或基于字典的映射是劳动密集型的。本研究提出了一种基于 BERT 的管道，通过二元交叉熵和成对排名损失进行微调，用于从 Cookie Theft 图片描述中自动提取 CIU 和排序。通过5倍交叉验证评估，其在CIU检测中实现了93%的中位精度、96%的中位召回率和24%的序列错误率。所提出的方法提取的特征与真实情况表现出很强的皮尔逊相关性，在外部验证中超越了基于字典的基线。在通过 ANCOVA 评估组差异时，这些特征的表现也与源自手动注释的特征相当。该管道被证明可以有效地表征认知障碍评估的视觉叙事路径，并且其实施和模型向公众开源。

Title: Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models

Authors: Qingshu Xu, Hong Jiao, Tianyi Zhou, Ming Li, Nan Zhang, Sydney Peters, Yanbin Fu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05129
Pdf URL: https://arxiv.org/pdf/2510.05129
Copy Paste: [[2510.05129]] Automated Alignment of Math Items to Content Standards in Large-Scale Assessments Using Language Models(https://arxiv.org/abs/2510.05129)
Keywords: language model
Abstract: Accurate alignment of items to content standards is critical for valid score interpretation in large-scale assessments. This study evaluates three automated paradigms for aligning items with four domain and nineteen skill labels. First, we extracted embeddings and trained multiple classical supervised machine learning models, and further investigated the impact of dimensionality reduction on model performance. Second, we fine-tuned eight BERT model and its variants for both domain and skill alignment. Third, we explored ensemble learning with majority voting and stacking with multiple meta-models. The DeBERTa-v3-base achieved the highest weighted-average F1 score of 0.950 for domain alignment while the RoBERTa-large yielded the highest F1 score of 0.869 for skill alignment. Ensemble models did not surpass the best-performing language models. Dimension reduction enhanced linear classifiers based on embeddings but did not perform better than language models. This study demonstrated different methods in automated item alignment to content standards.}
摘要：项目与内容标准的准确对齐对于大规模评估中有效的分数解释至关重要。本研究评估了三种自动化范例，用于将项目与四个领域和十九个技能标签对齐。首先，我们提取嵌入并训练多个经典的监督机器学习模型，并进一步研究降维对模型性能的影响。其次，我们针对领域和技能调整对八个 BERT 模型及其变体进行了微调。第三，我们探索了多数投票和多个元模型堆叠的集成学习。 DeBERTa-v3-base 在领域对齐方面取得了最高的加权平均 F1 分数 0.950，而 RoBERTa-large 在技能对齐方面取得了最高的 F1 分数 0.869。集成模型并没有超越表现最好的语言模型。降维增强了基于嵌入的线性分类器，但性能并不比语言模型更好。这项研究展示了自动项目与内容标准对齐的不同方法。}

Title: Submodular Context Partitioning and Compression for In-Context Learning-short paper

Authors: Shaoyi Zheng, Canyu Zhang, Tianyi Zhou, Shengjie Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05130
Pdf URL: https://arxiv.org/pdf/2510.05130
Copy Paste: [[2510.05130]] Submodular Context Partitioning and Compression for In-Context Learning-short paper(https://arxiv.org/abs/2510.05130)
Keywords: language model, llm
Abstract: In-context learning (ICL) enables efficient few-shot learning in large language models (LLMs) without training, but suffers from the quadratic input complexity of transformers, limiting the maximum number of exemplars. While various efficient ICL approaches partition the context into blocks to process (e.g., ensembling, compression, cross-attention), they often ignore the information redundancy or under-representation caused by different partition strategies, leading to suboptimal performance. To tackle this problem, we propose Sub-CP, a block-aware context selection framework that leverages submodular objectives to control block diversity. Sub-CP supports a flexible spectrum of selection strategies, allowing each block to range from globally diverse to locally coherent. This allows fine-grained control over semantic structure while enabling precomputation. Extensive experiments across diverse tasks on multiple datasets show that Sub-CP consistently improves performance across model scales.
摘要：上下文学习（ICL）无需训练即可在大型语言模型（LLM）中实现高效的小样本学习，但受到变压器二次输入复杂性的影响，限制了样本的最大数量。虽然各种有效的 ICL 方法将上下文划分为块进行处理（例如，集成、压缩、交叉注意力），但它们经常忽略由不同划分策略引起的信息冗余或表示不足，从而导致性能不佳。为了解决这个问题，我们提出了 Sub-CP，一种块感知上下文选择框架，利用子模块目标来控制块多样性。 Sub-CP 支持灵活的选择策略，允许每个块的范围从全局多样化到局部一致。这允许对语义结构进行细粒度控制，同时启用预计算。在多个数据集上的不同任务中进行的广泛实验表明，Sub-CP 能够持续提高跨模型规模的性能。

Title: Rationale-Augmented Retrieval with Constrained LLM Re-Ranking for Task Discovery

Authors: Bowen Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05131
Pdf URL: https://arxiv.org/pdf/2510.05131
Copy Paste: [[2510.05131]] Rationale-Augmented Retrieval with Constrained LLM Re-Ranking for Task Discovery(https://arxiv.org/abs/2510.05131)
Keywords: language model, llm
Abstract: Head Start programs utilizing GoEngage face significant challenges when new or rotating staff attempt to locate appropriate Tasks (modules) on the platform homepage. These difficulties arise from domain-specific jargon (e.g., IFPA, DRDP), system-specific nomenclature (e.g., Application Pool), and the inherent limitations of lexical search in handling typos and varied word ordering. We propose a pragmatic hybrid semantic search system that synergistically combines lightweight typo-tolerant lexical retrieval, embedding-based vector similarity, and constrained large language model (LLM) re-ranking. Our approach leverages the organization's existing Task Repository and Knowledge Base infrastructure while ensuring trustworthiness through low false-positive rates, evolvability to accommodate terminological changes, and economic efficiency via intelligent caching, shortlist generation, and graceful degradation mechanisms. We provide a comprehensive framework detailing required resources, a phased implementation strategy with concrete milestones, an offline evaluation protocol utilizing curated test cases (Hit@K, Precision@K, Recall@K, MRR), and an online measurement methodology incorporating query success metrics, zero-result rates, and dwell-time proxies.
摘要：当新员工或旋转的员工试图在平台主页上找到适当的任务（模块）时，使用Goengage的主启动程序面临重大挑战。这些困难来自特定领域的术语（例如IFPA，DRDP），系统特定的命名法（例如，应用程序池）以及处理错别字和各种单词顺序的词汇搜索的固有局限性。我们提出了一个务实的混合语义搜索系统，该系统协同结合了轻巧的错别字耐受性词汇检索，基于嵌入的矢量相似性，并约束了大型语言模型（LLM）。我们的方法利用组织的现有任务存储库和知识基础基础架构，同时通过低阳性速率，可进化性来确保可信赖性，以适应术语变化以及通过智能缓存，候选名单生成和优雅的降级机制来适应术语变化以及经济效率。我们提供了一个全面的框架，详细介绍了所需的资源，具有混凝土里程碑的分阶段实施策略，使用策划测试用例的离线评估协议（hit@k，precision@k，recke@k，k，mrr）以及包含查询成功度量的查询成功指标，零分辨率率和零用用量的在线测量方法。

Title: Training Large Language Models To Reason In Parallel With Global Forking Tokens

Authors: Sheng Jia, Xiao Wang, Shiva Prasad Kasiviswanathan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05132
Pdf URL: https://arxiv.org/pdf/2510.05132
Copy Paste: [[2510.05132]] Training Large Language Models To Reason In Parallel With Global Forking Tokens(https://arxiv.org/abs/2510.05132)
Keywords: language model, llm
Abstract: Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.
摘要：尽管LLM通过缩放并行测试时间计算表现出了提高的性能，但这样做依赖于生成既多样化又准确的推理路径。对于具有挑战性的问题，触发多种而正确推理模式的分叉令牌通常在采样树中深处。因此，鼓励多样性（例如温度缩放）的共同策略遇到了多样性和准确性之间的权衡恶化。在这一挑战的推动下，我们将平行推理视为一个次要预测问题，并使用我们的全球分叉代币与独特的推理痕迹之间的自我监视的双分配匹配，将基于集合的全球损失纳入监督的微调（SFT）。我们观察到，尽管以多种推理痕迹幼稚的微调崩溃了这些独特的推理模式，但我们提出的方法设置了监督的微调（SSFT），但仍保留这些模式并产生新兴的全球分叉代币。多个推理基准的实验表明，我们的SSFT在Pass@1和Cons@K指标下始终优于SFT。

Title: Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

Authors: Y. Du, G. Wu, G. Tang, W. Wang, Q. Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05133
Pdf URL: https://arxiv.org/pdf/2510.05133
Copy Paste: [[2510.05133]] Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios(https://arxiv.org/abs/2510.05133)
Keywords: language model
Abstract: Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50\%. Our key findings include: models maintain stable performance with up to 20\% synthetic data, but degradation accelerates beyond 30\%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80\% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.'s model collapse findings.
摘要：大语言模型生成的合成数据已成为现代NLP培训管道不可或缺的一部分，从引导推理功能到增强跟随数据集的数据集。尽管最近的工作证明了维持较高外部数据比率的成功应用程序，但对合成数据比例如何影响不同尺度的模型行为的系统理解仍然有限。本文介绍了一项受控的经验研究，研究了对不同的合成与外部数据比率进行培训时的模型性能，校准和输出特征。在五个不同的任务中，我们使用毕曲霉模型套件（410m-12b参数），我们在一到三个训练迭代迭代后评估模型，其合成数据比例为0-50 \％。我们的主要发现包括：模型保持稳定的性能，最多20 \％的合成数据，但降解的加速超过30 \％；较大的模型（6.9b-12b）比较小的模型（410m-1.4b）显示出更大的合成数据鲁棒性。校准降解在准确性损失之前，提供了预警信号；和任务特征很重要，在合成数据培训下，推理任务比检索任务更快。重要的是，我们发现当前的最佳实践，例如在我们的实验确定的安全制度内运行良好的恒星和自我指导系统中的最佳实践。我们根据模型量表和任务要求为从业人员提供有关综合数据预算的实用指导，以及与Shumailov等人的模型崩溃发现（包括Shumailov等人）的详细比较。

Title: Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

Authors: Vanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05135
Pdf URL: https://arxiv.org/pdf/2510.05135
Copy Paste: [[2510.05135]] Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment(https://arxiv.org/abs/2510.05135)
Keywords: language model, llm
Abstract: Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.
摘要：现代大型语言模型（LLMS）擅长于评估数学推理和事实准确性等客观任务，但是当面对评估创造力的细微，主观性质时，它们会摇摇欲坠。在这项工作中，我们提出了一个新颖的好奇心驱动的LLM-AS-A-A-Gudge来评估创意写作，该创意写作对每个人的创造性判断都有个人化。我们使用Chakrabarty等人引入的创意思维的Torrance测试（TTCW）。（2024年），其中有专家人类在各种主观方面等专家提供的故事，以检验我们的假设。我们表明，我们的方法可以通过在各种评估指标（例如Pearson相关性，Cohen's and F1值）上显示出对基线监督芬特（SFT）方法的改进，从而使各种尺寸的模型能够学习不同个人的细微创意判断。我们的方法在主观评估中特别有用，在主观评估中，并非所有注释者都彼此同意。

Title: Linguistic Characteristics of AI-Generated Text: A Survey

Authors: Luka Terčon, Kaja Dobrovoljc
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05136
Pdf URL: https://arxiv.org/pdf/2510.05136
Copy Paste: [[2510.05136]] Linguistic Characteristics of AI-Generated Text: A Survey(https://arxiv.org/abs/2510.05136)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text. Their use is quickly becoming commonplace in fields such as education, healthcare, and scientific research. There is a growing need to study the linguistic features present in AI-generated text, as the increasing presence of such texts has profound implications in various disciplines such as corpus linguistics, computational linguistics, and natural language processing. Many observations have already been made, however a broader synthesis of the findings made so far is required to provide a better understanding of the topic. The present survey paper aims to provide such a synthesis of extant research. We categorize the existing works along several dimensions, including the levels of linguistic description, the models included, the genres analyzed, the languages analyzed, and the approach to prompting. Additionally, the same scheme is used to present the findings made so far and expose the current trends followed by researchers. Among the most-often reported findings is the observation that AI-generated text is more likely to contain a more formal and impersonal style, signaled by the increased presence of nouns, determiners, and adpositions and the lower reliance on adjectives and adverbs. AI-generated text is also more likely to feature a lower lexical diversity, a smaller vocabulary size, and repetitive text. Current research, however, remains heavily concentrated on English data and mostly on text generated by the GPT model family, highlighting the need for broader cross-linguistic and cross-model investigation. In most cases authors also fail to address the issue of prompt sensitivity, leaving much room for future studies that employ multiple prompt wordings in the text generation phase.
摘要：大型语言模型 (LLM) 正在巩固其作为自动生成文本的有效工具在现代世界的地位。它们的使用很快在教育、医疗保健和科学研究等领域变得普遍。人们越来越需要研究人工智能生成的文本中存在的语言特征，因为此类文本的不断增加对语料库语言学、计算语言学和自然语言处理等各个学科产生了深远的影响。已经进行了许多观察，但是需要对迄今为止的发现进行更广泛的综合，以便更好地理解该主题。本调查文件旨在提供现有研究的综合。我们从几个维度对现有作品进行分类，包括语言描述的层次、包含的模型、分析的体裁、分析的语言以及提示的方法。此外，同样的方案还用于展示迄今为止的发现并揭示研究人员遵循的当前趋势。最常报道的发现之一是，人工智能生成的文本更有可能包含更正式和客观的风格，这通过名词、限定词和副词的出现增加以及对形容词和副词的依赖减少来表明。人工智能生成的文本也更有可能具有较低的词汇多样性、较小的词汇量和重复的文本。然而，目前的研究仍然主要集中在英语数据上，并且主要集中在 GPT 模型系列生成的文本上，这凸显了更广泛的跨语言和跨模型调查的必要性。在大多数情况下，作者也未能解决提示敏感性问题，为未来在文本生成阶段采用多种提示措辞的研究留下了很大的空间。

Title: Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Authors: Maojia Song, Renhang Liu, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Soujanya Poria, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05137
Pdf URL: https://arxiv.org/pdf/2510.05137
Copy Paste: [[2510.05137]] Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics(https://arxiv.org/abs/2510.05137)
Keywords: retrieval-augmented generation, agent
Abstract: RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.
摘要：越来越多地评估了抹布（检索出的生成）系统和Web代理，对多跳深搜索任务进行了评估，但是当前的实践受到了两个主要局限性。首先，大多数基准测试漏了问题文本中的推理路径，允许模型遵循表面提示，而不是自动发现推理链。其次，评估通常会降低为单个通过率，这将各种行为崩溃为一个分数，并掩盖了失败是由于搜索不足，知识使用不足还是不适当的拒绝造成的失败。为了解决这些问题，我们提出了Websetextive，这是无备用多跳的问题的基准，与受控的Wikipedia Sandbox配对，可确保模型动作的完全可追溯性，以及将搜索充足性，知识利用和拒绝行为分开的整体评估框架。我们对25种最先进模型的评估揭示了所有体系结构中的系统弱点：尽管有足够的证据，但在缺乏证据时表现出了几乎不公平的拒绝。这些模式暴露了一个基本差距：当今的系统在执行给定推理路径的执行方面表现出色，但在发现它们时失败了。我们开发了一个代理工作流程，即EvidenCeloop，该工作流明确针对基准标识的挑战，并结合了验证循环和系统性证据跟踪，以提高搜索和合成能力。该基线表明，WebDetactive的诊断框架可以指导具体的建筑改进，从而确立我们的基准作为开发真正自主推理系统而不是模式跟随代理的关键工具。

Title: LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation

Authors: Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, Xinyi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05138
Pdf URL: https://arxiv.org/pdf/2510.05138
Copy Paste: [[2510.05138]] LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation(https://arxiv.org/abs/2510.05138)
Keywords: llm, agent
Abstract: The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.
摘要：科学出版物的快速增长使得保持文献综述的全面性和最新性变得越来越困难。尽管之前的工作主要集中在自动化检索和筛选上，但系统评论的写作阶段在很大程度上仍未得到充分探索，特别是在可读性和事实准确性方面。为了解决这个问题，我们提出了 LiRA（文献评审代理），这是一种模拟人类文献评审过程的多智能体协作工作流程。 LiRA 利用专门的代理进行内容概述、分段写作、编辑和审阅，生成连贯且全面的评论文章。在 SciReviewGen 和专有的 ScienceDirect 数据集上进行评估，LiRA 在写作和引文质量方面优于 AutoSurvey 和 MASS-Survey 等当前基线，同时保持与人工撰写评论的竞争相似性。我们使用文档检索在现实场景中进一步评估 LiRA，并评估其对审阅者模型变化的稳健性。我们的研究结果凸显了代理法学硕士工作流程的潜力，即使没有特定领域的调整，也可以提高自动化科学写作的可靠性和可用性。

Title: NLD-LLM: A systematic framework for evaluating small language transformer models on natural language description

Authors: Hamed Jelodar, Mohammad Meymani, Parisa Hamedi, Tochukwu Emmanuel Nwankwo, Samita Bai, Roozbeh Razavi-Far, Ali A. Ghorbani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05139
Pdf URL: https://arxiv.org/pdf/2510.05139
Copy Paste: [[2510.05139]] NLD-LLM: A systematic framework for evaluating small language transformer models on natural language description(https://arxiv.org/abs/2510.05139)
Keywords: language model, llm, prompt
Abstract: Natural Language Description (NLD) is a Natural Language Processing (NLP) task that requires models to generate structured and meaningful outputs from natural language inputs. In this work, we propose NLD-LLM, a systematic NLP framework to evaluate the performance of language models to generate accurate and concise source code descriptions. This framework incorporates a diverse set of transformer models, including Qwen, DeepSeek, Phi, LLaMA, and Mistral, spanning various sizes, architectures, and training approaches. Central to NLD-LLM is a comprehensive prompt design strategy that includes standardized formatting, clear task guidance, and NLD prompting, ensuring fair and consistent evaluation. Additionally, we apply an iterative refinement process to improve output's quality and assess the model's adaptability. Using semantic and structural metrics, our analysis demonstrates that prompt engineering significantly impacts the effectiveness of the model such that smaller models often performing competitively when supported by well-crafted prompts.
摘要：自然语言描述（NLD）是一种自然语言处理（NLP）任务，需要模型才能从自然语言输入中生成结构化和有意义的输出。在这项工作中，我们提出了NLD-LLM，这是一个系统的NLP框架，以评估语言模型的性能以生成准确而简洁的源代码描述。该框架结合了各种变压器模型，包括Qwen，DeepSeek，Phi，Llama和Mistral，涵盖了各种尺寸，架构和培训方法。 NLD-LLM的中心是一种全面的及时设计策略，包括标准化格式，清晰的任务指导和NLD提示，确保公平，一致的评估。此外，我们应用了一个迭代完善过程来提高产出的质量并评估模型的适应性。我们的分析使用语义和结构指标表明，迅速工程会显着影响模型的有效性，因此，在由精心制作的提示支持下，较小的模型通常会竞争性地执行。

Title: To model human linguistic prediction, make LLMs less superhuman

Authors: Byung-Doh Oh, Tal Linzen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05141
Pdf URL: https://arxiv.org/pdf/2510.05141
Copy Paste: [[2510.05141]] To model human linguistic prediction, make LLMs less superhuman(https://arxiv.org/abs/2510.05141)
Keywords: language model, llm
Abstract: When people listen to or read a sentence, they actively make predictions about upcoming words: words that are less predictable are generally read more slowly than predictable ones. The success of large language models (LLMs), which, like humans, make predictions about upcoming words, has motivated exploring the use of these models as cognitive models of human linguistic prediction. Surprisingly, in the last few years, as language models have become better at predicting the next word, their ability to predict human reading behavior has declined. This is because LLMs are able to predict upcoming words much better than people can, leading them to predict lower processing difficulty in reading than observed in human experiments; in other words, mainstream LLMs are 'superhuman' as models of language comprehension. In this position paper, we argue that LLMs' superhumanness is primarily driven by two factors: compared to humans, LLMs have much stronger long-term memory for facts and training examples, and they have much better short-term memory for previous words in the text. We advocate for creating models that have human-like long-term and short-term memory, and outline some possible directions for achieving this goal. Finally, we argue that currently available human data is insufficient to measure progress towards this goal, and outline human experiments that can address this gap.
摘要：当人们听或读一个句子时，他们会主动预测即将出现的单词：难以预测的单词通常比可预测的单词读得更慢。大型语言模型（LLM）像人类一样对即将出现的单词进行预测，其成功推动了探索将这些模型用作人类语言预测的认知模型。令人惊讶的是，在过去几年中，随着语言模型在预测下一个单词方面变得越来越好，它们预测人类阅读行为的能力却下降了。这是因为法学硕士能够比人类更好地预测即将出现的单词，从而导致他们预测的阅读处理难度比人类实验中观察到的要低；换句话说，主流法学硕士作为语言理解模型是“超人”的。在这篇立场文件中，我们认为法学硕士的超人性主要由两个因素驱动：与人类相比，法学硕士对事实和训练示例有更强的长期记忆，并且他们对文本中先前单词有更好的短期记忆。我们主张创建具有类似人类长期和短期记忆的模型，并概述了实现这一目标的一些可能的方向。最后，我们认为目前可用的人类数据不足以衡量实现这一目标的进展，并概述了可以解决这一差距的人体实验。

Title: Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models

Authors: Xin Wang, Anshu Raj, Matthew Luebbe, Haiming Wen, Shuozhi Xu, Kun Lu
Subjects: cs.CL, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2510.05142
Pdf URL: https://arxiv.org/pdf/2510.05142
Copy Paste: [[2510.05142]] Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models(https://arxiv.org/abs/2510.05142)
Keywords: language model
Abstract: Data-driven materials discovery requires large-scale experimental datasets, yet most of the information remains trapped in unstructured literature. Existing extraction efforts often focus on a limited set of features and have not addressed the integrated composition-processing-microstructure-property relationships essential for understanding materials behavior, thereby posing challenges for building comprehensive databases. To address this gap, we propose a multi-stage information extraction pipeline powered by large language models, which captures 47 features spanning composition, processing, microstructure, and properties exclusively from experimentally reported materials. The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability. Evaluations at the feature level (independent attributes) and tuple level (interdependent features) yielded F1 scores around 0.96. Compared with single-pass extraction without source tracking, our approach improved F1 scores of microstructure category by 10.0% (feature level) and 13.7% (tuple level), and reduced missed materials from 49 to 13 out of 396 materials in 100 articles on precipitate-containing multi-principal element alloys (miss rate reduced from 12.4% to 3.3%). The pipeline enables scalable and efficient literature mining, producing databases with high precision, minimal omissions, and zero false positives. These datasets provide trustworthy inputs for machine learning and materials informatics, while the modular design generalizes to diverse material classes, enabling comprehensive materials information extraction.
摘要：数据驱动的材料发现需要大规模的实验数据集，但大多数信息仍然陷入非结构化文献中。现有的提取工作通常集中在有限的功能上，并且没有解决对理解材料行为必不可少的集成组成过程 - 微观建筑 - 理性关系，从而对构建综合数据库构成了挑战。为了解决这一差距，我们提出了一个由大型语言模型提供动力的多阶段信息提取管道，该管道可捕获47个功能，这些功能涵盖了涵盖构图，处理，微观结构和属性，专门从实验报告的材料中捕获。管道将迭代提取与源跟踪集成在一起，以提高准确性和可靠性。特征级别（独立属性）和元组水平（相互依赖特征）的评估产生的F1得分约为0.96。与没有源跟踪的单通气提取相比，我们的方法将微观结构类别的F1分数提高了10.0％（特征水平）和13.7％（元组水平），而错过的材料从49个材料中降低到100个材料中的100种材料中的100种材料中的沉淀物包含多个元素元素（从12.4％降低至3.3％）。该管道可实现可扩展有效的文献挖掘，生产具有高精度，最小遗漏和零假阳性的数据库。这些数据集为机器学习和材料信息学提供了值得信赖的输入，而模块化设计则概括为各种材料类别，从而实现了全面的材料信息提取。

Title: Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs

Authors: Qi Li, Runpeng Yu, Haiquan Lu, Xinchao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05148
Pdf URL: https://arxiv.org/pdf/2510.05148
Copy Paste: [[2510.05148]] Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs(https://arxiv.org/abs/2510.05148)
Keywords: language model, llm
Abstract: Discrete Diffusion Large Language Models (dLLMs) have recently emerged as a competitive paradigm for non-autoregressive language modeling. Their distinctive decoding mechanism enables faster inference speed and strong performance in code generation and mathematical tasks. In this work, we show that the decoding mechanism of dLLMs not only enhances model utility but also can be used as a powerful tool for model attribution. A key challenge in this problem lies in the diversity of attribution scenarios, including distinguishing between different models as well as between different checkpoints or backups of the same model. To ensure broad applicability, we identify two fundamental problems: what information to extract from the decoding trajectory, and how to utilize it effectively. We first observe that relying directly on per-step model confidence yields poor performance. This is mainly due to the bidirectional decoding nature of dLLMs: each newly decoded token influences the confidence of other decoded tokens, making model confidence highly redundant and washing out structural signal regarding decoding order or dependencies. To overcome this, we propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors. Furthermore, to make full use of the extracted structural information during attribution, we propose Gaussian-Trajectory Attribution (GTA), where we fit a cell-wise Gaussian distribution at each decoding position for each target model, and define the likelihood of a trajectory as the attribution score: if a trajectory exhibits higher log-likelihood under the distribution of a specific model, it is more likely to have been generated by that model. Extensive experiments under different settings validate the utility of our methods.
摘要：离散扩散大型语言模型 (dLLM) 最近已成为非自回归语言建模的竞争范例。其独特的解码机制可实现更快的推理速度以及在代码生成和数学任务中的强大性能。在这项工作中，我们证明了 dLLM 的解码机制不仅增强了模型的实用性，而且可以用作模型归因的强大工具。该问题的一个关键挑战在于归因场景的多样性，包括区分不同模型以及同一模型的不同检查点或备份。为了确保广泛的适用性，我们确定了两个基本问题：从解码轨迹中提取哪些信息以及如何有效地利用它。我们首先观察到直接依赖每步模型置信度会产生较差的性能。这主要是由于 dLLM 的双向解码性质：每个新解码的令牌都会影响其他已解码令牌的置信度，使模型置信度高度冗余，并消除有关解码顺序或依赖性的结构信号。为了克服这个问题，我们提出了一种称为定向解码图（DDM）的新颖信息提取方案，它捕获解码步骤之间的结构关系并更好地揭示模型特定的行为。此外，为了在归因过程中充分利用提取的结构信息，我们提出了高斯轨迹归因（GTA），在每个目标模型的每个解码位置处拟合细胞级高斯分布，并将轨迹的似然定义为归因分数：如果轨迹在特定模型的分布下表现出较高的对数似然，则它更有可能是由该模型生成的。不同设置下的大量实验验证了我们方法的实用性。

Title: Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Authors: Donghang Wu, Haoyang Zhang, Chen Chen, Tianyu Zhang, Fei Tian, Xuerui Yang, Gang Yu, Hexin Liu, Nana Hou, Yuchen Hu, Eng Siong Chng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05150
Pdf URL: https://arxiv.org/pdf/2510.05150
Copy Paste: [[2510.05150]] Chronological Thinking in Full-Duplex Spoken Dialogue Language Models(https://arxiv.org/abs/2510.05150)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, a on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
摘要：口语对话语言模型 (SDLM) 的最新进展反映了人们对从回合制系统向全双工系统转变的兴趣日益浓厚，其中模型在生成响应的同时不断感知用户语音流。这种同步听和说的设计可以实现实时交互，并且代理可以处理动态对话行为，例如用户打断。然而，在聆听阶段，现有系统通过重复预测沉默令牌来保持代理空闲，这与人类行为背道而驰：我们通常在对话过程中进行轻量级思考，而不是心不在焉。受此启发，我们提出了时序思维，这是一种即时对话思维机制，旨在提高全双工 SDLM 的响应质量。具体来说，时间顺序思维呈现出传统法学硕士思维方法的范式转变，例如专为流式声学输入而构建的思想链。 (1) 严格因果关系：智能体在聆听时逐步推理，仅根据过去的音频更新内部假设，不进行前瞻。（2）无额外延迟：推理在监听窗口期间摊销；一旦用户停止说话，代理就会停止思考并立即开始说话。实验通过客观指标和人工评估证明了按时间顺序思考的有效性，显示了响应质量的持续改进。此外，时间顺序思维可以稳健地处理对话动态，并在全双工交互指标上获得竞争性能。

Title: Exploring Large Language Models for Financial Applications: Techniques, Performance, and Challenges with FinMA

Authors: Prudence Djagba, Abdelkader Y. Saley
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05151
Pdf URL: https://arxiv.org/pdf/2510.05151
Copy Paste: [[2510.05151]] Exploring Large Language Models for Financial Applications: Techniques, Performance, and Challenges with FinMA(https://arxiv.org/abs/2510.05151)
Keywords: language model, llm
Abstract: This research explores the strengths and weaknesses of domain-adapted Large Language Models (LLMs) in the context of financial natural language processing (NLP). The analysis centers on FinMA, a model created within the PIXIU framework, which is evaluated for its performance in specialized financial tasks. Recognizing the critical demands of accuracy, reliability, and domain adaptation in financial applications, this study examines FinMA's model architecture, its instruction tuning process utilizing the Financial Instruction Tuning (FIT) dataset, and its evaluation under the FLARE benchmark. Findings indicate that FinMA performs well in sentiment analysis and classification, but faces notable challenges in tasks involving numerical reasoning, entity recognition, and summarization. This work aims to advance the understanding of how financial LLMs can be effectively designed and evaluated to assist in finance-related decision-making processes.
摘要：这项研究探讨了金融自然语言处理 (NLP) 背景下领域适应的大型语言模型 (LLM) 的优点和缺点。分析以 FinMA 为中心，FinMA 是在 PIXIU 框架内创建的模型，评估其在专门财务任务中的表现。认识到金融应用中准确性、可靠性和领域适应性的关键需求，本研究研究了 FinMA 的模型架构、利用金融指令调优 (FIT) 数据集的指令调优过程及其在 FLARE 基准下的评估。调查结果表明，FinMA 在情感分析和分类方面表现良好，但在涉及数值推理、实体识别和摘要的任务中面临着显着的挑战。这项工作旨在增进对如何有效设计和评估金融法学硕士以协助金融相关决策过程的理解。

Title: A Single Character can Make or Break Your LLM Evals

Authors: Jingtong Su, Jianyu Zhang, Karen Ullrich, Léon Bottou, Mark Ibrahim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05152
Pdf URL: https://arxiv.org/pdf/2510.05152
Copy Paste: [[2510.05152]] A Single Character can Make or Break Your LLM Evals(https://arxiv.org/abs/2510.05152)
Keywords: language model, llm, prompt
Abstract: Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by $\pm 23\%$ depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.
摘要：常见的大语言模型（LLM）评估依赖于演示示例来指导模型对所需样式的响应。虽然已经研究和标准化了所使用的示例数量，但研究如何格式化示例的选择较少。在评估协议和现实世界使用中，用户面对如何分开内部文字示例的选择：使用逗号？新线？分号？井号？ ETC。？令人惊讶的是，我们发现这种看似较小的选择可以极大地改变模型响应质量。在领先的模型家族（Llama，Qwen，Gemma）中，MMLU的性能可能会因定界符的选择而变化$ \ pm 23 \％$。实际上，只能通过仅修改单个字符分开示例来操纵模型排名将任何模型放在领导中。我们发现LLMS的脆弱性遍及主题，模型家族，并且不会随规模而改善。通过探测注意力头的得分，我们发现表现良好的定义者将注意力引向输入中的关键令牌。最后，我们探讨了改善LLMS对定界符选择的鲁棒性的方法。我们发现在提示提示中指定所选定界符的鲁棒性，并为最佳表现的分界符提供实用的建议。

Title: Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Authors: Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05154
Pdf URL: https://arxiv.org/pdf/2510.05154
Copy Paste: [[2510.05154]] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs(https://arxiv.org/abs/2510.05154)
Keywords: llm
Abstract: Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.
摘要：大规模的公众审议产生了数千个自由形式的贡献，必须将其综合为代表性和中性的摘要，以供政策使用。尽管LLM已被证明是为大规模审议提供摘要的有前途的工具，但它们也有代表性不足的少数群体观点的风险，并且在投入顺序上表现出偏见，在高风险环境中引起了公平关注。研究和解决这些问题需要大规模的全面评估，但是当前的做法通常依赖于LLM作为法官，这表现出与人类判断的一致性较弱。为了解决这个问题，我们介绍了审议银行，这是一个大规模的人类基础数据集，其中（1）意见数据涵盖了由3,000名参与者创建的十个审议问题，以及（2）跨四个维度（代表性，信息性，中立性，政策，政策批准）的4,500名参与者注释的简易判断数据。使用这些数据集，我们训练审议审议，这是一种微调的Deberta模型，可以从个人角度评估审议摘要。与广泛的LLM法官相比，审议判断更有效，并且与人类判断更加一致。通过审议判断，我们评估了18个LLM，并揭示了审议摘要中的持续弱点，尤其是少数群体职位的代表不足。我们的框架提供了一种可扩展和可靠的方式来评估审议摘要，从而确保AI系统对决策更具代表性和公平性。

Title: A novel hallucination classification framework

Authors: Maksym Zavhorodnii, Dmytro Dehtiarov, Anna Konovalenko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05189
Pdf URL: https://arxiv.org/pdf/2510.05189
Copy Paste: [[2510.05189]] A novel hallucination classification framework(https://arxiv.org/abs/2510.05189)
Keywords: language model, llm, hallucination, prompt
Abstract: This work introduces a novel methodology for the automatic detection of hallucinations generated during large language model (LLM) inference. The proposed approach is based on a systematic taxonomy and controlled reproduction of diverse hallucination types through prompt engineering. A dedicated hallucination dataset is subsequently mapped into a vector space using an embedding model and analyzed with unsupervised learning techniques in a reduced-dimensional representation of hallucinations with veridical responses. Quantitative evaluation of inter-centroid distances reveals a consistent correlation between the severity of informational distortion in hallucinations and their spatial divergence from the cluster of correct outputs. These findings provide theoretical and empirical evidence that even simple classification algorithms can reliably distinguish hallucinations from accurate responses within a single LLM, thereby offering a lightweight yet effective framework for improving model reliability.
摘要：这项工作介绍了一种自动检测大语言模型（LLM）推理过程中产生的幻觉的新颖方法。所提出的方法基于系统分类法，并通过即时工程控制不同幻觉类型的再现。随后使用嵌入模型将专用的幻觉数据集映射到向量空间，并使用无监督学习技术以具有真实反应的幻觉的降维表示进行分析。质心间距离的定量评估揭示了幻觉中信息失真的严重程度与其与正确输出簇的空间分歧之间的一致相关性。这些发现提供了理论和经验证据，表明即使是简单的分类算法也可以在单个法学硕士内可靠地区分幻觉和准确反应，从而为提高模型可靠性提供轻量级但有效的框架。

Title: Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

Authors: Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, Zhuokai Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05251
Pdf URL: https://arxiv.org/pdf/2510.05251
Copy Paste: [[2510.05251]] Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning(https://arxiv.org/abs/2510.05251)
Keywords: language model, llm
Abstract: Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive **explore-at-the-beginning, exploit-at-the-end** strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.
摘要：具有可验证奖励的强化学习（RLVR）是增强大型语言模型（LLM）推理能力的强大范例，但其成功取决于有效的探索。理想的探索策略必须应对两个基本挑战：它必须保持样本质量，同时确保训练稳定性。虽然标准固定温度采样很简单，但它很难平衡这些相互竞争的需求，因为高温会降低样品质量，而低温会限制发现。在这项工作中，我们提出了一种更简单、更有效的策略，即探索性退火解码（EAD），其基础是探索对定义序列语义方向的早期标记最具影响力。 EAD 通过在生成过程中将采样温度从高退火到低来实现直观的**开始探索、最后利用**策略。这种动态时间表在一开始就鼓励有意义的、高水平的多样性，然后逐渐降低温度以保持样本质量并使采样分布接近目标策略，这对于稳定的训练至关重要。我们证明 EAD 是一种轻量级、即插即用的方法，可显着提高采样效率，在各种 RLVR 算法和模型大小中始终优于固定温度采样。我们的工作表明，将探索与顺序生成的自然动态相结合，为改进法学硕士推理提供了一条强有力的途径。

Title: Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Authors: Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, JinYeong Bak, Keisuke Sakaguchi, Tanmoy Chakraborty, Yuki Arase, Wei Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05291
Pdf URL: https://arxiv.org/pdf/2510.05291
Copy Paste: [[2510.05291]] Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages(https://arxiv.org/abs/2510.05291)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) gain stronger multilingual capabilities, their ability to handle culturally diverse entities becomes crucial. Prior work has shown that LLMs often favor Western-associated entities in Arabic, raising concerns about cultural fairness. Due to the lack of multilingual benchmarks, it remains unclear if such biases also manifest in different non-Western languages. In this paper, we introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures. Camellia includes 19,530 entities manually annotated for association with the specific Asian or Western culture, as well as 2,173 naturally occurring masked contexts for entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLM families across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data. We further observe that different LLM families hold their distinct biases, differing in how they associate cultures with particular sentiments. Lastly, we find that LLMs struggle with context understanding in Asian languages, creating performance gaps between cultures in entity extraction.
摘要：随着大型语言模型 (LLM) 获得更强的多语言能力，它们处理文化多元化实体的能力变得至关重要。之前的研究表明，法学硕士通常偏爱与西方相关的阿拉伯语实体，引发了人们对文化公平的担忧。由于缺乏多语言基准，目前尚不清楚这种偏见是否也出现在不同的非西方语言中。在本文中，我们介绍了 Camellia，这是一个衡量跨越六种不同亚洲文化的九种亚洲语言中以实体为中心的文化偏见的基准。 Camellia 包括 19,530 个手动注释的与特定亚洲或西方文化相关的实体，以及 2,173 个源自社交媒体帖子的实体自然发生的屏蔽上下文。使用 Camellia，我们评估了最近四个多语言法学硕士家庭在文化背景适应、情感关联和实体提取 QA 等各种任务中的文化偏见。我们的分析显示，法学硕士在所有亚洲语言的文化适应方面都面临着困难，在文化相关数据获取方式不同的地区开发的模型之间的表现也有所不同。我们进一步观察到，不同的法学硕士家庭持有不同的偏见，他们将文化与特定情感联系起来的方式有所不同。最后，我们发现法学硕士很难理解亚洲语言的上下文，从而在实体提取方面造成了不同文化之间的表现差距。

Title: RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Authors: Yining She, Daniel W. Peterson, Marianne Menglin Liu, Vikas Upadhyay, Mohammad Hossein Chaghazardi, Eunsuk Kang, Dan Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05310
Pdf URL: https://arxiv.org/pdf/2510.05310
Copy Paste: [[2510.05310]] RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts(https://arxiv.org/abs/2510.05310)
Keywords: language model, gpt, llm, prompt
Abstract: With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
摘要：随着大型语言模型（LLM）的日益普及，确保LLM系统的安全已成为一个紧迫的问题。基于外部 LLM 的护栏模型已成为筛查不安全输入和输出的流行解决方案，但它们本身是经过微调或快速设计的 LLM，很容易受到数据分布变化的影响。在本文中，以检索增强生成（RAG）作为案例研究，我们研究了基于 LLM 的护栏对于上下文中嵌入的附加信息的鲁棒性。通过对 3 个 Llama Guards 和 2 个 GPT-oss 模型的系统评估，我们证实，在大约 11% 和 8% 的情况下，将良性文档插入到 Guardrail 上下文中会改变输入和输出 Guardrail 的判断，使其变得不可靠。我们分别分析了增强上下文中每个组件的效果：检索的文档、用户查询和 LLM 生成的响应。我们测试的两种缓解方法仅带来了微小的改进。这些结果暴露了当前护栏中的上下文鲁棒性差距，并激发了对检索和查询组合鲁棒的训练和评估协议。

Title: WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Authors: Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renee Sieber, Fengran Mo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05336
Pdf URL: https://arxiv.org/pdf/2510.05336
Copy Paste: [[2510.05336]] WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives(https://arxiv.org/abs/2510.05336)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at this https URL.
摘要：关于天气事件的历史档案是持久的主要来源记录的集合，它们提供了丰富的、未开发的关于社会如何经历和应对极端天气事件的叙述。这些定性描述提供了对社会脆弱性和复原力的洞察，而气象记录中基本上没有这些脆弱性和复原力，这使得它们对于气候科学家了解社会反应很有价值。然而，其庞大的规模、嘈杂的数字化质量和陈旧的语言使得它们很难转化为气候研究的结构化知识。为了应对这一挑战，我们推出了 WeatherArchive-Bench，这是用于评估历史天气档案检索增强生成 (RAG) 系统的第一个基准。 WeatherArchive-Bench 包含两项任务：WeatherArchive-Retrieval，用于衡量系统从超过一百万个档案新闻片段中定位历史相关段落的能力；WeatherArchive-Assessment，用于评估大型语言模型 (LLM) 是否可以对极端天气叙述中的社会脆弱性和复原力指标进行分类。对稀疏、密集和重新排名检索器以及各种法学硕士的广泛实验表明，密集检索器经常在历史术语上失败，而法学硕士经常误解脆弱性和弹性概念。这些发现凸显了复杂社会指标推理的关键局限性，并为从档案环境中设计更强大的以气候为中心的 RAG 系统提供了见解。构建的数据集和评估框架可在此 https URL 上公开获取。

Title: Residualized Similarity for Faithfully Explainable Authorship Verification

Authors: Peter Zeng, Pegah Alipoormolabashi, Jihu Mun, Gourab Dey, Nikita Soni, Niranjan Balasubramanian, Owen Rambow, H. Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05362
Pdf URL: https://arxiv.org/pdf/2510.05362
Copy Paste: [[2510.05362]] Residualized Similarity for Faithfully Explainable Authorship Verification(https://arxiv.org/abs/2510.05362)
Keywords: llm
Abstract: Responsible use of Authorship Verification (AV) systems not only requires high accuracy but also interpretable solutions. More importantly, for systems to be used to make decisions with real-world consequences requires the model's prediction to be explainable using interpretable features that can be traced to the original texts. Neural methods achieve high accuracies, but their representations lack direct interpretability. Furthermore, LLM predictions cannot be explained faithfully -- if there is an explanation given for a prediction, it doesn't represent the reasoning process behind the model's prediction. In this paper, we introduce Residualized Similarity (RS), a novel method that supplements systems using interpretable features with a neural network to improve their performance while maintaining interpretability. Authorship verification is fundamentally a similarity task, where the goal is to measure how alike two documents are. The key idea is to use the neural network to predict a similarity residual, i.e. the error in the similarity predicted by the interpretable system. Our evaluation across four datasets shows that not only can we match the performance of state-of-the-art authorship verification models, but we can show how and to what degree the final prediction is faithful and interpretable.
摘要：负责任地使用作者验证 (AV) 系统不仅需要高精度，还需要可解释的解决方案。更重要的是，对于用于做出具有现实世界后果的决策的系统，需要使用可追溯到原始文本的可解释特征来解释模型的预测。神经方法具有很高的准确性，但它们的表示缺乏直接的可解释性。此外，LLM的预测无法得到忠实的解释——如果对预测给出了解释，它并不代表模型预测背后的推理过程。在本文中，我们介绍了残差相似度（RS），这是一种新方法，它通过神经网络补充使用可解释特征的系统，以提高其性能，同时保持可解释性。作者身份验证从根本上来说是一项相似性任务，其目标是衡量两个文档的相似程度。关键思想是使用神经网络来预测相似性残差，即可解释系统预测的相似性的误差。我们对四个数据集的评估表明，我们不仅可以匹配最先进的作者身份验证模型的性能，而且可以展示最终预测如何以及在多大程度上是忠实和可解释的。

Title: The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures

Authors: Alexander M. Fichtl, Jeremias Bohn, Josefin Kelber, Edoardo Mosca, Georg Groh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05364
Pdf URL: https://arxiv.org/pdf/2510.05364
Copy Paste: [[2510.05364]] The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures(https://arxiv.org/abs/2510.05364)
Keywords: language model
Abstract: Transformers have dominated sequence processing tasks for the past seven years -- most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. This paper surveys recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze these approaches in terms of compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged.
摘要：在过去的七年中，变形金刚统治了序列处理任务，最著名的是语言建模。但是，随着上下文长度的增加，其注意机制的固有二次复杂性仍然是一个显着的瓶颈。本文调查了克服这种瓶颈的最新努力，包括（次级）注意变体，经常性神经网络，状态空间模型和混合体系结构的进步。我们对计算和记忆复杂性，基准结果以及基本局限性的批判性分析了这些方法，以评估纯粹的注意变压器的主导地位是否可能很快受到挑战。

Title: Context Length Alone Hurts LLM Performance Despite Perfect Retrieval

Authors: Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, Hao Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05381
Pdf URL: https://arxiv.org/pdf/2510.05381
Copy Paste: [[2510.05381]] Context Length Alone Hurts LLM Performance Despite Perfect Retrieval(https://arxiv.org/abs/2510.05381)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) often fail to scale their performance on long-context tasks performance in line with the context lengths they support. This gap is commonly attributed to retrieval failures -- the models' inability to identify relevant information in the long inputs. Accordingly, recent efforts often focus on evaluating and improving LLMs' retrieval performance: if retrieval is perfect, a model should, in principle, perform just as well on a long input as it does on a short one -- or should it? This paper presents findings that the answer to this question may be negative. Our systematic experiments across 5 open- and closed-source LLMs on math, question answering, and coding tasks reveal that, even when models can perfectly retrieve all relevant information, their performance still degrades substantially (13.9%--85%) as input length increases but remains well within the models' claimed lengths. This failure occurs even when the irrelevant tokens are replaced with minimally distracting whitespace, and, more surprisingly, when they are all masked and the models are forced to attend only to the relevant tokens. A similar performance drop is observed when all relevant evidence is placed immediately before the question. Our findings reveal a previously-unrealized limitation: the sheer length of the input alone can hurt LLM performance, independent of retrieval quality and without any distraction. They motivate our simple, model-agnostic mitigation strategy that transforms a long-context task into a short-context one by prompting the model to recite the retrieved evidence before attempting to solve the problem. On RULER, we observe a consistent improvement of GPT-4o up to 4% on an already strong baseline.
摘要：大型语言模型 (LLM) 通常无法根据其支持的上下文长度来扩展其在长上下文任务性能上的性能。这种差距通常归因于检索失败——模型无法识别长输入中的相关信息。因此，最近的努力通常集中在评估和提高法学硕士的检索性能上：如果检索是完美的，原则上模型在长输入上的表现应该与在短输入上的表现一样好——或者应该这样吗？本文提出的研究结果表明，这个问题的答案可能是否定的。我们对 5 个开源和闭源法学硕士关于数学、问答和编码任务的系统实验表明，即使模型可以完美地检索所有相关信息，随着输入长度的增加，其性能仍然会大幅下降 (13.9%--85%)，但仍保持在模型声称的长度范围内。即使当不相关的标记被最小程度地分散注意力的空白替换时，也会发生这种失败，更令人惊讶的是，当它们全部被屏蔽并且模型被迫只关注相关标记时。当所有相关证据都紧接在问题之前时，会观察到类似的性能下降。我们的研究结果揭示了一个以前未意识到的局限性：输入的绝对长度本身就会损害法学硕士的表现，而与检索质量无关，并且没有任何干扰。它们激发了我们简单的、与模型无关的缓解策略，通过提示模型在尝试解决问题之前背诵检索到的证据，将长上下文任务转换为短上下文任务。在 RULER 上，我们观察到 GPT-4o 在已经很强大的基线上持续提高了 4%。

Title: Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care

Authors: Junyi Fan, Li Sun, Negin Ashrafi, Kamiar Alaei, Maryam Pishgar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05410
Pdf URL: https://arxiv.org/pdf/2510.05410
Copy Paste: [[2510.05410]] Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care(https://arxiv.org/abs/2510.05410)
Keywords: language model, gpt
Abstract: Nursing documentation in intensive care units (ICUs) provides essential clinical intelligence but often suffers from inconsistent terminology, informal styles, and lack of standardization, challenges that are particularly critical in heart failure care. This study applies Direct Preference Optimization (DPO) to adapt Mistral-7B, a locally deployable language model, using 8,838 heart failure nursing notes from the MIMIC-III database and 21,210 preference pairs derived from expert-verified GPT outputs, model generations, and original notes. Evaluation across BLEU, ROUGE, BERTScore, Perplexity, and expert qualitative assessments demonstrates that DPO markedly enhances documentation quality. Specifically, BLEU increased by 84% (0.173 to 0.318), BERTScore improved by 7.6% (0.828 to 0.891), and expert ratings rose across accuracy (+14.4 points), completeness (+14.5 points), logical consistency (+14.1 points), readability (+11.1 points), and structural clarity (+6.0 points). These results indicate that DPO can align lightweight clinical language models with expert standards, supporting privacy-preserving, AI-assisted documentation within electronic health record systems to reduce administrative burden and improve ICU patient safety.
摘要：重症监护病房 (ICU) 的护理记录提供了必要的临床情报，但常常存在术语不一致、风格不正式和缺乏标准化等问题，这些挑战在心力衰竭护理中尤为关键。本研究应用直接偏好优化 (DPO) 来适应 Mistral-7B（一种可本地部署的语言模型），使用 MIMIC-III 数据库中的 8,838 条心力衰竭护理笔记以及来自专家验证的 GPT 输出、模型生成和原始笔记的 21,210 个偏好对。 BLEU、ROUGE、BERTScore、Perplexity 和专家定性评估的评估表明，DPO 显着提高了文档质量。具体来说，BLEU 提高了 84%（0.173 至 0.318），BERTScore 提高了 7.6%（0.828 至 0.891），专家评分在准确性（+14.4 分）、完整性（+14.5 分）、逻辑一致性（+14.1 分）、可读性（+11.1 分）和结构清晰度（+6.0 分）方面均有所上升。这些结果表明，DPO 可以使轻量级临床语言模型与专家标准保持一致，支持电子健康记录系统中的隐私保护、人工智能辅助记录，以减轻管理负担并提高 ICU 患者的安全。

Title: A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis

Authors: Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Haifeng Wang, Minghui Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05414
Pdf URL: https://arxiv.org/pdf/2510.05414
Copy Paste: [[2510.05414]] A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis(https://arxiv.org/abs/2510.05414)
Keywords: language model, gpt, llm, chat, agent
Abstract: Large language models (LLMs) have recently been used to empower autonomous agents in engineering, significantly improving automation and efficiency in labor-intensive workflows. However, their potential remains underexplored in structural engineering, particularly for finite element modeling tasks requiring geometric modeling, complex reasoning, and domain knowledge. To bridge this gap, this paper develops a LLM-based multi-agent system to automate finite element modeling of 2D frames. The system decomposes structural analysis into subtasks, each managed by a specialized agent powered by the lightweight Llama-3.3 70B Instruct model. The workflow begins with a Problem Analysis Agent, which extracts geometry, boundary, and material parameters from the user input. Next, a Geometry Agent incrementally derives node coordinates and element connectivity by applying expert-defined rules. These structured outputs are converted into executable OpenSeesPy code by a Translation Agent and refined by a Model Validation Agent through consistency checks. Then, a Load Agent applies load conditions into the assembled structural model. Experimental evaluations on 20 benchmark problems demonstrate that the system achieves accuracy over 80% in most cases across 10 repeated trials, outperforming Gemini-2.5 Pro and ChatGPT-4o models.
摘要：大型语言模型（LLM）最近被用来增强工程中的自主代理，显着提高劳动密集型工作流程的自动化和效率。然而，它们在结构工程中的潜力仍未得到充分开发，特别是对于需要几何建模、复杂推理和领域知识的有限元建模任务。为了弥补这一差距，本文开发了一种基于 LLM 的多智能体系统，以自动进行 2D 框架的有限元建模。该系统将结构分析分解为子任务，每个子任务均由由轻量级 Llama-3.3 70B Instruct 模型提供支持的专门代理进行管理。该工作流程从问题分析代理开始，该代理从用户输入中提取几何形状、边界和材料参数。接下来，几何代理通过应用专家定义的规则逐步导出节点坐标和元素连接性。这些结构化输出由翻译代理转换为可执行的 OpenSeesPy 代码，并由模型验证代理通过一致性检查进行细化。然后，负载代理将负载条件应用到组装的结构模型中。对 20 个基准问题的实验评估表明，该系统在 10 次重复试验中，在大多数情况下实现了 80% 以上的准确率，优于 Gemini-2.5 Pro 和 ChatGPT-4o 模型。

Title: Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Authors: Yoo Yongmin, Zhang Xu, Cao Longbing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05431
Pdf URL: https://arxiv.org/pdf/2510.05431
Copy Paste: [[2510.05431]] Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification(https://arxiv.org/abs/2510.05431)
Keywords: language model, llm
Abstract: Large language models (LLMs) increasingly generate natural language rationales to enhance interpretability, but these often contain logical errors, label mismatches, and domain-specific misalignments. Directly using such rationales as supervision risks propagating noise and undermining training stability. To address this challenge, we introduce Self-Filtered Distillation, a framework specifically tailored for patent classification, which treats LLM-generated rationales as trust signals rather than ground-truth supervision. The framework employs selective distillation guided by three unsupervised trust metrics: (1) Self-Consistency, which measures the stability of LLM-generated rationales across multiple generations; (2) Class Entailment Alignment, which assesses semantic coherence with patent-specific class definitions; and (3) LLM Agreement Scoring, which validates rationale-label plausibility. These metrics are integrated into a unified trust score that primarily weights training samples while optionally filtering out extremely low-trust cases, enabling reasoning-aware supervision. Experiments on the USPTO-2M dataset, a widely used benchmark for patent classification, show that our method outperforms label-based learning and conventional distillation in accuracy, stability, and interpretability, establishing a reliable paradigm for leveraging reasoning-aware trust indicators in patent analytics.
摘要：大型语言模型 (LLM) 越来越多地生成自然语言原理来增强可解释性，但这些原理通常包含逻辑错误、标签不匹配和特定领域的错位。直接使用监督等理由可能会传播噪音并破坏训练稳定性。为了应对这一挑战，我们引入了自过滤蒸馏，这是一个专门为专利分类量身定制的框架，它将法学硕士生成的理由视为信任信号，而不是真实的监督。该框架采用由三个无监督信任指标指导的选择性蒸馏：（1）自我一致性，衡量法学硕士生成的原理在多代中的稳定性； (2) 类蕴涵对齐，评估与专利特定类定义的语义一致性； (3) LLM 协议评分，验证理由标签的合理性。这些指标被集成到统一的信任评分中，该评分主要对训练样本进行加权，同时选择性地过滤掉极低信任的案例，从而实现推理感知监督。在 USPTO-2M 数据集（广泛使用的专利分类基准）上进行的实验表明，我们的方法在准确性、稳定性和可解释性方面优于基于标签的学习和传统蒸馏，为在专利分析中利用推理感知信任指标建立了可靠的范例。

Title: SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Authors: Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, Jianfeng Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05444
Pdf URL: https://arxiv.org/pdf/2510.05444
Copy Paste: [[2510.05444]] SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?(https://arxiv.org/abs/2510.05444)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce SimulatorArena, a benchmark of 909 annotated human-LLM conversations on two interactive tasks -- math tutoring and document creation. SimulatorArena evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman's $\rho$ of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation. Using the best simulator for each task, we benchmark 18 assistants, including the latest LLMs such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro.
摘要：大型语言模型（LLM）越来越多地用于交互式应用中，人类评估仍然是评估其在多转交谈中的性能的黄金标准。由于人类的研究成本高昂，耗时且难以再现，因此最近使用LLMS模拟用户进行自动助理评估的工作探索。但是，没有基准或系统的研究来评估这些模拟用户是否对真实用户是可靠的备用。为了解决这个问题，我们介绍了SimulatorArena，这是对两个交互式任务的909个注释人类对话的基准 - 数学辅导和文档创建。 Simulatorarena根据其信息如何与人类的行为相匹配以及助手评分与人类判断的程度如何评估模拟器。在各种模拟器方法上进行的实验表明，模拟器以用户配置文件为条件，捕获背景和消息样式等特征，与人类判断密切相符。他们在这两个任务上都达到了Spearman的$ \ rho $ 0.7，为人类评估提供了一种实用，可扩展的替代方案。使用最佳的模拟器，我们为18个助手（包括GPT-5，Claude 4.1 Opus和Gemini 2.5 Pro）进行基准测试18个助手。

Title: AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering

Authors: Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Murugesan, Vincent Galassi, Chuxu Zhang, Yanfang Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05445
Pdf URL: https://arxiv.org/pdf/2510.05445
Copy Paste: [[2510.05445]] AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering(https://arxiv.org/abs/2510.05445)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) and agent-based frameworks have advanced rapidly, enabling diverse applications. Yet, with the proliferation of models and agentic strategies, practitioners face substantial uncertainty in selecting the best configuration for a downstream task. Prior studies show that different agents and backbones exhibit complementary strengths, and that larger models are not always superior, underscoring the need for adaptive routing mechanisms. Existing approaches to agent routing, however, often emphasize cost efficiency while overlooking the fine-grained contextual and relational structure inherent in QA tasks. In this paper, we propose tAgentRouter, a framework that formulates multi-agent QA as a knowledge-graph-guided routing problem supervised by empirical performance signals. Specifically, we convert QA instance into a knowledge graph that jointly encodes queries, contextual entities, and agents, and then train a heterogeneous graph neural network (GNN) to propagate information across node types and produce task-aware routing distributions over agents. By leveraging soft supervision and weighted aggregation of agent outputs, AgentRouter learns principled collaboration schemes that capture the complementary strengths of diverse agents. Extensive experiments demonstrate that our framework consistently outperforms single-agent and ensemble baselines, while generalizing across benchmarks and LLM backbones. These results highlight the effectiveness and robustness of graph-supervised multi-agent routing for question answering.
摘要：大型语言模型（LLM）和基于代理的框架已迅速发展，从而实现了不同的应用程序。然而，随着模型和代理策略的扩散，从业人员在选择下游任务的最佳配置方面面临着实质性的不确定性。先前的研究表明，不同的药物和骨干具有互补的强度，并且较大的模型并不总是优越的，强调了对自适应路由机制的需求。但是，现有的代理路由方法通常会强调成本效率，同时忽略质量保证任务中固有的细粒度上下文和关系结构。在本文中，我们提出了TagentRouter，该框架将多代理质量质量质量为QA作为通过经验绩效信号监督的知识图指导的路由问题。具体而言，我们将质量检查实例转换为一个知识图，该实例共同编码查询，上下文实体和代理，然后训练异质图神经网络（GNN），以在节点类型上传播信息并在代理上产生任务意识到的路由分布。通过利用软监管和代理输出的加权聚合，AgentRouter学习了捕获不同代理人的互补优势的有原则的协作方案。广泛的实验表明，我们的框架始终优于单位代理和集合基线，同时跨基准和LLM骨架概括。这些结果突出了对问题回答的图形监管多代理路由的有效性和鲁棒性。

Title: SocialNLI: A Dialogue-Centric Social Inference Dataset

Authors: Akhil Deo, Kate Sanders, Benjamin Van Durme
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05458
Pdf URL: https://arxiv.org/pdf/2510.05458
Copy Paste: [[2510.05458]] SocialNLI: A Dialogue-Centric Social Inference Dataset(https://arxiv.org/abs/2510.05458)
Keywords: llm
Abstract: Making theory-of-mind inferences from human dialogue is a strong indicator of a model's underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) -- the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.
摘要：从人类对话中进行心理理论推断是模型潜在社交能力的有力指标，这对于熟练的人工智能助手来说是基础。然而，大型语言和推理模型很难理解转录数据中复杂的社会现象，例如讽刺和反讽。为了评估当前模型的弱点并确定其解决方案，我们引入了 SocialNLI (SoNLI)——第一个社交对话推理数据集。 SoNLI 包含一系列精心挑选的对话记录，以集中复杂的社会细微差别，如反讽和讽刺，并配有推论、相应的可能性分数和人工编写的解释。我们探索社会推理分析作为心理理论的一个方面，并通过多步骤反事实推理来评估法学硕士和推理模型心理理论能力。

Title: Language Model as Planner and Formalizer under Constraints

Authors: Cassie Huang, Stuti Mohan, Ziyi Yang, Stefanie Tellex, Li Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05486
Pdf URL: https://arxiv.org/pdf/2510.05486
Copy Paste: [[2510.05486]] Language Model as Planner and Formalizer under Constraints(https://arxiv.org/abs/2510.05486)
Keywords: language model, llm
Abstract: LLMs have been widely used in planning, either as planners to generate action sequences end-to-end, or as formalizers to represent the planning domain and problem in a formal language that can derive plans deterministically. However, both lines of work rely on standard benchmarks that only include generic and simplistic environmental specifications, leading to potential overestimation of the planning ability of LLMs and safety concerns in downstream tasks. We bridge this gap by augmenting widely used planning benchmarks with manually annotated, fine-grained, and rich natural language constraints spanning four formally defined categories. Over 4 state-of-the-art reasoning LLMs, 3 formal languages, 5 methods, and 4 datasets, we show that the introduction of constraints not only consistently halves performance, but also significantly challenges robustness to problem complexity and lexical shift.
摘要：法学硕士已广泛应用于规划中，要么作为规划器生成端到端的行动序列，要么作为形式化器以可以确定性地导出计划的形式语言表示规划领域和问题。然而，这两条工作线都依赖于仅包括通用和简单化环境规范的标准基准，导致可能高估法学硕士的规划能力和下游任务的安全问题。我们通过跨越四个正式定义的类别的手动注释、细粒度和丰富的自然语言约束来增强广泛使用的规划基准，从而弥补了这一差距。通过 4 个最先进的推理法学硕士、3 种形式语言、5 种方法和 4 个数据集，我们表明约束的引入不仅始终使性能减半，而且还显着挑战了问题复杂性和词汇转换的稳健性。

Title: LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation

Authors: Zhoutong Fu, Yihan Cao, Yi-Lin Chen, Aman Lunia, Liming Dong, Neha Saraf, Ruijie Jiang, Yun Dai, Qingquan Song, Tan Wang, Guoyao Li, Derek Koh, Haichao Wei, Zhipeng Wang, Aman Gupta, Chengming Jiang, Jianqiang Shen, Liangjie Hong, Wenjing Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05490
Pdf URL: https://arxiv.org/pdf/2510.05490
Copy Paste: [[2510.05490]] LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation(https://arxiv.org/abs/2510.05490)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved strong performance across a wide range of natural language processing tasks. However, deploying LLMs at scale for domain specific applications, such as job-person fit and explanation in job seeking platforms, introduces distinct challenges. At LinkedIn, the job person fit task requires analyzing a candidate's public profile against job requirements to produce both a fit assessment and a detailed explanation. Directly applying open source or finetuned LLMs to this task often fails to yield high quality, actionable feedback due to the complexity of the domain and the need for structured outputs. Moreover, the large size of these models leads to high inference latency and limits scalability, making them unsuitable for online use. To address these challenges, we introduce LANTERN, a novel LLM knowledge distillation framework tailored specifically for job person fit tasks. LANTERN involves modeling over multiple objectives, an encoder model for classification purpose, and a decoder model for explanation purpose. To better distill the knowledge from a strong black box teacher model to multiple downstream models, LANTERN incorporates multi level knowledge distillation that integrates both data and logit level insights. In addition to introducing the knowledge distillation framework, we share our insights on post training techniques and prompt engineering, both of which are crucial for successfully adapting LLMs to domain specific downstream tasks. Extensive experimental results demonstrate that LANTERN significantly improves task specific metrics for both job person fit and explanation. Online evaluations further confirm its effectiveness, showing measurable gains in job seeker engagement, including a 0.24\% increase in apply rate and a 0.28\% increase in qualified applications.
摘要：大型语言模型 (LLM) 在广泛的自然语言处理任务中取得了强大的性能。然而，针对特定领域的应用（例如求职平台中的职位匹配和解释）大规模部署法学硕士会带来明显的挑战。在 LinkedIn，职位人员适合度任务需要根据工作要求分析候选人的公开资料，以生成适合度评估和详细解释。由于领域的复杂性和对结构化输出的需求，直接将开源或微调的法学硕士应用于此任务通常无法产生高质量、可操作的反馈。此外，这些模型的尺寸较大，导致推理延迟较高，并限制了可扩展性，使其不适合在线使用。为了应对这些挑战，我们引入了 LANTERN，这是一种专为适合工作人员的任务量身定制的新型法学硕士知识蒸馏框架。 LANTERN 涉及多个目标的建模、用于分类目的的编码器模型和用于解释目的的解码器模型。为了更好地将知识从强大的黑盒教师模型提炼到多个下游模型，LANTERN 结合了多级知识提炼，集成了数据和 logit 级别的见解。除了介绍知识蒸馏框架之外，我们还分享了对后期培训技术和即时工程的见解，这两者对于成功使法学硕士适应特定领域的下游任务至关重要。大量的实验结果表明，LANTERN 显着改善了工作人员适合度和解释方面的任务特定指标。在线评估进一步证实了其有效性，求职者参与度显着提高，包括申请率增加了 0.24%，合格申请增加了 0.28%。

Title: Prototype-Based Dynamic Steering for Large Language Models

Authors: Ceyhun Efe Kayan, Li Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05498
Pdf URL: https://arxiv.org/pdf/2510.05498
Copy Paste: [[2510.05498]] Prototype-Based Dynamic Steering for Large Language Models(https://arxiv.org/abs/2510.05498)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Despite impressive breadth, LLMs still rely on explicit reasoning instructions or static, one-fits-all steering methods, leaving a gap for adaptive, instruction-free reasoning amplification. We present Prototype-Based Dynamic Steering (PDS), a test-time method that amplifies large language model (LLM) reasoning without adding or altering instructions. We introduce "reasoning prototypes" by clustering activation differences between Chain-of-Thought (CoT) and neutral prompts. At inference, an input's hidden state is projected onto these prototypes to form an instance-specific steering vector. Evaluated on GSM8K, AQuA-RAT, and BIG-Bench tasks, PDS consistently improves accuracy without fine-tuning or prompt engineering. Notably, the gains persist even when CoT is explicitly suppressed to improve cost-efficiency, indicating that the intervention strengthens latent reasoning processes rather than inducing a superficial behavioral shift. These results position dynamic, prototype-guided steering as a lightweight alternative to training-time approaches for enhancing LLM reasoning.
摘要：尽管法学硕士的广度令人印象深刻，但法学硕士仍然依赖于明确的推理指令或静态的、万能的指导方法，为自适应、无指令推理放大留下了空白。我们提出了基于原型的动态转向（PDS），这是一种测试时方法，可以在不添加或更改指令的情况下放大大型语言模型（LLM）推理。我们通过聚类思维链（CoT）和中性提示之间的激活差异来引入“推理原型”。在推理时，输入的隐藏状态被投影到这些原型上以形成特定于实例的引导向量。在 GSM8K、AQuA-RAT 和 BIG-Bench 任务上进行评估后，PDS 持续提高准确性，无需微调或即时工程。值得注意的是，即使为了提高成本效率而明确抑制 CoT，收益仍然持续存在，这表明干预措施加强了潜在的推理过程，而不是引起表面的行为转变。这些结果将动态、原型引导的转向定位为增强法学硕士推理的训练时间方法的轻量级替代方案。

Title: CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Authors: Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, Ruiming Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05520
Pdf URL: https://arxiv.org/pdf/2510.05520
Copy Paste: [[2510.05520]] CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension(https://arxiv.org/abs/2510.05520)
Keywords: language model, llm, agent
Abstract: Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget's Constructivist Theory, illuminating three traits of the agentic memory -- structured schemata, flexible assimilation, and dynamic accommodation. This blueprint forges a clear path toward a more robust and efficient memory system for LLM-based reading comprehension. To this end, we develop CAM, a prototype implementation of Constructivist Agentic Memory that simultaneously embodies the structurality, flexibility, and dynamicity. At its core, CAM is endowed with an incremental overlapping clustering algorithm for structured memory development, supporting both coherent hierarchical summarization and online batch integration. During inference, CAM adaptively explores the memory structure to activate query-relevant information for contextual response, akin to the human associative process. Compared to existing approaches, our design demonstrates dual advantages in both performance and efficiency across diverse long-text reading comprehension tasks, including question answering, query-based summarization, and claim verification.
摘要：当前的大型语言模型（LLM）在理解长格式文档时面临着巨大的信息量。这一挑战提出了内聚记忆模块的必要性，它可以将普通的法学硕士提升为自主阅读代理。尽管出现了一些启发式方法，但仍然缺乏系统的设计原则。为了填补这一空白，我们从让·皮亚杰的建构主义理论中汲取灵感，阐明了主体记忆的三个特征——结构化图式、灵活同化和动态调节。该蓝图为基于法学硕士的阅读理解构建更强大、更高效的记忆系统开辟了一条清晰的道路。为此，我们开发了 CAM，它是建构主义代理记忆的原型实现，同时体现了结构性、灵活性和动态性。 CAM 的核心是用于结构化内存开发的增量重叠聚类算法，支持连贯的分层汇总和在线批量集成。在推理过程中，CAM 自适应地探索记忆结构，以激活查询相关信息以进行上下文响应，类似于人类的联想过程。与现有方法相比，我们的设计在不同的长文本阅读理解任务中展示了性能和效率的双重优势，包括问答、基于查询的摘要和声明验证。

Title: KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

Authors: Kuangshi Ai, Jonathan A. Karr Jr, Meng Jiang, Nitesh V. Chawla, Chaoli Wang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.05524
Pdf URL: https://arxiv.org/pdf/2510.05524
Copy Paste: [[2510.05524]] KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance(https://arxiv.org/abs/2510.05524)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning.
摘要：我们对Amin（KEO）进行了知识提取，Amin（KEO）是一种特定于域的知识提取和推理框架，具有大型语言模型（LLM）在安全至关重要的环境中。使用操作和维护智能（AMIN）数据集，我们构建了一个跨越全局感官和可操作的维护任务的QA基准。 Keo构建了一个结构化知识图（KG），并将其集成到检索功能的一代（RAG）管道中，比传统的文本链抹布更具连贯的，数据集的推理。我们评估本地可部署的LLM（Gemma-3，Phi-4，Mismtral-Nemo），并使用更强的模型（GPT-4O，Llama-3.3）作为法官。实验表明，Keo通过揭示模式和系统级别的见解显着改善了全球感官，而文本锁抹布仍然对需要局部检索的细粒度程序任务有效。这些发现强调了KG授权的LLM对安全，域特异性质量质量质量质量的承诺及其在高风险推理中的潜力。

Title: H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference

Authors: Harshil Vejendla
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05529
Pdf URL: https://arxiv.org/pdf/2510.05529
Copy Paste: [[2510.05529]] H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference(https://arxiv.org/abs/2510.05529)
Keywords: language model, llm
Abstract: Autoregressive decoding in large language models (LLMs) requires caching a growing list of past key-value (KV) pairs, making long-context inference a memory-bound problem. While recent methods have explored quantizing the cache, evicting tokens, or using binary sketches for keys (e.g., Loki), these approaches often provide an incomplete solution by leaving one component (like values) uncompressed or by discarding context information. This paper introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression scheme that radically reduces memory usage without sacrificing context. H1B-KV represents each key vector using a 1-bit binary sketch, enabling hardware-friendly bitwise attention, and further compresses value vectors using 4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches full-precision performance not only on perplexity benchmarks but also on complex downstream tasks like mathematical reasoning (GSM8K), multi-task understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV significantly outperforms leading quantization (KIVI), token eviction (SparseLLM), and key-only sketching (Loki) methods in quality-per-byte, establishing it as a robust solution for deploying LLMs in memory-constrained environments.
摘要：在大型语言模型（LLM）中进行自回旋解码需要缓存越来越多的过去的键值（KV）对列表，从而使长篇小说推断成为记忆的问题。虽然最近的方法探索了量化缓存，驱逐令牌或使用键的二进制草图（例如Loki）的量化，但这些方法通常通过留下一个未压缩或删除上下文信息来提供不完整的解决方案。本文介绍了混合单位KV缓存（H1B-KV），这是一种综合的压缩方案，从根本上降低了内存使用情况而无需牺牲上下文。 H1B-KV使用1位二进制草图代表每个关键向量，从而使硬件友好的位关注能力，并使用4位量化进一步压缩了值向量。这种整体混合方法允许70亿个参数LLM处理8K token上下文，其缓存存储器低于60 MB的情况 - 降低了70倍。我们证明，在轻巧的登录后，H1B-KV不仅在复杂的基准上，而且在复杂的下游任务（例如数学推理（GSM8K），多任务理解（MMLU）和代码生成（HumaneVal）等复杂的下游任务上，不仅在复杂的下游任务上匹配完整精确的性能。我们的结果表明，H1B-KV显着胜过领先的量化（KIVI），令牌驱逐（Sparsellm）和按字节质量质量的素描（LOKI）方法，并将其确立为在内存约束环境中部署LLMS的强大解决方案。

Title: On the Role of Difficult Prompts in Self-Play Preference Optimization

Authors: Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05534
Pdf URL: https://arxiv.org/pdf/2510.05534
Copy Paste: [[2510.05534]] On the Role of Difficult Prompts in Self-Play Preference Optimization(https://arxiv.org/abs/2510.05534)
Keywords: language model, llm, prompt
Abstract: Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of $N$ sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.
摘要：自我播放偏好优化已成为对齐大语言模型（LLM）的突出范式。通常，它涉及一种语言模型，以生成提示和奖励模型（RM）的派利响应，以指导选择和拒绝响应的选择，可以通过直接优先优化（DPO）进一步培训。但是，尽管这是该管道中的核心组成部分，但提示的作用仍未得到充满反感。在这项工作中，我们研究了难度的变化提示如何影响自我播放优先优化。我们首先将提示的$ n $采样响应的平均奖励作为其难度的代理。我们发现，与语言模型的简单提示相比，困难提示表现出基本较低的自我播放优化性能。此外，将困难提示纳入训练并不能提高整体性能，实际上，与仅在简单提示上进行培训相比，与训练相比，会导致轻微的降解。我们还观察到，随着模型容量的增加，困难和简易提示之间的性能差距会关闭，这表明难度与模型容量相互作用。在这些发现的基础上，我们探索了减轻困难提示对最终表现的负面影响的策略。我们证明，有选择地删除适当的挑战性提示可以提高整体自我播放表现，同时还报告了失败的尝试和经验教训。

Title: Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

Authors: Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05544
Pdf URL: https://arxiv.org/pdf/2510.05544
Copy Paste: [[2510.05544]] Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM(https://arxiv.org/abs/2510.05544)
Keywords: language model, llm
Abstract: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.
摘要：大型语言模型（LLM）和视觉模型（VLM）已经达到了最先进的表现，但它们在部署中施加了重大的记忆和计算挑战。我们提出了一个新颖的低级压缩框架，以应对这一挑战。首先，我们通过基于层激活的压缩误差来限制网络损失的变化，从而填补了文献中的理论差距。然后，我们将低级模型压缩作为双目标优化，并证明单一均匀的公差产生替代帕累托 - 最佳的异质等级。基于我们的理论见解，我们提出了帕累托指导的奇异价值分解（PGSVD），这是一种零拍的管道，可通过帕累托指导的等级选择和交替的最小二乘实现来改善激活感知的压缩。我们将PGSVD应用于LLM和VLM，在相同的压缩水平和推理速度下显示出更好的准确性。

Title: Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

Authors: Chengzhi Liu, Yuzhe Yang, Kaiwen Zhou, Zhen Zhang, Yue Fan, Yannan Xie, Peng Qi, Xin Eric Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05571
Pdf URL: https://arxiv.org/pdf/2510.05571
Copy Paste: [[2510.05571]] Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations(https://arxiv.org/abs/2510.05571)
Keywords: agent
Abstract: The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: \emph{there is no way to improve it when you cannot evaluate it right}. To address this, we introduce \textbf{EvoPresent}, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a comprehensive benchmark comprising: \textit{Presentation Generation Quality}, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and \textit{Aesthetic Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.
摘要：学术论文推广已成为提升研究知名度的重要手段。然而，现有的自动化方法存在故事讲述能力有限、审美品质不足、自我调整能力有限等问题，难以实现高效、有吸引力的传播。这些挑战的核心是一个简单的原则：\emph{当你无法正确评估它时就没有办法改进它}。为了解决这个问题，我们引入了 \textbf{EvoPresent}，这是一个自我改进的代理框架，它通过虚拟角色统一了连贯的叙述、审美意识设计和真实的演示交付。 EvoPresent 的核心是 \textbf{PresAesth}，这是一种多任务强化学习 (RL) 美学模型，可提供可靠的美学评分、缺陷调整和比较反馈，即使在有限的美学训练数据下也能实现迭代式自我改进。为了系统地评估这些方法，我们引入了 \textbf{EvoPresent Benchmark}，这是一个综合基准，包括： \textit{Presentation Generation Quality}，基于 650 篇顶级 AI 会议论文和多模式资源（幻灯片、视频和脚本）构建，用于评估内容和设计； \textit{审美意识}，由2000对不同审美水平的幻灯片组成，支持评分、缺陷调整和比较的联合训练和评估。我们的研究结果强调，（i）高质量的反馈对于智能体的自我提升至关重要，而仅靠初始能力并不能保证有效的自我纠正。 (ii) 自动生成管道表现出视觉设计和内容构建之间的权衡。 (iii) 多任务强化学习训练在审美意识任务中表现出更强的泛化能力。

Title: Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs

Authors: Dong Yan, Gaochen Wu, Bowen Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05577
Pdf URL: https://arxiv.org/pdf/2510.05577
Copy Paste: [[2510.05577]] Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs(https://arxiv.org/abs/2510.05577)
Keywords: llm, agent
Abstract: Recent advancements in language agents have led to significant improvements in multi-hop reasoning tasks. However, existing approaches often struggle with handling open-domain problems, which require massive information retrieval due to their reliance on a fixed sequence of actions. To address this, we propose Feedback-Guided Dynamic Interactive Planning (FGDIP), a novel framework tailored to enhance reasoning in LLMs by utilizing dynamic and adaptive strategies for information exploration in open-domain multi-hop reasoning tasks. Our approach begins by identifying key entities relevant to the problem, which serve as the initial nodes in the reasoning process. From these initial nodes, we then generate reasoning child nodes with the process being refined through a combination of historical error analysis and real-time feedback, which allows the framework to dynamically adjust and optimize its reasoning strategies. By integrating depth-first search with an innovative node generation technique, our framework adapts based on both prior error paths and concurrently generated nodes at the same hierarchical level. This dynamic strategy effectively expands the search space while ensuring the reasoning process systematically converges toward accurate solutions. Experimental results show that FGDIP achieved up to 54.47% F1 score on the HotpotQA dataset and 70.05% on the StrategyQA dataset, surpassing the best baseline by 5.03% and 7.25% respectively, highlighting its versatility and potential to enhance language agents in multi-hop reasoning tasks.
摘要：语言代理的最新进展导致多跳推理任务的显着改进。然而，现有的方法常常难以处理开放域问题，由于它们依赖于固定的操作序列，因此需要大量信息检索。为了解决这个问题，我们提出了反馈引导的动态交互规划（FGDIP），这是一种新颖的框架，旨在通过利用动态和自适应策略在开放域多跳推理任务中进行信息探索来增强法学硕士的推理。我们的方法首先识别与问题相关的关键实体，这些实体充当推理过程中的初始节点。然后，我们从这些初始节点生成推理子节点，并通过历史错误分析和实时反馈的结合来细化流程，这使得框架能够动态调整和优化其推理策略。通过将深度优先搜索与创新的节点生成技术相结合，我们的框架根据先前的错误路径和同一层次结构中同时生成的节点进行调整。这种动态策略有效地扩展了搜索空间，同时确保推理过程系统地收敛到准确的解决方案。实验结果表明，FGDIP 在 HotpotQA 数据集上取得了高达 54.47% 的 F1 分数，在 StrategyQA 数据集上取得了 70.05% 的 F1 分数，分别超过了最佳基线 5.03% 和 7.25%，凸显了其在多跳推理任务中增强语言代理的多功能性和潜力。

Title: A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

Authors: Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05608
Pdf URL: https://arxiv.org/pdf/2510.05608
Copy Paste: [[2510.05608]] A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks(https://arxiv.org/abs/2510.05608)
Keywords: language model, llm, agent
Abstract: Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks. In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent's planning abilities without human effort. Specifically, we train a plug-and-play global planner through a two-step process: we first synthesize high-quality plans from an advanced LLM using our proposed homologous consensus filtering strategy, and apply fine-tuning as a cold start. Moreover, we further improve the planner with a rule-based reinforcement learning stage using a novel executor capability gain reward, ensuring it can handle task instructions of varying difficulty. Experiments on three long-horizon agent tasks show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance. Meanwhile, EAGLET reduces training costs by 8x compared to RL-based baselines, and it does not require manual effort or extra training data, offering an efficient and effective solution.
摘要：基于大语言模型（LLM）的代理商因长期任务中缺乏全球计划而与无脑反复试验和产生幻觉动作斗争。在本文中，我们介绍了一个计划和执行框架，并提出了Eaglet，Eaglet是一种有效而有效的计划者培训方法，可在不努力的情况下增强执行人代理的计划能力。具体来说，我们通过两步的过程训练插件的全球规划师：我们首先使用我们提出的同源共识过滤策略从高级LLM中综合了高质量的计划，并将微调应用于冷启动。此外，我们使用新型执行者的能力增益奖励，通过基于规则的强化学习阶段进一步改善了计划者，以确保它可以处理各种难度的任务说明。对三个长马代理任务的实验表明，配备了我们计划者的执行代理优于现有方法，实现了新的最先进的性能。同时，与基于RL的基准相比，Eaglet将培训成本降低了8倍，并且不需要手动努力或额外的培训数据，提供有效的解决方案。

Title: MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction

Authors: Wei-Chieh Huang, Cornelia Caragea
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05611
Pdf URL: https://arxiv.org/pdf/2510.05611
Copy Paste: [[2510.05611]] MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction(https://arxiv.org/abs/2510.05611)
Keywords: language model, llm, agent
Abstract: Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e-commerce, as it infers lantent attributes from multimodal data. Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data and gaps in vision-text understanding. In this work, we introduce \textsc{\modelname}, a multi-agent debate framework that employs multiple MLLM agents to iteratively refine inferences. Through a series of debate rounds, agents verify and update each other's responses, thereby improving inference performance and robustness. Experiments on the ImplicitAVE dataset demonstrate that even a few rounds of debate significantly boost accuracy, especially for attributes with initially low performance. We systematically evaluate various debate configurations, including identical or different MLLM agents, and analyze how debate rounds affect convergence dynamics. Our findings highlight the potential of multi-agent debate strategies to address the limitations of single-agent approaches and offer a scalable solution for implicit AVE in multimodal e-commerce.
摘要：隐式属性值提取 (AVE) 对于准确表示电子商务中的产品至关重要，因为它从多模式数据中推断潜在属性。尽管多模态大语言模型 (MLLM) 取得了进展，但由于多维数据的复杂性和视觉文本理解方面的差距，隐式 AVE 仍然具有挑战性。在这项工作中，我们引入了 \textsc{\modelname}，这是一个多代理辩论框架，它采用多个 MLLM 代理来迭代地完善推理。通过一系列的辩论，代理验证并更新彼此的响应，从而提高推理性能和鲁棒性。在 ImplicitAVE 数据集上进行的实验表明，即使是几轮辩论也能显着提高准确性，特别是对于最初性能较低的属性。我们系统地评估各种辩论配置，包括相同或不同的 MLLM 代理，并分析辩论回合如何影响收敛动态。我们的研究结果强调了多智能体辩论策略在解决单智能体方法的局限性方面的潜力，并为多模式电子商务中的隐式 AVE 提供了可扩展的解决方案。

Title: Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models

Authors: Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05678
Pdf URL: https://arxiv.org/pdf/2510.05678
Copy Paste: [[2510.05678]] Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models(https://arxiv.org/abs/2510.05678)
Keywords: language model, llm, prompt
Abstract: While large language models (LLMs) exhibit strong multilingual abilities, their reliance on English as latent representations creates a translation barrier, where reasoning implicitly depends on internal translation into English. When this process fails, performance in non-English languages deteriorates sharply, limiting the inclusiveness of LLM-based applications. Existing cross-lingual in-context learning (X-ICL) methods primarily leverage monolingual demonstrations, often failing to mitigate this barrier and instead reinforcing it. In this work, we introduce code-switching in-context learning (CSICL), a simple yet effective prompting strategy that progressively transitions from a target language to English within demonstrations and instruction to facilitate their latent reasoning in English. By explicitly scaffolding the reasoning process through controlled code-switching, CSICL acts as an implicit linguistic bridge that enhances cross-lingual alignment and reduces reliance on the translation barrier. We conduct extensive experiments across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive and reasoning-oriented domains. Our results demonstrate that CSICL consistently outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target and unseen languages, respectively. The improvement is even more pronounced in low-resource settings, with gains of 14.7% in target and 5.3% in unseen languages. These findings establish code-switching as a principled and robust approach for overcoming the translation barrier during inference, moving LLMs toward more equitable and effective multilingual systems.
摘要：虽然大型语言模型（LLM）表现出强大的多语言能力，但它们对英语作为潜在表示的依赖造成了翻译障碍，其中推理隐含地依赖于英语的内部翻译。当这个过程失败时，非英语语言的表现急剧恶化，限制了基于法学硕士的申请的包容性。现有的跨语言情境学习（X-ICL）方法主要利用单语演示，往往无法缓解这一障碍，反而会强化它。在这项工作中，我们引入了语码转换情境学习（CSICL），这是一种简单而有效的提示策略，可以在演示和指导中逐步从目标语言过渡到英语，以促进他们用英语进行潜在推理。通过受控的语码转换显式地搭建推理过程，CSICL 充当隐式语言桥梁，增强跨语言对齐并减少对翻译障碍的依赖。我们在 4 个法学硕士、6 个数据集和 10 种语言中进行了广泛的实验，涵盖知识密集型和推理导向型领域。我们的结果表明，CSICL 始终优于 X-ICL 基线，在目标语言和未见过的语言中分别实现了 3.1%p 和 1.9%p 的增益。在资源匮乏的环境中，这种改进更为明显，目标语言提高了 14.7%，未见过的语言提高了 5.3%。这些发现将语码转换确立为一种有原则且稳健的方法，可克服推理过程中的翻译障碍，推动法学硕士走向更公平、更有效的多语言系统。

Title: DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

Authors: Yongqi Leng, Yikun Lei, Xikai Liu, Meizhi Zhong, Bojian Xiong, Yurong Zhang, Yan Gao, Yi Wu, Yao Hu, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05691
Pdf URL: https://arxiv.org/pdf/2510.05691
Copy Paste: [[2510.05691]] DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision(https://arxiv.org/abs/2510.05691)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks through dynamic retrieval and adaptive workflows. Recent advances (e.g., Search-R1) have shown that outcome-supervised reinforcement learning demonstrate strong performance. However, this approach still suffers from inefficient exploration, sparse reward signals, and ambiguous global reward feedback. To address these challenges, we propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution, while introducing an efficient pruning strategy to optimize data expansion. Through comprehensive process-level policy optimization, DecEx-RAG significantly enhances the autonomous task decomposition, dynamic retrieval, and high-quality answer generation capabilities of large language models (LLMs). Experiments show that DecEx-RAG achieves an average absolute performance improvement of $6.2\%$ across six datasets, significantly outperforming existing baselines. Moreover, the pruning strategy improves data construction efficiency by nearly $6 \times$, providing an efficient solution for process-supervised RAG training. The code is available at this https URL.
摘要：代理检索增强生成（Agentic RAG）通过动态检索和自适应工作流程增强复杂任务的处理能力。最近的进展（例如 Search-R1）表明结果监督强化学习表现出强大的性能。然而，这种方法仍然存在探索效率低、奖励信号稀疏和全局奖励反馈不明确的问题。为了应对这些挑战，我们提出了 DecEx-RAG，它将 RAG 建模为包含决策和执行的马尔可夫决策过程（MDP），同时引入有效的剪枝策略来优化数据扩展。通过全面的流程级策略优化，DecEx-RAG显着增强了大型语言模型（LLM）的自主任务分解、动态检索和高质量答案生成能力。实验表明，DecEx-RAG 在六个数据集上实现了平均绝对性能提升 $6.2\%$，显着优于现有基线。此外，剪枝策略将数据构建效率提高了近 6 倍，为流程监督的 RAG 训练提供了高效的解决方案。该代码可从此 https URL 获取。

Title: Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities

Authors: Liza Fretel, Baptiste Cecconi, Laura Debisschop
Subjects: cs.CL, astro-ph.IM
Abstract URL: https://arxiv.org/abs/2510.05744
Pdf URL: https://arxiv.org/pdf/2510.05744
Copy Paste: [[2510.05744]] Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities(https://arxiv.org/abs/2510.05744)
Keywords: language model, llm
Abstract: This ongoing work focuses on the development of a methodology for generating a multi-source mapping of astronomical observation facilities. To compare two entities, we compute scores with adaptable criteria and Natural Language Processing (NLP) techniques (Bag-of-Words approaches, sequential approaches, and surface approaches) to map entities extracted from eight semantic artifacts, including Wikidata and astronomy-oriented resources. We utilize every property available, such as labels, definitions, descriptions, external identifiers, and more domain-specific properties, such as the observation wavebands, spacecraft launch dates, funding agencies, etc. Finally, we use a Large Language Model (LLM) to accept or reject a mapping suggestion and provide a justification, ensuring the plausibility and FAIRness of the validated synonym pairs. The resulting mapping is composed of multi-source synonym sets providing only one standardized label per entity. Those mappings will be used to feed our Name Resolver API and will be integrated into the International Virtual Observatory Alliance (IVOA) Vocabularies and the OntoPortal-Astro platform.
摘要：这项正在进行的工作着重于开发一种方法，用于生成天文观测设施的多源映射。为了比较两个实体，我们使用适应性标准和自然语言处理（NLP）技术（词袋方法，顺序方法和表面方法）计算得分，以绘制从八个语义文物中提取的实体，包括Wikidata和面向天文学的资源。我们利用所有可用的属性，例如标签，定义，描述，外部标识符以及更特定的领域特定属性，例如观察波，航天器发射日期，资金机构等。最后，我们使用大型语言模型（LLM）来接受或拒绝绘图建议并拒绝合同，并提供合法性和公平性和公平性和公平性。结果映射由多源同义词集组成，每个实体仅提供一个标准化标签。这些映射将用于馈送我们的名称解析器API，并将集成到国际虚拟天文台联盟（IVOA）词汇表和宽大型ASTRO平台中。

Title: Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

Authors: Rikuto Kotoge, Yuichi Sasaki
Subjects: cs.CL, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2510.05799
Pdf URL: https://arxiv.org/pdf/2510.05799
Copy Paste: [[2510.05799]] Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech(https://arxiv.org/abs/2510.05799)
Keywords: language model, llm
Abstract: Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.
摘要：已证明将文本到语音（TTS）系统输出与人类反馈进行对齐，已证明可以有效地改善基于语言模型的TTS模型的鲁棒性和自然性。当前的方法主要需要在话语水平上配对的理想和不良样品。但是，此类对在TTS输出数据中通常受到限制，并且说话级配方可防止精确发音对齐所需的细粒令牌级别的优化。在这项研究中，我们提出了TKTO，该TKTO消除了对配对数据的需求，实现了更具数据有效的训练范例，并直接针对令牌级单元，自动提供无代数注释的细粒度对准信号。 TKTO将具有挑战性的日本TTS精度提高了39％，并将CER降低了54％，自动为目标令牌分配了12.8倍的奖励。

Title: EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Authors: Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05837
Pdf URL: https://arxiv.org/pdf/2510.05837
Copy Paste: [[2510.05837]] EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget(https://arxiv.org/abs/2510.05837)
Keywords: language model, llm
Abstract: Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant modes-that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
摘要：对于大型语言模型（LLMS），平衡探索和剥削仍然是强化学习（RLVR）的核心挑战。当前的RLVR方法通常过分强调剥削，导致熵崩溃，探索能力降低，最终有限的性能提高。尽管增加政策随机性的技术可以促进探索，但它们经常无法逃脱主导的行为模式。这会产生一种自我强化环路的采样和奖励的主要模式，以进一步侵蚀探索。我们介绍了探索增强的政策优化（EEPO），该框架通过自适应学习的两阶段推出来促进探索。在第一阶段，模型产生了一半的轨迹。然后，它经历了轻巧的未学习步骤，以暂时抑制这些采样的响应，迫使第二阶段探索输出空间的不同区域。这种样品 - 然后填充机制破坏了自我增压循环，并在推出过程中促进了更广泛的探索。在五个推理基准中，EEPO的表现优于GRPO，QWEN2.5-3B的平均相对增长率为24.3％，Llama3.2-3b-Instruct的平均相对增长率为33.0％，QWEN3-8B基础的平均相对增长率为10.4％。

Title: Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer

Authors: Maxence Lasbordes, Sinoué Gad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05846
Pdf URL: https://arxiv.org/pdf/2510.05846
Copy Paste: [[2510.05846]] Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer(https://arxiv.org/abs/2510.05846)
Keywords: language model, llm
Abstract: The landscape of Large Language Models (LLMs) remains predominantly English-centric, resulting in a significant performance gap for other major languages, such as French, especially in the context of Small Language Models (SLMs). Existing multilingual models demonstrate considerably lower performance in French compared to English, and research on efficient adaptation methods for French remains limited. To address this, we introduce \textbf{Luth}, a family of French-specialized SLMs: through targeted post-training on curated, high-quality French data, our models outperform all open-source counterparts of comparable size on multiple French benchmarks while retaining their original English capabilities. We further show that strategic model merging enhances performance in both languages, establishing Luth as a new state of the art for French SLMs and a robust baseline for future French-language research.
摘要：大语言模型 (LLM) 的格局仍然主要以英语为中心，导致与法语等其他主要语言存在显着的性能差距，特别是在小语言模型 (SLM) 的背景下。与英语相比，现有的多语言模型在法语中的表现要低得多，并且对法语的有效适应方法的研究仍然有限。为了解决这个问题，我们引入了 \textbf{Luth}，一个法语专用 SLM 系列：通过对精选的高质量法语数据进行有针对性的后期培训，我们的模型在多个法语基准上优于所有同等规模的开源模型，同时保留了其原始的英语功能。我们进一步表明，战略模型合并可以提高两种语言的性能，将 Luth 确立为法语 SLM 的新技术水平以及未来法语研究的稳健基线。

Title: DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

Authors: Xue-Yong Fu, Elena Khasanova, Md Tahmid Rahman Laskar, Harsh Saini, Shashi Bhushan TN
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05858
Pdf URL: https://arxiv.org/pdf/2510.05858
Copy Paste: [[2510.05858]] DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization(https://arxiv.org/abs/2510.05858)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains %or conversational data that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.
摘要：大型语言模型 (LLM) 在文本摘要方面取得了令人印象深刻的性能，但当应用于与其原始预训练分布不同的专门领域或会话数据时，它们的性能往往会不足。虽然微调可以提高摘要质量，但它通常依赖于昂贵且稀缺的高质量标记数据。在这项工作中，我们探索持续预训练作为一种可扩展的、自我监督的方法，使法学硕士适应下游总结任务，特别是在嘈杂的现实世界对话记录的背景下。我们使用大规模、未标记的业务对话数据进行了广泛的实验，以研究持续的预训练是否可以增强模型在对话摘要中的能力。我们的结果表明，持续的预训练在域内和域外总结基准方面都产生了显着的收益，同时保持了强大的泛化性和鲁棒性。我们还分析了数据选择策略的效果，为在以摘要为重点的工业应用中应用持续预训练提供了实用指南。

Title: Automated Boilerplate: Prevalence and Quality of Contract Generators in the Context of Swiss Privacy Policies

Authors: Luka Nenadic, David Rodriguez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05860
Pdf URL: https://arxiv.org/pdf/2510.05860
Copy Paste: [[2510.05860]] Automated Boilerplate: Prevalence and Quality of Contract Generators in the Context of Swiss Privacy Policies(https://arxiv.org/abs/2510.05860)
Keywords: gpt, llm
Abstract: It has become increasingly challenging for firms to comply with a plethora of novel digital regulations. This is especially true for smaller businesses that often lack both the resources and know-how to draft complex legal documents. Instead of seeking costly legal advice from attorneys, firms may turn to cheaper alternative legal service providers such as automated contract generators. While these services have a long-standing presence, there is little empirical evidence on their prevalence and output quality. We address this gap in the context of a 2023 Swiss privacy law revision. To enable a systematic evaluation, we create and annotate a multilingual benchmark dataset that captures key compliance obligations under Swiss and EU privacy law. Using this dataset, we validate a novel GPT-5-based method for large-scale compliance assessment of privacy policies, allowing us to measure the impact of the revision. We observe compliance increases indicating an effect of the revision. Generators, explicitly referenced by 18% of local websites, are associated with substantially higher levels of compliance, with increases of up to 15 percentage points compared to privacy policies without generator use. These findings contribute to three debates: the potential of LLMs for cross-lingual legal analysis, the Brussels Effect of EU regulations, and, crucially, the role of automated tools in improving compliance and contractual quality.
摘要：对于公司来说，遵守大量新颖的数字法规变得越来越具有挑战性。对于通常缺乏起草复杂法律文件的资源和专业知识的小型企业来说尤其如此。公司可能会转向更便宜的替代法律服务提供商，例如自动合同生成器，而不是向律师寻求昂贵的法律建议。虽然这些服务长期存在，但很少有关于其流行程度和产出质量的经验证据。我们在 2023 年瑞士隐私法修订的背景下解决了这一差距。为了进行系统评估，我们创建并注释了一个多语言基准数据集，其中包含瑞士和欧盟隐私法规定的关键合规义务。使用此数据集，我们验证了一种基于 GPT-5 的新颖方法，用于大规模隐私政策合规性评估，使我们能够衡量修订的影响。我们观察到合规性增加，表明修订的效果。 18% 的本地网站明确提及发电机，其合规性水平显着提高，与不使用发电机的隐私政策相比，合规性提高了 15 个百分点。这些发现引发了三场争论：法学硕士在跨语言法律分析方面的潜力、欧盟法规的布鲁塞尔效应，以及至关重要的自动化工具在提高合规性和合同质量方面的作用。

Title: Revisiting Long-context Modeling from Context Denoising Perspective

Authors: Zecheng Tang, Baibei Ji, Juntao Li, Lijun Wu, Haijia Gui, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05862
Pdf URL: https://arxiv.org/pdf/2510.05862
Copy Paste: [[2510.05862]] Revisiting Long-context Modeling from Context Denoising Perspective(https://arxiv.org/abs/2510.05862)
Keywords: gpt
Abstract: Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).
摘要：长篇小说模型（LCMS）在处理长序列方面表现出巨大的潜力，从而促进了许多现实的应用。 LCM的成功可以归因于其在上下文中定位隐式关键信息以进行进一步预测的能力。但是，最近的研究表明，LCM通常容易受到上下文噪声的影响，即无关紧要的令牌，可能会误导模型的关注。在本文中，我们对上下文噪声进行了细粒度的分析，并提出了有效的度量标准，即综合梯度（IG）分数，以检测和量化上下文中的噪声信息。我们的发现表明，即使对检测到的上下文噪声的简单缓解也可以大大提高模型对关键令牌的关注，并使随后的预测受益。在这种见解的基础上，我们提出了语境DeNoising Training（CDT），这是一种直接而有效的培训策略，可以提高对关键令牌的关注，同时增强其对模型预测的影响。在上下文窗口缩放和长篇小说对齐设置下，跨四个任务进行了广泛的实验，证明了CDT的优势。值得注意的是，经过CDT培训时，开源8B模型可以实现与GPT-4O（51.00）相当的性能（50.92）。

Title: Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

Authors: Faeze Ghorbanpour, Alexander Fraser
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.05864
Pdf URL: https://arxiv.org/pdf/2510.05864
Copy Paste: [[2510.05864]] Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input(https://arxiv.org/abs/2510.05864)
Keywords: language model, llm, long context, prompt, retrieval-augmented generation
Abstract: Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation. While their long-context capabilities are well studied for reasoning and retrieval, little is known about their behavior in safety-critical scenarios. We evaluate LLMs' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens). Across harmful content categories such as toxic, offensive, and hate speech, with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit. These findings provide the first systematic view of how LLMs prioritize and calibrate harmful content in long contexts, highlighting both their emerging strengths and the challenges that remain for safety-critical use.
摘要：大型语言模型 (LLM) 越来越多地支持依赖扩展上下文的应用程序，从文档处理到检索增强生成。虽然它们的长上下文能力在推理和检索方面得到了充分研究，但人们对它们在安全关键场景中的行为知之甚少。我们评估了法学硕士在扩展上下文、不同类型（显式与隐式）、位置（开始、中间、结束）、流行程度（提示的 0.01-0.50）和上下文长度（600-6000 个标记）下对有害内容的敏感性。在有毒、攻击性和仇恨言论等有害内容类别中，对于 LLaMA-3、Qwen-2.5 和 Mistral，我们观察到类似的模式：在中等有害流行率 (0.25) 时性能达到峰值，但当内容非常稀疏或占主导地位时性能会下降；回忆随着上下文长度的增加而降低；通常更可靠地检测到开头的有害句子；显式内容比隐式内容更容易被识别。这些发现首次系统地展示了法学硕士如何在长期背景下优先考虑和校准有害内容，突出了其新兴优势和安全关键用途仍然面临的挑战。

Title: The fragility of "cultural tendencies" in LLMs

Authors: Kun Sun, Rong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05869
Pdf URL: https://arxiv.org/pdf/2510.05869
Copy Paste: [[2510.05869]] The fragility of "cultural tendencies" in LLMs(https://arxiv.org/abs/2510.05869)
Keywords: language model, gpt, llm, prompt
Abstract: In a recent study, Lu, Song, and Zhang (2025) (LSZ) propose that large language models (LLMs), when prompted in different languages, display culturally specific tendencies. They report that the two models (i.e., GPT and ERNIE) respond in more interdependent and holistic ways when prompted in Chinese, and more independent and analytic ways when prompted in English. LSZ attribute these differences to deep-seated cultural patterns in the models, claiming that prompt language alone can induce substantial cultural shifts. While we acknowledge the empirical patterns they observed, we find their experiments, methods, and interpretations problematic. In this paper, we critically re-evaluate the methodology, theoretical framing, and conclusions of LSZ. We argue that the reported "cultural tendencies" are not stable traits but fragile artifacts of specific models and task design. To test this, we conducted targeted replications using a broader set of LLMs and a larger number of test items. Our results show that prompt language has minimal effect on outputs, challenging LSZ's claim that these models encode grounded cultural beliefs.
摘要：在最近的一项研究中，Lu、Song 和Zhang (2025) (LSZ) 提出，大语言模型 (LLM) 在以不同语言提示时会表现出文化特定的倾向。他们报告说，当用中文提示时，这两种模型（即 GPT 和 ERNIE）会以更加相互依赖和整体的方式做出反应，而当用英语提示时，会以更加独立和分析的方式做出反应。 LSZ 将这些差异归因于模型中根深蒂固的文化模式，声称仅提示语言就可以引发实质性的文化转变。虽然我们承认他们观察到的经验模式，但我们发现他们的实验、方法和解释有问题。在本文中，我们批判性地重新评估了 LSZ 的方法论、理论框架和结论。我们认为，所报告的“文化倾向”并不是稳定的特征，而是特定模型和任务设计的脆弱产物。为了测试这一点，我们使用更广泛的法学硕士集和更多的测试项目进行了有针对性的复制。我们的结果表明，提示语言对输出的影响很小，这挑战了 LSZ 的说法，即这些模型编码扎根的文化信仰。

Title: Prompt reinforcing for long-term planning of large language models

Authors: Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.05921
Pdf URL: https://arxiv.org/pdf/2510.05921
Copy Paste: [[2510.05921]] Prompt reinforcing for long-term planning of large language models(https://arxiv.org/abs/2510.05921)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.
摘要：大型语言模型（LLMS）在各种自然语言处理任务中取得了巨大的成功，并且可以通过提示来调整。但是，它们在多转交互作用中仍然是最佳选择，通常依靠不正确的早期假设，并且未能随着时间的推移跟踪用户目标，这使得此类任务特别具有挑战性。对话系统中的先前工作表明，长期计划对于处理互动任务至关重要。在这项工作中，我们提出了一个受强化学习启发的及时优化框架，该框架可以通过仅修改基于LLM的代理的任务指示提示来进行此类计划。通过生成转弯反馈和利用经验重播以迅速重写，我们提出的方法在多转弯任务（例如文本到SQL和以任务为导向的对话）中显示出显着改善。此外，它概括了不同的基于LLM的代理，并且可以利用多种LLM作为元代理。这需要对强化学习启发的无参数优化方法的未来研究。

Title: Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens

Authors: Mai AlKhamissi, Yunze Xiao, Badr AlKhamissi, Mona Diab
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.05931
Pdf URL: https://arxiv.org/pdf/2510.05931
Copy Paste: [[2510.05931]] Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens(https://arxiv.org/abs/2510.05931)
Keywords: language model
Abstract: Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.
摘要：大型语言模型的文化评估变得越来越重要，但当前的基准通常将文化简化为静态事实或同质价值观。这种观点与人类学的观点相冲突，人类学强调文化是动态的、历史性的、在实践中产生的。为了分析这一差距，我们引入了一个由四部分组成的框架，该框架对基准如何构建文化（例如知识、偏好、绩效或偏见）进行了分类。利用这个视角，我们定性地检查了 20 个文化基准，并确定了六个反复出现的方法论问题，包括将国家视为文化、忽视文化内部多样性以及依赖过于简单化的调查格式。借鉴已建立的人类学方法，我们提出了具体的改进：纳入现实世界的叙述和场景，让文化社区参与设计和验证，并在上下文中而不是孤立地评估模型。我们的目标是指导文化基准的开发，超越静态回忆任务，更准确地捕捉模型对复杂文化情境的反应。

Title: EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Authors: Hadi Mohammadi, Anastasia Giachanou, Ayoub Bagheri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05942
Pdf URL: https://arxiv.org/pdf/2510.05942
Copy Paste: [[2510.05942]] EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models(https://arxiv.org/abs/2510.05942)
Keywords: language model, llm, chain-of-thought
Abstract: We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.
摘要：我们提出了Adlemoraal，这是一种透明的思想链（COT）框架，使用两种评分方法（对数和直接评分）以及一个模型的法官同行评审，以评估20种大语言模型中的道德一致性。我们评估世界价值调查的模型（55个国家，19个主题）和皮尤全球态度调查（39个国家，8个主题）。借助Evalmoraal，顶级模型与调查响应紧密相吻合（Pearson的r在WVS上约为0.90）。然而，我们发现了明显的区域差异：西方区域的平均r = 0.82，而非西部区域的平均水平为r = 0.61（0.21绝对间隙），表明区域偏见一致。我们的框架添加了三个部分：（1）所有模型的两种评分方法，以实现公平比较，（2）具有自稳态检查的结构化链链协议，以及（3）使用数据驱动的阈值来标记348次冲突的模型判断同行评审。同行协议与调查一致性有关（WVS r = 0.74，PEW r = 0.39，均p <.001），支持自动质量检查。这些结果表明了在文化意识的AI方面的真正进步，同时强调了整个地区使用的开放挑战。

Title: Probing the Difficulty Perception Mechanism of Large Language Models

Authors: Sunbowen Lee, Qingyu Yin, Chak Tou Leong, Jialiang Zhang, Yicheng Gong, Xiaoyu Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05969
Pdf URL: https://arxiv.org/pdf/2510.05969
Copy Paste: [[2510.05969]] Probing the Difficulty Perception Mechanism of Large Language Models(https://arxiv.org/abs/2510.05969)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed on complex reasoning tasks, yet little is known about their ability to internally evaluate problem difficulty, which is an essential capability for adaptive reasoning and efficient resource allocation. In this work, we investigate whether LLMs implicitly encode problem difficulty in their internal representations. Using a linear probe on the final-token representations of LLMs, we demonstrate that the difficulty level of math problems can be linearly modeled. We further locate the specific attention heads of the final Transformer layer: these attention heads have opposite activation patterns for simple and difficult problems, thus achieving perception of difficulty. Our ablation experiments prove the accuracy of the location. Crucially, our experiments provide practical support for using LLMs as automatic difficulty annotators, potentially substantially reducing reliance on costly human labeling in benchmark construction and curriculum learning. We also uncover that there is a significant difference in entropy and difficulty perception at the token level. Our study reveals that difficulty perception in LLMs is not only present but also structurally organized, offering new theoretical insights and practical directions for future research.
摘要：大型语言模型（LLM）越来越多地部署在复杂的推理任务上，但人们对其内部评估问题难度的能力知之甚少，而这是自适应推理和高效资源分配的基本能力。在这项工作中，我们研究了法学硕士是否在其内部表示中隐式编码了问题难度。对法学硕士的最终标记表示使用线性探针，我们证明了数学问题的难度水平可以线性建模。我们进一步定位最终 Transformer 层的特定注意力头：这些注意力头对于简单问题和困难问题具有相反的激活模式，从而实现难度感知。我们的消融实验证明了位置的准确性。至关重要的是，我们的实验为使用法学硕士作为自动难度注释器提供了实际支持，有可能大大减少基准构建和课程学习中对昂贵的人工标记的依赖。我们还发现，代币级别的熵和难度感知存在显着差异。我们的研究表明，法学硕士的困难感知不仅存在，而且是有结构的，为未来的研究提供了新的理论见解和实践方向。

Title: LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Authors: Periklis Mantenoglou, Rishi Hazra, Pedro Zuidberg Dos Martires, Luc De Raedt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05972
Pdf URL: https://arxiv.org/pdf/2510.05972
Copy Paste: [[2510.05972]] LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language(https://arxiv.org/abs/2510.05972)
Keywords: language model, gpt, llm
Abstract: Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon -- a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.
摘要：由于其推理能力，大型语言模型（LLM）已在以自然语言描述的规划任务上进行了评估。然而，法学硕士在很大程度上已经在没有限制的规划领域进行了测试。为了将它们部署在遵守约束（特别是安全约束）至关重要的现实环境中，我们需要评估它们在受限规划任务上的性能。我们介绍 LexiCon——一种基于自然语言 (Lexi) 约束 (Con) 的规划基准，由一套环境组成，可用于以原则性方式评估法学硕士的规划能力。 LexiCon 背后的核心思想是利用现有的规划环境并对各州施加时间限制。然后这些受限问题被翻译成自然语言并交给法学硕士来解决。 LexiCon 的一个关键特性是它的可扩展性。也就是说，支持的环境集可以使用新的（无约束）环境生成器进行扩展，并自动构建时间约束。这使得 LexiCon 不会过时：随着法学硕士规划能力的提高，生成的规划问题的难度也会增加。我们的实验表明，最先进的 LLM（包括 GPT-5、o3 和 R1 等推理模型）的性能随着规划任务约束程度的增加而恶化。

Title: Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

Authors: Timothy Pistotti, Jason Brown, Michael Witbrock
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06001
Pdf URL: https://arxiv.org/pdf/2510.06001
Copy Paste: [[2510.06001]] Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments(https://arxiv.org/abs/2510.06001)
Keywords: language model, gpt, llm
Abstract: Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the "wh-effect") to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM's syntactic competence.
摘要：最近的研究探讨了刺激贫困（AP）的论点，已应用大型语言模型（LLMS）来通过基于惊人的指标来测试复杂语法的可学习性。但是，分歧的结论提出了有关这些指标提供的见解的问题。而Wilcox等。（2024）使用了直接的最小对比较（“ WH-effect”）来证明模型成功地概括了对填充间隙依赖性的知识，Lan等人。（2024）使用了差异差异（DID）度量，并发现模型在寄生间隙（PGS）上很大程度上失败。本文认为，直接的最小对方法具有更大的诊断透明度。我们通过生成精制PG刺激的完整8腐烂范式来证明这一点，并评估了先前研究中使用系统的Wilcox式WH效果分析的GPT-2模型。我们的结果表明，GPT-2在所有四个测试条件下都取得了成功，这表明即使在复杂的PG环境中，也可以强大的填充空间许可原理知识。这一发现与DID式指标更加模棱两可的结果形成鲜明对比，这表明评估指标的选择对于评估LLM的句法能力至关重要。

Title: MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation

Authors: Qin Dong, Yuntian Tang, Heming Jia, Yunhang Shen, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Shaohui Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06005
Pdf URL: https://arxiv.org/pdf/2510.06005
Copy Paste: [[2510.06005]] MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation(https://arxiv.org/abs/2510.06005)
Keywords: language model
Abstract: Low-Rank Adaptation (LoRA) has emerged as a dominant method in Parameter-Efficient Fine-Tuning (PEFT) for large language models, which augments the transformer layer with one down-projection $A$ and one up-projection $B$. However, LoRA's reliance on a single down-projection matrix ($A$) creates a representational bottleneck, as this solitary feature extractor is inherently insufficient for capturing the diverse signals required by complex tasks. This motivates our architectural shift to focus on enriching the feature adaptation to improve the downstream task adaptation ability. We propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is asymmetrically shared across layers to ensure parameter efficiency. In MASA, these specialized experts capture diverse features, which are then integrated by a single, layer-specific $B$-matrix. The effectiveness and versatility of our method are validated through a comprehensive suite of experiments spanning multi-domain generalization, single-domain specialization, and multi-task reasoning. For example, on the MMLU benchmark, MASA achieves an average accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative improvement of 1.84%) with comparable learnable parameters of 0.52%.
摘要：对于大型语言模型，低秩适应（LORA）已成为参数有效微调（PEFT）的主要方法，该方法以一个下注$ a $ a $和一个上调$ b $来增强变压器层。但是，洛拉（Lora）对单个下降投影矩阵（$ a $）的依赖创造了代表性的瓶颈，因为这种单独的功能提取器本质上不足以捕获复杂任务所需的各种信号。这激发了我们的建筑转变，专注于丰富功能适应以提高下游任务适应能力。我们提出了MASA（多$ A $共享的改编），该体系结构实现了多$ A $，单$ B $结构，其中多$ A $ Expert Ensemble是在各个层上不对称共享的，以确保参数效率。在MASA中，这些专业专家捕获了各种功能，然后由单个特定层的$ b $ -matrix集成。我们方法的有效性和多功能性通过涵盖多域概括，单域专业化和多任务推理的全面实验套件来验证。例如，在MMLU基准测试中，MASA的平均准确度为59.62％，表现优于标准LORA的1.08点（相对提高1.84％），可比的可学习参数为0.52％。

Title: Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

Authors: Timothy Pistotti, Jason Brown, Michael Witbrock
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06018
Pdf URL: https://arxiv.org/pdf/2510.06018
Copy Paste: [[2510.06018]] Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance(https://arxiv.org/abs/2510.06018)
Keywords: language model, gpt, llm
Abstract: Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.
摘要：最近采用大型语言模型（LLM）从刺激贫困（AP）中测试论点的研究在句法现象中产生了对比的结果。本文调查了以下假设：最近的研究中使用的刺激的特征，包括词汇歧义和结构复杂性，可能会混淆模型性能。提出了一种方法，以重新评估LLM对句法预测的能力，重点是GPT-2。这涉及：1）建立先前使用的刺激（均过滤和未经过滤的）刺激，以及2）使用由语言形式形成的模板引导的，生成新的，精制的数据集（SOTA）生成的LLM（Gemini 2.5 Pro Preview），该数据集（Gemini 2.5 Pro Preview）由旨在减轻识别识别的混淆的模板。我们的初步发现表明，与基准相比，GPT-2在这些精制的PG刺激上的性能显着提高，这表明刺激质量在LLM句法能力的基于惊人的评估中会显着影响结果。

Title: CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

Authors: Chengwei Wu, Jiapu Wang, Mingyang Gao, Xingrui Zhuo, Jipeng Guo, Runlin Lei, Haoran Luo, Tianyu Chen, Haoyi Zhou, Shirui Pan, Zechao Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06039
Pdf URL: https://arxiv.org/pdf/2510.06039
Copy Paste: [[2510.06039]] CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs(https://arxiv.org/abs/2510.06039)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.
摘要：大型语言模型 (LLM) 在广泛的自然语言处理任务中取得了显着的成功。然而，中国的法学硕士面临着独特的挑战，这主要是由于非结构化自由文本的主导地位以及中文语料库中缺乏结构化表示。虽然现有的法学硕士基准部分评估了中国法学硕士，但它们仍然主要以英语为中心，未能解决中文独特的语言特征，缺乏稳健评估所必需的结构化数据集。为了应对这些挑战，我们基于新构建的中文数据文本对（CDTP）数据集提出了评估中文大语言模型的综合基准（CB-ECLLM）。具体来说，CDTP 包含超过 700 万个对齐的文本对，每个文本对都包含非结构化文本以及一个或多个相应的三元组，以及跨越四个关键域的总共 1500 万个三元组。 CDTP的核心贡献有三个：（i）用高质量的结构化信息丰富中文语料库； (ii) 能够针对知识驱动的任务进行细粒度的评估； (iii) 支持多任务微调，以评估跨场景的泛化性和鲁棒性，包括知识图完成、三元到文本生成和问答。此外，我们通过广泛的实验和消融研究进行严格的评估，以评估基准的有效性、监督微调（SFT）和稳健性。为了支持可重复的研究，我们提供了一个开源代码库，并根据我们的见解概述了未来研究的潜在方向。

Title: ASPO: Asymmetric Importance Sampling Policy Optimization

Authors: Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06062
Pdf URL: https://arxiv.org/pdf/2510.06062
Copy Paste: [[2510.06062]] ASPO: Asymmetric Importance Sampling Policy Optimization(https://arxiv.org/abs/2510.06062)
Keywords: language model, llm
Abstract: Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at this https URL.
摘要：最近的大型语言模型（LLM）后训练方法依赖于强化学习（RL）期间的令牌级裁剪机制。然而，我们发现这种结果监督强化学习（OSRL）范式中存在一个根本缺陷：正优势标记的重要性采样（IS）比率不匹配，导致正负标记的标记权重不平衡。这种不匹配会抑制低概率令牌的更新，同时过度放大已经高概率的令牌。为了解决这个问题，我们提出了非对称重要性采样策略优化（ASPO），它使用一种简单而有效的策略来翻转正优势令牌的 IS 比率，使它们的更新方向与负优势令牌的学习动态保持一致。 AIS 进一步采用了软双裁剪机制来稳定极端更新，同时保持梯度流。关于编码和数学推理基准的综合实验表明，与基于 GRPO 的强大基准相比，ASPO 显着减轻了过早收敛、提高了训练稳定性并增强了最终性能。我们的分析为 OSRL 中 token 级权重的作用提供了新的见解，并强调了在 LLM RL 中纠正 IS 的至关重要性。 ASPO 的代码和模型可从此 https URL 获取。

Title: Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability

Authors: Taylor Sorensen, Benjamin Newman, Jared Moore, Chan Park, Jillian Fisher, Niloofar Mireshghallah, Liwei Jiang, Yejin Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06084
Pdf URL: https://arxiv.org/pdf/2510.06084
Copy Paste: [[2510.06084]] Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability(https://arxiv.org/abs/2510.06084)
Keywords: language model
Abstract: Language model post-training has enhanced instruction-following and performance on many downstream tasks, but also comes with an often-overlooked cost on tasks with many possible valid answers. We characterize three desiderata for conditional distributional modeling: in-context steerability, valid output space coverage, and distributional alignment, and document across three model families how current post-training can reduce these properties. In particular, we disambiguate between two kinds of in-context learning: ICL for eliciting existing underlying knowledge or capabilities, and in-context steerability, where a model must use in-context information to override its priors and steer to a novel data generating distribution. To better evaluate and improve these desiderata, we introduce Spectrum Suite, a large-scale resource compiled from >40 data sources and spanning >90 tasks requiring models to steer to and match diverse distributions ranging from varied human preferences to numerical distributions and more. We find that while current post-training techniques help elicit underlying capabilities and knowledge, they hurt models' ability to flexibly steer in-context. To mitigate these issues, we propose Spectrum Tuning, a post-training method using Spectrum Suite to improve steerability and distributional coverage. We find that Spectrum Tuning often improves over pretrained models and their instruction-tuned counterparts, enhancing steerability, spanning more of the output space, and improving distributional alignment on held-out datasets.
摘要：语言模型训练后培训在许多下游任务上都提高了指导的遵循和性能，但在具有许多可能的有效答案的任务上也有经常被忽视的成本。我们表征了有条件分布建模的三个Desiderata：在三个模型家族中的内部下文引导性，有效的输出空间覆盖率和分布对齐方式，以及当前训练后培训如何降低这些属性的文档。特别是，我们在两种内在的学习学习中歧义：ICL用于引发现有的潜在知识或能力，以及内在的可引导性，其中模型必须使用秘密信息来覆盖其先验，并转向其新的数据生成分布。为了更好地评估和改善这些逃避的态度，我们介绍了Spectrum Suite，这是一种来自> 40个数据源的大规模资源，并跨越了> 90个任务，需要模型来引导和匹配各种分布，从不同的人类偏好到数值分布等。我们发现，尽管当前的训练后技术有助于引起潜在的能力和知识，但它们却损害了模型灵活地转向上下文的能力。为了减轻这些问题，我们提出了Spectrum Tuning，这是一种使用Spectrum Suite的训练后方法，以改善可置换性和分配覆盖率。我们发现，频谱调整通常会改善预贴的模型及其指导调整的对应物，增强可引导性，跨越更多的输出空间以及改善固定数据集中的分配对齐。

Title: The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

Authors: Muyu He, Muhammad Ali Shafique, Anand Kumar, Tsach Mackey, Nazneen Rajani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06101
Pdf URL: https://arxiv.org/pdf/2510.06101
Copy Paste: [[2510.06101]] The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models(https://arxiv.org/abs/2510.06101)
Keywords: language model, llm
Abstract: Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a $\textit{valley of code reasoning}$: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition
摘要：将大语言模型（LLM）的思维轨迹提炼成较小的模型已被证明有效。然而，关于模型性能如何使用蒸馏数据的数量进行缩放的工作稀缺。在这项工作中，我们研究了在两个小型非争议LLM上提取竞争性编码技能的扩展趋势。我们验证了以下假设：$ \ textit {代码推理谷} $：随着数据数量的增加，竞争性编码的下游性能，然后以比log linear的方式稳步增加。确定了这一趋势后，我们将模型在两个不同的蒸馏阶段进行了微调，以在其各自的学习阶段得出结论。我们了解到，在低和中低的数据制度中，各个阶段，小型模型从更容易的编码问题中受益于与较难的问题相比。我们还发现，令人惊讶的是，训练数据中产出的正确性对蒸馏结果没有影响。我们的工作代表了理解代码推理的训练动态蒸馏的迈出的一步

Title: Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Authors: Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2510.06107
Pdf URL: https://arxiv.org/pdf/2510.06107
Copy Paste: [[2510.06107]] Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models(https://arxiv.org/abs/2510.06107)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary this http URL, to enable the reliable tracing of internal semantic failures, we propose \textbf{Distributional Semantics Tracing (DST)}, a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific \textbf{commitment layer} where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic \textbf{associative pathway} (akin to System 1) and a slow, deliberate \textbf{contextual pathway} (akin to System 2), leading to predictable failure modes such as \textit{Reasoning Shortcut Hijacks}. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation ($\rho = -0.863$) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.
摘要：大型语言模型（LLM）容易产生幻觉，这是合理但实际上不正确的陈述的产生。 This work investigates the intrinsic, architectural origins of this failure mode through three primary this http URL, to enable the reliable tracing of internal semantic failures, we propose \textbf{Distributional Semantics Tracing (DST)}, a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics).其次，我们确定了幻觉变得不可避免的模型的层，并确定了特定的\ textbf {porsionment Layer}，其中模型的内部表示形式与事实不可逆转地不同。第三，我们确定了这些失败的基本机制。我们观察到不同的计算途径之间的冲突，我们使用双过程理论的镜头进行解释：快速，启发式\ textbf {cooptiative pathway}（类似于系统1）和一个缓慢的，有意的\ textbf {contextual pathway}（类似于系统2），导致了可预测的失败模式。我们的框架量化上下文途径相干性的能力表明，幻觉率有很强的负相关（$ \ rho = -0.863 $），这意味着这些失败是内部语义弱点的可预测后果。结果是关于如何，何时以及为什么在变压器体系结构中发生幻觉的机理说明。

Title: Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer

Authors: Muhammad Dehan Al Kautsar, Fajri Koto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06128
Pdf URL: https://arxiv.org/pdf/2510.06128
Copy Paste: [[2510.06128]] Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer(https://arxiv.org/abs/2510.06128)
Keywords: language model
Abstract: Tokenization defines the foundation of multilingual language models by determining how words are represented and shared across languages. However, existing methods often fail to support effective cross-lingual transfer because semantically equivalent words are assigned distinct embeddings. For example, "I eat rice" in English and "Ina cin shinkafa" in Hausa are typically mapped to different vocabulary indices, preventing shared representations and limiting cross-lingual generalization. We introduce parallel tokenizers. This new framework trains tokenizers monolingually and then aligns their vocabularies exhaustively using bilingual dictionaries or word-to-word translation, ensuring consistent indices for semantically equivalent words. This alignment enforces a shared semantic space across languages while naturally improving fertility balance. To assess their effectiveness, we pretrain a transformer encoder from scratch on thirteen low-resource languages and evaluate it on sentiment analysis, hate speech detection, emotion classification, and sentence embedding similarity. Across all tasks, models trained with parallel tokenizers outperform conventional multilingual baselines, confirming that rethinking tokenization is essential for advancing multilingual representation learning--especially in low-resource settings.
摘要：标记化通过确定单词如何在不同语言之间表示和共享来定义多语言语言模型的基础。然而，现有方法通常无法支持有效的跨语言迁移，因为语义等效的单词被分配了不同的嵌入。例如，英语中的“I eat Rice”和豪萨语中的“Ina cin shinkafa”通常映射到不同的词汇索引，从而防止共享表示并限制跨语言泛化。我们引入并行分词器。这个新框架以单语方式训练分词器，然后使用双语词典或词对词翻译来详尽地对齐其词汇表，确保语义等效词的索引一致。这种一致性强化了跨语言的共享语义空间，同时自然地改善了生育率平衡。为了评估其有效性，我们在 13 种低资源语言上从头开始预训练 Transformer 编码器，并在情感分析、仇恨语音检测、情感分类和句子嵌入相似性方面对其进行评估。在所有任务中，使用并行标记器训练的模型都优于传统的多语言基线，这证实了重新思考标记化对于推进多语言表示学习至关重要，尤其是在资源匮乏的环境中。

Title: CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits

Authors: Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06133
Pdf URL: https://arxiv.org/pdf/2510.06133
Copy Paste: [[2510.06133]] CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits(https://arxiv.org/abs/2510.06133)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising steps, achieving parallel decoding by denoising only high-confidence positions at each step. However, existing approaches often repetitively remask tokens due to initially low confidence scores, leading to redundant iterations and limiting overall acceleration. Through the analysis of dLLM decoding traces, we observe that the model often determines the final prediction for a token several steps before the decoding step. To leverage this historical information and avoid redundant steps, we introduce the concept of Trace Credit, which quantifies each token's convergence potential by accumulating historical logits. Furthermore, we propose CreditDecoding, a training-free parallel decoding algorithm that accelerates the confidence convergence of correct but underconfident tokens by fusing current logits with Trace Credit. This process significantly reduces redundant iterations and enhances decoding robustness. On eight benchmarks, CreditDecoding achieves a 5.48 times speedup and a 0.48 performance improvement over LLaDA-8B-Instruct, and a 4.11 times speedup with a 0.15 performance improvement over LLaDA-MoE-Instruct. Importantly, CreditDecoding scales effectively to long sequences and is orthogonal to mainstream inference optimizations, making it a readily integrable and versatile solution.
摘要：扩散大语言模型 (dLLM) 通过迭代去噪步骤生成文本，通过在每个步骤仅对高置信度位置进行去噪来实现并行解码。然而，由于最初的置信度分数较低，现有方法经常会重复重新屏蔽令牌，从而导致冗余迭代并限制整体加速。通过对 dLLM 解码轨迹的分析，我们观察到该模型通常在解码步骤之前的几个步骤中确定令牌的最终预测。为了利用这些历史信息并避免冗余步骤，我们引入了 Trace Credit 的概念，它通过累积历史 logits 来量化每个代币的收敛潜力。此外，我们提出了 CreditDecoding，这是一种免训练的并行解码算法，通过将当前的 logits 与 Trace Credit 融合来加速正确但信心不足的令牌的置信度收敛。这个过程显着减少了冗余迭代并增强了解码的鲁棒性。在八个基准测试中，CreditDecoding 比 LLaDA-8B-Instruct 实现了 5.48 倍的加速和 0.48 倍的性能改进，比 LLaDA-MoE-Instruct 实现了 4.11 倍的加速和 0.15 倍的性能改进。重要的是，CreditDecoding 可以有效地扩展到长序列，并且与主流推理优化正交，使其成为易于集成和通用的解决方案。

Title: RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets

Authors: Jan Cegin, Branislav Pecher, Ivan Srba, Jakub Simko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06143
Pdf URL: https://arxiv.org/pdf/2510.06143
Copy Paste: [[2510.06143]] RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets(https://arxiv.org/abs/2510.06143)
Keywords: llm
Abstract: LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator's outputs (optimal vs. proxy metric selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.
摘要：LLM是合成数据的强大发电机，用于训练较小的特定模型。这对于低资源语言尤其有价值，其中人类标签的数据稀缺，但LLMS仍然可以产生高质量的文本。但是，LLMS在输出对培训的有用程度上有所不同。选择最佳的LLM作为发电机很具有挑战性，因为外部评估需要昂贵的人类注释（对于低资源语言而言通常不可用），而本质的指标与下游性能差异很大。我们介绍了循环综合数据评估（ROSE），这是一个代理指标，用于选择没有人类测试集的最佳LLM发电机。玫瑰在候选发电机（LLM）的输出上训练一个小型模型，然后在所有其他候选LLMS的合成示例上对其进行评估。最终的玫瑰分数是这个小型模型的平均表现。在六个LLM，11种语言和三个任务（情感，主题，意图）中，Rose比任何其他内在的启发式方法都更频繁地识别最佳发电机。 Rose的表现优于内在的启发式方法，并且在最佳发电机基线的0.76个百分点内。该结果是根据下游性能来衡量的，该结果通过训练所选发电机的输出（最佳与代理公制）的小型模型获得，并在人体标记的测试数据上进行评估。此外，Rose是唯一与人类测试数据表现正相关的唯一指标。

Title: VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Authors: Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06175
Pdf URL: https://arxiv.org/pdf/2510.06175
Copy Paste: [[2510.06175]] VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization(https://arxiv.org/abs/2510.06175)
Keywords: language model, llm
Abstract: The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.
摘要：在大型语言模型（LLM）推理期间，键值（KV）缓存引入了大量内存开销。尽管现有的矢量量化（VQ）方法减少了KV缓存的使用情况并在位宽度上提供了灵活的代表能力，但由于关键的缓存离群值，它们在超低位宽度下遭受了严重的性能降解，从而阻碍了有效的代码书使用。为了应对这一挑战，我们提出了Vecinfer，这是一种具有攻击性KV缓存压缩的新型VQ方法，同时可以有效推断。通过应用平稳和哈达姆变换，Vecinfer抑制了密钥缓存中的异常值，从而使代码手册能够全面介绍原始数据分发，从而减少了量化难度。为了促进有效的部署，我们设计了一种优化的CUDA内核，该内核将计算融合到取消量化以最大程度地减少内存访问开销。广泛的评估表明，在长篇小说理解和数学推理任务中，Vecinfer始终优于现有的量化基线。 Vecinfer只有2位量化，可以实现与完整精确度相当的性能，同时在大批量自我注意计算中最多提供$ \ MathBf {2.7 \ times} $加速，$ \ Mathbf {8.3 \ times} $减少了Lllama-3.1-8B的单次批次终端延迟，并与Sequence Sequence Sequence Sequence Sequence。

Title: Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

Authors: Yoav Gur-Arieh, Mor Geva, Atticus Geiger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.06182
Pdf URL: https://arxiv.org/pdf/2510.06182
Copy Paste: [[2510.06182]] Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context(https://arxiv.org/abs/2510.06182)
Keywords: language model
Abstract: A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked "Who loves pie?" Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where "Ann" is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism (retrieving "Ann" through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.
摘要：语言模型（LMS）绑定实体以进行以后检索的能力是语言模型（LMS）的能力。例如，LM可能会通过将“ Ann”绑定到“派”来代表“ Ann Loves Pie”，允许它以后被问到“谁爱派？”时将其检索为“ Ann”？对约束实体简短列表的先前研究发现了有力的证据表明，LMS通过位置机制实施此类检索，在该位置机制中，“ ANN”是根据其在上下文中的位置检索的。在这项工作中，我们发现这种机制概括为更复杂的环境。随着上下文中约束实体数量的增加，位置机制在中间位置变得嘈杂和不可靠。为了弥补这一点，我们发现LMS用词汇机制（使用其结合的对应物“ pie”取回“ ANN”）和反射机制（通过直接指针检索“ ANN”）来补充位置机制。通过对九个模型和十个结合任务的广泛实验，我们发现了LMS如何混合这些机制以驱动模型行为的一致模式。我们利用这些见解来开发一个因果模型，结合了所有三种机制，这些机制估算了令牌分布和95％的一致性。最后，我们表明我们的模型将开放式文本的大量输入与实体组交织在一起，进一步证明了我们发现在更自然的环境中的鲁棒性。总体而言，我们的研究更完整地了解了LMS如何绑定和检索实体中文。

Title: RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Authors: Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Xue Liu, Irwin King, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.06186
Pdf URL: https://arxiv.org/pdf/2510.06186
Copy Paste: [[2510.06186]] RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback(https://arxiv.org/abs/2510.06186)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation
摘要：大型语言模型（LLM）显示出支持科学研究实施的前景，但它们生成正确且可执行代码的能力仍然有限。现有的工作大多采用一次性设置，忽视了科研开发现实工作流程的迭代和反馈驱动的本质。为了弥补这一差距，我们提出了 RECODE-H，这是来自研究论文和存储库的 102 项任务的基准，通过与 LLM 模拟的人类反馈进行多轮交互来评估 LLM 代理。它包括结构化指令、单元测试和五级反馈层次结构，以反映现实的研究人员与代理的协作。我们进一步提出了 ReCodeAgent，一个将反馈集成到迭代代码生成中的框架。使用领先的法学硕士（包括 GPT-5、Claude-Sonnet-4、DeepSeek-V3.1 和 Gemini 2.5）进行的实验显示，通过更丰富的反馈，性能显着提升，同时也凸显了复杂研究代码生成中持续存在的挑战。 RECODE-H 为在科学研究实施中开发自适应、反馈驱动的 LLM 代理奠定了基础

Title: Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Authors: Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.06198
Pdf URL: https://arxiv.org/pdf/2510.06198
Copy Paste: [[2510.06198]] Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction(https://arxiv.org/abs/2510.06198)
Keywords: llm
Abstract: This paper introduces a framework for relation extraction (RE) that enhances both accuracy and explainability. The framework has two key components: (i) a reasoning mechanism that formulates relation extraction as a series of text-processing steps inspired by cognitive science, and (ii) an optimization process driven by reinforcement learning (RL) with a novel reward function designed to improve both task accuracy and explanation quality. We call our approach CogRE. Our framework addresses the lack of supervision for language-based explanations in traditional RE by promoting outputs that include important relation keywords. These keywords are drawn from a high-quality dictionary that is automatically constructed using an LLM. We evaluate our approach for the task of one-shot RE using two LLMs and two RE datasets. Our experiments show that CogRE improves explanation quality by addressing two common failure patterns in one-shot RE: poor attention focus and limited one-shot learning capability. For example, our cognitive-structured reasoning with Qwen2.5-15B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using our reward further improves performance by +23.46% (absolute). Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
摘要：本文介绍了一个关系提取的框架（RE），可提高准确性和解释性。该框架具有两个关键组成部分：（i）一种推理机制，该机制将关系提取作为一系列受认知科学启发的文本处理步骤，以及（ii）由增强学习（RL）驱动的优化过程，其新型奖励功能旨在提高任务准确性和解释质量。我们称我们的方法Cogre。我们的框架通过促进包括重要关系关键字的输出来解决传统RE中基于语言的解释的监督。这些关键字来自使用LLM自动构建的高质量词典。我们使用两个LLM和两个RE数据集评估了对单次RE任务的方法。我们的实验表明，Cogre通过在一次性射击中解决了两种常见的失败模式来提高解释质量：注意力集中度和有限的单发学习能力。例如，我们在一声NYT29上使用QWEN2.5-15B教学的认知结构推理达到24.65％F1，超过了基于推理的设计。使用我们的奖励优化这种方法，进一步提高了 +23.46％（绝对）。最后，人类评估表明，我们的最佳模型生成与黄金标签紧密一致的关系关键字，从而使人类的解释质量评级提高了54％（相对）。