2025-10-28

Title: Policy Optimization Prefers The Path of Least Resistance

Authors: Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21853
Pdf URL: https://arxiv.org/pdf/2510.21853
Copy Paste: [[2510.21853]] Policy Optimization Prefers The Path of Least Resistance(https://arxiv.org/abs/2510.21853)
Keywords: language model, chain-of-thought
Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{}-only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the complex \texttt{} format is assigned up to 4x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.
摘要：策略优化 (PO) 算法用于细化大型语言模型，以进行复杂的多步骤推理。当前最先进的管道强制执行严格的“思考然后回答”格式来引发思维链（CoT）；然而，当这些严格的约束被放松为开放式 CoT 结构时，PO 的行为仍然是一个尚未得到充分研究的问题。我们通过一系列广泛的受控实验研究了这一差距，并确定了一个一致的原则：\textit{策略优化始终遵循阻力最小的路径}。当提供交错推理和响应的灵活性时，策略优化始终会学习丢弃显式推理，从而导致策略退化为直接仅 \texttt{} 格式。这一结果适用于各种模型和算法。我们发现，即使复杂的 \texttt{} 格式被分配高达 4 倍大的奖励权重，这种格式的崩溃仍然存在。我们通过一系列受控奖励分解实验形式化了这一原则，展示了清晰的层次结构：PO 首先系统地优化最简单的奖励成分，即使面对相互排斥的选择或对更复杂行为的强烈激励，这种偏好仍然成立。最后，我们表明，高回报捷径的成功收敛并不是一种低努力的漂移，而是由优化过程驱动的，该优化过程要求 KL 正则化策略有足够的自由度来从其初始先验做出重大转变。我们的研究结果表明，给予政策分歧的自由是一把双刃剑：虽然对于发现高回报捷径是必要的，但它也创造了强大的动机来博弈奖励函数的最简单方面，这对协调下的奖励黑客提出了严峻的挑战。

Title: Language Ranker: A Lightweight Ranking framework for LLM Decoding

Authors: Chenheng Zhang, Tianqi Du, Jizhe Zhang, Mingqing Xiao, Yifei Wang, Yisen Wang, Zhouchen Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21883
Pdf URL: https://arxiv.org/pdf/2510.21883
Copy Paste: [[2510.21883]] Language Ranker: A Lightweight Ranking framework for LLM Decoding(https://arxiv.org/abs/2510.21883)
Keywords: language model, llm
Abstract: Conventional research on large language models (LLMs) has primarily focused on refining output distributions, while paying less attention to the decoding process that transforms these distributions into final responses. Recent advances, such as scaling the computation of inference time with reward models, have underscored the importance of decoding, but these methods often suffer from high computational costs and limited applicability. In this paper, we revisit LLM generation through the lens of recommender systems, conceptualizing the decoding process as analogous to the ranking stage in recommendation pipelines. From this perspective, we observe that both traditional decoding methods and reward models exhibit clear limitations such as redundancy. Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses using features extracted by the base model. Experiments across a wide range of tasks show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only <0.5M additional parameters, significantly reducing the computational overhead during both training and inference stages. This highlights the efficiency and effectiveness of our method, showcasing its potential to fully unlock the capabilities of LLMs.
摘要：大语言模型（LLM）的传统研究主要集中在细化输出分布上，而很少关注将这些分布转换为最终响应的解码过程。最近的进展，例如使用奖励模型扩展推理时间的计算，强调了解码的重要性，但这些方法通常面临计算成本高和适用性有限的问题。在本文中，我们从推荐系统的角度重新审视 LLM 的生成，将解码过程概念化为类似于推荐管道中的排名阶段。从这个角度来看，我们观察到传统的解码方法和奖励模型都表现出明显的局限性，例如冗余。受这种见解的启发，我们提出了 LanguageRanker，这是一种新颖的框架，它引入了一个轻量级模块，使用基本模型提取的特征对候选响应进行重新排名。跨广泛任务的实验表明，LanguageRanker 实现了与大规模奖励模型相当的性能，同时仅需要 <0.5M 的额外参数，显着减少了训练和推理阶段的计算开销。这凸显了我们方法的效率和有效性，展示了其充分释放法学硕士能力的潜力。

Title: Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks

Authors: Avinash Patil
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21884
Pdf URL: https://arxiv.org/pdf/2510.21884
Copy Paste: [[2510.21884]] Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks(https://arxiv.org/abs/2510.21884)
Keywords: language model, llm
Abstract: The growing adoption of machine learning (ML) in sensitive domains has heightened the demand for transparent and interpretable artificial intelligence. Large Language Models (LLMs) are increasingly capable of producing natural language explanations, yet it remains unclear whether these rationales faithfully capture the predictive signals that underlie decisions. This paper introduces RACE-Reasoning Alignment for Completeness of Explanations, a systematic framework to evaluate the alignment between LLM-generated explanations and interpretable feature importance scores derived from a logistic regression baseline. We analyze four widely used text classification datasets-WIKI ONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS-and compare LLM rationales against top-ranked supporting and contradicting lexical features. To capture alignment at multiple levels of granularity, RACE implements token-aware, exact string, and edit-distance matching techniques. Empirical results reveal a consistent asymmetry: correct predictions exhibit higher coverage of supporting features, while incorrect predictions are associated with elevated coverage of contradicting features. Edit-distance matching further uncovers paraphrastic overlaps, boosting coverage while preserving this asymmetry. These findings demonstrate that LLM rationales combine both surface-level and flexible evidence reuse, yet can also amplify misleading cues in error cases. RACE provides new insights into the faithfulness of LLM explanations and establishes a quantitative basis for evaluating reasoning completeness in neural language models.
摘要：机器学习 (ML) 在敏感领域的日益普及提高了对透明和可解释人工智能的需求。大型语言模型 (LLM) 生成自然语言解释的能力越来越强，但仍不清楚这些基本原理是否忠实地捕捉了决策背后的预测信号。本文介绍了 RACE-Reasoning Alignment for Completeness of Explanations，这是一个系统框架，用于评估法学硕士生成的解释与从逻辑回归基线得出的可解释特征重要性得分之间的一致性。我们分析了四个广泛使用的文本分类数据集 - WIKI ONTOLOGY、AG NEWS、IMDB 和 GOEMOTIONS - 并将 LLM 基本原理与排名靠前的支持和矛盾词汇特征进行比较。为了捕获多个粒度级别的对齐，RACE 实现了标记感知、精确字符串和编辑距离匹配技术。经验结果揭示了一致的不对称性：正确的预测表现出更高的支持特征覆盖率，而错误的预测则与更高的矛盾特征覆盖率相关。编辑距离匹配进一步揭示释义重叠，提高覆盖范围，同时保留这种不对称性。这些发现表明，法学硕士的基本原理结合了表面证据和灵活的证据重用，但也可以放大错误案例中的误导性线索。 RACE为LLM解释的忠实度提供了新的见解，并为评估神经语言模型的推理完整性建立了定量基础。

Title: Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning

Authors: Anh Pham, Mihir Thalanki, Michael Sun, Aditya Chaloo, Ankita Gupta, Tian Xia, Aditya Mate, Ehimwenma Nosakhare, Soundararajan Srinivasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21885
Pdf URL: https://arxiv.org/pdf/2510.21885
Copy Paste: [[2510.21885]] Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning(https://arxiv.org/abs/2510.21885)
Keywords: language model
Abstract: Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.
摘要：当对良性数据进行微调时，大型语言模型通常会丢失先前对齐的安全行为，这种现象称为灾难性遗忘。先前的研究表明，添加随机安全示例可以减轻这种影响，但仍不清楚哪些示例最有效。我们提出了一种行为感知采样框架，该框架根据两个互补因素选择安全示例：指令响应行为（例如，拒绝与遵守）和跨危害类别的语义多样性。系统评估表明，这种方法在保持有用性的同时大幅减少了有害输出，仅用 0.5% 的额外训练数据即可将危害性降低高达 41%。这些结果凸显了有针对性的数据选择如何提高大规模微调的安全性和效率。

Title: Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation

Authors: Dhrupad Bhardwaj, Julia Kempe, Tim G. J. Rudner
Subjects: cs.CL, cs.AI, cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2510.21891
Pdf URL: https://arxiv.org/pdf/2510.21891
Copy Paste: [[2510.21891]] Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation(https://arxiv.org/abs/2510.21891)
Keywords: language model, llm, prompt
Abstract: To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy -- the degree of uniformity across normalized text embeddings on the unit sphere -- and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy -- that is, greater embedding dispersion -- reliably signals lower factual consistency across samples. Our approach requires no labeled data, no fine-tuning, and no hyperparameter selection, and can be used with open- or closed-weight embedding models. Across multiple domains, our method consistently outperforms existing approaches in predicting nonfactuality in long-form responses using only a handful of samples -- offering a practical, low-cost approach for integrating trust assessment into real-world LLM workflows.
摘要：为了在需要对开放式提示做出实质性准确响应的高风险应用领域中部署大型语言模型 (LLM)，我们需要可靠且计算成本低廉的方法来评估 LLM 生成的长格式响应的可信度。然而，现有的方法通常依赖于逐个声明的事实检查，这在对开放式提示的长格式响应中计算成本昂贵且脆弱。在这项工作中，我们引入了语义各向同性——单位范围内标准化文本嵌入的一致性程度——并用它来评估法学硕士生成的长格式响应的可信度。为此，我们生成几个长格式响应，嵌入它们，并将这些响应的语义各向同性水平估计为单位球上嵌入的角色散。我们发现较高的语义各向同性（即较大的嵌入分散度）可靠地表明样本之间的事实一致性较低。我们的方法不需要标记数据，不需要微调，也不需要超参数选择，并且可以与开放或封闭权重嵌入模型一起使用。在多个领域中，我们的方法在仅使用少量样本来预测长格式响应中的非事实性方面始终优于现有方法，为将信任评估集成到现实世界的法学硕士工作流程中提供了一种实用、低成本的方法。

Title: Understanding Network Behaviors through Natural Language Question-Answering

Authors: Mingzhe Xing, Chang Tian, Jianan Zhang, Lichen Pan, Peipei Liu, Zhaoteng Yan, Yinliang Yue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21894
Pdf URL: https://arxiv.org/pdf/2510.21894
Copy Paste: [[2510.21894]] Understanding Network Behaviors through Natural Language Question-Answering(https://arxiv.org/abs/2510.21894)
Keywords: language model, llm
Abstract: Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM's long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.
摘要：现代大规模网络给理解网络行为带来了极大的复杂性，增加了错误配置的风险。先前的工作提出通过挖掘网络配置来理解网络行为，通常依赖于与正式模型交互的特定领域语言。虽然有效，但它们的学习曲线陡峭且灵活性有限。相比之下，自然语言 (NL) 提供了更易于访问和解释的界面，推动了最近对 NL 引导的网络行为理解的研究。大型语言模型 (LLM) 的最新进展利用其丰富的网络概念先验知识和强大的推理能力，进一步增强了这一方向。然而，仍然存在三个关键挑战：1）大量具有冗长配置文件的路由器设备挑战LLM的长上下文理解能力； 2）跨设备和协议的异构性阻碍了可扩展性； 3）复杂的网络拓扑和协议需要超出法学硕士当前能力的高级推理能力。为了应对上述挑战，我们提出了 NetMind，这是一种使用 NL 查询网络的新颖框架。我们的方法引入了基于树的配置分块策略，以保持语义一致性，同时实现有效的分区。然后，我们构建一个统一的事实图作为中间表示，以规范特定于供应商的配置。最后，我们设计了一种混合命令式声明式语言，以减轻法学硕士的推理负担并提高精度。我们贡献了一个由 NL 问答对与网络配置配对组成的基准。实验表明，NetMind 实现了准确且可扩展的网络行为理解，优于现有基线。

Title: Deep Literature Survey Automation with an Iterative Workflow

Authors: Hongbo Zhang, Han Cui, Yidong Wang, Yijian Tian, Qi Guo, Cunxiang Wang, Jian Wu, Chiyu Song, Yue Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21900
Pdf URL: https://arxiv.org/pdf/2510.21900
Copy Paste: [[2510.21900]] Deep Literature Survey Automation with an Iterative Workflow(https://arxiv.org/abs/2510.21900)
Keywords: agent
Abstract: Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at this https URL\_Autosurveyv2.
摘要：自动文献调查生成已引起越来越多的关注，但大多数现有系统都遵循一次性范式，即一次检索大量论文，并在起草之前生成静态大纲。这种设计通常会导致检索嘈杂、结构碎片化和上下文过载，最终限制调查质量。受人类研究人员迭代阅读过程的启发，我们提出了一个基于循环大纲生成的框架，其中规划代理逐步检索、读取和更新大纲，以确保探索和连贯性。为了提供忠实的论文级基础，我们设计了纸卡，将每篇论文提炼为其贡献、方法和发现，并引入具有可视化增强功能的审查和细化循环，以改善文本流程并集成图形和表格等多模式元素。对现有主题和新兴主题的实验表明，我们的调查在内容覆盖范围、结构连贯性和引文质量方面远远优于最先进的基线，同时产生更易于访问和组织得更好的调查。为了对此类改进提供更可靠的评估，我们进一步引入了 Survey-Arena，这是一种成对基准，可补充绝对评分，并更清晰地定位机器生成的调查相对于人类编写的调查的位置。该代码可从此 https URL\_Autosurveyv2 获取。

Title: Model-Aware Tokenizer Transfer

Authors: Mykola Haltiuk, Aleksander Smywiński-Pohl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21954
Pdf URL: https://arxiv.org/pdf/2510.21954
Copy Paste: [[2510.21954]] Model-Aware Tokenizer Transfer(https://arxiv.org/abs/2510.21954)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.
摘要：大型语言模型 (LLM) 经过训练可以支持越来越多的语言，但其预定义的分词器仍然是使模型适应资源较低或不同脚本语言的瓶颈。现有的分词器传输方法通常依赖于语义启发式来初始化新的嵌入，忽略更高层的模型动态并限制传输质量。我们提出了模型感知分词器传输（MATT），这是一种将模型内部结构合并到分词器传输过程中的方法。 MATT 引入了注意力影响建模 (AIM) 目标，该目标使用新的分词器将源模型中的分词间通信模式提炼到目标模型中，从而在标准语言建模之前提供有效的预热。与仅关注嵌入相似性的方法不同，MATT 利用注意力行为来指导嵌入初始化和适应。跨不同语言设置的实验表明，MATT 在几个 GPU 小时内恢复了原始模型性能的很大一部分，优于启发式基线。这些结果表明，合并模型级信号为多语言法学硕士中稳健的分词器传输提供了一条实用且有效的途径。

Title: A Stylometric Application of Large Language Models

Authors: Harrison F. Stropkay, Jiayi Chen, Mohammad J. Latifi, Daniel N. Rockmore, Jeremy R. Manning
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2510.21958
Pdf URL: https://arxiv.org/pdf/2510.21958
Copy Paste: [[2510.21958]] A Stylometric Application of Large Language Models(https://arxiv.org/abs/2510.21958)
Keywords: language model, gpt, llm
Abstract: We show that large language models (LLMs) can be used to distinguish the writings of different authors. Specifically, an individual GPT-2 model, trained from scratch on the works of one author, will predict held-out text from that author more accurately than held-out text from other authors. We suggest that, in this way, a model trained on one author's works embodies the unique writing style of that author. We first demonstrate our approach on books written by eight different (known) authors. We also use this approach to confirm R. P. Thompson's authorship of the well-studied 15th book of the Oz series, originally attributed to F. L. Baum.
摘要：我们证明大型语言模型（LLM）可以用来区分不同作者的作品。具体来说，根据一位作者的作品从头开始训练的单个 GPT-2 模型将比其他作者的保留文本更准确地预测该作者的保留文本。我们建议，通过这种方式，根据一位作者的作品训练的模型体现了该作者独特的写作风格。我们首先在八位不同（已知）作者撰写的书籍上展示我们的方法。我们还使用这种方法来确认 R. P. Thompson 对奥兹国系列第 15 本书的作者身份，该书最初归属于 F. L. Baum。

Title: Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Authors: Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21983
Pdf URL: https://arxiv.org/pdf/2510.21983
Copy Paste: [[2510.21983]] Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks(https://arxiv.org/abs/2510.21983)
Keywords: language model, llm, prompt
Abstract: Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.
摘要：尽管最近取得了进展，大型语言模型仍然容易受到越狱攻击，这些攻击会绕过对齐保护措施并引发有害输出。虽然先前的研究提出了各种在人类可读性和可转移性方面不同的攻击策略，但很少关注可能影响模型对此类攻击的敏感性的语言和心理机制。在本文中，我们研究了跨学科的研究路线，该研究路线利用社会科学的说服基础理论来制定能够规避法学硕士的联盟约束的对抗性提示。借鉴完善的说服策略，我们假设法学硕士接受过大规模人类生成文本的培训，可能会更顺从地响应具有说服结构的提示。此外，我们还调查了法学硕士本身是否在越狱反应中表现出明显的说服力特征。对多个法学硕士的实证评估表明，说服意识提示可以显着绕过安全措施，这表明它们有可能诱发越狱行为。这项工作强调了跨学科洞察力在应对法学硕士安全不断变化的挑战方面的重要性。代码和数据都有了。

Title: Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Authors: Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, Andrej Risteski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22014
Pdf URL: https://arxiv.org/pdf/2510.22014
Copy Paste: [[2510.22014]] Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models(https://arxiv.org/abs/2510.22014)
Keywords: language model, prompt
Abstract: Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable -- succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model's internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.
摘要：针对大型语言模型的基于离散优化的越狱攻击旨在生成简短的、无意义的后缀，当将这些后缀附加到输入提示时，会引发不允许的内容。值得注意的是，这些后缀通常是可转移的——在它们从未优化过的提示和模型上取得成功。然而，尽管可转移性令人惊讶并且在经验上已得到证实，但该领域缺乏对何时以及为何发生转移的严格分析。为了填补这一空白，我们确定了与众多实验设置中的迁移成功密切相关的三个统计属性：（1）没有后缀的提示会在多大程度上激活模型的内部拒绝方向，（2）后缀引起远离该方向的强烈程度，以及（3）这些转变在与拒绝正交的方向上有多大。另一方面，我们发现即时语义相似性与转移成功的相关性很弱。这些发现使我们对可转移性有了更细粒度的理解，我们在干预实验中使用它来展示我们的统计分析如何转化为攻击成功的实际改进。

Title: Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Authors: Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22028
Pdf URL: https://arxiv.org/pdf/2510.22028
Copy Paste: [[2510.22028]] Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics(https://arxiv.org/abs/2510.22028)
Keywords: llm
Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.
摘要：质量估计 (QE) 指标对于机器翻译中的无参考评估至关重要，并且可以作为强化学习等任务中的奖励信号。然而，量化宽松中长度偏差的普遍性和影响尚未得到充分探讨。通过对 10 个不同语言对中基于回归的最佳 QE 指标和法学硕士作为法官的 QE 指标进行系统研究，我们揭示了两个关键的长度偏差：首先，随着翻译长度的增加，QE 指标始终会过度预测错误，即使对于高质量、无错误的文本也是如此。其次，当同一源文本有多个候选者可用时，他们表现出对较短翻译的偏好。这些固有的长度偏差可能会不公平地惩罚较长、正确的翻译，并可能导致 QE 重新排名和 QE 引导强化学习等应用中的次优决策。为了缓解这个问题，我们提出了两种策略：（a）在模型训练期间应用长度归一化，以及（b）在评估期间纳入参考文本。发现这两种方法都可以有效减少已识别的长度偏差。

Title: Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models

Authors: Benjamin Reichman, Adar Avsian, Larry Heck
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22042
Pdf URL: https://arxiv.org/pdf/2510.22042
Copy Paste: [[2510.22042]] Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models(https://arxiv.org/abs/2510.22042)
Keywords: language model, llm
Abstract: This work investigates how large language models (LLMs) internally represent emotion by analyzing the geometry of their hidden-state space. The paper identifies a low-dimensional emotional manifold and shows that emotional representations are directionally encoded, distributed across layers, and aligned with interpretable dimensions. These structures are stable across depth and generalize to eight real-world emotion datasets spanning five languages. Cross-domain alignment yields low error and strong linear probe performance, indicating a universal emotional subspace. Within this space, internal emotion perception can be steered while preserving semantics using a learned intervention module, with especially strong control for basic emotions across languages. These findings reveal a consistent and manipulable affective geometry in LLMs and offer insight into how they internalize and process emotion.
摘要：这项工作通过分析大型语言模型 (LLM) 隐藏状态空间的几何形状来研究其内部如何表示情感。该论文识别了低维情感流形，并表明情感表征是定向编码的、跨层分布的，并与可解释的维度保持一致。这些结构在深度上是稳定的，并且可以推广到跨越五种语言的八个现实世界情感数据集。跨域对齐产生低误差和强大的线性探测性能，表明通用的情感子空间。在这个空间中，可以使用学习的干预模块来引导内部情绪感知，同时保留语义，特别是对跨语言的基本情绪进行强有力的控制。这些发现揭示了法学硕士中一致且可操纵的情感几何，并提供了他们如何内化和处理情感的见解。

Title: Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds

Authors: Atij Mahesh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22084
Pdf URL: https://arxiv.org/pdf/2510.22084
Copy Paste: [[2510.22084]] Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds(https://arxiv.org/abs/2510.22084)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty Winogender-derived occupations. We quantify trade-offs between control strength and naturalness with evaluations of constraint compliance, lexical diversity, and fluency. Our results reveal key contrasts among the methods: SFT achieves 99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect compliance, but at the cost of severely reduced fluency and diversity. Preference-based learning fundamentally differs: it cannot satisfy compositional constraints, as binary preference signals encode ranking, not logical conjunctions. Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, underscoring the limitations of preference learning and the necessity of explicit supervision for fair and fluent controlled generation.
摘要：即使在职业中立的背景下，大型语言模型（LLM）仍然会产生性别刻板印象的语言，这反映了深刻的社会偏见（Rudinger et al., 2018）。为了解决这个问题，之前的工作提出了提示、约束解码（Dathathri et al., 2020；Zhou et al., 2024）、后处理和基于微调的对齐（Rafailov et al., 2023；Ravfogel et al., 2022）。然而，相对功效和学习动态仍然知之甚少。我们报告了六种用于缓解偏差的控制技术的比较分析：仅提示、生成和过滤、基于 DFA 的 Ctrl-G 解码、监督微调 (SFT)、直接偏好优化 (DPO) 和迭代零空间投影 (INLP)。我们在组合约束任务上评估每种方法。此任务需要为 20 种由 Winogender 派生的职业中的每一种生成至少包含一个主体描述符和一个公共描述符的句子。我们通过评估约束遵从性、词汇多样性和流畅性来量化控制强度和自然度之间的权衡。我们的结果揭示了这些方法之间的主要对比：SFT 实现了 99.87 ± 0.15% 的合规性和高词汇多样性，而 DPO 尽管训练稳定性相似，但达不到 4.53 ± 0.82%。 Ctrl-G 保证完美的合规性，但代价是严重降低流畅性和多样性。基于偏好的学习有着根本的不同：它不能满足组合约束，因为二元偏好信号编码排名，而不是逻辑连接。只有明确的积极监督才能减轻成分偏差；基于偏好的对齐未能概括逻辑结构，强调了偏好学习的局限性以及对公平和流畅的受控生成进行明确监督的必要性。

Title: Generalization or Memorization: Dynamic Decoding for Mode Steering

Authors: Xuanming Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22099
Pdf URL: https://arxiv.org/pdf/2510.22099
Copy Paste: [[2510.22099]] Generalization or Memorization: Dynamic Decoding for Mode Steering(https://arxiv.org/abs/2510.22099)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit a troubling duality, capable of both remarkable generalization and brittle, verbatim memorization of their training data. This unpredictability undermines their reliability in high-stakes applications. In this work, we propose a unified framework to understand, identify, and control these distinct reasoning modes. First, we introduce a theoretical model based on the Information Bottleneck (IB) principle, formalizing generalization as the learning of a compressed, task-relevant representation and memorization as a failure to compress. Building on this theory, we develop Dynamic Mode Steering (DMS), a novel inference-time algorithm which comprises two components: (1) a lightweight, causally-grounded linear probe that identifies the model's instantaneous reliance on memorization, and (2) a dynamic activation steering mechanism that nudges the model's computation towards pre-identified generalization circuits. We frame DMS as a form of adaptive, self-contrastive decoding. Experiments on reasoning and faithfulness tasks demonstrate that DMS significantly improves logical consistency and factual accuracy, thereby offering a principled approach to enhancing LLM reliability.
摘要：大型语言模型（LLM）表现出令人不安的二元性，既能够进行出色的泛化，又能够逐字记忆训练数据。这种不可预测性损害了它们在高风险应用中的可靠性。在这项工作中，我们提出了一个统一的框架来理解、识别和控制这些不同的推理模式。首先，我们介绍一个基于信息瓶颈（IB）原理的理论模型，将泛化形式化为压缩的、与任务相关的表示的学习，将记忆形式化为压缩失败。基于这一理论，我们开发了动态模式转向（DMS），这是一种新颖的推理时间算法，由两个组件组成：（1）一个轻量级的、基于因果关系的线性探针，用于识别模型对记忆的瞬时依赖；（2）一个动态激活转向机制，将模型的计算推向预先确定的泛化电路。我们将 DMS 视为一种自适应、自对比解码的形式。推理和忠实性任务的实验表明，DMS 显着提高了逻辑一致性和事实准确性，从而提供了增强 LLM 可靠性的原则性方法。

Title: Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows

Authors: Billy Dickson, Zoran Tiganj
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22109
Pdf URL: https://arxiv.org/pdf/2510.22109
Copy Paste: [[2510.22109]] Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows(https://arxiv.org/abs/2510.22109)
Keywords: language model
Abstract: Most approaches to long-context processing increase the complexity of the transformer's internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In this work, we introduce an alternative approach that modifies the input representation itself, rather than the transformer architecture. Inspired by cognitive models of human memory, our method applies a scale-invariant logarithmic compression to the input tokens. The resulting compressed representation is processed by a standard, unmodified transformer, preserving architectural simplicity. We evaluate this approach on the WikiText-103 and PG-19 language modeling benchmarks, showing a reduction in perplexity compared to uncompressed baselines. Moreover, performance improves consistently with longer compressed temporal contexts, showing that input-level logarithmic compression is a simple and effective way to extend a transformer's long-range memory.
摘要：大多数长上下文处理方法通过集成循环或辅助内存模块等机制来增加变压器内部架构的复杂性。在这项工作中，我们引入了一种替代方法，该方法修改输入表示本身，而不是变压器架构。受人类记忆认知模型的启发，我们的方法对输入标记应用尺度不变的对数压缩。生成的压缩表示由标准的、未经修改的变压器处理，保持了架构的简单性。我们在 WikiText-103 和 PG-19 语言建模基准上评估了这种方法，结果表明与未压缩的基线相比，困惑度有所降低。此外，性能随着更长的压缩时间上下文而持续提高，这表明输入级对数压缩是扩展变压器远程记忆的简单而有效的方法。

Title: OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue

Authors: Tianhong Gao, Jundong Shen, Bei Shi, Jiapeng Wang, Ying Ju, Junfeng Yao, Jiao Ran, Yong Zhang, Lin Dong, Huiyu Yu, Tingting Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22143
Pdf URL: https://arxiv.org/pdf/2510.22143
Copy Paste: [[2510.22143]] OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue(https://arxiv.org/abs/2510.22143)
Keywords: hallucination, retrieval-augmented generation
Abstract: Intelligent customer service (ICS) systems via retrieval-augmented generation (RAG) have been widely adopted in Web-based domains such as social platforms and e-commerce, achieving remarkable improvements in automation and efficiency. However, notable limitations still remain: these systems are prone to hallucinations and often generate rigid, mechanical responses, which can introduce business risks and undermine user experience, especially in Web-based customer service interactions under the RAG scenarios. In this paper, we introduce OlaMind, a human-like and hallucination-safe customer service framework for retrieval-augmented dialogue. Specifically, it first leverages a Learn-to-Think stage to learn the reasoning processes and response strategies from human experts, and then employs a Learn-to-Respond stage to perform cold-start supervised fine-tuning (SFT) combined with reinforcement learning (RL) for basic-to-hard self-refinement. Our method significantly enhances human-likeness and naturalness while effectively mitigating hallucinations and critical business risks. We have conducted large-scale online A/B experiments in an industry-level social customer service setting, and extensive experimental results show that OlaMind achieves significant cumulative relative improvements with intelligent resolution rates +28.92%/+18.42% and human takeover rate -6.08%/-7.12% in community-support/livestream-interaction scenarios, respectively, which highlights its consistent effectiveness across diverse real-world applications. The code and data will be publicly available.
摘要：基于检索增强生成（RAG）的智能客户服务（ICS）系统已广泛应用于社交平台和电子商务等基于网络的领域，在自动化和效率方面取得了显着提高。然而，仍然存在明显的局限性：这些系统容易产生幻觉，并且经常产生僵化、机械的响应，这会引入业务风险并损害用户体验，尤其是在RAG场景下基于Web的客户服务交互中。在本文中，我们介绍了 OlaMind，一种用于检索增强对话的类人且防幻觉的客户服务框架。具体来说，它首先利用Learn-to-Think阶段向人类专家学习推理过程和响应策略，然后利用Learn-to-Respond阶段进行冷启动监督微调（SFT）结合强化学习（RL）进行从基础到困难的自我完善。我们的方法显着增强了人类的相似性和自然性，同时有效地减轻了幻觉和关键业务风险。我们在行业级社交客服场景中进行了大规模的在线A/B实验，大量的实验结果表明，OlaMind在社区支持/直播互动场景中取得了显着的累积相对提升，智能解决率+28.92%/+18.42%，人工接手率-6.08%/-7.12%，凸显了其在不同场景下的一致有效性。现实世界的应用程序。代码和数据将公开。

Title: DETECT: Determining Ease and Textual Clarity of German Text Simplifications

Authors: Maria Korobeynikova, Alessia Battisti, Lukas Fischer, Yingqiang Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22212
Pdf URL: https://arxiv.org/pdf/2510.22212
Copy Paste: [[2510.22212]] DETECT: Determining Ease and Textual Clarity of German Text Simplifications(https://arxiv.org/abs/2510.22212)
Keywords: language model, llm
Abstract: Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.
摘要：目前对德语自动文本简化（ATS）的评估依赖于 SARI、BLEU 和 BERTScore 等通用指标，这些指标在简单性、含义保留和流畅性方面不足以捕捉简化质量。虽然像 LENS 这样的专门指标是针对英语开发的，但由于缺乏人工注释的语料库，针对德语的相应工作却落后了。为了缩小这一差距，我们引入了 DETECT，这是第一个德国特定的指标，可以在简单性、意义保存和流畅性的所有三个维度上全面评估 ATS 质量，并且完全基于合成大语言模型 (LLM) 响应进行训练。我们的方法使 LENS 框架适应德语，并通过以下方式扩展它：(i) 通过 LLM 生成综合质量分数的管道，无需人工注释即可创建数据集；(ii) 基于 LLM 的细化步骤，用于将评分标准与简化要求保持一致。据我们所知，我们还构建了最大的德国人类评估数据集用于文本简化，以直接验证我们的指标。实验结果表明，与广泛使用的 ATS 指标相比，DETECT 与人类判断的相关性要高得多，并且在意义保存和流畅性方面的收益尤其显着。除了 ATS 之外，我们的研究结果还强调了法学硕士在自动评估方面的潜力和局限性，并为一般语言无障碍任务提供了可转移的指南。

Title: Estimating the Error of Large Language Models at Pairwise Text Comparison

Authors: Tianyi Li
Subjects: cs.CL, cs.AI, math.PR
Abstract URL: https://arxiv.org/abs/2510.22219
Pdf URL: https://arxiv.org/pdf/2510.22219
Copy Paste: [[2510.22219]] Estimating the Error of Large Language Models at Pairwise Text Comparison(https://arxiv.org/abs/2510.22219)
Keywords: language model, gpt, llm, prompt, chat
Abstract: We measure LLMs' output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts. The Copeland counting constructs a ranking over the compared texts from pairwise preferences; the ranking reveals the poor scalability of LLM-based pairwise comparison and helps yield the estimates for LLMs' error rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five types of text input and obtain consistent estimates of LLMs' error. In general, the measured two positional bias terms are similar, close to the uniform error. Considering both the error rates and the robustness to the variation of prompts, Claude obtained the most desirable performance in this experiment. Our model outperforms the biased Bradley-Terry model and the commutativity score in indicating LLMs' error at this task.
摘要：我们测量法学硕士在成对文本比较中的输出误差，注意他们的偏好出现错误的概率。我们的方法不依赖于事实真相，并支持两种情况：（i）无论比较顺序如何，错误率均一，通过对每个文本对进行两次比较来估计，其中任一文本放在前面； (ii) 二元位置偏差，假设两个比较顺序有不同的错误率，通过文本之间的重复比较进行估计。谷轮计数根据成对偏好对比较文本构建排名；该排名揭示了基于法学硕士的成对比较的可扩展性较差，并有助于得出法学硕士错误率的估计。我们将该方法应用于具有五种类型文本输入的 6 个 LLM（ChatGPT、Claude、DeepSeek、Gemini、Grok、Qwen），并获得了 LLM 误差的一致估计。一般来说，测量的两个位置偏差项相似，接近均匀误差。考虑到错误率和对提示变化的鲁棒性，克劳德在这个实验中获得了最理想的性能。我们的模型在指示法学硕士在此任务中的错误方面优于有偏差的 Bradley-Terry 模型和交换性分数。

Title: You Don't Need Prompt Engineering Anymore: The Prompting Inversion

Authors: Imran Khan (Independent Researcher)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22251
Pdf URL: https://arxiv.org/pdf/2510.22251
Copy Paste: [[2510.22251]] You Don't Need Prompt Engineering Anymore: The Prompting Inversion(https://arxiv.org/abs/2510.22251)
Keywords: gpt, llm, prompt, chain-of-thought
Abstract: Prompt engineering, particularly Chain-of-Thought (CoT) prompting, significantly enhances LLM reasoning capabilities. We introduce "Sculpting," a constrained, rule-based prompting method designed to improve upon standard CoT by reducing errors from semantic ambiguity and flawed common sense. We evaluate three prompting strategies (Zero Shot, standard CoT, and Sculpting) across three OpenAI model generations (gpt-4o-mini, gpt-4o, gpt-5) using the GSM8K mathematical reasoning benchmark (1,317 problems). Our findings reveal a "Prompting Inversion": Sculpting provides advantages on gpt-4o (97% vs. 93% for standard CoT), but becomes detrimental on gpt-5 (94.00% vs. 96.36% for CoT on full benchmark). We trace this to a "Guardrail-to-Handcuff" transition where constraints preventing common-sense errors in mid-tier models induce hyper-literalism in advanced models. Our detailed error analysis demonstrates that optimal prompting strategies must co-evolve with model capabilities, suggesting simpler prompts for more capable models.
摘要：即时工程，特别是思维链 (CoT) 提示，可显着增强 LLM 推理能力。我们引入了“雕刻”，这是一种受约束的、基于规则的提示方法，旨在通过减少语义模糊和常识缺陷造成的错误来改进标准 CoT。我们使用 GSM8K 数学推理基准（1,317 个问题）评估了三个 OpenAI 模型代（gpt-4o-mini、gpt-4o、gpt-5）的三种提示策略（Zero Shot、标准 CoT 和 Sculpting）。我们的研究结果揭示了“提示反转”：雕刻在 gpt-4o 上提供了优势（标准 CoT 为 97% vs. 93%），但在 gpt-5 上变得不利（在完整基准测试中，CoT 为 94.00% vs. 96.36%）。我们将其追溯到“护栏到手铐”的转变，其中防止中间层模型中常识性错误的约束导致高级模型中的超现实主义。我们详细的错误分析表明，最佳的提示策略必须与模型功能共同发展，为更强大的模型提供更简单的提示。

Title: SteerX: Disentangled Steering for LLM Personalization

Authors: Xiaoyan Zhao, Ming Yan, Yilun Qiu, Haoting Ni, Yang Zhang, Fuli Feng, Hong Cheng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22256
Pdf URL: https://arxiv.org/pdf/2510.22256
Copy Paste: [[2510.22256]] SteerX: Disentangled Steering for LLM Personalization(https://arxiv.org/abs/2510.22256)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users' daily life and work. A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely. Activation steering, which directly leverages directions representing user preference in the LLM activation space to adjust its behavior, offers a cost-effective way to align the model's outputs with individual users. However, existing methods rely on all historical data to compute the steering vector, ignoring that not all content reflects true user preferences, which undermines the personalization signal. To address this, we propose SteerX, a disentangled steering method that isolates preference-driven components from preference-agnostic components. Grounded in causal inference theory, SteerX estimates token-level causal effects to identify preference-driven tokens, transforms these discrete signals into a coherent description, and then leverages them to steer personalized LLM generation. By focusing on the truly preference-driven information, SteerX produces more accurate activation steering vectors and enhances personalization. Experiments on two representative steering backbone methods across real-world datasets demonstrate that SteerX consistently enhances steering vector quality, offering a practical solution for more effective LLM personalization.
摘要：近年来，大型语言模型（LLM）取得了巨大的成功，实现了广泛的应用，包括支持用户日常生活和工作的智能助手。构建此类助手的一个关键因素是法学硕士的个性化，因为用户的偏好和需求差异很大。激活引导直接利用 LLM 激活空间中代表用户偏好的方向来调整其行为，提供了一种经济有效的方法来使模型的输出与个人用户保持一致。然而，现有的方法依赖于所有历史数据来计算引导向量，忽略了并非所有内容都反映真实的用户偏好，这破坏了个性化信号。为了解决这个问题，我们提出了 SteerX，一种分离的转向方法，它将偏好驱动的组件与偏好无关的组件隔离开来。 SteerX 以因果推理理论为基础，估计代币级别的因果效应，以识别偏好驱动的代币，将这些离散信号转换为连贯的描述，然后利用它们来引导个性化的 LLM 生成。通过关注真正偏好驱动的信息，SteerX 可以生成更准确的激活引导向量并增强个性化。在现实世界数据集中对两种代表性转向骨干方法进行的实验表明，SteerX 持续增强了转向矢量质量，为更有效的 LLM 个性化提供了实用的解决方案。

Title: From Slides to Chatbots: Enhancing Large Language Models with University Course Materials

Authors: Tu Anh Dinh, Philipp Nicolas Schumacher, Jan Niehues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22272
Pdf URL: https://arxiv.org/pdf/2510.22272
Copy Paste: [[2510.22272]] From Slides to Chatbots: Enhancing Large Language Models with University Course Materials(https://arxiv.org/abs/2510.22272)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have advanced rapidly in recent years. One application of LLMs is to support student learning in educational settings. However, prior work has shown that LLMs still struggle to answer questions accurately within university-level computer science courses. In this work, we investigate how incorporating university course materials can enhance LLM performance in this setting. A key challenge lies in leveraging diverse course materials such as lecture slides and transcripts, which differ substantially from typical textual corpora: slides also contain visual elements like images and formulas, while transcripts contain spoken, less structured language. We compare two strategies, Retrieval-Augmented Generation (RAG) and Continual Pre-Training (CPT), to extend LLMs with course-specific knowledge. For lecture slides, we further explore a multi-modal RAG approach, where we present the retrieved content to the generator in image form. Our experiments reveal that, given the relatively small size of university course materials, RAG is more effective and efficient than CPT. Moreover, incorporating slides as images in the multi-modal setting significantly improves performance over text-only retrieval. These findings highlight practical strategies for developing AI assistants that better support learning and teaching, and we hope they inspire similar efforts in other educational contexts.
摘要：近年来，大型语言模型（LLM）发展迅速。法学硕士的应用之一是支持学生在教育环境中的学习。然而，之前的研究表明，法学硕士仍然难以准确回答大学水平计算机科学课程中的问题。在这项工作中，我们研究了如何结合大学课程材料来提高法学硕士在这种情况下的表现。一个关键的挑战在于利用不同的课程材料，例如讲座幻灯片和成绩单，它们与典型的文本语料库有很大不同：幻灯片还包含图像和公式等视觉元素，而成绩单包含口语、结构性较差的语言。我们比较了两种策略：检索增强生成（RAG）和持续预训练（CPT），以扩展法学硕士的课程特定知识。对于讲座幻灯片，我们进一步探索多模式 RAG 方法，其中我们以图像形式将检索到的内容呈现给生成器。我们的实验表明，考虑到大学课程材料的规模相对较小，RAG 比 CPT 更有效、更高效。此外，在多模式设置中将幻灯片作为图像合并显着提高了纯文本检索的性能。这些发现强调了开发人工智能助手的实用策略，以更好地支持学习和教学，我们希望它们能够在其他教育环境中激发类似的努力。

Title: Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER

Authors: Andrei Baroian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22285
Pdf URL: https://arxiv.org/pdf/2510.22285
Copy Paste: [[2510.22285]] Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER(https://arxiv.org/abs/2510.22285)
Keywords: gpt, llm, prompt
Abstract: We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC's five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 $\approx$ 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.
摘要：我们在 CADEC 语料库上研究临床命名实体识别 (NER)，并比较三种方法：(i) BERT 式编码器（BERT Base、BioClinicalBERT、RoBERTa-large）、(ii) 在简单与复杂提示下与少样本上下文学习 (ICL) 结合使用的 GPT-4o，以及 (iii) 具有监督微调 (SFT) 的 GPT-4o。所有模型均根据 CADEC 的五种实体类型（ADR、药物、疾病、症状、发现）的标准 NER 指标进行评估。 RoBERTa-large 和 BioClinicalBERT 相对于 BERT Base 的改进有限，显示出这些模型系列的局限性。在 LLM 设置中，简单的 ICL 优于较长的、大量指令的提示，而 SFT 实现了最强的整体性能（F1 $\approx$ 87.1%），尽管成本较高。我们发现法学硕士在简化任务上实现了更高的准确性，将分类限制为两个标签。

Title: Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling

Authors: Antal van den Bosch, Ainhoa Risco Patón, Teun Buijse, Peter Berck, Maarten van Gompel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22317
Pdf URL: https://arxiv.org/pdf/2510.22317
Copy Paste: [[2510.22317]] Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling(https://arxiv.org/abs/2510.22317)
Keywords: language model, gpt
Abstract: We present memory-based language modeling as an efficient, eco-friendly alternative to deep neural network-based language modeling. It offers log-linearly scalable next-token prediction performance and strong memorization capabilities. Implementing fast approximations of k-nearest neighbor classification, memory-based language modeling leaves a relatively small ecological footprint both in training and in inference mode, as it relies fully on CPUs and attains low token latencies. Its internal workings are simple and fully transparent. We compare our implementation of memory-based language modeling, OLIFANT, with GPT-2 and GPT-Neo on next-token prediction accuracy, estimated emissions and speeds, and offer some deeper analyses of the model.
摘要：我们将基于记忆的语言建模作为基于深度神经网络的语言建模的高效、环保的替代方案。它提供对数线性可扩展的下一个令牌预测性能和强大的记忆能力。基于内存的语言建模实现了 k 最近邻分类的快速近似，在训练和推理模式中都留下了相对较小的生态足迹，因为它完全依赖于 CPU 并实现了较低的令牌延迟。其内部运作简单且完全透明。我们将基于内存的语言模型 OLIFANT 的实现与 GPT-2 和 GPT-Neo 的下一个令牌预测准确性、估计排放量和速度进行比较，并对模型进行一些更深入的分析。

Title: FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation

Authors: Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.22344
Pdf URL: https://arxiv.org/pdf/2510.22344
Copy Paste: [[2510.22344]] FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation(https://arxiv.org/abs/2510.22344)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.
摘要：虽然检索增强生成 (RAG) 可以减轻大型语言模型 (LLM) 中的幻觉和知识陈旧性，但现有框架在需要合成来自不同来源的信息的复杂、多跳查询时通常会出现问题。当前先进的 RAG 方法采用迭代或自适应策略，缺乏强大的机制来系统地识别和填补证据空白，经常传播噪音或无法收集全面的背景信息。我们引入了 FAIR-RAG，这是一种新颖的代理框架，它将标准 RAG 管道转变为动态的、证据驱动的推理过程。其核心是迭代细化周期，由我们称之为结构化证据评估（SEA）的模块控制。 SEA 充当分析门控机制：它将最初的查询解构为所需调查结果的清单，并审核汇总的证据以识别已确认的事实，以及至关重要的明确的信息差距。这些间隙向自适应查询细化代理提供精确的信号，该代理生成新的、有针对性的子查询来检索丢失的信息。这个循环不断重复，直到证据被证实足够，从而确保为最终的、严格忠实的一代提供全面的背景。我们对具有挑战性的多跳 QA 基准进行了实验，包括 HotpotQA、2WikiMultiHopQA 和 MusiQue。在统一的实验设置中，FAIR-RAG 的性能显着优于强基线。在 HotpotQA 上，它的 F1 分数为 0.453——比最强的迭代基线绝对提高了 8.3 分——在这些基准上为此类方法建立了新的最先进水平。我们的工作表明，具有明确差距分析的结构化、证据驱动的细化过程对于在高级 RAG 系统中针对复杂的知识密集型任务解锁可靠和准确的推理至关重要。

Title: Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models

Authors: Fiaz Ahmad, Nisar Hussain, Amna Qasim, Momina Hafeez, Muhammad Usman Grigori Sidorov, Alexander Gelbukh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22356
Pdf URL: https://arxiv.org/pdf/2510.22356
Copy Paste: [[2510.22356]] Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models(https://arxiv.org/abs/2510.22356)
Keywords: language model
Abstract: Ironic identification is a challenging task in Natural Language Processing, particularly when dealing with languages that differ in syntax and cultural context. In this work, we aim to detect irony in Urdu by translating an English Ironic Corpus into the Urdu language. We evaluate ten state-of-the-art machine learning algorithms using GloVe and Word2Vec embeddings, and compare their performance with classical methods. Additionally, we fine-tune advanced transformer-based models, including BERT, RoBERTa, LLaMA 2 (7B), LLaMA 3 (8B), and Mistral, to assess the effectiveness of large-scale models in irony detection. Among machine learning models, Gradient Boosting achieved the best performance with an F1-score of 89.18%. Among transformer-based models, LLaMA 3 (8B) achieved the highest performance with an F1-score of 94.61%. These results demonstrate that combining transliteration techniques with modern NLP models enables robust irony detection in Urdu, a historically low-resource language.
摘要：讽刺识别是自然语言处理中的一项具有挑战性的任务，特别是在处理语法和文化背景不同的语言时。在这项工作中，我们的目标是通过将英语讽刺语料库翻译成乌尔都语来检测乌尔都语中的讽刺。我们使用 GloVe 和 Word2Vec 嵌入评估了十种最先进的机器学习算法，并将它们的性能与经典方法进行了比较。此外，我们还对基于 Transformer 的高级模型进行微调，包括 BERT、RoBERTa、LLaMA 2 (7B)、LLaMA 3 (8B) 和 Mistral，以评估大型模型在讽刺检测中的有效性。在机器学习模型中，Gradient Boosting 取得了最好的性能，F1 分数为 89.18%。在基于 Transformer 的模型中，LLaMA 3 (8B) 取得了最高的性能，F1 分数为 94.61%。这些结果表明，将音译技术与现代 NLP 模型相结合，可以在乌尔都语（一种历史上资源匮乏的语言）中进行强大的反讽检测。

Title: GigaEmbeddings: Efficient Russian Language Embedding Model

Authors: Egor Kolodin, Daria Khomich, Nikita Savushkin, Anastasia Ianina, Fyodor Minkin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22369
Pdf URL: https://arxiv.org/pdf/2510.22369
Copy Paste: [[2510.22369]] GigaEmbeddings: Efficient Russian Language Embedding Model(https://arxiv.org/abs/2510.22369)
Keywords: llm, chat
Abstract: We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the decoder-only LLM designed specifically for Russian language (GigaChat-3B). Our three-stage pipeline, comprising large-scale contrastive pre-training in web-scale corpora, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering tasks, addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation. Architectural innovations include bidirectional attention for contextual modeling, latent attention pooling for robust sequence aggregation, and strategic pruning of 25% of transformer layers to enhance efficiency without compromising performance. Evaluated on the ruMTEB benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves state-of-the-art results (69.1 avg. score), outperforming strong baselines with a larger number of parameters.
摘要：我们介绍了 GigaEmbeddings，这是一种新颖的框架，用于通过专为俄语 (GigaChat-3B) 设计的纯解码器 LLM 的分层指令调整来训练高性能俄语文本嵌入。我们的三阶段流程包括网络规模语料库中的大规模对比预训练、硬阴性微调以及跨检索、分类和聚类任务的多任务泛化，通过统一不同的目标和利用合成数据生成来解决现有方法的主要局限性。架构创新包括用于上下文建模的双向注意力、用于稳健序列聚合的潜在注意力池以及对 25% 的转换器层进行战略性修剪以在不影响性能的情况下提高效率。在涵盖 23 个多语言任务的 ruMTEB 基准评估中，GigaEmbeddings 取得了最先进的结果（平均得分为 69.1），优于具有更多参数的强大基线。

Title: VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Authors: Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin Wang, Sirui Hong, Bang Liu, Chenglin Wu, Yuyu Luo
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.22373
Pdf URL: https://arxiv.org/pdf/2510.22373
Copy Paste: [[2510.22373]] VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations(https://arxiv.org/abs/2510.22373)
Keywords: language model, gpt, llm
Abstract: Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at this https URL.
摘要：可视化是一种特定领域但广泛使用的图像形式，是将复杂的数据集转化为直观见解的有效方法，其价值取决于数据是否忠实地表示、清晰地传达和美观地设计。然而，评估可视化质量具有挑战性：与自然图像不同，它需要同时对数据编码准确性、信息表达力和视觉美观度进行判断。尽管多模态大语言模型（MLLM）在自然图像的美学评估方面表现出了良好的性能，但不存在系统的基准来衡量其评估可视化的能力。为了解决这个问题，我们提出了 VisJudge-Bench，这是第一个用于评估 MLLM 在评估可视化美学和质量方面的性能的综合基准。它包含来自真实场景的 3,090 个专家注释的示例，涵盖单一可视化、多重可视化以及跨 32 种图表类型的仪表板。对该基准的系统测试表明，即使是最先进的 MLLM（例如 GPT-5）在判断上与人类专家相比仍然存在显着差距，平均绝对误差（MAE）为 0.551，与人类评分的相关性仅为 0.429。为了解决这个问题，我们提出了 VisJudge，一个专门为可视化美观和质量评估而设计的模型。实验结果表明，与 GPT-5 相比，VisJudge 显着缩小了与人类判断的差距，将 MAE 降低至 0.442（降低了 19.8%），并将与人类专家的一致性提高至 0.681（提高了 58.7%）。该基准可从此 https URL 获取。

Title: Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Authors: Federica Gamba, Aman Sinha, Timothee Mickus, Raul Vazquez, Patanjali Bhamidipati, Claudio Savelli, Ahana Chattopadhyay, Laura A. Zanella, Yash Kankanampati, Binesh Arakkal Remesh, Aryan Ashok Chandramania, Rohit Agarwal, Chuyuan Li, Ioana Buhnila, Radhika Mamidi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22395
Pdf URL: https://arxiv.org/pdf/2510.22395
Copy Paste: [[2510.22395]] Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection(https://arxiv.org/abs/2510.22395)
Keywords: language model, llm, hallucination
Abstract: We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.
摘要：我们介绍 CAP（Confabulations from ACL Publications）数据集，这是一个多语言资源，用于研究科学文本生成中大语言模型 (LLM) 中的幻觉。 CAP 专注于科学领域，在该领域，幻觉可能会歪曲事实知识，正如它们经常发生的那样。然而，在这个领域，专业术语、统计推理和依赖于上下文的解释的存在进一步加剧了这些扭曲，特别是考虑到法学硕士缺乏真正的理解、有限的上下文理解以及对表面概括的偏见。 CAP 在跨语言环境中运行，涵盖五种高资源语言（英语、法语、印地语、意大利语和西班牙语）和四种低资源语言（孟加拉语、古吉拉特语、马拉雅拉姆语和泰卢固语）。该数据集包含 900 个精心策划的科学问题和来自 16 个公开可用模型的 7000 多个 LLM 生成的答案，以问题-答案对以及标记序列和相应的 logits 的形式提供。每个实例都用一个二进制标签进行注释，表明存在科学幻觉，表示为事实性错误，以及一个流畅性标签，捕获文本的语言质量或自然性问题。 CAP 的公开发布是为了促进幻觉检测、法学硕士多语言评估以及更可靠的科学 NLP 系统的开发方面的高级研究。

Title: CHOIR: Collaborative Harmonization fOr Inference Robustness

Authors: Xiangjue Dong, Cong Wang, Maria Teleki, Millennium Bismay, James Caverlee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22475
Pdf URL: https://arxiv.org/pdf/2510.22475
Copy Paste: [[2510.22475]] CHOIR: Collaborative Harmonization fOr Inference Robustness(https://arxiv.org/abs/2510.22475)
Keywords: language model, llm
Abstract: Persona-assigned Large Language Models (LLMs) can adopt diverse roles, enabling personalized and context-aware reasoning. However, even minor demographic perturbations in personas, such as simple pronoun changes, can alter reasoning trajectories, leading to divergent sets of correct answers. Instead of treating these variations as biases to be mitigated, we explore their potential as a constructive resource to improve reasoning robustness. We propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a test-time framework that harmonizes multiple persona-conditioned reasoning signals into a unified prediction. CHOIR orchestrates a collaborative decoding process among counterfactual personas, dynamically balancing agreement and divergence in their reasoning paths. Experiments on various reasoning benchmarks demonstrate that CHOIR consistently enhances performance across demographics, model architectures, scales, and tasks - without additional training. Improvements reach up to 26.4% for individual demographic groups and 19.2% on average across five demographics. It remains effective even when base personas are suboptimal. By reframing persona variation as a constructive signal, CHOIR provides a scalable and generalizable approach to more reliable LLM reasoning.
摘要：角色分配的大型语言模型 (LLM) 可以扮演不同的角色，从而实现个性化和上下文感知推理。然而，即使是人物角色中微小的人口统计扰动，例如简单的代词变化，也可能改变推理轨迹，导致不同的正确答案集。我们没有将这些变化视为需要减轻的偏差，而是探索它们作为提高推理稳健性的建设性资源的潜力。我们提出了 CHOIR（推理鲁棒性协作协调），这是一种测试时框架，可将多个角色条件推理信号协调为统一的预测。 CHOIR 在反事实角色之间协调协作解码过程，动态平衡推理路径中的一致性和分歧。各种推理基准的实验表明，CHOIR 能够持续提高人口统计、模型架构、规模和任务方面的性能 - 无需额外培训。个人人口群体的改善幅度高达 26.4%，五个人口群体的平均改善幅度为 19.2%。即使基本角色不理想，它仍然有效。通过将角色变化重新定义为建设性信号，CHOIR 提供了一种可扩展且可推广的方法，以实现更可靠的 LLM 推理。

Title: Frustratingly Easy Task-aware Pruning for Large Language Models

Authors: Yuanhe Tian, Junjie Liu, Xican Yang, Haishan Ye, Yan Song
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22489
Pdf URL: https://arxiv.org/pdf/2510.22489
Copy Paste: [[2510.22489]] Frustratingly Easy Task-aware Pruning for Large Language Models(https://arxiv.org/abs/2510.22489)
Keywords: language model, llm
Abstract: Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often ranks the importance of LLM parameters using their magnitudes and calibration-data activations and removes (or masks) the less important ones, accordingly reducing LLMs' size. However, these approaches primarily focus on preserving the LLM's ability to generate fluent sentences, while neglecting performance on specific domains and tasks. In this paper, we propose a simple yet effective pruning approach for LLMs that preserves task-specific capabilities while shrinking their parameter space. We first analyze how conventional pruning minimizes loss perturbation under general-domain calibration and extend this formulation by incorporating task-specific feature distributions into the importance computation of existing pruning algorithms. Thus, our framework computes separate importance scores using both general and task-specific calibration data, partitions parameters into shared and exclusive groups based on activation-norm differences, and then fuses their scores to guide the pruning process. This design enables our method to integrate seamlessly with various foundation pruning techniques and preserve the LLM's specialized abilities under compression. Experiments on widely used benchmarks demonstrate that our approach is effective and consistently outperforms the baselines with identical pruning ratios and different settings.
摘要：剪枝提供了一种实用的解决方案，可以减少运行大型语言模型 (LLM) 所需的资源，从而从其有效功能中受益，并控制其训练和推理成本。 LLM 修剪的研究通常使用 LLM 参数的大小和校准数据激活对 LLM 参数的重要性进行排序，并删除（或屏蔽）不太重要的参数，从而减少 LLM 的大小。然而，这些方法主要侧重于保留法学硕士生成流畅句子的能力，而忽略了特定领域和任务的表现。在本文中，我们提出了一种简单而有效的 LLM 修剪方法，该方法保留特定于任务的功能，同时缩小其参数空间。我们首先分析传统剪枝如何在通用域校准下最小化损失扰动，并通过将特定于任务的特征分布合并到现有剪枝算法的重要性计算中来扩展该公式。因此，我们的框架使用通用和特定于任务的校准数据计算单独的重要性分数，根据激活范数差异将参数划分为共享组和独占组，然后融合它们的分数以指导修剪过程。这种设计使我们的方法能够与各种基础修剪技术无缝集成，并在压缩下保留法学硕士的专业能力。在广泛使用的基准上进行的实验表明，我们的方法是有效的，并且在相同的剪枝率和不同的设置下始终优于基线。

Title: Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection

Authors: Noshitha Padma Pratyusha Juttu, Sahithi Singireddy, Sravani Gona, Sujal Timilsina
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22531
Pdf URL: https://arxiv.org/pdf/2510.22531
Copy Paste: [[2510.22531]] Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection(https://arxiv.org/abs/2510.22531)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have transformed text understanding, yet their adaptation to specialized legal domains remains constrained by the cost of full fine-tuning. This study provides a systematic evaluation of fine tuning, parameter efficient adaptation (LoRA, QLoRA), and zero-shot prompting strategies for unfair clause detection in Terms of Service (ToS) documents, a key application in legal NLP. We finetune BERT and DistilBERT, apply 4-bit Low-Rank Adaptation (LoRA) to models such as TinyLlama, LLaMA 3B/7B, and SaulLM, and evaluate GPT-4o and O-versions in zero-shot settings. Experiments on the CLAUDETTE-ToS benchmark and the Multilingual Scraper Corpus show that full fine-tuning achieves the strongest precision recall balance, while LoRA-based models provide competitive recall with up to 3x lower memory cost. These findings highlight practical design trade-offs for efficient and domain-adapted LLMs, contributing open baselines for fine-tuning research in legal text processing.
摘要：大型语言模型（LLM）已经改变了文本理解，但它们对专门法律领域的适应仍然受到全面微调成本的限制。本研究对服务条款 (ToS) 文档中不公平条款检测的微调、参数有效适应（LoRA、QLoRA）和零样本提示策略进行了系统评估，这是法律 NLP 的关键应用。我们对 BERT 和 DistilBERT 进行微调，将 4 位低秩适应 (LoRA) 应用于 TinyLlama、LLaMA 3B/7B 和 SaulLM 等模型，并在零样本设置下评估 GPT-4o 和 O 版本。 CLAUDETTE-ToS 基准和多语言 Scraper 语料库的实验表明，完全微调可实现最强的精确召回平衡，而基于 LoRA 的模型可提供有竞争力的召回，同时内存成本降低多达 3 倍。这些发现强调了高效和适应领域的法学硕士的实际设计权衡，为法律文本处理的微调研究提供了开放基线。

Title: LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

Authors: Ziyuan He, Yuxuan Wang, Jiaqi Li, Kexin Liang, Muhan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22548
Pdf URL: https://arxiv.org/pdf/2510.22548
Copy Paste: [[2510.22548]] LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?(https://arxiv.org/abs/2510.22548)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark designed to evaluate LLMs' long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6 locally deployed and 4 API-based LLMs. The evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark. Despite the extensive context windows, popular LLMs are only capable of understanding a much shorter length of context than they claim to be, revealing significant limitations in their ability to handle real-world tasks with long dependencies and highlighting substantial room for model improvement in practical long-context understanding.
摘要：最近，大型语言模型（LLM）配备了越来越多的上下文窗口，但它们对长依赖任务的长上下文理解能力仍然从根本上受到限制并且尚未得到充分开发。这种差距在许多很少进行基准测试的现实世界长上下文应用程序中尤其重要。在本文中，我们介绍了 LooGLE v2，这是一种新颖的基准，旨在评估法学硕士在现实应用程序和场景中的长上下文能力。我们的基准测试由自动收集的现实世界长文本组成，范围从 16k 到 2M 代币，涵盖法律、金融、游戏和代码领域。因此，我们精心设计了 10 种特定领域的长依赖任务，并在可扩展的数据管理管道中生成了 1,934 个具有各种多样性和复杂性的 QA 实例，以满足进一步的实际需求。我们对 6 个本地部署的法学硕士和 4 个基于 API 的法学硕士进行了全面评估。评估结果显示，即使是表现最好的模型，在我们的基准测试中也只能获得 59.2% 的总分。尽管有广泛的上下文窗口，流行的法学硕士只能理解比他们声称的短得多的上下文长度，这揭示了他们处理具有长依赖性的现实世界任务的能力的显着局限性，并强调了实际长上下文理解中模型改进的巨大空间。

Title: SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

Authors: Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, Shilong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22556
Pdf URL: https://arxiv.org/pdf/2510.22556
Copy Paste: [[2510.22556]] SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size(https://arxiv.org/abs/2510.22556)
Keywords: language model, llm
Abstract: The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underline{s}emantic-aware KV cache eviction framework with \underline{a}daptive \underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full-cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.
摘要：键值 (KV) 缓存不断增长的内存占用给长上下文大语言模型 (LLM) 推理带来了严重的可扩展性瓶颈。虽然 KV 缓存驱逐通过丢弃不太关键的标记而成为一种有效的解决方案，但现有的标记、块和句子级压缩方法难以平衡语义一致性和内存效率。为此，我们引入了 SABlock，一个具有 \underline{s} 语义感知的 KV 缓存驱逐框架，具有 \underline{a} 自适应 \underline{block} 大小。具体来说，SABlock 首先执行语义分割，以将压缩边界与语言结构对齐，然后应用分段引导的令牌评分来完善令牌重要性估计。最后，对于每个段，预算驱动的搜索策略自适应地确定最佳块大小，在给定缓存预算下保持语义完整性，同时提高压缩效率。对长上下文基准的大量实验表明，在相同的内存预算下，SABlock 的性能始终优于最先进的基准。例如，在 Needle-in-a-Haystack (NIAH) 上，SABlock 仅用 96 KV 条目即可实现 99.9% 的检索精度，几乎与保留多达 8K 条目的全缓存基线的性能相当。在 1,024 的固定缓存预算下，SABlock 进一步将峰值内存使用量降低了 46.28%，并在 128K 上下文长度上实现了高达 9.5 倍的解码速度。

Title: A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback

Authors: Zhifeng Wang, Xinyue Zheng, Chunyan Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22559
Pdf URL: https://arxiv.org/pdf/2510.22559
Copy Paste: [[2510.22559]] A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback(https://arxiv.org/abs/2510.22559)
Keywords: language model, llm, agent
Abstract: As information technology advances, education is moving from one-size-fits-all instruction toward personalized learning. However, most methods handle modeling, item selection, and feedback in isolation rather than as a closed loop. This leads to coarse or opaque student models, assumption-bound adaptivity that ignores diagnostic posteriors, and generic, non-actionable feedback. To address these limitations, this paper presents an end-to-end personalized learning agent, EduLoop-Agent, which integrates a Neural Cognitive Diagnosis model (NCD), a Bounded-Ability Estimation Computerized Adaptive Testing strategy (BECAT), and large language models (LLMs). The NCD module provides fine-grained estimates of students' mastery at the knowledge-point level; BECAT dynamically selects subsequent items to maximize relevance and learning efficiency; and LLMs convert diagnostic signals into structured, actionable feedback. Together, these components form a closed-loop framework of ``Diagnosis--Recommendation--Feedback.'' Experiments on the ASSISTments dataset show that the NCD module achieves strong performance on response prediction while yielding interpretable mastery assessments. The adaptive recommendation strategy improves item relevance and personalization, and the LLM-based feedback offers targeted study guidance aligned with identified weaknesses. Overall, the results indicate that the proposed design is effective and practically deployable, providing a feasible pathway to generating individualized learning trajectories in intelligent education.
摘要：随着信息技术的进步，教育正在从一刀切的教学转向个性化的学习。然而，大多数方法都是孤立地处理建模、项目选择和反馈，而不是作为一个闭环。这导致粗略或不透明的学生模型、忽略诊断后验的假设约束适应性以及通用的、不可操作的反馈。为了解决这些限制，本文提出了一种端到端个性化学习代理 EduLoop-Agent，它集成了神经认知诊断模型（NCD）、有界能力估计计算机自适应测试策略（BECAT）和大语言模型（LLM）。 NCD模块提供了学生在知识点层面的掌握程度的细粒度估计； BECAT动态选择后续项目，以最大限度地提高相关性和学习效率；法学硕士将诊断信号转化为结构化的、可操作的反馈。这些组件共同构成了“诊断--推荐--反馈”的闭环框架。在 ASSISTments 数据集上的实验表明，NCD 模块在响应预测方面实现了强大的性能，同时产生了可解释的掌握评估。自适应推荐策略提高了项目相关性和个性化，基于法学硕士的反馈提供了与已识别的弱点相一致的有针对性的学习指导。总体而言，结果表明所提出的设计是有效且可实际部署的，为智能教育中生成个性化学习轨迹提供了可行的途径。

Title: Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems

Authors: Kaushal Kumar Maurya, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22581
Pdf URL: https://arxiv.org/pdf/2510.22581
Copy Paste: [[2510.22581]] Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems(https://arxiv.org/abs/2510.22581)
Keywords: language model, llm
Abstract: The interdisciplinary research domain of Artificial Intelligence in Education (AIED) has a long history of developing Intelligent Tutoring Systems (ITSs) by integrating insights from technological advancements, educational theories, and cognitive psychology. The remarkable success of generative AI (GenAI) models has accelerated the development of large language model (LLM)-powered ITSs, which have potential to imitate human-like, pedagogically rich, and cognitively demanding tutoring. However, the progress and impact of these systems remain largely untraceable due to the absence of reliable, universally accepted, and pedagogy-driven evaluation frameworks and benchmarks. Most existing educational dialogue-based ITS evaluations rely on subjective protocols and non-standardized benchmarks, leading to inconsistencies and limited generalizability. In this work, we take a step back from mainstream ITS development and provide comprehensive state-of-the-art evaluation practices, highlighting associated challenges through real-world case studies from careful and caring AIED research. Finally, building on insights from previous interdisciplinary AIED research, we propose three practical, feasible, and theoretically grounded research directions, rooted in learning science principles and aimed at establishing fair, unified, and scalable evaluation methodologies for ITSs.
摘要：教育人工智能 (AIED) 的跨学科研究领域在通过整合技术进步、教育理论和认知心理学的见解来开发智能辅导系统 (ITS) 方面有着悠久的历史。生成式人工智能 (GenAI) 模型的巨大成功加速了大语言模型 (LLM) 支持的 ITS 的开发，这些 ITS 有潜力模仿类人、教学丰富且认知要求高的辅导。然而，由于缺乏可靠的、普遍接受的、教育学驱动的评估框架和基准，这些系统的进展和影响在很大程度上仍然无法追踪。大多数现有的基于教育对话的 ITS 评估依赖于主观协议和非标准化基准，导致不一致和普遍性有限。在这项工作中，我们从主流 ITS 开发中退一步，提供全面的、最先进的评估实践，通过仔细而贴心的 AIED 研究中的真实案例研究来强调相关挑战。最后，基于以往跨学科 AIED 研究的见解，我们提出了三个实用、可行且有理论基础的研究方向，植根于学习科学原理，旨在建立公平、统一和可扩展的 ITS 评估方法。

Title: AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment

Authors: Dario Loi, Elena Maria Muià, Federico Siciliano, Giovanni Trappolini, Vincenzo Crisà, Peter Kruger, Fabrizio Silvestri
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22593
Pdf URL: https://arxiv.org/pdf/2510.22593
Copy Paste: [[2510.22593]] AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment(https://arxiv.org/abs/2510.22593)
Keywords: language model, llm
Abstract: We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.. Unlike static benchmarks that suffer from test-set contamination and limited adaptability, AutoBench dynamically generates novel evaluation tasks while models alternately serve as question generators, contestants, and judges across diverse domains. An iterative weighting mechanism amplifies the influence of consistently reliable evaluators, aggregating peer judgments into consensus-based rankings that reflect collective model agreement. Our experiments demonstrate strong correlations with established benchmarks including MMLU-Pro and GPQA (respectively 78\% and 63\%), validating this peer-driven evaluation paradigm. The multi-judge design significantly outperforms single-judge baselines, confirming that distributed evaluation produces more robust and human-consistent assessments. AutoBench offers a scalable, contamination-resistant alternative to static benchmarks for the continuous evaluation of evolving language models.
摘要：我们推出 AutoBench，这是一个完全自动化且自我维持的框架，用于通过互惠同行评估来评估大型语言模型 (LLM)。本文对 AutoBench 方法进行了严格的科学验证，该方法最初由 eZecute S.R.L. 开发为开源项目。与遭受测试集污染和有限适应性的静态基准测试不同，AutoBench 动态生成新颖的评估任务，而模型则交替充当不同领域的问题生成器、参赛者和评委。迭代加权机制放大了始终可靠的评估者的影响力，将同行判断汇总到反映集体模型协议的基于共识的排名中。我们的实验表明与已建立的基准包括 MMLU-Pro 和 GPQA（分别为 78\% 和 63\%）有很强的相关性，验证了这种同行驱动的评估范式。多法官设计显着优于单法官基线，证实分布式评估可以产生更稳健且与人类一致的评估。 AutoBench 提供了一种可扩展、抗污染的静态基准替代方案，用于持续评估不断发展的语言模型。

Title: Personal Care Utility (PCU): Building the Health Infrastructure for Everyday Insight and Guidance

Authors: Mahyar Abbasian, Ramesh Jain
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2510.22602
Pdf URL: https://arxiv.org/pdf/2510.22602
Copy Paste: [[2510.22602]] Personal Care Utility (PCU): Building the Health Infrastructure for Everyday Insight and Guidance(https://arxiv.org/abs/2510.22602)
Keywords: agent
Abstract: Building on decades of success in digital infrastructure and biomedical innovation, we propose the Personal Care Utility (PCU) - a cybernetic system for lifelong health guidance. PCU is conceived as a global, AI-powered utility that continuously orchestrates multimodal data, knowledge, and services to assist individuals and populations alike. Drawing on multimodal agents, event-centric modeling, and contextual inference, it offers three essential capabilities: (1) trusted health information tailored to the individual, (2) proactive health navigation and behavior guidance, and (3) ongoing interpretation of recovery and treatment response after medical events. Unlike conventional episodic care, PCU functions as an ambient, adaptive companion - observing, interpreting, and guiding health in real time across daily life. By integrating personal sensing, experiential computing, and population-level analytics, PCU promises not only improved outcomes for individuals but also a new substrate for public health and scientific discovery. We describe the architecture, design principles, and implementation challenges of this emerging paradigm.
摘要：基于数十年在数字基础设施和生物医学创新方面取得的成功，我们提出了个人护理实用程序 (PCU)——一种用于终身健康指导的控制论系统。 PCU 被认为是一个由人工智能驱动的全球实用程序，不断协调多模式数据、知识和服务来帮助个人和人群。它利用多模式代理、以事件为中心的建模和情境推理，提供三种基本功能：(1) 为个人量身定制的可信健康信息，(2) 主动健康导航和行为指导，以及 (3) 对医疗事件后的恢复和治疗反应进行持续解释。与传统的情景护理不同，PCU 充当环境适应性伴侣 - 在日常生活中实时观察、解释和指导健康状况。通过整合个人感知、体验式计算和人口层面的分析，PCU 不仅有望改善个人的结果，而且还为公共卫生和科学发现提供了新的基础。我们描述了这种新兴范例的架构、设计原则和实现挑战。

Title: Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal

Authors: Ambalika Guha, Sajal Saha, Debanjan Ballav, Soumi Mitra, Hritwick Chakraborty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22629
Pdf URL: https://arxiv.org/pdf/2510.22629
Copy Paste: [[2510.22629]] Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal(https://arxiv.org/abs/2510.22629)
Keywords: language model
Abstract: Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.
摘要：保护语言多样性是必要的，因为每种语言都提供了看待世界的独特视角。有许多全球倡议通过记录来保护濒危语言。本文是一个项目的一部分，该项目旨在开发一种三语（托托语-孟加拉语-英语）语言学习应用程序，以数字化存档和推广印度西孟加拉邦濒临灭绝的托托语。该应用程序专为托托母语使用者和非母语学习者设计，旨在通过 Unicode 脚本集成和结构化语言语料库确保可访问性和可用性，从而振兴该语言。该研究包括通过实地工作收集的详细语言文档，然后创建一个带有词素标记的三语语料库，用于训练小语言模型 (SLM) 和基于 Transformer 的翻译引擎。该分析涵盖了屈折形态，例如人名-数字-性别一致、时态-体-语气区别和格标记，以及反映词类变化的派生策略。还开发了脚本标准化和数字素养工具来增强脚本的使用。该研究通过将传统语言方法与人工智能相结合，为保护濒危语言提供了一个可持续的模型。语言研究与技术创新之间的这座桥梁凸显了跨学科合作对于基于社区的语言复兴的价值。

Title: Rule-Based Explanations for Retrieval-Augmented LLM Systems

Authors: Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22689
Pdf URL: https://arxiv.org/pdf/2510.22689
Copy Paste: [[2510.22689]] Rule-Based Explanations for Retrieval-Augmented LLM Systems(https://arxiv.org/abs/2510.22689)
Keywords: language model, llm, retrieval-augmented generation
Abstract: If-then rules are widely used to explain machine learning models; e.g., "if employed = no, then loan application = rejected." We present the first proposal to apply rules to explain the emerging class of large language models (LLMs) with retrieval-augmented generation (RAG). Since RAG enables LLM systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., "if a Times Higher Education ranking article is retrieved, then the LLM ranks Oxford first." To generate such rules, a brute force approach would probe the LLM with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like pruning from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions' value and efficiency.
摘要：If-then 规则被广泛用于解释机器学习模型；例如，“如果受雇=否，那么贷款申请=被拒绝。”我们提出了第一个建议，即应用规则来解释具有检索增强生成（RAG）的新兴大型语言模型（LLM）类别。由于 RAG 使 LLM 系统能够在推理时合并检索到的信息源，因此链接源存在或不存在的规则可以解释输出来源；例如，“如果检索到泰晤士报高等教育排名文章，则法学硕士将牛津大学排名第一。”为了生成这样的规则，强力方法将使用所有源组合来探测 LLM，并检查任何源的存在或不存在是否会导致相同的输出。我们提出了加速规则生成的优化，其灵感来自频繁项集挖掘的 Apriori 式修剪，但在我们的新问题的范围内重新定义。最后，我们通过定性和定量实验证明了我们解决方案的价值和效率。

Title: SALSA: Single-pass Autoregressive LLM Structured Classification

Authors: Ruslan Berdichevsky, Shai Nahum-Gefen, Elad Ben Zaken
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22691
Pdf URL: https://arxiv.org/pdf/2510.22691
Copy Paste: [[2510.22691]] SALSA: Single-pass Autoregressive LLM Structured Classification(https://arxiv.org/abs/2510.22691)
Keywords: language model, llm, prompt
Abstract: Despite their impressive generalization capabilities, instruction-tuned Large Language Models often underperform on text classification benchmarks. We introduce SALSA, a coherent pipeline that combines structured prompting, class-to-token mapping, and parameter-efficient fine-tuning, thereby avoiding cold-start training. Each class label is mapped to a distinct output token, and prompts are constructed to elicit a single-token response. During inference, the model's output is projected only onto the logits of the relevant class tokens, enabling efficient and accurate classification in a single forward pass. SALSA achieves state-of-the-art results across diverse benchmarks, demonstrating its robustness and scalability for LLM-based classification applications.
摘要：尽管具有令人印象深刻的泛化能力，但指令调整的大型语言模型在文本分类基准上通常表现不佳。我们引入了 SALSA，这是一个连贯的管道，它结合了结构化提示、类到令牌映射和参数高效的微调，从而避免了冷启动训练。每个类标签都映射到不同的输出标记，并构造提示以引发单标记响应。在推理过程中，模型的输出仅投影到相关类标记的 logit 上，从而在单次前向传递中实现高效且准确的分类。 SALSA 在不同的基准测试中取得了最先进的结果，证明了其对于基于 LLM 的分类应用程序的稳健性和可扩展性。

Title: $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker

Authors: Qi Liu, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Jiaxin Mao
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.22733
Pdf URL: https://arxiv.org/pdf/2510.22733
Copy Paste: [[2510.22733]] $\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker(https://arxiv.org/abs/2510.22733)
Keywords: llm, prompt
Abstract: Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, $\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
摘要：文本嵌入模型是现实世界搜索应用程序的基本组成部分。通过将查询和文档映射到共享嵌入空间，它们可以高效地提供有竞争力的检索性能。然而，与专用的重新排序器相比，它们的排序保真度仍然有限，尤其是最近基于 LLM 的列表重新排序器，它捕获细粒度的查询-文档和文档-文档交互。在本文中，我们提出了一个简单而有效的统一框架$\text{E}^2\text{Rank}$，即Efficient Embedding-based Ranking（也即Embedding-to-Rank），它扩展了单个文本嵌入模型，通过在列表排序目标下持续训练来执行高质量检索和列表重排序，从而以显着的效率实现强大的有效性。通过将查询和文档嵌入之间的余弦相似度应用为统一的排名函数，根据原始查询及其候选文档构建的列表排名提示可作为增强查询，丰富来自前 K 个文档的信号，类似于传统检索模型中的伪相关反馈 (PRF)。这种设计保留了基本嵌入模型的效率和表征质量，同时显着提高了其重新排序性能。根据经验，$\textrm{E}^2\text{Rank}$ 在 BEIR 重新排名基准上取得了最先进的结果，并在推理密集型 BRIGHT 基准上展示了具有竞争力的性能，并且重新排名延迟非常低。我们还表明，排名训练过程提高了 MTEB 基准上的嵌入性能。我们的研究结果表明，单个嵌入模型可以有效地统一检索和重新排名，提供计算效率和有竞争力的排名准确性。

Title: Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

Authors: Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22747
Pdf URL: https://arxiv.org/pdf/2510.22747
Copy Paste: [[2510.22747]] Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study(https://arxiv.org/abs/2510.22747)
Keywords: language model, llm
Abstract: Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. We release the first Québec French LLMs on HuggingFace.
摘要：尽管大型语言模型 (LLM) 得到了广泛采用，但其最强的功能仍然主要局限于少数拥有丰富训练数据的高资源语言。最近，持续预训练（CPT）已成为将这些模型微调为资源匮乏的地方方言的一种手段。在本文中，我们研究了在数据和计算预算紧张的情况下使用 CPT 进行方言学习。通过低秩适应 (LoRA) 和计算高效的持续预训练，我们使用非常小的数据集使三个法学硕士适应魁北克法语方言，并在 COLE 套件上对它们进行基准测试。我们的实验证明了少数方言基准的改进，对高级语言基准的回归最小化，模型参数的更新率低于 1%。结果分析表明，收益很大程度上取决于语料库的组成。这些研究结果表明，具有参数高效微调（PEFT）的 CPT 可以通过提供具有成本效益和可持续的语言资源创建来缩小方言差距，扩大少数语言社区的高质量法学硕士机会。我们在 HuggingFace 上发布了第一批魁北克法语法学硕士。

Title: Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models

Authors: Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Zoran Tiganj
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22752
Pdf URL: https://arxiv.org/pdf/2510.22752
Copy Paste: [[2510.22752]] Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models(https://arxiv.org/abs/2510.22752)
Keywords: language model, llm, prompt
Abstract: In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained LLMs, including transformer and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in transformers to induction heads. Extending the analysis to unique semantic contexts with partial overlap further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and transformer models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.
摘要：上下文学习受时间关系和语义关系的控制，决定了大型语言模型 (LLM) 检索上下文信息的方式。类似于人类情景记忆，通过分离不同时间发生的事件来检索特定事件，这项工作探讨了各种预训练的 LLM（包括变压器和状态空间模型）区分和检索时间上分离的事件的能力。具体来说，我们使用包含同一标记的多个表示的序列来提示模型，该标记会在序列末尾重新出现。通过修复这些重复标记的位置并排列所有其他标记，我们消除了语义混淆和对下一个标记预测的孤立时间影响。在不同的序列中，模型始终将最高概率放在重复标记之后的标记上，但对于最接近输入开头或结尾的标记存在明显偏差。一项烧蚀实验将变压器中的这种现象与感应头联系起来。将分析扩展到具有部分重叠的独特语义上下文进一步表明，嵌入在提示中间的记忆的检索不太可靠。尽管架构存在差异，状态空间和变压器模型显示出类似的时间偏差。我们的研究结果加深了对情境学习中时间偏差的理解，并说明了这些偏差如何实现时间分离和情景检索。

Title: EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

Authors: Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao, Junyi Ao, Yuhao Zhang, Benyou Wang, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22758
Pdf URL: https://arxiv.org/pdf/2510.22758
Copy Paste: [[2510.22758]] EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models(https://arxiv.org/abs/2510.22758)
Keywords: language model, prompt
Abstract: Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy-oriented framework spanning 3 coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.
摘要：语音语言模型（SLM）在口语理解方面取得了重大进展。然而，目前尚不清楚他们是否能够充分感知口语中的非词汇声音提示，并根据情感和语境因素做出同理心回应。现有的基准通常单独评估语言、声学、推理或对话能力，而忽视了这些技能的整合，而这对于类人的、高情商的对话至关重要。我们推出了 EchoMind，这是第一个相互关联的多层次基准，它通过顺序的、上下文相关的任务来模拟移情对话的认知过程：口语内容理解、声音提示感知、综合推理和响应生成。所有任务都共享相同且语义中立的脚本，这些脚本没有明确的情感或上下文线索，并且声音风格的受控变化用于测试独立于转录本的交付效果。 EchoMind 基于同理心导向的框架，涵盖 3 个粗粒度和 12 个细粒度维度，涵盖 39 个声音属性，并使用客观和主观指标进行评估。测试 12 个高级 SLM 表明，即使是最先进的模型也难以应对高表现力的声音提示，从而限制了同理心响应的质量。对提示强度、语音来源和理想语音提示识别的分析揭示了在遵循指令、对自然语音变化的适应能力以及有效利用语音提示实现同理心方面的持续弱点。这些结果强调了 SLM 需要将语言内容与不同的声音提示相结合，以实现真正的移情对话能力。

Title: Iterative Layer Pruning for Efficient Translation Inference

Authors: Yasmin Moslem, Muhammad Hazim Al Farouq, John D. Kelleher
Subjects: cs.CL, cs.PF
Abstract URL: https://arxiv.org/abs/2510.22763
Pdf URL: https://arxiv.org/pdf/2510.22763
Copy Paste: [[2510.22763]] Iterative Layer Pruning for Efficient Translation Inference(https://arxiv.org/abs/2510.22763)
Keywords: language model, llm
Abstract: Large language models (LLMs) have transformed many areas of natural language processing, including machine translation. However, efficient deployment of LLMs remains challenging due to their intensive computational requirements. In this paper, we address this challenge and present our submissions to the Model Compression track at the Conference on Machine Translation (WMT 2025). In our experiments, we investigate iterative layer pruning guided by layer importance analysis. We evaluate this method using the Aya-Expanse-8B model for translation from Czech to German, and from English to Egyptian Arabic. Our approach achieves substantial reductions in model size and inference time, while maintaining the translation quality of the baseline models.
摘要：大型语言模型 (LLM) 已经改变了自然语言处理的许多领域，包括机器翻译。然而，由于其密集的计算要求，法学硕士的有效部署仍然具有挑战性。在本文中，我们解决了这一挑战，并向机器翻译会议 (WMT 2025) 的模型压缩赛道提交了我们的意见。在我们的实验中，我们研究了由层重要性分析引导的迭代层修剪。我们使用 Aya-Expanse-8B 模型评估此方法，用于从捷克语翻译成德语，以及从英语翻译成埃及阿拉伯语。我们的方法显着减少了模型大小和推理时间，同时保持了基线模型的翻译质量。

Title: MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion

Authors: Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22768
Pdf URL: https://arxiv.org/pdf/2510.22768
Copy Paste: [[2510.22768]] MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion(https://arxiv.org/abs/2510.22768)
Keywords: language model
Abstract: As Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news, they are exposed to pervasive persuasive content. A critical question is how these models function as persuadees-how and why they can be influenced by persuasive multimodal inputs. Understanding both their susceptibility to persuasion and the effectiveness of different persuasive strategies is crucial, as overly persuadable models may adopt misleading beliefs, override user preferences, or generate unethical or unsafe outputs when exposed to manipulative messages. We introduce MMPersuade, a unified framework for systematically studying multimodal persuasion dynamics in LVLMs. MMPersuade contributes (i) a comprehensive multimodal dataset that pairs images and videos with established persuasion principles across commercial, subjective and behavioral, and adversarial contexts, and (ii) an evaluation framework that quantifies both persuasion effectiveness and model susceptibility via third-party agreement scoring and self-estimated token probabilities on conversation histories. Our study of six leading LVLMs as persuadees yields three key insights: (i) multimodal inputs substantially increase persuasion effectiveness-and model susceptibility-compared to text alone, especially in misinformation scenarios; (ii) stated prior preferences decrease susceptibility, yet multimodal information maintains its persuasive advantage; and (iii) different strategies vary in effectiveness across contexts, with reciprocity being most potent in commercial and subjective contexts, and credibility and logic prevailing in adversarial contexts. By jointly analyzing persuasion effectiveness and susceptibility, MMPersuade provides a principled foundation for developing models that are robust, preference-consistent, and ethically aligned when engaging with persuasive multimodal content.
摘要：随着大型视觉语言模型 (LVLM) 越来越多地部署在购物、健康和新闻等领域，它们接触到了无处不在的有说服力的内容。一个关键问题是这些模型如何发挥说服作用——它们如何以及为何会受到有说服力的多模式输入的影响。了解它们对说服的敏感性以及不同说服策略的有效性至关重要，因为过度说服的模型可能会采用误导性信念，凌驾于用户偏好之上，或者在暴露于操纵性消息时产生不道德或不安全的输出。我们引入了 MMPersuade，这是一个用于系统研究 LVLM 中多模式说服动力学的统一框架。 MMPersuade 贡献了 (i) 一个全面的多模态数据集，将图像和视频与商业、主观和行为以及对抗性背景下的既定说服原则配对，以及 (ii) 一个评估框架，通过第三方协议评分和对话历史中的自我估计令牌概率来量化说服有效性和模型敏感性。我们对六种领先的 LVLM 作为说服者的研究得出了三个关键见解：(i) 与单独的文本相比，多模式输入大大提高了说服有效性和模型敏感性，尤其是在错误信息场景中； (ii) 事先陈述的偏好会降低敏感性，但多模式信息仍保持其说服力优势； (iii) 不同策略的有效性因环境而异，互惠在商业和主观环境中最有效，而可信度和逻辑在对抗性环境中占主导地位。通过联合分析说服有效性和敏感性，MMPersuade 为开发在处理说服性多模式内容时稳健、偏好一致且道德一致的模型提供了原则基础。

Title: Scalable Supervising Software Agents with Patch Reasoner

Authors: Junjielong Xu, Boyin Tan, Xiaoyuan Liu, Chao Peng, Pengfei Gao, Pinjia He
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2510.22775
Pdf URL: https://arxiv.org/pdf/2510.22775
Copy Paste: [[2510.22775]] Scalable Supervising Software Agents with Patch Reasoner(https://arxiv.org/abs/2510.22775)
Keywords: language model, agent
Abstract: While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.
摘要：虽然大型语言模型代理具有高级软件工程任务，但现有基于测试的监督的不可扩展性限制了数据扩展的潜在改进。原因有两个：（1）构建和运行测试沙箱相当繁重且脆弱，（2）高覆盖率测试的数据自然很少见，并且会受到边缘案例测试黑客的威胁。在本文中，我们提出了 R4P，一种补丁验证器模型，通过推理为训练和测试 SWE 代理提供可扩展的奖励。我们认为补丁验证从根本上来说是一项推理任务，反映了存储库维护人员如何在不编写和运行新的复制测试的情况下审查补丁。为了获得足够的参考并降低奖励黑客的风险，R4P 使用分组目标进行 RL 训练，使其能够验证多个补丁是否相互修改，并获得稳定训练的密集奖励。 R4P 达到 72.2% Acc.用于验证来自 SWE-bench-verified 的补丁，超越 OpenAI o3。为了证明 R4P 的实用性，我们设计并训练了一个精简版支架 Mini-SE，采用纯强化学习，其中所有奖励都来自 R4P。结果，Mini-SE 在 SWE 基准验证上实现了 26.2% Pass@1，比原始 Qwen3-32B 提高了 10.0%。通过 R4P 进行测试时间缩放，可以将其进一步提高到 32.8%。此外，R4P 可在一秒内验证补丁，比平均测试速度快 50 倍。稳定的奖励和准确度缩放曲线以及高效率体现了 R4P 的实用性。

Title: VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions

Authors: Thu Phuong Nguyen, Duc M. Nguyen, Hyotaek Jeon, Hyunwook Lee, Hyunmin Song, Sungahn Ko, Taehwan Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22798
Pdf URL: https://arxiv.org/pdf/2510.22798
Copy Paste: [[2510.22798]] VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions(https://arxiv.org/abs/2510.22798)
Keywords: language model, prompt
Abstract: Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.
摘要：自动评估手写数学解决方案是教育技术实际应用中的一个重要问题，但由于学生作业的格式多样、布局非结构化和符号复杂性，它仍然是一个重大挑战。为了应对这一挑战，我们引入了 VEHME——一种用于评估手写数学表达式的视觉语言模型——旨在以高精度和可解释的推理轨迹评估开放式手写数学答案。 VEHME 集成了一个两阶段的训练流程：(i) 使用结构化推理数据进行监督微调，以及 (ii) 强化学习，将模型输出与多维评分目标（包括正确性、推理深度和错误定位）保持一致。为了增强空间理解，我们提出了一个表达感知视觉提示模块，该模块在我们合成的多行数学表达式数据集上进行训练，以稳健地引导视觉异构输入中的注意力。在 AIHub 和 FERMAT 数据集上进行评估后，VEHME 在开源模型中实现了最先进的性能，并接近专有系统的准确性，展示了其作为可扩展且可访问的自动化数学评估工具的潜力。我们的训练和实验代码可在我们的 GitHub 存储库中公开获取。

Title: Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

Authors: Poli Nemkova, Amrit Adhikari, Matthew Pearson, Vamsi Krishna Sadu, Mark V. Albert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22823
Pdf URL: https://arxiv.org/pdf/2510.22823
Copy Paste: [[2510.22823]] Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP(https://arxiv.org/abs/2510.22823)
Keywords: language model, gpt, llm, prompt
Abstract: Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.
摘要：人道主义组织面临着一个关键的选择：投资昂贵的商业 API，还是依靠免费的开放权重模型进行多语言人权监测。虽然商业系统提供了可靠性，但开放权重替代方案缺乏经验验证——尤其是对于冲突地区常见的资源匮乏的语言。本文首次对用于七种语言侵犯人权检测的商业模型和开放权重大型语言模型 (LLM) 进行了系统比较，量化了资源有限的组织面临的成本与可靠性权衡。在 78,000 个多语言推理中，我们使用标准分类指标和新的跨语言可靠性度量：校准偏差 (CD)、决策偏差，评估了六种模型——四种指令对齐模型（Claude-Sonnet-4、DeepSeek-V3、Gemini-Flash-2.0、GPT-4.1-mini）和两种开放权重模型（LLaMA-3-8B、Mistral-7B）。 (B)、语言鲁棒性评分 (LRS) 和语言稳定性评分 (LSS)。结果表明，决定稳定性的是对齐，而不是规模：对齐模型在类型上遥远和资源匮乏的语言（例如林加拉语、缅甸语）中保持近乎不变的准确性和平衡校准，而开放权重模型表现出显着的提示语言敏感性和校准漂移。这些发现表明，多语言对齐可以实现与语言无关的推理，并为人道主义组织平衡预算限制与多语言部署的可靠性提供实用指导。

Title: Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Authors: Haowei Hua (1), Hong Jiao (2), Xinyi Wang (3) ((1) Princeton University, (2) University of Maryland, College Park, (3) University of Maryland, College Park & Beijing Normal University)
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22830
Pdf URL: https://arxiv.org/pdf/2510.22830
Copy Paste: [[2510.22830]] Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays(https://arxiv.org/abs/2510.22830)
Keywords: language model, prompt
Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
摘要：BERT 及其变体被广泛探索用于自动评分。然而，这些基于编码器的模型的 512 个标记的限制表明了长论文自动评分的缺陷。因此，本研究探索了通过总结和提示对长论文进行自动评分的生成语言模型。结果表明，Learning Agency Lab Automated Essay Scoring 2.0 数据集的 QWK 评分准确度从 0.822 提高到 0.8878。

Title: Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning

Authors: Prerna Ravi, Dong Won Lee, Beatriz Flamia, Jasmine David, Brandon Hanks, Cynthia Breazeal, Emma Anderson, Grace Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22844
Pdf URL: https://arxiv.org/pdf/2510.22844
Copy Paste: [[2510.22844]] Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning(https://arxiv.org/abs/2510.22844)
Keywords: language model, llm, prompt
Abstract: Understanding how ideas develop and flow in small-group conversations is critical for analyzing collaborative learning. A key structural feature of these interactions is threading, the way discourse talk naturally organizes into interwoven topical strands that evolve over time. While threading has been widely studied in asynchronous text settings, detecting threads in synchronous spoken dialogue remains challenging due to overlapping turns and implicit cues. At the same time, large language models (LLMs) show promise for automating discourse analysis but often struggle with long-context tasks that depend on tracing these conversational links. In this paper, we investigate whether explicit thread linkages can improve LLM-based coding of relational moves in group talk. We contribute a systematic guidebook for identifying threads in synchronous multi-party transcripts and benchmark different LLM prompting strategies for automated threading. We then test how threading influences performance on downstream coding of conversational analysis frameworks, that capture core collaborative actions such as agreeing, building, and eliciting. Our results show that providing clear conversational thread information improves LLM coding performance and underscores the heavy reliance of downstream analysis on well-structured dialogue. We also discuss practical trade-offs in time and cost, emphasizing where human-AI hybrid approaches can yield the best value. Together, this work advances methods for combining LLMs and robust conversational thread structures to make sense of complex, real-time group interactions.
摘要：了解想法在小组对话中如何发展和流动对于分析协作学习至关重要。这些互动的一个关键结构特征是线索，即话语自然地组织成随着时间的推移而演变的相互交织的主题链。虽然线程在异步文本设置中得到了广泛的研究，但由于重叠的转弯和隐含的提示，检测同步口语对话中的线程仍然具有挑战性。与此同时，大型语言模型（LLM）显示出自动化话语分析的前景，但常常难以处理依赖于跟踪这些对话链接的长上下文任务。在本文中，我们研究了显式线程链接是否可以改进群组对话中基于 LLM 的关系动作编码。我们提供了一本系统指南，用于识别同步多方成绩单中的线程，并对自动线程的不同 LLM 提示策略进行基准测试。然后，我们测试线程如何影响会话分析框架下游编码的性能，该框架捕获核心协作操作，例如同意、构建和诱导。我们的结果表明，提供清晰的对话线索信息可以提高 LLM 编码性能，并强调下游分析对结构良好的对话的严重依赖。我们还讨论了时间和成本方面的实际权衡，强调人类-人工智能混合方法可以在哪些方面产生最佳价值。这项工作共同推进了将法学硕士和强大的会话线程结构相结合的方法，以理解复杂、实时的小组交互。

Title: Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Authors: Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22849
Pdf URL: https://arxiv.org/pdf/2510.22849
Copy Paste: [[2510.22849]] Once Upon an Input: Reasoning via Per-Instance Program Synthesis(https://arxiv.org/abs/2510.22849)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.
摘要：大型语言模型 (LLM) 擅长零样本推理，但在复杂的多步骤推理方面仍然遇到困难。最近通过思想链 (CoT) 和思想计划 (PoT) 等中间推理步骤增强法学硕士的方法可以提高性能，但通常会产生不需要的解决方案，尤其是在算法领域。我们引入了每实例程序综合（PIPS），这是一种使用结构反馈在实例级别生成和完善程序的方法，而不依赖于特定于任务的指导或明确的测试用例。为了进一步提高性能，PIPS 结合了一个置信度指标，可以根据每个实例动态地在直接推理和程序综合之间进行选择。跨三个前沿LLM和30个基准测试（包括Big Bench Extra Hard（BBEH）的所有任务、视觉问答任务、关系推理任务和数学推理任务）的实验表明，与PoT和CoT相比，PIPS的绝对调和平均精度分别提高了8.6%和9.4%，并且与PoT相比，在算法任务上减少了65.1%的不良程序生成 Gemini-2.0-Flash。

Title: Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement

Authors: Linyang He, Tianjun Zhong, Richard Antonello, Gavin Mischler, Micah Goldblum, Nima Mesgarani
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2510.22860
Pdf URL: https://arxiv.org/pdf/2510.22860
Copy Paste: [[2510.22860]] Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement(https://arxiv.org/abs/2510.22860)
Keywords: language model, llm
Abstract: Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large language models (LLMs) are increasingly used to model neural responses to language, their internal representations are highly "entangled," mixing information about lexicon, syntax, meaning, and reasoning. This entanglement biases conventional brain encoding analyses toward linguistically shallow features (e.g., lexicon and syntax), making it difficult to isolate the neural substrates of cognitively deeper processes. Here, we introduce a residual disentanglement method that computationally isolates these components. By first probing an LM to identify feature-specific layers, our method iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically, reasoning. We used these disentangled embeddings to model intracranial (ECoG) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas. 2) The neural signature for reasoning is temporally distinct, peaking later (~350-400ms) than signals related to lexicon, syntax, and meaning, consistent with its position atop a processing hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as their predictive success is primarily attributable to linguistically shallow features, masking the more subtle contributions of deeper cognitive processing.
摘要：了解人脑如何从处理简单的语言输入发展到执行高级推理是神经科学的基本挑战。虽然现代大语言模型 (LLM) 越来越多地用于模拟对语言的神经反应，但它们的内部表示是高度“纠缠”的，混合了有关词汇、语法、含义和推理的信息。这种纠缠使传统的大脑编码分析偏向于语言上的浅层特征（例如词汇和语法），使得很难分离出更深层次的认知过程的神经基础。在这里，我们引入了一种残差解缠结方法，可以在计算上隔离这些组件。通过首先探测 LM 来识别特定于特征的层，我们的方法迭代地回归出较低层的表示，以生成用于词典、语法、含义以及关键的推理的四个几乎正交的嵌入。我们使用这些解开的嵌入来模拟神经外科患者聆听自然语音的颅内（ECoG）大脑录音。我们表明：1）这种孤立的推理嵌入表现出独特的预测能力，解释了其他语言特征无法解释的神经活动的差异，甚至扩展到经典语言区域之外的视觉区域的招募。 2) 用于推理的神经信号在时间上是不同的，比与词汇、语法和含义相关的信号晚达到峰值（约 350-400 毫秒），与其在处理层次结构顶部的位置一致。 3）标准的、非解开的LLM嵌入可能会产生误导，因为它们的预测成功主要归因于语言上的浅层特征，掩盖了更深层次认知处理的更微妙的贡献。

Title: Interpreting and Mitigating Unwanted Uncertainty in LLMs

Authors: Tiasa Singha Roy, Ayush Rajesh Jhaveri, Ilias Triantafyllopoulos
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22866
Pdf URL: https://arxiv.org/pdf/2510.22866
Copy Paste: [[2510.22866]] Interpreting and Mitigating Unwanted Uncertainty in LLMs(https://arxiv.org/abs/2510.22866)
Keywords: language model, llm, prompt
Abstract: Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for downstream tasks, we observe trade-offs with flip behavior. Our findings contribute to the growing field of mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs.
摘要：尽管大型语言模型 (LLM) 的功能令人印象深刻，但它却表现出了不必要的不确定性，即模型在重新提示时将先前正确的答案更改为不正确的答案的现象。这种行为破坏了信任，并在高风险领域带来严重风险。在这项工作中，我们研究了驱动这种现象的机制。我们采用大海捞针检索框架，并集成翻转式重新评估提示来模拟现实的答案翻转场景。我们发现检索头并不是主要负责避免不确定性。相反，我们确定了一小部分非检索注意力头，它们在不确定的上下文中不成比例地关注误导性标记。遮盖这些头部可以带来显着的改进，将翻转行为减少多达 15%，而不会造成不连贯或过度校正。然而，当测试下游任务时，我们观察到翻转行为的权衡。我们的研究结果有助于机械可解释性领域的不断发展，并提出了一种简单而有效的技术来减轻法学硕士中不确定性驱动的故障模式。

Title: A Comprehensive Dataset for Human vs. AI Generated Text Detection

Authors: Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amitava Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22874
Pdf URL: https://arxiv.org/pdf/2510.22874
Copy Paste: [[2510.22874]] A Comprehensive Dataset for Human vs. AI Generated Text Detection(https://arxiv.org/abs/2510.22874)
Keywords: language model, gpt, llm, prompt
Abstract: The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: this https URL.
摘要：大语言模型 (LLM) 的快速发展导致人工智能生成的文本越来越像人类，引发了人们对内容真实性、错误信息和可信度的担忧。解决可靠地检测人工智能生成的文本并将其归因于特定模型的挑战需要大规模、多样化且注释良好的数据集。在这项工作中，我们提出了一个包含超过 58,000 个文本样本的综合数据集，这些样本结合了《纽约时报》的真实文章与多个最先进的法学硕士（包括 Gemma-2-9b、Mistral-7B、Qwen-2-72B、LLaMA-8B、Yi-Large 和 GPT-4-o）生成的合成版本。该数据集提供原始文章摘要作为提示，以及完整的人类撰写的叙述。我们为两个关键任务建立了基线结果：区分人类编写的文本和人工智能生成的文本，达到 58.35% 的准确度，并将人工智能文本归因于其生成模型，准确度为 8.92%。通过将现实世界的新闻内容与现代生成模型联系起来，该数据集旨在促进稳健的检测和归因方法的发展，在生成人工智能时代促进信任和透明度。我们的数据集可在以下位置获取：此 https URL。

Title: Batch Speculative Decoding Done Right

Authors: Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22876
Pdf URL: https://arxiv.org/pdf/2510.22876
Copy Paste: [[2510.22876]] Batch Speculative Decoding Done Right(https://arxiv.org/abs/2510.22876)
Keywords: llm
Abstract: Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3$\times$ throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at this https URL.
摘要：推测性解码通过使用小型草稿模型提出目标模型并行验证的多个标记来加速 LLM 推理。将这个想法扩展到批次对于生产服务至关重要，但它引入了不规则张量问题：同一批次中的序列接受不同数量的草案令牌，破坏右对齐并破坏位置 ID、注意力掩码和 KV 缓存状态。我们表明，几个现有的批处理实现违反了输出等效性——推测解码必须产生与标准自回归生成相同的令牌序列的基本要求。这些违规行为的发生正是由于不规则张量问题的处理不当造成的。作为回应，我们（1）描述了保证正确性的同步要求，（2）提出了正确性第一的批量推测性解码EQSPEC，其暴露了重新对齐消耗了40％的开销，以及（3）引入了EXSPEC，它维护序列的滑动池并动态形成相同长度的组，以减少重新对齐开销，同时保留每个序列的推测加速。在 SpecBench 数据集上，在 Vicuna-7B/68M、Qwen3-8B/0.6B 和 GLM-4-9B/0.6B 目标/草稿对中，与批量大小 1 相比，我们的方法在批量大小 8 时实现了高达 3$\times$ 的吞吐量改进，通过批量大小 8 进行有效扩展，同时保持 95% 的输出等效性。我们的方法不需要自定义内核，并且可以与现有推理堆栈干净地集成。我们的代码可以在这个 https URL 上找到。

Title: Language Server CLI Empowers Language Agents with Process Rewards

Authors: Yifan Zhang, Lanser Contributors
Subjects: cs.CL, cs.AI, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2510.22907
Pdf URL: https://arxiv.org/pdf/2510.22907
Copy Paste: [[2510.22907]] Language Server CLI Empowers Language Agents with Process Rewards(https://arxiv.org/abs/2510.22907)
Keywords: language model, agent
Abstract: Large language models routinely hallucinate APIs and mislocalize edits, while language servers compute verified, IDE-grade facts about real code. We present Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language Server Protocol (LSP) server for coding agents and CI, exposing deterministic, replayable workflows. Our position is that language servers provide not only structural information (definitions, references, types, diagnostics) but also an actionable process reward: machine-checked, step-wise signals that align an agent's planning loop with program reality. In this work, Lanser-CLI contributes: (i) a robust addressing scheme beyond brittle "file:line:col" via a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a principled relocation algorithm; (ii) deterministic Analysis Bundles that normalize Language Server responses and capture environment/capability metadata with stable content hashes; (iii) a safety envelope for mutating operations (rename, code actions) with preview, workspace jails, and Git-aware, transactional apply; and (iv) a process-reward functional derived from Language Server facts (diagnostic deltas, disambiguation confidence, and safe-apply checks) that is computable online and replayable offline. We formalize determinism under frozen snapshots and establish a monotonicity property for the process reward, making it suitable for process supervision and counterfactual analysis. Project Page: this https URL
摘要：大型语言模型通常会产生 API 幻觉和错误本地化编辑，而语言服务器则计算有关真实代码的经过验证的 IDE 级事实。我们推出了 Lanser-CLI，这是一个 CLI 优先的编排层，它为编码代理和 CI 固定和协调语言服务器协议 (LSP) 服务器，公开确定性、可重放的工作流程。我们的立场是，语言服务器不仅提供结构信息（定义、引用、类型、诊断），还提供可操作的过程奖励：机器检查的逐步信号，使代理的规划循环与程序现实保持一致。在这项工作中，Lanser-CLI 贡献了：(i) 通过具有原则性重定位算法的选择器 DSL（符号、AST 路径和内容锚定选择器），超越脆弱的“file:line:col”的强大寻址方案； (ii) 确定性分析包，用于标准化语言服务器响应并捕获具有稳定内容哈希的环境/功能元数据； (iii) 用于变异操作（重命名、代码操作）的安全信封，具有预览、工作区监狱、Git 感知、事务性应用； (iv) 从语言服务器事实（诊断增量、消歧置信度和安全应用检查）派生的过程奖励函数，可在线计算并可离线重播。我们在冻结快照下形式化确定性，并为过程奖励建立单调性属性，使其适合过程监督和反事实分析。项目页面：此 https URL

Title: Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Authors: Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22954
Pdf URL: https://arxiv.org/pdf/2510.22954
Copy Paste: [[2510.22954]] Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)(https://arxiv.org/abs/2510.22954)
Keywords: language model, prompt, chat
Abstract: Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.
摘要：语言模型（LM）通常很难生成多样化的、类似人类的创意内容，这引发了人们对重复接触相似输出而导致人类思维长期同质化的担忧。然而，用于评估 LM 输出多样性的可扩展方法仍然有限，尤其是超出随机数或名称生成等狭窄任务，或超出从单个模型重复采样的范围。我们引入了 Infinity-Chat，这是一个包含 26K 个多样化、现实世界、开放式用户查询的大型数据集，它承认各种看似合理的答案，但没有单一的基本事实。我们引入了第一个全面的分类法，用于描述向 LM 提出的所有开放式提示，包括 6 个顶级类别（例如，头脑风暴和创意），进一步细分为 17 个子类别。使用 Infinity-Chat，我们对 LM 中的模式崩溃进行了大规模研究，揭示了开放式生成的 LM 中明显的人工蜂巢效应，其特点是 (1) 模型内重复，其中单个模型始终生成相似的响应，更重要的是 (2) 模型间同质性，其中不同模型产生惊人相似的输出。 Infinity-Chat 还包括 31,250 个人工注释，涵盖绝对评级和成对偏好，每个示例有 25 个独立的人工注释。这使得能够研究集体和个人特定的人类偏好，以响应开放式查询。我们的研究结果表明，语言模型、奖励模型和语言模型评判员尽管总体质量保持了可比性，但在模型生成方面的人类评级的校准较差，这会引发不同的特殊注释者偏好。总体而言，INFINITY-CHAT 提供了第一个大规模资源，用于系统地研究 LM 的现实世界开放式查询，揭示了重要的见解，以指导未来的研究，以减轻人工 Hivemind 带来的长期人工智能安全风险。

Title: Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts

Authors: Anwesan Pal, Karen Hovsepian, Tinghao Guo, Mengnan Zhao, Somendra Tripathi, Nikos Kanakaris, George Mihaila, Sumit Nigam
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.22956
Pdf URL: https://arxiv.org/pdf/2510.22956
Copy Paste: [[2510.22956]] Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts(https://arxiv.org/abs/2510.22956)
Keywords: language model, llm, long context, prompt, retrieval-augmented generation
Abstract: Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over long and complex contexts for even the largest and most impressive cadre of models. While approaches like retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to mitigate this issue, they are sensitive to chunking, embedding and retrieval strategies and models, and furthermore, rely on extensive pre-processing, knowledge acquisition and indexing steps. In this paper, we propose Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy that boosts LLM performance in long-context scenarios, without degrading and altering the integrity and composition of retrieved documents. We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks -- NoLima and NovelQA -- and show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline -- up to 17% for 32K token contexts, and 2.9% in complex reasoning question-answering for multi-hop queries requiring knowledge across a wide span of text. Additional details are available at this https URL.
摘要：最近对现代旗舰大语言模型 (LLM) 的有效上下文长度的调查揭示了即使是最大、最令人印象深刻的模型核心，在长且复杂的上下文中进行有效问答 (QA) 和推理的主要局限性。虽然检索增强生成（RAG）和基于块的重新排序等方法试图缓解这个问题，但它们对分块、嵌入和检索策略和模型很敏感，而且依赖于广泛的预处理、知识获取和索引步骤。在本文中，我们提出了标记增强生成（TAG），这是一种轻量级数据增强策略，可提高长上下文场景中的 LLM 性能，而不会降低和改变检索到的文档的完整性和组成。我们通过增强两个具有挑战性且直接相关的问答基准（NoLima 和 NovelQA）来验证我们的假设，并表明，标记上下文，甚至只是将标签定义添加到 QA 提示中，可以带来比基线一致的性能增益，对于 32K 令牌上下文，性能提升高达 17%，在需要跨广泛文本知识的多跳查询的复杂推理问答中，性能提升高达 2.9%。其他详细信息可在此 https URL 中找到。

Title: MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs

Authors: Yucheng Ning, Xixun Lin, Fang Fang, Yanan Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22967
Pdf URL: https://arxiv.org/pdf/2510.22967
Copy Paste: [[2510.22967]] MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs(https://arxiv.org/abs/2510.22967)
Keywords: language model, llm, agent
Abstract: The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
摘要：大型语言模型 (LLM) 的广泛采用引发了人们对其输出的事实准确性的严重担忧，特别是在生物医学、法律和教育等高风险领域。由于复杂的推理链、相互交织的观点和累积的信息，现有的短文本评估方法常常无法处理长文本内容。为了解决这个问题，我们提出了一种集成大规模长格式数据集、多代理验证机制和加权评估指标的系统方法。我们构建了 LongHalluQA，一个中文长格式事实数据集；并开发 MAD-Fact，一个基于辩论的多代理验证系统。我们引入了事实重要性层次结构来捕获长文本中主张的不同重要性。两个基准的实验表明，较大的法学硕士通常保持较高的事实一致性，而国内模型在中文内容方面表现出色。我们的工作提供了一个结构化框架，用于评估和增强长篇法学硕士输出的事实可靠性，指导其在敏感领域的安全部署。

Title: Measuring Teaching with LLMs

Authors: Michael Hardy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22968
Pdf URL: https://arxiv.org/pdf/2510.22968
Copy Paste: [[2510.22968]] Measuring Teaching with LLMs(https://arxiv.org/abs/2510.22968)
Keywords: language model, llm
Abstract: Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better aligned with human judgments-attribute a larger share of score variation to lesson-level features rather than isolated utterances, challenging the sufficiency of single-turn annotation paradigms. Finally, to assess external validity, we find that aggregate model scores align with teacher value-added measures, indicating they are capturing features relevant to student learning. However, this trend does not hold at the individual item level, suggesting that while the models learn useful signals, they have not yet achieved full generalization. This work establishes a viable and powerful new methodology for AI-driven instructional measurement, offering a path toward providing scalable, reliable, and valid feedback for educator development.
摘要：客观且可扩展的教学质量衡量是教育领域持续存在的挑战。虽然大型语言模型 (LLM) 具有潜力，但通用模型一直难以可靠地应用复杂、真实的课堂观察工具。本文使用基于句子级嵌入构建的自定义 LLM，这种架构比传统的子词标记化更适合课堂记录的长格式、解释性性质。我们在旨在防止过度拟合的数据高效训练机制下系统地评估了五种不同的句子嵌入。我们的结果表明，这些专门的模型可以实现人类水平甚至超人类的表现，专家评分高于 0.65，并超过人类评分者的平均相关性。此外，通过对注释上下文窗口的分析，我们发现更先进的模型（那些更符合人类判断的模型）将更大比例的分数变化归因于课程级特征而不是孤立的话语，这对单轮注释范例的充分性提出了挑战。最后，为了评估外部效度，我们发现模型总分与教师增值指标一致，表明它们正在捕捉与学生学习相关的特征。然而，这种趋势在单个项目级别并不成立，这表明虽然模型学习了有用的信号，但它们尚未实现完全泛化。这项工作为人工智能驱动的教学测量建立了一种可行且强大的新方法，为教育者发展提供可扩展、可靠和有效的反馈提供了一条途径。

Title: Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures

Authors: Shenran Wang, Timothy Tin-Long Tse, Jian Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23006
Pdf URL: https://arxiv.org/pdf/2510.23006
Copy Paste: [[2510.23006]] Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures(https://arxiv.org/abs/2510.23006)
Keywords: language model, llm
Abstract: We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. Using a combination of behavioral probing and intervention-based methods, we have discovered that, while LLMs of different architectures can behave similarly in task performance, their internals could remain different. We discover that function vectors (FVs) responsible for ICL are primarily located in the self-attention and Mamba layers, and speculate that Mamba2 uses a different mechanism from FVs to perform ICL. FVs are more important for ICL involving parametric knowledge retrieval, but not for contextual knowledge understanding. Our work contributes to a more nuanced understanding across architectures and task types. Methodologically, our approach also highlights the importance of combining both behavioural and mechanistic analyses to investigate LLM capabilities.
摘要：我们对两类基于知识的 ICL 任务中最先进的 Transformer、状态空间和混合大语言模型进行了上下文学习 (ICL) 的深入评估。通过结合行为探索和基于干预的方法，我们发现，虽然不同架构的法学硕士在任务绩效方面可以表现相似，但它们的内部结构可能仍然不同。我们发现负责 ICL 的函数向量（FV）主要位于自注意力层和 Mamba 层，并推测 Mamba2 使用与 FV 不同的机制来执行 ICL。 FV 对于涉及参数知识检索的 ICL 更重要，但对于上下文知识理解则不然。我们的工作有助于对架构和任务类型有更细致的理解。在方法论上，我们的方法还强调了结合行为和机制分析来研究法学硕士能力的重要性。

Title: LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models

Authors: Sammriddh Gupta, Sonit Singh, Aditya Joshi, Mira Kim
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.23011
Pdf URL: https://arxiv.org/pdf/2510.23011
Copy Paste: [[2510.23011]] LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models(https://arxiv.org/abs/2510.23011)
Keywords: language model, agent
Abstract: Language educators strive to create a rich experience for learners, while they may be restricted in the extend of feedback and practice they can provide. We present the design and development of LangLingual, a conversational agent built using the LangChain framework and powered by Large Language Models. The system is specifically designed to provide real-time, grammar-focused feedback, generate context-aware language exercises and track learner proficiency over time. The paper discusses the architecture, implementation and evaluation of LangLingual in detail. The results indicate strong usability, positive learning outcomes and encouraging learner engagement.
摘要：语言教育者努力为学习者创造丰富的体验，尽管他们可以提供的反馈和练习的范围可能受到限制。我们介绍 LangLingual 的设计和开发，这是一个使用 LangChain 框架构建并由大型语言模型提供支持的对话代理。该系统专门设计用于提供实时、以语法为中心的反馈、生成上下文感知的语言练习并跟踪学习者随时间的熟练程度。论文详细讨论了LangLingual的架构、实现和评估。结果显示出强大的可用性、积极的学习成果和鼓励学习者的参与。

Title: Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Authors: Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.23038
Pdf URL: https://arxiv.org/pdf/2510.23038
Copy Paste: [[2510.23038]] Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning(https://arxiv.org/abs/2510.23038)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
摘要：大型语言模型 (LLM) 被广泛用作评估响应质量的法官，为人类评估提供了可扩展的替代方案。然而，大多数法学硕士法官仅进行基于文本的内在推理，这限制了他们验证复杂约束或执行准确计算的能力。受工具集成推理（TIR）在众多任务中取得成功的激励，我们提出了 TIR-Judge，这是一种用于培训 LLM 法官的端到端 RL 框架，它集成了代码执行器以进行精确评估。 TIR-Judge 建立在三个原则之上：(i) 跨可验证和不可验证领域的多样化训练，(ii) 灵活的判断格式（逐点、成对、列表），以及 (iii) 直接从初始模型引导而无需蒸馏的迭代强化学习。在 7 个公共基准测试中，TIR-Judge 超越基于强推理的判断高达 6.4%（逐点）和 7.7%（成对），并且尽管只有 8B 参数，但其列表性能与 Claude-Opus-4 相当。值得注意的是，TIR-Judge-Zero 完全在没有精炼法官轨迹的情况下进行训练，与精炼变体的性能相匹配，表明工具增强的法官可以通过迭代强化学习进行自我进化。

Title: Knocking-Heads Attention

Authors: Zhanchao Zhou, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Jianguo Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23052
Pdf URL: https://arxiv.org/pdf/2510.23052
Copy Paste: [[2510.23052]] Knocking-Heads Attention(https://arxiv.org/abs/2510.23052)
Keywords: language model
Abstract: Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
摘要：多头注意力（MHA）已成为现代大型语言模型的基石，通过并行注意力头增强表示能力。然而，增加头的数量本质上会削弱个体头的能力，而现有的注意力机制——无论是标准的 MHA 还是其变体，如分组查询注意力（GQA）和分组捆绑注意力（GTA）——只是简单地连接来自孤立头的输出，而没有强交互。为了解决这个限制，我们提出了敲头注意力（KHA），它使注意力头能够相互“敲击”——在缩放点积注意力之前促进跨头特征级交互。这是通过在所有头上应用共享的、对角初始化的投影矩阵来实现的。对角线初始化在训练开始时保留了头部特定的专业化，同时允许模型逐步学习集成的十字头表示。 KHA 仅添加最少的参数和 FLOP，并且可以无缝集成到 MHA、GQA、GTA 和其他注意力变体中。我们通过在 1T 高质量代币上训练 6.1B 参数 MoE 模型（1.01B 激活）来验证 KHA。与基线注意力机制相比，KHA 带来了更优越、更稳定的训练动态，在下游任务中实现了更好的性能。

Title: Quality-Aware Translation Tagging in Multilingual RAG system

Authors: Hoyeon Moon, Byeolhee Kim, Nikhil Verma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23070
Pdf URL: https://arxiv.org/pdf/2510.23070
Copy Paste: [[2510.23070]] Quality-Aware Translation Tagging in Multilingual RAG system(https://arxiv.org/abs/2510.23070)
Keywords: llm, hallucination, retrieval-augmented generation
Abstract: Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English documents and translates them into the query language for low-resource settings. However, poor translation quality degrades response generation performance. Existing approaches either assume sufficient translation quality or utilize the rewriting method, which introduces factual distortion and hallucinations. To mitigate these problems, we propose Quality-Aware Translation Tagging in mRAG (QTT-RAG), which explicitly evaluates translation quality along three dimensions-semantic equivalence, grammatical accuracy, and naturalness&fluency-and attach these scores as metadata without altering the original content. We evaluate QTT-RAG against CrossRAG and DKM-RAG as baselines in two open-domain QA benchmarks (XORQA, MKQA) using six instruction-tuned LLMs ranging from 2.4B to 14B parameters, covering two low-resource languages (Korean and Finnish) and one high-resource language (Chinese). QTT-RAG outperforms the baselines by preserving factual integrity while enabling generator models to make informed decisions based on translation reliability. This approach allows for effective usage of cross-lingual documents in low-resource settings with limited native language documents, offering a practical and robust solution across multilingual domains.
摘要：多语言检索增强生成 (mRAG) 通常会检索英文文档并将其翻译为资源匮乏环境下的查询语言。然而，翻译质量差会降低响应生成性能。现有的方法要么假设足够的翻译质量，要么利用重写方法，这会引入事实扭曲和幻觉。为了缓解这些问题，我们提出了 mRAG 中的质量感知翻译标签（QTT-RAG），它从三个维度（语义等效性、语法准确性以及自然性和流畅性）明确评估翻译质量，并将这些分数附加为元数据，而不改变原始内容。我们在两个开放域 QA 基准（XORQA、MKQA）中使用 CrossRAG 和 DKM-RAG 作为基线，使用 6 个指令调整的 LLM（参数范围从 2.4B 到 14B）评估 QTT-RAG，涵盖两种低资源语言（韩语和芬兰语）和一种高资源语言（中文）。 QTT-RAG 保持事实完整性，同时使生成器模型能够根据翻译可靠性做出明智的决策，从而优于基线。这种方法允许在资源匮乏的环境中使用有限的母语文档有效地使用跨语言文档，从而提供跨多语言领域的实用且强大的解决方案。

Title: A Survey on LLM Mid-training

Authors: Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23081
Pdf URL: https://arxiv.org/pdf/2510.23081
Copy Paste: [[2510.23081]] A Survey on LLM Mid-training(https://arxiv.org/abs/2510.23081)
Keywords: language model, llm
Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
摘要：基础模型的最新进展凸显了多阶段训练的显着优势，特别强调训练中期作为连接训练前和训练后的重要阶段的出现。中期训练的特点是使用中间数据和计算资源，系统地增强数学、编码、推理和长上下文扩展等特定能力，同时保持基础能力。这项调查提供了大型语言模型 (LLM) 中期训练的正式定义，并研究了包含数据管理、训练策略和模型架构优化的优化框架。我们在目标驱动干预的背景下分析主流模型的实施，说明培训中期如何作为法学硕士能力逐步发展的独特且关键的阶段。通过阐明中期培训的独特贡献，该调查提供了全面的分类和可行的见解，支持法学硕士发展的未来研究和创新。

Title: MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

Authors: Suchan Lee, Jihoon Choi, Sohyeon Lee, Minseok Song, Bong-Gyu Jang, Hwanjo Yu, Soyeon Caren Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23090
Pdf URL: https://arxiv.org/pdf/2510.23090
Copy Paste: [[2510.23090]] MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models(https://arxiv.org/abs/2510.23090)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.
摘要：最近的进展研究了通过将数值输入与 LLM 嵌入空间对齐来使用预训练的大型语言模型 (LLM) 进行时间序列预测。然而，现有的多模态方法经常忽视对时间序列数据至关重要的独特统计特性和时间依赖性。为了弥补这一差距，我们提出了 MAP4TS，这是一种新颖的多方面提示框架，明确地将经典时间序列分析纳入提示设计中。我们的框架引入了四个专门的提示组件：一个传达数据集级上下文的全局域提示，一个编码最近趋势和特定系列行为的本地域提示，以及一对嵌入从自相关（ACF）、部分自相关（PACF）和傅立叶分析中得出的手工见解的统计和时间提示。多方面提示与原始时间序列嵌入相结合，并通过跨模态对齐模块产生统一的表示，然后由法学硕士进行处理并进行最终预测。跨越八个不同数据集的大量实验表明，MAP4TS 始终优于最先进的基于 LLM 的方法。我们的消融研究进一步表明，提示感知设计显着增强了性能稳定性，并且当与结构化提示配合使用时，GPT-2 主干在长期预测任务中优于 LLaMA 等大型模型。

Title: Leveraging Hierarchical Organization for Medical Multi-document Summarization

Authors: Yi-Li Hsu, Katelyn X. Mei, Lucy Lu Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.23104
Pdf URL: https://arxiv.org/pdf/2510.23104
Copy Paste: [[2510.23104]] Leveraging Hierarchical Organization for Medical Multi-document Summarization(https://arxiv.org/abs/2510.23104)
Keywords: language model, gpt, llm
Abstract: Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
摘要：医学多文档摘要 (MDS) 是一项复杂的任务，需要有效管理跨文档关系。本文研究了与传统的平面摘要方法相比，在 MDS 的输入中合并层次结构是否可以提高模型跨文档组织信息和上下文信息的能力。我们研究了跨三个大型语言模型（LLM）合并层次结构的两种方法，并使用自动化指标、基于模型的指标以及领域专家对偏好、可理解性、清晰度、复杂性、相关性、覆盖范围、事实性和连贯性的评估，对结果摘要进行全面评估。我们的结果表明，人类专家更喜欢模型生成的摘要而不是人类编写的摘要。分层方法通常可以保留信息的真实性、覆盖范围和连贯性，同时也增加人们对摘要的偏好。此外，我们还检查了 GPT-4 的模拟判断是否与人类判断一致，在更客观的评估方面找到了更高的一致性。我们的研究结果表明，层次结构可以提高模型生成的医学摘要的清晰度，同时保持内容覆盖范围，从而提供了一种提高人们对生成摘要的偏好的实用方法。

Title: Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation

Authors: Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He, Ruixuan Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.23123
Pdf URL: https://arxiv.org/pdf/2510.23123
Copy Paste: [[2510.23123]] Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation(https://arxiv.org/abs/2510.23123)
Keywords: language model, llm
Abstract: Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). LoRA essentially describes the projection of an input space into a low-dimensional output space, with the dimensionality determined by the LoRA rank. In standard LoRA, all input tokens share the same weights and undergo an identical input-output projection. This limits LoRA's ability to capture token-specific information due to the inherent semantic differences among tokens. To address this limitation, we propose Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts LoRA weights according to the input token, thereby learning token-wise input-output projections in an end-to-end manner. Formally, the weights of TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA weights but achieves more granular adaptation by learning token-wise LoRA weights (i.e., token-wise input-output projections). Extensive experiments across multiple models and datasets demonstrate that TopLoRA consistently outperforms LoRA and its variants. The code is available at this https URL.
摘要：低秩适应（LoRA）是一种广泛应用于大型语言模型（LLM）的参数高效微调（PEFT）方法。 LoRA 本质上描述了输入空间到低维输出空间的投影，其维度由 LoRA 秩决定。在标准 LoRA 中，所有输入令牌共享相同的权重并经历相同的输入输出投影。由于令牌之间固有的语义差异，这限制了 LoRA 捕获令牌特定信息的能力。为了解决这个限制，我们提出了 Token-wise Projected Low-Rank Adaptation (TopLoRA)，它根据输入 token 动态调整 LoRA 权重，从而以端到端的方式学习 token-wise 输入输出投影。形式上，TopLoRA 的权重可以表示为 $B\Sigma_X A$，其中 $A$ 和 $B$ 是低秩矩阵（与标准 LoRA 中一样），$\Sigma_X$ 是从每个输入标记 $X$ 生成的对角矩阵。值得注意的是，TopLoRA 不会增加 LoRA 权重的排名，而是通过学习 token-wise LoRA 权重（即 token-wise 输入输出投影）来实现更细粒度的适应。跨多个模型和数据集的大量实验表明，TopLoRA 始终优于 LoRA 及其变体。该代码可从此 https URL 获取。

Title: ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix

Authors: Zile Yang, Ling Li, Na Di, Jinlong Pang, Yao Zhou, Hao Cheng, Bo Han, Jiaheng Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23160
Pdf URL: https://arxiv.org/pdf/2510.23160
Copy Paste: [[2510.23160]] ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix(https://arxiv.org/abs/2510.23160)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pairs, typically drawn from a larger dataset that often contains many low-quality or noisy samples. However, existing quality-first paradigms often overlook valuable signals in discarded low-quality data and rely on imperfect quality filters. We introduce ENTP (Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a framework that revitalizes low-quality corpora through symbolic purification and neural reconstruction. The symbolic module identifies and prunes noisy samples based on statistical priors, while the neural component synthesizes enriched instruction-response pairs by leveraging latent representations and model knowledge. This neural-symbolic synergy enhances data informativeness and diversity. Experiments show that ENTP-augmented datasets, constructed exclusively from low-quality data, outperform 13 established data-selection baselines across five instruction-following benchmarks, and even surpass fine-tuning on the full original dataset (approximately 300K examples). Our results highlight the untapped potential of low-quality data and underscore the importance of intelligent purification and synthesis for efficient instruction alignment.
摘要：监督微调 (SFT) 通过对精心策划的高质量指令响应对子集进行训练，使预训练的大型语言模型 (LLM) 适应特定于领域的指令，这些子集通常来自通常包含许多低质量或噪声样本的较大数据集。然而，现有的质量第一范式经常忽略被丢弃的低质量数据中的有价值信号，并依赖于不完善的质量过滤器。我们引入了ENTP（Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix），这是一个通过符号净化和神经重建来重振低质量语料库的框架。符号模块根据统计先验识别和修剪噪声样本，而神经组件则通过利用潜在表示和模型知识来合成丰富的指令响应对。这种神经符号协同作用增强了数据的信息量和多样性。实验表明，仅由低质量数据构建的 ENTP 增强数据集在 5 个指令跟踪基准中优于 13 个已建立的数据选择基线，甚至超过了对完整原始数据集（大约 30 万个示例）的微调。我们的结果凸显了低质量数据尚未开发的潜力，并强调了智能纯化和合成对于高效指令对齐的重要性。

Title: Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs

Authors: Hang Lei, Shengyi Zong, Zhaoyan Li, Ziren Zhou, Hao Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23163
Pdf URL: https://arxiv.org/pdf/2510.23163
Copy Paste: [[2510.23163]] Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs(https://arxiv.org/abs/2510.23163)
Keywords: language model, llm
Abstract: The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential in creative writing, direct end-to-end generation approaches often fail to produce well-crafted screenplays. We argue this failure stems from forcing a single model to simultaneously master two disparate capabilities: creative narrative construction and rigid format adherence. The resulting outputs may mimic superficial style but lack the deep structural integrity and storytelling substance required for professional use. To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage Refinement (DSR), a decomposed framework that decouples creative narrative generation from format conversion. The first stage transforms a brief outline into rich, novel-style prose. The second stage refines this narrative into a professionally formatted screenplay. This separation enables the model to specialize in one distinct capability at each stage. A key challenge in implementing DSR is the scarcity of paired outline-to-novel training data. We address this through hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, while forward synthesis leverages these inputs to generate high-quality narrative texts as training targets. Blind evaluations by professional screenwriters show that DSR achieves a 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance. Our work demonstrates that decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains.
摘要：剧本是电视制作的基础，定义叙事结构、角色发展和对话。虽然大型语言模型 (LLM) 在创意写作方面显示出巨大潜力，但直接的端到端生成方法通常无法制作出精心制作的剧本。我们认为这种失败源于迫使单一模型同时掌握两种不同的能力：创造性的叙事构建和严格的格式遵循。由此产生的输出可能模仿肤浅的风格，但缺乏专业用途所需的深层结构完整性和讲故事的实质内容。为了使法学硕士能够生成高质量的剧本，我们引入了双阶段细化（DSR），这是一个分解框架，可将创意叙事生成与格式转换分离。第一阶段将简短的提纲转化为丰富的、小说风格的散文。第二阶段将这个叙述提炼成专业格式的剧本。这种分离使模型能够在每个阶段专门研究一种不同的功能。实施 DSR 的一个关键挑战是缺乏配对的大纲与小说训练数据。我们通过混合数据合成来解决这个问题：反向合成将现有剧本解构为结构化输入，而正向合成则利用这些输入生成高质量的叙事文本作为训练目标。专业编剧盲评显示，DSR 相对于 Gemini-2.5-Pro 等强基线，胜率达到 75%，达到人类水平的 82.7%。我们的工作表明，具有定制数据合成的分解生成架构可以有效地使法学硕士专注于复杂的创意领域。

Title: SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations

Authors: Shuai Huang, Wenxuan Zhao, Jun Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23182
Pdf URL: https://arxiv.org/pdf/2510.23182
Copy Paste: [[2510.23182]] SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations(https://arxiv.org/abs/2510.23182)
Keywords: language model, llm, chain-of-thought, agent
Abstract: As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at this https URL.
摘要：随着大型语言模型 (LLM) 发展拟人化能力，它们越来越多地被部署为自主代理来与人类交互。然而，评估他们在现实和复杂的社交互动中的表现仍然是一个重大挑战。之前的大多数研究都是通过模拟智能体之间的交互来构建数据集，这无法捕捉真实人类对话中真实的语言风格和关系动态。为了解决这一差距，我们引入了 SI-Bench，这是一种新颖的基准，旨在评估法学硕士的社交智力方面。 SI-Bench 以广泛的社会科学理论为基础，包含从社交网络应用程序收集的 2,221 个真实的多轮对话。我们进一步选择了 312 个对话的子集，用于跨 8 个主要模型的手动注释。实验表明，SOTA模型在复杂社会情境下的过程推理方面已经超越了人类专家，但在回复质量方面仍然落后于人类。此外，引入思想链（CoT）推理可能会降低法学硕士在社会对话任务中的表现。所有数据集均可通过此 https URL 公开获取。

Title: DREaM: Drug-Drug Relation Extraction via Transfer Learning Method

Authors: Ali Fata, Hossein Rahmani, Parinaz Soltanzadeh, Amirhossein Derakhshan, Behrouz Minaei Bidgoli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.23189
Pdf URL: https://arxiv.org/pdf/2510.23189
Copy Paste: [[2510.23189]] DREaM: Drug-Drug Relation Extraction via Transfer Learning Method(https://arxiv.org/abs/2510.23189)
Keywords: language model, llm
Abstract: Relation extraction between drugs plays a crucial role in identifying drug drug interactions and predicting side effects. The advancement of machine learning methods in relation extraction, along with the development of large medical text databases, has enabled the low cost extraction of such relations compared to other approaches that typically require expert knowledge. However, to the best of our knowledge, there are limited datasets specifically designed for drug drug relation extraction currently available. Therefore, employing transfer learning becomes necessary to apply machine learning methods in this domain. In this study, we propose DREAM, a method that first employs a trained relation extraction model to discover relations between entities and then applies this model to a corpus of medical texts to construct an ontology of drug relationships. The extracted relations are subsequently validated using a large language model. Quantitative results indicate that the LLM agreed with 71 of the relations extracted from a subset of PubMed abstracts. Furthermore, our qualitative analysis indicates that this approach can uncover ambiguities in the medical domain, highlighting the challenges inherent in relation extraction in this field.
摘要：药物之间的关系提取在识别药物相互作用和预测副作用方面起着至关重要的作用。与通常需要专业知识的其他方法相比，机器学习方法在关系提取方面的进步以及大型医学文本数据库的发展使得这种关系的提取成本较低。然而，据我们所知，目前专门用于药物关系提取的数据集有限。因此，采用迁移学习成为在该领域应用机器学习方法的必要条件。在这项研究中，我们提出了 DREAM，这种方法首先采用经过训练的关系提取模型来发现实体之间的关系，然后将该模型应用于医学文本语料库以构建药物关系本体。随后使用大型语言模型验证提取的关系。定量结果表明，法学硕士同意从 PubMed 摘要子集中提取的 71 个关系。此外，我们的定性分析表明，这种方法可以揭示医学领域的模糊性，突出了该领域关系提取所固有的挑战。

Title: Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports

Authors: Alois Thomas, Maya Varma, Jean-Benoit Delbrouck, Curtis P. Langlotz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23217
Pdf URL: https://arxiv.org/pdf/2510.23217
Copy Paste: [[2510.23217]] Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports(https://arxiv.org/abs/2510.23217)
Keywords: language model, hallucination
Abstract: Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations
摘要：使用大型视觉语言模型（LVLM）自动生成放射学报告具有巨大的潜力，但这些模型经常产生临床上关键的幻觉，带来严重的风险。现有的幻觉检测方法经常缺乏必要的句子级粒度或跨不同 LVLM 生成器的稳健泛化。我们引入了一种新颖的方法：适用于这种视觉语言任务的句子级过程奖励模型（PRM）。我们的 PRM 根据临床背景和前面的文本来预测每个生成句子的事实正确性。当在具有弱监督标签的 MIMIC-CXR 上进行微调时，轻量级 0.5B 参数 PRM 的性能优于现有验证技术，例如，与一个 LVLM 输出的强白盒基线相比，马修斯相关系数相对提高了 7.5%，AUROC 相对提高了 1.8%。与依赖内部模型状态的方法不同，我们的 PRM 展示了对未见过的 LVLM 的强大泛化能力。我们进一步展示了它的实用性：PRM 分数可以有效过滤低质量报告，将 F1-CheXbert 分数提高 4.5%（当丢弃最差的 10% 报告时）。此外，当在 MIMIC-CXR 测试集上指导新颖的加权 N 最佳选择过程时，我们的 PRM 显示 F1-CheXbert 的临床指标相对改善了 7.4%，BERTScore 的临床指标改善了 0.6%。这些结果表明，轻量级、上下文感知的 PRM 为临床 LVLM 提供了模型无关的安全层，无需访问内部激活

Title: Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding

Authors: Mohammed Aljafari, Ismail Alturki, Ahmed Mori, Yehya Kadumi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23271
Pdf URL: https://arxiv.org/pdf/2510.23271
Copy Paste: [[2510.23271]] Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding(https://arxiv.org/abs/2510.23271)
Keywords: language model, prompt, retrieval-augmented generation
Abstract: Mubeen is a proprietary Arabic language model developed by MASARAT SA, optimized for deep understanding of Arabic linguistics, Islamic studies, and cultural heritage. Trained on an extensive collection of authentic Arabic sources significantly expanded by digitizing historical manuscripts via a proprietary Arabic OCR engine, the model incorporates seminal scholarly works in linguistics, jurisprudence, hadith, and Quranic exegesis, alongside thousands of academic theses and peer-reviewed research papers. Conditioned through a deep linguistic engineering framework, Mubeen masters not just the meaning but the eloquence of Arabic, enabling precise understanding across classical texts, contemporary writing, and regional dialects with focus on comprehending user intent and delivering accurate, contextually relevant responses. Unlike other Arabic models relying on translated English data that often fail in intent detection or retrieval-augmented generation (RAG), Mubeen uses native Arabic sources to ensure cultural authenticity and accuracy. Its core innovation is the Practical Closure Architecture, designed to solve the "Utility Gap Crisis" where factually correct answers fail to resolve users' core needs, forcing them into frustrating cycles of re-prompting. By prioritizing clarity and decisive guidance, Mubeen transforms from an information repository into a decisive guide, aligning with Saudi Vision 2030. The model's architecture combines deep heritage specialization with multi-disciplinary expert modules, enabling robust performance across both cultural preservation and general knowledge domains.
摘要：Mubeen 是 MASARAT SA 开发的专有阿拉伯语言模型，针对阿拉伯语言学、伊斯兰研究和文化遗产的深入理解进行了优化。该模型通过专有的阿拉伯语 OCR 引擎对历史手稿进行数字化，对大量真实的阿拉伯语资料进行了训练，纳入了语言学、法学、圣训和古兰经注释方面的开创性学术著作，以及数千篇学术论文和同行评审的研究论文。通过深入的语言工程框架，Mubeen 不仅掌握了阿拉伯语的含义，还掌握了阿拉伯语的口才，能够准确理解古典文本、当代写作和地方方言，重点是理解用户意图并提供准确、上下文相关的响应。与其他依赖翻译后的英语数据的阿拉伯语模型不同，这些模型经常在意图检测或检索增强生成 (RAG) 方面失败，而 Mubeen 使用阿拉伯语本地资源来确保文化的真实性和准确性。其核心创新是实用闭包架构，旨在解决“效用缺口危机”，即事实上正确的答案无法解决用户的核心需求，迫使他们陷入令人沮丧的重新提示循环。通过优先考虑清晰度和果断的指导，Mubeen 从信息存储库转变为果断的指南，与沙特 2030 年愿景保持一致。该模型的架构将深厚的遗产专业化与多学科专家模块相结合，在文化保护和一般知识领域实现了稳健的表现。

Title: Code Aesthetics with Agentic Reward Feedback

Authors: Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23272
Pdf URL: https://arxiv.org/pdf/2510.23272
Copy Paste: [[2510.23272]] Code Aesthetics with Agentic Reward Feedback(https://arxiv.org/abs/2510.23272)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.
摘要：大型语言模型 (LLM) 已成为开发人员执行代码相关任务的宝贵助手。虽然法学硕士擅长传统的编程任务，例如代码生成和错误修复，但他们在面向视觉的编码任务上遇到了困难，通常会产生次优的美观效果。在本文中，我们引入了一种新的管道来增强 LLM 生成代码的美观质量。我们首先构建了 AesCode-358K，这是一个专注于代码美观的大规模指令调优数据集。接下来，我们提出代理奖励反馈，这是一个评估可执行性、静态美学和交互美学的多代理系统。在此基础上，我们开发了 GRPO-AR，它将这些信号集成到 GRPO 算法中，以实现功能和代码美观的联合优化。最后，我们开发了 OpenDesign，一个评估代码美观的基准。实验结果表明，将 AesCode-358K 上的监督微调与使用代理奖励反馈的强化学习相结合，可显着提高 OpenDesign 上的性能，并增强 PandasPlotBench 等现有基准测试的结果。值得注意的是，我们的 AesCoder-4B 超越了 GPT-4o 和 GPT-4.1，并实现了与具有 480B-685B 参数的大型开源模型相当的性能，强调了我们方法的有效性。

Title: A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

Authors: Thai-Binh Nguyen, Katerina Zmolikova, Pingchuan Ma, Ngoc Quan Pham, Christian Fuegen, Alexander Waibel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23276
Pdf URL: https://arxiv.org/pdf/2510.23276
Copy Paste: [[2510.23276]] A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results(https://arxiv.org/abs/2510.23276)
Keywords: chat
Abstract: We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.
摘要：我们在第九届 CHiME 挑战赛中介绍了多模态上下文感知识别 (MCoRec) 任务，该任务使用音频、视觉和上下文提示解决了单房间环境中重叠对话的鸡尾酒会问题。 MCoRec 捕捉自然的多方对话，其中录音集中于无脚本、随意的群聊，导致高达 100% 的极端语音重叠和高度碎片化的对话轮流。该任务要求系统回答“谁在何时、说什么以及与谁说话？”的问题。通过共同转录每个发言者的讲话并将其从视听记录中聚类到各自的对话中。仅音频基线超过 100% 的单词错误率，而结合视觉提示可实现 50% 的大幅改进，凸显了多模态的重要性。在这份手稿中，我们介绍了任务背后的动机，概述了数据收集过程，并报告了为 MCoRec 开发的基线系统。

Title: DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model

Authors: Yuanzhen Xie, Liu Ye, Jiqun Chu, Mochi Gao, Hehuan Liu, Yunzhi Tan, Bo Hu, Zang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23284
Pdf URL: https://arxiv.org/pdf/2510.23284
Copy Paste: [[2510.23284]] DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model(https://arxiv.org/abs/2510.23284)
Keywords: gpt, chat, agent
Abstract: Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).
摘要：自 ChatGPT 发布以来，文本到 SQL 任务已经获得了引人注目的改进。其中，基于代理的框架在该领域得到了广泛的应用。然而，以数据为中心的策略对文本到 SQL 任务的影响却很少被探讨。在本文中，我们系统地设计了一个用于文本到 SQL 任务的完全自动化的以数据为中心的管道，包括 emph{自适应数据修复}，它可以自动查找并修复训练数据集中的错误；和\emph{错误数据增强}，我们专门扩散和增强最初训练的模型预测的错误数据。同时，我们提出了一种多模型协作训练模式，旨在使用不同的增强数据训练多个模型，使它们具有不同的能力并相互补充，因为我们发现单个微调模型的能力非常有限。此外，我们利用集成策略集成多个模型的能力来解决多项选择题，旨在进一步提高文本到 SQL 任务的准确性。实验结果和消融研究证明了以数据为中心的管道和多模型（MM）交互式迭代策略的有效性，在轻量级文本到SQL模型（70B以内）中取得了第一名。

Title: Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models

Authors: Mohammad Atif Quamar, Mohammad Areeb, Nishant Sharma, Ananth Shreekumar, Jonathan Rosenthal, Muslum Ozgur Ozmen, Mikhail Kuznetsov, Z. Berkay Celik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23334
Pdf URL: https://arxiv.org/pdf/2510.23334
Copy Paste: [[2510.23334]] Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models(https://arxiv.org/abs/2510.23334)
Keywords: language model, llm
Abstract: LLM alignment remains a critical challenge. Inference-time methods provide a flexible alternative to fine-tuning, but their uniform computational effort often yields suboptimal alignment. We hypothesize that for many alignment tasks, the initial tokens of a response are disproportionately more critical. To leverage this principle, we introduce AdaSearch, a novel blockwise search strategy. It adaptively allocates a fixed computational budget using a sampling schedule, focusing search effort on these critical tokens. We apply AdaSearch to sequential decoding and introduce its tree-search counterpart, AdaBeam. Our comprehensive evaluation across eight LLMs demonstrates that AdaSearch outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates improve by over 10% for harmlessness generation, controlled sentiment generation, and for mathematical reasoning tasks relative to Best-of-N.
摘要：LLM 的调整仍然是一个严峻的挑战。推理时间方法提供了微调的灵活替代方案，但其统一的计算工作通常会产生次优的对齐结果。我们假设对于许多对齐任务来说，响应的初始标记尤其重要。为了利用这一原理，我们引入了 AdaSearch，一种新颖的分块搜索策略。它使用采样计划自适应地分配固定的计算预算，将搜索工作集中在这些关键标记上。我们将 AdaSearch 应用于顺序解码，并介绍其对应的树搜索 AdaBeam。我们对八个法学硕士的综合评估表明，AdaSearch 的性能优于强大的 Best-of-N 和微调基线。具体来说，相对于 Best-of-N，无害生成、受控情绪生成以及数学推理任务的胜率提高了 10% 以上。

Title: BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning

Authors: Siyuan Zheng, Pai Liu, Xi Chen, Jizheng Dong, Sihan Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23337
Pdf URL: https://arxiv.org/pdf/2510.23337
Copy Paste: [[2510.23337]] BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning(https://arxiv.org/abs/2510.23337)
Keywords: language model, gpt, llm, prompt
Abstract: Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi-LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%-45%, showing the potential of culturally grounded symbolic-LLM integration for realistic character simulation.
摘要：类人的虚拟角色对于游戏、讲故事和虚拟现实至关重要，但目前的方法严重依赖注释数据或手工制作的角色提示，因此很难扩展和生成真实的、上下文连贯的角色。我们为基于八字的人物推理创建了第一个 QA 数据集，其中将真实的人类经历分类为财富、健康、亲属、职业和人际关系，并以生活事件问题和答案的形式表示。此外，我们提出了第一个 BaZi-LLM 系统，它将符号推理与大型语言模型相结合，以生成时间动态和细粒度的虚拟角色。与DeepSeek-v3和GPT-5-mini等主流LLM相比，我们的方法实现了30.3%-62.6%的精度提升。此外，当使用不正确的八字信息时，我们的模型的准确性会下降 20%-45%，这显示了基于文化的符号-LLM 集成在现实角色模拟中的潜力。

Title: LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data

Authors: Teng Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23341
Pdf URL: https://arxiv.org/pdf/2510.23341
Copy Paste: [[2510.23341]] LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data(https://arxiv.org/abs/2510.23341)
Keywords: language model, llm
Abstract: The scarcity of high-quality knowledge graphs (KGs) remains a critical bottleneck for downstream AI applications, as existing extraction methods rely heavily on error-prone pattern-matching techniques or resource-intensive large language models (LLMs). While recent tools leverage LLMs to generate KGs, their computational demands limit accessibility for low-resource environments. Our paper introduces LightKGG, a novel framework that enables efficient KG extraction from textual data using small-scale language models (SLMs) through two key technical innovations: (1) Context-integrated Graph extraction integrates contextual information with nodes and edges into a unified graph structure, reducing the reliance on complex semantic processing while maintaining more key information; (2) Topology-enhanced relationship inference leverages the inherent topology of the extracted graph to efficiently infer relationships, enabling relationship discovery without relying on complex language understanding capabilities of LLMs. By enabling accurate KG construction with minimal hardware requirements, this work bridges the gap between automated knowledge extraction and practical deployment scenarios while introducing scientifically rigorous methods for optimizing SLM efficiency in structured NLP tasks.
摘要：高质量知识图（KG）的稀缺仍然是下游人工智能应用的关键瓶颈，因为现有的提取方法严重依赖于容易出错的模式匹配技术或资源密集型大型语言模型（LLM）。虽然最近的工具利用 LLM 来生成知识图谱，但它们的计算需求限制了低资源环境的可访问性。我们的论文介绍了LightKGG，这是一种新颖的框架，通过两项关键技术创新，可以使用小规模语言模型（SLM）从文本数据中高效提取知识图谱：（1）上下文集成图提取将上下文信息与节点和边集成到统一的图结构中，减少对复杂语义处理的依赖，同时保留更多关键信息；（2）拓扑增强的关系推理利用提取图的固有拓扑来有效地推理关系，无需依赖法学硕士复杂的语言理解能力即可实现关系发现。通过以最少的硬件要求实现准确的知识图谱构建，这项工作弥合了自动化知识提取和实际部署场景之间的差距，同时引入了科学严谨的方法来优化结构化 NLP 任务中的 SLM 效率。

Title: How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes

Authors: Sheri Osborn, Rohit Valecha, H. Raghav Rao, Dan Sass, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23358
Pdf URL: https://arxiv.org/pdf/2510.23358
Copy Paste: [[2510.23358]] How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes(https://arxiv.org/abs/2510.23358)
Keywords: language model, llm, prompt
Abstract: Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.
摘要：人工智能正在重塑劳动力市场，但我们缺乏工具来系统地预测其对就业的影响。本文介绍了一个基准，用于评估大型语言模型 (LLM) 预测工作需求变化的能力，尤其是受人工智能影响的职业。现有研究表明，法学硕士可以提取情绪、总结经济报告并模仿预测者的行为，但很少有工作评估其在前瞻性劳动力预测中的用途。我们的基准结合了两个互补的数据集：美国部门级职位发布的高频指数，以及由于人工智能采用而预计职业变化的全球数据集。我们将这些数据格式化为具有清晰时间分割的预测任务，从而最大限度地降低信息泄漏的风险。然后，我们使用多种提示策略来评估法学硕士，比较跨模型系列的任务支架、角色驱动和混合方法。我们随着时间的推移评估定量准确性和定性一致性。结果表明，结构化任务提示持续提高预测稳定性，而角色提示则在短期趋势方面具有优势。然而，不同行业和不同领域的绩效差异很大，凸显了对领域感知提示和严格评估协议的需求。通过发布我们的基准，我们的目标是支持劳动力预测、即时设计和基于法学硕士的经济推理的未来研究。这项工作促进了越来越多关于法学硕士如何与现实世界经济数据互动的研究，并为研究人工智能作为劳动力市场预测工具的局限性和机遇提供了一个可重复的测试平台。

Title: Detecting Religious Language in Climate Discourse

Authors: Evy Beijen, Pien Pieterse, Yusuf Çelik, Willem Th. van Peursen, Sandjai Bhulai, Meike Morren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23395
Pdf URL: https://arxiv.org/pdf/2510.23395
Copy Paste: [[2510.23395]] Detecting Religious Language in Climate Discourse(https://arxiv.org/abs/2510.23395)
Keywords: language model, llm
Abstract: Religious language continues to permeate contemporary discourse, even in ostensibly secular domains such as environmental activism and climate change debates. This paper investigates how explicit and implicit forms of religious language appear in climate-related texts produced by secular and religious nongovernmental organizations (NGOs). We introduce a dual methodological approach: a rule-based model using a hierarchical tree of religious terms derived from ecotheology literature, and large language models (LLMs) operating in a zero-shot setting. Using a dataset of more than 880,000 sentences, we compare how these methods detect religious language and analyze points of agreement and divergence. The results show that the rule-based method consistently labels more sentences as religious than LLMs. These findings highlight not only the methodological challenges of computationally detecting religious language but also the broader tension over whether religious language should be defined by vocabulary alone or by contextual meaning. This study contributes to digital methods in religious studies by demonstrating both the potential and the limitations of approaches for analyzing how the sacred persists in climate discourse.
摘要：宗教语言继续渗透到当代话语中，甚至在环境行动主义和气候变化辩论等表面上世俗的领域也是如此。本文研究了世俗和宗教非政府组织 (NGO) 制作的气候相关文本中宗教语言的显性和隐性形式如何出现。我们引入了一种双重方法论：基于规则的模型，使用源自生态神学文献的宗教术语分层树，以及在零样本设置中运行的大型语言模型（LLM）。我们使用包含超过 880,000 个句子的数据集，比较这些方法如何检测宗教语言并分析一致点和分歧点。结果表明，与法学硕士相比，基于规则的方法始终将更多句子标记为宗教。这些发现不仅凸显了通过计算检测宗教语言的方法论挑战，而且还凸显了宗教语言是否应该仅由词汇或上下文含义来定义的更广泛的张力。这项研究通过展示分析神圣如何在气候话语中持续存在的方法的潜力和局限性，为宗教研究中的数字方法做出了贡献。

Title: EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting

Authors: Musleh Alharthi, Kaleel Mahmood, Sarosh Patel, Ausif Mahmood
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23396
Pdf URL: https://arxiv.org/pdf/2510.23396
Copy Paste: [[2510.23396]] EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting(https://arxiv.org/abs/2510.23396)
Keywords: language model, llm
Abstract: The immense success of the Transformer architecture in Natural Language Processing has led to its adoption in Time Se ries Forecasting (TSF), where superior performance has been shown. However, a recent important paper questioned their effectiveness by demonstrating that a simple single layer linear model outperforms Transformer-based models. This was soon shown to be not as valid, by a better transformer-based model termed PatchTST. More re cently, TimeLLM demonstrated even better results by repurposing a Large Language Model (LLM) for the TSF domain. Again, a follow up paper challenged this by demonstrating that removing the LLM component or replacing it with a basic attention layer in fact yields better performance. One of the challenges in forecasting is the fact that TSF data favors the more recent past, and is sometimes subject to unpredictable events. Based upon these recent insights in TSF, we propose a strong Mixture of Experts (MoE) framework. Our method combines the state-of-the-art (SOTA) models including xLSTM, en hanced Linear, PatchTST, and minGRU, among others. This set of complimentary and diverse models for TSF are integrated in a Trans former based MoE gating network. Our proposed model outperforms all existing TSF models on standard benchmarks, surpassing even the latest approaches based on MoE frameworks.
摘要：Transformer 架构在自然语言处理领域取得的巨大成功导致其在时间序列预测 (TSF) 中得到采用，并显示出卓越的性能。然而，最近的一篇重要论文通过证明简单的单层线性模型优于基于 Transformer 的模型来质疑它们的有效性。很快，一个更好的基于 Transformer 的模型（称为 PatchTST）就证明了这一点并不有效。最近，TimeLLM 通过为 TSF 领域重新利用大型语言模型 (LLM)，展示了更好的结果。后续论文再次挑战了这一点，证明删除 LLM 组件或用基本注意力层替换它实际上会产生更好的性能。预测的挑战之一是 TSF 数据偏向最近的过去，并且有时会受到不可预测事件的影响。基于 TSF 的这些最新见解，我们提出了一个强大的专家混合 (MoE) 框架。我们的方法结合了最先进的 (SOTA) 模型，包括 xLSTM、增强型线性、PatchTST 和 minGRU 等。这套 TSF 的互补且多样化的模型集成在基于 Transformer 的 MoE 门控网络中。我们提出的模型在标准基准上优于所有现有的 TSF 模型，甚至超过了基于 MoE 框架的最新方法。

Title: BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents

Authors: Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23458
Pdf URL: https://arxiv.org/pdf/2510.23458
Copy Paste: [[2510.23458]] BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents(https://arxiv.org/abs/2510.23458)
Keywords: llm, agent
Abstract: Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
摘要：对法学硕士的信心是模型不确定性和答案可靠性的有用指标。现有的工作主要集中在单轮场景，而对复杂多轮交互的置信度研究有限。在本文中，我们研究了基于法学硕士的搜索代理是否有能力在长时间的动作序列后通过语言化的置信度分数来传达自己的置信度，与在单次交互中输出置信度相比，这是一项更具挑战性的任务。通过对开源代理模型进行实验，我们首先发现模型在高置信度下表现出更高的任务准确性，而在置信度较低时则具有接近于零的准确性。基于这一观察，我们提出了测试时间缩放（TTS）方法，该方法使用置信度分数来确定答案质量，鼓励模型再次尝试，直到达到令人满意的置信水平。结果表明，与基准固定预算 TTS 方法相比，我们提出的方法显着减少了代币消耗，同时展示了具有竞争力的性能。

Title: Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts

Authors: Nikesh Gyawali, Doina Caragea, Alex Vasenkov, Cornelia Caragea
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23464
Pdf URL: https://arxiv.org/pdf/2510.23464
Copy Paste: [[2510.23464]] Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts(https://arxiv.org/abs/2510.23464)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs' performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.
摘要：美国证券交易委员会 (SEC) 提交的报告和季度财报电话会议记录 (ECT) 中的财务叙述对于投资者、审计师和监管机构来说非常重要。然而，它们的长度、金融术语和微妙的语言使得细粒度的分析变得困难。先前金融领域的情绪分析需要大量且昂贵的标记数据集，这使得针对特定金融目标的句子级立场具有挑战性。在这项工作中，我们引入了一个用于立场检测的句子级语料库，重点关注三个核心财务指标：债务、每股收益 (EPS) 和销售额。这些句子是从 10-K 表格年度报告和 ECT 中提取的，并在严格的人工验证下使用先进的 ChatGPT-o3-pro 模型标记立场（积极、消极、中立）。使用该语料库，我们使用零样本、少样本和思想链 (CoT) 提示策略对现代大型语言模型 (LLM) 进行系统评估。我们的结果表明，与监督基线相比，带有 CoT 提示的小样本表现最好，并且 LLM 的表现在 SEC 和 ECT 数据集上各不相同。我们的研究结果强调了利用法学硕士在金融领域实现特定目标立场的实际可行性，而无需大量标记数据。

Title: MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

Authors: Tengchao Yang, Sichen Guo, Mengzhao Jia, Jiaming Su, Yuanyang Liu, Zhihan Zhang, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23477
Pdf URL: https://arxiv.org/pdf/2510.23477
Copy Paste: [[2510.23477]] MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring(https://arxiv.org/abs/2510.23477)
Keywords: language model, llm, prompt
Abstract: Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.
摘要：有效的数学辅导不仅需要解决问题，还需要诊断学生的难点并循序渐进地指导。虽然多模态大语言模型（MLLM）显示出希望，但现有的基准在很大程度上忽视了这些辅导技能。我们推出了 MMTutorBench，这是人工智能数学辅导的第一个基准，由围绕教学上重要的关键步骤构建的 685 个问题组成。每个问题都与特定问题的评估标准相匹配，可以跨六个维度进行细粒度评估，并分为三个任务：洞察发现、操作制定和操作执行。我们评估了 12 个领先的 MLLM，发现专有系统和开源系统之间存在明显的性能差距，与人类导师相比有很大的空间，并且输入变量之间存在一致的趋势：OCR 管道会降低辅导质量，几次提示的收益有限，而我们基于标题的 LLM 作为法官被证明是高度可靠的。这些结果凸显了 MMTutorBench 对于推进人工智能辅导的难度和诊断价值。

Title: IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering

Authors: Jieyong Kim, Maryam Amirizaniani, Soojin Yoon, Dongha Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23536
Pdf URL: https://arxiv.org/pdf/2510.23536
Copy Paste: [[2510.23536]] IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering(https://arxiv.org/abs/2510.23536)
Keywords: language model, llm
Abstract: Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.
摘要：意图识别是在个性化问答 (PQA) 中生成适当响应的基础。然而，现有的基准仅评估响应质量或检索性能，而没有直接测量意图识别能力。这种差距至关重要，因为如果不了解用户优先考虑哪些意图，系统就无法生成满足个人信息需求的响应。为了解决这个问题，我们引入了核心意图的概念：用户在选择答案以满足其信息需求时优先考虑的意图。为了评估这些核心意图，我们提出了 IPQA，这是个性化问答中核心意图识别的基准。由于用户没有明确说明他们的优先意图，因此我们从答案选择中可观察到的行为模式中得出核心意图，其基础是用户选择符合其接受阈值的满意理论。我们通过系统过滤、基于法学硕士的注释以及将自动验证与人工验证相结合的严格质量控制，构建了具有各个领域的数据集。对最先进的语言模型的实验评估表明，当前的系统在个性化环境中难以识别核心意图。模型无法从用户历史记录中识别核心意图，随着问题复杂性的增加，性能会下降。代码和数据集将公开，以促进该方向的未来研究。

Title: LimRank: Less is More for Reasoning-Intensive Information Reranking

Authors: Tingyu Song, Yilun Zhao, Siyue Zhang, Chen Zhao, Arman Cohan
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.23544
Pdf URL: https://arxiv.org/pdf/2510.23544
Copy Paste: [[2510.23544]] LimRank: Less is More for Reasoning-Intensive Information Reranking(https://arxiv.org/abs/2510.23544)
Keywords: llm, retrieval-augmented generation
Abstract: Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.
摘要：现有方法通常依赖于大规模微调来使 LLM 适应信息重新排序任务，这在计算上是昂贵的。在这项工作中，我们证明，仅使用最少的高质量监督就可以有效地适应现代法学硕士。为了实现这一点，我们设计了 LIMRANK-SYNTHESIZER，这是一个可重用的开源管道，用于生成多样化、具有挑战性和现实的重新排名示例。使用这些合成数据，我们微调我们的重新排序模型 LIMRANK。我们在两个具有挑战性的基准上评估 LIMRANK，即用于推理密集型检索的 BRIGHT 和用于指令跟踪检索的 FollowIR。我们的实验表明，LIMRANK 实现了具有竞争力的性能，同时仅使用之前工作中通常使用的不到 5% 的数据进行训练。进一步的消融研究证明了 LIMRANK-SYNTHESIZER 的有效性以及 LIMRANK 在下游任务中的强大泛化能力，包括科学文献搜索和用于知识密集型问题解决的检索增强生成。

Title: Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models

Authors: Luis Ramos, Hiram Calvo, Olga Kolesnikova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23585
Pdf URL: https://arxiv.org/pdf/2510.23585
Copy Paste: [[2510.23585]] Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models(https://arxiv.org/abs/2510.23585)
Keywords: llm
Abstract: The identification of hope speech has become a promised NLP task, considering the need to detect motivational expressions of agency and goal-directed behaviour on social media platforms. This proposal evaluates traditional machine learning models and fine-tuned transformers for a previously split hope speech dataset as train, development and test set. On development test, a linear-kernel SVM and logistic regression both reached a macro-F1 of 0.78; SVM with RBF kernel reached 0.77, and Naïve Bayes hit 0.75. Transformer models delivered better results, the best model achieved weighted precision of 0.82, weighted recall of 0.80, weighted F1 of 0.79, macro F1 of 0.79, and 0.80 accuracy. These results suggest that while optimally configured traditional machine learning models remain agile, transformer architectures detect some subtle semantics of hope to achieve higher precision and recall in hope speech detection, suggesting that larges transformers and LLMs could perform better in small datasets.
摘要：考虑到需要检测社交媒体平台上代理的动机表达和目标导向行为，希望语音的识别已成为一项承诺的 NLP 任务。该提案评估了传统的机器学习模型和微调变压器，用于将先前分割的希望语音数据集作为训练集、开发集和测试集。在开发测试中，线性核SVM和逻辑回归的宏观F1均达到0.78；带 RBF 核的 SVM 达到 0.77，朴素贝叶斯达到 0.75。 Transformer 模型取得了更好的结果，最好的模型实现了 0.82 的加权精度、0.80 的加权召回率、0.79 的加权 F1、0.79 的宏观 F1 和 0.80 的准确率。这些结果表明，虽然优化配置的传统机器学习模型仍然灵活，但 Transformer 架构可以检测到一些微妙的语义，希望在语音检测中实现更高的精度和召回率，这表明大型 Transformer 和 LLM 可以在小型数据集中表现更好。

Title: Think Twice: Branch-and-Rethink Reasoning Reward Model

Authors: Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23596
Pdf URL: https://arxiv.org/pdf/2510.23596
Copy Paste: [[2510.23596]] Think Twice: Branch-and-Rethink Reasoning Reward Model(https://arxiv.org/abs/2510.23596)
Keywords: language model, llm
Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.
摘要：大型语言模型 (LLM) 越来越依赖于将中间步骤具体化并分配额外测试时间计算的思维模型，而三思而后行的策略表明，有意的第二次通过可以引发更强的推理。相比之下，大多数奖励模型（RM）仍然一次性将许多质量维度压缩为单个标量，这种设计会导致判断扩散：注意力分散在评估标准上，从而导致焦点分散和分析浅薄。我们引入了分支和重新思考（BR-RM），这是一种将两次思考原则转移到奖励建模的两轮 RM。第 1 回合执行自适应分支，选择一小组实例关键维度（例如事实性和安全性）并勾勒出简洁的、寻求证据的假设。第二回合执行分支条件重新思考，这是一种有针对性的重读，测试这些假设并仅审查最重要的内容。我们使用简单的二元结果奖励和严格的格式检查，在结构化的两轮轨迹上进行 GRPO 式强化学习的训练，从而使该方法与标准 RLHF 管道兼容。通过将一次性评分转换为集中、二次推理，BR-RM 减少了判断扩散，提高了对细微但后果性错误的敏感性，同时保持实用性和可扩展性。实验结果表明，我们的模型在不同领域的三个具有挑战性的奖励建模基准上实现了最先进的性能。代码和模型将很快发布。