2025-09-23

Title: On LLM-Based Scientific Inductive Reasoning Beyond Equations

Authors: Brian S. Lin, Jiaxin Yuan, Zihan Zhou, Shouli Wang, Shuo Wang, Cunliang Kong, Qi Shi, Yuxuan Li, Liner Yang, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16226
Pdf URL: https://arxiv.org/pdf/2509.16226
Copy Paste: [[2509.16226]] On LLM-Based Scientific Inductive Reasoning Beyond Equations(https://arxiv.org/abs/2509.16226)
Keywords: language model, llm
Abstract: As large language models (LLMs) increasingly exhibit human-like capabilities, a fundamental question emerges: How can we enable LLMs to learn the underlying patterns from limited examples in entirely novel environments and apply them effectively? This question is central to the ability of LLMs in inductive reasoning. Existing research on LLM-based inductive reasoning can be broadly categorized based on whether the underlying rules are expressible via explicit mathematical equations. However, many recent studies in the beyond-equations category have emphasized rule design without grounding them in specific scenarios. Inspired by the parallels between inductive reasoning and human scientific discovery, we propose the task of LLM-Based Scientific Inductive Reasoning Beyond Equations and introduce a new benchmark, SIRBench-V1, to evaluate the inductive reasoning abilities of LLMs in scientific settings. Our experimental results show that current LLMs still struggle with this task, underscoring its difficulty and the need for further advancement in this area.
摘要：随着大型语言模型（LLMS）越来越表现出类似人类的能力，就出现了一个基本问题：我们如何使LLMS能够从完全新颖的环境中的有限示例中学习基本模式并有效地应用它们？这个问题对于LLM在归纳推理中的能力至关重要。基于LLM基于LLM的归纳推理的现有研究可以根据是否可以通过明确的数学方程式表达基础规则来广泛分类。但是，在超越方程式类别中的许多最新研究都强调了规则设计，而没有在特定情况下对它们进行扎根。受归纳推理与人类科学发现之间的相似之处的启发，我们提出了基于LLM的科学归纳推理超出方程式的任务，并引入了新的基准SirBench-V1，以评估LLMS在科学环境中的归纳推理能力。我们的实验结果表明，当前的LLM仍在处理这项任务，强调其难度以及在该领域进一步发展的需求。

Title: Gender and Political Bias in Large Language Models: A Demonstration Platform

Authors: Wenjie Lin, Hange Liu, Xutao Mao, Yingying Zhuang, Jingwei Shi, Xudong Han, Tianyu Shi, Jinrui Yang
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16264
Pdf URL: https://arxiv.org/pdf/2509.16264
Copy Paste: [[2509.16264]] Gender and Political Bias in Large Language Models: A Demonstration Platform(https://arxiv.org/abs/2509.16264)
Keywords: language model, llm
Abstract: We present ParlAI Vote, an interactive system for exploring European Parliament debates and votes, and for testing LLMs on vote prediction and bias analysis. This platform connects debate topics, speeches, and roll-call outcomes, and includes rich demographic data such as gender, age, country, and political group. Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs, and view error breakdowns by demographic group. Visualizing the EuroParlVote benchmark and its core tasks of gender classification and vote prediction, ParlAI Vote highlights systematic performance bias in state-of-the-art LLMs. The system unifies data, models, and visual analytics in a single interface, lowering the barrier for reproducing findings, auditing behavior, and running counterfactual scenarios. It supports research, education, and public engagement with legislative decision-making, while making clear both the strengths and the limitations of current LLMs in political analysis.
摘要：我们提出了Parlai投票，这是一种互动制度，用于探索欧洲议会辩论和投票，并在投票预测和偏见分析中测试LLMS。该平台连接辩论主题，演讲和呼声结果，并包含丰富的人群数据，例如性别，年龄，国家和政治群体。用户可以浏览辩论，检查链接的演讲，将真实的投票结果与Frontier LLM的预测进行比较，并查看人口统计组的错误故障。可视化Europarlvote基准及其性别分类和投票预测的核心任务，Parlai投票强调了最先进的LLMS中有系统的绩效偏见。该系统在单个接口中统一数据，模型和视觉分析，从而降低了复制发现，审计行为和运行反事实场景的障碍。它支持研究，教育和公众参与立法决策，同时明确了政治分析中当前LLM的优势和局限性。

Title: Language Modeling with Learned Meta-Tokens

Authors: Alok N. Shah, Khush Gupta, Keshav Ramji, Pratik Chaudhari
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16278
Pdf URL: https://arxiv.org/pdf/2509.16278
Copy Paste: [[2509.16278]] Language Modeling with Learned Meta-Tokens(https://arxiv.org/abs/2509.16278)
Keywords: language model, gpt
Abstract: While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention in addition to causal multi-head attention, and study the impact of these tokens on a suite of synthetic tasks. We find that data-efficient language model pre-training on fewer than 100B tokens utilizing meta-tokens and our meta-attention mechanism achieves strong performance on these tasks after fine-tuning. We suggest that these gains arise due to the meta-tokens sharpening the positional encoding. This enables them to operate as trainable, content-based landmarks, implicitly compressing preceding context and "caching" it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization up to 2$\times$ its context window, even after extension with YaRN. We provide further evidence of these behaviors by visualizing model internals to study the residual stream, and assessing the compression quality by information-theoretic analysis on the rate-distortion tradeoff. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into the nature of their behavior towards length generalization.
摘要：尽管现代变压器的语言模型（LMS）在多任务概括方面取得了重大成功，但他们通常很难在其上下文窗口中捕获长期依赖性。这项工作介绍了一种新型的方法，该方法使用元tokens，在预训练期间注射的特殊令牌，以及专用的元注意机制，以指导LMS使用这些令牌。除了因果多头脑的关注外，我们还预先培训具有经过修改的GPT-2架构的语言模型，并研究了这些令牌对一系列合成任务的影响。我们发现，使用元tokens的数据有效语言模型对少于100B的令牌进行了预训练，并且我们的元注意机制在微调后在这些任务上实现了强大的性能。我们建议这些收益是由于荟萃tokens锐化了位置编码的。这使他们能够作为可训练的，基于内容的地标操作，隐含地压缩了先前的上下文，并在元式中“缓存”它。在推理时间，元式点指向相关的上下文，即使在使用纱线扩展后，也可以促进长度的概括至2 $ \ times $。我们通过可视化模型内部质量来研究残留流，并通过信息理论分析对利率差异的折衷来评估压缩质量，从而为这些行为提供进一步的证据。我们的发现表明，用元tokens进行预训练的LMS提供了一种简单，数据效率的方法，以增强长篇文章的语言建模性能，同时引入对其行为对长度概括的性质的新见解。

Title: Overhearing LLM Agents: A Survey, Taxonomy, and Roadmap

Authors: Andrew Zhu, Chris Callison-Burch
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2509.16325
Pdf URL: https://arxiv.org/pdf/2509.16325
Copy Paste: [[2509.16325]] Overhearing LLM Agents: A Survey, Taxonomy, and Roadmap(https://arxiv.org/abs/2509.16325)
Keywords: llm, chat, agent
Abstract: Imagine AI assistants that enhance conversations without interrupting them: quietly providing relevant information during a medical consultation, seamlessly preparing materials as teachers discuss lesson plans, or unobtrusively scheduling meetings as colleagues debate calendars. While modern conversational LLM agents directly assist human users with tasks through a chat interface, we study this alternative paradigm for interacting with LLM agents, which we call "overhearing agents." Rather than demanding the user's attention, overhearing agents continuously monitor ambient activity and intervene only when they can provide contextual assistance. In this paper, we present the first analysis of overhearing LLM agents as a distinct paradigm in human-AI interaction and establish a taxonomy of overhearing agent interactions and tasks grounded in a survey of works on prior LLM-powered agents and exploratory HCI studies. Based on this taxonomy, we create a list of best practices for researchers and developers building overhearing agent systems. Finally, we outline the remaining research gaps and reveal opportunities for future research in the overhearing paradigm.
摘要：想象一下，AI助手可以增强对话而不会中断对话：在医疗咨询期间悄悄地提供相关信息，在教师讨论课程计划时无缝准备材料，或者在同事辩论日历中不明显地安排会议。虽然现代的对话LLM代理直接通过聊天界面来帮助人类用户完成任务，但我们研究了与LLM代理互动的替代范式，我们称之为“偷听者”。与其要求用户的注意力，不如偷听代理人不断监视环境活动，并仅在提供上下文帮助时进行干预。在本文中，我们介绍了偷听LLM代理作为人类相互作用中的独特范式的首次分析，并在对先前的LLM驱动剂和探索性HCI研究的研究中建立了偷听药物相互作用和任务的分类法。基于这种分类法，我们为研究人员和开发人员创建了最佳实践清单，并建立了偷听剂系统。最后，我们概述了剩余的研究差距，并揭示了在偷听范式中进行未来研究的机会。

Title: HARE: an entity and relation centric evaluation framework for histopathology reports

Authors: Yunsoo Kim, Michal W. S. Ong, Alex Shavick, Honghan Wu, Adam P. Levine
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.16326
Pdf URL: https://arxiv.org/pdf/2509.16326
Copy Paste: [[2509.16326]] HARE: an entity and relation centric evaluation framework for histopathology reports(https://arxiv.org/abs/2509.16326)
Keywords: language model
Abstract: Medical domain automated text generation is an active area of research and development; however, evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking, e.g. histopathology. We propose HARE (Histopathology Automated Report Evaluation), a novel entity and relation centric framework, composed of a benchmark dataset, a named entity recognition (NER) model, a relation extraction (RE) model, and a novel metric, which prioritizes clinically relevant content by aligning critical histopathology entities and relations between reference and generated reports. To develop the HARE benchmark, we annotated 813 de-identified clinical diagnostic histopathology reports and 652 histopathology reports from The Cancer Genome Atlas (TCGA) with domain-specific entities and relations. We fine-tuned GatorTronS, a domain-adapted language model to develop HARE-NER and HARE-RE which achieved the highest overall F1-score (0.915) among the tested models. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, as well as radiology metrics such as RadGraph-XL, with the highest correlation and the best regression to expert evaluations (higher than the second best method, GREEN, a large language model based radiology report evaluator, by Pearson $r = 0.168$, Spearman $\rho = 0.161$, Kendall $\tau = 0.123$, $R^2 = 0.176$, $RMSE = 0.018$). We release HARE, datasets, and the models at this https URL to foster advancements in histopathology report generation, providing a robust framework for improving the quality of reports.
摘要：医疗领域自动化文本生成是研究与开发的活跃领域。但是，评估生成报告的临床质量仍然是一个挑战，尤其是在缺乏领域特定指标的情况下，例如组织病理学。我们提出了野兔（组织病理学自动报告评估），这是一个新颖的实体和关系中心框架，由基准数据集组成，一个命名的实体识别（NER）模型，关系提取（RE）模型以及一种新颖的指标，通过对临床相关的内容优先确定关键的组织病理学实体和参考报告和生成的参考报告和生成报告。为了开发野兔基准，我们注释了813个具有癌症基因组图集（TCGA）的临床诊断组织病理学报告和652个组织病理学报告，该报告具有特异性的实体和关系。我们微调的Gatortrons是一种域适应性的语言模型，用于开发野兔和野兔 - 在经过测试的模型中达到了最高的F1分数（0.915）。拟议的野兔指标优于包括胭脂和流星在内的传统指标，以及放射学指标，例如radgraph-xl，具有最高的相关性和与专家评估的最佳回归（高于第二最佳方法，高于绿色，高度语言模型的放射性报告评估者，由pearson $ = 0.168 $ r = 0.168 $ = 0.16 $ = 0.16 $ = 0.16 $ = 0.16 $ = 0.16。 0.123 $，$ r^2 = 0.176 $，$ rmse = 0.018 $）。我们在此HTTPS URL上发布了野兔，数据集和模型，以促进组织病理学报告生成的进步，从而为提高报告质量提供了一个强大的框架。

Title: RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering

Authors: Weikang Qiu, Tinglin Huang, Ryan Rullo, Yucheng Kuang, Ali Maatouk, S. Raquel Ramos, Rex Ying
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16360
Pdf URL: https://arxiv.org/pdf/2509.16360
Copy Paste: [[2509.16360]] RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering(https://arxiv.org/abs/2509.16360)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.
摘要：大型语言模型（LLMS）在解决复杂的医疗问题方面有希望。但是，尽管大多数先前的研究都集中在提高准确性和推理能力上，但开发有效的医疗药物的重要瓶颈在于LLM生成的反应的可读性，特别是他们清楚地回答公共健康问题的能力，仅对没有医疗背景的人来解决公共卫生问题。在这项工作中，我们介绍了RephQA，这是评估LLM在公共卫生问题答案（QA）中的可读性的基准。它包含来自13个主题的27个来源的533个经过专家评审的质量检查对，其中包括一项代理多项选择任务，以评估信息性，以及两个可读性指标：Flesch-Kincaid等级和专业得分。对25个LLM的评估表明，大多数人无法满足可读性标准，突出了推理和有效沟通之间的差距。为了解决这个问题，我们探讨了四个可读性增强策略 - 标准提示，经过经过经过思考的提示，小组相对政策优化（GRPO）和一个适应于代币的变体。适应于代币的GRPO取得了最佳结果，推进了更实用和用户友好的公共卫生代理商的发展。这些结果代表了为公共卫生建立更实用的代理的一步。

Title: Whisper-UT: A Unified Translation Framework for Speech and Text

Authors: Cihan Xiao, Matthew Wiesner, Debashish Chakraborty, Reno Kriz, Keith Cunningham, Kenton Murray, Kevin Duh, Luis Tavarez-Arce, Paul McNamee, Sanjeev Khudanpur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16375
Pdf URL: https://arxiv.org/pdf/2509.16375
Copy Paste: [[2509.16375]] Whisper-UT: A Unified Translation Framework for Speech and Text(https://arxiv.org/abs/2509.16375)
Keywords: prompt
Abstract: Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also enhances speech translation (ST) performance through a 2-stage decoding strategy. We demonstrate our methods using the Whisper model, though in principle they are general and could be applied to similar multitask models. We highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data. Our results underscore the flexibility, efficiency, and general applicability of the proposed framework for multi-modal translation.
摘要：编码器模型在语音和文本任务中取得了巨大的成功，但有效地将这些模型调整为各种Uni/多模式场景仍然是一个开放的挑战。在本文中，我们提出了Whisper-UT，这是一个统一，有效的框架，利用轻量级适配器来使跨任务无缝适应，包括多模式机器翻译（MMT）任务，可在语音和源语言文本输入上明确地对翻译进行明确调整。通过将ASR假设或地面真实成绩单纳入提示，这种方法不仅使系统能够同时处理这两种方式，而且还可以通过2阶段解码策略来增强语音翻译（ST）性能。我们使用耳语模型来证明我们的方法，尽管原则上它们是一般的，并且可以应用于类似的多任务模型。我们强调了交叉模式和交叉任务微调的有效性，这可以提高性能，而无需三向数据。我们的结果强调了拟议的多模式翻译框架的灵活性，效率和一般适用性。

Title: Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans

Authors: Deuksin Kwon, Kaleen Shrestha, Bin Han, Elena Hayoung Lee, Gale Lucas
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2509.16394
Pdf URL: https://arxiv.org/pdf/2509.16394
Copy Paste: [[2509.16394]] Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans(https://arxiv.org/abs/2509.16394)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Large Language Models (LLMs) are increasingly deployed in socially complex, interaction-driven tasks, yet their ability to mirror human behavior in emotionally and strategically complex contexts remains underexplored. This study assesses the behavioral alignment of personality-prompted LLMs in adversarial dispute resolution by simulating multi-turn conflict dialogues that incorporate negotiation. Each LLM is guided by a matched Five-Factor personality profile to control for individual variation and enhance realism. We evaluate alignment across three dimensions: linguistic style, emotional expression (e.g., anger dynamics), and strategic behavior. GPT-4.1 achieves the closest alignment with humans in linguistic style and emotional dynamics, while Claude-3.7-Sonnet best reflects strategic behavior. Nonetheless, substantial alignment gaps persist. Our findings establish a benchmark for alignment between LLMs and humans in socially complex interactions, underscoring both the promise and the limitations of personality conditioning in dialogue modeling.
摘要：大型语言模型（LLM）越来越多地部署在社会复杂的，互动驱动的任务中，但是它们在情感和战略上复杂的环境中反映人类行为的能力仍然没有得到充实的态度。这项研究通过模拟包含谈判的多转变冲突对话来评估对抗性争议解决中人格贡献的LLM的行为一致性。每个LLM都以匹配的五因素人格概况为指导，以控制单个变异并增强现实主义。我们评估了三个维度的一致性：语言风格，情感表达（例如愤怒动态）和战略行为。 GPT-4.1以语言风格和情感动态实现了与人类最接近的一致性，而Claude-3.7-Sonnet最能反映战略行为。尽管如此，实质性的一致性差距仍然存在。我们的发现在社会复杂的互动中为LLM和人类之间的一致性建立了一个基准，强调了对话建模中人格条件的承诺和局限性。

Title: 'Rich Dad, Poor Lad': How do Large Language Models Contextualize Socioeconomic Factors in College Admission ?

Authors: Huy Nghiem, Phuong-Anh Nguyen-Le, John Prindle, Rachel Rudinger, Hal Daumé III
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.16400
Pdf URL: https://arxiv.org/pdf/2509.16400
Copy Paste: [[2509.16400]] 'Rich Dad, Poor Lad': How do Large Language Models Contextualize Socioeconomic Factors in College Admission ?(https://arxiv.org/abs/2509.16400)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly involved in high-stakes domains, yet how they reason about socially sensitive decisions remains underexplored. We present a large-scale audit of LLMs' treatment of socioeconomic status (SES) in college admissions decisions using a novel dual-process framework inspired by cognitive science. Leveraging a synthetic dataset of 30,000 applicant profiles grounded in real-world correlations, we prompt 4 open-source LLMs (Qwen 2, Mistral v0.3, Gemma 2, Llama 3.1) under 2 modes: a fast, decision-only setup (System 1) and a slower, explanation-based setup (System 2). Results from 5 million prompts reveal that LLMs consistently favor low-SES applicants -- even when controlling for academic performance -- and that System 2 amplifies this tendency by explicitly invoking SES as compensatory justification, highlighting both their potential and volatility as decision-makers. We then propose DPAF, a dual-process audit framework to probe LLMs' reasoning behaviors in sensitive applications.
摘要：大型语言模型（LLM）越来越多地参与高风险领域，但是他们如何推理对社会敏感的决策的推论仍然没有得到充实的态度。我们使用受认知科学启发的新型双过程框架对LLMS对社会经济地位（SES）的处理（SES）进行了大规模审核。利用以现实世界相关性为基础的30,000个申请人概况的合成数据集，我们提示了4种开源的LLM（QWEN 2，MISTRAL V0.3，GEMMA 2，LLAMA 3.1）在2种模式下：一种快速，决策的设置（系统1）和一个较慢的，较慢的基于基于基于的设置（系统2）。 500万提示的结果表明，LLM始终偏爱低SES申请人（即使控制学业表现），该系统2通过明确调用SES作为补偿性理由来扩大这种趋势，从而强调了他们作为决策者的潜在和波动性。然后，我们提出DPAF，这是一种双进程审计框架，以探测LLMS在敏感应用中的推理行为。

Title: Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

Authors: Richard Diehl Martinez, David Demitri Africa, Yuval Weiss, Suchir Salhan, Ryan Daniels, Paula Buttery
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16413
Pdf URL: https://arxiv.org/pdf/2509.16413
Copy Paste: [[2509.16413]] Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research(https://arxiv.org/abs/2509.16413)
Keywords: language model
Abstract: Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model's architecture or training procedures and directly observe their effects on the model's behavior. To support reproducible experimentation, we also release a suite of baseline models, pico-decoder, trained under standardized conditions and open-sourced for the community. Case studies highlight how Pico can support iterative small LM design and analysis.
摘要：构建语言模型（LMS），尤其是中小型模型，仍然比科学更多。尽管大型LMS通常会通过纯粹的规模改善，但仍不清楚为什么许多设计选择都起作用。对于小型LMS，这种不确定性更具限制性：紧张的参数预算使每个决策至关重要，但是研究人员仍然缺乏系统的，科学的方法来测试和完善新想法。我们介绍了PICO，这是一个轻巧的模块化框架，可实现针对中小型语言模型开发的系统，假设驱动的研究。 PICO由两个库组成，共同提供了一个实用的沙箱，研究人员可以对模型的体系结构或培训程序进行有针对性的更改，并直接观察它们对模型行为的影响。为了支持可重复的实验，我们还发布了一套基线模型，即Pico-Decoder，在标准化条件下接受了培训，并为社区开源。案例研究强调了PICO如何支持迭代的小型LM设计和分析。

Title: Evaluating CxG Generalisation in LLMs via Construction-Based NLI Fine Tuning

Authors: Tom Mackintosh, Harish Tayyar Madabushi, Claire Bonial
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16422
Pdf URL: https://arxiv.org/pdf/2509.16422
Copy Paste: [[2509.16422]] Evaluating CxG Generalisation in LLMs via Construction-Based NLI Fine Tuning(https://arxiv.org/abs/2509.16422)
Keywords: language model, llm
Abstract: We probe large language models' ability to learn deep form-meaning mappings as defined by construction grammars. We introduce the ConTest-NLI benchmark of 80k sentences covering eight English constructions from highly lexicalized to highly schematic. Our pipeline generates diverse synthetic NLI triples via templating and the application of a model-in-the-loop filter. This provides aspects of human validation to ensure challenge and label reliability. Zero-shot tests on leading LLMs reveal a 24% drop in accuracy between naturalistic (88%) and adversarial data (64%), with schematic patterns proving hardest. Fine-tuning on a subset of ConTest-NLI yields up to 9% improvement, yet our results highlight persistent abstraction gaps in current LLMs and offer a scalable framework for evaluating construction-informed learning.
摘要：我们探究了大型语言模型的学习能力，可以学习由施工语法定义的深层含义映射。我们介绍了80k句子的比赛基准，涵盖了从高度词汇到高度示意图的八个英语结构。我们的管道通过模板和使用模型中的滤波器生成多种合成的NLI三元组。这提供了人类验证的各个方面，以确保挑战和标记可靠性。对领先LLM的零拍测试显示，自然主义（88％）和对抗数据（64％）之间的准确性下降了24％，示意图最难证明。对NLI的一部分进行微调可提高9％，但我们的结果突出了当前LLMS中持续的抽象差距，并提供了可扩展的框架来评估施工知识的学习。

Title: Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations

Authors: Yunzhe Wang, Gale M. Lucas, Burcin Becerik-Gerber, Volkan Ustun
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.16457
Pdf URL: https://arxiv.org/pdf/2509.16457
Copy Paste: [[2509.16457]] Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations(https://arxiv.org/abs/2509.16457)
Keywords: llm, agent
Abstract: Language-driven generative agents have enabled large-scale social simulations with transformative uses, from interpersonal training to aiding global policy-making. However, recent studies indicate that generative agent behaviors often deviate from expert expectations and real-world data--a phenomenon we term the Behavior-Realism Gap. To address this, we introduce a theoretical framework called Persona-Environment Behavioral Alignment (PEBA), formulated as a distribution matching problem grounded in Lewin's behavior equation stating that behavior is a function of the person and their environment. Leveraging PEBA, we propose PersonaEvolve (PEvo), an LLM-based optimization algorithm that iteratively refines agent personas, implicitly aligning their collective behaviors with realistic expert benchmarks within a specified environmental context. We validate PEvo in an active shooter incident simulation we developed, achieving an 84% average reduction in distributional divergence compared to no steering and a 34% improvement over explicit instruction baselines. Results also show PEvo-refined personas generalize to novel, related simulation scenarios. Our method greatly enhances behavioral realism and reliability in high-stakes social simulations. More broadly, the PEBA-PEvo framework provides a principled approach to developing trustworthy LLM-driven social simulations.
摘要：语言驱动的生成代理已经使大规模的社会模拟具有变革性用途，从人际培训到有助于全球决策。但是，最近的研究表明，生成剂的行为通常偏离专家期望和现实世界数据，这是我们称为行为真实主义差距的现象。为了解决这个问题，我们引入了一个理论框架，称为人格环境行为对准（PEBA），该框架是在勒温行为方程式基于的分配匹配问题，表明行为是人及其环境的函数。利用PEBA，我们提出了一种基于LLM的优化算法的Personaevolve（PEVO），它可以迭代地完善代理人角色，在指定的环境环境中隐含地将其集体行为与现实的专家基准保持一致。我们在我们开发的主动射击事件模拟中验证了PEVO，与没有转向相比，分布分布的平均分布平均减少了84％，比显式指令基准的提高了34％。结果还显示了PEVE精制的角色概括为新颖的相关模拟方案。我们的方法在高风险的社会模拟中大大提高了行为现实主义和可靠性。更广泛地说，PEBA-PEVO框架为开发值得信赖的LLM驱动的社会模拟提供了一种原则性的方法。

Title: Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models

Authors: 'Mina Arzaghi', 'Alireza Dehghanpour Farashah', 'Florian Carichon', ' Golnoosh Farnadi'
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16462
Pdf URL: https://arxiv.org/pdf/2509.16462
Copy Paste: [[2509.16462]] Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models(https://arxiv.org/abs/2509.16462)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit socio-economic biases that can propagate into downstream tasks. While prior studies have questioned whether intrinsic bias in LLMs affects fairness at the downstream task level, this work empirically investigates the connection. We present a unified evaluation framework to compare intrinsic bias mitigation via concept unlearning with extrinsic bias mitigation via counterfactual data augmentation (CDA). We examine this relationship through real-world financial classification tasks, including salary prediction, employment status, and creditworthiness assessment. Using three open-source LLMs, we evaluate models both as frozen embedding extractors and as fine-tuned classifiers. Our results show that intrinsic bias mitigation through unlearning reduces intrinsic gender bias by up to 94.9%, while also improving downstream task fairness metrics, such as demographic parity by up to 82%, without compromising accuracy. Our framework offers practical guidance on where mitigation efforts can be most effective and highlights the importance of applying early-stage mitigation before downstream deployment.
摘要：大型语言模型（LLMS）表现出可以传播到下游任务的社会经济偏见。虽然先前的研究质疑LLMS中的内在偏见是否会影响下游任务级别的公平性，但这项工作经验研究了联系。我们提出了一个统一的评估框架，以通过反事实数据增强（CDA）（CDA）来比较通过概念概念来比较固有的偏见缓解范围。我们通过现实世界中的财务分类任务来检查这种关系，包括工资预测，就业状况和信誉评估。使用三个开源LLM，我们将模型评估为冷冻嵌入提取器和微调分类器。我们的结果表明，通过学习的固有偏见缓解了固有的性别偏见高达94.9％，同时还将下游任务公平度量指标（例如人口统计学奇偶校验）提高了多达82％，而不会损害准确性。我们的框架提供了有关缓解工作最有效的何处的实用指导，并强调了在下游部署之前采用早期缓解措施的重要性。

Title: Computational Analysis of Conversation Dynamics through Participant Responsivity

Authors: Margaret Hughes, Brandon Roy, Elinor Poole-Dayan, Deb Roy, Jad Kabbara
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.16464
Pdf URL: https://arxiv.org/pdf/2509.16464
Copy Paste: [[2509.16464]] Computational Analysis of Conversation Dynamics through Participant Responsivity(https://arxiv.org/abs/2509.16464)
Keywords: language model, llm
Abstract: Growing literature explores toxicity and polarization in discourse, with comparatively less work on characterizing what makes dialogue prosocial and constructive. We explore conversational discourse and investigate a method for characterizing its quality built upon the notion of ``responsivity'' -- whether one person's conversational turn is responding to a preceding turn. We develop and evaluate methods for quantifying responsivity -- first through semantic similarity of speaker turns, and second by leveraging state-of-the-art large language models (LLMs) to identify the relation between two speaker turns. We evaluate both methods against a ground truth set of human-annotated conversations. Furthermore, selecting the better performing LLM-based approach, we characterize the nature of the response -- whether it responded to that preceding turn in a substantive way or not. We view these responsivity links as a fundamental aspect of dialogue but note that conversations can exhibit significantly different responsivity structures. Accordingly, we then develop conversation-level derived metrics to address various aspects of conversational discourse. We use these derived metrics to explore other conversations and show that they support meaningful characterizations and differentiations across a diverse collection of conversations.
摘要：不断增长的文献探讨了话语中的毒性和两极分化，而在表征对话亲社会和建设性方面的工作相对较少。我们探讨了对话性话语，并研究了一种表征其质量构建的方法，该方法构建了``响应式''的概念 - 一个人的对话转弯是否正在响应前面的转弯。我们开发和评估量化响应率的方法 - 首先是通过说话者转弯的语义相似性，其次是利用最先进的大语言模型（LLM）来确定两个说话者转弯之间的关系。我们针对人类宣传的对话的基础真理评估了这两种方法。此外，选择更好的基于LLM的方法，我们表征了响应的性质 - 它是否以实质性的方式对前转弯做出了反应。我们将这些响应链接视为对话的基本方面，但请注意，对话可以表现出明显不同的响应性结构。因此，然后我们开发了对话级的派生指标，以解决对话性话语的各个方面。我们使用这些派生的指标来探索其他对话，并表明它们支持各种对话集合中有意义的特征和区分。

Title: The Oracle Has Spoken: A Multi-Aspect Evaluation of Dialogue in Pythia

Authors: Zixun Chen, Petr Babkin, Akshat Gupta, Gopala Anumanchipalli, Xiaomo Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16487
Pdf URL: https://arxiv.org/pdf/2509.16487
Copy Paste: [[2509.16487]] The Oracle Has Spoken: A Multi-Aspect Evaluation of Dialogue in Pythia(https://arxiv.org/abs/2509.16487)
Keywords: language model, llm
Abstract: Dialogue is one of the landmark abilities of large language models (LLMs). Despite its ubiquity, few studies actually distinguish specific ingredients underpinning dialogue behavior emerging during post-training. We employ a comprehensive suite of model-based metrics, each targeting a distinct fine-grained aspect of dialogue, motivated by linguistic theory. We evaluate how the performance of pre-trained Pythia models changes with respect to each of those dimensions, depending on model size and as a result of supervised fine-tuning on conversational datasets. We observe only a mild impact of raw model size on most metrics, whereas fine-tuning quickly saturates the scores for all but the smallest models tested. Somewhat contrary to our expectations, many metrics show very similar trends, especially if they are all rooted in the same evaluator model, which raises the question of their reliability in measuring a specific dimension. To that end, we conduct additional analyses of score distributions, metric correlations, and term frequencies in generated responses to help explain our observations.
摘要：对话是大语言模型（LLM）的具有里程碑意义的能力之一。尽管无处不在，但很少有研究实际上区分了培训期间出现的对话行为的特定成分。我们采用了一套全面的基于模型的指标，每个指标都以语言理论为动机，针对对话的独特善良方面。我们评估了预先训练的毕达斯模型的性能相对于每个维度的每个维度如何变化，具体取决于模型大小以及在对话数据集中进行的微调进行微调的结果。我们仅观察到原始模型大小对大多数指标的轻度影响，而微调迅速使除了测试的最小模型以外的所有模型都饱和。与我们的期望有些相反，许多指标表现出非常相似的趋势，尤其是当它们全部植根于同一评估器模型时，这提出了它们在测量特定维度方面的可靠性问题。为此，我们对生成的响应中的得分分布，度量相关性和术语频率进行了其他分析，以帮助解释我们的观察结果。

Title: Can an Individual Manipulate the Collective Decisions of Multi-Agents?

Authors: Fengyuan Liu, Rui Zhao, Shuo Chen, Guohao Li, Philip Torr, Lei Han, Jindong Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16494
Pdf URL: https://arxiv.org/pdf/2509.16494
Copy Paste: [[2509.16494]] Can an Individual Manipulate the Collective Decisions of Multi-Agents?(https://arxiv.org/abs/2509.16494)
Keywords: language model, llm, agent
Abstract: Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.
摘要：单个大型语言模型（LLM）在各个领域（例如医疗保健和法律）都表现出了重要的功能。最近的研究还表明，协调的多代理系统通过协作表现出增强的决策和推理能力。但是，由于单个LLM的脆弱性以及难以访问多代理系统中所有代理的困难，因此出现了一个关键问题：如果攻击者只知道一个代理，他们仍然可以生成能够误导集体决定的对抗性样本吗？为了探索这个问题，我们将其作为一个具有不完整信息的游戏进行表达，攻击者只知道一个目标代理，并且缺乏系统中其他代理的知识。通过此公式，我们提出了M-Spoiler，该框架模拟了多代理系统中的代理相互作用以生成对抗样本。然后，这些样本用于操纵目标系统中的目标代理，从而误导了系统的协作决策过程。更具体地说，M-Spoiler引入了一种顽固的剂，该剂通过模拟目标系统中代理的潜在顽固反应来积极帮助优化对抗样本。这增强了生成的对抗样品在误导系统中的有效性。通过跨各种任务的广泛实验，我们的发现证实了多代理系统中各个代理的知识所带来的风险，并证明了我们框架的有效性。我们还探索了几种防御机制，这表明我们提出的攻击框架仍然比基线更有效，强调了对防御策略进行进一步研究的必要性。

Title: AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans

Authors: Wei Xie, Shuoyoucheng Ma, Zhenhua Wang, Enze Wang, Kai Chen, Xiaobing Sun, Baosheng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16530
Pdf URL: https://arxiv.org/pdf/2509.16530
Copy Paste: [[2509.16530]] AIPsychoBench: Understanding the Psychometric Differences between LLMs and Humans(https://arxiv.org/abs/2509.16530)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) with hundreds of billions of parameters have exhibited human-like intelligence by learning from vast amounts of internet-scale data. However, the uninterpretability of large-scale neural networks raises concerns about the reliability of LLM. Studies have attempted to assess the psychometric properties of LLMs by borrowing concepts from human psychology to enhance their interpretability, but they fail to account for the fundamental differences between LLMs and humans. This results in high rejection rates when human scales are reused directly. Furthermore, these scales do not support the measurement of LLM psychological property variations in different languages. This paper introduces AIPsychoBench, a specialized benchmark tailored to assess the psychological properties of LLM. It uses a lightweight role-playing prompt to bypass LLM alignment, improving the average effective response rate from 70.12% to 90.40%. Meanwhile, the average biases are only 3.3% (positive) and 2.1% (negative), which are significantly lower than the biases of 9.8% and 6.9%, respectively, caused by traditional jailbreak prompts. Furthermore, among the total of 112 psychometric subcategories, the score deviations for seven languages compared to English ranged from 5% to 20.2% in 43 subcategories, providing the first comprehensive evidence of the linguistic impact on the psychometrics of LLM.
摘要：具有数百十亿个参数的大型语言模型（LLM）通过从大量的互联网规模数据中学习来表现出类似人类的智能。但是，大规模神经网络的无法解释性引起了人们对LLM可靠性的担忧。研究试图通过借用人类心理学的概念来增强其可解释性来评估LLM的心理测量特性，但他们无法解决LLMS和人类之间的基本差异。当直接重复人类量表时，这会导致高排斥率。此外，这些量表不支持不同语言中LLM心理特性变化的测量。本文介绍了AipsyChobench，这是一种量身定制的专门基准测试，该基准用于评估LLM的心理特性。它使用轻巧的角色扮演提示来绕过LLM对齐，将平均有效响应率从70.12％提高到90.40％。同时，平均偏见仅为3.3％（正）和2.1％（负），这显着低于传统越狱提示引起的9.8％和6.9％的偏见。此外，在总共112个心理测量子类别中，七种语言的得分差为43个子类别的5％至20.2％，提供了对语言对LLM心理图的语言影响的第一个综合证据。

Title: Challenging the Evaluator: LLM Sycophancy Under User Rebuttal

Authors: Sungwon Kim, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16533
Pdf URL: https://arxiv.org/pdf/2509.16533
Copy Paste: [[2509.16533]] Challenging the Evaluator: LLM Sycophancy Under User Rebuttal(https://arxiv.org/abs/2509.16533)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) often exhibit sycophancy, distorting responses to align with user beliefs, notably by readily agreeing with user counterarguments. Paradoxically, LLMs are increasingly adopted as successful evaluative agents for tasks such as grading and adjudicating claims. This research investigates that tension: why do LLMs show sycophancy when challenged in subsequent conversational turns, yet perform well when evaluating conflicting arguments presented simultaneously? We empirically tested these contrasting scenarios by varying key interaction patterns. We find that state-of-the-art models: (1) are more likely to endorse a user's counterargument when framed as a follow-up from a user, rather than when both responses are presented simultaneously for evaluation; (2) show increased susceptibility to persuasion when the user's rebuttal includes detailed reasoning, even when the conclusion of the reasoning is incorrect; and (3) are more readily swayed by casually phrased feedback than by formal critiques, even when the casual input lacks justification. Our results highlight the risk of relying on LLMs for judgment tasks without accounting for conversational framing.
摘要：大型语言模型（LLMS）经常表现出粘粘性，使对用户信念保持一致的响应扭曲，特别是通过易于同意用户反驳。矛盾的是，LLM越来越多地作为成功的评估代理，以诸如评分和裁决主张等任务。这项研究调查了这种张力：在随后的对话转弯中挑战时，LLMS为什么在评估同时提出冲突的论点时表现出色？我们通过改变关键相互作用模式来经验测试了这些对比方案。我们发现最新的模型：（1）在用户的后续行动中，而不是同时提出两个响应以进行评估时，更有可能认可用户的反驳；（2）表明，即使推理的结论不正确，用户的反驳包括详细的推理，对说服力的敏感性增加了；（3）即使休闲输入缺乏理由，也比正式的批评更容易被随意的反馈摇摆。我们的结果强调了依靠LLM来执行判断任务的风险，而无需考虑会话框架。

Title: InteGround: On the Evaluation of Verification and Retrieval Planning in Integrative Grounding

Authors: Cheng Jiayang, Qianqian Zhuang, Haoran Li, Chunkit Chan, Xin Liu, Lin Qiu, Yangqiu Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16534
Pdf URL: https://arxiv.org/pdf/2509.16534
Copy Paste: [[2509.16534]] InteGround: On the Evaluation of Verification and Retrieval Planning in Integrative Grounding(https://arxiv.org/abs/2509.16534)
Keywords: language model, llm
Abstract: Grounding large language models (LLMs) in external knowledge sources is a promising method for faithful prediction. While existing grounding approaches work well for simple queries, many real-world information needs require synthesizing multiple pieces of evidence. We introduce "integrative grounding" -- the challenge of retrieving and verifying multiple inter-dependent pieces of evidence to support a hypothesis query. To systematically study this problem, we repurpose data from four domains for evaluating integrative grounding capabilities. Our investigation reveals two critical findings: First, in groundedness verification, while LLMs are robust to redundant evidence, they tend to rationalize using internal knowledge when information is incomplete. Second, in examining retrieval planning strategies, we find that undirected planning can degrade performance through noise introduction, while premise abduction emerges as a promising approach due to its logical constraints. Additionally, LLMs' zero-shot self-reflection capabilities consistently improve grounding quality. These insights provide valuable direction for developing more effective integrative grounding systems.
摘要：在外部知识来源中基础大型语言模型（LLM）是忠实预测的有前途的方法。尽管现有的接地方法可用于简单查询，但许多实际信息需要综合多个证据。我们介绍“综合基础” - 检索和验证多个相互依赖的证据以支持假设查询的挑战。为了系统地研究此问题，我们从四个领域中重新利用数据，以评估综合接地能力。我们的调查揭示了两个关键发现：首先，在扎根验证中，尽管LLM对冗余证据是强大的，但在信息不完整时，它们倾向于使用内部知识合理化。其次，在检查检索计划策略时，我们发现无方向的计划可以通过噪声引入来降低绩效，而前提绑架由于其逻辑限制而成为一种有前途的方法。此外，LLMS的零射击自我反射能力始终提高接地质量。这些见解为开发更有效的集成接地系统提供了宝贵的方向。

Title: ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions

Authors: Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh V Chawla, Olaf Wiest, Xiangliang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16543
Pdf URL: https://arxiv.org/pdf/2509.16543
Copy Paste: [[2509.16543]] ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions(https://arxiv.org/abs/2509.16543)
Keywords: language model, llm
Abstract: Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning and distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.
摘要：通过化学智能授权大型语言模型（LLM）仍然是一个挑战，这是因为高质量，特定领域的指令 - 响应数据集以及现有的合成数据生成管道的未对准具有固有的层次结构和规则的化学信息结构。为了解决这个问题，我们提出了化学货币，该框架通过两个阶段的过程综合化学基础的指令 - 响应对：任务控制的指令生成和工具吸引的响应构建。化学疗法可以使生成任务的可控多样性和难度水平，并通过工具计划和蒸馏以及基于工具的自我修复机制来确保响应精度。基于以下方式评估化疗的有效性：1）生成的指导数据的高质量，证明了较高的多样性和与化学限制的强烈一致性； 2）更有效地揭示化学中LLM弱点的可靠生成的评估任务； 3）当使用生成的指令数据进行微调时，LLM化学能力的显着改善。因此，我们的工作代表了朝着LLMS中可扩展和可验证的化学智能迈出的关键步骤。

Title: Rethinking the Role of Text Complexity in Language Model Pretraining

Authors: Dan John Velasco, Matthew Theodore Roque
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16551
Pdf URL: https://arxiv.org/pdf/2509.16551
Copy Paste: [[2509.16551]] Rethinking the Role of Text Complexity in Language Model Pretraining(https://arxiv.org/abs/2509.16551)
Keywords: language model
Abstract: Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity is less explored. Text complexity refers to how hard a text is to read, and is typically estimated from surface cues such as sentence length, word choice, and sentence structure. We reduce surface-level complexity--shorter sentences, simpler words, simpler structure--while keeping core text content close to constant, and ask: (1) How does complexity affect language modeling across model sizes? (2) Can useful representations be learned from simpler text alone? (3) How does pretraining text complexity influence downstream language understanding? To answer these questions, we simplify human-written texts using a large language model, then pretrain causal models (28M-500M) from scratch on both original and simplified data, and evaluate them in finetuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity--smaller models degrade far less on simpler texts--while text complexity has little impact on finetuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking.
摘要：已知提高预训练的数据质量和大小可以提高下游性能，但文本复杂性的作用较少。文本复杂性是指文本读取的难度，通常是从句子长度，单词选择和句子结构等表面提示中估算的。我们降低了表面级的复杂性 - 更简单的句子，更简单的单词，更简单的结构 - 同时保持核心文本内容接近常数，并询问：（1）复杂性如何影响跨模型大小的语言建模？（2）可以单独从更简单的文本中学到有用的表示形式吗？（3）预处理文本复杂性如何影响下游语言的理解？为了回答这些问题，我们使用大型语言模型简化了人工写的文本，然后在原始数据和简化数据上从头开始预处理因果模型（28m-500m），并在填充和零射击设置中对其进行评估。我们发现，困惑性对模型容量和文本复杂性之间的相互作用敏感 - 杂项模型在更简单的文本上的降低较少 - 而文本复杂性几乎没有对芬太尼评估的影响，而零摄像的评估表明，简单的文本对语言知识任务有益于更复杂的文本，而更复杂的文本则有利于教会的知识和实体知识。

Title: MPCG: Multi-Round Persona-Conditioned Generation for Modeling the Evolution of Misinformation with LLMs

Authors: Jun Rong Brian Chong, Yixuan Tang, Anthony K.H. Tung
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2509.16564
Pdf URL: https://arxiv.org/pdf/2509.16564
Copy Paste: [[2509.16564]] MPCG: Multi-Round Persona-Conditioned Generation for Modeling the Evolution of Misinformation with LLMs(https://arxiv.org/abs/2509.16564)
Keywords: language model, gpt, llm, agent
Abstract: Misinformation evolves as it spreads, shifting in language, framing, and moral emphasis to adapt to new audiences. However, current misinformation detection approaches implicitly assume that misinformation is static. We introduce MPCG, a multi-round, persona-conditioned framework that simulates how claims are iteratively reinterpreted by agents with distinct ideological perspectives. Our approach uses an uncensored large language model (LLM) to generate persona-specific claims across multiple rounds, conditioning each generation on outputs from the previous round, enabling the study of misinformation evolution. We evaluate the generated claims through human and LLM-based annotations, cognitive effort metrics (readability, perplexity), emotion evocation metrics (sentiment analysis, morality), clustering, feasibility, and downstream classification. Results show strong agreement between human and GPT-4o-mini annotations, with higher divergence in fluency judgments. Generated claims require greater cognitive effort than the original claims and consistently reflect persona-aligned emotional and moral framing. Clustering and cosine similarity analyses confirm semantic drift across rounds while preserving topical coherence. Feasibility results show a 77% feasibility rate, confirming suitability for downstream tasks. Classification results reveal that commonly used misinformation detectors experience macro-F1 performance drops of up to 49.7%. The code is available at this https URL
摘要：错误的信息随着语言，框架和道德强调以适应新受众的传播而演变。但是，当前的错误信息检测方法隐式假定错误信息是静态的。我们介绍了MPCG，这是一个多轮角色条件框架，该框架模拟了具有不同意识形态观点的代理商对主张的迭代重新解释。我们的方法使用未经审查的大型语言模型（LLM）在多个回合中生成特定于角色的主张，从而使每一轮的产量从上一轮的输出中调节，从而使错误信息进化的研究。我们通过基于人类和LLM的注释，认知努力指标（可读性，困惑），情感唤起指标（情感分析，道德），聚类，可行性和下游分类来评估产生的主张。结果表明，人类和GPT-4O-MINI注释之间有很强的一致性，流利性判断的差异较高。产生的主张比原始主张需要更大的认知努力，并始终反映与角色一致的情感和道德框架。聚类和余弦相似性分析证实了跨回合的语义漂移，同时保持局部连贯性。可行性结果表明可行性率为77％，确认了对下游任务的适用性。分类结果表明，常用的错误信息检测器经历了宏F1性能下降高达49.7％。该代码可在此HTTPS URL上找到

Title: From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

Authors: Benlu Wang, Iris Xia, Yifan Zhang, Junda Wang, Feiyun Ouyang, Shuo Han, Arman Cohan, Hong Yu, Zonghai Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16584
Pdf URL: https://arxiv.org/pdf/2509.16584
Copy Paste: [[2509.16584]] From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations(https://arxiv.org/abs/2509.16584)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.
摘要：大型语言模型（LLM）在医疗基准上表现出了有希望的表现；但是，他们执行医学计算的能力，这是临床决策的关键方面，仍然没有得到充实的评估。现有的基准通常仅以广泛的数值耐受性评估最终答案，忽略了系统的推理故障，并可能导致严重的临床错误判断。在这项工作中，我们重新审视医学计算评估，更加专注于临床可信度。首先，我们清洁和重组MEDCALC-BENCHEND数据集，并提出一条新的逐步评估管道，该管道独立评估公式选择，实体提取和算术计算。在这个颗粒状框架下，GPT-4O的准确性从62.7％下降到43.6％，揭示了通过先前评估掩盖的错误。其次，我们引入了一个自动错误分析框架，该框架为每个故障模式生成结构化归因。人类评估证实了其与专家判断的一致性，从而实现了可扩展和可解释的诊断。最后，我们提出了一个模块化的代理管道MEDRAC，该管道结合了检索功能的生成和基于Python的代码执行。没有任何微调，MEDRAC将不同LLM的准确性从16.35％提高到53.19％。我们的工作强调了当前基准实践的局限性，并提出了一种更临床忠实的方法。通过实现透明且可转移的推理评估，我们更加接近使基于LLM的系统对现实世界医学应用的信任。

Title: Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data

Authors: Qiongqiong Wang, Hardik Bhupendra Sailor, Tianchi Liu, Wenyu Zhang, Muhammad Huzaifah, Nattadaporn Lertcheva, Shuo Sun, Nancy F. Chen, Jinyang Wu, AiTi Aw
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16589
Pdf URL: https://arxiv.org/pdf/2509.16589
Copy Paste: [[2509.16589]] Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data(https://arxiv.org/abs/2509.16589)
Keywords: llm
Abstract: Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.
摘要：在转录和翻译等任务中，最近的语音词表现出了令人印象深刻的表现，但是它们在理解对社会和情商至关重要的语音方面方面仍然有限。我们提出了CP基础，这是一个基准，用于评估语境副语言推理上语言内容与情感和韵律等非语言线索的整合。基准包括两个策划的问题回答（QA）数据集，需要语言和同情心理解。我们从开放式和封闭式模型中评估最新的语音插件，并对不同问题类型进行全面分析。在温度调整下进一步分析了前两个模型，以了解其对这项任务的影响。我们的基准揭示了现有评估中的一个关键差距，并提供了建立更多背景感和情感智能语音能力的LLM的见解。

Title: From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature

Authors: Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16591
Pdf URL: https://arxiv.org/pdf/2509.16591
Copy Paste: [[2509.16591]] From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature(https://arxiv.org/abs/2509.16591)
Keywords: llm
Abstract: Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in this https URL.
摘要：强化学习已成为增强LLM中推理的基本技术。但是，现有的算法对所有令牌都采用统一的优化，而忽略了它们在推理过程中的不同角色。为了解决此限制，我们引入了异质自适应策略优化（HAPO），这是一种全面的令牌感知算法，基于令牌熵，动态适应优化。对于推出采样，我们提出了自适应温度采样，该采样可实时调整采样温度，从而促进高渗透令牌的探索，同时在低透镜的代币中保持连贯性。为了计算优势，我们引入了令牌级别的平均水平，该平均水平在令牌水平上的优势归一化，共同考虑了序列长度，如令牌均值损失，同时保留了无偏见的治疗。然后，我们开发出差异优势的重新分配，以利用熵和重要性比率调节带有明确信号的令牌调整调整的更新。为了削减损失，我们设计了不对称的自适应剪辑，从而使嘈杂的低渗透令牌的攻击性概率降低，同时可以探索高渗透令牌。通过熵和训练动力学之间的系统研究，我们将令牌级处理嵌入到每个阶段，以实现细粒度的控制。广泛的实验表明，HAPO始终在多个模型尺度上胜过DAPO。我们的代码可以在此HTTPS URL中找到。

Title: Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels

Authors: Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16596
Pdf URL: https://arxiv.org/pdf/2509.16596
Copy Paste: [[2509.16596]] Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels(https://arxiv.org/abs/2509.16596)
Keywords: language model, llm
Abstract: Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model's knowledge remains underexplored, limiting our ability to control knowledge change behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families. Surprisingly, models fine-tuned on 1,920 samples perform up to 14% worse than those fine-tuned on only 240 samples. Furthermore, varying the level of knowledge mastery in the fine-tuning data leads to performance fluctuations of over 12%. To investigate these effects, we analyze model behavior at both the token and parameter levels. Our analysis reveals that up to 90% of parameter updates during SFT do not contribute to knowledge enhancement. Restoring these updates can improve performance on the CBQA task, depending on the characteristics of the fine-tuning data. These insights offer practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.
摘要：大型语言模型（LLMS）在训练前获得了大量的世界知识，这是由训练后技术（例如监督微调（SFT））进一步塑造的。但是，SFT对模型知识的影响仍然没有被忽视，从而限制了我们在微调模型中控制知识变化行为的能力。为了解决这一差距，我们评估了来自Llama-2和Llama-3家族的五个LLM的封闭式问题答案（CBQA）。令人惊讶的是，在1,920个样品上进行了微调的模型比仅在240个样本上进行微调的模型差14％。此外，在微调数据中改变知识掌握水平会导致性能波动超过12％。为了研究这些效果，我们分析了令牌和参数水平的模型行为。我们的分析表明，在SFT期间，多达90％的参数更新不会导致知识增强。恢复这些更新可以改善CBQA任务的性能，具体取决于微调数据的特征。这些见解为制定微调策略提供了实用的指导，从而更有效地增强了模型知识。

Title: MCP: A Control-Theoretic Orchestration Framework for Synergistic Efficiency and Interpretability in Multimodal Large Language Models

Authors: Luyan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16597
Pdf URL: https://arxiv.org/pdf/2509.16597
Copy Paste: [[2509.16597]] MCP: A Control-Theoretic Orchestration Framework for Synergistic Efficiency and Interpretability in Multimodal Large Language Models(https://arxiv.org/abs/2509.16597)
Keywords: language model
Abstract: Aiming at the problems of computational inefficiency and insufficient interpretability faced by large models in complex tasks such as multi-round reasoning and multi-modal collaboration, this study proposes a three-layer collaboration framework based on model-controller-task adaptation (MCP). By decoupling large model functions into reasoning, generation and retrieval modules, and combining reinforcement learning-driven dynamic routing algorithms and task adaptation mechanisms, the systematic integration of control theory and large model dynamic reasoning is achieved for the first time. Experiments show that the MCP framework improves the performance of cross-modal benchmarking tasks, such as GLUE, COCO, ScienceQA, etc., by 15-30% compared with the baseline model, improves the reasoning efficiency by 40%, and generates the interpretable intermediate results through the Presenter layer, obtaining 90% of the manual interpretability scores, which provides a brand-new technological path to solve the bottleneck of the practical application of the large model.
摘要：这项研究针对大型模型（例如多轮推理和多模式协作）中大型模型所面临的计算低效率和不足的可解释性问题，本研究提出了一个基于模型控制器任务适应（MCP）的三层协作框架。通过将大型模型功能分解为推理，生成和检索模块，并将强化学习驱动的动态路由算法和任务适应机制结合在一起，首次实现控制理论和大型模型动态推理的系统整合。实验表明，与基线模型相比，MCP框架提高了胶合，可可，科学QA等的跨模式基准任务的性能，例如胶水，可可，科学QA等，提高了15-30％，将推理效率提高了40％，并通过主持人层获得了90％的实用方法，该实践的实用方法是固定的，该效率可解释的效果，该实施的实用方法是固定技术的实践，该实施的技术是技术的技术，即技术的技术范围，该技术是技术的技术，即技术的技术范围，该技术的技术范围是技术的范围。大型模型。

Title: PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality

Authors: Byeongho Yu, Changhun Lee, Jungyu Jin, Eunhyeok Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16598
Pdf URL: https://arxiv.org/pdf/2509.16598
Copy Paste: [[2509.16598]] PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality(https://arxiv.org/abs/2509.16598)
Keywords: language model, llm, hallucination
Abstract: To mitigate the hallucination problem in large language models, DoLa exploits early exit logits from the same model as a contrastive prior. However, we found that these early exit logits tend to be flat, low in magnitude, and fail to reflect meaningful contrasts. To address this, we propose PruneCD, a novel contrastive decoding method that constructs the amateur model via layer pruning rather than early exit. This design leads to more informative and well-aligned logits, enabling more effective contrastive decoding. Through qualitative and quantitative analyses, we demonstrate that PruneCD consistently improves factuality with minimal inference overhead, offering a robust and practical approach to mitigating hallucinations in LLMs.
摘要：为了减轻大语言模型中的幻觉问题，Dola利用与先验的对比度相同的模型来利用早期出口逻辑。但是，我们发现这些早期的出口逻辑往往是平坦的，幅度较低，并且无法反映有意义的对比。为了解决这个问题，我们提出了PruneCD，这是一种新型的对比解码方法，该方法通过修剪而不是早期出口来构建业余模型。这种设计导致更具信息和良好的逻辑，从而实现了更有效的对比解码。通过定性和定量分析，我们证明了Prunecd始终通过最小的推理开销来改善事实，从而提供了一种强大而实用的方法来减轻LLMS的幻觉。

Title: LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts

Authors: Junhao Chen, Jingbo Sun, Xiang Li, Haidong Xin, Yuhao Xue, Yibin Xu, Hao Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16610
Pdf URL: https://arxiv.org/pdf/2509.16610
Copy Paste: [[2509.16610]] LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts(https://arxiv.org/abs/2509.16610)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) advance across diverse tasks, the need for comprehensive evaluation beyond single metrics becomes increasingly important. To fully assess LLM intelligence, it is crucial to examine their interactive dynamics and strategic behaviors. We present LLMsPark, a game theory-based evaluation platform that measures LLMs' decision-making strategies and social behaviors in classic game-theoretic settings, providing a multi-agent environment to explore strategic depth. Our system cross-evaluates 15 leading LLMs (both commercial and open-source) using leaderboard rankings and scoring mechanisms. Higher scores reflect stronger reasoning and strategic capabilities, revealing distinct behavioral patterns and performance differences across models. This work introduces a novel perspective for evaluating LLMs' strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios. The benchmark and rankings are publicly available at this https URL.
摘要：随着大型语言模型（LLM）跨越各种任务的发展，对单个指标以外的全面评估的需求变得越来越重要。为了充分评估LLM智能，检查其互动动力和战略行为至关重要。我们提出了LLMSpark，这是一个基于游戏理论的评估平台，该平台在经典游戏理论设置中测量了LLMS的决策策略和社交行为，提供了一个多机构的环境来探索战略深度。我们的系统使用排行榜排名和评分机制跨评估了15个领先的LLM（商业和开源）。较高的分数反映了更强的推理和战略能力，揭示了模型之间不同的行为模式和绩效差异。这项工作介绍了评估LLMS战略智能，丰富现有基准并扩大其在互动性游戏理论方案中的评估的新观点。基准和排名在此HTTPS URL上公开可用。

Title: Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

Authors: Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, Md Shad Akhtar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16660
Pdf URL: https://arxiv.org/pdf/2509.16660
Copy Paste: [[2509.16660]] Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation(https://arxiv.org/abs/2509.16660)
Keywords: language model
Abstract: Large Language Models have demonstrated impressive fluency across diverse tasks, yet their tendency to produce toxic content remains a critical challenge for AI safety and public trust. Existing toxicity mitigation approaches primarily manipulate individual neuron activations, but these methods suffer from instability, context dependence, and often compromise the model's core language abilities. To address these shortcomings, we investigate three key questions: the stability of neuron-level toxicity indicators, the advantages of structural (layer-wise) representations, and the interpretability of mechanisms driving toxic generation. Through extensive experiments on Jigsaw and ToxiCN datasets, we show that aggregated layer-wise features provide more robust signals than single neurons. Moreover, we observe conceptual limitations in prior works that conflate toxicity detection experts and generation experts within neuron-based interventions. To mitigate this, we propose a novel principled intervention technique, EigenShift, based on eigen-decomposition of the language model's final output layer. This method selectively targets generation-aligned components, enabling precise toxicity suppression without impairing linguistic competence. Our method requires no additional training or fine-tuning, incurs minimal computational cost, and is grounded in rigorous theoretical analysis.
摘要：大型语言模型在各种任务中表现出了令人印象深刻的流利性，但是它们产生有毒内容的趋势仍然是AI安全和公共信任的关键挑战。现有的毒性缓解方法主要操纵单个神经元激活，但是这些方法遭受了不稳定性，上下文依赖性，并且经常损害模型的核心语言能力。为了解决这些缺点，我们研究了三个关键问题：神经元级毒性指标的稳定性，结构（层）表示的优势以及推动有毒产生的机制的解释性。通过对拼图和有毒数据集进行的大量实验，我们表明，汇总层的特征比单个神经元提供了更强的信号。此外，我们观察到概念性的局限性在先前的作品中，将毒性检测专家和基于神经元的干预措施中的发电专家混为一谈。为了减轻这种情况，我们提出了一种基于语言模型最终输出层的特征分类的新颖的原则干预技术，即欧文·升。该方法有选择地靶向与生成一致的成分，从而在不损害语言能力的情况下可以精确的毒性抑制。我们的方法不需要额外的培训或微调，会产生最低的计算成本，并且以严格的理论分析为基础。

Title: Robust Native Language Identification through Agentic Decomposition

Authors: Ahmet Yavuz Uluslu, Tannon Kew, Tilia Ellendorff, Gerold Schneider, Rico Sennrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16666
Pdf URL: https://arxiv.org/pdf/2509.16666
Copy Paste: [[2509.16666]] Robust Native Language Identification through Agentic Decomposition(https://arxiv.org/abs/2509.16666)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) often achieve high performance in native language identification (NLI) benchmarks by leveraging superficial contextual clues such as names, locations, and cultural stereotypes, rather than the underlying linguistic patterns indicative of native language (L1) influence. To improve robustness, previous work has instructed LLMs to disregard such clues. In this work, we demonstrate that such a strategy is unreliable and model predictions can be easily altered by misleading hints. To address this problem, we introduce an agentic NLI pipeline inspired by forensic linguistics, where specialized agents accumulate and categorize diverse linguistic evidence before an independent final overall assessment. In this final assessment, a goal-aware coordinating agent synthesizes all evidence to make the NLI prediction. On two benchmark datasets, our approach significantly enhances NLI robustness against misleading contextual clues and performance consistency compared to standard prompting methods.
摘要：大型语言模型（LLMS）通常通过利用诸如名称，位置和文化刻板印象之类的表面上下文线索，而不是指示母语（L1）影响的基本语言模式来实现本地语言识别（NLI）基准高表现。为了提高鲁棒性，以前的工作指示LLM无视此类线索。在这项工作中，我们证明了这种策略是不可靠的，并且模型预测可以通过误导性提示轻松改变。为了解决这个问题，我们引入了一个受法医语言学启发的代理NLI管道，在独立的最终总体评估之前，专门的代理人会积累并分类多种语言证据。在这项最终评估中，目标感知协调代理综合了所有证据以做出NLI预测。在两个基准数据集上，与标准提示方法相比，我们的方法可显着提高NLI的鲁棒性，以抗误导性上下文线索和性能一致性。

Title: Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle

Authors: Keliang Liu, Dingkang Yang, Ziyun Qian, Weijie Yin, Yuchi Wang, Hongsheng Li, Jun Liu, Peng Zhai, Yang Liu, Lihua Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16679
Pdf URL: https://arxiv.org/pdf/2509.16679
Copy Paste: [[2509.16679]] Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle(https://arxiv.org/abs/2509.16679)
Keywords: language model, llm
Abstract: In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comprehensive summary of how RL operates across the full lifecycle of LLMs. We systematically review the theoretical and practical advancements whereby RL empowers LLMs, especially Reinforcement Learning with Verifiable Rewards (RLVR). First, we briefly introduce the basic theory of RL. Second, we thoroughly detail application strategies for RL across various phases of the LLM lifecycle, including pre-training, alignment fine-tuning, and reinforced reasoning. In particular, we emphasize that RL methods in the reinforced reasoning phase serve as a pivotal driving force for advancing model reasoning to its limits. Next, we collate existing datasets and evaluation benchmarks currently used for RL fine-tuning, spanning human-annotated datasets, AI-assisted preference data, and program-verification-style corpora. Subsequently, we review the mainstream open-source tools and training frameworks available, providing clear practical references for subsequent research. Finally, we analyse the future challenges and trends in the field of RL-enhanced LLMs. This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs, with the goal of fostering the evolution of LLMs that are more intelligent, generalizable, and secure.
摘要：近年来，以增强学习为中心（RL）的培训方法显着提高了大语言模型（LLMS）的推理和对齐性能，尤其是在理解人类意图，按照用户指示以及增强推论力量的方面。尽管现有的调查提供了RL增强LLM的概述，但它们的范围通常受到限制，但未能提供有关RL在LLMS的整个生命周期中运作的全面摘要。我们系统地回顾了理论和实践进步，其中RL赋予了LLM的能力，尤其是具有可验证的奖励（RLVR）的加强学习。首先，我们简要介绍了RL的基本理论。其次，我们彻底详细介绍了在LLM生命周期的各个阶段的RL的应用策略，包括预训练，对齐微调和加强推理。特别是，我们强调，在加强推理阶段中的RL方法是将模型推理推进其限制的关键驱动力。接下来，我们整理了当前用于RL微调的现有数据集和评估基准，涵盖了人类通知的数据集，AI辅助偏好数据和程序验证风格的语料库。随后，我们回顾了可用的主流开源工具和培训框架，为随后的研究提供了明确的实际参考。最后，我们分析了RL增强LLM的未来挑战和趋势。这项调查旨在向研究人员和从业人员展示RL和LLMS交集的最新发展和前沿趋势，目的是促进LLM的演变，这些LLM的演变更聪明，可推广和安全。

Title: EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

Authors: Zhengge Cai, Haowen Hou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16686
Pdf URL: https://arxiv.org/pdf/2509.16686
Copy Paste: [[2509.16686]] EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs(https://arxiv.org/abs/2509.16686)
Keywords: language model, llm
Abstract: Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose \textbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6\% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9\% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.
摘要：降低键值（KV）缓存大小是朝着实现大语言模型（LLMS）有效推断的关键步骤，尤其是在潜伏期和内存约束下。虽然多头注意力（MHA）具有强大的代表力，但它会引起大量的内存开销。关于多头潜在注意力（MLA）的最新工作通过将KV表示为共享的潜在空间来减轻这种情况，从而在性能和缓存效率之间取得了更好的权衡。尽管MLA已经实现了显着的KV缓存降低，但进一步压缩的范围仍然有限，而不会损失绩效。在本文中，我们提出了\ textbf {嵌入门控的多头潜在注意（EG-MLA）}，这是MLA的新型扩展，该扩展进一步降低了KV缓存大小，同时增强了表示表现力。 EG-MLA引入了在潜在空间中应用的令牌特异性嵌入门控机制，从而可以对压缩的KV向量进行细粒度调制，并具有最小的其他计算。与MHA相比，EG-MLA的KV缓存大小降低了91.6％，并且性能降解可降低。相对于MLA，EG-MLA始终提高各种推理基准的任务准确性，同时可实现多达59.9％的额外存储器节省。我们的理论分析强调了嵌入门控如何引起隐式高阶相互作用，而经验评估表明了模型量表和压缩方案的强大概括。值得注意的是，我们成功地将EG-MLA扩展到超过10亿个参数，证明了其对大规模LLM部署的实际生存能力。这些结果将EG-MLA确定为一种记忆和计算有效的注意机制，可在现代LLM中进行可扩展的高性能推断。

Title: Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models

Authors: Wataru Hashimoto, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16696
Pdf URL: https://arxiv.org/pdf/2509.16696
Copy Paste: [[2509.16696]] Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models(https://arxiv.org/abs/2509.16696)
Keywords: language model, llm
Abstract: Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of decoding strategies on uncertainty estimation in Large Language Models (LLMs). Our experiments show that Contrastive Search, which mitigates repetition, yields better uncertainty estimates on average across a range of preference-aligned LLMs. In contrast, the benefits of these strategies sometimes diverge when the model is only post-trained with supervised fine-tuning, i.e. without explicit alignment.
摘要：解码策略操纵语言模型输出的概率分布，因此可能会影响发电质量及其不确定性。在这项研究中，我们研究了解码策略对大语言模型（LLMS）中不确定性估计的影响。我们的实验表明，减轻重复的对比搜索可以在一系列优先偏好的LLMS中平均得出更好的不确定性估计。相比之下，当模型仅通过受监督的微调进行训练时，即没有明确的一致性，这些策略的好处有时会有所不同。

Title: OPEN-THEATRE: An Open-Source Toolkit for LLM-based Interactive Drama

Authors: Tianyang Xu, Hongqiu Wu, Weiqi Wu, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16713
Pdf URL: https://arxiv.org/pdf/2509.16713
Copy Paste: [[2509.16713]] OPEN-THEATRE: An Open-Source Toolkit for LLM-based Interactive Drama(https://arxiv.org/abs/2509.16713)
Keywords: llm, agent
Abstract: LLM-based Interactive Drama introduces a novel dialogue scenario in which the player immerses into a character and engages in a dramatic story by interacting with LLM agents. Despite the fact that this emerging area holds significant promise, it remains largely underexplored due to the lack of a well-designed playground to develop a complete drama. This makes a significant barrier for researchers to replicate, extend, and study such systems. Hence, we present Open-Theatre, the first open-source toolkit for experiencing and customizing LLM-based interactive drama. It refines prior work with an efficient multi-agent architecture and a hierarchical retrieval-based memory system, designed to enhance narrative coherence and realistic long-term behavior in complex interactions. In addition, we provide a highly configurable pipeline, making it easy for researchers to develop and optimize new approaches.
摘要：基于LLM的互动戏剧介绍了一种新颖的对话场景，其中玩家会沉浸在角色中，并通过与LLM代理商进行互动，从而参与戏剧性的故事。尽管这个新兴地区具有巨大的希望，但由于缺乏精心设计的游乐场来发展完整的戏剧性，它仍然在很大程度上没有被忽视。这为研究人员复制，扩展和研究此类系统带来了重大障碍。因此，我们提出了开放式剧院，这是第一个用于体验和定制基于LLM的交互式戏剧的开源工具包。它通过有效的多代理体系结构和基于层次检索的内存系统来完善先前的工作，旨在增强叙事连贯性和复杂相互作用中的现实长期行为。此外，我们还提供一条高度可配置的管道，使研究人员可以轻松开发和优化新方法。

Title: Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling

Authors: Haoran Li, Zhiming Su, Junyan Yao, Enwei Zhang, Yang Ji, Yan Chen, Kan Zhou, Chao Feng, Jiao Ran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16717
Pdf URL: https://arxiv.org/pdf/2509.16717
Copy Paste: [[2509.16717]] Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling(https://arxiv.org/abs/2509.16717)
Keywords: prompt
Abstract: Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set. Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model's sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of Douyin's dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9%, and improved the Image User Penetration Rate (IUPR) by 0.1054%.
摘要：合成数据在嵌入模型中被广泛采用，以确保跨越难度，长度和语言等维度的培训数据分布的多样性。但是，现有的基于及时的综合方法难以捕获特定于域的数据分布，尤其是在数据筛选域，并且常常忽略了细粒度的相关性多样性。在本文中，我们提出了一个中文简短的视频数据集，其中包含4级相关性注释，从而填充了关键的资源空白。此外，我们提出了一个半监督的合成数据管道，其中两个经过协作训练的模型生成具有可控相关性标签的域自适应短视频数据。我们的方法通过将样本综合为代表性不足的中间相关性标签来增强相关性级别的多样性，从而导致更平衡且语义丰富的培训数据集。广泛的离线实验表明，对我们的合成数据训练的嵌入模型优于使用基于提示或香草监督微调（SFT）生成的数据的数据。此外，我们证明，将更多样化的细粒相关性水平纳入训练数据可以增强模型对微妙的语义区别的敏感性，从而突出了细粒度相关性监督在嵌入学习中的价值。在搜索增强的建议管道中，Douyin的双柱方案（通过在线A/B测试）中，提议的模型将点击率（CTR）提高了1.45％，将强相关比率（SRR）的比例提高了4.9％，并将图像用户渗透率（IUPR）提高了0.1054％。

Title: Time to Revist Exact Match

Authors: Auss Abbood, Zaiqiao Meng, Nigel Collier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16720
Pdf URL: https://arxiv.org/pdf/2509.16720
Copy Paste: [[2509.16720]] Time to Revist Exact Match(https://arxiv.org/abs/2509.16720)
Keywords: language model
Abstract: Temporal question answering is an established method for evaluating temporal reasoning in large language models. Expected answers are often numeric (e.g., dates or durations), yet model responses are evaluated like regular text with exact match (EM), unable to distinguish small from large errors. In this investigative work, we frame temporal question answering as a numerical estimation task to assess the shortcomings of EM. We introduce TempAnswerQA, a benchmark distilled from Test of Time and TempTabQA, where all questions require a numerical, temporal answer, allowing us to evaluate models beyond EM. We use the forecasting metrics symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE). With sMAPE, we find that error size and EM are decoupled. Models with low EM still have low sMAPE (both ~20%), and some models have high sMAPE despite high EM. Scaling errors by the deviation of the ground truth data with MASE reshuffles model rankings compared to EM, revealing gaps in models' understanding of temporal domain knowledge, especially when trained with synthetic data. Lastly, the models' most frequent error is to deviate by only $\pm1$ from the ground truth. sMAPE and MASE, unlike EM, adequately weight these errors. Our findings underscore the need for specialised metrics for temporal QA tasks. Code and data are available on this https URL.
摘要：时间问题回答是一种评估大语言模型中时间推理的既定方法。预期的答案通常是数字（例如，日期或持续时间），但是模型响应的评估就像常规文本一样具有匹配（EM），无法将小错误与大错误区分开。在这项调查工作中，我们将回答作为数字估计任务的时间问题构架，以评估EM的缺点。我们介绍了tempanswerqa，这是一种从时间和Temptabqa测试中提取的基准，其中所有问题都需要一个数值的时间答案，使我们能够评估EM以外的模型。我们使用预测指标对称的平均绝对百分比误差（SMAPE）和平均绝对缩放误差（MASE）。使用Smape，我们发现错误大小和EM被解耦。较低的EM模型仍然具有低Smape（均约为20％），尽管EM高，但有些模型仍具有高Smape。与EM相比，通过Mase Rehuffles模型排名的地面真相数据的偏差来扩展错误，这揭示了模型对时间域知识的理解中的差距，尤其是在接受合成数据培训时。最后，模型最常见的错误是从地面真相中仅偏离$ \ pm1 $。与EM不同，Smape和Mase会充分加重这些错误。我们的发现强调了针对时间质量检查任务的专业指标的需求。代码和数据可在此HTTPS URL上找到。

Title: A Multi-Level Benchmark for Causal Language Understanding in Social Media Discourse

Authors: Xiaohan Ding, Kaike Ping, Buse Çarık, Eugenia Rho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16722
Pdf URL: https://arxiv.org/pdf/2509.16722
Copy Paste: [[2509.16722]] A Multi-Level Benchmark for Causal Language Understanding in Social Media Discourse(https://arxiv.org/abs/2509.16722)
Keywords: gpt
Abstract: Understanding causal language in informal discourse is a core yet underexplored challenge in NLP. Existing datasets largely focus on explicit causality in structured text, providing limited support for detecting implicit causal expressions, particularly those found in informal, user-generated social media posts. We introduce CausalTalk, a multi-level dataset of five years of Reddit posts (2020-2024) discussing public health related to the COVID-19 pandemic, among which 10120 posts are annotated across four causal tasks: (1) binary causal classification, (2) explicit vs. implicit causality, (3) cause-effect span extraction, and (4) causal gist generation. Annotations comprise both gold-standard labels created by domain experts and silver-standard labels generated by GPT-4o and verified by human annotators. CausalTalk bridges fine-grained causal detection and gist-based reasoning over informal text. It enables benchmarking across both discriminative and generative models, and provides a rich resource for studying causal reasoning in social media contexts.
摘要：在非正式话语中了解因果语言是NLP中的核心但毫无争议的挑战。现有的数据集主要集中在结构化文本中的明确因果关系上，为检测隐式因果表达提供了有限的支持，尤其是在非正式，用户生成的社交媒体帖子中发现的因果关系。 We introduce CausalTalk, a multi-level dataset of five years of Reddit posts (2020-2024) discussing public health related to the COVID-19 pandemic, among which 10120 posts are annotated across four causal tasks: (1) binary causal classification, (2) explicit vs. implicit causality, (3) cause-effect span extraction, and (4) causal gist generation.注释包括由域专家创建的黄金标准标签和由GPT-4O生成的银标签，并由人类注释者进行了验证。 Causaltalk桥梁对非正式文本的细粒性因果检测和基于要点的推理。它可以在歧视和生成模型中进行基准测试，并为在社交媒体环境中研究因果推理提供了丰富的资源。

Title: The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology

Authors: Fagun Patel, Duc Q. Nguyen, Sang T. Truong, Jody Vaynshtok, Sanmi Koyejo, Nick Haber
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.16765
Pdf URL: https://arxiv.org/pdf/2509.16765
Copy Paste: [[2509.16765]] The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology(https://arxiv.org/abs/2509.16765)
Keywords: language model, prompt, chain-of-thought
Abstract: According to the U.S. National Institutes of Health, more than 3.4 million children experience speech disorders that require clinical intervention. The number of speech-language pathologists (SLPs) is roughly 20 times fewer than the number of affected children, highlighting a significant gap in children's care and a pressing need for technological support that improves the productivity of SLPs. State-of-the-art multimodal language models (MLMs) show promise for supporting SLPs, but their use remains underexplored largely due to a limited understanding of their performance in high-stakes clinical settings. To address this gap, we collaborate with domain experts to develop a taxonomy of real-world use cases of MLMs in speech-language pathologies. Building on this taxonomy, we introduce the first comprehensive benchmark for evaluating MLM across five core use cases, each containing 1,000 manually annotated data points. This benchmark includes robustness and sensitivity tests under various settings, including background noise, speaker gender, and accent. Our evaluation of 15 state-of-the-art MLMs reveals that no single model consistently outperforms others across all tasks. Notably, we find systematic disparities, with models performing better on male speakers, and observe that chain-of-thought prompting can degrade performance on classification tasks with large label spaces and narrow decision boundaries. Furthermore, we study fine-tuning MLMs on domain-specific data, achieving improvements of over 30% compared to base models. These findings highlight both the potential and limitations of current MLMs for speech-language pathology applications, underscoring the need for further research and targeted development.
摘要：根据美国国立卫生研究院的说法，超过340万儿童经历了需要临床干预的言语障碍。言语病理学家（SLP）的数量比受影响儿童的人数少20倍，突出了儿童护理的显着差距，并迫切需要对技术支持的需求，从而提高了SLP的生产力。最先进的多模式模型（MLMS）显示出支持SLP的希望，但由于对他们在高风险临床环境中的表现有限的了解，它们的使用仍未得到充分兴奋。为了解决这一差距，我们与领域专家合作，在语音语言病理学中开发了MLMS现实世界中用例的分类法。在此分类法的基础上，我们介绍了第一个综合基准，用于评估五个核心用例中的传销，每个基准包含1,000个手动注释的数据点。该基准包括在各种设置下的鲁棒性和灵敏度测试，包括背景噪声，扬声器性别和口音。我们对15个最先进的MLMS的评估表明，在所有任务中，没有任何单一模型始终超过其他模型。值得注意的是，我们发现系统的差异，模型在男性扬声器上的表现更好，并观察到，经过三通的提示可以在具有较大标签空间和狭窄决策边界的分类任务上降低性能。此外，我们研究了针对域特异性数据的微调MLM，与基本模型相比，提高了30％以上。这些发现突出了当前的MLM在语音语言病理应用中的潜在和局限性，从而强调了进一步研究和有针对性发展的需求。

Title: Domain-Adaptive Pre-Training for Arabic Aspect-Based Sentiment Analysis: A Comparative Study of Domain Adaptation and Fine-Tuning Strategies

Authors: Salha Alyami, Amani Jamal, Areej Alhothali
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16788
Pdf URL: https://arxiv.org/pdf/2509.16788
Copy Paste: [[2509.16788]] Domain-Adaptive Pre-Training for Arabic Aspect-Based Sentiment Analysis: A Comparative Study of Domain Adaptation and Fine-Tuning Strategies(https://arxiv.org/abs/2509.16788)
Keywords: language model
Abstract: Aspect-based sentiment analysis (ABSA) in natural language processing enables organizations to understand customer opinions on specific product aspects. While deep learning models are widely used for English ABSA, their application in Arabic is limited due to the scarcity of labeled data. Researchers have attempted to tackle this issue by using pre-trained contextualized language models such as BERT. However, these models are often based on fact-based data, which can introduce bias in domain-specific tasks like ABSA. To our knowledge, no studies have applied adaptive pre-training with Arabic contextualized models for ABSA. This research proposes a novel approach using domain-adaptive pre-training for aspect-sentiment classification (ASC) and opinion target expression (OTE) extraction. We examine fine-tuning strategies - feature extraction, full fine-tuning, and adapter-based methods - to enhance performance and efficiency, utilizing multiple adaptation corpora and contextualized models. Our results show that in-domain adaptive pre-training yields modest improvements. Adapter-based fine-tuning is a computationally efficient method that achieves competitive results. However, error analyses reveal issues with model predictions and dataset labeling. In ASC, common problems include incorrect sentiment labeling, misinterpretation of contrastive markers, positivity bias for early terms, and challenges with conflicting opinions and subword tokenization. For OTE, issues involve mislabeling targets, confusion over syntactic roles, difficulty with multi-word expressions, and reliance on shallow heuristics. These findings underscore the need for syntax- and semantics-aware models, such as graph convolutional networks, to more effectively capture long-distance relations and complex aspect-based opinion alignments.
摘要：自然语言处理中的基于方面的情感分析（ABSA）使组织能够了解客户对特定产品方面的看法。尽管深度学习模型被广泛用于英语ABSA，但由于标记的数据缺乏，它们在阿拉伯语中的应用受到限制。研究人员试图通过使用预先训练的上下文化语言模型（例如BERT）来解决此问题。但是，这些模型通常基于基于事实的数据，该数据可以在特定于ABSA（例如ABSA）中引入偏见。据我们所知，没有研究对ABSA进行自适应预训练。这项研究提出了一种新的方法，该方法使用域自适应预训练进行方面验证分类（ASC）和意见目标表达（OTE）提取。我们利用多个适应性语料库和上下文化的模型来研究微调策略 - 提取功能，完整的微调和基于适配器的方法 - 以提高性能和效率。我们的结果表明，内域自适应预训练会产生适度的改进。基于适配器的微调是一种计算有效的方法，可实现竞争结果。但是，错误分析揭示了模型预测和数据集标签的问题。在ASC中，常见问题包括不正确的情感标签，对对比标记的误解，早期术语的积极性偏见以及与意见相互矛盾的挑战和子单词令牌化。对于OTE而言，问题涉及标签的目标错误，对句法角色的困惑，多字表达的困难以及对浅启发式方法的依赖。这些发现强调了对语法和语义感知模型（例如图形卷积网络）的需求，以更有效地捕获长距离关系和复杂的基于方面的意见一致性。

Title: KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis

Authors: Kozhin muhealddin Awlla, Hadi Veisi, Abdulhady Abas Abdullah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16804
Pdf URL: https://arxiv.org/pdf/2509.16804
Copy Paste: [[2509.16804]] KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis(https://arxiv.org/abs/2509.16804)
Keywords: language model
Abstract: This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low-resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional word embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low-resource languages.
摘要：本文通过将来自变压器（BERT）的双向编码器表示纳入自然语言处理技术的双向编码器来增强了库尔德语言中心语言的情感分析。库尔德语是一种低资源的语言，具有最低的计算资源的语言多样性，使情感分析变得有些挑战。此前，这是使用传统单词嵌入模型（例如Word2Vec）完成的，但是随着新语言模型的出现，特别是Bert，人们对改进有希望。伯特（Bert）的嵌入能力更好，这是在这项研究中借助的，有助于捕获细微的语义库以及所研究的语言的上下文复杂性，库尔德语，从而为低资源语言的情感分析树立了新的基准。

Title: Cognitive Linguistic Identity Fusion Score (CLIFS): A Scalable Cognition-Informed Approach to Quantifying Identity Fusion from Text

Authors: Devin R. Wright, Jisun An, Yong-Yeol Ahn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16813
Pdf URL: https://arxiv.org/pdf/2509.16813
Copy Paste: [[2509.16813]] Cognitive Linguistic Identity Fusion Score (CLIFS): A Scalable Cognition-Informed Approach to Quantifying Identity Fusion from Text(https://arxiv.org/abs/2509.16813)
Keywords: language model, llm
Abstract: Quantifying identity fusion -- the psychological merging of self with another entity or abstract target (e.g., a religious group, political party, ideology, value, brand, belief, etc.) -- is vital for understanding a wide range of group-based human behaviors. We introduce the Cognitive Linguistic Identity Fusion Score (CLIFS), a novel metric that integrates cognitive linguistics with large language models (LLMs), which builds on implicit metaphor detection. Unlike traditional pictorial and verbal scales, which require controlled surveys or direct field contact, CLIFS delivers fully automated, scalable assessments while maintaining strong alignment with the established verbal measure. In benchmarks, CLIFS outperforms both existing automated approaches and human annotation. As a proof of concept, we apply CLIFS to violence risk assessment to demonstrate that it can improve violence risk assessment by more than 240%. Building on our identification of a new NLP task and early success, we underscore the need to develop larger, more diverse datasets that encompass additional fusion-target domains and cultural backgrounds to enhance generalizability and further advance this emerging area. CLIFS models and code are public at this https URL.
摘要：量化身份融合 - 自我与另一个实体或抽象目标的心理融合（例如宗教团体，政党，意识形态，价值，品牌，信仰等） - 对于理解广泛的基于群体的人类行为至关重要。我们介绍了认知语言身份融合得分（CLIFS），这是一种将认知语言学与大语言模型（LLMS）相结合的新型指标，该语言模型（LLMS）建立在隐式隐喻检测的基础上。与需要受控调查或直接野外接触的传统绘画和言语量表不同，CLIF提供了完全自动化的可扩展评估，同时保持与已建立的口头措施的牢固保持一致性。在基准中，CLIF的表现优于现有的自动化方法和人类注释。作为概念证明，我们将CLIF应用于暴力风险评估，以证明它可以将暴力风险评估提高240％以上。在我们确定新的NLP任务和早期成功的基础上，我们强调了开发更大，更多样化的数据集的需求，这些数据集涵盖了其他融合目标领域和文化背景，以增强普遍性并进一步推进这一新兴领域。 CLIFS模型和代码在此HTTPS URL上是公开的。

Title: Can GRPO Boost Complex Multimodal Table Understanding?

Authors: Xiaoqiang Kang, Shengen Wu, Zimu Wang, Yilin Liu, Xiaobo Jin, Kaizhu Huang, Wei Wang, Yutao Yue, Xiaowei Huang, Qiufeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16889
Pdf URL: https://arxiv.org/pdf/2509.16889
Copy Paste: [[2509.16889]] Can GRPO Boost Complex Multimodal Table Understanding?(https://arxiv.org/abs/2509.16889)
Keywords: gpt, prompt
Abstract: Existing table understanding methods face challenges due to complex table structures and intricate logical reasoning. While supervised finetuning (SFT) dominates existing research, reinforcement learning (RL), such as Group Relative Policy Optimization (GRPO), has shown promise but struggled with low initial policy accuracy and coarse rewards in tabular contexts. In this paper, we introduce Table-R1, a three-stage RL framework that enhances multimodal table understanding through: (1) Warm-up that prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO), which employs continuous Tree-Edit-Distance Similarity (TEDS) rewards for recognizing table structures and contents, and (3) Hint-Completion GRPO (HC-GRPO), which utilizes fine-grained rewards of residual steps based on the hint-guided question. Extensive experiments demonstrate that Table-R1 can boost the model's table reasoning performance obviously on both held-in and held-out datasets, outperforming SFT and GRPO largely. Notably, Qwen2-VL-7B with Table-R1 surpasses larger specific table understanding models (e.g., Table-LLaVA 13B), even achieving comparable performance to the closed-source model GPT-4o on held-in datasets, demonstrating the efficacy of each stage of Table-R1 in overcoming initialization bottlenecks and reward sparsity, thereby advancing robust multimodal table understanding.
摘要：现有的表格理解方法由于复杂的表结构和复杂的逻辑推理而面临挑战。尽管受到监督的Finetuning（SFT）主导了现有的研究，但诸如小组相对政策优化（GRPO）之类的强化学习（RL）已表现出希望，但在较低的初始政策准确性和在表格环境中挣扎的粗糙奖励。在本文中，我们介绍了Table-R1，这是一个三个阶段的RL框架，通过以下三阶段的框架增强了多模式的理解：（1）促使初始感知和推理能力的热身，（2）感知一致性GRPO（PA-GRPO）（PA-GRPO）使用连续的树式相似性（TEDS）恢复（TEDS）恢复（TEDS）恢复（TEDS）和（3）（HC-GRPO），它根据提示引入的问题利用残留步骤的细粒度奖励。广泛的实验表明，Table-R1可以在持有的数据集和固定数据集上显然提高模型的表推理性能，从而超过SFT和GRPO。值得注意的是，带有Table-R1的QWEN2-VL-7B超过了更大的特定表理解模型（例如，表格13b），甚至可以在Hold-In-In数据集上实现与封闭源模型GPT-4O相当的性能，从而证明了表R1每个阶段的功效，证明了Table-R1的每个阶段在表R1级的初始化效果上，并奖励了巨大的sparsity，并奖励了强大的跨度效果。

Title: CLaC at DISRPT 2025: Hierarchical Adapters for Cross-Framework Multi-lingual Discourse Relation Classification

Authors: Nawar Turk, Daniele Comitogianni, Leila Kosseim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16903
Pdf URL: https://arxiv.org/pdf/2509.16903
Copy Paste: [[2509.16903]] CLaC at DISRPT 2025: Hierarchical Adapters for Cross-Framework Multi-lingual Discourse Relation Classification(https://arxiv.org/abs/2509.16903)
Keywords: language model, llm, prompt
Abstract: We present our submission to Task 3 (Discourse Relation Classification) of the DISRPT 2025 shared task. Task 3 introduces a unified set of 17 discourse relation labels across 39 corpora in 16 languages and six discourse frameworks, posing significant multilingual and cross-formalism challenges. We first benchmark the task by fine-tuning multilingual BERT-based models (mBERT, XLM-RoBERTa-Base, and XLM-RoBERTa-Large) with two argument-ordering strategies and progressive unfreezing ratios to establish strong baselines. We then evaluate prompt-based large language models (namely Claude Opus 4.0) in zero-shot and few-shot settings to understand how LLMs respond to the newly proposed unified labels. Finally, we introduce HiDAC, a Hierarchical Dual-Adapter Contrastive learning model. Results show that while larger transformer models achieve higher accuracy, the improvements are modest, and that unfreezing the top 75% of encoder layers yields performance comparable to full fine-tuning while training far fewer parameters. Prompt-based models lag significantly behind fine-tuned transformers, and HiDAC achieves the highest overall accuracy (67.5%) while remaining more parameter-efficient than full fine-tuning.
摘要：我们将提交给DISP 2025共享任务的任务3（话语关系分类）的提交。任务3引入了39个语言中的17种语言和6种语言框架的统一的17个话语关系标签，提出了重大的多语言和跨形式挑战。我们首先通过微调基于BERT的模型（Mbert，XLM-Roberta-Base和XLM-Roberta-Large）来基准这项任务，并具有两种参数排序的策略和渐进的未释放比率，以建立强大的基准。然后，我们在零击和几乎没有拍摄的设置中评估基于及时的大语言模型（即Claude Opus 4.0），以了解LLM对新提出的统一标签的反应。最后，我们介绍了HIDAC，这是一种分层双适配对比学习模型。结果表明，尽管较大的变压器模型达到了更高的精度，但改进是适度的，并且在训练较少的参数较少的同时，将顶部75％的编码器层产生的性能与完整的微调相当。基于及时的模型显着落后于微调的变压器，Hidac达到了最高的总体准确性（67.5％），同时比完整的微调保持了参数效率高。

Title: CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

Authors: Wenhao Zhuang, Yuan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16914
Pdf URL: https://arxiv.org/pdf/2509.16914
Copy Paste: [[2509.16914]] CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages(https://arxiv.org/abs/2509.16914)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.
摘要：大型语言模型（LLMS）在各种NLP任务中展示了出色的零拍功能，从而显着提高了用户体验和效率。但是，此优势主要限于资源丰富的语言。对于各种各样的低资源语言，支持仍然不足，培训语料库的稀缺性被认为是主要原因。我们构建和开源可爱的中文，Uyghur，藏语，英文数据集，由两套25GB的四语语言Corpora（一组平行和一个非平行）组成，是通过机器翻译获得的。可爱的包含两种资源丰富的语言（中文和英语）以及两种低资源语言（Uyghur和藏语）。在构建可爱的人类评估之前，在中文 - 藏族之间的机器翻译质量验证了中文 - 英语翻译的质量。 Cute代表了迄今为止Uyghur和藏语语言最大的开源语料库，我们在研究Corpus Paralleleist在跨语言转移学习中的作用时，展示了其在增强LLMS处理低资源语言的能力方面的有效性。可爱的语料库和相关模型可公开提供研究社区。

Title: K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling

Authors: Yongrui Chen, Yi Huang, Yunchang Liu, Shenyu Zhang, Junhao He, Tongtong Wu, Guilin Qi, Tianxing Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16929
Pdf URL: https://arxiv.org/pdf/2509.16929
Copy Paste: [[2509.16929]] K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling(https://arxiv.org/abs/2509.16929)
Keywords: language model
Abstract: Continual Structured Knowledge Reasoning (CSKR) focuses on training models to handle sequential tasks, where each task involves translating natural language questions into structured queries grounded in structured knowledge. Existing general continual learning approaches face significant challenges when applied to this task, including poor generalization to heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase. To address these limitations, we propose a novel CSKR framework, \textsc{K-DeCore}, which operates with a fixed number of tunable parameters. Unlike prior methods, \textsc{K-DeCore} introduces a knowledge decoupling mechanism that disentangles the reasoning process into task-specific and task-agnostic stages, effectively bridging the gaps across diverse tasks. Building on this foundation, \textsc{K-DeCore} integrates a dual-perspective memory consolidation mechanism for distinct stages and introduces a structure-guided pseudo-data synthesis strategy to further enhance the model's generalization capabilities. Extensive experiments on four benchmark datasets demonstrate the superiority of \textsc{K-DeCore} over existing continual learning methods across multiple metrics, leveraging various backbone large language models.
摘要：持续的结构化知识推理（CSKR）着重于处理顺序任务的训练模型，在该模型中，每个任务都涉及将自然语言问题转化为基于结构化知识的结构化查询。现有的一般持续学习方法在应用于此任务时面临重大挑战，包括对随着任务的增加，由于参数增长而导致的异质结构化知识的泛化和效率低下的推理。为了解决这些限制，我们提出了一个新颖的CSKR框架，\ textsc {k-decore}，该框架以固定数量的可调参数运行。与先前的方法不同，\ textsc {k-decore}引入了一种知识解耦机制，将推理过程分解为特定于任务和任务无关的阶段，从而有效地弥合了各种任务的差距。 \ textsc {k-decore}以此为基础，集成了一个双重记忆合并机制，以实现不同的阶段，并引入了结构引导的伪DATA合成策略，以进一步增强模型的泛化能力。在四个基准数据集上进行的广泛实验证明了\ textsc {k-decore}的优越性，而不是多个指标的现有学习方法的优势，利用各种骨干大语言模型。

Title: AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

Authors: Tiancheng Huang, Ruisheng Cao, Yuxin Zhang, Zhangyi Kang, Zijian Wang, Chenrun Wang, Yijie Luo, Hang Zheng, Lirong Qian, Lu Chen, Kai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16952
Pdf URL: https://arxiv.org/pdf/2509.16952
Copy Paste: [[2509.16952]] AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation(https://arxiv.org/abs/2509.16952)
Keywords: language model, llm, agent
Abstract: The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this specific task is hindered by the shortage of high-quality interaction trajectories. In this work, we propose AirQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,948 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open-source and proprietary models show that most models underperform on AirQA, demonstrating the quality of our dataset. Extensive experiments confirm that ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
摘要：学术论文的数量越来越大，研究人员越来越难以提及关键信息。尽管大型语言模型（LLM）的代理能够自动化问题答案（QA）的科学论文工作流程，但仍然缺乏评估其功能的全面和现实的基准。此外，高质量互动轨迹的短缺阻碍了培训针对此特定任务的交互式代理。在这项工作中，我们提出了AIRQA，这是一个人工智能领域（AI）中的人类宣传的综合论文QA数据集，其中包含13,948篇论文和1,246个问题，其中包含多任务，多模式和实例评估。此外，我们提出了提取器，这是指导数据综合的自动化框架。借助三种基于LLM的代理，提取器可以在不干预的情况下执行示例生成和轨迹收集。对多个开源模型的评估表明，大多数模型在AIRQA上表现不佳，证明了我们数据集的质量。广泛的实验证实，提取器一致地提高了小型模型的多转化工具使用能力，从而使它们能够实现与较大型号相当的性能。

Title: Preference Distillation via Value based Reinforcement Learning

Authors: Minchan Kwon, Junwon Ko, Kangil Kim, Junmo Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16965
Pdf URL: https://arxiv.org/pdf/2509.16965
Copy Paste: [[2509.16965]] Preference Distillation via Value based Reinforcement Learning(https://arxiv.org/abs/2509.16965)
Keywords: language model
Abstract: Direct Preference Optimization (DPO) is a powerful paradigm to align language models with human preferences using pairwise comparisons. However, its binary win-or-loss supervision often proves insufficient for training small models with limited capacity. Prior works attempt to distill information from large teacher models using behavior cloning or KL divergence. These methods often focus on mimicking current behavior and overlook distilling reward modeling. To address this issue, we propose \textit{Teacher Value-based Knowledge Distillation} (TVKD), which introduces an auxiliary reward from the value function of the teacher model to provide a soft guide. This auxiliary reward is formulated to satisfy potential-based reward shaping, ensuring that the global reward structure and optimal policy of DPO are preserved. TVKD can be integrated into the standard DPO training framework and does not require additional rollouts. Our experimental results show that TVKD consistently improves performance across various benchmarks and model sizes.
摘要：直接偏好优化（DPO）是使用成对比较将语言模型与人类偏好相结合的强大范式。但是，它的二进制获胜或损坏监督通常证明不足以培训容量有限的小型模型。先前的工作试图使用行为克隆或KL差异从大型教师模型中提取信息。这些方法通常集中在模仿当前行为并忽略蒸馏奖励建模。为了解决这个问题，我们建议\ textit {基于教师价值的知识蒸馏}（TVKD），该{TVKD）从教师模型的价值功能中引入了辅助奖励，以提供软指南。该辅助奖励的制定是为了满足潜在的奖励成型，以确保保留DPO的全球奖励结构和最佳政策。 TVKD可以集成到标准DPO培训框架中，不需要其他推出。我们的实验结果表明，TVKD始终提高各种基准和模型大小的性能。

Title: Advancing Speech Understanding in Speech-Aware Language Models with GRPO

Authors: Avishai Elmakies, Hagai Aronowitz, Nimrod Shabtay, Eli Schwartz, Ron Hoory, Avihu Dekel
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.16990
Pdf URL: https://arxiv.org/pdf/2509.16990
Copy Paste: [[2509.16990]] Advancing Speech Understanding in Speech-Aware Language Models with GRPO(https://arxiv.org/abs/2509.16990)
Keywords: language model, llm
Abstract: In this paper, we introduce a Group Relative Policy Optimization (GRPO)-based method for training Speech-Aware Large Language Models (SALLMs) on open-format speech understanding tasks, such as Spoken Question Answering and Automatic Speech Translation. SALLMs have proven highly effective for speech understanding tasks. GRPO has recently gained traction for its efficiency in training LLMs, and prior work has explored its application to SALLMs, primarily in multiple-choice tasks. Building on this, we focus on open-format tasks that better reflect the generative abilities of the models. Our approach leverages GRPO with BLEU as the reward signal to optimize SALLMs, and we demonstrate empirically that it surpasses standard SFT across several key metrics. Finally, we explore the potential of incorporating off-policy samples within GRPO for these tasks, highlighting avenues for further improvement and further research.
摘要：在本文中，我们介绍了一个基于培训语音感知的大语模型（SALLMS）的基于小组相对政策优化（GRPO）的方法，以理解诸如口头问题答案和自动语音翻译之类的开放格式语音理解任务。 Sallms已证明对语音理解任务非常有效。 GRPO最近因其在培训LLM方面的效率而获得了吸引力，并且先前的工作探索了其在SALLMS的应用，主要是在多项选择任务中。在此基础上，我们专注于更好地反映模型生成能力的开放式任务。我们的方法以BLEU为奖励信号来利用GRPO来优化SALLMS，我们从经验上证明了它超过了几个关键指标的标准SFT。最后，我们探讨了将这些任务纳入GRPO中的非政策样品的潜力，从而强调了进一步改进和进一步研究的途径。

Title: The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs

Authors: Hinata Tezuka, Naoya Inoue
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17030
Pdf URL: https://arxiv.org/pdf/2509.17030
Copy Paste: [[2509.17030]] The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs(https://arxiv.org/abs/2509.17030)
Keywords: llm
Abstract: Recent studies have suggested a processing framework for multilingual inputs in decoder-based LLMs: early layers convert inputs into English-centric and language-agnostic representations; middle layers perform reasoning within an English-centric latent space; and final layers generate outputs by transforming these representations back into language-specific latent spaces. However, the internal dynamics of such transformation and the underlying mechanism remain underexplored. Towards a deeper understanding of this framework, we propose and empirically validate The Transfer Neurons Hypothesis: certain neurons in the MLP module are responsible for transferring representations between language-specific latent spaces and a shared semantic latent space. Furthermore, we show that one function of language-specific neurons, as identified in recent studies, is to facilitate movement between latent spaces. Finally, we show that transfer neurons are critical for reasoning in multilingual LLMs.
摘要：最近的研究提出了一个基于解码器的LLM中多语言输入的处理框架：早期层将输入转换为以英语为中心和语言的表示；中层在以英语为中心的潜在空间内执行推理；最终层通过将这些表示形式转换为特定语言的潜在空间来生成输出。但是，这种转化和潜在机制的内部动力学仍然没有被逐渐解散。为了深入了解该框架，我们提出并经验验证了转移神经元假设：MLP模块中的某些神经元负责在语言特定的潜在空间和共享语义潜在空间之间传递表示。此外，我们表明，在最近的研究中确定的，特定语言神经元的一个功能是促进潜在空间之间的运动。最后，我们表明转移神经元对于多语言LLM中的推理至关重要。

Title: Modeling Bottom-up Information Quality during Language Processing

Authors: Cui Ding, Yanning Yin, Lena A. Jäger, Ethan Gotlieb Wilcox
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17047
Pdf URL: https://arxiv.org/pdf/2509.17047
Copy Paste: [[2509.17047]] Modeling Bottom-up Information Quality during Language Processing(https://arxiv.org/abs/2509.17047)
Keywords: language model
Abstract: Contemporary theories model language processing as integrating both top-down expectations and bottom-up inputs. One major prediction of such models is that the quality of the bottom-up inputs modulates ease of processing -- noisy inputs should lead to difficult and effortful comprehension. We test this prediction in the domain of reading. First, we propose an information-theoretic operationalization for the "quality" of bottom-up information as the mutual information (MI) between visual information and word identity. We formalize this prediction in a mathematical model of reading as a Bayesian update. Second, we test our operationalization by comparing participants' reading times in conditions where words' information quality has been reduced, either by occluding their top or bottom half, with full words. We collect data in English and Chinese. We then use multimodal language models to estimate the mutual information between visual inputs and words. We use these data to estimate the specific effect of reduced information quality on reading times. Finally, we compare how information is distributed across visual forms. In English and Chinese, the upper half contains more information about word identity than the lower half. However, the asymmetry is more pronounced in English, a pattern which is reflected in the reading times.
摘要：当代理论模型语言处理是集成自上而下的期望和自下而上的输入。这种模型的一个主要预测是，自下而上输入的质量调节了处理的易度性 - 嘈杂的输入应该导致困难和努力的理解。我们在阅读领域中测试了这一预测。首先，我们为自下而上信息的“质量”提出了一个信息理论操作，作为视觉信息和单词身份之间的相互信息（MI）。我们在阅读的数学模型中将这一预测形式化为贝叶斯更新。其次，我们通过比较参与者的阅读时间在单词“信息质量降低的条件下，通过用完整的单词遮挡他们的上半场或下半部分）来测试我们的操作。我们以英语和中文收集数据。然后，我们使用多模式模型来估计视觉输入和单词之间的相互信息。我们使用这些数据来估计减少信息质量对阅读时间的特定效果。最后，我们比较信息如何跨视觉形式分布。在英语和中文中，上半部分比下半部分包含有关单词身份的更多信息。但是，不对称性在英语中更为明显，这种模式反映在阅读时间中。

Title: TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies?

Authors: Yiwei Liu, Emma Jane Pretty, Jiahao Huang, Saku Sugawara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17054
Pdf URL: https://arxiv.org/pdf/2509.17054
Copy Paste: [[2509.17054]] TactfulToM: Do LLMs Have the Theory of Mind Ability to Understand White Lies?(https://arxiv.org/abs/2509.17054)
Keywords: language model, llm
Abstract: While recent studies explore Large Language Models' (LLMs) performance on Theory of Mind (ToM) reasoning tasks, research on ToM abilities that require more nuanced social context is limited, such as white lies. We introduce TactfulToM, a novel English benchmark designed to evaluate LLMs' ability to understand white lies within real-life conversations and reason about prosocial motivations behind them, particularly when they are used to spare others' feelings and maintain social harmony. Our benchmark is generated through a multi-stage human-in-the-loop pipeline where LLMs expand manually designed seed stories into conversations to maintain the information asymmetry between participants necessary for authentic white lies. We show that TactfulToM is challenging for state-of-the-art models, which perform substantially below humans, revealing shortcomings in their ability to fully comprehend the ToM reasoning that enables true understanding of white lies.
摘要：尽管最近的研究探讨了大型语言模型（LLMS）在心理理论（TOM）推理任务上的表现，但对需要更多细微差别社会环境的Tom能力的研究是有限的，例如白人谎言。我们介绍了TactfulTom，这是一种新颖的英语基准，旨在评估LLMS在现实生活中的对话中理解白人谎言的能力以及有关亲社会动机背后的谎言的能力，尤其是当它们被用来避免他人的感受并保持社会和谐时。我们的基准是通过多阶段的人类在循环管道中生成的，在该管道中，LLM将手动设计的种子故事扩展到对话中，以维持正宗白色谎言所需的参与者之间的信息不对称性。我们表明，TactfulTom对于最先进的模型而言是充满挑战的，该模型的表现基本低于人类，这表明了他们完全理解对白人谎言真正理解的TOM推理的能力的缺点。

Title: SFT-TA: Supervised Fine-Tuned Agents in Multi-Agent LLMs for Automated Inductive Thematic Analysis

Authors: Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Joseph Skrovan, Mehak Beri, Hitakshi Modi, Andrew Well, Liu Leqi, Mia Markey, Ying Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17167
Pdf URL: https://arxiv.org/pdf/2509.17167
Copy Paste: [[2509.17167]] SFT-TA: Supervised Fine-Tuned Agents in Multi-Agent LLMs for Automated Inductive Thematic Analysis(https://arxiv.org/abs/2509.17167)
Keywords: gpt, llm, agent
Abstract: Thematic Analysis (TA) is a widely used qualitative method that provides a structured yet flexible framework for identifying and reporting patterns in clinical interview transcripts. However, manual thematic analysis is time-consuming and limits scalability. Recent advances in LLMs offer a pathway to automate thematic analysis, but alignment with human results remains limited. To address these limitations, we propose SFT-TA, an automated thematic analysis framework that embeds supervised fine-tuned (SFT) agents within a multi-agent system. Our framework outperforms existing frameworks and the gpt-4o baseline in alignment with human reference themes. We observed that SFT agents alone may underperform, but achieve better results than the baseline when embedded within a multi-agent system. Our results highlight that embedding SFT agents in specific roles within a multi-agent system is a promising pathway to improve alignment with desired outputs for thematic analysis.
摘要：主题分析（TA）是一种广泛使用的定性方法，它提供了一个结构化但灵活的框架，用于在临床访谈成绩单中识别和报告模式。但是，手动主题分析是耗时的，并且限制了可扩展性。 LLMS的最新进展为自动化主题分析提供了途径，但与人类结果的一致性仍然有限。为了解决这些限制，我们提出了SFT-TA，这是一个自动化的主题分析框架，该框架将监督的微调（SFT）代理嵌入多代理系统中。我们的框架优于现有框架和与人类参考主题保持一致的GPT-4O基线。我们观察到，单独的SFT代理可能表现不佳，但是在嵌入多代理系统中的基线比基线更好。我们的结果强调，将SFT代理嵌入多代理系统中的特定角色是一种有前途的途径，可以通过主题分析的所需输出来改善对齐方式。

Title: FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Authors: Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17177
Pdf URL: https://arxiv.org/pdf/2509.17177
Copy Paste: [[2509.17177]] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions(https://arxiv.org/abs/2509.17177)
Keywords: language model
Abstract: We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: this https URL
摘要：我们对当前的大型推理模型（LRMS）进行了无污染的评估，并进行了一些初步发现。我们还发布了罗马，这是旨在从视觉线索中测试推理的视觉语言模型的评估基准。我们将链接附加到本网站上的基准，评估数据和其他更新：此HTTPS URL

Title: Attention Consistency for LLMs Explanation

Authors: Tian Lan, Jinyuan Xu, Xue He, Jenq-Neng Hwang, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17178
Pdf URL: https://arxiv.org/pdf/2509.17178
Copy Paste: [[2509.17178]] Attention Consistency for LLMs Explanation(https://arxiv.org/abs/2509.17178)
Keywords: language model, llm
Abstract: Understanding the decision-making processes of large language models (LLMs) is essential for their trustworthy development and deployment. However, current interpretability methods often face challenges such as low resolution and high computational cost. To address these limitations, we propose the \textbf{Multi-Layer Attention Consistency Score (MACS)}, a novel, lightweight, and easily deployable heuristic for estimating the importance of input tokens in decoder-based models. MACS measures contributions of input tokens based on the consistency of maximal attention. Empirical evaluations demonstrate that MACS achieves a favorable trade-off between interpretability quality and computational efficiency, showing faithfulness comparable to complex techniques with a 22\% decrease in VRAM usage and 30\% reduction in latency.
摘要：了解大语言模型（LLM）的决策过程对于他们值得信赖的发展和部署至关重要。但是，当前的可解释性方法通常面临诸如低分辨率和高计算成本之类的挑战。为了解决这些局限性，我们提出了\ textbf {多层注意力一致性得分（MACS）}，这是一种新颖，轻巧且易于部署的启发式，以估计基于解码器模型中输入令牌的重要性。 MAC根据最大关注的一致性来衡量输入令牌的贡献。经验评估表明，MAC在可解释性质量和计算效率之间取消了良好的权衡，表明忠诚与复杂技术相当，VRAM使用率下降22 \％，延迟降低30 \％。

Title: LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization

Authors: Junsong Li, Jie Zhou, Bihao Zhan, Yutao Yang, Qianjun Pan, Shilian Chen, Tianyu Huai, Xin Li, Qin Chen, Liang He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17183
Pdf URL: https://arxiv.org/pdf/2509.17183
Copy Paste: [[2509.17183]] LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization(https://arxiv.org/abs/2509.17183)
Keywords: language model, llm
Abstract: Alignment plays a crucial role in Large Language Models (LLMs) in aligning with human preferences on a specific task/domain. Traditional alignment methods suffer from catastrophic forgetting, where models lose previously acquired knowledge when adapting to new preferences or domains. We introduce LifeAlign, a novel framework for lifelong alignment that enables LLMs to maintain consistent human preference alignment across sequential learning tasks without forgetting previously learned knowledge. Our approach consists of two key innovations. First, we propose a focalized preference optimization strategy that aligns LLMs with new preferences while preventing the erosion of knowledge acquired from previous tasks. Second, we develop a short-to-long memory consolidation mechanism that merges denoised short-term preference representations into stable long-term memory using intrinsic dimensionality reduction, enabling efficient storage and retrieval of alignment patterns across diverse domains. We evaluate LifeAlign across multiple sequential alignment tasks spanning different domains and preference types. Experimental results demonstrate that our method achieves superior performance in maintaining both preference alignment quality and knowledge retention compared to existing lifelong learning approaches. The codes and datasets will be released on GitHub.
摘要：对齐在大语言模型（LLM）中起着至关重要的作用，与对特定任务/领域的人类偏好保持一致。传统的一致性方法遭受灾难性遗忘的困扰，在适应新的偏好或领域时，模型将失去以前获得的知识。我们介绍了LifeAlign，这是一个新颖的终身对齐框架，使LLM可以在不忘记以前学习的知识的情况下保持在顺序学习任务之间的一致人类偏好对齐。我们的方法包括两个关键的创新。首先，我们提出了一种集中的偏好优化策略，该策略将LLM与新的偏好保持一致，同时阻止从先前任务中获得的知识侵蚀。其次，我们开发了一种短期的内存整合机制，该机制将固定的短期偏好表示形式合并为稳定的长期记忆，使用固有维度降低，从而有效存储并检索各种域之间的对齐方式。我们评估跨越不同域和偏好类型的多个顺序对齐任务的LifeArign。实验结果表明，与现有的终身学习方法相比，我们的方法在保持偏好对齐质量和知识保留方面取得了出色的表现。代码和数据集将在GitHub上发布。

Title: Evolution of Concepts in Language Model Pre-Training

Authors: Xuyang Ge, Wentao Shu, Jiaxing Wu, Yunhua Zhou, Zhengfu He, Xipeng Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17196
Pdf URL: https://arxiv.org/pdf/2509.17196
Copy Paste: [[2509.17196]] Evolution of Concepts in Language Model Pre-Training(https://arxiv.org/abs/2509.17196)
Keywords: language model
Abstract: Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.
摘要：语言模型通过预训练获得广泛的功能。但是，预训练过程仍然是黑匣子。在这项工作中，我们使用称为CrossCoders的稀疏字典学习方法跟踪跨训练快照的线性解释特征演变。我们发现，大多数功能开始围绕特定点形成，而在以后的训练阶段中出现了更复杂的模式。特征归因分析揭示了特征演化与下游性能之间的因果关系。我们的功能级观察与以前关于变压器两阶段学习过程的发现高度一致，我们将其称为统计学习阶段和特征学习阶段。我们的工作打开了在语言模型学习动态过程中跟踪细粒度表示进展的可能性。

Title: Prompt-Based Simplification for Plain Language using Spanish Language Models

Authors: Lourdes Moreno, Jesus M. Sanchez-Gomez, Marco Antonio Sanchez-Escudero, Paloma Martínez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17209
Pdf URL: https://arxiv.org/pdf/2509.17209
Copy Paste: [[2509.17209]] Prompt-Based Simplification for Plain Language using Spanish Language Models(https://arxiv.org/abs/2509.17209)
Keywords: language model, prompt, chat
Abstract: This paper describes the participation of HULAT-UC3M in CLEARS 2025 Subtask 1: Adaptation of Text to Plain Language (PL) in Spanish. We explored strategies based on models trained on Spanish texts, including a zero-shot configuration using prompt engineering and a fine-tuned version with Low-Rank Adaptation (LoRA). Different strategies were evaluated on representative internal subsets of the training data, using the official task metrics, cosine similarity (SIM) and the Fernández-Huerta readability index (FH) to guide the selection of the optimal model and prompt combination. The final system was selected for its balanced and consistent performance, combining normalization steps, the RigoChat-7B-v2 model, and a dedicated PL-oriented prompt. It ranked first in semantic similarity (SIM = 0.75), however, fourth in readability (FH = 69.72). We also discuss key challenges related to training data heterogeneity and the limitations of current evaluation metrics in capturing both linguistic clarity and content preservation.
摘要：本文介绍了Hulat-UC3M在Clears 2025子任务1中的参与：在西班牙语中适应文本对普通语言（PL）。我们根据对西班牙文本训练的模型进行了探索，包括使用及时工程的零拍配置和具有低级改编（LORA）的微调版本。使用官方任务指标，余弦相似性（SIM）和Fernández-Huerta可读性指数（FH）对培训数据的代表性内部子集进行了不同的策略，以指导最佳模型和及时组合的选择。选择最终系统以保持平衡和一致的性能，结合归一化步骤，Rigochat-7b-V2模型和专用PL的提示。但是，它在语义相似性（SIM = 0.75）中排名第一，但是，可读性第四（FH = 69.72）。我们还讨论了与培训数据异质性有关的关键挑战以及当前评估指标在捕获语言清晰度和内容保存方面的局限性。

Title: Extending Automatic Machine Translation Evaluation to Book-Length Documents

Authors: Kuang-Da Wang, Shuoyang Ding, Chao-Han Huck Yang, Ping-Chun Hsieh, Wen-Chih Peng, Vitaly Lavrukhin, Boris Ginsburg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17249
Pdf URL: https://arxiv.org/pdf/2509.17249
Copy Paste: [[2509.17249]] Extending Automatic Machine Translation Evaluation to Book-Length Documents(https://arxiv.org/abs/2509.17249)
Keywords: language model, llm, prompt
Abstract: Despite Large Language Models (LLMs) demonstrating superior translation performance and long-context capabilities, evaluation methodologies remain constrained to sentence-level assessment due to dataset limitations, token number restrictions in metrics, and rigid sentence boundary requirements. We introduce SEGALE, an evaluation scheme that extends existing automatic metrics to long-document translation by treating documents as continuous text and applying sentence segmentation and alignment methods. Our approach enables previously unattainable document-level evaluation, handling translations of arbitrary length generated with document-level prompts while accounting for under-/over-translations and varied sentence boundaries. Experiments show our scheme significantly outperforms existing long-form document evaluation schemes, while being comparable to evaluations performed with groundtruth sentence alignments. Additionally, we apply our scheme to book-length texts and newly demonstrate that many open-weight LLMs fail to effectively translate documents at their reported maximum context lengths.
摘要：尽管大型语言模型（LLMS）证明了卓越的翻译性能和长期文化功能，但由于数据集限制，指标的标记数限制以及刚性句子边界要求，评估方法仍然限制在句子级评估中。我们介绍了Segale，这是一种评估方案，该方案将现有的自动指标扩展到长期文档翻译，通过将文档视为连续文本并应用句子细分和对齐方式。我们的方法启用了以前无法实现的文档级评估，处理文档级提示生成的任意长度的翻译，同时考虑到不足/过度翻译和各种句子边界。实验表明，我们的计划明显胜过现有的长期文档评估方案，同时与与地面句子一致性进行的评估相媲美。此外，我们将计划应用于书本文本，并新证明，许多开放式LLMS无法在其报告的最大上下文长度上有效地翻译文档。

Title: Probabilistic Token Alignment for Large Language Model Fusion

Authors: Runjia Zeng, James Chenhao Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huang, Tong Geng, Qifan Wang, Dongfang Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17276
Pdf URL: https://arxiv.org/pdf/2509.17276
Copy Paste: [[2509.17276]] Probabilistic Token Alignment for Large Language Model Fusion(https://arxiv.org/abs/2509.17276)
Keywords: language model, llm
Abstract: Training large language models (LLMs) from scratch can yield models with unique functionalities and strengths, but it is costly and often leads to redundant capabilities. A more cost-effective alternative is to fuse existing pre-trained LLMs with different architectures into a more powerful model. However, a key challenge in existing model fusion is their dependence on manually predefined vocabulary alignment, which may not generalize well across diverse contexts, leading to performance degradation in several evaluation. To solve this, we draw inspiration from distribution learning and propose the probabilistic token alignment method as a general and soft mapping for alignment, named as PTA-LLM. Our approach innovatively reformulates token alignment into a classic mathematical problem: optimal transport, seamlessly leveraging distribution-aware learning to facilitate more coherent model fusion. Apart from its inherent generality, PTA-LLM exhibits interpretability from a distributional perspective, offering insights into the essence of the token alignment. Empirical results demonstrate that probabilistic token alignment enhances the target model's performance across multiple capabilities. Our code is avaliable at this https URL.
摘要：从头开始培训大型语言模型（LLMS）可以产生具有独特功能和优势的模型，但它是昂贵的，并且通常会导致冗余功能。一个更具成本效益的替代方法是将现有的预培训的LLM与不同的体系结构融合为更强大的模型。但是，现有模型融合中的一个主要挑战是它们对手动预定义词汇一致性的依赖，这可能无法很好地跨越各种环境，从而导致多项评估的性能下降。为了解决这个问题，我们从分布学习中汲取灵感，并提出概率令牌对准方法作为一般和软映射的一般和软映射，称为PTA-LLM。我们创新的方法将令牌对齐构成一个经典的数学问题：最佳运输，无缝利用分布感知学习以促进更连贯的模型融合。除了其固有的一般性外，PTA-LLM从分布的角度表现出可解释性，从而提供了对令牌一致性本质的见解。经验结果表明，概率令牌比对增强了目标模型在多个功能上的性能。我们的代码在此HTTPS URL上可用。

Title: Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

Authors: Sydney Anuyah, Mehedi Mahmud Kaushik, Krishna Dwarampudi, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17289
Pdf URL: https://arxiv.org/pdf/2509.17289
Copy Paste: [[2509.17289]] Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling(https://arxiv.org/abs/2509.17289)
Keywords: language model, prompt, chain-of-thought
Abstract: We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at this https URL
摘要：我们介绍了Code-KG，这是一种开源的，端到端的管道，用于通过将强大的核心分辨率与句法句子分解相结合来提取句子级知识图。使用我们的模型，我们贡献了一个超过150,000个知识三元的数据集，这是开源的。我们还为句子复杂性贡献了一个7248行的培训语料库，使用PubMed的开源肺癌摘要，190行金人类注释共同参考，900行句子转换政策的金色人类注释，以及398个金人类注释三分。我们系统地选择了五个复杂性类别的最佳提示模型对，这表明混合链的链条和很少的发射提示可在简化的句子简化时产生高达99.8％的精确匹配精度。关于关系提取（RE），我们的管道在叛军上达到了65.8％的宏F1，对先前的最新状态的8点增益，在WebNLG2上获得了75.7％的Micro-F1，同时在Wiki-nre和Carb上匹配或超过性能。消融研究表明，整合核心和分解会使罕见关系的回忆增加超过20％。代码和数据集可在此HTTPS URL上找到

Title: Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection

Authors: Jun Seo Kim, Hyemi Kim, Woo Joo Oh, Hongjin Cho, Hochul Lee, Hye Hyeon Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17292
Pdf URL: https://arxiv.org/pdf/2509.17292
Copy Paste: [[2509.17292]] Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection(https://arxiv.org/abs/2509.17292)
Keywords: language model, llm
Abstract: Cognitive distortions have been closely linked to mental health disorders, yet their automatic detection remained challenging due to contextual ambiguity, co-occurrence, and semantic overlap. We proposed a novel framework that combines Large Language Models (LLMs) with Multiple-Instance Learning (MIL) architecture to enhance interpretability and expression-level reasoning. Each utterance was decomposed into Emotion, Logic, and Behavior (ELB) components, which were processed by LLMs to infer multiple distortion instances, each with a predicted type, expression, and model-assigned salience score. These instances were integrated via a Multi-View Gated Attention mechanism for final classification. Experiments on Korean (KoACD) and English (Therapist QA) datasets demonstrate that incorporating ELB and LLM-inferred salience scores improves classification performance, especially for distortions with high interpretive ambiguity. Our results suggested a psychologically grounded and generalizable approach for fine-grained reasoning in mental health NLP.
摘要：认知扭曲与心理健康障碍密切相关，但是由于上下文歧义，共同出现和语义重叠，它们的自动检测仍然具有挑战性。我们提出了一个新颖的框架，该框架将大型语言模型（LLMS）与多种现实学习（MIL）结构相结合，以增强可解释性和表达水平的推理。每个话语都被分解为情感，逻辑和行为（ELB）组件，这些组件是由LLMS处理的，以推断多个失真实例，每个实例都具有预测类型，表达和模型分配的显着性评分。这些实例是通过多视图的门控注意机制整合的，以进行最终分类。关于韩语（KOACD）和英语（治疗师质量检查）数据集的实验表明，纳入ELB和LLM提取的显着分数可以提高分类性能，尤其是对于具有高解释性歧义的扭曲而言。我们的结果提出了一种在心理健康NLP中进行精神上且可概括的推理的方法。

Title: Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text

Authors: Dan John Velasco, Matthew Theodore Roque
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17317
Pdf URL: https://arxiv.org/pdf/2509.17317
Copy Paste: [[2509.17317]] Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text(https://arxiv.org/abs/2509.17317)
Keywords: gpt, llm
Abstract: Most languages lack sufficient data for large-scale monolingual pretraining, creating a "data wall." Multilingual pretraining helps but is limited by language imbalance and the "curse of multilinguality." An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil--two typologically distant, lower-resource languages--and pretraining GPT-2 models (124M-774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.
摘要：大多数语言都缺乏足够的数据来进行大规模的单语言审核，从而创建了“数据墙”。多语言预测有帮助，但受到语言失衡和“多语言的诅咒”的限制。另一种选择是用机器翻译（MT）翻译高资源文本，该文本提出了三个问题：（1）MT衍生的数据量表如何具有模型容量？（2）源端转换（例如，用LLM简化英语）可以改善对本地文本的概括吗？（3）在不断受到有限的天然文本培训时，在MT衍生的数据中预处理的模型如何适应？我们通过将英语翻译成印尼语和泰米尔语（两种类型远处，低资源的语言），并从RAW和LLM拟光的英语中列出了gpt-2型号（124m-774m），调查了这些问题。我们评估了本机文本上的跨凝性损失，以及句法探针和下游任务的准确性。我们的结果表明，（1）MT预言的模型受益于缩放；（2）源端简化会危害对天然文本的概括；（3）在天然文本上调整MT预言的模型通常比本地数据较少的本机模型产生更好的性能。但是，需要文化细微差别的任务（例如毒性检测）需要更多地接触天然数据。

Title: AIMMerging: Adaptive Iterative Model Merging Using Training Trajectories for Language Model Continual Learning

Authors: Yujie Feng, Jian Li, Xiaoyu Dong, Pengfei Xu, Xiaohui Zhou, Yujia Zhang, Zexin LU, Yasha Wang, Alan Zhao, Xu Chu, Xiao-Ming Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17348
Pdf URL: https://arxiv.org/pdf/2509.17348
Copy Paste: [[2509.17348]] AIMMerging: Adaptive Iterative Model Merging Using Training Trajectories for Language Model Continual Learning(https://arxiv.org/abs/2509.17348)
Keywords: language model, llm
Abstract: Continual learning (CL) is essential for deploying large language models (LLMs) in dynamic real-world environments without the need for costly retraining. Recent model merging-based methods have attracted significant attention, but they still struggle to effectively manage the trade-off between learning new knowledge and preventing forgetting, a challenge largely stemming from suboptimal number of merges and merging frequency. In this paper, we introduce Adaptive Iterative Model Merging (AimMerging), a novel CL framework that utilizes learning and forgetting signals from the training trajectory to dynamically monitor the model's training status. Guided by dynamic monitoring, the training trajectory-guided merge controller adaptively determines the timing and frequency of iterative fusion, while the rehearsal-based knowledge fusion module computes the merging weights and executes the fusion. Comprehensive experiments on three CL benchmarks with various model sizes (from 770M to 13B) demonstrate that AimMerging achieves significant performance improvements over existing state-of-the-art methods, with an average relative improvement of 80% and 59% on FWT and BWT, respectively. The source code is provided for reproducibility.
摘要：持续学习（CL）对于在动态的现实环境中部署大型语言模型（LLM）至关重要，而无需昂贵的再训练。最近的基于模型合并的方法引起了极大的关注，但是他们仍然很难有效地管理学习新知识和防止遗忘之间的权衡，这在很大程度上是由于合并的数量和合并频率所引起的。在本文中，我们引入了自适应迭代模型合并（Aimmerging），这是一个新颖的CL框架，利用学习和忘记训练轨迹的信号来动态监视模型的训练状态。在动态监测的指导下，训练轨迹引导的合并控制器可以自适应地确定迭代融合的时间和频率，而基于排练的知识融合模块计算合并权重并执行融合。对三个具有各种模型尺寸（从770m到13B）的CLEN测试的全面实验表明，Aimmerging对现有最新方法的性能得到了显着改善，而FWT和BWT的平均相对相对改善分别为80％和59％。提供源代码可重复性。

Title: Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs

Authors: Haoyang Chen, Kumiko Tanaka-Ishii
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17367
Pdf URL: https://arxiv.org/pdf/2509.17367
Copy Paste: [[2509.17367]] Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs(https://arxiv.org/abs/2509.17367)
Keywords: gpt, llm
Abstract: We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps' exponent $\beta$ (vocabulary growth), Taylor's exponent $\alpha$ (word-frequency fluctuation scaling), compression rate $r$ (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower $\beta$) and higher term consistency (higher $\alpha$) than general texts. Within legal domain, statutory codes have the lowest $\beta$ and highest $\alpha$, reflecting strict drafting conventions, while cases and deeds show higher $\beta$ and lower $\alpha$. In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.
摘要：我们使用无标度指标对跨域的文本复杂性进行了比较分析。我们通过堆的指数$ \ beta $（词汇增长），泰勒的指数$ \ alpha $（单词频率波动比例），压缩率$ r $（冗余）和熵来量化语言复杂性。我们的语料库跨越了三个领域：法律文件（法规，案件，契据）作为专业领域，一般自然语言文本（文学，维基百科）和AI生成的（GPT）文本。我们发现，法律文本的词汇增长较慢（$ \ beta $较低）和比一般文本较高的词汇增长（$ \ beta $）和更高的期限一致性（$ \ alpha $更高）。在法律领域内，法定代码的$ \ beta $和最高$ \ alpha $的最低，反映了严格的起草约定，而案件和契约显示出更高的$ \ beta $和较低的$ \ alpha $。相比之下，GPT生成的文本显示了统计数据与一般语言模式更加一致。这些结果表明，法律文本表现出特定于领域的结构和复杂性，当前的生成模型并未完全复制。

Title: Robustness of Neurosymbolic Reasoners on First-Order Logic Problems

Authors: Hannah Bansal, Kemal Kurniawan, Lea Frermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17377
Pdf URL: https://arxiv.org/pdf/2509.17377
Copy Paste: [[2509.17377]] Robustness of Neurosymbolic Reasoners on First-Order Logic Problems(https://arxiv.org/abs/2509.17377)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent trends in NLP aim to improve reasoning capabilities in Large Language Models (LLMs), with key focus on generalization and robustness to variations in tasks. Counterfactual task variants introduce minimal but semantically meaningful changes to otherwise valid first-order logic (FOL) problem instances altering a single predicate or swapping roles of constants to probe whether a reasoning system can maintain logical consistency under perturbation. Previous studies showed that LLMs becomes brittle on counterfactual variations, suggesting that they often rely on spurious surface patterns to generate responses. In this work, we explore if a neurosymbolic (NS) approach that integrates an LLM and a symbolic logical solver could mitigate this problem. Experiments across LLMs of varying sizes show that NS methods are more robust but perform worse overall that purely neural methods. We then propose NSCoT that combines an NS method and Chain-of-Thought (CoT) prompting and demonstrate that while it improves performance, NSCoT still lags behind standard CoT. Our analysis opens research directions for future work.
摘要：NLP的最新趋势旨在提高大语言模型（LLMS）的推理能力，关注任务变化的概括和鲁棒性。反事实任务变体引入了最小但具有语义上有意义的更改，以其他有效的一阶逻辑（fol）问题实例更改常数的单个谓词或交换角色以探测推理系统在扰动下是否可以维持逻辑一致性。先前的研究表明，LLM在反事实变化上变得脆弱，表明它们通常依靠伪造的表面模式来产生响应。在这项工作中，我们探讨了整合LLM和符号逻辑求解器的神经符号（NS）方法是否可以减轻此问题。跨不同尺寸的LLM的实验表明，NS方法更健壮，但总体上的性能比纯粹的神经方法更差。然后，我们提出的NSCOT结合了NS方法和经营链（COT）提示并证明，尽管它提高了性能，但NSCOT仍然落后于标准COT。我们的分析为未来的工作打开了研究方向。

Title: FinDebate: Multi-Agent Collaborative Intelligence for Financial Analysis

Authors: Tianshi Cai, Guanxu Li, Nijia Han, Ce Huang, Zimu Wang, Changyu Zeng, Yuqi Wang, Jingshi Zhou, Haiyang Zhang, Qi Chen, Yushan Pan, Shuihua Wang, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17395
Pdf URL: https://arxiv.org/pdf/2509.17395
Copy Paste: [[2509.17395]] FinDebate: Multi-Agent Collaborative Intelligence for Financial Analysis(https://arxiv.org/abs/2509.17395)
Keywords: llm, retrieval-augmented generation, agent
Abstract: We introduce FinDebate, a multi-agent framework for financial analysis, integrating collaborative debate with domain-specific Retrieval-Augmented Generation (RAG). Five specialized agents, covering earnings, market, sentiment, valuation, and risk, run in parallel to synthesize evidence into multi-dimensional insights. To mitigate overconfidence and improve reliability, we introduce a safe debate protocol that enables agents to challenge and refine initial conclusions while preserving coherent recommendations. Experimental results, based on both LLM-based and human evaluations, demonstrate the framework's efficacy in producing high-quality analysis with calibrated confidence levels and actionable investment strategies across multiple time horizons.
摘要：我们介绍了Findebate，这是一个用于财务分析的多代理框架，将协作辩论与特定于领域的检索生成一代（RAG）相结合。五个专业特工涵盖收入，市场，情感，估值和风险，与将证据合成为多维见解。为了减轻过度自信并提高可靠性，我们引入了一个安全的辩论协议，该协议使代理商能够挑战和完善初始结论，同时保留连贯的建议。基于基于LLM的和人类评估的实验结果证明了该框架在产生高质量分析的功效，并在多个时间范围内使用校准的置信水平和可操作的投资策略。

Title: EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

Authors: Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17396
Pdf URL: https://arxiv.org/pdf/2509.17396
Copy Paste: [[2509.17396]] EpiCache: Episodic KV Cache Management for Long Conversational Question Answering(https://arxiv.org/abs/2509.17396)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.
摘要：大型语言模型（LLM）的最新进展具有扩展的上下文长度，使助手能够维持长长的历史，以获得连贯的个性化响应。但是，这种能力取决于键值（KV）缓存，其内存随对话长度线性增长，并在严格的资源约束下迅速占主导地位。 KV缓存压缩是减少此开销的积极研究线，该研究旨在限制缓存大小的同时保持准确性。然而，现有方法面临两个主要局限性：（i）在全文预击之前驱逐条目会导致无界的峰值内存，并且（ii）与查询有关的驱逐将缓存缩小到单个查询，从而导致多转向对话的准确性下降。我们介绍了Epicache，这是一个在固定的内存预算下进行长时间会话问题回答（LongConvQA）的无培训KV缓存管理框架。 Epicache界限通过较大的预填充填充范围，并通过情节性KV压缩保存与主题相关的上下文，该上下文将对话历史记录到连贯的情节中，并应用了特定于情节的KV缓存驱逐。我们进一步设计了一种自适应层的预算分配策略，该策略可以衡量每层驱逐的敏感性，并相应地跨层分配记忆预算。在三个LongConvQA基准测试中，Epicache在4-6倍压缩下的准确性高达40％，维持近乎满足的KV准确性，并将潜伏期和内存降低2.4倍和3.5倍，从而在严格的资源约束下实现有效的多型电流互动。

Title: DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context

Authors: Pramit Sahoo, Maharaj Brahma, Maunendra Sankar Desarkar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17399
Pdf URL: https://arxiv.org/pdf/2509.17399
Copy Paste: [[2509.17399]] DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context(https://arxiv.org/abs/2509.17399)
Keywords: language model, llm
Abstract: Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment \citep{ryan-etal-2024-unintended, alkhamissi-etal-2024-investigating} and produce biased generations \cite{naous-etal-2024-beer} due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises $\sim$8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: \href{this https URL}{this https URL}, project webpage\footnote{\href{this https URL}{this https URL}}, and our codebase with model outputs can be found here: \href{this https URL}{this https URL}.
摘要：大型语言模型（LLM）广泛用于各种任务和应用中。但是，尽管具有广泛的功能，但它们被证明缺乏文化对齐\ citep {ryan-etal-2024--2024-nistered，Alkhamissi-etal-2024-Indevestation}，由于缺乏文化知识，因此产生了有偏见的世代\ cite {naous-etal-etal-2024-beer}。由于缺乏适当的评估指标和文化扎根的数据集，对LLM的文化意识和对齐方式的评估尤其具有挑战性，这代表了区域和次区域级别的文化巨大复杂性。现有的针对文化特定项目的数据集（CSI）主要关注区域层面的概念，并可能包含误报。为了解决这个问题，我们介绍了一个新颖的CSI数据集，用于印度文化，属于17个文化方面。该数据集包括36个子区域的$ \ sim $ 8K文化概念。为了衡量LLM在文化文本适应任务上的文化能力，我们使用来自不同社会人口统计学区域的CSIS，LLM作为法官和人类评估来评估适应性。此外，我们进行定量分析，证明了所有被考虑的LLM的选择性次区域覆盖范围和表面级适应。我们的数据集可在此处找到：\ href {此https url} {此https url}，project webpage \ footnote {\ href {this https url} {this https url}}，并在此处可以找到带有模型的代码库，以下是以下基础。 URL}。

Title: Vision Language Models Are Not (Yet) Spelling Correctors

Authors: Junhong Liang, Bojun Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.17418
Pdf URL: https://arxiv.org/pdf/2509.17418
Copy Paste: [[2509.17418]] Vision Language Models Are Not (Yet) Spelling Correctors(https://arxiv.org/abs/2509.17418)
Keywords: language model, gpt
Abstract: Spelling correction from visual input poses unique challenges for vision language models (VLMs), as it requires not only detecting but also correcting textual errors directly within images. We present ReViCo (Real Visual Correction), the first benchmark that systematically evaluates VLMs on real-world visual spelling correction across Chinese and English. ReViCo contains naturally occurring errors collected from real-world image data and supports fine-grained evaluation at both image and token levels. Through comprehensive experiments on representative cascaded (Qwen) and native (InternVL) open-source models, as well as closed-source systems (GPT-4o, Claude), we show that current VLMs fall significantly short of human performance, particularly in correction. To address these limitations, we explore two solution paradigms: a Joint OCR-Correction pipeline and a Background Information enhanced approach, both of which yield consistent performance gains. Our analysis highlights fundamental limitations of existing architectures and provides actionable insights for advancing multimodal spelling correction.
摘要：来自视觉输入的拼写校正对视觉语言模型（VLM）构成了独特的挑战，因为它不仅需要检测到图像中直接纠正文本错误。我们提出了Revico（实际视觉校正），这是第一个系统地评估VLMS对跨汉语和英语的现实视觉拼写校正的基准。 Revico包含从现实世界图像数据中收集的自然存在的错误，并支持图像和令牌级别的细粒度评估。通过对代表性级联（QWEN）和天然（InternVL）开源模型以及封闭源系统（GPT-4O，Claude）进行的全面实验，我们表明当前的VLMS在人类绩效方面大大缺乏，尤其是在校正中。为了解决这些局限性，我们探讨了两个解决方案范式：联合OCR校正管道和背景信息增强的方法，这两者都会产生一致的性能增长。我们的分析强调了现有体系结构的基本局限性，并为推进多模式拼写校正提供了可行的见解。

Title: RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios

Authors: Fei Zhao, Chengqiang Lu, Yufan Shen, Qimeng Wang, Yicheng Qian, Haoxin Zhang, Yan Gao, Yi Wu, Yao Hu, Zhen Wu, Shangyu Xing, Xinyu Dai
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.17421
Pdf URL: https://arxiv.org/pdf/2509.17421
Copy Paste: [[2509.17421]] RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios(https://arxiv.org/abs/2509.17421)
Keywords: llm
Abstract: While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8\% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context.
摘要：尽管已经出现了各种多模式的多模式评估数据集，但是这些数据集主要基于英语，并且尚未有一个中文的多图像数据集。为了填补这一空白，我们引入了RealBench，这是第一个中国多模式的多模式数据集，其中包含9393个样本和69910张图像。 RealBench通过合并真实的用户生成的内容来区分自己，从而确保与现实世界应用程序高度相关。此外，数据集涵盖了各种场景，图像分辨率和图像结构，进一步增加了多图像理解的难度。最终，我们使用21种不同尺寸的多模式LLM对RealBench进行了全面评估，包括支持多图像输入以及开源视觉和视频模型的封闭式模型。实验结果表明，即使是最强大的封闭源模型，在处理多图像中国方案时仍会面临挑战。此外，在开源视觉/视频模型和封闭源模型之间，平均仍有明显的性能差距约为71.8％。这些结果表明，RealBench为进一步探索中国背景下的多图像理解能力提供了重要的研究基础。

Title: QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

Authors: Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17428
Pdf URL: https://arxiv.org/pdf/2509.17428
Copy Paste: [[2509.17428]] QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models(https://arxiv.org/abs/2509.17428)
Keywords: language model, llm
Abstract: The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at this https URL.
摘要：大型语言模型（LLMS）有效部署的需求引起了人们对量化的兴趣，从而降低了推理成本和参数有效的微调（PEFT），从而降低了培训开销。这促进了量化感知的PEFT的发展，以产生准确而有效的量化模型。在这种情况下，减少微调之前的量化误差对于实现高模型精度至关重要。但是，现有的依赖低级适应性的方法具有有限的代表性能力。最近，基于傅立叶相关的转换（FT）适配器比低级别适配器具有更大的代表性功能，但是将它们直接集成到量化模型中通常会导致降低误差降低和计算开销增加。为了克服这些局限性，我们提出了QWHA，该方法通过使用WALSH-HADAMARD变换（WHT）作为转换内核，将基于FT的适配器整合到量化模型中，并结合了自适应参数选择和价值改进的新型适配器初始化方案。我们证明QWHA在促进微调的同时有效地减轻了量化误差，并且其设计大大降低了计算成本。实验结果表明，QWHA始终以低位量化精度优于基准，并在现有基于FT的适配器上实现了显着的训练速度。该代码可在此HTTPS URL上找到。

Title: MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses

Authors: Tong Chen, Zimu Wang, Yiyi Miao, Haoran Luo, Yuanfei Sun, Wei Wang, Zhengyong Jiang, Procheta Sen, Jionglong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17436
Pdf URL: https://arxiv.org/pdf/2509.17436
Copy Paste: [[2509.17436]] MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses(https://arxiv.org/abs/2509.17436)
Keywords: language model, llm
Abstract: Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at this https URL.
摘要：随着越来越多的人在线寻求医疗信息，医学事实检查变得越来越关键。但是，现有数据集主要集中在人类生成的内容上，从而使大型语言模型（LLMS）生成的内容的验证相对尚未探索。为了解决这一差距，我们引入了Medfact，这是LLM生成的医学内容的第一个基于证据的中国医学检查数据集。它由1,321个问题和7,409个主张组成，反映了现实世界中医学场景的复杂性。我们在秘密学习（ICL）和微调设置中进行了全面的实验，展示了当前LLM在此任务上的能力和挑战，并伴随着深入的错误分析，以指出未来研究的关键方向。我们的数据集可在此HTTPS URL上公开获得。

Title: GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

Authors: Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Deli Zhao, Anh Tuan Luu, Yu Rong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17437
Pdf URL: https://arxiv.org/pdf/2509.17437
Copy Paste: [[2509.17437]] GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning(https://arxiv.org/abs/2509.17437)
Keywords: language model, llm
Abstract: Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.
摘要：强化学习（RL）的最新进步增强了大语言模型（LLMS）的推理能力，但是对多模式LLMS（MLLM）的影响有限。尤其是在视力密集型任务（如几何推理）中，Mllms经常幻觉，导致推理不准确。我们将其归因于MLLM中的感知瓶颈，这限制了推理培训的好处。为了量化这一点，我们设计了一个地理知觉问题避开（GEOPQA）基准，以基本的几何概念和空间关系为目标。对GeoPQA的实验显示了视觉感知中MLLM的显着缺陷，这限制了RL奖励信号以进行有效的训练。为了解决这个瓶颈，我们首先提高了几何结构的视觉感知，然后促进推理能力，提出了一个两阶段的RL训练框架。与直接的推理训练方法相比，我们的两阶段训练应用于QWEN2.5-VL-3B-INSTRUCT，我们的两阶段训练将几何推理提高了9.7％，几何问题解决方法解决了9.1％。我们的方法还推广到其他视力密集型领域，例如人物理解，突出了感知基础在有效的MLLM推理中的重要性。

Title: Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system

Authors: Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, Eiji Aramaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17444
Pdf URL: https://arxiv.org/pdf/2509.17444
Copy Paste: [[2509.17444]] Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system(https://arxiv.org/abs/2509.17444)
Keywords: gpt, llm
Abstract: This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. While robust evaluation frameworks are crucial for the safe development of medical LLMs, resources in Japanese remain limited, often relying on translated multiple-choice questions. Our research addresses this gap by first establishing a performance baseline, applying a machine-translated version of HealthBench's 5,000 scenarios to evaluate both a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Second, we employ an LLM-as-a-Judge approach to systematically classify the benchmark's scenarios and rubric criteria, identifying "contextual gaps" where content is misaligned with Japan's clinical guidelines, healthcare systems, or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches and a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification indicates that while the majority of scenarios are applicable, a substantial portion of the rubric criteria requires localization. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localized adaptation, a J-HealthBench, to ensure the reliable and safe evaluation of medical LLMs in Japan.
摘要：这项研究调查了HealthBench是一种大规模的，基于标语的医疗基准在日本背景下的适用性。尽管强大的评估框架对于医疗LLM的安全开发至关重要，但日本的资源仍然有限，通常依赖于翻译的多项选择问题。我们的研究通过首先建立绩效基线来解决这一差距，并应用了HealthBench的5,000个场景的机器翻译版本，以评估高性能的多语言模型（GPT-4.1）和日本本地的开源模型（LLM-JP-3.1）。其次，我们采用LLM-AS-A-Audge方法来系统地对基准的情景和标准进行分类，确定“上下文差距”，其中内容与日本的临床准则，医疗保健系统或文化规范未对齐。我们的发现表明，由于标语不匹配和日本本地模型的显着失败，GPT-4.1的性能下降了，这缺乏所需的临床完整性。此外，我们的分类表明，尽管大多数方案都是适用的，但大部分标准需要本地化。这项工作强调了直接基准翻译的局限性，并强调了迫切需要进行上下文感知的局部适应，即J-Healthbench，以确保对日本医学LLM的可靠和安全评估。

Title: Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks

Authors: Chaodong Tong, Qi Zhang, Lei Jiang, Yanbing Liu, Nannan Sun, Wei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17445
Pdf URL: https://arxiv.org/pdf/2509.17445
Copy Paste: [[2509.17445]] Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks(https://arxiv.org/abs/2509.17445)
Keywords: language model, llm, hallucination
Abstract: Reliable question answering with large language models (LLMs) is challenged by hallucinations, fluent but factually incorrect outputs arising from epistemic uncertainty. Existing entropy-based semantic-level uncertainty estimation methods are limited by sampling noise and unstable clustering of variable-length answers. We propose Semantic Reformulation Entropy (SRE), which improves uncertainty estimation in two ways. First, input-side semantic reformulations produce faithful paraphrases, expand the estimation space, and reduce biases from superficial decoder tendencies. Second, progressive, energy-based hybrid clustering stabilizes semantic grouping. Experiments on SQuAD and TriviaQA show that SRE outperforms strong baselines, providing more robust and generalizable hallucination detection. These results demonstrate that combining input diversification with multi-signal clustering substantially enhances semantic-level uncertainty estimation.
摘要：通过大型语言模型（LLM）回答可靠的问题，这是由于幻觉，流利但事实不正确的产出而引起的，这是由认知不确定性引起的。现有的基于熵的语义级别不确定性估计方法受到采样噪声和可变长度答案的不稳定聚类的限制。我们提出了语义重新印度熵（SRE），该熵通过两种方式提高了不确定性估计。首先，输入方语义重新纠正产生忠实的解释，扩大估计空间并减少浅表解码器倾向的偏见。其次，渐进式的基于能量的混合聚类稳定了语义分组。对小队和Triviaqa的实验表明，SRE的表现优于强基础，提供了更健壮和可推广的幻觉检测。这些结果表明，将输入多样化与多信号聚类结合起来大大提高了语义级别的不确定性估计。

Title: SLAyiNG: Towards Queer Language Processing

Authors: Leonor Veloso, Lea Hirlimann, Philipp Wicke, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17449
Pdf URL: https://arxiv.org/pdf/2509.17449
Copy Paste: [[2509.17449]] SLAyiNG: Towards Queer Language Processing(https://arxiv.org/abs/2509.17449)
Keywords: llm
Abstract: Knowledge of slang is a desirable feature of LLMs in the context of user interaction, as slang often reflects an individual's social identity. Several works on informal language processing have defined and curated benchmarks for tasks such as detection and identification of slang. In this paper, we focus on queer slang. Queer slang can be mistakenly flagged as hate speech or can evoke negative responses from LLMs during user interaction. Research efforts so far have not focused explicitly on queer slang. In particular, detection and processing of queer slang have not been thoroughly evaluated due to the lack of a high-quality annotated benchmark. To address this gap, we curate SLAyiNG, the first dataset containing annotated queer slang derived from subtitles, social media posts, and podcasts, reflecting real-world usage. We describe our data curation process, including the collection of slang terms and definitions, scraping sources for examples that reflect usage of these terms, and our ongoing annotation process. As preliminary results, we calculate inter-annotator agreement for human annotators and OpenAI's model o3-mini, evaluating performance on the task of sense disambiguation. Reaching an average Krippendorff's alpha of 0.746, we argue that state-of-the-art reasoning models can serve as tools for pre-filtering, but the complex and often sensitive nature of queer language data requires expert and community-driven annotation efforts.
摘要：在用户互动的背景下，对语的知识是LLM的理想特征，因为语通常反映了个人的社会身份。关于非正式语言处理的几项作品已定义和策划的基准，用于诸如s语的检测和识别之类的任务。在本文中，我们专注于奇怪的语。酷儿s语可能被错误地标记为仇恨言论，也可以在用户互动期间唤起LLM的负面响应。到目前为止，研究工作尚未明确地集中在酷儿语上。特别是，由于缺乏高质量的注释基准，尚未对酷儿语的检测和处理进行彻底评估。为了解决这一差距，我们策划了杀戮，这是第一个包含从字幕，社交媒体帖子和播客的带注释的酷儿s语的数据集，反映了现实世界中的用法。我们描述了我们的数据策划过程，包括收集语术语和定义，刮擦来源的示例来反映这些术语的使用以及我们正在进行的注释过程。作为初步结果，我们计算人类注释者和OpenAI模型O3-Mini的通道间一致性，评估了理智歧义任务的绩效。我们认为，达到克里潘多夫的平均α为0.746，我们认为最先进的推理模型可以作为预过滤的工具，但是酷儿语言数据的复杂且通常是敏感的性质需要专家和社区驱动的注释工作。

Title: PRINCIPLES: Synthetic Strategy Memory for Proactive Dialogue Agents

Authors: Namyoung Kim, Kai Tzu-iunn Ong, Yeonjun Hwang, Minseok Kang, Iiseo Jihn, Gayoung Kim, Minju Kim, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17459
Pdf URL: https://arxiv.org/pdf/2509.17459
Copy Paste: [[2509.17459]] PRINCIPLES: Synthetic Strategy Memory for Proactive Dialogue Agents(https://arxiv.org/abs/2509.17459)
Keywords: language model, llm, agent
Abstract: Dialogue agents based on large language models (LLMs) have shown promising performance in proactive dialogue, which requires effective strategy planning. However, existing approaches to strategy planning for proactive dialogue face several limitations: limited strategy coverage, preference bias in planning, and reliance on costly additional training. To address these, we propose PRINCIPLES: a synthetic strategy memory for proactive dialogue agents. PRINCIPLES is derived through offline self-play simulations and serves as reusable knowledge that guides strategy planning during inference, eliminating the need for additional training and data annotation. We evaluate PRINCIPLES in both emotional support and persuasion domains, demonstrating consistent improvements over strong baselines. Furthermore, PRINCIPLES maintains its robustness across extended and more diverse evaluation settings. See our project page at this https URL.
摘要：基于大语言模型（LLM）的对话代理在主动对话中表现出了有希望的表现，这需要有效的策略计划。但是，现有的积极对话战略规划方法面临着几个局限性：策略覆盖范围有限，计划中的偏好偏见以及依赖昂贵的额外培训。为了解决这些问题，我们提出了原则：主动对话代理的合成策略记忆。原理是通过离线自我播放模拟得出的，并用作可重复使用的知识，可以指导推理过程中的策略计划，从而消除了对额外的培训和数据注释的需求。我们评估情感支持和说服力领域的原则，表明对强基础的一致改进。此外，原则在扩展和更多样化的评估环境中保持其稳健性。在此HTTPS URL上查看我们的项目页面。

Title: Diagnosing Model Editing via Knowledge Spectrum

Authors: Tsung-Hsuan Pan, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17482
Pdf URL: https://arxiv.org/pdf/2509.17482
Copy Paste: [[2509.17482]] Diagnosing Model Editing via Knowledge Spectrum(https://arxiv.org/abs/2509.17482)
Keywords: language model
Abstract: Model editing, the process of efficiently modifying factual knowledge in pre-trained language models, is critical for maintaining their accuracy and relevance. However, existing editing methods often introduce unintended side effects, degrading model performance in unpredictable ways. While much research has focused on improving editing algorithms, the role of the target knowledge's intrinsic properties remains a significant, underexplored factor. This paper addresses this gap by first proposing the ``Knowledge Spectrum,'' a systematic framework for categorizing knowledge based on its real-world popularity, the model's pre-edit familiarity, and the linguistic structure of the eliciting question. Our empirical analysis reveals that these characteristics are strong predictors of editing success and stability. Informed by these findings, we introduce the ``Knowledge-Diagnostic Framework,'' an adaptive strategy that tailors editing intensity to the diagnosed difficulty of a knowledge item. We demonstrate that this framework significantly improves success rates for challenging edits while optimizing computational resources. Our work provides a more comprehensive understanding of the factors governing model editing.
摘要：模型编辑是在预训练的语言模型中有效修改事实知识的过程，对于保持其准确性和相关性至关重要。但是，现有的编辑方法通常会以不可预测的方式引入意想不到的副作用，使模型性能降低。尽管许多研究重点是改善编辑算法，但目标知识的内在属性的作用仍然是一个重要的，毫无疑问的因素。本文通过首先提出``知识范围''来解决这一差距，这是一个系统的框架，用于基于其现实世界中的知名度，模型的编辑前的熟悉度以及引起问题的语言结构对知识进行分类。我们的经验分析表明，这些特征是编辑成功和稳定性的有力预测指标。在这些发现的情况下，我们介绍了``知识诊断框架''，这是一种适应性策略，该策略量身定制了对知识项目的诊断难度的编辑强度。我们证明，该框架可显着提高挑战性编辑的成功率，同时优化计算资源。我们的工作提供了对管理模型编辑因素的更全面的理解。

Title: AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation

Authors: Lvzhou Luo, Yixuan Cao, Ping Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17486
Pdf URL: https://arxiv.org/pdf/2509.17486
Copy Paste: [[2509.17486]] AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation(https://arxiv.org/abs/2509.17486)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation improves the factual accuracy of Large Language Models (LLMs) by incorporating external context, but often suffers from irrelevant retrieved content that hinders effectiveness. Context compression addresses this issue by filtering out irrelevant information from context before LLM generation. However, existing methods struggle to adaptively adjust compression rates for different context, maintain low latency and integrate information across multiple documents. To overcome these limitations, We introduce AttnComp, an adaptive, efficient and context-aware compression framework. By leveraging the attention mechanism of LLMs to identify relevant information, AttnComp employs a Top-P compression algorithm to retain the minimal set of documents whose cumulative attention weights exceeds a predefined threshold. In addition to compression, AttnComp estimates response confidence by assessing the overall relevance of the retrieved content, enabling users to gauge response reliability. Experiments demonstrate that AttnComp outperforms existing compression methods and uncompressed baselines, achieving higher accuracy with substantial compression rates and lower latency.
摘要：检索演示的一代通过纳入外部环境来提高大语言模型（LLM）的事实准确性，但通常会遭受无关紧要的检索内容，从而阻碍了有效性。上下文压缩通过在LLM生成之前从上下文中滤除无关的信息来解决此问题。但是，现有方法难以自适应调整不同上下文的压缩率，保持低延迟并整合多个文档的信息。为了克服这些局限性，我们介绍了ATTNCOMP，这是一个自适应，高效和上下文感知的压缩框架。通过利用LLMS的注意机制识别相关信息，ATTNCOMP采用了顶级压缩算法来保留最小文档集的累积注意权重超过预定义阈值的文档。除压缩外，ATTNCOMP还通过评估检索到的内容的整体相关性来估算响应信心，从而使用户能够衡量响应可靠性。实验表明，ATTNCOMP优于现有的压缩方法和未压缩的基准，通过实质性的压缩率和较低的潜伏度达到了更高的精度。

Title: MapCoder-Lite: Squeezing Multi-Agent Coding into a Single Small LLM

Authors: Woongkyu Lee, Junhee Cho, Jungwook Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17489
Pdf URL: https://arxiv.org/pdf/2509.17489
Copy Paste: [[2509.17489]] MapCoder-Lite: Squeezing Multi-Agent Coding into a Single Small LLM(https://arxiv.org/abs/2509.17489)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have advanced code generation from single-function tasks to competitive-programming problems, but existing multi-agent solutions either rely on costly large-scale ($>$ 30B) models or collapse when downsized to small open-source models. We present MapCoder-Lite, which upgrades a single 7B model into four role-specialised agents-retriever, planner, coder, and debugger-using only rank-32, role-specific LoRA adapters ($<3\%$ extra parameters). Three lightweight techniques make this possible: (i) trajectory distillation from strong LLMs fixes format fragility in retrieval and debugging, (ii) supervisor-guided correction strengthens planning and coding agents, and (iii) agent-wise LoRA fine-tuning delivers memory-efficient specialisation. Comprehensive evaluation on xCodeEval, APPS, and CodeContests shows that MapCoder-Lite more than doubles xCodeEval accuracy (from $13.2\%$ to $28.3\%$), eliminates all format failures, and closes to within six points of a 32B baseline while cutting GPU memory and token-generation time by $4\times$. These results demonstrate that careful agent-wise fine-tuning unleashes high-quality multi-agent coding on a small language model.
摘要：大型语言模型（LLMS）具有从单功能任务到竞争性编程问题的高级代码生成，但是现有的多代理解决方案要么依赖于昂贵的大规模（$> $ 30B）型号，要么将缩小到小型开源模型时倒塌。我们提出了MapCoder-Lite，该Lite将单个7B模型升级到四个角色特殊化的代理 - 重新制定者，计划者，编码器和Debugger-仅使用级别32，特定于角色的Lora适配器（$ <3 \％$ $ $ $ $ $ $ $）。三种轻巧的技术使得这是可能的：（i）强力LLMS的轨迹蒸馏固定格式的脆弱性和调试中的脆弱性，（ii）主管指导的校正增强了计划和编码剂，以及（iii）lora-lora fine lora file-tonning file-tonning file-tood files fel ver vel vel vel vel vel sermememence smemore效率高。对Xcodeeval，应用程序和CodeContests的全面评估表明，MapCoder-lite的Xcodeeval准确度多于Xcodeeval的准确性（从$ 13.2 \％$到$ 28.3 \％$），消除了所有格式的失败，并在减少32B基线的六点左右的gpu Memory和toke $ 4 $ $ 4 $ $ 4 $ 4.这些结果表明，仔细的定位微调释放了小语言模型上的高质量多代理编码。

Title: Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages

Authors: Wenhao Zhuang, Yuan Sun, Xiaobing Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17493
Pdf URL: https://arxiv.org/pdf/2509.17493
Copy Paste: [[2509.17493]] Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages(https://arxiv.org/abs/2509.17493)
Keywords: language model, llm
Abstract: As large language models (LLMs) are trained on increasingly diverse and extensive multilingual corpora, they demonstrate cross-lingual transfer capabilities. However, these capabilities often fail to effectively extend to low-resource languages, particularly those utilizing non-Latin scripts. While transliterating low-resource languages into Latin script presents a natural solution, there currently lacks a comprehensive framework for integrating transliteration into LLMs training and deployment. Taking a pragmatic approach, this paper innovatively combines character transliteration with Huffman coding to design a complete transliteration framework. Our proposed framework offers the following advantages: 1) Compression: Reduces storage requirements for low-resource language content, achieving up to 50% reduction in file size and 50-80% reduction in token count. 2) Accuracy: Guarantees 100% lossless conversion from transliterated text back to the source language. 3) Efficiency: Eliminates the need for vocabulary expansion for low-resource languages, improving training and inference efficiency. 4) Scalability: The framework can be extended to other low-resource languages. We validate the effectiveness of our framework across multiple downstream tasks, including text classification, machine reading comprehension, and machine translation. Experimental results demonstrate that our method significantly enhances the model's capability to process low-resource languages while maintaining performance on high-resource languages. Our data and code are publicly available at this https URL.
摘要：由于大型语言模型（LLM）接受了越来越多样化和广泛的多语言语料库的培训，因此它们表现出跨语性转移功能。但是，这些功能通常无法有效扩展到低资源语言，尤其是使用非拉丁脚本的语言。在将低资源语言译成拉丁文脚本的同时，目前缺乏将音译融合到LLMS培训和部署中的全面框架。采用务实的方法，本文创新了角色音译与霍夫曼编码的编码，以设计一个完整的音译框架。我们提出的框架提供了以下优点：1）压缩：减少低资源语言内容的存储要求，可减少50％的文件大小和减少50-80％的降低令牌计数。 2）准确性：保证100％从音译文本回到源语言的100％无损转换。 3）效率：消除对低资源语言的词汇扩展的需求，从而提高培训和推理效率。 4）可伸缩性：该框架可以扩展到其他低资源语言。我们在多个下游任务中验证框架的有效性，包括文本分类，机器阅读理解和机器翻译。实验结果表明，我们的方法显着增强了模型处理低资源语言的能力，同时维持高资源语言的性能。我们的数据和代码可在此HTTPS URL上公开获取。

Title: CorefInst: Leveraging LLMs for Multilingual Coreference Resolution

Authors: Tuğba Pamay Arslan, Emircan Erol, Gülşen Eryiğit
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17505
Pdf URL: https://arxiv.org/pdf/2509.17505
Copy Paste: [[2509.17505]] CorefInst: Leveraging LLMs for Multilingual Coreference Resolution(https://arxiv.org/abs/2509.17505)
Keywords: language model, llm
Abstract: Coreference Resolution (CR) is a crucial yet challenging task in natural language understanding, often constrained by task-specific architectures and encoder-based language models that demand extensive training and lack adaptability. This study introduces the first multilingual CR methodology which leverages decoder-only LLMs to handle both overt and zero mentions. The article explores how to model the CR task for LLMs via five different instruction sets using a controlled inference method. The approach is evaluated across three LLMs; Llama 3.1, Gemma 2, and Mistral 0.3. The results indicate that LLMs, when instruction-tuned with a suitable instruction set, can surpass state-of-the-art task-specific architectures. Specifically, our best model, a fully fine-tuned Llama 3.1 for multilingual CR, outperforms the leading multilingual CR model (i.e., Corpipe 24 single stage variant) by 2 pp on average across all languages in the CorefUD v1.2 dataset collection.
摘要：核心分辨率（CR）是自然语言理解中的一项至关重要但又具有挑战性的任务，通常受到特定于任务的体系结构和基于编码器的语言模型的约束，这些模型需要广泛的培训和缺乏适应性。这项研究介绍了第一个利用仅解码器LLM的多语言CR方法来处理公开和零提及。本文探讨了如何使用受控推理方法通过五个不同的指令集对LLM的CR任务进行建模。该方法在三个LLM上进行了评估； Llama 3.1，Gemma 2和Mistral 0.3。结果表明，在使用合适的指令集进行指令时，LLM可以超越特定于任务的架构。具体而言，我们的最佳模型是多语言CR的完全微调的Llama 3.1，在Corefud v1.2数据集收集中的所有语言中，平均所有语言中的所有语言平均要优于领先的多语言CR模型（即Corpipe 24单阶段变体）。

Title: Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A.Subramanian, Alvin Chan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17552
Pdf URL: https://arxiv.org/pdf/2509.17552
Copy Paste: [[2509.17552]] Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning(https://arxiv.org/abs/2509.17552)
Keywords: language model, llm
Abstract: The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.
摘要：大型语言模型（LLM）的显着性能可以通过测试时间计算来增强，该计算依赖于外部工具甚至其他深度学习模型。但是，现有的将非文本模式表示形式集成到LLM的方法通常需要其他昂贵的监督培训，从而限制了对新领域和方式的即时适应。在这项工作中，我们探讨了以无培训方式将非文本基础模型（FMS）整合到基于文本的LLM中的可行性。我们建议在概念概念上提出内部文化表示学习（ICRL），以允许LLMS自适应地利用很少学习的非文本模式表示。与传统的文本格学习不同，ICRL用FM表示代替文本输入，从而使LLM能够执行多模式推断而无需微调。我们在分子结构域中的一系列任务上评估ICRL，研究了三个核心研究问题：（i）如何以无训练的方式将FM表示形式映射到LLMS，（ii）哪些因素影响ICRL性能，以及（iii）哪些机制是ICRL有效性的基础。据我们所知，ICRL是第一个将非文本模式表示形式集成到基于文本的LLM中的第一个无培训框架，为适应性，多模式的概括提供了有希望的方向。

Title: Specification-Aware Machine Translation and Evaluation for Purpose Alignment

Authors: Yoko Kayano, Saku Sugawara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17559
Pdf URL: https://arxiv.org/pdf/2509.17559
Copy Paste: [[2509.17559]] Specification-Aware Machine Translation and Evaluation for Purpose Alignment(https://arxiv.org/abs/2509.17559)
Keywords: language model, llm, prompt
Abstract: In professional settings, translation is guided by communicative goals and client needs, often formalized as specifications. While existing evaluation frameworks acknowledge the importance of such specifications, these specifications are often treated only implicitly in machine translation (MT) research. Drawing on translation studies, we provide a theoretical rationale for why specifications matter in professional translation, as well as a practical guide to implementing specification-aware MT and evaluation. Building on this foundation, we apply our framework to the translation of investor relations texts from 33 publicly listed companies. In our experiment, we compare five translation types, including official human translations and prompt-based outputs from large language models (LLMs), using expert error analysis, user preference rankings, and an automatic metric. The results show that LLM translations guided by specifications consistently outperformed official human translations in human evaluations, highlighting a gap between perceived and expected quality. These findings demonstrate that integrating specifications into MT workflows, with human oversight, can improve translation quality in ways aligned with professional practice.
摘要：在专业环境中，翻译以交流目标和客户需求为指导，通常是正式的。尽管现有的评估框架承认此类规格的重要性，但这些规格通常仅在机器翻译（MT）研究中隐含地处理。利用翻译研究，我们为为什么规范在专业翻译中至关重要，以及实施规范意识和评估的实用指南提供了理论上的理由。在此基金会的基础上，我们将框架应用于来自33家公开上市公司的投资者关系文本的翻译。在我们的实验中，我们比较了五种翻译类型，包括使用专家错误分析，用户偏好排名和自动指标的官方人类翻译和基于迅速的输出。结果表明，以规格为指导的LLM翻译在人类评估中始终优于官方人类翻译，突出了感知质量和期望质量之间的差距。这些发现表明，将规格与人类的监督相结合到MT工作流程，可以以与专业实践一致的方式提高翻译质量。

Title: Asking a Language Model for Diverse Responses

Authors: Sergey Troshin, Irina Saparina, Antske Fokkens, Vlad Niculae
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17570
Pdf URL: https://arxiv.org/pdf/2509.17570
Copy Paste: [[2509.17570]] Asking a Language Model for Diverse Responses(https://arxiv.org/abs/2509.17570)
Keywords: language model
Abstract: Large language models increasingly rely on explicit reasoning chains and can produce multiple plausible responses for a given context. We study the candidate sampler that produces the set of plausible responses contrasting the ancestral (parallel) sampling against two alternatives: enumeration, which asks the model to produce $n$ candidates in one pass, and iterative sampling, which proposes candidates sequentially while conditioning on the currently generated response set. Under matched budgets, we compare these samplers on quality, lexical and computation flow diversity, and efficiency. Our empirical results demonstrate that enumeration and iterative strategies result in higher diversity at comparable quality. Our findings highlight the potential of simple non-independent sampling strategies to improve response diversity without sacrificing generation quality.
摘要：大型语言模型越来越依赖于明确的推理链，可以在给定情况下产生多个合理的响应。我们研究了产生一组合理响应的候选抽样器，将祖先（平行）采样与两种替代方案进行了对比：枚举：该枚举要求该模型在一次通过中产生$ n $候选者，并迭代采样，并提出候选者在当前生成的响应集对候选者中进行顺序调节。在匹配的预算下，我们将这些采样器与质量，词汇和计算流量多样性以及效率进行比较。我们的经验结果表明，枚举和迭代策略在质量可比的情况下导致更高的多样性。我们的发现突出了简单的非独立抽样策略的潜力，即不牺牲发电质量而改善响应多样性。

Title: MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents

Authors: Yuzhen Lei, Hongbin Xie, Jiaxing Zhao, Shuangxue Liu, Xuan Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17628
Pdf URL: https://arxiv.org/pdf/2509.17628
Copy Paste: [[2509.17628]] MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents(https://arxiv.org/abs/2509.17628)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models' abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose \textbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models' robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at this https URL.
摘要：大型语言模型（LLM）在单个领域内的提问（QA）任务上表现出色。但是，它们在复杂，多阶段的场景中的推理和协调能力仍然没有被逐渐消失。现有基准通常专注于孤立的任务或狭窄的域，忽略了模型的多阶段协作和优化能力，而无需明确的外部指导。为了弥合这一差距，我们提出了\ textbf {mscore}，这是一种新型基准，其中包括126696域特异性质量质量质量标准，涵盖了汽车，药物，电子，电子和能源领域的方案。该数据集是使用结构化的三相管道创建的：动态采样，迭代问题 - 答案生成和多层质量评估，以确保数据质量。根据阶段的覆盖范围和复杂性，将任务进一步分为三个难度水平。使用MSCORE，我们对各种最先进的LLM代理进行了全面评估。商业模型在所有任务和场景中都表现最好，但是在简单和复杂的任务之间，Rouge分数的显着差距仍然存在。我们还测试了模型的鲁棒性，发现它们的性能受到嘈杂数据的负面影响。 MSCORE为社区提供了一个宝贵的新资源，以评估和改善LLM代理商的多阶段推理。该代码和数据可在此HTTPS URL上找到。

Title: AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Authors: Hyunjong Ok, Suho Yoo, Hyeonjun Kim, Jaeho Lee
Subjects: cs.CL, cs.AI, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2509.17641
Pdf URL: https://arxiv.org/pdf/2509.17641
Copy Paste: [[2509.17641]] AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?(https://arxiv.org/abs/2509.17641)
Keywords: language model, llm
Abstract: Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at this https URL.
摘要：即使没有直接听到的声音，人类也可以轻松地理解听觉，例如音高，响度或声音源性关联，并利用听觉常识。相比之下，语言模型通常缺乏这种能力，从而限制了它们在多模式相互作用中的有效性。作为解决此差距的第一步，我们提出了AuditorityBench ++，这是评估仅文本设置中听觉知识和推理的综合基准。基准包括从基本听觉比较到上下文基础推理的任务，从而可以对模型处理和整合听觉概念的细粒度分析。此外，我们介绍了Air-Cot，这是一种新型的听觉想象推理方法，该方法在推理过程中通过特殊令牌和知识注入生成和集成了听觉信息。对最近的LLM和多模式LLM的广泛实验表明，空气盘通常优于现成的模型，并且具有听觉知识增强的模型。该项目页面可在此HTTPS URL上找到。

Title: PG-CE: A Progressive Generation Dataset with Constraint Enhancement for Controllable Text Generation

Authors: Yan Zhuang, Yuan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17669
Pdf URL: https://arxiv.org/pdf/2509.17669
Copy Paste: [[2509.17669]] PG-CE: A Progressive Generation Dataset with Constraint Enhancement for Controllable Text Generation(https://arxiv.org/abs/2509.17669)
Keywords: language model, llm
Abstract: With the rapid development of Large Language Models (LLMs), Controllable Text Generation (CTG) has become a critical technology for enhancing system reliability and user experience. Addressing the limitations of traditional methods, this paper proposes the PG-CE (Progressive Generation with Constraint Enhancement) approach, which decomposes CTG tasks into three steps: type prediction, constraint construction, and guided generation. This method employs constraint generation models to dynamically build multi-dimensional constraints including tone, expression style, and thematic focus to guide output. Experiments demonstrate that PG-CE significantly improves generation quality across multiple scenarios while maintaining text controllability, thematic relevance, and response practicality. The research developed a dataset containing 90,000 constraint-text pairs (with an 8:2 ratio between daily and other topics), effectively reflecting real-world application requirements.
摘要：随着大型语言模型（LLM）的快速发展，可控文本生成（CTG）已成为增强系统可靠性和用户体验的关键技术。解决传统方法的局限性，本文提出了PG-CE（具有约束增强的渐进生成）方法，该方法将CTG任务分解为三个步骤：类型预测，约束结构和指导生成。该方法采用约束生成模型来动态构建多维约束，包括音调，表达方式和主题焦点来指导输出。实验表明，PG-CE在维持文本可控性，主题相关性和响应实用性的同时显着提高了多种情况的发电质量。该研究开发了一个包含90,000个约束文本对的数据集（每日和其他主题之间的比率为8：2），有效地反映了现实世界的应用要求。

Title: Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications

Authors: Selva Taş, Mahmut El Huseyni, Özay Ezerceli, Reyhan Bayraktar, Fatma Betül Terzioğlu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17671
Pdf URL: https://arxiv.org/pdf/2509.17671
Copy Paste: [[2509.17671]] Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications(https://arxiv.org/abs/2509.17671)
Keywords: language model, llm, long context, hallucination, retrieval-augmented generation
Abstract: The widespread adoption of Large Language Models (LLMs) has been hindered by their tendency to hallucinate, generating plausible but factually incorrect information. While Retrieval-Augmented Generation (RAG) systems attempt to address this issue by grounding responses in external knowledge, hallucination remains a persistent challenge, particularly for morphologically complex, low-resource languages like Turkish. This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications. Building on the LettuceDetect framework, we formulate hallucination detection as a token-level classification task and fine-tune three distinct encoder architectures: a Turkish-specific ModernBERT, TurkEmbed4STS, and multilingual EuroBERT. These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks. Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks. The models maintain computational efficiency while supporting long contexts up to 8,192 tokens, making them suitable for real-time deployment. Comparative analysis reveals that while state-of-the-art LLMs demonstrate high recall, they suffer from low precision due to over-generation of hallucinated content, underscoring the necessity of specialized detection mechanisms. By releasing our models and translated dataset, this work addresses a critical gap in multilingual NLP and establishes a foundation for developing more reliable and trustworthy AI applications for Turkish and other languages.
摘要：大型语言模型（LLM）的广泛采用受到了幻觉的趋势的阻碍，从而产生了合理但实际上不正确的信息。虽然检索增强的一代（RAG）系统试图通过基于外部知识的响应来解决这个问题，但幻觉仍然是一个持续的挑战，尤其是对于像土耳其这样的形态复杂，低资源的语言。本文介绍了Turk-LetTuceStect，这是专门为土耳其破布应用设计的第一套幻觉检测模型。在LetTuceStect框架的基础上，我们将幻觉检测作为令牌级别的分类任务和微调三个不同的编码器建筑：土耳其特异性的现代伯特，Turkembed4ST和多语言Eurobert。对这些模型进行了对Ragtruth基准数据集的机器翻译版本的培训，该版本包含17,790个实例，跨问答，数据到文本生成和摘要任务。我们的实验结果表明，现代基于现代的模型在完整的测试集上达到了0.7266的F1分数，在结构化任务上的性能尤其强劲。这些模型维持计算效率，同时支持高达8,192个令牌的长篇小说，使其适合实时部署。比较分析表明，尽管最先进的LLMS表现出很高的召回，但由于幻觉含量过度生成，它们的精度较低，强调了专业检测机制的必要性。通过发布我们的模型并翻译数据集，这项工作解决了多语言NLP的关键差距，并为为土耳其语和其他语言开发更可靠和可信赖的AI应用程序建立了基础。

Title: When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

Authors: Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17680
Pdf URL: https://arxiv.org/pdf/2509.17680
Copy Paste: [[2509.17680]] When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables(https://arxiv.org/abs/2509.17680)
Keywords: language model, llm
Abstract: Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (LLMs) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while preserving essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table pruning path to remove irrelevant data step by step. At each pruning step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.
摘要：表问题回答（TableQA）是自然语言处理（NLP）的基本任务。大语言模型（LLM）的强大推理能力在这一领域带来了重大进步。但是，由于现实世界中的应用程序涉及日益复杂的问题和更大的表，因此引入了大量嘈杂的数据，这严重降低了推理性能。为了应对这一挑战，我们专注于提高两个核心功能：相关性过滤，它标识并保留了与推理和桌子修剪真正相关的信息，从而降低了表尺寸的同时保留基本内容。基于这些原则，我们提出了Enotab，这是一个双重denoising框架，用于复杂问题和大规模表。具体而言，我们首先通过将问题分解为最小的语义单元，并根据一致性和可用性标准来回答推理来实现基于证据的问题。然后，我们提出了证据树引导的表denoising，该表构建了明确透明的表修剪路径，以逐步删除无关的数据。在每个修剪步骤中，我们都会观察表的中间状态，并应用后阶节点回滚机制来处理异常表状态，最终产生一个高度可靠的子桌子来最终答案推理。最后，广泛的实验表明，Enotab通过复杂的问题和大规模表达到了TableQA任务上的出色表现，从而确认了其有效性。

Title: Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues

Authors: Dongxu Lu, Johan Jeuring, Albert Gatt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17694
Pdf URL: https://arxiv.org/pdf/2509.17694
Copy Paste: [[2509.17694]] Evaluating LLM-Generated Versus Human-Authored Responses in Role-Play Dialogues(https://arxiv.org/abs/2509.17694)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) in long-form, knowledge-grounded role-play dialogues remains challenging. This study compares LLM-generated and human-authored responses in multi-turn professional training simulations through human evaluation ($N=38$) and automated LLM-as-a-judge assessment. Human evaluation revealed significant degradation in LLM-generated response quality across turns, particularly in naturalness, context maintenance and overall quality, while human-authored responses progressively improved. In line with this finding, participants also indicated a consistent preference for human-authored dialogue. These human judgements were validated by our automated LLM-as-a-judge evaluation, where Gemini 2.0 Flash achieved strong alignment with human evaluators on both zero-shot pairwise preference and stochastic 6-shot construct ratings, confirming the widening quality gap between LLM and human responses over time. Our work contributes a multi-turn benchmark exposing LLM degradation in knowledge-grounded role-play dialogues and provides a validated hybrid evaluation framework to guide the reliable integration of LLMs in training simulations.
摘要：通过长形式，知识接收的角色扮演对话评估大语言模型（LLM）仍然具有挑战性。这项研究比较了通过人类评估（$ n = 38 $）和自动化的LLM-AS-A-A-a-a-Gudge评估，比较了LLM生成的和人为的反应。人类评估表明，LLM生成的响应质量在跨回合，尤其是自然性，上下文维护和整体质量方面显着降解，而人为作者的反应逐渐改善。与这一发现一致，参与者还表明，人们对人类作者的对话保持持续的偏爱。这些人类的判断是通过我们自动化的LLM-AS-A-A-A-Gudge评估来验证的，Gemini 2.0 Flash与人类评估者在零拍的成对偏好和随机6-Shot构造等级方面达到了强烈的对准，这证实了LLM和人类反应之间的扩大质量差距，并随着时间的推移而扩大。我们的工作贡献了一个多转弯基准，以揭示知识吸引的角色扮演对话中LLM退化，并提供了经过验证的混合评估框架，以指导LLM在培训模拟中的可靠集成。

Title: Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs

Authors: Mariam Mahran, Katharina Simbeck
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17701
Pdf URL: https://arxiv.org/pdf/2509.17701
Copy Paste: [[2509.17701]] Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs(https://arxiv.org/abs/2509.17701)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solutions in each language. A held-out panel of LLM judges, including Claude 3.5 Haiku, evaluated solution quality using a comparative framework. Results show a consistent gap, with English solutions consistently rated highest, and Arabic often ranked lower. These findings highlight persistent linguistic bias and the need for more equitable multilingual AI systems in education.
摘要：大型语言模型（LLM）越来越多地用于教育支持，但它们的响应质量因互动语言而异。本文提出了一条自动多语言管道，用于生成，解决和评估与德语K-10课程一致的数学问题。我们产生了628次数学练习，并将其转化为英语，德语和阿拉伯语。提示了三个商业LLM（GPT-4O-Mini，Gemini 2.5 Flash和Qwen-Plus），以每种语言的逐步解决方案生成。包括Claude 3.5 Haiku在内的LLM法官持有的面板，使用比较框架评估了解决方案质量。结果表明，英语解决方案始终如一，阿拉伯语通常排名较低。这些发现突出了持续存在的语言偏见，以及对教育中更公平的多语言AI系统的需求。

Title: Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics

Authors: Kavin R V, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17737
Pdf URL: https://arxiv.org/pdf/2509.17737
Copy Paste: [[2509.17737]] Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics(https://arxiv.org/abs/2509.17737)
Keywords: language model
Abstract: Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA), as well as a biomedical domain-specific benchmark (BC5CDR) using BioBERT. Our findings demonstrate that representing tokens compositionally via ASG achieves extreme compression in embedding parameters (0.4--0.5\%) while maintaining $>$95\% task performance relative to the base model, even in generative tasks and extends to both cross lingual transfer and domain-specific settings. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a simple yet concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling compact yet semantically rich models.
摘要：标准语言模型对每个令牌采用独特的整体嵌入，可能限制其捕获单词含义的多方面性质的能力。我们研究是否可以通过积累各种语义方面的组成结构来更有效地表示令牌。为了探讨这一点，我们提出了一种利用产品量化（PQ）的新方法（ASG）的汇总语义分组（ASG）。我们将ASG应用于标准变压器体系结构（Mbert，XLM-R，MT5），并在不同任务（NLI，NER，QA）以及使用Biobert的生物医学领域特异性基准（BC5CDR）中评估了这种代表性方案。我们的发现表明，通过ASG代表代币的组合在嵌入参数（0.4---0.5 \％）中实现极端压缩，同时保持$> $> $ 95 \％的任务性能相对于基本模型，即使在生成任务中，并且扩展到交叉lingual传递和域特异性设置。这些结果证明了可以有效地将令牌建模为共享语义构建块的组合的原则。 ASG提供了一种简单而具体的方法来实现这一目标，展示了构图表示如何捕获语言丰富性的同时，同时可以促进紧凑而富有语义丰富的模型。

Title: Qwen3-Omni Technical Report

Authors: Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
Subjects: cs.CL, cs.AI, cs.CV, eess.AS
Abstract URL: https://arxiv.org/abs/2509.17765
Pdf URL: https://arxiv.org/pdf/2509.17765
Copy Paste: [[2509.17765]] Qwen3-Omni Technical Report(https://arxiv.org/abs/2509.17765)
Keywords: gpt, hallucination
Abstract: We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
摘要：我们提出了QWEN3-OMNI，这是一种单模型模型，该模型首次保持跨文本，图像，音频和视频的最先进性能，而没有任何相对于单模式对应物的降级。 QWEN3-OMNI匹配QWEN系列中同一大小的单模模型的性能，并且特别在音频任务上擅长。在36个音频和视听基准中，QWEN3-OMNI在32个基准和22个基准的开源SOTA上实现了22个基准SOTA，超过了强大的闭合源模型，例如Gemini-2.5-Pro，Seed-Asr和GPT-4O-4O-Transcribe。 Qwen3-Omni采用了一个思想家 - 谈话师的MUE架构，该体系结构在文本，图像，音频和视频中统一感知和发电，产生流利的文本和自然的实时演讲。它支持119种语言的文本互动，19种语言的语音理解以及10种语言的语音生成。为了减少流媒体合成中的第一包延迟，使用多编码书方案可以自动加入对离散的语音编解码器进行预测。利用这些代码簿的代表性能力，我们用轻量级的因果转换替换了计算密集的块范围扩散，从而从第一个编解码器框架启用了流。在冷启动设置中，Qwen3-omni达到了234毫秒的理论端到端第一包延迟。为了进一步加强多模式推理，我们介绍了一个思维模型，该模型明确地理解了任何模式的输入。由于目前的研究界缺乏通用音频字幕模型，因此我们对QWEN3-OMNI-30B-A3B进行了微调，以获取QWEN3-OMNI-30B-A3B接收器，该QWEN3-OMNI-30B-A3B接收器为任意音频输入提供了详细的，低悬浮的字幕。 QWEN3-OMNI-30B-A3B，QWEN3-OMNI-30B-A3B思考和QWEN3-OMNI-30B-A3B-CAPTIONER根据Apache 2.0许可证公开发布。

Title: A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

Authors: Ziyi Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17766
Pdf URL: https://arxiv.org/pdf/2509.17766
Copy Paste: [[2509.17766]] A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue(https://arxiv.org/abs/2509.17766)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) struggle with information forgetting and inefficiency in long-horizon, multi-turn dialogues. To address this, we propose a training-free prompt engineering method, the State-Update Multi-turn Dialogue Strategy. It utilizes "State Reconstruction" and "History Remind" mechanisms to effectively manage dialogue history. Our strategy shows strong performance across multiple multi-hop QA datasets. For instance, on the HotpotQA dataset, it improves the core information filtering score by 32.6%, leading to a 14.1% increase in the downstream QA score, while also reducing inference time by 73.1% and token consumption by 59.4%. Ablation studies confirm the pivotal roles of both components. Our work offers an effective solution for optimizing LLMs in long-range interactions, providing new insights for developing more robust Agents.
摘要：大型语言模型（LLMS）与信息忘记和长途对话中的信息遗忘和效率低下的斗争。为了解决这个问题，我们提出了一种无培训的及时工程方法，即国家更新的多转向对话策略。它利用“状态重建”和“历史提醒”机制来有效地管理对话历史。我们的策略在多个多跳QA数据集中表现出强大的性能。例如，在HOTPOTQA数据集上，它将核心信息过滤分数提高了32.6％，导致下游QA得分增加了14.1％，同时也将推理时间降低了73.1％，而代币消耗量则增加了59.4％。消融研究证实了这两个组成部分的关键作用。我们的工作为在长期互动中优化LLM提供了有效的解决方案，为开发更健壮的代理提供了新的见解。

Title: One Agent to Serve All: a Lite-Adaptive Stylized AI Assistant for Millions of Multi-Style Official Accounts

Authors: Xingyu Fan, Feifei Li, Wenhui Que, Hailong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17788
Pdf URL: https://arxiv.org/pdf/2509.17788
Copy Paste: [[2509.17788]] One Agent to Serve All: a Lite-Adaptive Stylized AI Assistant for Millions of Multi-Style Official Accounts(https://arxiv.org/abs/2509.17788)
Keywords: prompt, chain-of-thought, agent
Abstract: Conversational agents deployed in industrial-scale official account platforms must generate responses that are both contextually grounded and stylistically aligned-requirements that existing methods struggle to meet. Chain-of-thought (CoT) prompting induces significant latency due to multi-turn reasoning; per-account fine-tuning is computationally prohibitive; and long prompt-based methods degrade the model's ability to grasp injected context and style. In this paper, we propose WeStar, a lite-adaptive framework for stylized contextual question answering that scales to millions of official accounts. WeStar combines context-grounded generation via RAG with style-aware generation using Parametric RAG (PRAG), where LoRA modules are dynamically activated per style cluster. Our contributions are fourfold: (1) We introduce WeStar, a unified framework capable of serving large volumes of official accounts with minimal overhead. (2) We propose a multi-dimensional, cluster-based parameter sharing scheme that enables compact style representation while preserving stylistic diversity. (3) We develop a style-enhanced Direct Preference Optimization (SeDPO) method to optimize each style cluster's parameters for improved generation quality. (4) Experiments on a large-scale industrial dataset validate the effectiveness and efficiency of WeStar, underscoring its pracitical value in real-world deployment.
摘要：在工业规模的官方帐户平台中部署的对话代理必须产生响应，这些响应既是上下文基础，又是风格上的一致性，现有方法难以满足。经过思考链（COT）促使由于多转弯推理引起的显着潜伏期；人口微调在计算上是过度的；基于较长的及时及时的方法降低了模型掌握注入上下文和样式的能力。在本文中，我们提出了Westar，这是一个轻巧的自适应框架，用于定型的上下文问题，将其扩展到数百万个官方帐户。韦斯塔尔（Westar）使用参数抹布（布拉格）结合了通过抹布的上下文生成与样式吸引生成的生成，其中洛拉模块是每个样式群集动态激活的。我们的贡献是四倍：（1）我们介绍了Westar，这是一个统一的框架，能够为大量的官方帐户提供最少的开销。（2）我们提出了一种基于群集的参数共享方案，该方案可以实现紧凑的样式表示，同时保持风格多样性。（3）我们开发了一种样式增强的直接优先优化（SEDPO）方法，以优化每个样式群集的参数以提高生成质量。（4）对大规模工业数据集进行的实验验证了韦斯塔尔的有效性和效率，强调了其在现实部署中的准确价值。

Title: Learning to vary: Teaching LMs to reproduce human linguistic variability in next-word prediction

Authors: Tobias Groot, Salo Lacunes, Evgenia Ilia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17794
Pdf URL: https://arxiv.org/pdf/2509.17794
Copy Paste: [[2509.17794]] Learning to vary: Teaching LMs to reproduce human linguistic variability in next-word prediction(https://arxiv.org/abs/2509.17794)
Keywords: language model, gpt
Abstract: Natural language generation (NLG) tasks are often subject to inherent variability; \emph{e.g.} predicting the next word given a context has multiple valid responses, evident when asking multiple humans to complete the task. While having language models (LMs) that are aligned pluralistically, so that they are able to reproduce well the inherent diversity in perspectives of an entire population of interest is clearly beneficial, \citet{ilia2024predict} show that LMs do not reproduce this type of linguistic variability well. They speculate this inability might stem from the lack of consistent training of LMs with data reflecting this type of inherent variability. As such, we investigate whether training LMs on multiple plausible word continuations per context can improve their ability to reproduce human linguistic variability for next-word prediction. We employ fine-tuning techniques for pre-trained and instruction-tuned models; and demonstrate their potential when fine-tuning GPT-2 and Mistral-7B-IT, using Provo Corpus. Our evaluation, which measures divergence among empirically estimated human and model next-word distributions across contexts before and after fine-tuning, shows that our multi-label fine-tuning improves the LMs' ability to reproduce linguistic variability; both for contexts that admit higher and lower variability.
摘要：自然语言生成（NLG）任务通常会遭受固有的可变性； \ emph {例如，预测给定上下文的下一个单词具有多个有效响应，在要求多个人完成任务时很明显。尽管具有多元化对齐的语言模型（LMS），以便他们能够很好地再现整个感兴趣的观点的固有多样性，这显然是有益的，但\ citet {ilia2024predict}表明，LMS并不能很好地再现这种语言可变性。他们推测这种无能可能是由于缺乏对LMS的一致培训，而数据反映了这种固有的可变性。因此，我们调查了对每个上下文的多个合理词连续训练LMS是否可以提高其重现人类语言可变性对下一词预测的能力。我们为预训练和指导调节的模型采用微调技术；并在使用Provo语料库微调GPT-2和Mismtral-7b-IT时证明其潜力。我们的评估衡量了在微调之前和之后跨凭经验估计的人类和模型的下一词分布之间的差异，表明我们的多标签微调提高了LMS再现语言变异性的能力；这两种情况都可以接受较高和较低的可变性。

Title: Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?

Authors: Michal Novák, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17796
Pdf URL: https://arxiv.org/pdf/2509.17796
Copy Paste: [[2509.17796]] Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?(https://arxiv.org/abs/2509.17796)
Keywords: language model, llm
Abstract: The paper presents an overview of the fourth edition of the Shared Task on Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025 workshop. As in the previous editions, participants were challenged to develop systems that identify mentions and cluster them according to identity coreference. A key innovation of this year's task was the introduction of a dedicated Large Language Model (LLM) track, featuring a simplified plaintext format designed to be more suitable for LLMs than the original CoNLL-U representation. The task also expanded its coverage with three new datasets in two additional languages, using version 1.3 of CorefUD - a harmonized multilingual collection of 22 datasets in 17 languages. In total, nine systems participated, including four LLM-based approaches (two fine-tuned and two using few-shot adaptation). While traditional systems still kept the lead, LLMs showed clear potential, suggesting they may soon challenge established approaches in future editions.
摘要：本文概述了关于多语言核心分辨率共享任务的第四版，该任务是Codi-Crac 2025研讨会的一部分。与以前的版本一样，参与者受到挑战，要开发系统，以确定提及并根据身份核心重新聚集。对今年任务的一个关键创新是引入专用的大语言模型（LLM）曲目，其特征是简化的明文格式，旨在比原始的Conll-U代表更适合LLM。该任务还使用Corefud版本1.3（用17种语言的22个数据集的1.3版）使用了另外两种语言的三个新数据集扩展了其覆盖范围。总共参加了九个系统，其中包括四种基于LLM的方法（两种微调和两种使用少射击适应）。尽管传统系统仍保持领先地位，但LLMS表现出明显的潜力，这表明它们可能很快挑战了未来版本中建立的方法。

Title: Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

Authors: Robert Litschko, Verena Blaschke, Diana Burkhardt, Barbara Plank, Diego Frassinelli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17855
Pdf URL: https://arxiv.org/pdf/2509.17855
Copy Paste: [[2509.17855]] Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora(https://arxiv.org/abs/2509.17855)
Keywords: language model, llm
Abstract: Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavarian as a case study and investigate the lexical dialect understanding capability of LLMs by examining how well they recognize and translate dialectal terms across different parts-of-speech. To this end, we introduce DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only, and use it to compile a ground truth dataset consisting of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our results show that LLMs perform best on nouns and lexically similar word pairs, and struggle most in distinguishing between direct translations and inflected variants. Interestingly, providing additional context in the form of example usages improves the translation performance, but reduces their ability to recognize dialect variants. This study highlights the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.
摘要：由于缺乏标准拼字法，方言表现出很大程度的变化。同时，大型语言模型（LLM）处理方言的能力在很大程度上仍在研究中。为了解决这一差距，我们将巴伐利亚用作案例研究，并通过检查他们在不同词性各个部分的识别和翻译方言术语的识别和翻译方言术语方面来研究LLM的词汇方言理解能力。为此，我们介绍了Diolemma，这是一个新的注释框架，用于创建单语言数据的方言变化词典，并使用它来编译由100K人类通知的德国 - 巴拉维安单词对组成的地面真相数据集。我们评估了九个最先进的LLM可以判断巴伐利亚语作为方言翻译，变种或无关的德国引理形式。我们的结果表明，LLM在名词和词汇相似的单词对上的表现最佳，并且在区分直接翻译和变形的变体方面最挣扎。有趣的是，以示例的形式提供其他上下文可以提高翻译性能，但降低了他们识别方言变体的能力。这项研究强调了LLM在处理拼字方言变化方面的局限性，并强调了将来需要将LLMS适应方言的工作的必要性。

Title: CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution

Authors: Milan Straka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17858
Pdf URL: https://arxiv.org/pdf/2509.17858
Copy Paste: [[2509.17858]] CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution(https://arxiv.org/abs/2509.17858)
Keywords: llm
Abstract: We present CorPipe 25, the winning entry to the CRAC 2025 Shared Task on Multilingual Coreference Resolution. This fourth iteration of the shared task introduces a new LLM track alongside the original unconstrained track, features reduced development and test sets to lower computational requirements, and includes additional datasets. CorPipe 25 represents a complete reimplementation of our previous systems, migrating from TensorFlow to PyTorch. Our system significantly outperforms all other submissions in both the LLM and unconstrained tracks by a substantial margin of 8 percentage points. The source code and trained models are publicly available at this https URL.
摘要：我们介绍Corpipe 25，这是CRAC 2025的获胜条目，共享多语言核心分辨率。共享任务的第四次迭代引入了一个新的LLM轨道，以及原始的无约束轨道，具有减少的开发和测试集以降低计算要求，并包括其他数据集。 Corpipe 25代表了我们以前的系统的完整重新实现，该系统从Tensorflow迁移到Pytorch。我们的系统在LLM和不受限制的轨道中都大大优于所有其他提交的内容，其幅度为8个百分点。源代码和训练有素的模型可在此HTTPS URL上公开使用。

Title: How Persuasive is Your Context?

Authors: Tu Nguyen, Kevin Du, Alexander Miserlis Hoyle, Ryan Cotterell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.17879
Pdf URL: https://arxiv.org/pdf/2509.17879
Copy Paste: [[2509.17879]] How Persuasive is Your Context?(https://arxiv.org/abs/2509.17879)
Keywords: language model
Abstract: Two central capabilities of language models (LMs) are: (i) drawing on prior knowledge about entities, which allows them to answer queries such as "What's the official language of Austria?", and (ii) adapting to new information provided in context, e.g., "Pretend the official language of Austria is Tagalog.", that is pre-pended to the question. In this article, we introduce targeted persuasion score (TPS), designed to quantify how persuasive a given context is to an LM where persuasion is operationalized as the ability of the context to alter the LM's answer to the question. In contrast to evaluating persuasiveness only by inspecting the greedily decoded answer under the model, TPS provides a more fine-grained view of model behavior. Based on the Wasserstein distance, TPS measures how much a context shifts a model's original answer distribution toward a target distribution. Empirically, through a series of experiments, we show that TPS captures a more nuanced notion of persuasiveness than previously proposed metrics.
摘要：语言模型（LMS）的两个核心功能是：（i）利用有关实体的先验知识，这使他们可以回答诸如“奥地利的官方语言？”以及（ii）适应上下文中提供的新信息，例如假装奥地利的官方语言是Tagalog是Tagalog。在本文中，我们介绍了针对性的说服得分（TPS），旨在量化给定上下文的说服力如何，其中说服力将说服力作为上下文改变了LM对问题的答案的能力。与仅通过检查模型下的贪婪解码答案来评估说服力相反，TPS提供了模型行为的更细粒度的观点。根据Wasserstein距离，TPS衡量了上下文将模型的原始答案分布转移到目标分布的程度。从经验上讲，通过一系列实验，我们表明，与以前提出的指标相比，TPS捕获了更细微的说服力概念。

Title: Training-free Truthfulness Detection via Value Vectors in LLMs

Authors: Runheng Liu, Heyan Huang, Xingchen Xiao, Zhijing Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17932
Pdf URL: https://arxiv.org/pdf/2509.17932
Copy Paste: [[2509.17932]] Training-free Truthfulness Detection via Value Vectors in LLMs(https://arxiv.org/abs/2509.17932)
Keywords: language model, llm
Abstract: Large language models often generate factually incorrect outputs, motivating efforts to detect the truthfulness of their content. Most existing approaches rely on training probes over internal activations, but these methods suffer from scalability and generalization issues. A recent training-free method, NoVo, addresses this challenge by exploiting statistical patterns from the model itself. However, it focuses exclusively on attention mechanisms, potentially overlooking the MLP module-a core component of Transformer models known to support factual recall. In this paper, we show that certain value vectors within MLP modules exhibit truthfulness-related statistical patterns. Building on this insight, we propose TruthV, a simple and interpretable training-free method that detects content truthfulness by leveraging these value vectors. On the NoVo benchmark, TruthV significantly outperforms both NoVo and log-likelihood baselines, demonstrating that MLP modules-despite being neglected in prior training-free efforts-encode rich and useful signals for truthfulness detection. These findings offer new insights into how truthfulness is internally represented in LLMs and motivate further research on scalable and interpretable truthfulness detection.
摘要：大型语言模型通常会产生事实不正确的产出，激励努力检测其内容的真实性。大多数现有方法都依赖于内部激活的培训探针，但是这些方法遭受了可伸缩性和概括性问题的影响。 Novo最近的一种无培训方法通过利用模型本身利用统计模式来应对这一挑战。但是，它仅专注于注意机制，可能忽视已知的用于支持事实回忆的变压器模型的MLP模块。在本文中，我们表明MLP模块中的某些值向量显示出与真实性相关的统计模式。在这种见解的基础上，我们提出了Trutv，这是一种简单且可解释的无培训方法，它通过利用这些价值向量来检测内容真实性。在NOVO基准测试中，TruthV显着胜过Novo和Log-ofikelihood Baselines，这表明MLP模块dite在先前的无培训工作中被忽略了 - 对真实性检测的丰富和有用的信号被忽略。这些发现提供了有关LLM内部代表的真实性的新见解，并激发了对可扩展和可解释的真实性检测的进一步研究。

Title: D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Authors: Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.17938
Pdf URL: https://arxiv.org/pdf/2509.17938
Copy Paste: [[2509.17938]] D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models(https://arxiv.org/abs/2509.17938)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.
摘要：大型语言模型（LLM）的安全性和一致性对于其负责任的部署至关重要。当前的评估方法主要集中在识别和预防公开的有害产出上。但是，他们通常无法解决更阴险的失败模式：在以恶意或欺骗性的内部推理运行时产生良性产出的模型。这种脆弱性通常是由复杂的系统提示触发的，允许模型绕过常规的安全过滤器，从而带来了一个明显的，不受欢迎的风险。为了解决这一差距，我们介绍了欺骗性推理曝光套件（D-Rex），这是一个新型数据集，旨在评估模型的内部推理过程与其最终输出之间的差异。 D-Rex是通过竞争性的红线练习来构建的，参与者精心制作的对抗系统提示引起这种欺骗性行为。 D-Rex中的每个样本都包含对抗系统提示，最终用户的测试查询，模型看似无害的响应，并且至关重要的是，该模型的内部想法链，揭示了潜在的恶意意图。我们的基准促进了一项新的基本评估任务：欺骗性一致性的检测。我们证明，D-Rex对现有模型和安全机制提出了一个重大挑战，强调了迫切需要仔细研究LLM的内部过程的新技术，而不仅仅是它们的最终输出。

Title: HICode: Hierarchical Inductive Coding with LLMs

Authors: Mian Zhong, Pristina Wang, Anjalie Field
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2509.17946
Pdf URL: https://arxiv.org/pdf/2509.17946
Copy Paste: [[2509.17946]] HICode: Hierarchical Inductive Coding with LLMs(https://arxiv.org/abs/2509.17946)
Keywords: llm
Abstract: Despite numerous applications for fine-grained corpus analysis, researchers continue to rely on manual labeling, which does not scale, or statistical tools like topic modeling, which are difficult to control. We propose that LLMs have the potential to scale the nuanced analyses that researchers typically conduct manually to large text corpora. To this effect, inspired by qualitative research methods, we develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes. We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations. Finally, we conduct a case study of litigation documents related to the ongoing opioid crisis in the U.S., revealing aggressive marketing strategies employed by pharmaceutical companies and demonstrating HICode's potential for facilitating nuanced analyses in large-scale data.
摘要：尽管有许多用于细粒语料库分析的应用，但研究人员仍继续依靠不扩展的手动标签，或者很难控制的统计工具（例如主题建模）。我们建议LLM有可能扩展研究人员通常手动进行大型文本语料库的细微分析。为此，受定性研究方法的启发，我们开发了Hicode，这是一条两部分的管道，该管道首先是从分析数据中直接产生标签，然后分层将它们簇簇以表面出现的主题。我们通过测量人类构建的主题对齐，并通过自动化和人类评估来证明其稳健性，从而在三个不同的数据集中验证了这种方法。最后，我们对与美国正在进行的阿片类药物危机有关的诉讼文件进行了案例研究，揭示了制药公司采用的积极营销策略，并证明了Hicode在大规模数据中促进细微差异的潜力。

Title: Variation in Verification: Understanding Verification Dynamics in Large Language Models

Authors: Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.17995
Pdf URL: https://arxiv.org/pdf/2509.17995
Copy Paste: [[2509.17995]] Variation in Verification: Understanding Verification Dynamics in Large Language Models(https://arxiv.org/abs/2509.17995)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions - problem difficulty, generator capability, and verifier generation capability - with empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier's own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities to optimize basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.5%). Second, we identify cases where strong verifiers offer limited advantage over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.
摘要：最近的进步表明，缩放测试时间计算使大型语言模型（LLMS）能够解决跨不同领域的日益复杂的问题。测试时间缩放（TTS）的一个有效范式涉及产生多个解决方案候选者的LLM发电机，LLM验证者评估这些候选者的正确性而没有参考答案。在本文中，我们研究了生成验证仪，该验证者通过产生思考链（COT）推理进行验证，然后进行二进制判决。我们通过使用14个开源模型（2B至72B参数范围）和GPT -4O进行了经验研究，通过对数学推理，知识和自然语言推理任务进行12个基准的经验研究，系统地分析了三个维度的验证动态 - 问题难度，生成器能力和验证者的产生能力。我们的实验揭示了有关验证有效性的三个关键发现：（1）简单问题使验证者可以更可靠地证明正确的响应；（2）弱发电机产生的错误比强生成器更容易检测；（3）验证能力通常与验证者自身解决问题的能力相关，但是这种关系随问题难度而变化。这些发现揭示了在TTS应用程序中优化基本验证策略的机会。首先，考虑到相同的验证者，一些弱发电机在验证后TTS性能中几乎可以匹配更强的发电机（例如，GEMMA2-9B至GEMMA2-27B性能差距缩小75.5％）。其次，我们确定强大验证者对弱者的优势有限的案例，因为两者都无法提供有意义的验证收益，这表明仅验证者缩放量无法克服基本验证挑战。

Title: RadEval: A framework for radiology text evaluation

Authors: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David Eyre, Jean-Benoit Delbrouck
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18030
Pdf URL: https://arxiv.org/pdf/2509.18030
Copy Paste: [[2509.18030]] RadEval: A framework for radiology text evaluation(https://arxiv.org/abs/2509.18030)
Keywords: llm
Abstract: We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.
摘要：我们介绍了Radeval，这是一个统一的开源框架，用于评估放射学文本。 Radeval合并了各种各样的指标，从经典的N-Gram重叠（BLEU，ROUGE）和上下文测量（BERTSCORE）到基于临床概念的分数（F1CHEXBERT，F1RADGRAPH，RATIECORE，RATIE CORE，SRR-BERT，SRR-BERT，PERISENTITYFOL1）和基于LLM LLM的评估者（绿色）。我们完善并标准化实现，扩展绿色，以更轻巧的模型支持多个成像模式，并预处理域特异性放射学编码器，显示出强烈的零射击检索性能。我们还发布了具有超过450个临床上显着的错误标签的丰富注释的专家数据集，并显示了不同的指标与放射科医生判断的相关性。最后，Radeval提供了统计测试工具和基线模型评估，跨多个可公开可用的数据集，从而促进了放射学报告生成中的可重复性和强大的基准测试。

Title: The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies

Authors: Jiaxu Zhou, Jen-tse Huang, Xuhui Zhou, Man Ho Lam, Xintao Wang, Hao Zhu, Wenxuan Wang, Maarten Sap
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.18052
Pdf URL: https://arxiv.org/pdf/2509.18052
Copy Paste: [[2509.18052]] The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies(https://arxiv.org/abs/2509.18052)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Large Language Models (LLMs) are increasingly used for social simulation, where populations of agents are expected to reproduce human-like collective behavior. However, we find that many recent studies adopt experimental designs that systematically undermine the validity of their claims. From a survey of over 40 papers, we identify six recurring methodological flaws: agents are often homogeneous (Profile), interactions are absent or artificially imposed (Interaction), memory is discarded (Memory), prompts tightly control outcomes (Minimal-Control), agents can infer the experimental hypothesis (Unawareness), and validation relies on simplified theoretical models rather than real-world data (Realism). For instance, GPT-4o and Qwen-3 correctly infer the underlying social experiment in 53.1% of cases when given instructions from prior work-violating the Unawareness principle. We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions for credible LLM-based social simulation. To demonstrate their impact, we re-run five representative studies using a framework that enforces PIMMUR and find that the reported social phenomena frequently fail to emerge under more rigorous conditions. Our work establishes methodological standards for LLM-based multi-agent research and provides a foundation for more reliable and reproducible claims about "AI societies."
摘要：大型语言模型（LLM）越来越多地用于社会模拟，在这种模拟中，预计代理人将重现类似人类的集体行为。但是，我们发现许多最近的研究采用了实验设计，这些实验设计系统地破坏了其主张的有效性。从对40多篇论文的调查中，我们确定了六个反复的方法论缺陷：代理通常是同质的（轮廓），相互作用是不存在或施加的（相互作用），记忆被丢弃（内存），提示紧密控制的结果（最小值）可以推断实验假设（Uniblesixhessississessiss（Unive）（统一）和实现（实际上）的效果（实际上是实现），而不是实现了效果。例如，GPT-4O和QWEN-3正确地推断了53.1％的案件中的基本社会实验，当给出了先前的工作 - 不认识原则的指令时。我们将这六个要求形式化为PIMMUR原则，并认为它们是基于LLM的社会模拟的必要条件。为了证明它们的影响，我们使用一个框架来重新进行了五项代表性研究，该框架强制执行PIMMUR，并发现所报道的社会现象经常在更严格的条件下出现。我们的工作为基于LLM的多代理研究建立了方法论标准，并为对“ AI社会”的更可靠和可重复的主张奠定了基础。

Title: ARK-V1: An LLM-Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning

Authors: Jan-Felix Klein, Lars Ohnemus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18063
Pdf URL: https://arxiv.org/pdf/2509.18063
Copy Paste: [[2509.18063]] ARK-V1: An LLM-Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning(https://arxiv.org/abs/2509.18063)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Large Language Models (LLMs) show strong reasoning abilities but rely on internalized knowledge that is often insufficient, outdated, or incorrect when trying to answer a question that requires specific domain knowledge. Knowledge Graphs (KGs) provide structured external knowledge, yet their complexity and multi-hop reasoning requirements make integration challenging. We present ARK-V1, a simple KG-agent that iteratively explores graphs to answer natural language queries. We evaluate several not fine-tuned state-of-the art LLMs as backbones for ARK-V1 on the CoLoTa dataset, which requires both KG-based and commonsense reasoning over long-tail entities. ARK-V1 achieves substantially higher conditional accuracies than Chain-of-Thought baselines, and larger backbone models show a clear trend toward better coverage, correctness, and stability.
摘要：大型语言模型（LLMS）表现出强大的推理能力，但依靠内在知识，这些知识通常在尝试回答需要特定领域知识的问题时通常不足，过时或不正确。知识图（kgs）提供了结构化的外部知识，但是它们的复杂性和多跳的推理要求使集成具有挑战性。我们提出了ARK-V1，这是一个简单的KG代理，迭代地探索了回答自然语言查询的图表。我们评估了几个未经微调的最先进的LLM作为COLOTA数据集上ARK-V1的骨架，这需要基于kg的长尾实体的基于kg和常识性推理。 ARK-V1的条件准确性大大要高于基本链基线，并且较大的骨干模型显示出朝着更好的覆盖，正确性和稳定性的明显趋势。

Title: SEQR: Secure and Efficient QR-based LoRA Routing

Authors: William Fleshman, Benjamin Van Durme
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.18093
Pdf URL: https://arxiv.org/pdf/2509.18093
Copy Paste: [[2509.18093]] SEQR: Secure and Efficient QR-based LoRA Routing(https://arxiv.org/abs/2509.18093)
Keywords: language model
Abstract: Low-Rank Adaptation (LoRA) has become a standard technique for parameter-efficient fine-tuning of large language models, enabling large libraries of LoRAs, each for a specific task or domain. Efficiently selecting the correct LoRA adapter for a given input remains a challenge, particularly in secure environments where supervised training of routers may raise privacy concerns. Motivated by previous approaches, we formalize the goal of unsupervised LoRA routing in terms of activation norm maximization, providing a theoretical framework for analysis. We demonstrate the discriminative power of activation norms and introduce SEQR, an unsupervised LoRA routing algorithm designed to maximize efficiency while providing strict routing guarantees. SEQR provably identifies the norm-maximizing adapter with significantly greater efficiency, making it a highly scalable and effective solution for dynamic LoRA composition. We validate our results through experiments that demonstrate improved multi-task performance and efficiency.
摘要：低级适应性（LORA）已成为大型语言模型参数有效微调的标准技术，使洛拉斯的大型库可以用于特定的任务或域。有效地选择正确的洛拉适配器作为给定输入仍然是一个挑战，尤其是在监督路由器的安全环境中，可能会引起隐私问题。在先前的方法中，我们将无监督的洛拉路由的目标正式化，这是激活规范最大化的目标，提供了一个理论框架的分析框架。我们演示了激活规范的歧视能力，并引入了SEQR，这是一种无监督的LORA路由算法，旨在最大化效率，同时提供严格的路由保证。 SEQR可证明具有明显更高的效率来识别标准最大化的适配器，从而使其成为动态LORA组成的高度可扩展性和有效解决方案。我们通过实验证明了提高多任务性能和效率的实验来验证我们的结果。