2025-09-25

Title: Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias

Authors: Sirui Wu, Daijin Yang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.19314
Pdf URL: https://arxiv.org/pdf/2509.19314
Copy Paste: [[2509.19314]] Automated Item Neutralization for Non-Cognitive Scales: A Large Language Model Approach to Reducing Social-Desirability Bias(https://arxiv.org/abs/2509.19314)
Keywords: language model, gpt, llm
Abstract: This study evaluates item neutralization assisted by the large language model (LLM) to reduce social desirability bias in personality assessment. GPT-o3 was used to rewrite the International Personality Item Pool Big Five Measure (IPIP-BFM-50), and 203 participants completed either the original or neutralized form along with the Marlowe-Crowne Social Desirability Scale. The results showed preserved reliability and a five-factor structure, with gains in Conscientiousness and declines in Agreeableness and Openness. The correlations with social desirability decreased for several items, but inconsistently. Configural invariance held, though metric and scalar invariance failed. Findings support AI neutralization as a potential but imperfect bias-reduction method.
摘要：这项研究评估了大语模型（LLM）协助的项目中和，以减少人格评估中的社会可取性偏见。 GPT-O3用于重写国际人格项目池五巨头（IPIP-BFM-50），而203名参与者以及Marlowe-Crowne社会可取性量表完成了原始形式或中和形式。结果表明，保留的可靠性和五因素结构，并取得了良心性和开放性的下降。几个项目的与社会可取性的相关性降低了，但不一致。配置不变性，尽管公制和标量不变性失败了。发现支持AI中和作为潜在但不完善的偏置方法。

Title: FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering

Authors: Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Yugang jia, Jong Ha Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19319
Pdf URL: https://arxiv.org/pdf/2509.19319
Copy Paste: [[2509.19319]] FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering(https://arxiv.org/abs/2509.19319)
Keywords: llm, agent
Abstract: The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (this https URL) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.
摘要：最近向健康级别的七个快速医疗保健互操作性资源（HL7 FHIR）标准的转变为临床AI开辟了新的边界，要求LLM代理在复杂的，基于资源的数据模型中导航，而不是传统的结构化健康数据。但是，现有的基准落后于这种过渡，缺乏评估可互操作临床数据的最新LLM所需的现实主义。为了弥合这一差距，我们介绍了FHIR-AgentBench，这是一个基准，在HL7 FHIR标准中以2,931个现实世界的临床问题为基础。使用此基准测试，我们系统地评估了代理框架，比较不同的数据检索策略（直接API调用与专业工具），交互模式（单转与多转移）以及推理策略（自然语言与代码生成）。我们的实验强调了从复杂的FHIR资源中检索数据的实际挑战，以及对它们推理的困难，这两者都严重影响了问答绩效的问题。我们公开发布了FHIR-AgentBench数据集和评估套件（此HTTPS URL），以促进可重现的研究以及用于临床应用的可靠，可靠的LLM代理的开发。

Title: Readme_AI: Dynamic Context Construction for Large Language Models

Authors: Millie Vyas, Timothy Blattner, Alden Dima
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19322
Pdf URL: https://arxiv.org/pdf/2509.19322
Copy Paste: [[2509.19322]] Readme_AI: Dynamic Context Construction for Large Language Models(https://arxiv.org/abs/2509.19322)
Keywords: language model, llm, hallucination
Abstract: Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user's specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog's developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: this https URL .
摘要：尽管接受了大量数据培训，但大型语言模型（LLMS）仍可以在用户的特定查询中提供不准确或不可靠的信息。给定特定的环境可显着提高其响应的有用性。在本文中，我们提出了一个规范，该规范可用于动态构建数据源的上下文。数据源所有者创建包含llms元数据的文件，以便在有关数据集相关的查询时使用。为了展示我们提出的规范，我们创建了一个原型ReadMe_ai模型上下文协议（MCP）服务器，该协议（MCP）服务器从数据源检索元数据并使用它来动态构建上下文。某些使此规范动态的功能是代表爬网页，数据存储库中获取数据，下载和解析出版物以及一般文本的可扩展类型。使用用户指定的标签对上下文进行格式和分组，这些标签可为LLM提供明确的上下文信息以推理内容。我们通过向LLM询问NIST开发的刺猬库来证明这一早期原型的功能，为此，常见的LLMS通常会提供不准确且无关紧要的响应，其中包含幻觉。借助ReadMe_ai，LLM接收到足够的上下文，现在可以对库及其使用进行推理，甚至生成从HedgeHog开发人员提供的ReadMe_ai文件中包含的示例中插值的代码。我们的主要贡献是用于在专门的，所有者提供的数据中动态接地LLM的可扩展协议，增强了LLM的响应并减少幻觉。 README_AI工具的源代码在此处发布：此HTTPS URL。

Title: How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs

Authors: Jian Ouyang, Arman T, Ge Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19325
Pdf URL: https://arxiv.org/pdf/2509.19325
Copy Paste: [[2509.19325]] How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs(https://arxiv.org/abs/2509.19325)
Keywords: language model, gpt, llm
Abstract: This paper investigates the impact of incorrect data on the performance and safety of large language models (LLMs), specifically gpt-4o, during supervised fine-tuning (SFT). Although LLMs become increasingly vital across broad domains like finance, coding, law, and health, fine-tuning on incorrect data can lead to "emergent misalignment," producing harmful or deceptive outputs unrelated to the intended task. We evaluate gpt-4o models fine-tuned with varying ratios (10\% to 90\% correct) of both obviously and subtly incorrect data across four domains: coding, finance, health, and legal. Our findings show that even modest amounts of incorrect data (10-25\%) dramatically degrade domain performance and not moral alignment. A clear threshold of at least 50\% correct data is needed for models to consistently recover strong performance, though they rarely match the robustness and safety of the base model, which exhibits near-perfect alignment and zero dangerous completions out-of-the-box. This research emphasizes that the cost of incorrect data is heavy, highlighting the critical need for extremely high-quality data curation or, alternatively, leveraging robust base models without unnecessary fine-tuning for high-stakes applications.
摘要：本文研究了在监督的微调（SFT）期间，研究错误数据对大语言模型（特别是GPT-4O）的性能和安全性的影响。尽管LLM在诸如金融，编码，法律和健康之类的广泛领域变得越来越重要，但对错误的数据进行微调可能导致“紧急未对准”，从而产生与预期任务无关的有害或欺骗性产出。我们对跨四个领域的数据显然和微妙的数据进行了微调（10 \％至90 \％正确）微调的GPT-4O模型：编码，财务，健康和法律。我们的发现表明，即使是不正确的数据（10-25 \％），也显着降低了域的性能，而不是道德对准。对于模型始终如一地恢复了强劲的性能，需要至少50 \％正确的数据，尽管它们很少与基本模型的稳健性和安全性相匹配，而基本模型的稳健性和安全性很少，该模型表现出接近完美的对齐和零危险的危险完成。这项研究强调，错误数据的成本很重，强调了对极高的数据策展的关键需求，或者，在没有不必要的微调应用程序中利用健壮的基本模型，而不是为高风险应用程序进行微调。

Title: Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers

Authors: Ruochi Li, Haoxuan Zhang, Edward Gehringer, Ting Xiao, Junhua Ding, Haihua Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19326
Pdf URL: https://arxiv.org/pdf/2509.19326
Copy Paste: [[2509.19326]] Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers(https://arxiv.org/abs/2509.19326)
Keywords: language model, gpt, llm, prompt
Abstract: The surge in scientific submissions has placed increasing strain on the traditional peer-review process, prompting the exploration of large language models (LLMs) for automated review generation. While LLMs demonstrate competence in producing structured and coherent feedback, their capacity for critical reasoning, contextual grounding, and quality sensitivity remains limited. To systematically evaluate these aspects, we propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics to assess LLM-generated reviews against human-written counterparts. We construct a large-scale benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five LLMs. Our findings show that LLMs perform well in descriptive and affirmational content, capturing the main contributions and methodologies of the original work, with GPT-4o highlighted as an illustrative example, generating 15.74% more entities than human reviewers in the strengths section of good papers in ICLR 2025. However, they consistently underperform in identifying weaknesses, raising substantive questions, and adjusting feedback based on paper quality. GPT-4o produces 59.42% fewer entities than real reviewers in the weaknesses and increases node count by only 5.7% from good to weak papers, compared to 50% in human reviews. Similar trends are observed across all conferences, years, and models, providing empirical foundations for understanding the merits and defects of LLM-generated reviews and informing the development of future LLM-assisted reviewing tools. Data, code, and more detailed results are publicly available at this https URL.
摘要：科学提交的激增使传统的同行评审过程越来越有压力，促使对自动审查生成的大型语言模型（LLMS）的探索。尽管LLMS表现出产生结构化和相干反馈的能力，但其关键推理，上下文接地和质量敏感性的能力仍然有限。为了系统地评估这些方面，我们提出了一个全面的评估框架，该框架集成了语义相似性分析和结构化知识图指标，以评估针对人类写的对应物的LLM生成的评论。我们在多年内构建了来自ICLR和Neurips的1,683篇论文和6,495篇专家评论的大规模基准，并使用五个LLM进行了评论。我们的发现表明，LLM在描述性和肯定的内容方面表现良好，捕获了原始工作的主要贡献和方法论，GPT-4O作为一个说明性示例，在ICLR 2025中的优势范围内，在良好论文的优势部分中产生了15.74％的实体。 GPT-4O在弱点中产生的实体比实际审稿人少59.42％，而节点的数量仅从良好的论文到弱的纸张量增加了5.7％，而人类评论中的节点数量仅为50％。在所有会议，年和模型中都观察到类似的趋势，这些趋势为理解LLM生成的评论的优点和缺陷提供了经验基础，并为未来LLM辅助评论工具的开发提供了信息。此HTTPS URL可在此HTTPS URL上公开获得数据，代码和更详细的结果。

Title: A systematic review of trial-matching pipelines using large language models

Authors: Braxton A. Morrison (1), Madhumita Sushil (1), Jacob S. Young (1) ((1) University of California, San Francisco)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19327
Pdf URL: https://arxiv.org/pdf/2509.19327
Copy Paste: [[2509.19327]] A systematic review of trial-matching pipelines using large language models(https://arxiv.org/abs/2509.19327)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Matching patients to clinical trial options is critical for identifying novel treatments, especially in oncology. However, manual matching is labor-intensive and error-prone, leading to recruitment delays. Pipelines incorporating large language models (LLMs) offer a promising solution. We conducted a systematic review of studies published between 2020 and 2025 from three academic databases and one preprint server, identifying LLM-based approaches to clinical trial matching. Of 126 unique articles, 31 met inclusion criteria. Reviewed studies focused on matching patient-to-criterion only (n=4), patient-to-trial only (n=10), trial-to-patient only (n=2), binary eligibility classification only (n=1) or combined tasks (n=14). Sixteen used synthetic data; fourteen used real patient data; one used both. Variability in datasets and evaluation metrics limited cross-study comparability. In studies with direct comparisons, the GPT-4 model consistently outperformed other models, even finely-tuned ones, in matching and eligibility extraction, albeit at higher cost. Promising strategies included zero-shot prompting with proprietary LLMs like the GPT-4o model, advanced retrieval methods, and fine-tuning smaller, open-source models for data privacy when incorporation of large models into hospital infrastructure is infeasible. Key challenges include accessing sufficiently large real-world data sets, and deployment-associated challenges such as reducing cost, mitigating risk of hallucinations, data leakage, and bias. This review synthesizes progress in applying LLMs to clinical trial matching, highlighting promising directions and key limitations. Standardized metrics, more realistic test sets, and attention to cost-efficiency and fairness will be critical for broader deployment.
摘要：将患者与临床试验选择相匹配对于识别新型治疗，尤其是在肿瘤学方面至关重要。但是，手动匹配是劳动密集型且容易出错的，导致招聘延迟。结合大型语言模型（LLM）的管道提供了有希望的解决方案。我们对三个学术数据库和一台预制服务器之间发表的研究进行了系统评价，并确定了基于LLM的临床试验匹配方法。在126篇独特的文章中，有31条符合纳入标准。审查的研究重点是仅匹配患者到标准（n = 4），仅患者对审判（n = 10），仅试验到患者（n = 2），仅二进制资格分类（n = 1）或合并任务（n = 14）。 16个使用合成数据； 14个使用了真实的患者数据；一个都使用了。数据集和评估指标的可变性有限的跨研究可比性。在直接比较的研究中，GPT-4模型在匹配和资格提取方面始终优于其他模型，甚至是精心调整的模型，尽管成本较高。有前途的策略包括零射击促使专有LLM（例如GPT-4O模型，先进的检索方法），以及当将大型模型纳入医院基础设施时，较小的开源模型进行了微调，以进行数据隐私。主要挑战包括访问足够大的现实世界数据集，以及与部署相关的挑战，例如降低成本，减轻幻觉，数据泄漏和偏见的风险。这篇评论综合了将LLMS应用于临床试验匹配的进展，突出了有希望的方向和关键局限性。标准化指标，更现实的测试集以及对成本效益和公平性的关注对于更广泛的部署至关重要。

Title: How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment

Authors: Julie Jung, Max Lu, Sina Chole Benker, Dogus Darici
Subjects: cs.CL, stat.ME
Abstract URL: https://arxiv.org/abs/2509.19329
Pdf URL: https://arxiv.org/pdf/2509.19329
Copy Paste: [[2509.19329]] How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment(https://arxiv.org/abs/2509.19329)
Keywords: language model, llm, prompt
Abstract: We examined how model size, temperature, and prompt style affect Large Language Models' (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.
摘要：我们研究了模型的规模，温度和及时样式如何影响大型语言模型（LLMS）自身，模型之间以及人类在评估临床推理技能方面的一致性。模型大小是LLM-Human得分对齐的关键因素。研究突出了检查多个级别对齐对准的重要性。

Title: Quantifying Compositionality of Classic and State-of-the-Art Embeddings

Authors: Zhijin Guo (1 and 2), Chenhao Xue (1), Zhaozhen Xu (2), Hongbo Bo (2), Yuxuan Ye (2), Janet B. Pierrehumbert (1), Martha Lewis (3) ((1) University of Oxford, (2) University of Bristol, (3) University of Amsterdam)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19332
Pdf URL: https://arxiv.org/pdf/2509.19332
Copy Paste: [[2509.19332]] Quantifying Compositionality of Classic and State-of-the-Art Embeddings(https://arxiv.org/abs/2509.19332)
Keywords: language model
Abstract: For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don't know what a "pelp" is, we can use our knowledge of numbers to understand that "ten pelps" makes more pelps than "two pelps". Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at this https URL.
摘要：为了使语言模型正确地概括为新颖的表达式，至关重要的是，在合理的情况下，它们可以利用访问构图含义。即使我们不知道“ pelp”是什么，我们也可以利用数字知识来理解“十个pelps”比“两个pelps”会产生更多的pelps。诸如Word2Vec之类的静态词嵌入对组成性提出了强烈的说法。但是，SOTA生成型，变压器模型和图形模型朝另一个方向走得太远，这是由于上下文引起的含义变化的实际限制。 To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy.这些指标还捕获了线性组成分解的故障案例。评估并跟踪所有层次和训练阶段的构图。在跨数据模式的以后训练阶段以及在顶层下降之前，在跨数据模式的较深层次和更深层的层中观察到更强的组成信号。代码可在此HTTPS URL上找到。

Title: Pluralistic Off-policy Evaluation and Alignment

Authors: Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, Lina Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19333
Pdf URL: https://arxiv.org/pdf/2509.19333
Copy Paste: [[2509.19333]] Pluralistic Off-policy Evaluation and Alignment(https://arxiv.org/abs/2509.19333)
Keywords: llm
Abstract: Personalized preference alignment for LLMs with diverse human preferences requires evaluation and alignment methods that capture pluralism. Most existing preference alignment datasets are logged under policies that differ substantially from the evaluated LLMs, and existing off-policy estimators focus solely on overall utility while ignoring preference pluralism. Extending Off-Policy Evaluation (OPE) to pluralistic preference alignment, therefore, remains an open question. Thus, we propose the Pluralistic Off-Policy Evaluation (POPE), the first framework for offline pluralistic preference evaluation and alignment in LLMs. POPE includes a unified reward function that combines (1) a collaborative utility component derived from human preference signals (e.g., upvotes or relevance scores) and (2) a diversity component inspired by entropy-based coverage measures, together reflecting pluralistic alignment. Furthermore, to estimate this reward from logged interactions, we derive decomposable inverse propensity scoring (IPS) estimators that separately evaluate relevance and diversity. Theoretically, we prove that our decomposed IPS estimators establish a lower bound on their variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance pluralistic alignment. Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models' general capabilities on downstream tasks
摘要：具有不同人类偏好的LLM的个性化偏好一致性需要评估和对齐方式来捕捉多元化。大多数现有的偏好对齐数据集都在与评估的LLMS有很大不同的政策下记录，而现有的非政策估计器仅关注整体效用，同时忽略了偏好多元化。因此，将非政策评估（OPE）扩展到多元化偏好对齐方式仍然是一个悬而未决的问题。因此，我们提出了多元化的非政策评估（POPE），这是LLM中离线多元化偏好评估和对齐方式的第一个框架。教皇包括一个统一的奖励函数，将（1）结合了一个从人类偏好信号（例如，高投票或相关性得分）得出的协作效用组件，以及（2）受基于熵覆盖量措施启发的多样性组成部分，反映了多元化的一致性。此外，为了估算已记录的相互作用的奖励，我们得出了可分解的反相反倾向评分（IPS）估计器，这些估计值分别评估相关性和多样性。从理论上讲，我们证明了我们分解的IPS估计器对其差异建立了下限。借助范围评估的价值函数，我们可以直接实现非政策优化，以进一步增强多元化对准。经验结果表明，教皇有效地增强了多元化响应的产生，并维持模型在下游任务上的一般能力

Title: Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation

Authors: Qingsong Wang, Tao Wu, Wang Lin, Yueying Feng, Gongsheng Yuan, Chang Yao, Jingyuan Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19336
Pdf URL: https://arxiv.org/pdf/2509.19336
Copy Paste: [[2509.19336]] Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation(https://arxiv.org/abs/2509.19336)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated strong performance in open-ended generation tasks. However, they often struggle to adapt content to users with differing cognitive capacities, leading to a phenomenon we term cognitive misalignment. This issue arises in two forms: knowledge-level misalignment, where content is too complex or too simplistic relative to user understanding, and presentation-style misalignment, where the structure or tone hinders effective comprehension. To address these challenges, we propose the Cognitive-Level Alignment Framework (CLAF), a general-purpose generation framework that aligns both knowledge complexity and presentation style with user cognition. CLAF integrates a capability-aware retrieval module based on a hierarchical knowledge graph and a style optimization module guided by Bloom's taxonomy and preference learning. Additionally, a knowledge-controllable generation component ensures consistency and relevance throughout the output. To support training and evaluation, we construct SCALE, a cognitively annotated dataset containing responses at multiple comprehension levels per query. Empirical results show that CLAF enhances the adaptability and informativeness of LLM outputs across a range of user profiles, offering a robust solution to cognitive-level alignment in real-world applications.
摘要：大型语言模型（LLMS）在开放式生成任务中表现出强劲的性能。但是，他们经常难以将内容适应不同认知能力的用户，从而导致我们称认知不对对准的现象。这个问题以两种形式出现：知识级别的未对准，内容相对于用户理解过于复杂或太简单，并且结构或音调在其中妨碍了有效的理解。为了应对这些挑战，我们提出了认知级别对齐框架（CLAF），这是一个通用生成框架，将知识复杂性和表现方式与用户认知保持一致。 CLAF基于层次知识图和以Bloom的分类学和偏好学习为指导的样式优化模块集成了能力感知的检索模块。此外，可控制的生成组件可确保整个输出中的一致性和相关性。为了支持培训和评估，我们构建了量表，这是一个认知注释的数据集，该数据集包含每个查询以多个理解水平的响应。经验结果表明，CLAF增强了LLM输出在一系列用户配置文件中的适应性和信息性，从而为现实世界应用中的认知级别对齐提供了强大的解决方案。

Title: Performance of Large Language Models in Answering Critical Care Medicine Questions

Authors: Mahmoud Alwakeel, Aditya Nagori, An-Kwok Ian Wong, Neal Chaisson, Vijay Krishnamoorthy, Rishikesan Kamaleswaran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19344
Pdf URL: https://arxiv.org/pdf/2509.19344
Copy Paste: [[2509.19344]] Performance of Large Language Models in Answering Critical Care Medicine Questions(https://arxiv.org/abs/2509.19344)
Keywords: language model
Abstract: Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.
摘要：大型语言模型已经在医学学生级问题上进行了测试，但是他们在诸如重症监护医学（CCM）等专业领域的表现较少。这项研究评估了871 CCM问题的Meta-lalama 3.1型号（8B和70B参数）。 Llama3.1：70B的表现优于8B 30％，平均精度为60％。性能各不相同，研究最高（68.4％），肾脏最低（47.9％），强调了对更广泛的未来工作的需求，以改善各个专科领域的模型。

Title: SCORE: A Semantic Evaluation Framework for Generative Document Parsing

Authors: Renyu Li, Antonio Jimeno Yepes, Yao You, Kamil Pluciński, Maximilian Operlejn, Crag Wolfe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19345
Pdf URL: https://arxiv.org/pdf/2509.19345
Copy Paste: [[2509.19345]] SCORE: A Semantic Evaluation Framework for Generative Document Parsing(https://arxiv.org/abs/2509.19345)
Keywords: hallucination
Abstract: Multi-modal generative document parsing systems challenge traditional evaluation: unlike deterministic OCR or layout models, they often produce semantically correct yet structurally divergent outputs. Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior. We introduce SCORE (Structural and COntent Robust Evaluation), an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks. Together, these dimensions enable evaluation that embraces representational diversity while enforcing semantic rigor. Across 1,114 pages spanning a holistic benchmark and a field dataset, SCORE consistently revealed cross-dataset performance patterns missed by standard metrics. In 2-5% of pages with ambiguous table structures, traditional metrics penalized systems by 12-25% on average, leading to distorted rankings. SCORE corrected these cases, recovering equivalence between alternative but valid interpretations. Moreover, by normalizing generative outputs into a format-agnostic representation, SCORE reproduces traditional scores (e.g., table F1 up to 0.93) without requiring object-detection pipelines, demonstrating that generative parsing alone suffices for comprehensive evaluation. By exposing how interpretive diversity impacts evaluation outcomes and providing multi-dimensional, interpretable diagnostics, SCORE establishes foundational principles for semantically grounded, fair, and practical benchmarking of modern document parsing systems.
摘要：多模式生成文档解析系统挑战传统评估：与确定性的OCR或布局模型不同，它们通常会产生语义上正确但结构上不同的输出。常规的指标 - cer，wer，iou或teds-misclassersive多样性，惩罚有效的解释和掩盖系统行为等多样性。我们介绍了分数（结构和内容鲁棒评估），这是一个解释 - 不足的框架，该框架集成了（i）调整的编辑距离，以实现鲁棒内容保真度，（ii）令牌级别的诊断，以区分幻觉和遗漏，（iii）表格评估与空间容忍度和空间耐受性和语义差异，以及（iiv）的nierarkarkarkarky-arkarky-aware-aware-aware-aware-aware-aware-aware-aware-aware-aware-aware-aware。这些维度共同使评估能够在执行语义严格的同时具有代表性多样性。在跨越整体基准和一个字段数据集的1,114页中，得分始终显示出标准指标错过的跨数据库性能模式。在2-5％的页面结构模棱两可的页面中，传统指标平均惩罚了12-25％，导致排名扭曲。得分纠正了这些情况，从而在替代但有效的解释之间恢复了等效性。此外，通过将生成型输出标准化为格式 - 不平衡表示，得分就可以再现传统的分数（例如表F1最高0.93），而无需对象检测管道，这表明单独的生成解析足以进行全面评估。通过揭示解释性多样性如何影响评估结果并提供多维，可解释的诊断，得分为现代文档解析系统的语义扎根，公平和实用的基准测试建立了基本原理。

Title: Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches

Authors: Maryam Mahdi Alhusseini, Mohammad-Reza Feizi-Derakhshi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19346
Pdf URL: https://arxiv.org/pdf/2509.19346
Copy Paste: [[2509.19346]] Benchmarking ChatGPT and DeepSeek in April 2025: A Novel Dual Perspective Sentiment Analysis Using Lexicon-Based and Deep Learning Approaches(https://arxiv.org/abs/2509.19346)
Keywords: language model, gpt, llm, chat
Abstract: This study presents a novel dual-perspective approach to analyzing user reviews for ChatGPT and DeepSeek on the Google Play Store, integrating lexicon-based sentiment analysis (TextBlob) with deep learning classification models, including Convolutional Neural Networks (CNN) and Bidirectional Long Short Term Memory (Bi LSTM) Networks. Unlike prior research, which focuses on either lexicon-based strategies or predictive deep learning models in isolation, this study conducts an extensive investigation into user satisfaction with Large Language Model (LLM) based applications. A Dataset of 4,000 authentic user reviews was collected, which were carefully preprocessed and subjected to oversampling to achieve balanced classes. The balanced test set of 1,700 Reviews were used for model testing. Results from the experiments reveal that ChatGPT received significantly more positive sentiment than DeepSeek. Furthermore, deep learning based classification demonstrated superior performance over lexicon analysis, with CNN outperforming Bi-LSTM by achieving 96.41 percent accuracy and near perfect classification of negative reviews, alongside high F1-scores for neutral and positive sentiments. This research sets a new methodological standard for measuring sentiment in LLM-based applications and provides practical insights for developers and researchers seeking to improve user-centric AI system design.
摘要：这项研究提出了一种新型的双重观点方法，可以在Google Play商店分析Chatgpt和DeepSeek的用户评论，将基于词典的情感分析（TextBlob）与深度学习分类模型集成在一起，包括卷积神经网络（CNN）和双向短期长期记忆（BI LSTM）网络。与先前的研究不同，该研究侧重于基于词典的策略或孤立的预测深度学习模型，本研究对基于大语言模型（LLM）应用的用户满意度进行了广泛的研究。收集了4,000个真实用户评论的数据集，这些数据集经过精心处理并进行了过采样以实现平衡的类。 1,700个评论的平衡测试集用于模型测试。实验的结果表明，与DeepSeek相比，Chatgpt获得的积极情绪要多得多。此外，基于深度学习的分类表明，通过实现96.41％的准确性和接近完美的负面评论分类，以及对中性和积极情绪的高F1得分，CNN优于词汇分析，超过了BI-LSTM。这项研究为测量基于LLM的应用程序中情感的新方法标准设定了一个新的方法论标准，并为寻求改善以用户为中心的AI系统设计的开发人员和研究人员提供了实用见解。

Title: Characterizing Knowledge Graph Tasks in LLM Benchmarks Using Cognitive Complexity Frameworks

Authors: Sara Todorovikj, Lars-Peter Meyer, Michael Martin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19347
Pdf URL: https://arxiv.org/pdf/2509.19347
Copy Paste: [[2509.19347]] Characterizing Knowledge Graph Tasks in LLM Benchmarks Using Cognitive Complexity Frameworks(https://arxiv.org/abs/2509.19347)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used for tasks involving Knowledge Graphs (KGs), whose evaluation typically focuses on accuracy and output correctness. We propose a complementary task characterization approach using three complexity frameworks from cognitive psychology. Applying this to the LLM-KG-Bench framework, we highlight value distributions, identify underrepresented demands and motivate richer interpretation and diversity for benchmark evaluation tasks.
摘要：大型语言模型（LLMS）越来越多地用于涉及知识图（kg）的任务，其评估通常集中于准确性和输出正确性。我们建议使用认知心理学的三个复杂性框架提出一种互补的任务表征方法。将其应用于LLM-KG基础框架，我们重点介绍价值分布，确定代表性不足的需求，并激发了基准评估任务的更丰富的解释和多样性。

Title: ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

Authors: Robert Tjarko Lange, Yuki Imajuku, Edoardo Cetin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.19349
Pdf URL: https://arxiv.org/pdf/2509.19349
Copy Paste: [[2509.19349]] ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution(https://arxiv.org/abs/2509.19349)
Keywords: language model, llm, agent
Abstract: We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.
摘要：我们介绍了Shinkaevolve：一个新的开源框架，利用大型语言模型（LLMS）以最先进的性能和前所未有的效率来推进科学发现。缩放推理时间计算LLM的最新进展已在广义科学发现中取得了重大进展。这些方法依赖于将LLMS作为突变操作员生成候选解决方案的进化代理线束。但是，当前的代码演化方法受到关键局限性：它们是样本效率低下的样本，需要数千个样本来识别有效的解决方案，并保持封闭源，并阻碍了广泛的采用和扩展。 Shinkaevolve解决了这些局限性，引入了三个关键创新：父母抽样技术平衡探索和剥削，有效的搜索空间探索的代码新颖性拒绝抽样，以及基于强盗的LLM集合选择策略。我们评估了跨不同任务的Shinkaevolve，表明样本效率和解决方案质量的一致性提高。 Shinkaevolve仅使用150个样本发现了一种新的最先进的圆形包装解决方案，设计了高性能的代理安全带，以实现AIME数学推理任务，确定对Ale-Bench竞争性编程解决方案的改进，并发现新颖的Expert载荷负载负载损失损失功能，以阐明优化策略的空间。我们的结果表明，Shinkaevolve具有出色的样本效率，可实现广泛的适用性。通过提供开源可访问性和成本效益，这项工作使各种计算问题的开放式发现民主化。

Title: TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities

Authors: Jiajun Chen, Yangyang Wu, Xiaoye Miao, Mengying Zhu, Meng Xi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19352
Pdf URL: https://arxiv.org/pdf/2509.19352
Copy Paste: [[2509.19352]] TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities(https://arxiv.org/abs/2509.19352)
Keywords: prompt
Abstract: The widespread presence of incomplete modalities in multimodal data poses a significant challenge to achieving accurate rumor detection. Existing multimodal rumor detection methods primarily focus on learning joint modality representations from \emph{complete} multimodal training data, rendering them ineffective in addressing the common occurrence of \emph{missing modalities} in real-world scenarios. In this paper, we propose a hierarchical soft prompt model \textsf{TriSPrompt}, which integrates three types of prompts, \textit{i.e.}, \emph{modality-aware} (MA) prompt, \emph{modality-missing} (MM) prompt, and \emph{mutual-views} (MV) prompt, to effectively detect rumors in incomplete multimodal data. The MA prompt captures both heterogeneous information from specific modalities and homogeneous features from available data, aiding in modality recovery. The MM prompt models missing states in incomplete data, enhancing the model's adaptability to missing information. The MV prompt learns relationships between subjective (\textit{i.e.}, text and image) and objective (\textit{i.e.}, comments) perspectives, effectively detecting rumors. Extensive experiments on three real-world benchmarks demonstrate that \textsf{TriSPrompt} achieves an accuracy gain of over 13\% compared to state-of-the-art methods. The codes and datasets are available at https: //anonymous.this http URL.
摘要：多模式数据中普遍存在不完整的模式的存在为实现准确的谣言检测带来了重大挑战。现有的多模式谣言检测方法主要集中于从\ emph {完整}多模式训练数据中学习关节模式表示，使它们无效地解决了现实世界中\ emph {缺失模态}的常见发生。在本文中，我们提出了一个层次软提示模型\ textsf {trisprompt}，该模型集成了三种类型的提示，\ textit {i.e。}，\ emph {moditaly-ware}（ma）提示不完整的多模式数据。 MA提示将从特定模式和可用数据中的均质功能中捕获异质信息，从而有助于模态恢复。 MM提示模型在数据不完整中缺少状态，从而增强了模型对缺少信息的适应性。 MV提示将学习主观（\ textit {i.e。}，文本和图像）与目标（\ textit {i.e。}，注释）观点之间的关系，有效地检测到谣言。对三个现实世界基准的广泛实验表明，与最先进的方法相比，\ textsf {trisprompt}的准确性增长超过13 \％。代码和数据集可在https：// anonymous..this http url上找到。

Title: RoadMind: Towards a Geospatial AI Expert for Disaster Response

Authors: Ahmed El Fekih Zguir, Ferda Ofli, Muhammad Imran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19354
Pdf URL: https://arxiv.org/pdf/2509.19354
Copy Paste: [[2509.19354]] RoadMind: Towards a Geospatial AI Expert for Disaster Response(https://arxiv.org/abs/2509.19354)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have shown impressive performance across a range of natural language tasks, but remain limited in their ability to reason about geospatial data, particularly road networks, distances, and directions. This gap poses challenges in disaster scenarios, where spatial understanding is critical for tasks such as evacuation planning and resource allocation. In this work, we present RoadMind, a self-supervised framework that enhances the geospatial reasoning capabilities of LLMs using structured data from OpenStreetMap (OSM). Our automated pipeline extracts road infrastructure data for a given city and converts it into multiple supervision formats tailored to key spatial tasks. We pretrain and fine-tune LLMs on these representations using QLoRA adapters and 4-bit quantized models. We evaluate our approach on three disaster-prone cities with varying global representation, Los Angeles, Christchurch, and Manila, across tasks such as road segment identification, nearest road retrieval, and distance/direction estimation. Our results show that models trained via RoadMind significantly outperform strong baselines, including state-of-the-art LLMs equipped with advanced prompt engineering. This demonstrates the potential of structured geospatial data to enhance language models with robust spatial reasoning, enabling more effective offline AI systems for disaster response.
摘要：大型语言模型（LLMS）在一系列自然语言任务中表现出令人印象深刻的表现，但其推理地理空间数据，尤其是道路网络，距离，距离和方向的能力仍然有限。该差距在灾难场景中构成了挑战，在灾难场景中，空间理解对于诸如疏散计划和资源分配等任务至关重要。在这项工作中，我们提出了一种自制的框架，它使用OpenStreetMap（OSM）的结构化数据来增强LLM的地理空间推理能力。我们的自动化管道将为给定的城市提取道路基础设施数据，并将其转换为针对关键空间任务的多个监督格式。我们使用Qlora适配器和4位量化模型在这些表示上预先计算LLM。我们在三个具有不同的全球代表性，洛杉矶，克赖斯特彻奇和马尼拉的城市中评估我们的方法，跨越路段识别，最近的道路检索以及距离/方向估计等任务。我们的结果表明，通过路线训练的型号明显胜过强大的基线，包括配备高级及时工程的最先进的LLM。这证明了结构化地理空间数据的潜力，可以通过强大的空间推理来增强语言模型，从而使更有效的离线AI系统进行灾难响应。

Title: Benchmarking and Improving LLM Robustness for Personalized Generation

Authors: Chimaobi Okite, Naihao Deng, Kiran Bodipati, Huaidian Hou, Joyce Chai, Rada Mihalcea
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19358
Pdf URL: https://arxiv.org/pdf/2509.19358
Copy Paste: [[2509.19358]] Benchmarking and Improving LLM Robustness for Personalized Generation(https://arxiv.org/abs/2509.19358)
Keywords: language model, gpt, llm, prompt
Abstract: Recent years have witnessed a growing interest in personalizing the responses of large language models (LLMs). While existing evaluations primarily focus on whether a response aligns with a user's preferences, we argue that factuality is an equally important yet often overlooked dimension. In the context of personalization, we define a model as robust if its responses are both factually accurate and align with the user preferences. To assess this, we introduce PERG, a scalable framework for evaluating robustness in LLMs, along with a new dataset, PERGData. We evaluate fourteen models from five different model families using different prompting methods. Our findings show that current LLMs struggle with robust personalization: even the strongest models (GPT-4.1, LLaMA3-70B) fail to maintain correctness in 5% of previously successful cases without personalization, while smaller models (e.g., 7B-scale) can fail more than 20% of the time. Further analysis reveals that robustness is significantly affected by the nature of the query and the type of user preference. To mitigate these failures, we propose Pref-Aligner, a two-stage approach that improves robustness by an average of 25% across models. Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned LLM deployments.
摘要：近年来，人们对个性化大语模型（LLM）的响应的兴趣日益增加。虽然现有的评估主要集中于响应是否与用户的偏好保持一致，但我们认为事实是同样重要但经常被忽略的维度。在个性化的背景下，如果其响应既准确又与用户偏好保持一致，则将模型定义为强大的。为了评估这一点，我们介绍了PERG，这是一个可扩展的框架，用于评估LLMS的鲁棒性以及新的数据集Pergdata。我们使用不同的提示方法评估了来自五个不同模型家族的14个模型。我们的发现表明，当前的LLM与强大的个性化斗争：即使是最强的模型（GPT-4.1，Llama3-70B）也无法在没有个性化的5％以前成功的案例中保持正确性，而较小的模型（例如，7B规模）可能会在20％以上的时间内失败。进一步的分析表明，鲁棒性受查询性质和用户喜好类型的显着影响。为了减轻这些失败，我们提出了Pref-Aligner，这是一种两阶段的方法，可在模型中平均提高鲁棒性25％。我们的工作突出了当前评估实践中的关键差距，并引入了工具和指标，以支持更可靠的用户一致的LLM部署。

Title: Semantic Representation Attack against Aligned Large Language Models

Authors: Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19360
Pdf URL: https://arxiv.org/pdf/2509.19360
Copy Paste: [[2509.19360]] Semantic Representation Attack against Aligned Large Language Models(https://arxiv.org/abs/2509.19360)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.
摘要：大型语言模型（LLMS）越来越多地采用对齐技术来防止有害产出。尽管有这些保障措施，攻击者仍可以通过制定提示来构成攻击者，从而诱使LLM产生有害内容。当前方法通常针对精确的肯定响应，例如``当然，……''，遭受有限的收敛性，不自然的提示和高计算成本。我们介绍了语义表示攻击，这是一种新颖的范式，从根本上重新概念化了针对统一的LLM的对抗性目标。我们的方法不是针对确切的文本模式，而是利用语义表示空间，其中包括各种响应，具有等效的有害含义。这项创新解决了困扰现有方法的攻击功效与迅速自然性之间的固有权衡。提出了语义表示启发式搜索算法，以通过在增量扩张期间保持可解释性来有效地产生语义相干和简洁的对抗提示。我们为语义融合建立了严格的理论保证，并证明我们的方法实现了前所未有的攻击成功率（在18个LLM中平均89.41 \％，包括11个模型的100 \％），同时保持隐身性和效率。全面的实验结果证实了我们语义表示攻击的总体优势。该代码将公开可用。

Title: The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior

Authors: Angelina Wang, Daniel E. Ho, Sanmi Koyejo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19364
Pdf URL: https://arxiv.org/pdf/2509.19364
Copy Paste: [[2509.19364]] The Inadequacy of Offline LLM Evaluations: A Need to Account for Personalization in Model Behavior(https://arxiv.org/abs/2509.19364)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Standard offline evaluations for language models -- a series of independent, state-less inferences made by models -- fail to capture how language models actually behave in practice, where personalization fundamentally alters model behavior. For instance, identical benchmark questions to the same language model can produce markedly different responses when prompted to a state-less system, in one user's chat session, or in a different user's chat session. In this work, we provide empirical evidence showcasing this phenomenon by comparing offline evaluations to field evaluations conducted by having 800 real users of ChatGPT and Gemini pose benchmark and other provided questions to their chat interfaces.
摘要：语言模型的标准离线评估 - 一系列由模型做出的一系列独立的，无国家的推论 - 无法捕获语言模型在实践中的实际行为方式，在这个中，个性化从根本上改变了模型行为。例如，当提示对一个无国务系统，在一个用户的聊天会话中或在其他用户的聊天会话中提示，与同一语言模型相同的基准问题可能会产生明显不同的答案。在这项工作中，我们提供了经验证据，通过将离线评估与拥有800名Chatgpt和Gemini姿势基准的真正用户进行比较，并通过将离线评估与实地评估进行比较，并为他们的聊天界面提出了问题，从而展示了这种现象。

Title: LLM-Assisted Topic Reduction for BERTopic on Social Media Data

Authors: Wannes Janssens, Matthias Bogaert, Dirk Van den Poel
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.19365
Pdf URL: https://arxiv.org/pdf/2509.19365
Copy Paste: [[2509.19365]] LLM-Assisted Topic Reduction for BERTopic on Social Media Data(https://arxiv.org/abs/2509.19365)
Keywords: language model, llm
Abstract: The BERTopic framework leverages transformer embeddings and hierarchical clustering to extract latent topics from unstructured text corpora. While effective, it often struggles with social media data, which tends to be noisy and sparse, resulting in an excessive number of overlapping topics. Recent work explored the use of large language models for end-to-end topic modelling. However, these approaches typically require significant computational overhead, limiting their scalability in big data contexts. In this work, we propose a framework that combines BERTopic for topic generation with large language models for topic reduction. The method first generates an initial set of topics and constructs a representation for each. These representations are then provided as input to the language model, which iteratively identifies and merges semantically similar topics. We evaluate the approach across three Twitter/X datasets and four different language models. Our method outperforms the baseline approach in enhancing topic diversity and, in many cases, coherence, with some sensitivity to dataset characteristics and initial parameter selection.
摘要：Bertopic框架利用变形金刚的嵌入和分层聚类从非结构化的文本语料库中提取潜在主题。尽管有效，但它通常与社交媒体数据斗争，这往往嘈杂且稀疏，导致过多的重叠主题。最近的工作探索了大型语言模型用于端到端主题建模。但是，这些方法通常需要大量的计算开销，从而限制了它们在大数据上下文中的可扩展性。在这项工作中，我们提出了一个框架，将主题生成的伯托与大型语言模型结合在一起，以减少主题。该方法首先生成一组初始主题，并为每个主题构建一个表示。然后将这些表示形式作为语言模型的输入提供，迭代地识别并合并了语义上相似的主题。我们评估了三个Twitter/x数据集和四个不同语言模型的方法。我们的方法优于增强主题多样性的基线方法，并且在许多情况下，对数据集特性和初始参数选择具有一定的敏感性。

Title: Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

Authors: Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19368
Pdf URL: https://arxiv.org/pdf/2509.19368
Copy Paste: [[2509.19368]] Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding(https://arxiv.org/abs/2509.19368)
Keywords: language model, llm
Abstract: Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.
摘要：大型语言模型（LLMS）具有令人印象深刻的发电质量，但是由于所有模型层都会自动产生自动回归，因此推理成本很高。基于早期外观的自定义解码（EESD）已出现以减轻这一成本。但是，实际上，即使以良好的早期偏远地区和选定的退出地位，许多方法也很难在此类草案中达到预期的加速度。我们的分析表明，EESD仅在LLM接受绝大多数草稿代币时才有所回报。否则，草案成本可能会克服加速度增长并导致负速度。为了减轻这种情况，我们提出了充分管道草稿和验证工作的管道并行自我指码（PPSD），以便在预测失败的情况下不会浪费努力。它有两个关键的创新。我们将模型层配置为管道，其中早期外观（草稿）计算和剩余层（验证）计算重叠。我们插入了每个令牌的起草和验证。尽管LLM在其最后一层中验证当前令牌，但早期的路径同时起草了下一个令牌。这样的验证时，请放入方案使所有单元保持繁忙，并验证代币与管道猜测和验证阶段的同样类似。经验结果证实，PPSD在自我指定LLM推理中实现了最新的加速度。在不同的基准测试基准上，PPSD达到2.01x〜3.81x的速度比率，该比率几乎以固定的接受率和退出位置获得了最佳加速度，从而展示了其在提供有效的自我调查方面的进步。

Title: SLM-Based Agentic AI with P-C-G: Optimized for Korean Tool Use

Authors: Changhyun Jeon, Jinhee Park, Jungwoo Choi, Keonwoo Kim, Jisu Kim, Minji Hong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19369
Pdf URL: https://arxiv.org/pdf/2509.19369
Copy Paste: [[2509.19369]] SLM-Based Agentic AI with P-C-G: Optimized for Korean Tool Use(https://arxiv.org/abs/2509.19369)
Keywords: language model, llm, agent
Abstract: We propose a small-scale language model (SLM) based agent architecture, Planner-Caller-Generator (P-C-G), optimized for Korean tool use. P-C-G separates planning, calling, and generation by role: the Planner produces an initial batch plan with limited on-demand replanning; the Caller returns a normalized call object after joint schema-value validation; and the Generator integrates tool outputs to produce the final answer. We apply a Korean-first value policy to reduce execution failures caused by frequent Korean-to-English code switching in Korean settings. Evaluation assumes Korean queries and Korean tool/parameter specifications; it covers single-chain, multi-chain, missing-parameters, and missing-functions scenarios, and is conducted via an LLM-as-a-Judge protocol averaged over five runs under a unified I/O interface. Results show that P-C-G delivers competitive tool-use accuracy and end-to-end quality while reducing tokens and maintaining acceptable latency, indicating that role-specialized SLMs are a cost-effective alternative for Korean tool-use agents.
摘要：我们提出了一个基于小规模的语言模型（SLM）的代理体系结构，计划者 - 呼叫者生成器（P-C-G），已针对韩国工具的使用进行了优化。 P-C-G可以通过角色将计划，呼叫和生成分开：计划者生成的初始批处理计划有限地重新启动；呼叫者在联合模式值验证后返回一个归一化的调用对象；并且发电机集成了工具输出以产生最终答案。我们采用韩国先进的价值政策来减少韩国环境中频繁到英语代码造成的执行故障。评估假设韩国查询和韩国工具/参数规格；它涵盖了单链，多链，缺失参数和缺失功能的方案，并通过在统一的I/O接口下通过五次运行的LLM-AS-A-A-Gudge协议进行。结果表明，P-C-G具有竞争性的工具使用精度和端到端质量，同时降低令牌并保持可接受的延迟，这表明角色专题化的SLM是韩国工具使用代理商的成本效益替代方案。

Title: Meow: End-to-End Outline Writing for Automatic Academic Survey

Authors: Zhaoyu Ma, Yuan Shan, Jiahao Zhao, Nan Xu, Lei Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19370
Pdf URL: https://arxiv.org/pdf/2509.19370
Copy Paste: [[2509.19370]] Meow: End-to-End Outline Writing for Automatic Academic Survey(https://arxiv.org/abs/2509.19370)
Keywords: llm
Abstract: As academic paper publication numbers grow exponentially, conducting in-depth surveys with LLMs automatically has become an inevitable trend. Outline writing, which aims to systematically organize related works, is critical for automated survey generation. Yet existing automatic survey methods treat outline writing as mere workflow steps in the overall pipeline. Such template-based workflows produce outlines that lack in-depth understanding of the survey topic and fine-grained styles. To address these limitations, we propose Meow, the first metadata-driven outline writing framework that produces organized and faithful outlines efficiently. Specifically, we first formulate outline writing as an end-to-end task that generates hierarchical structured outlines from paper metadata. We then curate a high-quality dataset of surveys from arXiv, bioRxiv, and medRxiv, and establish systematic evaluation metrics for outline quality assessment. Finally, we employ a two-stage training approach combining supervised fine-tuning and reinforcement learning. Our 8B reasoning model demonstrates strong performance with high structural fidelity and stylistic coherence.
摘要：随着学术报纸出版物的成倍增长，自动对LLMS进行深入的调查已成为不可避免的趋势。旨在系统地组织相关作品的大纲写作对于自动化的测量生成至关重要。然而，现有的自动调查方法将大纲写作视为整体管道中的工作流程步骤。这种基于模板的工作流程产生的大纲缺乏对调查主题和细粒度样式的深入了解。为了解决这些局限性，我们提出了Meow，这是第一个由元数据驱动的大纲写作框架，可有效地有效地概述。具体而言，我们首先将大纲写作作为端到端任务，该任务生成了纸质元数据的层次结构概述。然后，我们策划了来自Arxiv，Biorxiv和MedRxiv的调查的高质量数据集，并为轮廓质量评估建立系统评估指标。最后，我们采用了两阶段的培训方法，结合了监督的微调和强化学习。我们的8B推理模型以高结构保真度和风格连贯性表现出强劲的性能。

Title: How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models

Authors: Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu, Yongwei Wang, Wenbo Su, Bo Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19371
Pdf URL: https://arxiv.org/pdf/2509.19371
Copy Paste: [[2509.19371]] How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models(https://arxiv.org/abs/2509.19371)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model's size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.
摘要：大型语言模型（LLMS）由于在各种下游任务中令人印象深刻的一般能力而引起了极大的关注。但是，如果没有特定领域的优化，它们通常在专业知识基准上表现不佳，甚至产生幻觉。最近的研究表明，在预训练期间策略性地注入域知识可以大大改善下游性能。一个关键的挑战在于平衡这种输液权衡：注入太少的域特异性数据会产生足够的专业化，而过度的输液触发了灾难性的忘记先前获得的知识。在这项工作中，我们关注过度灌注引起的记忆崩溃现象。通过系统的实验，我们进行了两个关键的观察，即1）关键崩溃点：每个模型都表现出一个阈值，其知识保留能力急剧下降。 2）比例尺相关：这些崩溃点与模型的大小一致。在这些见解的基础上，我们提出了一项知识输液缩放定律，该定律定律通过分析较小的对应物来预测最佳的领域知识数量，以注入大型LLM。跨不同模型尺寸和有关代币预算的广泛实验证明了我们扩展定律的有效性和概括性。

Title: A Pipeline to Assess Merging Methods via Behavior and Internals

Authors: Yutaro Sigris, Andreas Waldis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19476
Pdf URL: https://arxiv.org/pdf/2509.19476
Copy Paste: [[2509.19476]] A Pipeline to Assess Merging Methods via Behavior and Internals(https://arxiv.org/abs/2509.19476)
Keywords: language model
Abstract: Merging methods combine the weights of multiple language models (LMs) to leverage their capacities, such as for domain adaptation. While existing studies investigate merged models from a solely behavioral perspective, we offer the first comprehensive view by assessing and connecting their behavior and internals. We present a novel evaluation pipeline that first merges multiple parent LMs, and then evaluates the merged models in comparison to the initial ones based on their behavior on downstream tasks, like MMLU, and the internal encoded linguistic competence. We showcase this pipeline by assessing the merging of instruction fine-tuned with math- and code-adapted LMs from the Qwen2.5 family. Our results show that merging methods impacts behavior and internals differently. While the performance of merged models is typically between that of the two parent models, their encoded information about linguistic phenomena, particularly in morphology and syntax, can surpass the parent models. Moreover, we find weak ranking correlation between this behavior and internal evaluation. With our pipeline and initial results, we emphasize the need for more comprehensive evaluations of model merging methods to gain a faithful understanding of their capabilities and reliability, beyond potential superficial behavioral advances.
摘要：合并方法结合了多种语言模型（LMS）的权重以利用其能力，例如域适应性。尽管现有研究从唯一的行为角度研究了合并模型，但我们通过评估和联系其行为和内部质量来提供第一个全面的观点。我们提出了一条新的评估管道，该管道首先合并多个父级LM，然后根据其在下游任务（例如MMLU）和内部编码的语言能力的行为而评估合并模型。我们通过评估QWEN2.5家族的数学和代码适应的LMS进行微调的教学合并来展示该管道。我们的结果表明，合并方法对行为和内部的影响有所不同。虽然合并模型的性能通常在两个父模型之间，但它们有关语言现象的编码信息，尤其是在形态和语法方面，可以超越父模型。此外，我们发现这种行为和内部评估之间的排名相关性较弱。通过我们的管道和最初的结果，我们强调了对模型合并方法进行更全面的评估，以忠实地了解其能力和可靠性，而不是潜在的肤浅行为进步。

Title: Do LLMs Encode Frame Semantics? Evidence from Frame Identification

Authors: Jayanth Krishna Chundru, Rudrashis Poddar, Jie Cao, Tianyu Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19540
Pdf URL: https://arxiv.org/pdf/2509.19540
Copy Paste: [[2509.19540]] Do LLMs Encode Frame Semantics? Evidence from Frame Identification(https://arxiv.org/abs/2509.19540)
Keywords: language model, llm, prompt
Abstract: We investigate whether large language models encode latent knowledge of frame semantics, focusing on frame identification, a core challenge in frame semantic parsing that involves selecting the appropriate semantic frame for a target word in context. Using the FrameNet lexical resource, we evaluate models under prompt-based inference and observe that they can perform frame identification effectively even without explicit supervision. To assess the impact of task-specific training, we fine-tune the model on FrameNet data, which substantially improves in-domain accuracy while generalizing well to out-of-domain benchmarks. Further analysis shows that the models can generate semantically coherent frame definitions, highlighting the model's internalized understanding of frame semantics.
摘要：我们调查了大型语言模型是否编码框架语义的潜在知识，专注于框架标识，这是框架语义解析中的核心挑战，涉及在上下文中为目标单词选择适当的语义框架。使用Framenet词汇资源，我们在基于及时的推理下评估模型，并观察到即使没有明确的监督，它们也可以有效地执行框架识别。为了评估特定于任务的培训的影响，我们将模型调整在Framenet数据上，该模型可以大大提高内域准确性，同时很好地推广到跨域基准测试。进一步的分析表明，这些模型可以生成语义上一致的框架定义，从而突出了模型对框架语义的内在理解。

Title: Confidence Calibration in Large Language Model-Based Entity Matching

Authors: Iris Kamsteeg, Juan Cardenas-Cartagena, Floris van Beers, Gineke ten Holt, Tsegaye Misikir Tashu, Matias Valdenegro-Toro
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.19557
Pdf URL: https://arxiv.org/pdf/2509.19557
Copy Paste: [[2509.19557]] Confidence Calibration in Large Language Model-Based Entity Matching(https://arxiv.org/abs/2509.19557)
Keywords: language model
Abstract: This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.
摘要：这项研究旨在探索大型语言模型和实体匹配中置信度校准的交集。为此，我们进行了一项经验研究，以比较实体匹配任务的基线罗伯塔（Roberta）的信心与使用温度缩放，蒙特卡洛辍学和合奏进行校准的信心。我们使用ABT-BUY，DBLP-ACM，ITUNES-AMAZON和公司数据集。研究结果表明，所提出的修改后的Roberta模型表现出轻微的自信，预期的校准误差得分范围从数据集范围为0.0043至0.0552。我们发现，可以使用温度缩放来缓解这种过度自信，从而将预期的校准误差得分降低了23.83％。

Title: Uncertainty in Semantic Language Modeling with PIXELS

Authors: Stefania Radu, Marco Zullich, Matias Valdenegro-Toro
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.19563
Pdf URL: https://arxiv.org/pdf/2509.19563
Copy Paste: [[2509.19563]] Uncertainty in Semantic Language Modeling with PIXELS(https://arxiv.org/abs/2509.19563)
Keywords: language model
Abstract: Pixel-based language models aim to solve the vocabulary bottleneck problem in language modeling, but the challenge of uncertainty quantification remains open. The novelty of this work consists of analysing uncertainty and confidence in pixel-based language models across 18 languages and 7 scripts, all part of 3 semantically challenging tasks. This is achieved through several methods such as Monte Carlo Dropout, Transformer Attention, and Ensemble Learning. The results suggest that pixel-based models underestimate uncertainty when reconstructing patches. The uncertainty is also influenced by the script, with Latin languages displaying lower uncertainty. The findings on ensemble learning show better performance when applying hyperparameter tuning during the named entity recognition and question-answering tasks across 16 languages.
摘要：基于像素的语言模型旨在解决语言建模中的词汇瓶颈问题，但是不确定性量化的挑战仍然开放。这项工作的新颖性包括分析对18种语言和7种脚本的基于像素的语言模型的不确定性和信心，这都是3个语义上具有挑战性的任务的一部分。这是通过多种方法（例如蒙特卡洛辍学，变压器的注意力和集合学习）来实现的。结果表明，基于像素的模型在重建斑块时低估了不确定性。不确定性也受脚本的影响，拉丁语的不确定性较低。合奏学习的发现在应用16种语言的指定实体识别和提问任务期间应用超参数调整时表现出更好的性能。

Title: Retrieval Augmented Generation based context discovery for ASR

Authors: Dimitrios Siskos, Stavros Papadopoulos, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Anastasios Drosou
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.19567
Pdf URL: https://arxiv.org/pdf/2509.19567
Copy Paste: [[2509.19567]] Retrieval Augmented Generation based context discovery for ASR(https://arxiv.org/abs/2509.19567)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.
摘要：这项工作调查了在上下文感知的自动语音识别（ASR）系统中自动上下文发现的有效策略的检索增强生成，以便在存在稀有或少量分类术语的情况下提高转录精度。但是，自动确定正确的上下文仍然是一个开放的挑战。这项工作提出了一种基于高效的基于嵌入的检索方法，用于ASR中的自动上下文发现。为了使其有效性化，还评估了基于大语言模型（LLM）的两种替代方案：（1）通过提示通过提示生成的大语言模型（LLM）基于上下文的生成，以及（2）使用LLMS的后识别后成绩单校正。相对于使用Not-No-Context的TED-LIUMV3，ENATIONS21和SPGISPEECH的实验表明，所提出的方法最多将WER降低了17％（百分比差异），而Oracle上下文则减少了24.1％。

Title: ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities

Authors: Aleksis Datseris, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19569
Pdf URL: https://arxiv.org/pdf/2509.19569
Copy Paste: [[2509.19569]] ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities(https://arxiv.org/abs/2509.19569)
Keywords: language model
Abstract: This paper introduces a novel approach to position embeddings in transformer models, named "Exact Positional Embeddings" (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model's ability to generalize to more extended sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training.
摘要：本文介绍了一种新颖的方法，用于在变压器模型中定位嵌入，称为“精确位置嵌入”（expe）。一种绝对的位置嵌入方法，可以推断长到长度序列长于其经过训练的长度序列。传统的变压器模型依靠绝对或相对位置嵌入将位置信息纳入令牌嵌入中，这些嵌入通常在推断到序列上比训练期间所看到的更长的序列遇到困难。我们提出的方法采用了一种新颖的嵌入策略，该策略通过覆盖嵌入向量的特定维度来编码确切的位置信息，从而实现了令牌位置的更精确表示。所提出的方法不仅保持原始嵌入的完整性，而且还增强了模型将其推广到更扩展序列的能力。在因果语言建模中，与旋转和正弦式嵌入相比，我们的expe嵌入可显着降低困惑性，而对序列进行测试的时间比训练中使用的序列长。

Title: LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines

Authors: Yanfang (Fanny)Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Patricia Culligan, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Tom Stapleford, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh Chawla
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19580
Pdf URL: https://arxiv.org/pdf/2509.19580
Copy Paste: [[2509.19580]] LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines(https://arxiv.org/abs/2509.19580)
Keywords: language model, gpt, llm, chat
Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.
摘要：尖端的人工智能（AI）技术不断重塑我们对世界的看法。例如，基于CHATGPT等基于大型语言模型（LLMS）的应用程序表明了在广泛主题上产生类似人类的对话的能力。由于在各种与语言相关的任务上的表现令人印象深刻（例如，开放域的问题答案，翻译和文档摘要），可以设想LLMS可以通过更广泛的现实世界应用（例如，客户服务，教育和访问性和科学发现）带来的深远影响。受其成功的启发，本文将概述最先进的LLMS及其整合到广泛的学术学科中，包括：（1）艺术，信件和法律（例如历史，历史，哲学，哲学，政治科学，艺术和法律，法律，法律），（2）经济学和商业（例如，财务，经济学，营销，机构），（3）科学和机构（3）工程，化学和化学工程，生命科学与生物工程，地球科学与土木工程，计算机科学和电气工程）。在本文中，我们将探讨LLM在这些领域中如何塑造研究和实践，同时还讨论生成AI时代的关键局限性，开放挑战以及未来的方向。对LLM在学科中如何参与的回顾，并通过关键的观察和见解范围 - 帮助研究人员和从业人员有兴趣利用LLMS在不同的现实世界应用中推进其作品。

Title: GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

Authors: Dylan Hutson, Daniel Vennemeyer, Aneesh Deshmukh, Justin Zhan, Tianyu Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19593
Pdf URL: https://arxiv.org/pdf/2509.19593
Copy Paste: [[2509.19593]] GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models(https://arxiv.org/abs/2509.19593)
Keywords: language model, llm, prompt
Abstract: We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43\%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.
摘要：我们介绍了猜测游戏，这是一种评估大型语言模型（LLM）作为开放式开放式域设置中的战略问题攻击者的协议。 Guesser LLM通过在没有预定义的选择或候选列表的情况下向Oracle提出自由形式的问题来标识隐藏的对象。为了衡量问题质量，我们提出了两个信息增益（IG）指标：一种贝叶斯方法，该方法使用LLM得分的相关性跟踪语义概念的信念更新，以及一种基于熵的方法，可通过ConceptNet过滤候选者。这两个指标均为模型不合时宜，并支持事后分析。在具有多种型号并促使策略的858场游戏中，更高的IG强烈预测效率：单标准差异IG增加可将预期的游戏长度降低43 \％。促使IG引导的限制（例如执行问题多样性）使弱模型能够显着提高性能。这些结果表明，LLMS中的提问既可以衡量又可以提高，又对互动推理至关重要。

Title: Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models

Authors: Mohammad Saim, Phan Anh Duong, Cat Luong, Aniket Bhanderi, Tianyu Jiang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.19595
Pdf URL: https://arxiv.org/pdf/2509.19595
Copy Paste: [[2509.19595]] Anatomy of a Feeling: Narrating Embodied Emotions via Large Vision-Language Models(https://arxiv.org/abs/2509.19595)
Keywords: language model
Abstract: The embodiment of emotional reactions from body parts contains rich information about our affective experiences. We propose a framework that utilizes state-of-the-art large vision-language models (LVLMs) to generate Embodied LVLM Emotion Narratives (ELENA). These are well-defined, multi-layered text outputs, primarily comprising descriptions that focus on the salient body parts involved in emotional reactions. We also employ attention maps and observe that contemporary models exhibit a persistent bias towards the facial region. Despite this limitation, we observe that our employed framework can effectively recognize embodied emotions in face-masked images, outperforming baselines without any fine-tuning. ELENA opens a new trajectory for embodied emotion analysis across the modality of vision and enriches modeling in an affect-aware setting.
摘要：身体部位的情感反应的实施方案包含有关我们情感经历的丰富信息。我们提出了一个利用最先进的大视觉模型（LVLM）来生成体现的LVLM情感叙事（Elena）的框架。这些是定义明确的多层文本输出，主要包括描述，这些描述着重于情感反应所涉及的显着身体部位。我们还采用了注意图，并观察到当代模型对面部地区表现出持续的偏见。尽管有这一限制，我们观察到我们使用的框架可以有效地识别脸部掩盖的图像中具体的情绪，超过了基线而不会进行任何微调。埃琳娜（Elena）开辟了一个新的轨迹，用于在视觉方式上进行体现的情绪分析，并在情感感知环境中丰富建模。

Title: Evaluating Language Translation Models by Playing Telephone

Authors: Syeda Jannatus Saba, Steven Skiena
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19611
Pdf URL: https://arxiv.org/pdf/2509.19611
Copy Paste: [[2509.19611]] Evaluating Language Translation Models by Playing Telephone(https://arxiv.org/abs/2509.19611)
Keywords: language model
Abstract: Our ability to efficiently and accurately evaluate the quality of machine translation systems has been outrun by the effectiveness of current language models--which limits the potential for further improving these models on more challenging tasks like long-form and literary translation. We propose an unsupervised method to generate training data for translation evaluation over different document lengths and application domains by repeated rounds of translation between source and target languages. We evaluate evaluation systems trained on texts mechanically generated using both model rotation and language translation approaches, demonstrating improved performance over a popular translation evaluation system (xCOMET) on two different tasks: (i) scoring the quality of a given translation against a human reference and (ii) selecting which of two translations is generationally closer to an original source document.
摘要：当前语言模型的有效性使我们有效，准确评估机器翻译系统质量的能力已经超越了 - 这限制了将这些模型进一步改善在更具挑战性的任务（例如长形和文学翻译）上的潜力。我们提出了一种无监督的方法，以生成培训数据，以通过源和目标语言之间的重复翻译进行翻译评估。我们评估了对使用模型旋转和语言翻译方法机械生成的文本培训的评估系统，证明了在两个不同任务上的流行翻译评估系统（XCOMET）的性能改善：（i）根据人类参考的给定翻译的质量和（ii）选择两种翻译中的哪个转换中的哪个在原始源文档上产生了哪些。

Title: AutoSpec: An Agentic Framework for Automatically Drafting Patent Specification

Authors: Ryan Shea, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19640
Pdf URL: https://arxiv.org/pdf/2509.19640
Copy Paste: [[2509.19640]] AutoSpec: An Agentic Framework for Automatically Drafting Patent Specification(https://arxiv.org/abs/2509.19640)
Keywords: language model, llm, long context, agent
Abstract: Patents play a critical role in driving technological innovation by granting inventors exclusive rights to their inventions. However the process of drafting a patent application is often expensive and time-consuming, making it a prime candidate for automation. Despite recent advancements in language models, several challenges hinder the development of robust automated patent drafting systems. First, the information within a patent application is highly confidential, which often prevents the use of closed-source LLMs for automating this task. Second, the process of drafting a patent application is difficult for even the most advanced language models due to their long context, technical writing style, and specialized domain knowledge. To address these challenges, we introduce AutoSpec, a secure, agentic framework for Automatically drafting patent Specification. Our approach decomposes the drafting process into a sequence of manageable subtasks, each solvable by smaller, open-source language models enhanced with custom tools tailored for drafting patent specification. To assess our system, we design a novel evaluation protocol in collaboration with experienced patent attorneys. Our automatic and expert evaluations show that AutoSpec outperforms existing baselines on a patent drafting task.
摘要：专利在推动技术创新方面起着至关重要的作用，通过赋予发明家的发明专有权。但是，起草专利申请的过程通常昂贵且耗时，使其成为自动化的主要候选人。尽管语言模型最近取得了进步，但一些挑战阻碍了强大的自动化专利制图系统的发展。首先，专利应用程序中的信息是高度机密的，这通常阻止使用封闭源LLMS自动执行此任务。其次，即使是最先进的语言模型，由于其悠久的上下文，技术写作风格和专业领域知识，也很难起草专利应用程序的过程。为了应对这些挑战，我们介绍了自动起草专利规范的安全，代理框架Autospec。我们的方法将起草过程分解为一系列可管理的子任务，每个子任务可通过较小的开源语言模型来解决，并使用定制用于起草专利规范的自定义工具。为了评估我们的系统，我们与经验丰富的专利律师合作设计了一种新颖的评估协议。我们的自动和专家评估表明，Autospec在专利制图任务上的表现优于现有基准。

Title: Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections

Authors: Yicheng Yang, Zixian Li, Jean Paul Bizimana, Niaz Zafri, Yongfeng Dong, Tianyi Li
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2509.19657
Pdf URL: https://arxiv.org/pdf/2509.19657
Copy Paste: [[2509.19657]] Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections(https://arxiv.org/abs/2509.19657)
Keywords: language model, gpt, llm, prompt
Abstract: Pedestrian safety is a critical component of urban mobility and is strongly influenced by the interactions between pedestrian decision-making and driver yielding behavior at crosswalks. Modeling driver--pedestrian interactions at intersections requires accurately capturing the complexity of these behaviors. Traditional machine learning models often struggle to capture the nuanced and context-dependent reasoning required for these multifactorial interactions, due to their reliance on fixed feature representations and limited interpretability. In contrast, large language models (LLMs) are suited for extracting patterns from heterogeneous traffic data, enabling accurate modeling of driver-pedestrian interactions. Therefore, this paper leverages multimodal LLMs through a novel prompt design that incorporates domain-specific knowledge, structured reasoning, and few-shot prompting, enabling interpretable and context-aware inference of driver yielding behavior, as an example application of modeling pedestrian--driver interaction. We benchmarked state-of-the-art LLMs against traditional classifiers, finding that GPT-4o consistently achieves the highest accuracy and recall, while Deepseek-V3 excels in precision. These findings highlight the critical trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.
摘要：行人安全是城市流动性的关键组成部分，并且受到人行横道行为的行人决策与驾驶员产生行为之间的相互作用的强烈影响。建模驱动器 - 交叉点处的伴随互动需要准确捕获这些行为的复杂性。传统的机器学习模型通常难以捕获这些多因素互动所需的细微差别和上下文依赖的推理，因为它们依赖固定功能表示和有限的解释性。相比之下，大型语言模型（LLMS）适用于从异质流量数据中提取模式，从而可以准确地建模驾驶员 - 佩德互动。因此，本文通过一种新颖的提示设计利用了多模式LLM，该设计结合了特定于领域的知识，结构化推理以及很少的弹性提示，从而启用了对驾驶员屈服行为的可解释和上下文意识推断，以建模Pedestrian-Driver-driver交互作用。我们对传统分类器进行了最先进的LLM，发现GPT-4O始终达到了最高的准确性和回忆，而DeepSeek-V3精确地表现出色。这些发现突出了模型性能和计算效率之间的关键权衡，为在现实世界中的行人安全系统中部署LLM提供了实用的指导。

Title: Personality Vector: Modulating Personality of Large Language Models by Model Merging

Authors: Seungjong Sun, Seo Yeon Baek, Jang Hyun Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19727
Pdf URL: https://arxiv.org/pdf/2509.19727
Copy Paste: [[2509.19727]] Personality Vector: Modulating Personality of Large Language Models by Model Merging(https://arxiv.org/abs/2509.19727)
Keywords: language model, llm
Abstract: Driven by the demand for personalized AI systems, there is growing interest in aligning the behavior of large language models (LLMs) with human traits such as personality. Previous attempts to induce personality in LLMs have shown promising results, but they struggle to capture the continuous and multidimensional nature of human traits. In this work, we propose a novel method for personality modulation in LLMs via model merging. Specifically, we construct personality vectors by subtracting the weights of a pre-trained model from those of the fine-tuned model on a given personality trait. By merging personality vectors, we enable LLMs to exhibit desired personality traits without additional training. Extensive experiments show that personality vectors enable continuous control over trait intensity and support the composition of multiple traits. Furthermore, personality vectors transfer across diverse downstream models, suggesting that they encode generalizable representations of personality. Our code is available at here.
摘要：在对个性化AI系统的需求的驱动下，人们对将大语言模型（LLM）的行为与人格等人格等性格保持一致。以前在LLM中诱导个性的尝试表现出了令人鼓舞的结果，但它们努力捕捉人类特征的持续和多维性质。在这项工作中，我们通过模型合并提出了一种新型的LLM中人格调制方法。具体而言，我们通过从给定的人格特征上的微调模型的模型中减去预训练模型的权重来构建个性向量。通过合并个性向量，我们使LLM能够在没有其他培训的情况下表现出所需的性格特征。广泛的实验表明，人格矢量可以持续控制特征强度并支持多个性状的组成。此外，人格向量跨越不同的下游模型转移，这表明它们编码了可推广的人格表示。我们的代码在这里可用。

Title: HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST

Authors: Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.19742
Pdf URL: https://arxiv.org/pdf/2509.19742
Copy Paste: [[2509.19742]] HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST(https://arxiv.org/abs/2509.19742)
Keywords: prompt
Abstract: Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at this https URL.
摘要：零射击对话框状态跟踪（ZS-DST）对于启用任务导向对话框系统（TOD）至关重要，以推广到新域而无需昂贵的数据注释。一个核心挑战在于动态对话环境与静态提示之间的语义不对对准，导致越来越多的跨层协调，域干扰和灾难性的遗忘。为了解决这个问题，我们提出了层次结构合作的低级别适应（Hicolora），该框架通过可靠的及时对齐来增强零击插槽的推断。它具有用于动态层特异性处理（结合较低的启发式分组和较高层完整交互）的层次洛拉体系结构，整合了光谱联合域插入群集以识别可转移的关联（喂养自适应线性融合机制），并使用Sentical-Enalancic Sem-Enhanced SVD初始化preative（SemSsvd pre-pre-init-preister-pre-pre-init-preister-pre-pre-pre-init）。多域数据集多功能和SGD的实验表明，Hicolora优于基准，在ZS-DST中实现SOTA。代码可在此HTTPS URL上找到。

Title: PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs

Authors: Pei Zhang, Andong Chen, Xi Chen, Baosong Yang, Derek F. Wong, Fei Huang
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2509.19745
Pdf URL: https://arxiv.org/pdf/2509.19745
Copy Paste: [[2509.19745]] PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs(https://arxiv.org/abs/2509.19745)
Keywords: language model, llm
Abstract: Large language models (LLMs) have expanded from text to speech, giving rise to Speech Large Models (SLMs) that support recognition, translation, and synthesis. A key challenge is aligning speech and text representations, which becomes harder in multilingual settings. Existing methods often freeze LLM parameters and train encoders on multilingual data, but this forces cross-language convergence and limits performance. We introduce Progressive Alignment Representation Training (PART), a multi-stage and multi-task framework that separates within-language from cross-language alignment. During cross-language training, LLM parameters are dynamically activated, and text-based tasks are later introduced to enhance multilingual understanding. Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches, with analysis confirming its ability to balance language-specific distinctions and cross-language generalization. These results demonstrate PART's effectiveness and generality for multilingual speech modality alignment.
摘要：大型语言模型（LLM）已从文本扩展到语音，从而引起了支持识别，翻译和综合的语音大型模型（SLM）。一个关键的挑战是使语音和文本表示形式保持一致，这在多语言设置中变得更加困难。现有的方法通常会冻结LLM参数并在多语言数据上进行编码，但是这会导致交叉融合并限制性能。我们介绍了渐进式一致性表示培训（部分），这是一种多阶段和多任务框架，将语言内部与跨语言对准分开。在跨语言训练中，LLM参数会动态激活，后来引入基于文本的任务以增强多语言理解。关于CommonVoice 15，Fleurs，Wenetspeech和Covost2的实验表明，部分超过了常规方法，分析证实了其平衡语言特定区分和跨语言概括的能力。这些结果证明了部分对多语言语音方式一致性的有效性和一般性。

Title: CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

Authors: Sina J. Semnani, Han Zhang, Xinyan He, Merve Tekgürler, Monica S. Lam
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.19768
Pdf URL: https://arxiv.org/pdf/2509.19768
Copy Paste: [[2509.19768]] CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition(https://arxiv.org/abs/2509.19768)
Keywords: language model
Abstract: Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials. This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages. We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective. By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.
摘要：对历史文件的准确识别可以大大提高研究和保存文化遗产。但是，现有的视觉模型（VLMS）是为现代标准化文本而设计的，并且没有能力阅读各种语言和脚本，不规则的布局以及在历史材料中经常出现的降级。本文介绍了Churro，这是一种3B参数开放式VLM，专门用于历史文本识别。该模型经过Churro-DS培训，Churro-DS是迄今为止最大的历史文本识别数据集。 Churro-ds统一了155个历史文献，其中包括99,491页，涵盖了46个语言簇的22个世纪的文本遗产，包括历史型号和死语。我们在Churro-DS上评估了几个开放量和封闭的VLM和光学特征识别（OCR）系统，发现Churro优于所有其他VLM。在Churro-DS测试集上，Churro的实现为82.3％（印刷）和70.1％（手写）归一化Levenshtein的相似性，超过了第二好的模型Gemini 2.5 Pro，分别为1.4％和6.5％，而成本效益更高，而成本效益更高。通过释放模型和数据集，我们旨在使社区驱动的研究能够提高历史文本的可读性并加速奖学金。

Title: EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation

Authors: Sen Yang, Yu Bao, Yu Lu, Jiajun Chen, Shujian Huang, Shanbo Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19770
Pdf URL: https://arxiv.org/pdf/2509.19770
Copy Paste: [[2509.19770]] EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation(https://arxiv.org/abs/2509.19770)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models' established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at this https URL
摘要：大型语言模型（LLM）已证明了以英语为中心的语言对的强大机器翻译功能，但在直接非英语（X2X）翻译中表现不佳。这项工作通过合成数据生成框架来解决这一限制，该框架利用了模型的英语到X（EN2X）功能。通过将英语平行语料库扩展到全向数据集并开发英文参考的质量评估代理，我们可以有效地收集高质量的X2X培训数据。结合基于偏好的优化，我们的方法可在72个X2X方向上为广泛使用的LLMS实现显着改善，同时推广以增强EN2X性能。结果表明，以英语为中心的优势的战略利用可以在LLMS中引导全面的多语言翻译能力。我们在此HTTPS URL上发布代码，数据集和模型检查点

Title: bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Authors: Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2509.19775
Pdf URL: https://arxiv.org/pdf/2509.19775
Copy Paste: [[2509.19775]] bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs(https://arxiv.org/abs/2509.19775)
Keywords: language model, llm
Abstract: With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.
摘要：随着大语言模型（LLM）的快速发展，它们针对对抗性操纵的稳健性，尤其是越狱后门攻击，已经变得至关重要。现有的嵌入越狱触发器的方法 - 像受监督的微调（SFT），模型编辑和从人类反馈中学习的强化学习（RLHF）一样 - 每种都受到限制，包括概括，隐身性妥协或降低产生的越狱响应的上下文可用性。为了克服这些问题，我们提出了BI-GRPO（双向小组相对政策优化），这是一个针对越狱后门注入的新型基于RL的框架。 BI-GRPO通过采用成对的推广和成对的奖励，可以共同优化该模型，以可靠地与触发器产生有害内容并否则保持安全性。我们的方法利用了基于规则的奖励机制，该机制补充了长度和格式激励措施，从而消除了对高质量监督数据集或潜在有缺陷的奖励模型的依赖。广泛的实验表明，BIPO具有出色的有效性（> 99 \％的攻击成功率），保留在非触发场景中的隐形性，并产生高度可用且连贯的越狱响应，从而在越狱后的后门攻击中显着推动了最先进的越狱。

Title: Polarity Detection of Sustainable Detection Goals in News Text

Authors: Andrea Cadeddua, Alessandro Chessa, Vincenzo De Leo, Gianni Fenu, Francesco Osborne, Diego Reforgiato Recupero, Angelo Salatino, Luca Secchi
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2509.19833
Pdf URL: https://arxiv.org/pdf/2509.19833
Copy Paste: [[2509.19833]] Polarity Detection of Sustainable Detection Goals in News Text(https://arxiv.org/abs/2509.19833)
Keywords: language model, llm
Abstract: The United Nations' Sustainable Development Goals (SDGs) provide a globally recognised framework for addressing critical societal, environmental, and economic challenges. Recent developments in natural language processing (NLP) and large language models (LLMs) have facilitated the automatic classification of textual data according to their relevance to specific SDGs. Nevertheless, in many applications, it is equally important to determine the directionality of this relevance; that is, to assess whether the described impact is positive, neutral, or negative. To tackle this challenge, we propose the novel task of SDG polarity detection, which assesses whether a text segment indicates progress toward a specific SDG or conveys an intention to achieve such progress. To support research in this area, we introduce SDG-POD, a benchmark dataset designed specifically for this task, combining original and synthetically generated data. We perform a comprehensive evaluation using six state-of-the-art large LLMs, considering both zero-shot and fine-tuned configurations. Our results suggest that the task remains challenging for the current generation of LLMs. Nevertheless, some fine-tuned models, particularly QWQ-32B, achieve good performance, especially on specific Sustainable Development Goals such as SDG-9 (Industry, Innovation and Infrastructure), SDG-12 (Responsible Consumption and Production), and SDG-15 (Life on Land). Furthermore, we demonstrate that augmenting the fine-tuning dataset with synthetically generated examples yields improved model performance on this task. This result highlights the effectiveness of data enrichment techniques in addressing the challenges of this resource-constrained domain. This work advances the methodological toolkit for sustainability monitoring and provides actionable insights into the development of efficient, high-performing polarity detection systems.
摘要：联合国的可持续发展目标（SDGS）为解决关键的社会，环境和经济挑战提供了一个全球认可的框架。自然语言处理（NLP）和大语言模型（LLM）的最新发展已根据其与特定可持续发展目标的相关性促进了文本数据的自动分类。然而，在许多应用中，确定相关性的方向性同样重要。也就是说，评估所述影响是正，中性还是负面。为了应对这一挑战，我们提出了可持续发展目标极性检测的新任务，该任务评估了文本段是指向特定的可持续发展目标方面的进步还是传达了实现此类进步的意图。为了支持该领域的研究，我们介绍了SDG-POD，这是专门为此任务设计的基准数据集，结合了原始和合成生成的数据。我们使用六个最先进的大型LLM进行了全面的评估，考虑到零击和微调配置。我们的结果表明，对于当前一代LLM的任务仍然具有挑战性。然而，一些微调模型，尤其是QWQ-32B，尤其是在特定的可持续发展目标，例如SDG-9（行业，创新和基础设施），SDG-12（负责消费和生产）以及SDG-15（Life on Life on Land）上。此外，我们证明，通过合成生成的示例增强微调数据集可以在此任务上提高模型性能。该结果突出了数据丰富技术在应对该资源约束领域的挑战方面的有效性。这项工作为可持续性监测的方法学工具包提供了发展，并为有效，高性能的极性检测系统的发展提供了可行的见解。

Title: TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios

Authors: Ji Yin, Menglan He, Yujie Zhang, Linshuai Zhang, Tingting Ma, Ce Tian, Jie Wu, Lin Xu, Tao Jiang, ((1) School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu, China (2) The Acupuncture and Tuina School, Chengdu University of Traditional Chinese Medicine, Chengdu, China (3) Center of Preventive Medicine, Hospital of Chengdu University of Traditional Chinese Medicine, Chengdu, China (4) MD School of Intelligent Medicine Chengdu University of Traditional Chinese Medicine, Liutai Avenue Wenjiang District Chengdu, China (5) MD School of Intelligent Medicine Chengdu University of Traditional Chinese Medicine, Liutai Avenue Wenjiang District Chengdu, China)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19834
Pdf URL: https://arxiv.org/pdf/2509.19834
Copy Paste: [[2509.19834]] TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios(https://arxiv.org/abs/2509.19834)
Keywords: language model, llm
Abstract: Domain-specific LLMs in TCM face limitations in research settings due to constrained adaptability, insufficient evaluation datasets, and limited computational resources. This study presents TianHui, a specialized TCM LLM built through contextual data integration and domain knowledge fusion. We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation and scalable application of TCM knowledge. All resources are open-sourced.
摘要：由于适应性不足，评估数据集不足和计算资源有限，TCM中的域特异性LLM面临研究环境中的限制。这项研究提出了Tianhui，这是一种通过上下文数据集成和域知识融合而构建的专业TCM LLM。 We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW)并在其他六个中获得了最高结果（TCMEE，APR，GCPMI，TCMKQA，TCMRC，ADTG）。最佳配置被识别为LORA RANK = 128，Alpha = 256，Epoch = 4，辍学= 0.2，最大长度= 2048。天华实现了TCM知识的系统保存和可扩展应用。所有资源都是开源的。

Title: Benchmarking Gaslighting Attacks Against Speech Large Language Models

Authors: Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19858
Pdf URL: https://arxiv.org/pdf/2509.19858
Copy Paste: [[2509.19858]] Benchmarking Gaslighting Attacks Against Speech Large Language Models(https://arxiv.org/abs/2509.19858)
Keywords: language model, llm, prompt
Abstract: As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.
摘要：随着语音大语模型（语音LLM）越来越多地集成到基于语音的应用中，确保其对操纵或对抗性输入的鲁棒性变得至关重要。尽管先前的工作已经研究了基于文本的LLM和视觉模型中的对抗性攻击，但基于语音的互动的独特认知和感知挑战仍未得到充实。相比之下，语音呈现出固有的歧义，连续性和知觉多样性，这使得对抗攻击更难以检测。在本文中，我们引入了燃气攻击，策略性地制作的提示，旨在误导，覆盖或扭曲模型推理，以此作为评估语音LLM脆弱性的一种手段。具体来说，我们构建了五种操纵策略：愤怒，认知破坏，讽刺，隐性和专业否定，旨在测试各种任务跨越的模型鲁棒性。值得注意的是，我们的框架同时捕获了绩效下降和行为反应，包括未经请求的道歉和拒绝，以诊断易感性的不同维度。此外，进行声学扰动实验以评估多模式鲁棒性。为了量化模型脆弱性，对5个不同数据集的10,000多个测试样本进行了5个语音和多模式LLM的全面评估，显示在5种气光攻击的情况下，平均准确性下降了24.3％，表明行为脆弱性很大。这些发现凸显了需要更具弹性和值得信赖的基于语音的AI系统。

Title: SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

Authors: Alba Maria Marmol-Romero, Manuel Garcia-Vega, Miguel Angel Garcia-Cumbreras, Arturo Montejo-Raez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19861
Pdf URL: https://arxiv.org/pdf/2509.19861
Copy Paste: [[2509.19861]] SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection(https://arxiv.org/abs/2509.19861)
Keywords: language model, llm
Abstract: This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-off between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the effectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.
摘要：本文描述了西奈 - 乌贾团队参与ERISK@CLEF 2025实验室。具体来说，我们解决了两个提议的任务：（i）任务2：上下文化抑郁症的早期检测，（ii）试验任务：通过LLMS进行对话抑郁症检测。我们的任务2方法结合了广泛的预处理管道与使用基于变压器的几种模型，例如Roberta Base或Menterroberta大型模型，以捕获多用户对话的上下文和顺序性质。对于试点任务，我们设计了一系列对话策略来与LLM驱动的角色互动，重点是在有限数量的对话回合中最大化信息增益。在任务2中，我们的系统基于F1分数在12个参与球队中排名第8。但是，更深入的分析表明，我们的模型是发出早期预测的最快的模型之一，这是现实部署场景中的关键因素。这突出了早期检测和分类准确性之间的权衡，这表明了在未来工作中共同优化两者的潜在途径。在试点任务中，我们在5个团队中获得了第一名，在所有评估指标中获得了最佳的总体表现：DCHR，ADODL和ASHR。我们在这项任务中的成功证明了结构化对话设计与强大的语言模型相结合的有效性，从而增强了在敏感的心理健康评估环境中部署LLM的可行性。

Title: Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation

Authors: Wei-Hsiang Lin, Sheng-Lun Wei, Hen-Hsen Huang, Hsin-Hsi Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19880
Pdf URL: https://arxiv.org/pdf/2509.19880
Copy Paste: [[2509.19880]] Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation(https://arxiv.org/abs/2509.19880)
Keywords: llm
Abstract: LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models' generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs' sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model's own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and providing a reliable proxy for model selection in evaluation tasks.
摘要：LLM-As-Gudge框架在AI评估中越来越流行，但是关于模型的产生和判断力之间关系的研究结果仍然不一致。我们通过在11个模型和21个不同任务的系统数据集和实例级分析中调查了这种关系。尽管这两种功能都依赖相同的基本知识，但我们的分析表明，它们仅弱关联，这主要是由于LLMS对所判断的响应的敏感性。为了解决这个问题，我们提出了一种自我参考引导的评估策略，该策略利用模型自己的答案作为参考。这种方法显着加强了发电和判断能力之间的相关性，为这些技能保持一致，并为评估任务中的模型选择提供了可靠的代理。

Title: Future Policy Aware Preference Learning for Mathematical Reasoning

Authors: Minjae Oh, Yunho Choi, Dongmin Choi, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19893
Pdf URL: https://arxiv.org/pdf/2509.19893
Copy Paste: [[2509.19893]] Future Policy Aware Preference Learning for Mathematical Reasoning(https://arxiv.org/abs/2509.19893)
Keywords: language model, llm
Abstract: Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.
摘要：诸如直接偏好优化（DPO）之类的偏好学习方法已成为大型语言模型（LLM）培训后的标准，但它们通常对数学推理无效。一个关键的挑战是，优先和分配轨迹之间的大量令牌重叠；降低分配轨迹的概率也降低了共享有用令牌的可能性，从而导致过度培训和整体性能崩溃。作为缓解措施，现有的算法包括当前策略下轨迹的概率作为正规化项，这会在概率较低时降低梯度的效果。但是，到这种效果达到时，随着模型开始降级，有用的令牌可能已经过度占用。为了解决这个问题，我们提出了未来的政策意识（FPA）偏好学习，该学习在正规化期限内用未来的政策取代了当前的政策。这项未来的策略是通过从参考模型到当前模型的轻质，logit空间外推估计的。 FPA通过先发制于潜在有问题的梯度来实现更安全的培训。我们将FPA应用于DPO，RPO和Simper，并根据数学和GSM8K基准进行评估。 FPA可产生一致的性能增长，并在Simper观察到最大的改进，达到高达5.75％的增长。我们证明，FPA在保留共享，有用的数学令牌的可能性的同时提供了主动的正则化，并通过可忽略的计算开销实现了更长的无降解训练。我们将在出版时公开发布我们的代码。

Title: WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

Authors: Binbin Zhang, Chengdong Liang, Shuai Wang, Xuelong Geng, Zhao Guo, Haoyu Li, Hao Yin, Xipeng Yang, Pengshen Zhang, Changwei Ma, Lei Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19902
Pdf URL: https://arxiv.org/pdf/2509.19902
Copy Paste: [[2509.19902]] WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction(https://arxiv.org/abs/2509.19902)
Keywords: language model, llm
Abstract: In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at this https URL
摘要：在本文中，我们向West（We Speak Toolkit）展示了基于大型语言模型（LLM）的语音工具包，用于语音理解，生成和互动。西方有三个关键特征：1）完全基于LLM的：站在巨人的肩膀上，通过重复成熟的体系结构，生态系统（例如，拥抱面）和大型模型的方法（例如，序列包装）。 2）全栈：支持诸如识别，综合，理解，对话和多模式功能之类的任务，并具有可扩展的开源模型。 3）简单而愚蠢的：一个简单而愚蠢的语音工具包，每个人都可以触摸。此外，West提供两种类型的食谱，模型和实验结果。第一个完全基于开源模型和开源数据，使用户可以在本文中充分重现实验，并用作验证系统或最小系统基线。第二个经过大量数据训练，提供出色的性能，因此用户可以将其直接应用于开箱即用。 West在此HTTPS URL上公开可用

Title: The Knowledge-Behaviour Disconnect in LLM-based Chatbots

Authors: Jan Broersen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.20004
Pdf URL: https://arxiv.org/pdf/2509.20004
Copy Paste: [[2509.20004]] The Knowledge-Behaviour Disconnect in LLM-based Chatbots(https://arxiv.org/abs/2509.20004)
Keywords: language model, gpt, llm, hallucination, chat, agent
Abstract: Large language model-based artificial conversational agents (like ChatGPT) give answers to all kinds of questions, and often enough these answers are correct. Just on the basis of that capacity alone, we may attribute knowledge to them. But do these models use this knowledge as a basis for their own conversational behaviour? I argue this is not the case, and I will refer to this failure as a `disconnect'. I further argue this disconnect is fundamental in the sense that with more data and more training of the LLM on which a conversational chatbot is based, it will not disappear. The reason is, as I will claim, that the core technique used to train LLMs does not allow for the establishment of the connection we are after. The disconnect reflects a fundamental limitation on the capacities of LLMs, and explains the source of hallucinations. I will furthermore consider the ethical version of the disconnect (ethical conversational knowledge not being aligned with ethical conversational behaviour), since in this domain researchers have come up with several additional techniques to influence a chatbot's behaviour. I will discuss how these techniques do nothing to solve the disconnect and can make it worse.
摘要：基于语言模型的大型人工会话代理（如chatgpt）给出了各种问题的答案，而且这些答案通常是正确的。仅仅基于这种能力，我们可能会将知识归因于他们。但是，这些模型是否将这些知识用作自己的对话行为的基础？我认为事实并非如此，我将这种失败称为“断开连接”。我进一步认为，这种脱节是基本的，因为随着更多的数据和对会话聊天机器人所基于的LLM的更多培训，它将不会消失。正如我将声称的那样，原因是用于培训LLM的核心技术不允许建立我们所追求的连接。断开连接反映了对LLM的能力的基本限制，并解释了幻觉的来源。我将考虑截断的道德版本（道德对话知识与道德对话行为不符），因为在该领域中，研究人员提出了几种其他技术来影响聊天机器人的行为。我将讨论这些技术如何无助于解决断开连接并使情况变得更糟。

Title: DiffNator: Generating Structured Explanations of Time-Series Differences

Authors: Kota Dohi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20007
Pdf URL: https://arxiv.org/pdf/2509.20007
Copy Paste: [[2509.20007]] DiffNator: Generating Structured Explanations of Time-Series Differences(https://arxiv.org/abs/2509.20007)
Keywords: llm
Abstract: In many IoT applications, the central interest lies not in individual sensor signals but in their differences, yet interpreting such differences requires expert knowledge. We propose DiffNator, a framework for structured explanations of differences between two time series. We first design a JSON schema that captures the essential properties of such differences. Using the Time-series Observations of Real-world IoT (TORI) dataset, we generate paired sequences and train a model that combine a time-series encoder with a frozen LLM to output JSON-formatted explanations. Experimental results show that DiffNator generates accurate difference explanations and substantially outperforms both a visual question answering (VQA) baseline and a retrieval method using a pre-trained time-series encoder.
摘要：在许多物联网应用中，中心利益不在于单个传感器信号，而在于它们的差异，但是解释这种差异需要专家知识。我们提出了散射器，这是两个时间序列之间差异的结构化解释的框架。我们首先设计一个JSON模式，该模式捕获了这种差异的基本属性。使用现实世界IoT（Tori）数据集的时间序列观察，我们生成配对序列并训练一个模型，该模型将时间序列编码器与冷冻LLM结合在一起，以输出JSON形式的解释。实验结果表明，Diffnator会产生准确的差异解释，并且使用预先训练的时间序列编码器既优于视觉问题答案（VQA）基线和检索方法。

Title: Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks

Authors: Vani Kanjirangat, Tanja Samardžić, Ljiljana Dolamic, Fabio Rinaldi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.20045
Pdf URL: https://arxiv.org/pdf/2509.20045
Copy Paste: [[2509.20045]] Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks(https://arxiv.org/abs/2509.20045)
Keywords: llm
Abstract: Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of LLMs often might mask deeper mismatches at the script or token level.
摘要：方言数据的特征是语言差异，这对人类看来很小，但对模型的性能有重大影响。然而，这种方言差距与各种因素（例如数据规模，经济和社会因素）有关，但事实证明，这些因素的影响是不一致的。在这项工作中，我们调查了对模型性能的影响的因素：我们将令牌化平价（TP）和信息奇偶校验（IP）相关联，作为预先训练的多语言模型中代表性偏见的度量，以及下游性能。我们将最新的仅解码器LLM与基于编码器的模型进行了三个任务进行比较：方言分类，主题分类和提取性问题答案，控制不同的脚本（拉丁语与非LATIN）和资源可用性（高与低）。我们的分析表明，TP可以更好地预测依赖于句法和形态学提示的任务的性能（例如，提取性质量请访问），而IP可以更好地预测语义任务中的性能（例如主题分类）。互补分析，包括令牌机行为，词汇覆盖和定性见解，表明LLM的语言支持主张通常可能会在脚本或令牌级别上掩盖更深层的不匹配。

Title: From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors

Authors: Maggie Mi, Aline Villavicencio, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20065
Pdf URL: https://arxiv.org/pdf/2509.20065
Copy Paste: [[2509.20065]] From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors(https://arxiv.org/abs/2509.20065)
Keywords: language model
Abstract: Language models often struggle with idiomatic, figurative, or context-sensitive inputs, not because they produce flawed outputs, but because they misinterpret the input from the outset. We propose an input-only method for anticipating such failures using token-level likelihood features inspired by surprisal and the Uniform Information Density hypothesis. These features capture localized uncertainty in input comprehension and outperform standard baselines across five linguistically challenging datasets. We show that span-localized features improve error detection for larger models, while smaller models benefit from global patterns. Our method requires no access to outputs or hidden activations, offering a lightweight and generalizable approach to pre-generation error prediction.
摘要：语言模型通常会在惯用，象征性或上下文敏感的输入中遇到困难，不是因为它们产生有缺陷的输出，而是因为它们从一开始就误解了输入。我们提出了一种仅输入方法，用于预测使用令牌级的可能性功能，灵感来自于惊奇和统一信息密度假设。这些功能捕获了五个语言具有挑战性的数据集中的输入理解和优于标准基线的局部不确定性。我们表明，跨定位的特征可改善较大模型的错误检测，而较小的模型则受益于全局模式。我们的方法不需要访问输出或隐藏激活，提供了一种轻巧且可推广的方法来进行产生前错误预测。

Title: From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Authors: Tianqiao Liu, Xueyi Li, Hao Wang, Haoxuan Li, Zhichao Chen, Weiqi Luo, Zitao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20072
Pdf URL: https://arxiv.org/pdf/2509.20072
Copy Paste: [[2509.20072]] From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training(https://arxiv.org/abs/2509.20072)
Keywords: language model, llm
Abstract: Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asymmetry in their dependency structures: while text tokens exhibit strong target target dependencies requiring causal ordering, audio tokens are predominantly driven by source target dependencies, where audio outputs primarily condition on source text rather than preceding audio tokens. In this work, we propose TtT, a unified audio-text modeling framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM.
摘要：大型语言模型的最新进展引起了人们对将其能力扩展到多模式情景的重大兴趣，特别是对于语音 - 语音流出的对话系统。但是，现有的多模型处理交错音频和文本（例如Moshi）需要复杂的多阶段训练管道，从而产生了实质性的计算成本。此外，这些模型均匀地对文本和音频令牌应用自回归产生，忽略其依赖性结构中的基本不对称性：虽然文本令牌表现出强大的目标目标依赖性需要因果关系订购，但音频令牌主要是由源目标驱动的，而源目标则主要是由源依赖的源依赖于源文本，而不是源文本，而不是源代码的autio and Audio，而不是源代码。在这项工作中，我们提出了TTT，这是一种统一的音频文本建模框架，将AR文本生成与非自动回程音频扩散集成在从预读的LLM初始化的单个变压器体系结构中。

Title: OLaPh: Optimal Language Phonemizer

Authors: Johannes Wirth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20086
Pdf URL: https://arxiv.org/pdf/2509.20086
Copy Paste: [[2509.20086]] OLaPh: Optimal Language Phonemizer(https://arxiv.org/abs/2509.20086)
Keywords: language model, llm
Abstract: Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.
摘要：音素化（文本转换为音素）是文本到语音的关键步骤。传统方法使用基于规则的转换和词典查找，而更高级的方法应用了预处理技术或神经网络，以提高室外词汇的准确性。但是，所有系统都在为名称，借出词，缩写和同型物所努力。这项工作介绍了Olaph（最佳语言音符器），该框架结合了大型词典，多个NLP技术和复合分辨率和概率评分函数。在德语和英语中的评估表明，与以前的方法（包括具有挑战性的数据集）相比，准确性提高了。为了进一步解决未解决的案例，我们培训了一个大型语言模型，以实现Olaph生成的数据，从而实现了更强的概括和性能。框架和LLM共同提高了音素化的一致性，并为将来的研究提供了免费的资源。

Title: Causal Understanding by LLMs: The Role of Uncertainty

Authors: Oscar Lithgow-Serrano, Vani Kanjirangat, Alessandro Antonucci
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.20088
Pdf URL: https://arxiv.org/pdf/2509.20088
Copy Paste: [[2509.20088]] Causal Understanding by LLMs: The Role of Uncertainty(https://arxiv.org/abs/2509.20088)
Keywords: gpt, llm
Abstract: Recent papers show LLMs achieve near-random accuracy in causal relation classification, raising questions about whether such failures arise from limited pretraining exposure or deeper representational gaps. We investigate this under uncertainty-based evaluation, testing whether pretraining exposure to causal examples improves causal understanding >18K PubMed sentences -- half from The Pile corpus, half post-2024 -- across seven models (Pythia-1.4B/7B/12B, GPT-J-6B, Dolly-7B/12B, Qwen-7B). We analyze model behavior through: (i) causal classification, where the model identifies causal relationships in text, and (ii) verbatim memorization probing, where we assess whether the model prefers previously seen causal statements over their paraphrases. Models perform four-way classification (direct/conditional/correlational/no-relationship) and select between originals and their generated paraphrases. Results show almost identical accuracy on seen/unseen sentences (p > 0.05), no memorization bias (24.8% original selection), and output distribution over the possible options is almost flat, with entropic values near the maximum (1.35/1.39), confirming random guessing. Instruction-tuned models show severe miscalibration (Qwen: > 95% confidence, 32.8% accuracy, ECE=0.49). Conditional relations induce highest entropy (+11% vs. direct). These findings suggest that failures in causal understanding arise from the lack of structured causal representation, rather than insufficient exposure to causal examples during pretraining.
摘要：最近的论文显示，LLMS在因果关系分类中达到了几乎随机的准确性，提出了有关这种故障是由于预算验而导致的，还是更深层的代表性差距引起的问题。我们在基于不确定性的评估下对此进行了调查，测试了预处理因果示例是否改善了因果理解> 18K PubMed句子（一半来自桩copcus，一半，一半，在2024年以后的一半）（Pythia-1.4b/7b/7b/7b/12b，gpt-j-j-6b，gpt-j-6b，dolly-j-6b，dolly-j-7b，dolly-7b，dolly-7b，dolly-7b/12b，qwen-qwen-7b，qwen-7b）。我们通过以下方式分析模型行为：（i）因果分类，该模型在文本中识别因果关系，以及（ii）逐字记忆探测，我们评估该模型是否喜欢以前看到的因果陈述而不是其释义。模型执行四向分类（直接/条件/相关/无关关系），并在原件及其生成的释义之间进行选择。结果表明，看到/看不见的句子（p> 0.05），没有记忆偏见（原始选择24.8％），可能的选项上的输出分布几乎是平坦的，熵值接近最大值（1.35/1.39），确认随机猜测几乎相同。指导调整的模型显示出严重的错误校准（QWEN：> 95％的置信度，精度为32.8％，ECE = 0.49）。条件关系诱导最高熵（+11％与直接）。这些发现表明，因果理解中的失败是由于缺乏结构化因果代表，而不是在训练过程中对因果的实例不足。

Title: Integrated Framework for LLM Evaluation with Answer Generation

Authors: Sujeong Lee, Hayoung Lee, Seongsoo Heo, Wonik Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.20097
Pdf URL: https://arxiv.org/pdf/2509.20097
Copy Paste: [[2509.20097]] Integrated Framework for LLM Evaluation with Answer Generation(https://arxiv.org/abs/2509.20097)
Keywords: language model, llm, hallucination
Abstract: Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textit{self-refining descriptive evaluation with expert-driven diagnostics}, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.
摘要：对大语言模型的可靠评估对于确保其在实际情况下的适用性至关重要。传统的基于基准的评估方法通常依赖于固定的参考答案，从而限制了它们捕获生成反应的重要定性方面的能力。为了解决这些缺点，我们提出了一个综合评估框架，称为\ textit {使用专家驱动的诊断}，速度}，该框架利用专业功能专家来对模型输出进行全面的描述性分析。与传统的方法不同，速度可以积极地纳入跨多个维度的专家反馈，包括幻觉检测，毒性评估和词汇上下文适当性。实验结果表明，速度可以实现各种域和数据集的稳健和一致的评估性能。此外，通过采用相对紧凑的专家模型，与大型评估者相比，速度证明了较高的资源效率。这些发现表明，速度显着提高了LLM评估中的公平性和可解释性，为现有评估方法提供了有希望的替代方案。

Title: Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Authors: Chaojun Nie, Jun Zhou, Guanxiang Wang, Shisong Wud, Zichen Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.20162
Pdf URL: https://arxiv.org/pdf/2509.20162
Copy Paste: [[2509.20162]] Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation(https://arxiv.org/abs/2509.20162)
Keywords: language model, llm
Abstract: Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at this https URL.
摘要：大型语言模型（LLM）在特定于领域的任务上通常表现出有限的性能，因为在其培训数据中对专业信息的自然表示以及这些数据集的静态性质。知识稀缺性和时间滞后为域应用创造知识差距。尽管域数据集上的后培训可以将知识嵌入模型中，但现有方法有一些局限性。持续的预训练（CPT）以同等的重视对待域文档中的所有令牌，而未能优先考虑关键知识点，而监督的微调（SFT）则与问答式的斗争斗争，以开发复杂推理任务所需的一致性知识结构。为了应对这些挑战，我们建议从增强生成（RLAG）中进行加强学习。我们的方法在采样世代之间迭代循环，并通过计算出的奖励来优化模型，有效地嵌入了关键和上下文相干域知识。我们选择具有最高日志概率的生成的输出作为采样结果，然后计算三个量身定制的奖励指标来指导优化过程。为了全面评估领域的专业知识，我们评估了答案的准确性以及为正确回答问题而生成的解释的合理性。医疗，法律，天文学和时事数据集的实验结果表明，我们提出的方法显着超过了基线方法。我们的代码和数据是在此HTTPS URL上开源的。

Title: Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian

Authors: Ghazal Kalhor, Behnam Bahrak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20168
Pdf URL: https://arxiv.org/pdf/2509.20168
Copy Paste: [[2509.20168]] Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian(https://arxiv.org/abs/2509.20168)
Keywords: language model, gpt, llm
Abstract: Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.
摘要：多语言大型语言模型（LLMS）在全球范围内越来越多地使用，这对于确保他们没有性别偏见以防止代表性危害至关重要。虽然先前的研究已经检查了高农产品语言的这种偏见，但低资源语言仍在研究。在本文中，我们提出了一种基于模板的探测方法，可针对现实世界数据进行验证，以发现LLMS中的性别刻板印象。作为此框架的一部分，我们介绍了特定领域的性别偏斜指数（DS-GSI），该指标量化了与性别平价的偏差。我们跨四个语义域，评估了四个突出的模型，即GPT-4O Mini，DeepSeek R1，Gemini 2.0 Flash和QWEN QWQ 32B，重点是波斯语，波斯语是一种具有独特语言特征的低资源语言。我们的结果表明，所有模型都表现出性别刻板印象，波斯语的差异比在所有领域的英语中都更大。其中，运动反映了最严格的性别偏见。这项研究强调了包容性NLP实践的需求，并提供了评估其他低资源语言偏见的框架。

Title: Thinking Augmented Pre-training

Authors: Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.20186
Pdf URL: https://arxiv.org/pdf/2509.20186
Copy Paste: [[2509.20186]] Thinking Augmented Pre-training(https://arxiv.org/abs/2509.20186)
Keywords: language model, llm
Abstract: This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
摘要：本文通过使用思维轨迹来增强现有文本数据来提高大语模型（LLM）培训的数据效率（LLM）培训的数据效率。训练前LLM的计算以前所未有的速度增长，而高质量数据的可用性仍然有限。因此，最大化可用数据的实用性构成了重大的研究挑战。主要的障碍是，鉴于固定模型的能力，很难学习某些高质量的代币，因为单个令牌的基本原理可能非常复杂且深度很深。为了解决这个问题，我们建议思考增强预训练（TPT），这是一种通用方法论，可以通过自动生成的思维轨迹增强文本。这种增强有效地增加了训练数据的数量，并通过分步推理和分解使高质量的代币更加可学习。我们将TPT应用于最高$ 100 $ b代币的各种培训配置，包括预先培训，并通过限制和丰富的数据以及强大的开源检查点进行了中期培训。实验结果表明，我们的方法显着提高了各种模型和家庭中LLM的性能。值得注意的是，TPT提高了LLM预培训的数据效率，提高了$ 3 $。对于$ 3 $ b的参数模型，它在几种具有挑战性的推理基准上，将培训后的性能提高了10美元。

Title: Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs

Authors: Parker Glenn, Alfy Samuel, Daben Liu
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2509.20208
Pdf URL: https://arxiv.org/pdf/2509.20208
Copy Paste: [[2509.20208]] Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs(https://arxiv.org/abs/2509.20208)
Keywords: language model, llm
Abstract: Integrating LLM powered operators in declarative query languages allows for the combination of cheap and interpretable functions with powerful, generalizable language model reasoning. However, in order to benefit from the optimized execution of a database query language like SQL, generated outputs must align with the rules enforced by both type checkers and database contents. Current approaches address this challenge with orchestrations consisting of many LLM-based post-processing calls to ensure alignment between generated outputs and database values, introducing performance bottlenecks. We perform a study on the ability of various sized open-source language models to both parse and execute functions within a query language based on SQL, showing that small language models can excel as function executors over hybrid data sources. Then, we propose an efficient solution to enforce the well-typedness of LLM functions, demonstrating 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions. We make our implementation available at this https URL
摘要：在声明性查询语言中将LLM动力运算符集成，允许将廉价和可解释的功能与功能强大，可推广的语言模型推理结合在一起。但是，要从SQL（例如SQL）的数据库查询语言的优化执行中受益，生成的输出必须与类型Checkers和数据库内容执行的规则保持一致。当前的方法通过由许多基于LLM的后处理调用组成的编排来应对这一挑战，以确保生成的输出和数据库值之间的对齐，并引入性能瓶颈。我们对基于SQL的查询语言中的各种大小开源语言模型在查询语言中分析和执行功能的能力进行研究，这表明小语言模型可以在混合数据源上作为功能执行者出色。然后，我们提出了一种有效的解决方案来实施LLM功能的良好性，证明了多跳问题答复数据集的准确性提高了7％，可相比解决方案的潜伏期提高了53％。我们在此HTTPS URL上提供实施

Title: Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models

Authors: Yu Wang, Leyi Lao, Langchu Huang, Gabriel Skantze, Yang Xu, Hendrik Buschmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20237
Pdf URL: https://arxiv.org/pdf/2509.20237
Copy Paste: [[2509.20237]] Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models(https://arxiv.org/abs/2509.20237)
Keywords: language model
Abstract: Backchannels and fillers are important linguistic expressions in dialogue, but are under-represented in modern transformer-based language models (LMs). Our work studies the representation of them in language models using three fine-tuning strategies. The models are trained on three dialogue corpora in English and Japanese, where backchannels and fillers are preserved and annotated, to investigate how fine-tuning can help LMs learn their representations. We first apply clustering analysis to the learnt representation of backchannels and fillers, and have found increased silhouette scores in representations from fine-tuned models, which suggests that fine-tuning enables LMs to distinguish the nuanced semantic variation in different backchannel and filler use. We also use natural language generation (NLG) metrics to confirm that the utterances generated by fine-tuned language models resemble human-produced utterances more closely. Our findings suggest the potentials of transforming general LMs into conversational LMs that are more capable of producing human-like languages adequately.
摘要：回音和填充剂是对话中重要的语言表达式，但在现代基于变压器的语言模型（LMS）中的代表性不足。我们的工作研究了使用三种微调策略在语言模型中的代表。这些模型接受了三个对话语料库的培训，其中包括英语和日语，在那里保留和注释了回音和填充剂，以调查微调如何帮助LMS学习其表示形式。我们首先将聚类分析应用于后班和填充剂的学会表示，并发现在微调模型的表示中的轮廓得分提高，这表明微调使LMS能够区分不同后频道和填充剂使用不同的语义变化。我们还使用自然语言产生（NLG）指标来确认，通过微调语言模型产生的话语更像是人类产生的话语。我们的发现表明，将一般LMS转化为对话的LMS的潜力更能充分地产生类似人类的语言。

Title: Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage

Authors: Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20278
Pdf URL: https://arxiv.org/pdf/2509.20278
Copy Paste: [[2509.20278]] Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage(https://arxiv.org/abs/2509.20278)
Keywords: llm, prompt
Abstract: Large-language-model (LLM) reasoning has long been regarded as a powerful tool for problem solving across domains, providing non-experts with valuable advice. However, their limitations - especially those stemming from prompt design - remain underexplored. Because users may supply biased or incomplete prompts - often unintentionally - LLMs can be misled, undermining reliability and creating risks. We refer to this vulnerability as the Instruction Boundary. To investigate the phenomenon, we distill it into eight concrete facets and introduce BiasDetector, a framework that measures biases arising from three instruction types: complete, redundant, and insufficient. We evaluate several mainstream LLMs and find that, despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. Our empirical study confirms that LLM reasoning reliability can still be significantly improved. We analyze the practical impact of these biases and outline mitigation strategies. Our findings underscore the need for developers to tackle biases and for users to craft options carefully.
摘要：大型语言模型（LLM）推理长期以来一直被视为跨领域解决问题的有力工具，为非专家提供了有价值的建议。但是，它们的局限性 - 尤其是迅速设计的局限性 - 仍然没有被忽视。因为用户可能会提供偏见或不完整的提示（通常是无意间），LLM可能会被误导，破坏可靠性并造成风险。我们将此漏洞称为指令边界。为了调查现象，我们将其提炼成八个混凝土面并引入偏见，该框架测量了由三种指令类型产生的偏见：完整，多余和不足。我们评估了几个主流LLM，发现尽管标题很高，但由于迅速覆盖的直接结果，许多下游任务仍然存在很大的偏见。我们的实证研究证实，LLM推理可靠性仍然可以显着提高。我们分析了这些偏见和概述缓解策略的实际影响。我们的发现强调了开发人员应对偏见和用户仔细制作选择的需求。

Title: SIM-CoT: Supervised Implicit Chain-of-Thought

Authors: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.20317
Pdf URL: https://arxiv.org/pdf/2509.20317
Copy Paste: [[2509.20317]] SIM-CoT: Supervised Implicit Chain-of-Thought(https://arxiv.org/abs/2509.20317)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Implicit Chain-of-Thought (CoT) methods present a promising, token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited the application of implicit CoT. We identify a core latent instability issue by scaling the computational budget of implicit CoT approaches: as we increase the number of implicit reasoning tokens to enhance performance, the training process often becomes unstable and collapses. Our analysis reveals that this instability arises from the latent representations becoming homogeneous and losing their semantic diversity, a failure caused by insufficient step-level supervision in existing implicit CoT approaches. To address this issue, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. Specifically, SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring that latent states capture distinct and meaningful information. The proposed auxiliary decoder is removed during inference, preserving the computational efficiency of implicit CoT methods with no added overhead. In addition, the auxiliary decoder affords interpretability of implicit reasoning by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization of semantic roles and diagnosis. SIM-CoT significantly enhances both the in-domain accuracy and out-of-domain stability of various implicit CoT methods, boosting baselines like Coconut by +8.2% on GPT-2 and CODI by +3.0% on LLaMA-3.1 8B. Demonstrating strong scalability, SIM-CoT also surpasses the explicit CoT baseline on GPT-2 by 2.1% with 2.3\times greater token efficiency, while substantially closing the performance gap on larger models like LLaMA-3.1 8B.
摘要：隐式链链（COT）方法提出了一种有希望的，有效的，有效的替代方案，用于大型语言模型（LLMS）中明确的COT推理，但是持续的性能差距限制了隐式COT的应用。我们通过扩展隐式COT方法的计算预算来确定核心潜在不稳定问题：随着我们增加隐式推理令牌以提高性能的数量，训练过程通常变得不稳定并崩溃。我们的分析表明，这种不稳定性源于潜在的表示形式变得同质并失去了语义多样性，这是由于现有的隐式COT方法中的步骤级监督不足而导致的失败。为了解决这个问题，我们提出了SIM-COT，这是一个插件训练模块，该模块介绍了稳定和丰富潜在推理空间的阶梯级监督。具体而言，SIM-COT在培训期间采用辅助解码器，以使每个隐式令牌与相应的明确推理步骤保持一致，从而确保潜在状态捕获独特而有意义的信息。提出的辅助解码器在推理过程中被删除，从而保留了隐式cot方法的计算效率，而没有增加开销。此外，辅助解码器通过将每个潜在令牌投射到明确的推理词汇上，从而可以解释隐性推理，从而可以每步可视化语义角色和诊断。 SIM-COT显着提高了各种隐式COT方法的内域准确性和室外稳定性，在GPT-2上将椰子等基准在GPT-2和CODI上提高了 +8.2％，而Llama-3.1 8B则 +3.0％。 SIM-COT表现出强大的可伸缩性，还超过了GPT-2上的显式COT基线2.1％，而令牌效率提高了2.3 \倍，同时在较大模型（如Llama-3.1 8B）上实质上缩小了性能差距。

Title: Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

Authors: Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, James Caverlee
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2509.20319
Pdf URL: https://arxiv.org/pdf/2509.20319
Copy Paste: [[2509.20319]] Z-Scores: A Metric for Linguistically Assessing Disfluency Removal(https://arxiv.org/abs/2509.20319)
Keywords: llm, prompt
Abstract: Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.
摘要：评估语音中的弱势措施所需要的不仅仅是代币级别的分数。传统的基于单词的指标，例如精确，召回和F1（电子评分）捕获整体性能，但无法透露模型成功或失败的原因。我们介绍了Z分数，这是一种跨度级的语言基础评估度量，该度量将系统行为分类为不同的差异类型（编辑，INTJ，PRN）。我们的确定性对齐模块可以在生成的文本和不足的成绩单之间进行强大的映射，从而允许Z得分暴露出该单词级指标晦涩难懂的系统弱点。通过提供特定于类别的诊断，Z得分使研究人员能够识别模型故障模式和设计目标干预措施（例如量身定制的提示或数据增强），从而得出可衡量的性能改进。使用LLMS的案例研究表明，Z得分通过INTJ和PRN分裂构成的挑战，隐藏在F1中，直接为模型改进策略提供了信息。

Title: DRES: Benchmarking LLMs for Disfluency Removal

Authors: Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, James Caverlee
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2509.20321
Pdf URL: https://arxiv.org/pdf/2509.20321
Copy Paste: [[2509.20321]] DRES: Benchmarking LLMs for Disfluency Removal(https://arxiv.org/abs/2509.20321)
Keywords: llm, prompt, agent
Abstract: Disfluencies -- such as "um," "uh," interjections, parentheticals, and edited statements -- remain a persistent challenge for speech-driven systems, degrading accuracy in command interpretation, summarization, and conversational agents. We introduce DRES (Disfluency Removal Evaluation Suite), a controlled text-level benchmark that establishes a reproducible semantic upper bound for this task. DRES builds on human-annotated Switchboard transcripts, isolating disfluency removal from ASR errors and acoustic variability. We systematically evaluate proprietary and open-source LLMs across scales, prompting strategies, and architectures. Our results reveal that (i) simple segmentation consistently improves performance, even for long-context models; (ii) reasoning-oriented models tend to over-delete fluent tokens; and (iii) fine-tuning achieves near state-of-the-art precision and recall but harms generalization abilities. We further present a set of LLM-specific error modes and offer nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. DRES provides a reproducible, model-agnostic foundation for advancing robust spoken-language systems.
摘要：诸如“ um”，“嗯”，插头，括号和编辑的陈述之类的爆发仍然是对语音驱动系统的持续挑战，在命令解释，摘要和对话代理方面的准确性降低了精度。我们介绍了Dres（Dres Dres（Drymention Islusightion评估套件），这是一个受控的文本级别的基准，该基准为此任务建立了可重现的语义上限。 DRE建立在人类注销的总机笔录的基础上，从而将差异从ASR误差和声学变异性中隔离开来。我们系统地评估跨尺度的专有和开源LLM，促使策略和体系结构。我们的结果表明，（i）简单的细分也始终提高性能，即使对于长篇小说模型；（ii）面向推理的模型倾向于过度删除流利的令牌；（iii）微调实现了几乎最先进的精度和回忆，但会损害概括能力。我们进一步介绍了一组LLM特定的误差模式，并提供了九种实用建议（R1-R9），用于在语音驱动管道中部署疏离率。 DRE为推进强大的口语系统提供了可再现的模型不足的基础。

Title: EmbeddingGemma: Powerful and Lightweight Text Representations

Authors: Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, Jingxiao Zheng, Jyotinder Singh, Abheesht Sharma, Divya Sreepat, Aashi Jain, Adham Elarabawy, AJ Co, Andreas Doumanoglou, Babak Samari, Ben Hora, Brian Potetz, Dahun Kim, Enrique Alfonseca, Fedor Moiseev, Feng Han, Frank Palma Gomez, Gustavo Hernández Ábrego, Hesen Zhang, Hui Hui, Jay Han, Karan Gill, Ke Chen, Koert Chen, Madhuri Shanbhogue, Michael Boratko, Paul Suganthan, Sai Meher Karthik Duddu, Sandeep Mariserla, Setareh Ariafar, Shanfeng Zhang, Shijie Zhang, Simon Baumgartner, Sonam Goenka, Steve Qiu, Tanmaya Dabral, Trevor Walker, Vikram Rao, Waleed Khawaja, Wenlei Zhou, Xiaoqi Ren, Ye Xia, Yichang Chen, Yi-Ting Chen, Zhe Dong, Zhongli Ding, Francesco Visin, Gaël Liu, Jiageng Zhang, Kathleen Kenealy, Michelle Casbon, Ravin Kumar, Thomas Mesnard, Zach Gleicher, Cormac Brick, Olivier Lacombe, Adam Roberts, Yunhsuan Sung, Raphael Hoffmann, Tris Warkentin, Armand Joulin, Tom Duerig, Mojtaba Seyedhosseini
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.20354
Pdf URL: https://arxiv.org/pdf/2509.20354
Copy Paste: [[2509.20354]] EmbeddingGemma: Powerful and Lightweight Text Representations(https://arxiv.org/abs/2509.20354)
Keywords: language model
Abstract: We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
摘要：我们介绍了基于Gemma 3 Language Model Family的新型轻巧的开放文本嵌入模型。我们的创新培训配方通过编码器初始化和几何嵌入蒸馏来策略性地捕获大型模型的知识。我们通过扩散的正规化程序提高模型的鲁棒性和表达性，并通过合并各种优化混合物的检查点来确保概括性。对多语言，英语和代码域的大规模文本嵌入基准（MTEB）进行评估，嵌入式游戏（300m）可实现最新的结果。值得注意的是，它的表现优于先前的顶级型号，包括专有和开放量，参数少于500m，并且提供了与模型尺寸加倍的性能相当的性能，提供了出色的性能与成本比率。值得注意的是，当量化模型权重或截断嵌入输出时，该铅持续存在。这使得嵌入仪式特别适合低延迟和高通量用例，例如在设备应用程序中。我们提供探索我们关键设计选择的消融研究。我们向社区释放嵌入式节目，以促进进一步的研究。

Title: Language Models that Think, Chat Better

Authors: Adithya Bhaskar, Xi Ye, Danqi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.20357
Pdf URL: https://arxiv.org/pdf/2509.20357
Copy Paste: [[2509.20357]] Language Models that Think, Chat Better(https://arxiv.org/abs/2509.20357)
Keywords: language model, gpt, prompt, chat
Abstract: Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks -- such as writing outline essays or making meal plans -- where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces **RL** with **M**odel-rewarded **T**hinking (**RLMT**) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.
摘要：通过可验证的奖励（RLVR）的增强学习通过在数学和代码等可验证域中使用基于规则的奖励来改善语言模型推理。但是，RLVR导致对开放式任务的概括有限，例如编写大纲或制定餐食计划 - 人类经常推理。本文表明，RLVR范式在可验证的域之外是有效的，并以** m ** odel-whorded ** t ** hinking（** rlmt **）引入** rl **，以提供通用聊天能力。 RLMT使用不同的实际提示，要求LMS在响应之前产生长的婴儿床推理，并通过在线RL与RLHF中使用的基于首选项的奖励模型进行优化。在Llama-3.1-8B和QWEN-2.5-7B（基础和指示）以及多个优化算法（DPO，PPO和GRPO）上的40次训练中，RLMT始终超过标准的RLHF管道。这包括在三个聊天基准（Alpacaeval2，Wildbench和Arenahardv2）上获得3-7分的实质性收益，以及对创意写作和一般知识等其他任务的1-3点改进。我们最佳的8B模型在聊天和创意写作中超过了GPT-4O，并且与Claude-3.7-Sonnet竞争（思维）。 RLMT也可以直接应用于没有SFT阶段的基本模型，类似于R1-Zero训练。值得注意的是，只有7K提示，Llama-3.1-8B基础接受了我们的RLMT配方的训练，其表现优于Llama-3.1-8B教学后，经过训练后，由复杂的多级管道进行了25m+示例的复杂多级管道。我们对训练模型如何计划其响应的定性和定量分析结束。我们的结果重新考虑了训练后管道，并呼吁将来的工作更广泛地理解和采用思维。