2025-09-15

Title: Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

Authors: Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, Nirmalie Wiratunga
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09700
Pdf URL: https://arxiv.org/pdf/2509.09700
Copy Paste: [[2509.09700]] Cross-Layer Attention Probing for Fine-Grained Hallucination Detection(https://arxiv.org/abs/2509.09700)
Keywords: language model, llm, hallucination, prompt
Abstract: With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.
摘要：随着在各种应用中大规模采用大语言模型（LLM），由于其倾向于产生不准确文本的趋势，即幻觉，因此可靠性越来越大。在这项工作中，我们提出了跨层注意探测（CLAP），这是一种用于幻觉检测的新型激活探测技术，该技术将整个残留流的LLM激活作为关节序列处理。我们使用五个LLM和三个任务进行的经验评估表明，与刺激的解码反应的基准相比，拍手改善了幻觉的检测，以及在较高温度下采样的反应，从而可以在不同的采样响应中散发出精细粒度的检测，即能够消除不同的幻觉和非障碍响应。与直接缓解方法相比，这使我们能够使用拍手提出一种检测到的策略，以减少幻觉并提高LLM的可靠性。最后，我们表明，即使分布式分发，拍手也保持高可靠性。

Title: Creativity Benchmark: A benchmark for marketing creativity for LLM models

Authors: Ninad Bhat, Kieran Browne, Pip Bingemann
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2509.09702
Pdf URL: https://arxiv.org/pdf/2509.09702
Copy Paste: [[2509.09702]] Creativity Benchmark: A benchmark for marketing creativity for LLM models(https://arxiv.org/abs/2509.09702)
Keywords: language model, llm, prompt
Abstract: We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $\Delta\theta \approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.
摘要：我们介绍了创造力基准，这是一个在营销创造力中的大语言模型（LLM）的评估框架。该基准涵盖100个品牌（12个类别）和三种及时类型（见解，想法，狂野的想法）。与Bradley-Terry模型进行分析的678名实践创意者的人类成对偏好表现出紧密聚集的性能，没有跨品牌或及时类型的模型占主导地位：顶部的底差价为$ \ delta \ delta \ theta \ theta \ theta \ theta \ theTa \ the the Head-to-to-to-to-to-to-to-0.61 $ $ 0.61 $ $ 0.61 $ $ $ 0.61 $ $ $ $ $ $;收视率最高的车型仅比最低$ 61 \％的时间击败最低。我们还使用余弦距离分析了模型多样性，以捕获模型内和模型间的变化以及敏感性以迅速进行重塑。将三个LLM-AS法官设置与人类排名进行比较，发现弱，不一致的相关性和特定的偏见，强调自动化法官无法代替人类评估。常规的创造力测试还仅部分转移到了受到品牌受限的任务。总体而言，结果突出了对人类评估和多样性的工作流程的需求。

Title: CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor

Authors: Zhenhua Xu, Xixiang Zhao, Xubin Yue, Shengwei Tian, Changting Lin, Meng Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09703
Pdf URL: https://arxiv.org/pdf/2509.09703
Copy Paste: [[2509.09703]] CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor(https://arxiv.org/abs/2509.09703)
Keywords: language model, llm
Abstract: The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at
摘要：大型语言模型（LLMS）的广泛部署对知识产权保护（IP）保护引起了人们的关注，因为模型盗窃和未经授权的再分配变得越来越可行。为了解决这个问题，模型指纹旨在将可验证的所有权痕迹嵌入LLM中。但是，现有方法在隐形，鲁棒性和概括性之间面临着固有的权衡，可以通过分配转移来检测，一旦揭示了指纹，就容易受到对抗性修饰的影响，或者很容易被无效。在这项工作中，我们介绍了CTCC，这是一个新颖的规则驱动的指纹框架，该框架编码了跨多个对话转弯（例如反事实）的上下文相关性，而不是依靠令牌级别或单转触点。 CTCC在黑盒访问下启用指纹验证，同时减轻误报和指纹泄漏，即使暴露了部分触发器，也可以在共享的语义规则下进行连续构造。跨多个LLM体系结构进行的广泛实验表明，与先前的工作相比，CTCC始终达到更强的隐身和健壮性。我们的发现将CTCC定位为现实LLM部署方案中所有权验证的可靠和实用解决方案。我们的代码和数据可在

Title: Temporal Preferences in Language Models for Long-Horizon Assistance

Authors: Ali Mazyaki, Mohammad Naghizadeh, Samaneh Ranjkhah Zonouzaghi, Hossein Setareh
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.09704
Pdf URL: https://arxiv.org/pdf/2509.09704
Copy Paste: [[2509.09704]] Temporal Preferences in Language Models for Long-Horizon Assistance(https://arxiv.org/abs/2509.09704)
Keywords: language model, prompt
Abstract: We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM's revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment.
摘要：我们研究语言模型（LMS）是否在跨期选择中表现出未来的偏好，以及是否可以系统地操纵这些偏好。使用改编的人类实验方案，我们评估了多个LMS在Time-Tradeoff任务上，并根据人类决策者的样本进行基准测试。我们介绍了一个操作指标，即时间取向的操作性（MTO），定义为LM揭示的时间偏好在未来和现在面向现在的提示之间的变化。在我们的测试中，以推理为重点的模型（例如，DeepSeek-Reasoner和Grok-3-Mini）选择以后的以未来为导向的提示下的选项，但仅部分个性化身份或地理位置的决策。此外，正确推荐时间导向的模型将自己作为AI决策者的未来取向内部化。我们讨论了应对异质，长马目标一致的AI助手的设计含义，并概述了有关个性化上下文校准和社会意识部署的研究议程。

Title: The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

Authors: Claudio Pinhanez, Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Yago Primerano
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09705
Pdf URL: https://arxiv.org/pdf/2509.09705
Copy Paste: [[2509.09705]] The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks(https://arxiv.org/abs/2509.09705)
Keywords: llm
Abstract: This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.
摘要：这项工作探讨了小LLM（2b-8b参数）在回答相同问题时的一致性。我们介绍了一项有关已知的开源LLM的研究，该研究响应了从多项选择基准MMLU-REDUX和MEDQA中重复的问题，考虑了不同的推理温度，小与中等模型（50B-80B），鉴定与基本模型和其他参数。我们还研究了需要多试答案一致性对准确性的影响以及决定哪种模型最能提供两者的权衡。为了支持这些研究，我们提出了一些新的分析和图形工具。结果表明，在模型中，可以在低推理温度下的小型模型的50％-80％范围内回答的问题数量持续变化。同样，一致答案之间的准确性似乎与整体准确性相关。中型模型的结果似乎表明答案的一致性水平更高。

Title: Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Authors: Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09708
Pdf URL: https://arxiv.org/pdf/2509.09708
Copy Paste: [[2509.09708]] Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal(https://arxiv.org/abs/2509.09708)
Keywords: language model, llm, prompt
Abstract: Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.
摘要：拒绝有害提示是指导调整大型语言模型（LLMS）的关键安全行为，但这种行为的内部原因仍然很熟悉。我们使用对残留流动激活训练的稀疏自动编码器（SAE）研究了两个公共教学模型，Gemma-2-2b-it和Llama-3.1-8B-IT。考虑到有害的提示，我们在SAE潜在空间中寻找功能集，该功能集将模型从拒绝到合规性，表现出因果影响并造成越狱。我们的搜索在三个阶段进行：（1）拒绝方向：找到一个拒绝的导向方向并在该方向附近收集SAE特征；（2）贪婪的过滤：修剪至最小套件；（3）互动发现：拟合分解机（FM），该计算机（FM）捕获其余活动特征和最小设置之间的非线性相互作用。该管道产生了广泛的越狱关键特征，从而深入了解了拒绝的机械基础。此外，除非抑制早期的特征，否则我们发现冗余特征的证据仍然处于休眠状态。我们的发现突出了通过操纵可解释的潜在空间来对安全行为进行细粒度审核和针对性干预的潜力。

Title: Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement

Authors: Jing Ren, Weiqi Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09709
Pdf URL: https://arxiv.org/pdf/2509.09709
Copy Paste: [[2509.09709]] Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement(https://arxiv.org/abs/2509.09709)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) like ChatGPT are increasingly used in academic writing, yet issues such as incorrect or fabricated references raise ethical concerns. Moreover, current content quality evaluations often rely on subjective human judgment, which is labor-intensive and lacks objectivity, potentially compromising the consistency and reliability. In this study, to provide a quantitative evaluation and enhance research proposal writing capabilities of LLMs, we propose two key evaluation metrics--content quality and reference validity--and an iterative prompting method based on the scores derived from these two metrics. Our extensive experiments show that the proposed metrics provide an objective, quantitative framework for assessing ChatGPT's writing performance. Additionally, iterative prompting significantly enhances content quality while reducing reference inaccuracies and fabrications, addressing critical ethical challenges in academic contexts.
摘要：像chatgpt这样的大型语言模型（LLM）越来越多地用于学术写作中，但诸如不正确或捏造的参考文献等问题引起了道德问题。此外，当前的内容质量评估通常依赖于主观的人类判断，这是劳动密集型并且缺乏客观性，可能损害一致性和可靠性。在这项研究中，为了提供定量评估并增强了LLMS的研究建议写作能力，我们提出了两个关键的评估指标，即质量和参考有效性，以及一种基于从这两个指标得出的分数的迭代提示方法。我们广泛的实验表明，拟议的指标为评估Chatgpt的写作表现提供了一个客观的定量框架。此外，迭代提示会显着提高内容质量，同时减少参考的不准确性和制造，从而应对学术背景下的关键道德挑战。

Title: Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data

Authors: Sepehr Golrokh Amin, Devin Rhoads, Fatemeh Fakhrmoosavi, Nicholas E. Lownes, John N. Ivan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.09710
Pdf URL: https://arxiv.org/pdf/2509.09710
Copy Paste: [[2509.09710]] Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data(https://arxiv.org/abs/2509.09710)
Keywords: language model, llm, prompt, agent
Abstract: This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM's statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM's zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems.
摘要：这项研究介绍了一种大型语言模型（LLM）方案，用于在基于代理的运输模型中生成单个旅行日记。尽管传统方法依靠大量专有的家庭旅行调查，但本研究中提出的方法从开源美国社区调查（ACS）和智能位置数据库（SLD）数据随机产生角色，然后通过直接提示综合日记。这项研究具有一个新颖的一到圆形现实主义评分：针对跨人口统计学变量匹配的康涅狄格州全州运输研究（CSTS）日记的四个指标（跳闸计数得分，间隔得分，目的评分和模式得分）的综合。该验证利用詹森 - 香农差异来衡量生成和真实日记之间的分布相似性。与经典方法生成的日记相比，在验证集上校准了经典方法（用于模式/目的的多项式logit）时，LLM生成的日记实现了可比的总体现实主义（LLM平均值：0.485：0.485 vs. 0.455）。 LLM在确定行程目的方面表现出色，并表现出更大的一致性（较窄的现实主义分数分布），而经典模型则以数值估计行程数量和活动持续时间为单位。骨料验证证实了LLM的统计代表性（LLM平均值：0.612 vs. 0.435），证明了LLM的零照片可行性，并为未来的合成日记评估系统建立了可量化的日记现实主义度量。

Title: Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry

Authors: Aya E. Fouda, Abdelrahamn A. Hassan, Radwa J. Hanafy, Mohammed E. Fouda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09711
Pdf URL: https://arxiv.org/pdf/2509.09711
Copy Paste: [[2509.09711]] Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry(https://arxiv.org/abs/2509.09711)
Keywords: language model, llm
Abstract: Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.
摘要：大型语言模型（LLMS）在增强精神病实践方面拥有巨大的希望，从提高诊断准确性到简化临床文档和治疗支持。但是，现有的评估资源在很大程度上依赖于小型临床访谈语料库，社交媒体帖子或合成对话，这限制了其临床有效性，并且无法捕捉精神病推理的全部复杂性。在这项工作中，我们介绍了精神病院，这是一个严格的精心策划的基准，该基准仅基于权威，专家验证的精神病学教科书和案例书。精神病学院包括从诊断推理和治疗计划到纵向后续行动，管理计划，临床方法，顺序案例分析以及多项选择/扩展的匹配格式，总计5,300个专家注册的项目，包括11个不同的提问任务。我们使用常规指标和“ LLM-AS-Judge-judge-judge”相似的评分框架，评估了一套各种各样的Frontier LLM（包括Google Gemini，DeepSeek，Llame 3和QWQ-32）（例如，使用常规指标和“ LLM-AS-Judge-Judge”相似的评分框架，以及使用“ LLM-AS-Judge-Judge”的领先开源医疗模型（例如OpenBilollm，Medgemma）。我们的结果揭示了临床一致性和安全性的巨大差距，尤其是在多转化的随访和管理任务中，强调了对专业模型调整的需求和更强大的评估范例。 Psychiatrybench提供了一个模块化，可扩展的平台，用于在高风险心理健康应用中进行基准测试和改善LLM的性能。

Title: The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

Authors: Talha Tahir
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09712
Pdf URL: https://arxiv.org/pdf/2509.09712
Copy Paste: [[2509.09712]] The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization(https://arxiv.org/abs/2509.09712)
Keywords: language model, llm, chain-of-thought
Abstract: Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic `process' over imitating `content,' a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.
摘要：接受和承诺疗法（ACT）是第三波认知行为疗法，在几种精神病疾病中有疗效的新证据。这项研究调查了训练后方法和明确推理对小型开放权重模型（LLM）提供ACT的能力的影响。使用Mistral-Large产生的50组合成ACT转录本，我们使用两种不同的方法训练了Llama-3.2-3b-Instruct，有两种不同的方法，有监督的微调（SFT）和优势比策略优化（ORPO），每个方法都有和没有明确的猜测链（COT）的推理步骤。通过将这四种训练后的变体与基本指导模型进行比较，可以评估性能。在模拟治疗课程中对这些模型进行了基准测试，其绩效对ACT保真度量度（ACT-FM）进行了定量评估，而LLM法官则对人类评估进行了微调。我们的发现表明，受ORPO训练的模型大大优于其SFT和指导ACT Fidelity（$ \ chi^2（5）= 185.15，p <.001 $）和治疗性移情（$ \ chi^2（5）= 140.37，p <.001 $）。 COT的效果是有条件的，因为它为SFT模型提供了重大好处，将ACT-FM分数平均提高了2.68点（$ P <.001 $），同时对上级ORPO或指导调节变体没有明显的优势。我们认为，ORPO的优势源于其学习治疗性“过程”而不是模仿“内容”的能力，这是行动的关键方面，而COT则是仅通过模仿而训练的模型的必要脚手。这项研究确定，与偏好一致的政策优化可以有效地灌输小型LLM的ACT能力，并且明确推理的实用性高度依赖于基本的培训范式。

Title: HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering

Authors: Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09713
Pdf URL: https://arxiv.org/pdf/2509.09713
Copy Paste: [[2509.09713]] HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering(https://arxiv.org/abs/2509.09713)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system's adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.
摘要：检索功能生成（RAG）方法通过将信息检索（IR）技术与大语言模型（LLMS）集成（LLMS），从而增强了提问系统和对话生成任务。该策略从外部知识库中检索信息以增强生成模型的响应能力，并取得了一定的成功。但是，当前的破布方法在处理多跳查询时仍然面临许多挑战。例如，某些方法过于依赖于迭代检索，在复合查询上浪费了太多的检索步骤。此外，使用原始的复杂查询进行检索可能无法捕获与特定子查询相关的内容，从而导致嘈杂的检索内容。如果没有管理噪声，则可能导致噪声积累问题。为了解决这些问题，我们介绍了Hanrag，这是一种基于启发式的新型框架，旨在有效地解决不同复杂性的问题。在强大的启示器驱动器的驱动器中，Hanrag路线查询，将其分解为子征服，并从检索到的文档中过滤噪声。这增强了系统的适应性和抗噪声性，使其高度能够处理各种查询。我们将提议的框架与各种基准的其他领先行业方法进行了比较。结果表明，我们的框架在单跳和多跳的提问任务中获得了卓越的性能。

Title: How Small Transformation Expose the Weakness of Semantic Similarity Measures

Authors: Serge Lionel Nikiema, Albérick Euraste Djire, Abdoul Aziz Bonkoungou, Micheline Bénédicte Moumoula, Jordan Samhi, Abdoul Kader Kabore, Jacques Klein, Tegawendé F. Bissyande
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09714
Pdf URL: https://arxiv.org/pdf/2509.09714
Copy Paste: [[2509.09714]] How Small Transformation Expose the Weakness of Semantic Similarity Measures(https://arxiv.org/abs/2509.09714)
Keywords: language model, llm
Abstract: This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns. The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships. The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods' poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content.
摘要：这项研究研究了不同方法如何衡量语义相似性，这对于各种软件工程应用程序（例如代码搜索，API建议，自动代码审查和重构工具）很重要。尽管大型语言模型越来越多地用于这些相似性评估，但有关它们是真正了解语义关系还是仅仅识别表面模式的问题仍然存在。该研究测试了18种不同的相似性测量方法，包括基于单词的方法，嵌入技术，基于LLM的系统和结构感知算法。研究人员创建了一个系统的测试框架，该框架将受控更改应用于文本和代码，以评估每种方法如何处理不同类型的语义关系。结果显示了常用指标的重大问题。一些基于嵌入的方法错误地将语义对立识别为99.9％的时间相似，而某些基于变压器的方法偶尔将相反的含义与同义词更相似。研究发现，嵌入方法的差经常源于它们如何计算距离。从欧几里得距离转换为余弦相似性，结果提高了24％至66％。基于LLM的方法在区分语义差异方面的表现更好，与嵌入不正确分配的高分（0.82至0.99）的嵌入方法相比，对真正不同的含义产生了低相似性得分（0.00至0.29）。

Title: Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA

Authors: Naveen Lamba, Sanju Tiwari, Manas Gaur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09715
Pdf URL: https://arxiv.org/pdf/2509.09715
Copy Paste: [[2509.09715]] Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA(https://arxiv.org/abs/2509.09715)
Keywords: language model, llm, hallucination
Abstract: Hallucination in Large Language Models (LLMs) is a well studied problem. However, the properties that make LLM intrinsically vulnerable to hallucinations have not been identified and studied. This research identifies and characterizes the key properties, allowing us to pinpoint vulnerabilities within the model's internal mechanisms. To solidify on these properties, we utilized two established datasets, HaluEval and TruthfulQA and convert their existing format of question answering into various other formats to narrow down these properties as the reason for the hallucinations. Our findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets. With increased model scale, hallucination drops to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B, reflecting a 15 percentage point reduction overall. Although the hallucination rate decreases as the model size increases, a substantial amount of hallucination caused by symbolic properties still persists. This is especially evident for modifiers (ranging from 84.76% to 94.98%) and named entities (ranging from 83.87% to 93.96%) across all Gemma models and both datasets. These findings indicate that symbolic elements continue to confuse the models, pointing to a fundamental weakness in how these LLMs process such inputs--regardless of their scale.
摘要：大语言模型（LLMS）中的幻觉是一个精心研究的问题。但是，尚未鉴定和研究使LLM本质上容易受到幻觉的特性。这项研究确定并表征了关键特性，使我们能够在模型的内部机制中查明漏洞。为了巩固这些属性，我们利用了两个已建立的数据集，即Halueval和Throtfulqa，并将其现有的问题格式转换为其他各种格式，以缩小这些属性，以此作为幻觉的原因。我们的发现表明，符号属性的幻觉百分比对于GEMMA-2-2B而言显着很高，在任务和数据集中平均为79.0％。随着模型量表的增加，Gemma-2-9b的幻觉下降到73.6％，Gemma-2-27b的幻觉下降到63.9％，反映出总体上降低了15个百分点。尽管幻觉速率随着模型大小的增加而降低，但符号特性引起的大量幻觉仍然存在。这对于修饰符（从84.76％到94.98％）和命名的实体（从83.87％到93.96％）的修饰符尤其明显。这些发现表明，符号要素继续使模型感到困惑，这表明这些LLM在处理此类输入的方式（没有规模）的基本弱点。

Title: ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

Authors: Kai R. Larsen, Sen Yan, Roland Müller, Lan Sang, Mikko Rönkkö, Ravi Starzl, Donald Edmondson
Subjects: cs.CL, cs.AI, cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2509.09723
Pdf URL: https://arxiv.org/pdf/2509.09723
Copy Paste: [[2509.09723]] ALIGNS: Unlocking nomological networks in psychological measurement through a large language model(https://arxiv.org/abs/2509.09723)
Keywords: language model
Abstract: Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system's importance, accessibility, and suitability. ALIGNS is freely available at this http URL, complementing traditional validation methods with large-scale nomological analysis.
摘要：心理测量对许多学科至关重要。尽管测量的进步，建立法制网络，但概念和衡量与建立有效性的关系的理论图仍然是在克朗巴赫和梅尔提出将其作为验证基础的70年之后的70年。这种限制会带来实际的后果：临床试验可能无法检测到治疗效果，公共政策可能针对错误的结果。我们介绍了潜在指标的分析以生成法制结构（Aligns），这是一种基于语言模型的大型系统，该系统训练有经过验证的问卷测量。 Aligns提供了三个综合的法制网络，其中包含心理学，医学，社会政策和其他领域的550,000多个指标。这代表了大型语言模型在测量验证中解决基础问题的第一个应用。我们报告用于开发模型的分类精度测试以及三个评估。在第一次评估中，广泛使用的NIH Promis焦虑症和抑郁仪被证明会融合到情绪困扰的单一维度。第二次评估检查了儿童气质措施，并确定了当前框架未捕获的四个潜在维度，并质疑一个现有维度。第三次评估是适用性检查，与评估系统重要性，可访问性和适用性的专家心理医生有关。 Aligns可以在此HTTP URL中免费获得，并通过大规模的法分析补充了传统验证方法。

Title: DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model

Authors: Wonyoung Kim, Sujeong Seo, Juhyun Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.09724
Pdf URL: https://arxiv.org/pdf/2509.09724
Copy Paste: [[2509.09724]] DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model(https://arxiv.org/abs/2509.09724)
Keywords: language model, llm, prompt, chat
Abstract: Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.
摘要：技术机会是关键信息，它是技术，工业和创新发展的基础。本文根据技术之间的时间关系提出了一个框架，以识别新兴的技术机会。所提出的框架开始于从专利数据集中提取文本，然后绘制基于文本的主题以发现技术间的关系。然后，通过跟踪随着时间的推移这些主题的变化来确定技术机会。为了提高效率，该框架利用大型语言模型提取主题，并采用基于聊天的语言模型的提示来支持发现技术机会。使用美国专利商标局提供的人工智能专利数据集评估了该框架。实验结果表明，人工智能技术正在发展为促进日常可访问性的形式。这种方法证明了拟议框架的潜力来确定未来的技术机会。

Title: Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure

Authors: Seiji Hattori, Takuya Matsuzaki, Makoto Fujiwara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.09726
Pdf URL: https://arxiv.org/pdf/2509.09726
Copy Paste: [[2509.09726]] Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure(https://arxiv.org/abs/2509.09726)
Keywords: llm
Abstract: This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant.
摘要：本文提出了一种自然语言翻译方法，用于机器可验证的正式证明，以利用不当度（正式语言证明步骤的语言化）和LLM的摘要功能。为了进行评估，它应用于根据本科级教科书获得的自然语言证明创建的正式证明数据，并根据原始的自然语言证明分析了生成的自然语言证明的质量。此外，我们将证明该方法可以通过将其应用于精益证明助手的现有正式证明库来输出高度可读和准确的自然语言证明。

Title: A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs

Authors: Andy Zhu, Yingjun Du
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/2509.09727
Pdf URL: https://arxiv.org/pdf/2509.09727
Copy Paste: [[2509.09727]] A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs(https://arxiv.org/abs/2509.09727)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation, chain-of-thought, agent
Abstract: Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from this http URL, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.
摘要：问题回答（QA）在金融教育中起着核心作用，但是现有的大语言模型（LLM）方法通常无法捕获解决财务问题所需的细微和专业推理。金融领域需要多步数定量推理，对特定领域术语的熟悉以及对现实情况的理解。我们提出了一个多代理框架，该框架利用基于角色的提示来提高域特异性QA的性能。我们的框架包括基本发电机，证据回试者和专家审稿人，它们在单次迭代中工作以产生精致的答案。我们从这个在线学习平台HTTP URL中评估了3,532套专家设计的财务教育问题的框架。我们利用检索效果的生成（RAG）从6份财务教科书中获得上下文证据，并促使域 - 专家审稿人促使策略。我们的实验表明，基于批评的改进将答案的准确性提高了6.6-8.3％，而零投篮链的基准线的表现最高，而Gemini-2.0-Flash的性能最高。此外，我们的方法使GPT-4O-Mini能够实现与金融调节的Fingpt-Mt_llama3-8B_lora相当的性能。我们的结果表明，一种具有成本效益的方法来增强财务质量保证和为多代理金融LLM系统的进一步研究提供见解。

Title: A meta-analysis on the performance of machine-learning based language models for sentiment analysis

Authors: Elena Rohde, Jonas Klingwort, Christian Borgs
Subjects: cs.CL, cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2509.09728
Pdf URL: https://arxiv.org/pdf/2509.09728
Copy Paste: [[2509.09728]] A meta-analysis on the performance of machine-learning based language models for sentiment analysis(https://arxiv.org/abs/2509.09728)
Keywords: language model
Abstract: This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.
摘要：本文提出了一项荟萃分析，评估了Twitter数据的情感分析中ML性能。该研究旨在估计平均性能，评估研究之间和内部的异质性，并分析研究特征如何影响模型性能。使用PRISMA指南，我们搜索了学术数据库，并从20个研究功能的20个研究中选择了195个试验。总体精度是最报告的性能度量，使用双弧形转换和三级随机效应模型分析。 AIC优化模型的平均总准确度为0.80 [0.76，0.84]。本文提供了两个关键的见解：1）总体准确性被广泛使用，但由于其对阶级失衡的敏感性和情感类别的数量，通常会产生误导，从而突出了对归一化的需求。 2）模型性能的标准化报告，包括针对独立测试集的报告混乱矩阵，对于跨研究的ML分类器的可靠比较至关重要，这似乎远非常见实践。

Title: Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

Authors: Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.09731
Pdf URL: https://arxiv.org/pdf/2509.09731
Copy Paste: [[2509.09731]] Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning(https://arxiv.org/abs/2509.09731)
Keywords: language model
Abstract: Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.
摘要：中国古老的文件，中国历史和文化的千年载体，在各种领域拥有丰富的知识，但是在数字化和理解中面临挑战，即传统方法仅扫描图像，而当前的视觉模型（VLMS）则与其视觉和语言复杂性斗争。现有的文档基准专注于英语印刷文本或简化中文，留下了用于评估古代中国文档中VLM的差距。为了解决这个问题，我们介绍了古代文档的第一个基准，旨在评估从OCR到知识推理的VLM。古代DOC包括五项任务（页面级OCR，白话翻译，基于推理的质量质量检查，基于知识的QA，语言变体QA），涵盖了14种文档类型，超过100本书，大约3,000页。基于古老的Doc，我们使用多个指标评估主流VLM，并以人类一致的大语言模型进行评分。

Title: MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

Authors: Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, Zhendong Mao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.09734
Pdf URL: https://arxiv.org/pdf/2509.09734
Copy Paste: [[2509.09734]] MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools(https://arxiv.org/abs/2509.09734)
Keywords: agent
Abstract: The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP's growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP's transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.
摘要：模型上下文协议（MCP）迅速成为一个关键的开放标准，旨在增强代理工具的集成和互操作性，并定位为解锁强大，互连和真正实用的代理AI的新时代。但是，尽管MCP的采用越来越大，但现有的基准通常无法在这种新范式中捕获现实世界的代理商绩效，从而导致对其真实运营价值的看法扭曲，并且无法可靠地区分能力。为了弥合这个关键的评估差距，我们介绍了MCP-AgentBench - 一种专门设计的全面基准，该基准专门评估MCP介导的工具交互中的语言代理能力。 MCP-AgentBench的核心贡献包括：建立一个强大的MCP测试床，其中包括33个具有188个不同工具的操作服务器；基准的开发具有600个系统设计的查询，分布在6种不同类别的不同相互作用复杂性上；以及MCP-eval的引入，这是一种以结果为导向的评估方法，优先考虑现实世界任务成功。通过对领先语言代理的广泛经验评估，我们提供了基本的见解。 MCP-AgentBench旨在为研究社区提供一个标准化且可靠的框架，以建立，验证和前进的代理，能够完全利用MCP的变革性益处，从而加速真正有能力且可互操作的AI系统的进步。

Title: Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation

Authors: Willem Huijzer, Jieying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.09735
Pdf URL: https://arxiv.org/pdf/2509.09735
Copy Paste: [[2509.09735]] Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation(https://arxiv.org/abs/2509.09735)
Keywords: language model, gpt, llm, prompt
Abstract: The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study examines biases in LLMs related to background, gender, and age, with a focus on their impact on decision-making and summarization tasks. Additionally, the research examines the cross-lingual propagation of these biases and evaluates the effectiveness of prompt-instructed mitigation strategies. Using an adapted version of the dataset by Tamkin et al. (2023) translated into Dutch, we created 151,200 unique prompts for the decision task and 176,400 for the summarisation task. Various demographic variables, instructions, salience levels, and languages were tested on GPT-3.5 and GPT-4o. Our analysis revealed that both models were significantly biased during decision-making, favouring female gender, younger ages, and certain backgrounds such as the African-American background. In contrast, the summarisation task showed minimal evidence of bias, though significant age-related differences emerged for GPT-3.5 in English. Cross-lingual analysis showed that bias patterns were broadly similar between English and Dutch, though notable differences were observed across specific demographic categories. The newly proposed mitigation instructions, while unable to eliminate biases completely, demonstrated potential in reducing them. The most effective instruction achieved a 27\% mean reduction in the gap between the most and least favorable demographics. Notably, contrary to GPT-3.5, GPT-4o displayed reduced biases for all prompts in English, indicating the specific potential for prompt-based mitigation within newer models. This research underscores the importance of cautious adoption of LLMs and context-specific bias testing, highlighting the need for continued development of effective mitigation strategies to ensure responsible deployment of AI.
摘要：大型语言模型（LLM）迅速整合到各种领域，这引起了人们对社会不平等和信息偏见的担忧。这项研究研究了与背景，性别和年龄有关的LLMS的偏见，重点是它们对决策和摘要任务的影响。此外，该研究还研究了这些偏见的跨语性传播，并评估了迅速的缓解策略的有效性。使用Tamkin等人的数据集的改编版。（2023）被翻译成荷兰语，我们为决策任务创建了151,200个独特的提示，摘要任务创建了176,400。在GPT-3.5和GPT-4O上测试了各种人口统计学变量，说明，显着性水平和语言。我们的分析表明，在决策过程中，这两个模型都显着偏见，偏爱女性，年轻年龄和某些背景，例如非裔美国人背景。相反，汇总任务显示出偏见的最低证据，尽管与年龄相关的差异显着，而gpt-3.5在英语中出现了。跨语性分析表明，偏见模式在英语和荷兰语之间大致相似，尽管在特定的人群类别中观察到了显着的差异。新提出的缓解指示虽然无法完全消除偏见，但在减少它们方面表现出了潜力。最有效的指示实现了27 \％的平均降低，但最有利的人口统计学之间的差距。值得注意的是，与GPT-3.5相反，GPT-4O显示了所有英语提示的偏差减少，这表明在较新的型号中，基于迅速缓解的特定潜力。这项研究强调了谨慎采用LLM的重要性和特定于上下文的偏见测试，强调了继续制定有效缓解策略以确保负责任部署AI的必要性。

Title: HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning

Authors: Brennen Hill
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.09801
Pdf URL: https://arxiv.org/pdf/2509.09801
Copy Paste: [[2509.09801]] HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning(https://arxiv.org/abs/2509.09801)
Keywords: language model, llm
Abstract: The adaptation of large language models (LLMs) to specialized reasoning tasks is fundamentally constrained by computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the landscape of these techniques is diverse, with distinct methods operating in either the model's weight space or its representation space. This paper investigates the hypothesis that a synergistic combination of these paradigms can unlock superior performance and efficiency. We introduce HEFT (Hierarchical Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes two distinct PEFT methods in a coarse-to-fine manner: first, a broad, foundational adaptation in the weight space using Low-Rank Adaptation (LoRA), followed by a precise, surgical refinement of internal activations using Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential reasoning. Our results reveal a profound synergistic effect. A model fine-tuned for only three epochs with our HEFT strategy achieves an accuracy of 85.17\%, exceeding the performance of models trained for 20 epochs with either LoRA-only (85.05\%) or ReFT-only (83.36\%) methodologies. This work demonstrates that the thoughtful composition of PEFT methods is a potent algorithmic innovation, offering a more efficient and effective path toward advancing the reasoning capabilities of language models. By achieving superior results with a fraction of the computational budget, our findings present a principled approach to overcoming the obstacles inherent in adapting large-scale models for complex cognitive tasks.
摘要：大型语言模型（LLM）适应专业推理任务的基本上受到计算资源的限制。参数有效的微调（PEFT）方法已成为一种强大的解决方案，但是这些技术的景观是多种多样的，在模型的重量空间或其表示空间中使用不同的方法。本文调查了以下假设：这些范式的协同组合可以释放出色的性能和效率。我们介绍了Heft（分层有效的微调），这是一种新型的分层适应策略，以粗略的方式构成了两种不同的PEFT方法：首先，使用低秩适应（LORA）在体重空间中进行了广泛的基础适应（LORA），然后是精确的，随后是一种使用代表性的内部激活（使用代表性的）。我们通过在Boolq基准上微调Llama-2-7b模型来评估这种方法，Boolq Benchmark是一个具有挑战性的推理推理数据集。我们的结果揭示了深远的协同作用。一个模型仅针对我们的HEFT策略进行了三个时期的微调，其精度达到了85.17 \％，超过了对20个时期训练的模型的性能，仅使用lora for lora（85.05 \％）或仅reft-infly（83.36 \％）方法。这项工作表明，PEFT方法的周到组成是一种有效的算法创新，为推进语言模型的推理能力提供了更有效的途径。通过以计算预算的一小部分取得优越的结果，我们的发现提出了一种有原则的方法来克服适应复杂认知任务的大规模模型固有的障碍。

Title: Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization

Authors: Chuyuan Li, Austin Xu, Shafiq Joty, Giuseppe Carenini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.09852
Pdf URL: https://arxiv.org/pdf/2509.09852
Copy Paste: [[2509.09852]] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization(https://arxiv.org/abs/2509.09852)
Keywords: language model, llm, prompt
Abstract: A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.
摘要：多文档摘要（MDS）中的关键挑战是有效地整合了来自多个来源的信息，同时保持连贯性和局部相关性。尽管大型语言模型在单文件摘要中显示出令人印象深刻的结果，但它们在MDS上的性能仍然留出了改进的空间。在本文中，我们提出了一种主题引导的强化学习方法，以改善MD中的内容选择。我们首先表明，明确提示具有主题标签的模型可以增强生成的摘要的信息。在这种见解的基础上，我们在小组相对政策优化（GRPO）框架内提出了一个新的主题奖励，以衡量生成的摘要和源文档之间的主题对齐。多新的和多XSCIECH数据集的实验结果表明，我们的方法始终胜过强大的基准，强调了利用MDS中局部提示的有效性。

Title: Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case

Authors: Bastián González-Bustamante, Nando Verelst, Carla Cisternas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09871
Pdf URL: https://arxiv.org/pdf/2509.09871
Copy Paste: [[2509.09871]] Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case(https://arxiv.org/abs/2509.09871)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI's GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors.
摘要：大型语言模型（LLMS）通过使用合成受访者模仿人类的答案和行为，为调查研究中的方法论和应用创新提供了有希望的途径，从而有可能减轻测量和表示错误。但是，LLM恢复汇总项目分布的程度仍然不确定，下游应用程序风险再现从培训数据中继承的社会刻板印象和偏见。我们评估了LLM生成的合成调查响应的可靠性，以针对智利公众舆论概率调查的基础真相反应。具体而言，我们基准了128个及时模型问题的三重态，在128个问题 - 次样样本中的荟萃分析中产生189,696个合成概况和池性能指标（即准确性，精度，精度，回忆和F1得分），以测试沿键入社会社会杂种的偏见。评估涵盖了OpenAI的GPT家族和O系列推理模型，以及Llama和Qwen检查站。三个结果突出。首先，综合响应在信任项目上具有出色的性能（F1得分和准确性> 0.90）。其次，GPT-4O，GPT-4O-Mini和Llama 4 Maverick在此任务上表现相当。第三，在45-59岁的受访者中，合成人类的比对最高。总体而言，基于LLM的合成样品近似于概率样本的响应，尽管具有很大的项目水平异质性。捕捉全部舆论的细微差别仍然具有挑战性，需要仔细的校准和其他分配测试，以确保算法忠诚度并减少错误。

Title: Large Language Models Meet Legal Artificial Intelligence: A Survey

Authors: Zhitian Hou, Zihan Ye, Nanli Zeng, Tianyong Hao, Kun Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.09969
Pdf URL: https://arxiv.org/pdf/2509.09969
Copy Paste: [[2509.09969]] Large Language Models Meet Legal Artificial Intelligence: A Survey(https://arxiv.org/abs/2509.09969)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at this https URL.
摘要：近年来，大型语言模型（LLM）显着推动了法律人工智能（法定AI）的发展，从而提高了法律任务的效率和准确性。为了推进法律领域中基于LLM的方法的研究和应用，本文对16个法律LLMS系列和47个基于LLM的法律任务框架进行了全面评论，还收集了15个基准测试和29个数据集，以评估不同的法律能力。此外，我们分析了挑战，并讨论了法律领域中基于LLM的方法的未来方向。我们希望本文为初学者提供系统的介绍，并鼓励该领域的未来研究。资源可在此HTTPS URL上找到。

Title: Unsupervised Hallucination Detection by Inspecting Reasoning Processes

Authors: Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10004
Pdf URL: https://arxiv.org/pdf/2509.10004
Copy Paste: [[2509.10004]] Unsupervised Hallucination Detection by Inspecting Reasoning Processes(https://arxiv.org/abs/2509.10004)
Keywords: language model, llm, hallucination, prompt
Abstract: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
摘要：无监督的幻觉检测旨在识别大语模型（LLMS）产生的幻觉内容而不依赖标签数据。尽管无监督的方法通过消除劳动密集型的人类注释而越来越受欢迎，但他们经常依靠与事实正确性无关的代理信号。这种未对准的检测探针偏向于表面或非真实性相关的方面，从而限制了跨数据集和方案的普遍性。为了克服这些局限性，我们提出了一个无监督的幻觉检测框架Iris，利用了事实正确性的内部表示。 IRIS提示LLM仔细验证给定陈述的真实性，并获得其上下文化的嵌入作为培训的信息特征。同时，每个反应的不确定性被认为是真实性的软伪标记。实验结果表明，虹膜始终优于现有的无监督方法。我们的方法是完全无监督的，计算低成本，即使很少有培训数据也可以很好地工作，因此它适合实时检测。

Title: Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs

Authors: Adnan Ahmad, Philine Kowol, Stefan Hillmann, Sebastian Möller
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2509.10010
Pdf URL: https://arxiv.org/pdf/2509.10010
Copy Paste: [[2509.10010]] Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs(https://arxiv.org/abs/2509.10010)
Keywords: language model, llm, prompt, chat
Abstract: In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.
摘要：在本文中，我们使用开源，可公开可用的大型语言模型（LLMS）对多标签意图分类进行了广泛的分析，并且可以在消费者硬件中运行。我们使用对话系统域中的Multiwoz 2.1数据集（对话系统域中的基准）来研究三个流行的开源预训练的LLM的功效，即Llama2-7B-HF，Mismtral-7b-v0.1和YI-6B。我们在几次设置中执行分类任务，并在提示中提供了20个示例，并提供了一些说明。我们的方法着重于通过有条不紊地评估这些模型在多标签意图分类任务上，这些模型在几个绩效指标上的性能差异。此外，我们将基于教学的微调方法的性能与使用较小的变压器模型BertforeSequencececectification作为基线的监督学习进行了比较。为了评估模型的性能，我们使用评估指标，例如准确性，精度和召回率以及微型，宏和加权F1评分。我们还报告了推理时间，VRAM需求等。Mistral-7b-v0.1在F-SCORE方面，在14个意图上的11个意图类别上优于其他两个生成模型，平均加权平均为0.50。它还具有相对较低的嗡嗡声损失和较高的Jaccard相似性，使其成为少数拍摄环境中的获胜模型。我们发现基于BERT的监督分类器与表现最佳的少数发电性LLM相比具有出色的性能。这项研究为小型开源LLM提供了一个框架，以检测复杂的多智能对话，从而增强了以任务为导向的聊天机器人的自然理解方面。

Title: Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models

Authors: Dongmin Choi, Woojung Song, Jongwook Han, Eun-Ju Lee, Yohan Jo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10078
Pdf URL: https://arxiv.org/pdf/2509.10078
Copy Paste: [[2509.10078]] Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models(https://arxiv.org/abs/2509.10078)
Keywords: language model, llm, prompt
Abstract: Researchers have applied established psychometric questionnaires (e.g., BFI, PVQ) to measure the personality traits and values reflected in the responses of Large Language Models (LLMs). However, concerns have been raised about applying these human-designed questionnaires to LLMs. One such concern is their lack of ecological validity--the extent to which survey questions adequately reflect and resemble real-world contexts in which LLMs generate texts in response to user queries. However, it remains unclear how established questionnaires and ecologically valid questionnaires differ in their outcomes, and what insights these differences may provide. In this paper, we conduct a comprehensive comparative analysis of the two types of questionnaires. Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs. Overall, our work cautions against the use of established psychological questionnaires for LLMs. Our code will be released upon publication.
摘要：研究人员已经应用了既定的心理测量问卷（例如BFI，PVQ）来衡量大语模型（LLMS）回答中反映的人格特质和价值观。但是，人们对将这些设计的调查表应用于LLM的关注感到担忧。这样一个关注的是它们缺乏生态有效性 - 调查问题在多大程度上充分反映和类似于现实世界中的环境，其中LLMS响应用户查询而生成文本。但是，目前尚不清楚确定的问卷和生态上有效的问卷如何在其结果上有所不同，以及这些差异可能提供的见解。在本文中，我们对两种类型的问卷进行了全面的比较分析。 Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs.总体而言，我们的工作警告不使用针对LLM的既定心理问卷。我们的代码将在出版后发布。

Title: Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery

Authors: Mustapha Adamu, Qi Zhang, Huitong Pan, Longin Jan Latecki, Eduard C. Dragut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10087
Pdf URL: https://arxiv.org/pdf/2509.10087
Copy Paste: [[2509.10087]] Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery(https://arxiv.org/abs/2509.10087)
Keywords: language model
Abstract: The growing complexity and volume of climate science literature make it increasingly difficult for researchers to find relevant information across models, datasets, regions, and variables. This paper introduces a domain-specific Knowledge Graph (KG) built from climate publications and broader scientific texts, aimed at improving how climate knowledge is accessed and used. Unlike keyword based search, our KG supports structured, semantic queries that help researchers discover precise connections such as which models have been validated in specific regions or which datasets are commonly used with certain teleconnection patterns. We demonstrate how the KG answers such questions using Cypher queries, and outline its integration with large language models in RAG systems to improve transparency and reliability in climate-related question answering. This work moves beyond KG construction to show its real world value for climate researchers, model developers, and others who rely on accurate, contextual scientific information.
摘要：气候科学文献的复杂性和数量的日益增长使研究人员越来越难以在模型，数据集，区域和变量之间找到相关信息。本文介绍了由气候出版物和更广泛的科学文本构建的特定领域知识图（KG），旨在改善气候知识的访问和使用方式。与基于关键字的搜索不同，我们的KG支持结构化的语义查询，以帮助研究人员发现精确的连接，例如在特定区域验证了哪些模型或哪些数据集已在某些远程连接模式中使用。我们演示了KG如何使用Cypher查询来回答此类问题，并概述了与抹布系统中的大型语言模型的集成，以提高与气候相关的问题回答中的透明度和可靠性。这项工作超越了KG的建设，以表明其对气候研究人员，模型开发人员以及其他依靠准确，上下文科学信息的人的现实价值。

Title: Arabic Large Language Models for Medical Text Generation

Authors: Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Ammar Mohammed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10095
Pdf URL: https://arxiv.org/pdf/2509.10095
Copy Paste: [[2509.10095]] Arabic Large Language Models for Medical Text Generation(https://arxiv.org/abs/2509.10095)
Keywords: language model, gpt, llm
Abstract: Efficient hospital management systems (HMS) are critical worldwide to address challenges such as overcrowding, limited resources, and poor availability of urgent health care. Existing methods often lack the ability to provide accurate, real-time medical advice, particularly for irregular inputs and underrepresented languages. To overcome these limitations, this study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation. The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input. The research methodology required the collection of a unique dataset from social media platforms, capturing real-world medical conversations between patients and doctors. The dataset, which includes patient complaints together with medical advice, was properly cleaned and preprocessed to account for multiple Arabic dialects. Fine-tuning state-of-the-art generative models, such as Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2 Medium, optimized the system's ability to generate reliable medical text. Results from evaluations indicate that the fine-tuned Mistral-7B model outperformed the other models, achieving average BERT (Bidirectional Encoder Representations from Transformers) Score values in precision, recall, and F1-scores of 68.5\%, 69.08\%, and 68.5\%, respectively. Comparative benchmarking and qualitative assessments validate the system's ability to produce coherent and relevant medical replies to informal input. This study highlights the potential of generative artificial intelligence (AI) in advancing HMS, offering a scalable and adaptable solution for global healthcare challenges, especially in linguistically and culturally diverse environments.
摘要：有效的医院管理系统（HMS）在全球范围内至关重要，以应对诸如人满为患，资源有限和紧急医疗保健的可用性不佳的挑战。现有方法通常缺乏提供准确，实时医疗建议的能力，特别是对于不规则的投入和代表性不足的语言。为了克服这些局限性，本研究提出了一种为阿拉伯医学文本生成的大型语言模型（LLM）的方法。该系统旨在通过根据用户输入提供准确的医疗建议，诊断，药物建议和治疗计划来帮助患者。该研究方法需要从社交媒体平台中收集独特的数据集，并捕获患者与医生之间的现实医学对话。该数据集包括患者投诉以及医疗建议，已正确清洁和预处理以说明多个阿拉伯方言。微调的最新生成模型，例如Mistral-7b-Instruct-V0.2，Llama-2-7b和GPT-2介质，优化了该系统生成可靠的医学文本的能力。评估的结果表明，微调的Mistral-7b模型的表现优于其他模型，在精确度，回忆和F1得分中达到平均BERT（来自变形金刚的双向编码器表示）分别为68.5 \％，69.08 \％和68.5 \％。比较基准测试和定性评估验证了该系统产生连贯且相关的医学答复以进行非正式输入的能力。这项研究强调了生成人工智能（AI）在推进HMS方面的潜力，为全球医疗挑战提供了可扩展且适应性的解决方案，尤其是在语言和文化上多样化的环境中。

Title: Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records

Authors: Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10108
Pdf URL: https://arxiv.org/pdf/2509.10108
Copy Paste: [[2509.10108]] Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records(https://arxiv.org/abs/2509.10108)
Keywords: language model, gpt, llm, hallucination, chat
Abstract: The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.
摘要：大规模，高质量注释的数据集的稀缺性稀缺，在阿拉伯语中的医疗聊天机器人的开发受到了重大限制。虽然先前的努力编制了一个从社交媒体到微型语言模型（LLMS）的20,000个阿拉伯患者贡献者相互作用的数据集，但模型的可伸缩性和概括性仍然有限。在这项研究中，我们提出了一种可扩展的合成数据增强策略，以将培训语料库扩展到100,000条记录。使用先进的生成AI系统ChatGpt-4O和Gemini 2.5 Pro，我们生成了80,000个上下文相关且具有医学连贯的合成问题答案对基于原始数据集的结构。这些合成样品被语义过滤，手动验证并集成到训练管道中。我们对包括Mistral-7b和Aragpt2在内的五个LLM进行了微调，并使用BERTSCORE指标和专家驱动的定性评估评估了它们的性能。为了进一步分析合成来源的有效性，我们进行了一项消融研究，以独立比较Chatgpt-4O和Gemini生成的数据。结果表明，CHATGPT-4O数据始终导致所有模型的F1得分和更少的幻觉。总体而言，我们的发现表明合成增强的可行性是一种实用解决方案，可在低落资源医疗NLP中增强特定领域的语言模型，为更具包容性，可扩展性和准确的阿拉伯医疗保健聊天机器人系统铺平了道路。

Title: Population-Aligned Persona Generation for LLM-based Social Simulation

Authors: Zhengyu Hu, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Jianxun Lian, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.10127
Pdf URL: https://arxiv.org/pdf/2509.10127
Copy Paste: [[2509.10127]] Population-Aligned Persona Generation for LLM-based Social Simulation(https://arxiv.org/abs/2509.10127)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.
摘要：大型语言模型（LLM）的最新进展使人以前所未有的规模和忠诚度具有类似人类的社会模拟，为计算社会科学提供了新的机会。然而，一个关键的挑战是，构建角色集的构建，实际上代表了现实世界人口的多样性和分布。大多数现有的基于LLM的社会模拟研究主要着重于设计代理框架和模拟环境，通常忽略角色产生的复杂性以及无代表性角色集引入的潜在偏见。在本文中，我们提出了一个系统的框架，用于综合使用LLM驱动的社交模拟的高质量，人口一致的角色集。我们的方法首先利用LLM从长期的社交媒体数据中产生叙事角色，然后进行严格的质量评估以滤除低保真概况。然后，我们将重要性采样用于与参考心理测量分布（例如五大人格特征）实现全球一致性。为了满足特定仿真上下文的需求，我们进一步引入了一个特定于任务的模块，该模块将设置为有针对性亚群的全球性角色。广泛的实验表明，我们的方法大大降低了人群级别的偏见，并为广泛的研究和政策应用提供准确，灵活的社交模拟。

Title: Towards Reliable and Interpretable Document Question Answering via VLMs

Authors: Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.10129
Pdf URL: https://arxiv.org/pdf/2509.10129
Copy Paste: [[2509.10129]] Towards Reliable and Interpretable Document Question Answering via VLMs(https://arxiv.org/abs/2509.10129)
Keywords: language model
Abstract: Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce \textit{DocExplainerV0}, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.
摘要：视觉模型（VLM）在文档理解方面表现出很强的功能，尤其是从复杂文档中识别和提取文本信息时。尽管如此，在文档中准确本地化的答案仍然是一个重大挑战，限制了可解释性和现实世界的适用性。为了解决这个问题，我们介绍了\ textIt {docexplainerv0}，这是一个插件的边界框预测模块，该模块将回答生成从空间本地化中解答。这种设计使其适用于现有的VLM，包括不可行的细微调整的专有系统。通过系统的评估，我们提供了有关文本准确性和空间接地之间差距的定量见解，表明正确的答案通常缺乏可靠的本地化。我们的标准化框架强调了这些缺点，并为将来的研究建立了一个基准，以更加易于解释，强大的文档信息提取VLM。

Title: Benchmark of stylistic variation in LLM-generated texts

Authors: Jiří Milička, Anna Marklová, Václav Cvrček
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10179
Pdf URL: https://arxiv.org/pdf/2509.10179
Copy Paste: [[2509.10179]] Benchmark of stylistic variation in LLM-generated texts(https://arxiv.org/abs/2509.10179)
Keywords: language model, llm, prompt
Abstract: This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs). Biber's multidimensional analysis (MDA) is applied to a sample of human-written texts and AI-created texts generated to be their counterparts to find the dimensions of variation in which LLMs differ most significantly and most systematically from humans. As textual material, a new LLM-generated corpus AI-Brown is used, which is comparable to BE-21 (a Brown family corpus representing contemporary British English). Since all languages except English are underrepresented in the training data of frontier LLMs, similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model. Examined were 16 frontier models in various settings and prompts, with emphasis placed on the difference between base models and instruction-tuned models. Based on this, a benchmark is created through which models can be compared with each other and ranked in interpretable dimensions.
摘要：这项研究调查了人类撰写的文本和大语模型（LLMS）产生的可比文本的寄存器变化。 Biber的多维分析（MDA）应用于人类写的文本的样本和AI创建的文本样本，它们是它们的同类文本，以找到LLMS的变化尺寸，在这种变化方面，LLMS的差异与人类有很大的差异和最大程度的差异。作为文本材料，使用了一种新的LLM生成的Copus Ai-Brown，可与BE-21相当（代表当代英语英语的棕色家庭语料库）。由于在Frontier LLM的培训数据中，除英语以外的所有语言都不足，因此使用Ai-Koditex语料库和捷克多维模型在捷克上复制了类似的分析。在各种设置和提示中，检查的是16个边境模型，重点放在基本模型和指令调整模型之间的差异。基于此，创建了一个基准测试，可以通过该基准相互比较，并在可解释的维度中排名。

Title: Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations

Authors: Leen Almajed, Abeer ALdayel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10184
Pdf URL: https://arxiv.org/pdf/2509.10184
Copy Paste: [[2509.10184]] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations(https://arxiv.org/abs/2509.10184)
Keywords: language model, llm
Abstract: In emotionally supportive conversations, well-intended positivity can sometimes misfire, leading to responses that feel dismissive, minimizing, or unrealistically optimistic. We examine this phenomenon of incongruent positivity as miscalibrated expressions of positive support in both human and LLM generated responses. To this end, we collected real user-assistant dialogues from Reddit across a range of emotional intensities and generated additional responses using large language models for the same context. We categorize these conversations by intensity into two levels: Mild, which covers relationship tension and general advice, and Severe, which covers grief and anxiety conversations. This level of categorization enables a comparative analysis of how supportive responses vary across lower and higher stakes contexts. Our analysis reveals that LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. To further study the underlying dimensions of this phenomenon, we finetune LLMs on datasets with strong and weak emotional reactions. Moreover, we developed a weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) that shows improved detection of incongruent positivity types across two sorts of concerns (Mild and Severe). Our findings shed light on the need to move beyond merely generating generic positive responses and instead study the congruent support measures to balance positive affect with emotional acknowledgment. This approach offers insights into aligning large language models with affective expectations in the online supportive dialogue, paving the way toward context-aware and trust preserving online conversation systems.
摘要：在情感支持性的对话中，有时候良好的积极性有时会失火，从而导致反应感到不屑一顾，最小化或不切实际的乐观。我们研究了这种不一致阳性的现象，因为在人和LLM产生的反应中均具有积极支持的误解表达。为此，我们从Reddit中收集了一系列情感强度的真实用户辅助对话，并使用大型语言模型在相同的上下文中产生了其他响应。我们将这些对话按强度分为两个级别：温和，涵盖关系张力和一般建议以及严重的层次，涵盖了悲伤和焦虑对话。这种分类级别可以对支持反应在较低和更高股份的情况下的比较分析。我们的分析表明，通过不屑一顾和最小化语调，尤其是在高风险的情况下，LLM更容易出现不切实际的积极性。为了进一步研究这种现象的基本维度，我们在具有强烈和弱情绪反应的数据集上进行了Finetune LLM。此外，我们开发了一个弱监督的多标签分类器集合（Deberta和Chenterbert），显示出在两种关注点（轻度和重度）中对不一致的阳性类型的发现改善。我们的发现阐明了需要超越仅产生通用积极反应的需求，而是研究了一致的支持措施，以平衡积极影响与情感认可。这种方法为在线支持对话中的大型语言模型与情感期望保持一致，铺平了迈向情境感知和信任维护在线对话系统的道路。

Title: Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

Authors: Miklós Sebők, Viktor Kovács, Martin Bánóczy, Daniel Møller Eriksen, Nathalie Neptune, Philippe Roussille
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10199
Pdf URL: https://arxiv.org/pdf/2509.10199
Copy Paste: [[2509.10199]] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification(https://arxiv.org/abs/2509.10199)
Keywords: language model, gpt
Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.
摘要：社会科学中使用最广泛的大型语言模型（例如Bert及其衍生物，例如Roberta）对他们可以处理的输入文本长度有一个限制，以产生预测。对于某些分类任务而言，这是一个特别紧迫的问题，其目的是处理长输入文本。这样的领域涉及法律和法律草案（账单），这些法律和草案可以包含数百页的长度，因此并非特别适合使用只能处理的模型来处理，例如512令牌。在本文中，我们展示了涵盖5种语言的实验结果，该语言使用XLM-Roberta，Longformer，GPT-3.5，GPT-4的GPT-4模型，用于比较议程项目的多类分类任务，该任务具有从教育到医疗保健的21个政策主题标签的代码书。结果对于长形式模型没有特别的优势，该模型专门针对处理长输入的目的进行了预先培训。 GPT变体与表现最佳的开放模型之间的比较为后者带来了边缘。对班级因素的分析表明，在长期文本输入上的性能方面，支持和物质重叠的重要性。

Title: SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning

Authors: Shengqiang Fu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.10208
Pdf URL: https://arxiv.org/pdf/2509.10208
Copy Paste: [[2509.10208]] SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning(https://arxiv.org/abs/2509.10208)
Keywords: language model, llm
Abstract: Large Language Models often generate unfaithful responses in knowledge intensive tasks due to knowledge conflict,that is,a preference for relying on internal parametric knowledge rather than the provided this http URL address this issue,we propose a novel self improving framework,Self Improving Faithfulness Aware Contrastive this http URL framework uses a self instruct mechanism that allows the base LLM to automatically generate high quality,structured contrastive learning data,including anchor samples,semantically equivalent positive samples,and negative samples simulating unfaithful this http URL approach significantly reduces the cost of manual this http URL,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation this http URL on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2% over the best baseline method,while significantly reducing dependence on internal this http URL results indicate that SI FACT provides strong effectiveness and high data efficiency in enhancing the contextual faithfulness of LLMs,offering a practical pathway toward building more proactive and trustworthy language models.
摘要：大型语言模型通常会因知识冲突而导致的知识密集任务产生不忠的响应，即偏爱依靠内部参数知识，而不是提供此问题，我们提出了一个新颖的自我改善框架，自我改善的忠诚感相反，可以使该HTTP URL框架使用自动构造的自动构造，以自动构建良好的质量，以自动构建良好的质量 samples,semantically equivalent positive samples,and negative samples simulating unfaithful this http URL approach significantly reduces the cost of manual this http URL,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation this http URL on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT基于LLAMA3 8B指示的模型将上下文召回率提高了6.2％，而最佳基线方法则显着降低了对内部此HTTP URL结果的依赖，这表明SI事实在增强LLM的上下文忠诚度上提供了强大的有效性和高数据效率，从而为建立更为活跃的和可信赖的语言提供了实用的途径。

Title: Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs

Authors: Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10377
Pdf URL: https://arxiv.org/pdf/2509.10377
Copy Paste: [[2509.10377]] Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs(https://arxiv.org/abs/2509.10377)
Keywords: language model, llm
Abstract: Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.
摘要：由于其计算效率，稀疏的Experts（SMOE）架构被广泛用于大语言模型（LLMS）。但是，尽管每个令牌只激活了少数专家，但SMOE仍然需要加载所有专家参数，从而导致高内存使用和部署的挑战。以前的工作试图通过修剪和合并专家来减少开销，但主要集中于专家级的操作，而神经元级结构却没有被置换。我们建议DERN（丢弃专家，重组神经元），这是一个任务不合时宜的，无需再培训的框架，用于专家修剪和重建。我们观察到，专家经常被错位，并在神经元层面包含语义冲突，这对直接合并构成了挑战。为了解决这个问题，DERN分为三个步骤：它首先使用路由器统计数据降级专家；然后，它将它们分解为神经元级专家细分市场，将每个部分分配给其最兼容的保留专家；最后，它合并了每个保留专家内的细分市场以建立一个紧凑的表示形式。关于混音，QWEN和DEEPSEEK SMOE模型的实验表明，DERN在常识性推理和MMLU基准测试中提高了5％以上，而MMLU的基准在50％的专家稀疏度低于50％，而无需额外的培训。它还大大减少了专家的数量和内存使用量，使SMOE LLMS更容易在实践中部署。

Title: Is In-Context Learning Learning?

Authors: Adrian de Wynter
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.10414
Pdf URL: https://arxiv.org/pdf/2509.10414
Copy Paste: [[2509.10414]] Is In-Context Learning Learning?(https://arxiv.org/abs/2509.10414)
Keywords: prompt, chain-of-thought
Abstract: In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.
摘要：内在学习（ICL）允许一些自回旋模型通过下一步的预测来解决任务，而无需进一步培训。这导致了这些模型在提示中仅使用几张照片（示例）解决（学习）看不见的任务的能力。但是，推论并不总是意味着学习，因为ICL并未明确编码给定的观察。取而代之的是，模型依靠他们的先验知识和给出的示例（如果有）。我们认为，从数学上讲，ICL确实构成了学习，但其全部特征需要经验工作。然后，我们对ICL消融或解释记忆，预处理，分配变化以及促使样式和措辞的大规模分析。我们发现ICL是一个有效的学习范式，但其学习和推广到看不见的任务的能力有限。我们注意到，在范例变得越来越多的范围内，准确性对示例分布，模型，及时样式和输入的语言特征不敏感。取而代之的是，它从提示中的规律性中推导出模式，从而导致分布敏感性，尤其是在提示诸如诸如思维链之类的样式中。鉴于正式相似的任务的精确量变化，我们得出结论，自动锻炼的临时编码不是强大的机制，并且建议有限的通用性。

Title: Long Context Automated Essay Scoring with Language Models

Authors: Christopher Ormerod, Gitit Kehat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10417
Pdf URL: https://arxiv.org/pdf/2509.10417
Copy Paste: [[2509.10417]] Long Context Automated Essay Scoring with Language Models(https://arxiv.org/abs/2509.10417)
Keywords: language model, long context
Abstract: Transformer-based language models are architecturally constrained to process text of a fixed maximum length. Essays written by higher-grade students frequently exceed the maximum allowed length for many popular open-source models. A common approach to addressing this issue when using these models for Automated Essay Scoring is to truncate the input text. This raises serious validity concerns as it undermines the model's ability to fully capture and evaluate organizational elements of the scoring rubric, which requires long contexts to assess. In this study, we evaluate several models that incorporate architectural modifications of the standard transformer architecture to overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.
摘要：基于变压器的语言模型在架构上限制为处理固定最大长度的文本。高级学生写的论文经常超过许多流行的开源型号的最大允许长度。在使用这些模型进行自动论文评分时，解决此问题的一种常见方法是截断输入文本。这引起了严重的有效性问题，因为它破坏了该模型完全捕获和评估评分标题的组织要素的能力，这需要长期评估。在这项研究中，我们评估了几个模型，这些模型结合了标准变压器体系结构的体系结构修改，以使用Kaggle ASAP 2.0数据集克服这些长度限制。这项研究中考虑的模型包括XLNET，Longformer，Modernbert，Mamba和Llama模型的微调版本。

Title: RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment

Authors: Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10436
Pdf URL: https://arxiv.org/pdf/2509.10436
Copy Paste: [[2509.10436]] RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment(https://arxiv.org/abs/2509.10436)
Keywords: language model, llm, prompt, agent
Abstract: To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.
摘要：为了优化大型语言模型（LLMS）的推理和解决问题的能力，我们提出了一种新颖的云边缘协作架构，该架构可以实现一个结构化的多代理提示框架。该框架包括三个专业组件：Guidellm，一种在边缘部署的轻量级模型，以提供方法论指导； Solverllm，一个更强大的模型，托管在负责生成代码解决方案的云中；和Judgellm，是一种评估解决方案正确性和质量的自动化评估者。为了评估和证明这种体系结构在现实设置中的有效性，我们介绍了RepactorCoderqa，这是一种综合基准，旨在评估和增强多域编码任务中大语言模型（LLMS）的性能。通过现有基准测试的局限性，RepactorCoderqa系统地涵盖了各种技术领域，包括软件工程，数据科学，机器学习和自然语言处理，使用堆栈溢出的真实编码挑战。广泛的实验表明，我们的微调模型，RefactorCoder-Moe实现了最先进的性能，总体准确度为76.84％，超过领先的开源和商业基线的表现明显优于领先的开源和商业基线。人类评估进一步验证了生成的解决方案的解释性，准确性和实际相关性。此外，我们评估了系统级指标，例如吞吐量和延迟，以更深入地了解拟议体系结构的性能特征和权衡。

Title: DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

Authors: Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.10446
Pdf URL: https://arxiv.org/pdf/2509.10446
Copy Paste: [[2509.10446]] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL(https://arxiv.org/abs/2509.10446)
Keywords: language model, llm, agent
Abstract: Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs' long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at this https URL.
摘要：使用浏览工具增强大型语言模型（LLM）可以大大提高其作为解决复杂的现实世界任务的深度搜索代理的潜力。然而，由于浏览工具的长途推理能力有限，并且缺乏足够困难的监督数据，因此在这种情况下，开放LLM的性能仍然很差。为了应对这些挑战，我们介绍了深层探索代理。首先，我们提出了一种策略，以自动综合开放知识图中的复杂，困难和难以找到的问题。其次，我们应用端到端的多转弯增强学习（RL），通过深入搜索来增强LLMS的长途推理。实验表明，DeepDive-32B在BrowseComp，优于较高的Weberer，DeepSeek-R1-Browse和Search-O1上取得了新的开源竞争结果。我们证明，多转移RL培训提高了深度搜索能力，并显着促进了多个基准测试的性能改进。我们观察到DeepDive可以实现工具调用和并行抽样的测试时间缩放。所有数据集，模型和代码均在此HTTPS URL上公开可用。