2025-10-06

Title: Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

Authors: Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Oliver Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02324
Pdf URL: https://arxiv.org/pdf/2510.02324
Copy Paste: [[2510.02324]] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning(https://arxiv.org/abs/2510.02324)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model's weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL's light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL's flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.
摘要：大型语言模型（LLM）具有令人印象深刻的功能，但通常会幻觉，自信地提供了错误的答案，而不是承认无知。先前的工作表明，模型编码自己知识的线性表示，并且激活转向可以减少幻觉。但是，这些方法需要在推断期间进行实时监控和干预。我们引入了摊销学习（Casal）的对比激活转向，这是一种有效的算法，可与摊销优化联系起来。卡萨尔直接将激活转向转向模型的重量的好处。一旦受过培训，LLMS就回答了他们知道的问题，同时避免回答他们没有回答的问题。 Casal的轻巧设计只需要培训单个变压器层的子模块，但在多种短形式的QA基准中，幻觉却减少了30％-40％。 Casal比基于LORA的基线（例如SFT和DPO）更高的计算效率和20倍高20倍，从而提高了其在数据稀缺域中的实际适用性。重要的是，Casal还有效地推广到分布（OOD）域。我们展示了Casal在减轻文本和视觉模型中幻觉方面的灵活性。据我们所知，Casal是第一种基于转向的训练方法，该方法已被证明对Experts（MOE）模型都有效。 Casal代表了将受解释性启发的方法用于生产系统中的实际部署的前进一步。

Title: Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

Authors: Vivek Bhavsar, Joseph Ereifej, Aravanan Gurusami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02326
Pdf URL: https://arxiv.org/pdf/2510.02326
Copy Paste: [[2510.02326]] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval(https://arxiv.org/abs/2510.02326)
Keywords: language model, gpt, hallucination
Abstract: Large language models accelerate literature synthesis but can hallucinate and mis-cite, limiting their usefulness in expert workflows. We present RA-FSM (Research Assistant - Finite State Machine), a modular GPT-based research assistant that wraps generation in a finite-state control loop: Relevance -> Confidence -> Knowledge. The system is grounded in vector retrieval and a deterministic citation pipeline. The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed, and emits answers with confidence labels and in-corpus, de-duplicated references. A ranked-tier ingestion workflow constructs a domain knowledge base from journals, conferences, indices, preprints, and patents, writing both to a dense vector index and to a relational store of normalized metrics. We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design. In blinded A/B reviews, domain experts prefer RA-FSM to both a strong Notebook LM (NLM) and a vanilla Default GPT API call single-pass baseline, citing stronger boundary-condition handling and more defensible evidence use. Coverage and novelty analyses indicate that RA-FSM explores beyond the NLM while incurring tunable latency and cost overheads. The design emphasizes transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains.
摘要：大型语言模型加速了文献综合，但可能会幻觉和误用，从而限制了它们在专家工作流程中的用处。我们提出了RA -FSM（研究助理 - 有限状态机器），这是一名基于GPT的研究助理，它将发电机包裹在有限状态的控制循环中：相关性 - > Profesity->“知识”。该系统基于矢量检索和确定性引文管道。控制器过滤界外查询，得分可回答性，分解问题并仅在需要时触发检索，并使用置信标签和cor孔内发出答案。排名的摄入工作流程从期刊，会议，索引，预印本和专利构建了域知识库，并将其写入密集的向量索引，以及归一化指标的关系存储。我们实施光子学系统，并在六个任务类别中进行评估：分析推理，数值分析，方法论批判，比较综合，事实提取和应用设计。在盲目的A/B评论中，域专家更喜欢RA-FSM，而不是强大的笔记本LM（NLM）和Vanilla默认的GPT API呼叫单通路基线，理由是更强大的边界条件处理和更可辩护的证据使用。覆盖范围和新颖性分析表明，RA-FSM在NLM之外探索，同时产生可调延迟和成本开销。该设计强调了高风险技术工作的透明，引用的答案，并且可以推广到其他科学领域。

Title: KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

Authors: So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2510.02327
Pdf URL: https://arxiv.org/pdf/2510.02327
Copy Paste: [[2510.02327]] KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI(https://arxiv.org/abs/2510.02327)
Keywords: language model, llm
Abstract: Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.
摘要：实时语音到语音（S2S）模型在产生自然的低延迟对话反应方面表现出色，但通常缺乏深刻的知识和语义理解。相反，结合自动语音识别的级联系统，基于文本的大语言模型（LLM）和文本到语音综合，以高潜伏期为代价提供了出色的知识表示，这破坏了自然交互的流动。本文介绍了一种新型的混合体系结构，该结构弥合了这两个范式之间的差距。我们的框架通过S2S变压器处理用户语音，以立即响应能力，同时将查询转达给功能强大的后端LLM。然后，对LLM的基于文本的响应进行了实时注入，以指导S2S模型的语音生成，从而有效地将其输出注入丰富的知识，而无需全面的延迟惩罚。我们使用MT基础基准的语音合成变体评估了我们的方法，该变体由多转交响的会话组成。结果表明，我们的系统在响应正确性方面基本上优于基线S2S模型，即接近级联系统的模型，同时与基线保持延迟。

Title: AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering

Authors: Ziqing Wang, Chengsheng Mao, Xiaole Wen, Yuan Luo, Kaize Ding
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2510.02328
Pdf URL: https://arxiv.org/pdf/2510.02328
Copy Paste: [[2510.02328]] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering(https://arxiv.org/abs/2510.02328)
Keywords: language model, llm, agent
Abstract: Medical Multimodal Large Language Models (Med-MLLMs) have shown great promise in medical visual question answering (Med-VQA). However, when deployed in low-resource settings where abundant labeled data are unavailable, existing Med-MLLMs commonly fail due to their medical reasoning capability bottlenecks: (i) the intrinsic reasoning bottleneck that ignores the details from the medical image; (ii) the extrinsic reasoning bottleneck that fails to incorporate specialized medical knowledge. To address those limitations, we propose AMANDA, a training-free agentic framework that performs medical knowledge augmentation via LLM agents. Specifically, our intrinsic medical knowledge augmentation focuses on coarse-to-fine question decomposition for comprehensive diagnosis, while extrinsic medical knowledge augmentation grounds the reasoning process via biomedical knowledge graph retrieval. Extensive experiments across eight Med-VQA benchmarks demonstrate substantial improvements in both zero-shot and few-shot Med-VQA settings. The code is available at this https URL.
摘要：医学多模式大型语言模型（MED-MLLMS）在医学视觉问题回答（MED-VQA）方面表现出了巨大的希望。但是，当部署在不可用的标记数据的低资源设置中时，由于其医疗推理能力瓶颈而通常失败了，现有的MED-MLLM通常会失败：（i）忽略医疗图像中细节的内在推理瓶颈；（ii）无法纳入专业医学知识的外部推理瓶颈。为了解决这些限制，我们建议Amanda是一个无训练的代理框架，可通过LLM代理进行医学知识增强。具体而言，我们的内在医学知识增强集中于对综合诊断的粗到五个问题分解，而外在的医学知识增强为通过生物医学知识图检索奠定了推理过程。在八个MED-VQA基准上进行的广泛实验表明，零射门和少量MED-VQA设置都有很大的改进。该代码可在此HTTPS URL上找到。

Title: SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

Authors: Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02329
Pdf URL: https://arxiv.org/pdf/2510.02329
Copy Paste: [[2510.02329]] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification(https://arxiv.org/abs/2510.02329)
Keywords: llm
Abstract: Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.
摘要：投机解码通过通过较大的目标模型从草案模型中验证候选令牌来加速LLM推断。最近对解码的法官通过接受可能显示出与目标模型输出略有差异的草稿令牌来放松验证标准，从而促进了这一过程，但是现有方法受其依赖人类注释或具有可验证的基础真理的任务的限制，从而限制了各种NLP任务的普遍性。我们提出了自我判断，该法官通过对目标模型的自学训练验证者培训法官。我们的方法通过评估令牌取代的响应是否保留原始响应的含义，从而实现各种NLP任务的自动验证者培训来衡量语义保存。我们的实验表明，自我法官比法官解码基准实现了卓越的推理 - 准确性权衡，为更快的LLM推理提供了广泛适用的解决方案。

Title: EntropyLong: Effective Long-Context Training via Predictive Uncertainty

Authors: Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, Zijia Lin, Debing Zhang, Songlin Hu, Binghui Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02330
Pdf URL: https://arxiv.org/pdf/2510.02330
Copy Paste: [[2510.02330]] EntropyLong: Effective Long-Context Training via Predictive Uncertainty(https://arxiv.org/abs/2510.02330)
Keywords: language model
Abstract: Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose EntropyLong, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This model-in-the-loop verification ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWebEdu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBenchv2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropybased verification for long-context training.
摘要：培训长篇小说模型以捕获长期依赖性需要专门的数据构建。当前的方法，例如通用文本串联或基于启发式的变体，通常无法保证真正的远程依赖性。我们提出了一种新型的数据构建方法Entropylong，该方法利用了预测性不确定性来验证依赖性质量。我们的方法确定了文档中的高渗透位置，从大型语料库中检索语义相关的上下文，并通过评估它们是否减少预测熵来验证其效用。此循环验证可确保每个依赖性代表可测量的信息增益，而不是虚假的相关性。我们通过将原始文档与这些经过验证的上下文补充结合在一起来构建具有长期依赖性的培训样本。使用FineWebedu和Cosmopedia，我们生成了具有经过验证的依赖项的128K长度序列的数据集。对此数据培训的模型表明，在需要远处信息的任务中，统治者基准有了重大改进。按照教学进行微调，我们的模型还可以在Longbenchv2上获得可观的收益，这表明了长期以来的理解增强。广泛的消融研究进一步验证了基于熵验证的长篇文化训练的必要性和有效性。

Title: Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER)

Authors: Moonkyung Ryu, Chih-Wei Hsu, Yinlam Chow, Mohammad Ghavamzadeh, Craig Boutilier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02331
Pdf URL: https://arxiv.org/pdf/2510.02331
Copy Paste: [[2510.02331]] Synthetic Dialogue Generation for Interactive Conversational Elicitation & Recommendation (ICER)(https://arxiv.org/abs/2510.02331)
Keywords: language model, prompt
Abstract: While language models (LMs) offer great potential for conversational recommender systems (CRSs), the paucity of public CRS data makes fine-tuning LMs for CRSs challenging. In response, LMs as user simulators qua data generators can be used to train LM-based CRSs, but often lack behavioral consistency, generating utterance sequences inconsistent with those of any real user. To address this, we develop a methodology for generating natural dialogues that are consistent with a user's underlying state using behavior simulators together with LM-prompting. We illustrate our approach by generating a large, open-source CRS data set with both preference elicitation and example critiquing. Rater evaluation on some of these dialogues shows them to exhibit considerable consistency, factuality and naturalness.
摘要：尽管语言模型（LMS）为对话推荐系统（CRS）提供了巨大的潜力，但公共CRS数据的匮乏使CRSS充满挑战的LMS造成了微调的LMS。作为响应，LMS作为用户模拟器QUA数据生成器可用于训练基于LM的CRS，但通常缺乏行为一致性，生成与任何真实用户不一致的话语序列。为了解决这个问题，我们开发了一种方法，用于生成与用户使用行为模拟器以及LM宣传的用户基本状态一致的自然对话。我们通过生成具有偏好启发和示例批评的大型开源CRS数据集来说明我们的方法。评估这些对话中的一些对话表明，它们表现出相当大的一致性，事实和自然性。

Title: Human Mobility Datasets Enriched With Contextual and Social Dimensions

Authors: Chiara Pugliese, Francesco Lettich, Guido Rocchietti, Chiara Renso, Fabio Pinelli
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2510.02333
Pdf URL: https://arxiv.org/pdf/2510.02333
Copy Paste: [[2510.02333]] Human Mobility Datasets Enriched With Contextual and Social Dimensions(https://arxiv.org/abs/2510.02333)
Keywords: language model, llm
Abstract: In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.
摘要：在这篇资源论文中，我们介绍了两个公开可用的语义丰富人类轨迹的数据集，以及构建它们的管道。这些轨迹是从OpenStreetMap检索的GPS痕迹。每个数据集都包含上下文层，例如停止，移动，兴趣点（POI），推断运输模式和天气数据。一个新颖的语义特征是包含由大语言模型（LLM）产生的合成，现实的社交媒体帖子，从而实现了多模式和语义流动性分析。数据集以表格和资源描述框架（RDF）格式提供，支持语义推理和公平的数据实践。它们涵盖了两个在结构上的大型城市：巴黎和纽约。我们的开源可重现管道允许进行数据集自定义，而数据集则支持研究任务，例如行为建模，移动性预测，知识图构建和基于LLM的应用程序。据我们所知，我们的资源是第一个在可重复使用的框架中结合现实世界运动，结构化语义丰富，LLM生成的文本和语义Web兼容性的资源。

Title: Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Authors: Zhe Li, Wei Zhao, Yige Li, Jun Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02334
Pdf URL: https://arxiv.org/pdf/2510.02334
Copy Paste: [[2510.02334]] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing(https://arxiv.org/abs/2510.02334)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model's activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs. The code is available at this https URL.
摘要：大型语言模型（LLM）表现出了显着的功能，但是它们的部署经常被不受欢迎的行为所破坏，例如产生有害内容，事实不准确和社会偏见。诊断这些故障的根本原因对AI安全构成了关键挑战。现有的归因方法，尤其是基于参数梯度的方法，通常由于过度噪声信号和计算复杂性而降低。在这项工作中，我们引入了一个新颖有效的框架，该框架通过分析表示及其梯度来诊断一系列不良的LLM行为，该框架直接在模型的激活空间中运行，以提供语义上有意义的信号将输出链接到其训练数据。我们系统地评估了我们的方法，包括跟踪有害内容，检测后门中毒并确定知识污染。结果表明，我们的方法不仅在样本级别的归因方面出色，而且还可以实现细粒的令牌级分析，从而精确地识别了影响模型行为的特定样本和短语。这项工作提供了一种强大的诊断工具，可以理解，审核并最终减轻与LLMS相关的风险。该代码可在此HTTPS URL上找到。

Title: FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

Authors: Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, Xiaoxing Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02335
Pdf URL: https://arxiv.org/pdf/2510.02335
Copy Paste: [[2510.02335]] FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory(https://arxiv.org/abs/2510.02335)
Keywords: language model, llm
Abstract: Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,
摘要：大型语言模型（LLMS）最近在正式定理证明中表现出了显着的进展。然而，他们担任数学家实践助理的能力，在复杂的证据中填写丢失的步骤，仍然没有得到充实的态度。我们将这一挑战确定为子目标完成任务，在该任务中，LLM必须在人类提供的草图中履行简短但非平凡的证明义务。为了研究这个问题，我们介绍了正式ML，这是一种由机器学习的基础理论构建的精益4基准。使用将程序证明转换为声明形式的翻译策略，我们提取了4937个问题，这些问题涉及优化和概率不平等，难度不同。正式ML是第一个结合前提检索和复杂的研究级环境的子目标完成基准。评估最先进的掠夺表明准确性和效率的持续限制，强调了对更有能力的基于LLM的定理抛弃的需求，以有效的子目标完成，

Title: CRACQ: A Multi-Dimensional Approach To Automated Document Assessment

Authors: Ishak Soltani, Francisco Belo, Bernardo Tavares
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02337
Pdf URL: https://arxiv.org/pdf/2510.02337
Copy Paste: [[2510.02337]] CRACQ: A Multi-Dimensional Approach To Automated Document Assessment(https://arxiv.org/abs/2510.02337)
Keywords: llm
Abstract: This paper presents CRACQ, a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality. Building on insights from traitbased Automated Essay Scoring (AES), CRACQ expands its fo-cus beyond essays to encompass diverse forms of machine-generated text, providing a rubricdriven and interpretable methodology for automated evaluation. Unlike singlescore approaches, CRACQ integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis. Trained on 500 synthetic grant pro-posals, CRACQ was benchmarked against an LLM-as-a-judge and further tested on both strong and weak real applications. Preliminary results in-dicate that CRACQ produces more stable and interpretable trait-level judgments than direct LLM evaluation, though challenges in reliability and domain scope remain
摘要：本文介绍了CRACQ，这是一个多维评估框架，该框架量身定制，该框架是为了评估特定特定特征的文档：连贯，严格，适当性，完整性和质量。 CRACQ以基于特征的自动论文评分（AES）的见解为基础，扩展了其FO-CUS超越论文，以涵盖各种形式的机器生成的文本，为自动化评估提供了一种涉足和可解释的方法。与单打方法不同，CRACQ将语言，语义和结构信号整合到累积评估中，从而使整体和性状级别的分析既可以进行。克拉克（Cracq）对500个合成赠款的亲理由进行了培训，针对LLM-AS-A-A-Gudge进行了基准测试，并在强大和弱的真实应用中进一步测试。初步结果表明，CRACQ比直接LLM评估产生更稳定和可解释的特质级判断，尽管可靠性和域范围的挑战仍然存在

Title: Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

Authors: Samyak Jhaveri, Praphul Singh, Jangwon Kim, Tara Taghavi, Krishnaram Kenthapadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02338
Pdf URL: https://arxiv.org/pdf/2510.02338
Copy Paste: [[2510.02338]] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards(https://arxiv.org/abs/2510.02338)
Keywords: language model, gpt, hallucination
Abstract: Automating clinical documentation with large language models requires precise alignment with priorities such as completeness and factual grounding. We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation that couples Group Relative Policy Optimization (GRPO) with DocLens, a claim-level evaluator that provides deterministic, dialogue-grounded rewards. Our method directly optimizes factual grounding and completeness without training a separate reward model or relying on human-authored references. Empirically, the approach improves clinical note quality and reduces training cost via a simple reward-gating strategy. An independent GPT-5 qualitative evaluation further supports these gains, showing higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Because the benchmarks are relatively clean and the base model already well aligned, these improvements likely represent a conservative lower bound. The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.
摘要：使用大语言模型自动化临床文档需要精确的对齐方式，并具有优先级，例如完整性和事实基础。我们提出了一个评估综合的增强学习框架，该框架是长形式临床文本生成的，该框架将小组相对策略优化（GRPO）与Doclens（索赔级评估者），这是一个提供确定性的，对话的奖励。我们的方法直接优化了事实基础和完整性，而无需训练单独的奖励模型或依靠人为实现的参考文献。从经验上讲，该方法通过简单的奖励门票策略来提高临床注意事项质量，并降低培训成本。独立的GPT-5定性评估进一步支持了这些收益，显示了对事实，完整性和简洁的GRPO输出的更高偏好，并且遗漏和幻觉较少。由于基准相对干净并且基本模型已经很好地对齐，因此这些改进可能代表了保守的下限。该框架可扩展到现实世界的设置，并且可以合并自定义目标，例如指南依从性或计费偏好。

Title: Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Authors: Kevin Zhou, Adam Dejl, Gabriel Freedman, Lihu Chen, Antonio Rago, Francesca Toni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02339
Pdf URL: https://arxiv.org/pdf/2510.02339
Copy Paste: [[2510.02339]] Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models(https://arxiv.org/abs/2510.02339)
Keywords: language model, llm, prompt
Abstract: Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods' effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.
摘要：大型语言模型（LLM）的不确定性量化研究（UQ）对于确保这种开创性技术的可靠性越来越重要。我们探讨了LLM UQ方法在论证LLMS（ARGLLMS）中的集成，这是一个基于计算论证的可解释的LLM决策框架，其中UQ起着关键作用。我们进行实验，以评估Argllms在使用不同的LLM UQ方法时在索赔验证任务上的性能，并固有地对UQ方法的有效性进行评估。此外，实验程序本身是评估UQ方法有效性的一种新型方法，尤其是在存在复杂和潜在的争议性陈述时。我们的结果表明，尽管具有简单性，但直接提示是Argllms中有效的UQ策略，其表现优于更复杂的方法。

Title: Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Authors: Xin Gao, Ruiyi Zhang, Daniel Du, Saurabh Mahindre, Sai Ashish Somayajula, Pengtao Xie
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02340
Pdf URL: https://arxiv.org/pdf/2510.02340
Copy Paste: [[2510.02340]] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs(https://arxiv.org/abs/2510.02340)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at this https URL.
摘要：大型语言模型（LLM）广泛用于时间预测，但是它们对预处理数据的依赖引起了污染问题，因为对切割前测试数据的准确预测可能反映了记忆而不是推理，从而导致对其概括能力的高估。随着最近出现基于促使基于促使的学习技术的出现，出现了一个自然的问题：可以提示LLMS模拟早期的知识截止吗？在这项工作中，我们调查了促使LLMS中早期知识截止的能力。我们构建三个评估数据集，以评估LLM可以忘记的程度（1）直接的事实知识，（2）语义转移以及（3）因果关系知识。结果表明，虽然基于迅速的模拟知识截止值在直接与该日期之后的信息进行查询时表现出有效性，但他们很难忘记何时直接询问被遗忘的内容，而是与查询有关的因果关系。这些发现强调了在为时间预测任务应用LLM时需要更严格的评估设置的必要性。完整的数据集和评估代码可在此HTTPS URL上找到。

Title: DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Authors: Yifan Wang, Bolian Li, Junlin Wu, Zhaoxuan Tan, Zheli Liu, Ruqi Zhang, Ananth Grama, Qingkai Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02341
Pdf URL: https://arxiv.org/pdf/2510.02341
Copy Paste: [[2510.02341]] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning(https://arxiv.org/abs/2510.02341)
Keywords: language model, gpt
Abstract: Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at this https URL.
摘要：现实世界中的大型语言模型部署（例如，对话AI系统，代码生成助手）自然会产生丰富的隐式用户不满意（DSAT）信号，因为用户通过改进，更正和表达的偏好介绍了更好的答案，而显式满意度（SAT）反馈很少。现有的偏好学习方法与此数据概况的一致性很差，因为它们依赖于昂贵的人类注释或假定积极的反应。 In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy.从经验上讲，在现实世界中接受培训的漂移模型{wildfeptback}数据集和合成\ textIt {ultrefeffback}数据集达到+6.23 \％（7B） / +7.61 \％（14b），在Wildbench任务上，+8.95 \％（7B） /+12.29。比率比基本模型的速率优于迭代DPO和自旋等强基线方法。在较大的尺度上，改进特别明显：在野生板上，经过漂移超过GPT-4O-Mini训练的14B型号。进一步的分析表明，漂移还保留了探索能力，产生了更多样化的高回报解决方案，而不是崩溃到狭窄的子集。从理论上讲，我们证明了这种设计可以保留偏好边缘并避免梯度变性。这些结果表明，漂移是现实世界训练后的有效且可扩展的食谱，它利用了最丰富和信息性的信号。该代码和数据可在此HTTPS URL上找到。

Title: $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training

Authors: Aurélien Bück-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-François Godbout, Reihaneh Rabbany, Zachary Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02343
Pdf URL: https://arxiv.org/pdf/2510.02343
Copy Paste: [[2510.02343]] $\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training(https://arxiv.org/abs/2510.02343)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.
摘要：大型语言模型（LLMS）提供了有希望的能力，可以大规模模拟社交媒体动态，从而使在人类受试者上具有道德或逻辑上具有挑战性的研究。但是，该领域缺乏标准化的数据资源来进行微调和评估LLM作为现实的社交媒体代理。我们通过引入仿真的角色和动作捕获工具包来解决这一差距，这是一个尊重构建适用于培训代理模型的行为基础的社交媒体数据集的隐私框架。我们将下一行动预测作为培训和评估基于LLM的代理的任务，并在集群和人群水平上引入指标，以评估行为忠诚度和风格现实主义。作为具体的实施，我们释放了BluePrint，这是一种由Bluesky Public Bluesky数据构建的大规模数据集，该数据集中于政治话语。蓝图将匿名用户群群归为汇总行为的角色，捕获真实的参与模式，同时通过假名和删除个人身份信息来保护隐私。该数据集包括12种社交媒体互动类型的相当大的操作集（例如，答复，重新发布等），每个实例都与它之前的发布活动相关。这不仅支持使用上下文依赖性的代理商的发展，不仅在语言中，而且还以社交媒体的互动行为来建模社交媒体用户。通过标准化数据和评估协议，Simpact为推进严格的，道德负责的社交媒体模拟提供了基础。蓝图既是政治话语建模的评估基准，又是构建特定域数据集的模板，以研究诸如错误信息和两极分化之类的挑战。

Title: Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Authors: Peijun Zhu, Ning Yang, Jiayu Wei, Jinghang Wu, Haijun Zhang
Subjects: cs.CL, cs.AI, cs.DC, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2510.02345
Pdf URL: https://arxiv.org/pdf/2510.02345
Copy Paste: [[2510.02345]] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression(https://arxiv.org/abs/2510.02345)
Keywords: language model, llm
Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.
摘要：Experts的混合物（MOE）大语言模型（LLMS）面临负载不平衡，参数冗余和通信开销的三元素。我们引入了一个基于动态专家聚类和结构化压缩的统一框架，以凝聚在这些问题上。我们的方法采用在线聚类程序，使用融合的参数和激活相似性定期重新组合专家，从而稳定专家利用率。据我们所知，这是利用路由器语义嵌入能力的第一个框架之一，以在训练过程中动态重新配置模型的体系结构以获得可观的效率提高。在每个集群中，我们将专家权重分解为共享的基本矩阵和极低的剩余适配器，在保留专业化的同时，每组减少了多达五倍的参数。该结构可以实现两阶段的层次路由策略：令牌首先分配给集群，然后分配给其中的特定专家，从而大大降低路由搜索空间和全能通信的音量。此外，一种异质的精度方案，该方案在FP16中共享碱和INT4中的残留因子，再加上无效簇的动态卸载，将峰值存储器消耗降低到与密集模型相当的水平。在胶水和Wikitext-103上进行了评估，我们的框架与标准MOE模型的质量相匹配，同时将总参数降低了约80％，将吞吐量提高了10％至20％，并将专家负载差异降低了三个以上。我们的工作表明，结构重组是通往可扩展，高效和记忆有效的MOE LLM的原则性途径。

Title: Small Language Models for Curriculum-based Guidance

Authors: Konstantinos Katharakis, Sippo Rossi, Raghava Rao Mukkamala
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02347
Pdf URL: https://arxiv.org/pdf/2510.02347
Copy Paste: [[2510.02347]] Small Language Models for Curriculum-based Guidance(https://arxiv.org/abs/2510.02347)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: The adoption of generative AI and large language models (LLMs) in education is still emerging. In this study, we explore the development and evaluation of AI teaching assistants that provide curriculum-based guidance using a retrieval-augmented generation (RAG) pipeline applied to selected open-source small language models (SLMs). We benchmarked eight SLMs, including LLaMA 3.1, IBM Granite 3.3, and Gemma 3 (7-17B parameters), against GPT-4o. Our findings show that with proper prompting and targeted retrieval, SLMs can match LLMs in delivering accurate, pedagogically aligned responses. Importantly, SLMs offer significant sustainability benefits due to their lower computational and energy requirements, enabling real-time use on consumer-grade hardware without depending on cloud infrastructure. This makes them not only cost-effective and privacy-preserving but also environmentally responsible, positioning them as viable AI teaching assistants for educational institutions aiming to scale personalized learning in a sustainable and energy-efficient manner.
摘要：在教育中采用生成的AI和大型语言模型（LLM）仍在出现。在这项研究中，我们探讨了使用用于选定的开源小语言模型（SLMS）的检索型发电机（SLM）的检索型生成（RAG）管道提供基于课程的指导的AI教学助理的开发和评估。我们对八个SLM进行了基准测试，包括Llama 3.1，IBM Granite 3.3和Gemma 3（7-17b参数），对GPT-4O。我们的发现表明，通过适当的提示和有针对性的检索，SLM可以匹配LLM，以提供准确的教学对准响应。重要的是，由于SLM的计算和能源需求较低，因此SLM提供了显着的可持续性益处，从而在消费级硬件上实时使用而无需依赖云基础架构。这使他们不仅具有成本效益和保护性，而且对环境负责，将其定位为旨在以可持续和节能的方式扩展个性化学习的教育机构的可行的AI助教。

Title: LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL

Authors: Dzmitry Pihulski, Karol Charchut, Viktoria Novogrodskaia, Jan Kocoń
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02350
Pdf URL: https://arxiv.org/pdf/2510.02350
Copy Paste: [[2510.02350]] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL(https://arxiv.org/abs/2510.02350)
Keywords: language model, gpt, llm
Abstract: Converting natural language questions into SQL queries (Text-to-SQL) enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early NL2SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the LLM era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models (LLMs), including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1 and others. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark: unlike the original WikiSQL, tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural language-to-SQL models.
摘要：将自然语言问题转换为SQL查询（文本到SQL）使非专家用户能够与关系数据库进行交互，并且长期以来一直是自然语言界面数据界面的核心任务。尽管WikISQL数据集在NL2SQL早期研究中发挥了关键作用，但由于结构和注释问题，其使用情况下降了，包括案例敏感性不一致，数据类型不匹配，语法错误以及未解决的问题。我们提出了LLMSQL，这是为LLM时代设计的WikisQL的系统修订和转换。我们对这些错误进行了分类，并实施自动化方法进行清洁和重新注释。为了评估这些改进的影响，我们评估了多种大型语言模型（LLM），包括Gemma 3，Llama 3.2，Mistral 7b，GPT-Oss 20b，Phi-3.5 Mini，Qwen 2.5，OpenAi O4-Mini，DeepSeek R1等。 LLMSQL并没有作为更新，而是作为LLM就绪的基准引入：与原始WikisQl不同，该基础是针对指针网络模型量身定制的，从输入中选择代币，LLMSQL提供了简洁的自然语言问题，并提供完整的SQL查询作为普通文本，以直接生成和评估自然语言到现代语言模型。

Title: Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs

Authors: Dzmitry Pihulski, Jan Kocoń
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02351
Pdf URL: https://arxiv.org/pdf/2510.02351
Copy Paste: [[2510.02351]] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs(https://arxiv.org/abs/2510.02351)
Keywords: language model, gpt, llm, prompt
Abstract: We explore how large language models (LLMs) assess offensiveness in political discourse when prompted to adopt specific political and cultural perspectives. Using a multilingual subset of the MD-Agreement dataset centered on tweets from the 2020 US elections, we evaluate several recent LLMs - including DeepSeek-R1, o4-mini, GPT-4.1-mini, Qwen3, Gemma, and Mistral - tasked with judging tweets as offensive or non-offensive from the viewpoints of varied political personas (far-right, conservative, centrist, progressive) across English, Polish, and Russian contexts. Our results show that larger models with explicit reasoning abilities (e.g., DeepSeek-R1, o4-mini) are more consistent and sensitive to ideological and cultural variation, while smaller models often fail to capture subtle distinctions. We find that reasoning capabilities significantly improve both the personalization and interpretability of offensiveness judgments, suggesting that such mechanisms are key to adapting LLMs for nuanced sociopolitical text classification across languages and ideologies.
摘要：我们探讨了大型语言模型（LLM）如何评估政治话语中的侵犯性，当时促使采用特定的政治和文化观点。使用以2020年美国选举的推文为中心的MD签署数据集的多语言子集，我们评估了最近的几个LLM，包括DeepSeek-R1，O4-Mini，GPT-4.1-Mini，Qwen3，Qwen3，Gemma和Mistral-任务是从跨性别或非侵犯的推进中的跨性别者（从视图中判断出来）（差异）（差异）英语，波兰语和俄罗斯语境。我们的结果表明，具有明确推理能力的较大模型（例如，DeepSeek-R1，O4-Mini）对意识形态和文化变化更加一致和敏感，而较小的模型通常无法捕获微妙的区别。我们发现推理能力可显着提高进攻性判断的个性化和解释性，这表明这种机制是调整LLM的关键，以跨语言和意识形态进行细微的社会政治文本分类。

Title: Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

Authors: Yihao Wu, Tianrui Wang, Yizhou Peng, Yi-Wen Chao, Xuyi Zhuang, Xinsheng Wang, Shunshun Yin, Ziyang Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02352
Pdf URL: https://arxiv.org/pdf/2510.02352
Copy Paste: [[2510.02352]] Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations(https://arxiv.org/abs/2510.02352)
Keywords: language model, gpt, llm
Abstract: While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.
摘要：尽管已经检查和鉴定出大型语言模型（LLMS）的偏见（LLM），例如刻板印象和产出中的文化倾向，但它们在带有音频输入和输出的口头对话模型（SDMS）中的存在和特征在很大程度上尚未得到探索。副语言特征（例如年龄，性别和口音）会影响模型输出。当由多转向对话加重时，这些影响可能会加剧偏见，并可能对决策和建议任务的公平性产生潜在影响。在本文中，我们系统地评估了语音LLM的偏见，并研究了反复反馈的多转话对话的影响。偏差是在QWEN2.5-OMNI和GLM-4-VOICE等开源模型中，使用群体不公平评分（GUS）进行决策和基于相似性的归一化统计率（SNSR），以及GPT-4O Audio和Gemini-2.5-flash等封闭源API。我们的分析表明，封闭源模型通常表现出较低的偏差，而开源模型对年龄和性别更敏感，并且建议任务倾向于扩大跨组差异。我们发现，偏见的决定可能会持续在多转交谈中。这项工作提供了对端到端口语对话模型中偏见的首次系统研究，为基于音频的交互式系统提供了见解。为了促进进一步的研究，我们发布了Fairdialogue数据集和评估法。

Title: An Senegalese Legal Texts Structuration Using LLM-augmented Knowledge Graph

Authors: Oumar Kane, Mouhamad M. Allaya, Dame Samb, Mamadou Bousso
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02353
Pdf URL: https://arxiv.org/pdf/2510.02353
Copy Paste: [[2510.02353]] An Senegalese Legal Texts Structuration Using LLM-augmented Knowledge Graph(https://arxiv.org/abs/2510.02353)
Keywords: language model, gpt, llm
Abstract: This study examines the application of artificial intelligence (AI) and large language models (LLM) to improve access to legal texts in Senegal's judicial system. The emphasis is on the difficulties of extracting and organizing legal documents, highlighting the need for better access to judicial information. The research successfully extracted 7,967 articles from various legal documents, particularly focusing on the Land and Public Domain Code. A detailed graph database was developed, which contains 2,872 nodes and 10,774 relationships, aiding in the visualization of interconnections within legal texts. In addition, advanced triple extraction techniques were utilized for knowledge, demonstrating the effectiveness of models such as GPT-4o, GPT-4, and Mistral-Large in identifying relationships and relevant metadata. Through these technologies, the aim is to create a solid framework that allows Senegalese citizens and legal professionals to more effectively understand their rights and responsibilities.
摘要：这项研究研究了人工智能（AI）和大语言模型（LLM）的应用，以改善塞内加尔司法系统中法律文本的访问。重点是提取和组织法律文件的困难，强调需要更好地获取司法信息。该研究成功地从各种法律文件中提取了7,967篇文章，尤其是针对土地和公共领域法规。开发了一个详细的图形数据库，其中包含2,872个节点和10,774个关系，有助于可视化法律文本中的互连。此外，将先进的三重提取技术用于知识，证明了诸如GPT-4O，GPT-4和Mistral-large之类的模型在识别关系和相关元数据方面的有效性。通过这些技术，目的是创建一个可靠的框架，使塞内加尔公民和法律专业人士更有效地了解其权利和责任。

Title: Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness

Authors: Shreya Saha, Shurui Li, Greta Tuckute, Yuanning Li, Ru-Yuan Zhang, Leila Wehbe, Evelina Fedorenko, Meenakshi Khosla
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02354
Pdf URL: https://arxiv.org/pdf/2510.02354
Copy Paste: [[2510.02354]] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness(https://arxiv.org/abs/2510.02354)
Keywords: language model
Abstract: The human language system represents both linguistic forms and meanings, but the abstractness of the meaning representations remains debated. Here, we searched for abstract representations of meaning in the language cortex by modeling neural responses to sentences using representations from vision and language models. When we generate images corresponding to sentences and extract vision model embeddings, we find that aggregating across multiple generated images yields increasingly accurate predictions of language cortex responses, sometimes rivaling large language models. Similarly, averaging embeddings across multiple paraphrases of a sentence improves prediction accuracy compared to any single paraphrase. Enriching paraphrases with contextual details that may be implicit (e.g., augmenting "I had a pancake" to include details like "maple syrup") further increases prediction accuracy, even surpassing predictions based on the embedding of the original sentence, suggesting that the language system maintains richer and broader semantic representations than language models. Together, these results demonstrate the existence of highly abstract, form-independent meaning representations within the language cortex.
摘要：人类语言体系既代表语言形式和含义，但是含义表示的抽象性仍在争论中。在这里，我们通过使用视觉和语言模型的表示形式对句子进行神经响应来建模语言皮质中的含义的抽象表示。当我们生成对应于句子并提取视觉模型嵌入的图像时，我们发现跨多个生成图像的汇总会产生对语言皮层响应的越来越准确的预测，有时会媲美大型语言模型。同样，与任何单个释义相比，在句子的多个释义中平均嵌入可以提高预测准确性。通过上下文细节丰富的解释可能是隐含的（例如，增强“我有煎饼”，以包含诸如“枫糖浆”之类的细节）进一步提高了预测准确性，甚至基于原始句子的嵌入来超越预测，这表明该语言系统比语言模型更丰富和更广泛的语言表示。总之，这些结果证明了语言皮质中高度抽象的，与形式无关的含义表示形式。

Title: DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

Authors: Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, Jun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02358
Pdf URL: https://arxiv.org/pdf/2510.02358
Copy Paste: [[2510.02358]] DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding(https://arxiv.org/abs/2510.02358)
Keywords: language model, llm
Abstract: As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
摘要：随着大型语言模型（LLMS）的扩大，精度提高了，但是解码的自回旋（AR）性质会增加延迟，因为每个令牌都需要串行前向通行证。投机解码通过使用快速起草者提出多型草稿来解决这一问题，然后通过目标模型并行验证。但是，许多部署仍然依赖于AR起草者，其中顺序通过限制了墙壁锁定的收益。我们重新审视起草阶段和现在的diffuspec，这是一个无训练的倒入框架，该框架使用预告片的扩散语言模型（DLM）在单个正向传球中生产多句话的草稿，同时仍与标准AR验证器兼容。由于DLM草稿是在双向条件下生成的，因此平行的人均候选构成一个令牌晶格，在该晶格中，每个位置的本地最高概率令牌不需要形成因果关系的左右路径。此外，DLM的起草需要预先指定的草稿长度，并引起速度质量的权衡。为了应对这些挑战，我们介绍了两个实用的组件：（i）在此晶格上的因果一致性路径搜索（CPS），它提取了与AR验证一致的左右路径；（ii）一个自适应截止长度（ADL）控制器，该控制器根据最近的接受反馈并实现生成的长度来调整下一个建议大小。在基准中，diffuspec的产量高达3倍，壁挂式加速，建立了基于扩散的起草，作为用于投机解码的自回旋起草者的可靠替代品。

Title: Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis

Authors: Jiashu Ye, Tong Wu, Weiwen Chen, Hao Zhang, Zeteng Lin, Xingxing Li, Shujuan Weng, Manni Zhu, Xin Yuan, Xinlong Hong, Jingjie Li, Junyu Zheng, Zhijiong Huang, Jing Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02359
Pdf URL: https://arxiv.org/pdf/2510.02359
Copy Paste: [[2510.02359]] Emission-GPT: A domain-specific language model agent for knowledge retrieval, emission inventory and data analysis(https://arxiv.org/abs/2510.02359)
Keywords: language model, gpt, prompt, agent
Abstract: Improving air quality and addressing climate change relies on accurate understanding and analysis of air pollutant and greenhouse gas emissions. However, emission-related knowledge is often fragmented and highly specialized, while existing methods for accessing and compiling emissions data remain inefficient. These issues hinder the ability of non-experts to interpret emissions information, posing challenges to research and management. To address this, we present Emission-GPT, a knowledge-enhanced large language model agent tailored for the atmospheric emissions domain. Built on a curated knowledge base of over 10,000 documents (including standards, reports, guidebooks, and peer-reviewed literature), Emission-GPT integrates prompt engineering and question completion to support accurate domain-specific question answering. Emission-GPT also enables users to interactively analyze emissions data via natural language, such as querying and visualizing inventories, analyzing source contributions, and recommending emission factors for user-defined scenarios. A case study in Guangdong Province demonstrates that Emission-GPT can extract key insights--such as point source distributions and sectoral trends--directly from raw data with simple prompts. Its modular and extensible architecture facilitates automation of traditionally manual workflows, positioning Emission-GPT as a foundational tool for next-generation emission inventory development and scenario-based assessment.
摘要：改善空气质量并解决气候变化依赖于对空气污染物和温室气体排放的准确理解和分析。但是，与排放相关的知识通常是分散且高度专业化的，而现有的访问和编译数据的方法仍然降低。这些问题阻碍了非专家解释排放信息的能力，对研究和管理提出了挑战。为了解决这个问题，我们提出了发射-GPT，这是一种针对大气排放域量身定制的知识增强的大语言模型代理。在精心策划的知识库基于10,000多个文档（包括标准，报告，指南和经过同行评审的文献）的基础上，排放-GPT集成了及时的工程和问题完成，以支持准确的特定领域的问题答案。排放-GPT还使用户可以通过自然语言进行交互分析排放数据，例如查询和可视化库存，分析源贡献以及为用户定义的方案推荐排放因素。广东省的一个案例研究表明，排放-GPT可以提取关键的见解 - 例如点源分布和部门趋势，并以简单的提示直接从原始数据指导。它的模块化且可扩展的体系结构有助于传统手动工作流的自动化，将排放量为基础作为下一代排放库存开发和基于方案的评估的基础工具。

Title: Spiral of Silence in Large Language Model Agents

Authors: Mingze Zhong, Meng Fang, Zijing Shi, Yuxuan Huang, Shunfeng Zheng, Yali Du, Ling Chen, Jun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02360
Pdf URL: https://arxiv.org/pdf/2510.02360
Copy Paste: [[2510.02360]] Spiral of Silence in Large Language Model Agents(https://arxiv.org/abs/2510.02360)
Keywords: language model, llm, agent
Abstract: The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the 'agents' are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS-like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of 'History' and 'Persona' signals. Opinion dynamics are assessed using trend tests such as Mann-Kendall and Spearman's rank, along with concentration measures including kurtosis and interquartile range. Experiments across open-source and closed-source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.
摘要：沉默的螺旋（SOS）理论认为，少数群体观点的个人通常不说话，因为害怕社会隔离，使多数立场能够主导公众话语。但是，当“代理人”是大语言模型（LLM）时，由于SOS是为人类社会开发的，因此经典的心理解释并不是直接适用的。这提出了一个核心问题：但是，LLM集体中纯粹的统计语言生成能否出现类似SOS的动态？我们提出了一个评估框架，用于检查LLM代理中的SOS。具体而言，我们考虑了四个受控条件，这些条件在系统上改变了“历史”和“角色”信号的可用性。使用诸如Mann-Kendall和Spearman等级的趋势测试以及集中度措施（包括峰度和四分之一间范围）评估了意见动力学。跨开源和封闭源模型的实验表明，历史和角色共同产生了强大的多数主导地位并复制SOS模式。仅历史信号就会引起强烈的锚定；单独的角色信号促进了多样化但不相关的意见，表明没有历史锚定，SOS动态就不会出现。该作品桥梁桥梁计算社会学和负责人的AI设计，强调了监测和减轻LLM代理系统中新兴合规性的需求。

Title: ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Authors: Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02361
Pdf URL: https://arxiv.org/pdf/2510.02361
Copy Paste: [[2510.02361]] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference(https://arxiv.org/abs/2510.02361)
Keywords: llm
Abstract: Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.
摘要：基于变压器的大型模型在自然语言处理和计算机视觉方面表现出色，但是由于自我注意事项与输入令牌的二次复杂性，面临严重的计算效率低下。最近，研究人员提出了一系列基于块选择和压缩的方法，以减轻此问题，但他们要么有语义不完整或训练效率不佳的问题。为了全面解决这些挑战，我们提出了Chunkllm，这是一个轻巧且可拉伸的培训框架。具体来说，我们介绍了两个组件：QK适配器（Q-Audapter和K-Adapter）和块适配器。前者附着在每个变压器层上，具有特征压缩和块注意力的双重目的。后者在模型的Bottommost层上运行，可通过利用上下文语义信息来检测块边界。在训练阶段，骨干的参数保持冷冻，只有QK适配器和块适配器正在接受培训。值得注意的是，我们设计了一种训练QK适配器的注意力蒸馏方法，从而提高了关键块的召回率。在推理阶段，当将电流令牌视为块边界时，仅触发块选择，从而加速模型推断。实验评估是对跨越多个任务的一组多样的长文本和短文本基准数据集进行的。 Chunkllm不仅在短文本基准上获得了可比的性能，而且还保持了长篇文本基准测试的98.64％的性能，同时保留了48.58％的密钥值缓存保留率。特别是，与120k长的文本处理中的香草变压器相比，Chunkllm的最大加速度为4.48倍。

Title: A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History

Authors: Matei-Iulian Cocu, Răzvan-Cosmin Cristia, Adrian Marius Dumitran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02362
Pdf URL: https://arxiv.org/pdf/2510.02362
Copy Paste: [[2510.02362]] A Cross-Lingual Analysis of Bias in Large Language Models Using Romanian History(https://arxiv.org/abs/2510.02362)
Keywords: language model, llm
Abstract: In this case study, we select a set of controversial Romanian historical questions and ask multiple Large Language Models to answer them across languages and contexts, in order to assess their biases. Besides being a study mainly performed for educational purposes, the motivation also lies in the recognition that history is often presented through altered perspectives, primarily influenced by the culture and ideals of a state, even through large language models. Since they are often trained on certain data sets that may present certain ambiguities, the lack of neutrality is subsequently instilled in users. The research process was carried out in three stages, to confirm the idea that the type of response expected can influence, to a certain extent, the response itself; after providing an affirmative answer to some given question, an LLM could shift its way of thinking after being asked the same question again, but being told to respond with a numerical value of a scale. Results show that binary response stability is relatively high but far from perfect and varies by language. Models often flip stance across languages or between formats; numeric ratings frequently diverge from the initial binary choice, and the most consistent models are not always those judged most accurate or neutral. Our research brings to light the predisposition of models to such inconsistencies, within a specific contextualization of the language for the question asked.
摘要：在此案例研究中，我们选择了一组有争议的罗马尼亚历史问题，并要求多种大型语言模型跨语言和环境回答，以评估其偏见。除了是一项主要用于教育目的的研究外，动机还认识到历史通常是通过改变的观点来呈现的，这主要受国家的文化和理想的影响，即使是通过大语言模型。由于它们经常受到某些可能呈现某些歧义的数据集的培训，因此缺乏中立性的情况随后被灌输给用户。研究过程是在三个阶段进行的，以确认预期的反应类型可以在一定程度上影响响应本身。在为某个给定的问题提供了肯定的答案之后，LLM再次被问到同一问题后可能会改变其思维方式，但被告知要以规模的数字价值做出回应。结果表明，二进制响应稳定性相对较高，但远非完美，并且随着语言而变化。模型通常会跨语言或格式之间翻转姿势；数字评分经常与最初的二进制选择不同，最一致的模型并不总是最准确或最中性的。我们的研究将模型的倾向揭示了此类不一致之处，在该问题的特定情境中为提出的问题。

Title: Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

Authors: Kuntai Cai, Juncheng Liu, Xianglin Yang, Zhaojie Niu, Xiaokui Xiao, Xing Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02369
Pdf URL: https://arxiv.org/pdf/2510.02369
Copy Paste: [[2510.02369]] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents(https://arxiv.org/abs/2510.02369)
Keywords: language model, llm, prompt, agent
Abstract: Large language model (LLM) agents typically receive two kinds of context: (i) environment-level manuals that define interaction interfaces and global rules, and (ii) task-level guidance or demonstrations tied to specific goals. In this work, we identify a crucial but overlooked third type of context, instance-level context, which consists of verifiable and reusable facts tied to a specific environment instance, such as object locations, crafting recipes, and local rules. We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks, as success often depends not only on reasoning over global rules or task prompts but also on making decisions based on precise and persistent facts. Acquiring such context requires more than memorization: the challenge lies in efficiently exploring, validating, and formatting these facts under tight interaction budgets. We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it. Our method performs a guided exploration, using a compact TODO forest to intelligently prioritize its next actions and a lightweight plan-act-extract loop to execute them. This process automatically produces a high-precision context document that is reusable across many downstream tasks and agents, thereby amortizing the initial exploration cost. Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent gains in both success and efficiency: for instance, ReAct's mean success rate in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By transforming one-off exploration into persistent, reusable knowledge, our method complements existing contexts to enable more reliable and efficient LLM agents.
摘要：大型语言模型（LLM）代理通常会收到两种上下文：（i）定义交互接口和全球规则的环境级别的手册，以及（ii）任务级别的指导或与特定目标相关的演示。在这项工作中，我们确定了一个至关重要但被忽视的第三种类型的上下文，实例级上下文，该上下文由与特定环境实例相关的可验证和可重复使用的事实组成，例如对象位置，制作配方和本地规则。我们认为，缺乏实例级别上下文是LLM代理在复杂任务中的常见故障来源，因为成功通常不仅取决于对全球规则或任务提示的推理，而且还取决于基于精确和持久的事实做出决策。获得这种情况不仅需要记忆：挑战在于在紧张的互动预算下有效探索，验证和格式化这些事实。我们将此问题形式化为实例级上下文学习（ILCL），并介绍我们的任务无关方法来解决它。我们的方法使用紧凑型森林进行指导探索，以智能优先考虑其下一个动作和轻巧的计划攻击循环以执行它们。此过程会自动产生一个高精度上下文文档，该文档可在许多下游任务和代理中重复使用，从而摊销初始探索成本。跨文本世界，Alfworld和Crafter进行的实验表明，成功和效率的实验均持续增长：例如，React在TextWorld中的平均成功率从37％上升到95％，而IGE则从81％提高到95％。通过将一次性探索转变为持久，可重复使用的知识，我们的方法可以补充现有的上下文，以实现更可靠和有效的LLM代理。

Title: Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

Authors: Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, Meeyoung Cha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02370
Pdf URL: https://arxiv.org/pdf/2510.02370
Copy Paste: [[2510.02370]] Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models(https://arxiv.org/abs/2510.02370)
Keywords: language model, retrieval-augmented generation
Abstract: Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models' use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.
摘要：大型语言模型经常会在推理时间检索到的文化知识与预审进期间获得的参数知识之间遇到冲突。不加批判地接受外部知识的模型很容易受到错误信息的影响，而坚持参数知识的模型无法从检索中受益。尽管采用了检索效果的一代，但我们仍然对培训期间的知识 - 责任策略的塑造策略缺乏系统性的理解。这种差距风险会产生具有不良仲裁行为的预审预周化模型，因此，在预算预算已经花费后，浪费了大量的计算资源。为了解决这个问题，我们介绍了第一个对训练条件如何影响模型对内在和参数知识的使用以及它们之间如何仲裁的对照研究。我们在系统控制各种条件的同时，在合成的传记语料库上训练基于变压器的语言模型。我们的实验表明，事实内文档的重复促进了参数和内在功能的发展。此外，对包含不一致的信息或分布偏斜的语料库进行培训鼓励模型开发出强大的策略来利用参数和文本知识。我们的结果并没有将这些非理想特性视为要删除的工件，而是表明它们对于学习强大的仲裁很重要。这些见解为和谐地整合参数和文化知识的预读模型提供了具体的经验指南。

Title: Pretraining with hierarchical memories: separating long-tail and common knowledge

Authors: Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02375
Pdf URL: https://arxiv.org/pdf/2510.02375
Copy Paste: [[2510.02375]] Pretraining with hierarchical memories: separating long-tail and common knowledge(https://arxiv.org/abs/2510.02375)
Keywords: language model, prompt
Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.
摘要：现代语言模型的令人印象深刻的性能取得了目前的缩放参数：较大的模型可以更好地存储更多的世界知识和理性。然而，将所有世界知识压缩到参数上是不必要的，因为每个提示只使用馏分，并且对于有限的推理时间内存和计算的边缘设备不切实际。我们通过内存调节的体系结构以及与现有硬件范式保持一致的预处理策略来解决这一缺点。我们介绍了小型语言模型，以访问编码世界知识的大型分层参数记忆库。在训练和推理期间，我们获取一个小的，上下文依赖的内存块，然后将其添加到模型中。我们的预训练学会了将长尾世界知识存储在记忆参数中，而小语言模型则是一种锚定捕获常识和一般推理能力的锚点。通过数万亿英语规模的实验，我们显示出显着的增长：一个160m参数模型增强，并从4.6B内存库中获取的18m参数存储器获得了与常规模型相当的性能，其参数超过2倍。通过广泛的实验，我们研究了变压器中参数记忆的最佳类型和大小，从而将其缩放到21b的参数。我们发现，我们提出的分层馈送前进记忆在变压器体系结构之间，无论是在事后训练期间还是在事后增加。

Title: Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Authors: Aakriti Agrawal, Rohith Aralikatti, Anirudh Satheesh, Souradip Chakraborty, Amrit Singh Bedi, Furong Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02377
Pdf URL: https://arxiv.org/pdf/2510.02377
Copy Paste: [[2510.02377]] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems(https://arxiv.org/abs/2510.02377)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.
摘要：大型语言模型（LLMS）已经证明了出色的功能，但是从多个LLM中选择最可靠的响应仍然是一个挑战，尤其是在资源约束的设置中。现有的方法通常取决于昂贵的外部验证符，人类评估者或自洽技术，这些技术需要单个模型中的多个样本。虽然多LLM系统比单个模型产生的响应更多，因此具有更大的潜力，但与单个LLM自洽相比，它们的表现通常不足。我们提出了一种有原则的，新颖和计算上有效的方法，可以使用校准的对数可能分数从多个不同的LLM中选择最佳响应，从而隐含利用这些模型的固有知识和信心。我们的方法证明了大约改进。在GSM8K，MMLU（6个子集）和ARC数据集上，辩论（多轮LLM讨论）和非偏见（具有多个LLM的最佳）设置的辩论（多轮LLM讨论）和非偏见（具有多个LLM的最佳）和ARC数据集的4％，3％和5％。

Title: Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation

Authors: Haoyue Bai, Haoyu Wang, Shengyu Chen, Zhengzhang Chen, Lu-An Tang, Wei Cheng, Haifeng Chen, Yanjie Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02388
Pdf URL: https://arxiv.org/pdf/2510.02388
Copy Paste: [[2510.02388]] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation(https://arxiv.org/abs/2510.02388)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA), yet they often struggle in domain-specific scenarios where accurate and up-to-date information is required. Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge, but existing systems primarily rely on unstructured documents, while largely overlooking relational databases, which provide precise, timely, and efficiently queryable factual information, serving as indispensable infrastructure in domains such as finance, healthcare, and scientific research. Motivated by this gap, we conduct a systematic analysis that reveals three central observations: (i) databases and documents offer complementary strengths across queries, (ii) naively combining both sources introduces noise and cost without consistent accuracy gains, and (iii) selecting the most suitable source for each query is crucial to balance effectiveness and efficiency. We further observe that query types show consistent regularities in their alignment with retrieval paths, suggesting that routing decisions can be effectively guided by systematic rules that capture these patterns. Building on these insights, we propose a rule-driven routing framework. A routing agent scores candidate augmentation paths based on explicit rules and selects the most suitable one; a rule-making expert agent refines the rules over time using QA feedback to maintain adaptability; and a path-level meta-cache reuses past routing decisions for semantically similar queries to reduce latency and cost. Experiments on three QA benchmarks demonstrate that our framework consistently outperforms static strategies and learned routing baselines, achieving higher accuracy while maintaining moderate computational cost.
摘要：大型语言模型（LLM）在一般问题答案（QA）上表现出色，但他们经常在需要准确和最新信息的特定领域方案中挣扎。检索提升的生成（RAG）通过用外部知识丰富LLM来解决这一局限性，但是现有系统主要依赖于非结构化的文档，同时在很大程度上忽略了关系数据库，这些数据库可提供精确，及时且有效的事实信息，可作为领域中的基础结构，在诸如融资，医疗领域中提供不可或缺的基础结构。在这一差距的推动下，我们进行了系统分析，揭示了三个中心观察结果：（i）数据库和文档在查询之间提供互补的优势，（ii）自然结合两个来源都会引入噪声和成本，而无需一致的准确性提高，（iii）选择每个查询的最合适的源是对平衡有效性和效率的重要性。我们进一步观察到，查询类型与检索路径的一致性表现出一致的规律性，这表明可以通过捕获这些模式的系统规则有效地指导路由决策。在这些见解的基础上，我们提出了一个规则驱动的路由框架。路由代理根据明确的规则分数候选候选路径，并选择最合适的规则；规则制定专家代理人使用质量保证反馈来维持适应性，随着时间的推移完善了规则；路径级别的元缓存重复了过去对语义上相似查询的路由决策，以降低延迟和成本。三个质量检查基准的实验表明，我们的框架始终优于静态策略和学习的路由基准，在保持中等计算成本的同时，达到了更高的准确性。

Title: KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning

Authors: Yinyi Luo, Zhexian Zhou, Hao Chen, Kai Qiu, Marios Savvides, Yixuan Li, Jindong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02392
Pdf URL: https://arxiv.org/pdf/2510.02392
Copy Paste: [[2510.02392]] KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning(https://arxiv.org/abs/2510.02392)
Keywords: language model, llm
Abstract: Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: this https URL
摘要：知识编辑和机器学习是大型语言模型（LLMS）的两种流行方法，以保持最新状态。但是，由于不足，孤立和小规模的评估，LLMS的知识更新机制在很大程度上仍未探索。例如，在修改某些知识时，LLM与人类相似吗？随着培训数据的增加，编辑和学习的区别是什么？本文提出了知识史密斯，这是一个统一的框架，以系统地了解LLM的更新机制。我们首先将编辑和学习作为一个约束优化问题的实例。然后，我们提出了一个自动数据集生成器，该生成器提供跨多个图级别和数据量表的结构化干预措施，从而可以对不同的修改策略如何通过模型知识传播。广泛的实验表明，关于知识传播，可塑性缩放，一致性和鲁棒性的细微洞察力。例如，我们的结果表明，LLM并未表现出与人类有关不同知识水平的类似更新，并且存在一致性能力折衷。我们希望我们的发现可以为设计更可靠和可扩展的策略的设计提供建议。代码：此HTTPS URL

Title: Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

Authors: Manasi Patwardhan, Ayush Agarwal, Shabbirhussain Bhaisaheb, Aseem Arora, Lovekesh Vig, Sunita Sarawagi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02394
Pdf URL: https://arxiv.org/pdf/2510.02394
Copy Paste: [[2510.02394]] Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing(https://arxiv.org/abs/2510.02394)
Keywords: language model, llm
Abstract: The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.
摘要：大型语言模型（LLMS）用于将自然语言（NL）查询转换为SQL的性能在数据库（DBS）之间差异很大。 NL查询通常是使用特定域词汇表达的，并将其映射到正确的SQL需要了解嵌入式域表达式，即它们与DB架构结构的关系。现有基准依靠不现实的，临时查询的特定文本提示来表达域知识。在本文中，我们提出了一个系统的框架，用于在数据库级别关联结构化域语句。我们提出了使用子弦级匹配的用户查询的相关结构化域语句的检索。我们评估了11个现实的DB模式，涵盖了五个开源和专有LLM的不同域，并证明（1）（1）DB水平结构化域声明比现有的特定特定特定的特定文本域陈述更实用，更准确，并且（2）我们的相关领域基于相关的领域陈述的方法比其他准确的方法比其他准确的方法更高。

Title: Words That Make Language Models Perceive

Authors: Sophie L. Wang, Phillip Isola, Brian Cheung
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02425
Pdf URL: https://arxiv.org/pdf/2510.02425
Copy Paste: [[2510.02425]] Words That Make Language Models Perceive(https://arxiv.org/abs/2510.02425)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to 'see' or 'hear', it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.
摘要：大型语言模型（LLMS）纯粹是在表面上训练的，表面上缺乏任何直接的感知经验，但是它们的内部表示是由用语言编码的多模式规律暗示的。我们检验了一个假设，即明确的感觉提示可以浮出水，从而使仅文本LLM与专业视觉和音频编码器更紧密地对齐。当感官提示告诉模型“看到”或“听到”时，它会提示该模型解决其下一步的预测，就好像它们是基于从未实际提供的潜在视觉或听觉证据的条件。我们的发现表明，轻巧的及时工程可以可靠地激活纯粹由文本培训的LLM中的形式表示。

Title: CLARITY: Clinical Assistant for Routing, Inference, and Triage

Authors: Vladimir Shaposhnikov, Aleksandr Nesterov, Ilia Kopanichuk, Ivan Bakulin, Egor Zhelvakov, Ruslan Abramov, Ekaterina Tsapieva, Dmitry V. Dylov, Ivan Oseledets
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2510.02463
Pdf URL: https://arxiv.org/pdf/2510.02463
Copy Paste: [[2510.02463]] CLARITY: Clinical Assistant for Routing, Inference, and Triage(https://arxiv.org/abs/2510.02463)
Keywords: language model, llm, agent
Abstract: We present CLARITY (Clinical Assistant for Routing, Inference, and Triage), an AI-driven platform designed to facilitate patient-to-specialist routing, clinical consultations, and severity assessment of patients' conditions. Its hybrid architecture combines a Finite State Machine (FSM) for structured dialogue flows with collaborative agents that employ Large Language Model (LLM) to analyze symptoms and prioritize referrals to appropriate specialists. Built on a modular microservices framework, CLARITY ensures safe, efficient, and robust performance, flexible and readily scalable to meet the demands of existing workflows and IT solutions in healthcare. We report integration of our clinical assistant into a large-scale nation-wide inter-hospital IT platform, with over 55,000 content-rich user dialogues completed within the two months of deployment, 2,500 of which were expert-annotated for a consequent validation. The validation results show that CLARITY surpasses human-level performance in terms of the first-attempt routing precision, naturally requiring up to 3 times shorter duration of the consultation than with a human.
摘要：我们提出了清晰度（临床助理，用于路由，推理和分类），这是一个AI驱动的平台，旨在促进患者对患者的路线，临床咨询以及患者状况的严重性评估。它的混合体系结构结合了有限状态机器（FSM），用于结构化对话流与采用大型语言模型（LLM）的协作代理人分析症状并将推荐人推荐给适当的专家。 Clarity建立在模块化微服务框架上，可确保安全，高效和稳健的性能，灵活且易于扩展，以满足现有工作流程和IT解决方案的需求。我们将临床助理的集成到全国性的大型院间IT平台中，其中有超过55,000个内容的用户对话在部署的两个月内完成，其中2,500个是专家注册的，以进行验证。验证结果表明，从首次攻击的路由精度方面，清晰度超过了人类水平的表现，自然需要咨询的持续时间比对人类的持续时间短的持续时间高3倍。

Title: Unraveling Syntax: How Language Models Learn Context-Free Grammars

Authors: Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio
Subjects: cs.CL, cs.FL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02524
Pdf URL: https://arxiv.org/pdf/2510.02524
Copy Paste: [[2510.02524]] Unraveling Syntax: How Language Models Learn Context-Free Grammars(https://arxiv.org/abs/2510.02524)
Keywords: language model
Abstract: We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar's substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.
摘要：我们介绍了一个新的框架，以了解语言模型如何获得语法。尽管大型模型取得了令人印象深刻的结果，但对他们的学习动态知之甚少。我们的方法始于这样的观察，即大多数感兴趣的领域，例如自然语言语法，编码语言，算术问题，都是由无概率的无上下文语法（PCFGS）捕获的。我们研究了经过PCFG产生的合成语言训练的小型模型的学习动力学，从而可以精确控制语法复杂性，递归深度和下法结构。我们证明了训练损失和kullback-leibler差异的几个一般递归公式，对PCFG的下仪结构。从经验上讲，我们发现与儿童不同，他们在发展到更复杂的结构之前先掌握了简单的子结构，而变形金刚则减少了所有并行所有亚码头的损失。我们进一步表明，按照预处理可以改善较小模型的最终损失，并且预处理的模型会形成内部表示与语法的子结构更加一致。最后，我们证明了模型与更深的递归结构（甚至是大语模型的限制）斗争，从而揭示了神经网络如何代表层次语法的基本挑战。总体而言，我们的工作启动了PCFG上变压器的学习动力学的研究，作为用于语言模型中学习的多功能测试床，并以许多开放的问题打开了研究方向。

Title: Hierarchical Semantic Retrieval with Cobweb

Authors: Anant Gupta, Karthik Singaravadivelan, Zekun Wang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.02539
Pdf URL: https://arxiv.org/pdf/2510.02539
Copy Paste: [[2510.02539]] Hierarchical Semantic Retrieval with Cobweb(https://arxiv.org/abs/2510.02539)
Keywords: gpt
Abstract: Neural document retrieval often treats a corpus as a flat cloud of vectors scored at a single granularity, leaving corpus structure underused and explanations opaque. We use Cobweb--a hierarchy-aware framework--to organize sentence embeddings into a prototype tree and rank documents via coarse-to-fine traversal. Internal nodes act as concept prototypes, providing multi-granular relevance signals and a transparent rationale through retrieval paths. We instantiate two inference approaches: a generalized best-first search and a lightweight path-sum ranker. We evaluate our approaches on MS MARCO and QQP with encoder (e.g., BERT/T5) and decoder (GPT-2) representations. Our results show that our retrieval approaches match the dot product search on strong encoder embeddings while remaining robust when kNN degrades: with GPT-2 vectors, dot product performance collapses whereas our approaches still retrieve relevant results. Overall, our experiments suggest that Cobweb provides competitive effectiveness, improved robustness to embedding quality, scalability, and interpretable retrieval via hierarchical prototypes.
摘要：神经文档的检索通常将语料库视为在单个粒度上得分的矢量云，使语料结构没有使用，并且解释不透明。我们使用cobweb-层次结构 - 感知的框架 - 将句子嵌入到原型树中，并通过粗到细遍历对文档进行排名文档。内部节点充当概念原型，通过检索路径提供了多生细胞相关性信号和透明的理由。我们实例化了两种推理方法：一种广义的最佳搜索和轻巧的路径和排名。我们使用编码器（例如BERT/T5）和解码器（GPT-2）表示，评估了MAS MAS和QQP的方法。我们的结果表明，我们的检索方法与较强的编码器嵌入中的DOT产品搜索相匹配，同时在KNN降解时保持健壮：使用GPT-2矢量，DOT产品性能崩溃了，而我们的方法仍然会检索相关的结果。总体而言，我们的实验表明，CobWeb提供了竞争有效性，提高了通过分层原型嵌入质量，可扩展性和可解释检索的鲁棒性。

Title: Knowledge-Graph Based RAG System Evaluation Framework

Authors: Sicheng Dong, Vahid Zolfaghari, Nenad Petrovic, Alois Knoll
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02549
Pdf URL: https://arxiv.org/pdf/2510.02549
Copy Paste: [[2510.02549]] Knowledge-Graph Based RAG System Evaluation Framework(https://arxiv.org/abs/2510.02549)
Keywords: language model, llm, retrieval augmented generation
Abstract: Large language models (LLMs) has become a significant research focus and is utilized in various fields, such as text generation and dialog systems. One of the most essential applications of LLM is Retrieval Augmented Generation (RAG), which greatly enhances generated content's reliability and relevance. However, evaluating RAG systems remains a challenging task. Traditional evaluation metrics struggle to effectively capture the key features of modern LLM-generated content that often exhibits high fluency and naturalness. Inspired by the RAGAS tool, a well-known RAG evaluation framework, we extended this framework into a KG-based evaluation paradigm, enabling multi-hop reasoning and semantic community clustering to derive more comprehensive scoring metrics. By incorporating these comprehensive evaluation criteria, we gain a deeper understanding of RAG systems and a more nuanced perspective on their performance. To validate the effectiveness of our approach, we compare its performance with RAGAS scores and construct a human-annotated subset to assess the correlation between human judgments and automated metrics. In addition, we conduct targeted experiments to demonstrate that our KG-based evaluation method is more sensitive to subtle semantic differences in generated outputs. Finally, we discuss the key challenges in evaluating RAG systems and highlight potential directions for future research.
摘要：大型语言模型（LLM）已成为重要的研究重点，并用于各个领域，例如文本生成和对话系统。 LLM最重要的应用之一是检索增强发电（RAG），它极大地增强了生成的内容的可靠性和相关性。但是，评估抹布系统仍然是一项具有挑战性的任务。传统的评估指标努力有效地捕获现代LLM生成的内容的关键特征，这些内容通常表现出高流利性和自然性。受Ragas工具的启发，Ragas工具是一个著名的抹布评估框架，我们将此框架扩展到了基于KG的评估范式中，从而实现了多跳的推理和语义社区聚类，以得出更全面的评分指标。通过纳入这些全面的评估标准，我们对抹布系统有了更深入的了解，对它们的性能有了更细微的看法。为了验证我们的方法的有效性，我们将其性能与Ragas分数进行比较，并构建一个人类宣传的子集，以评估人类判断与自动指标之间的相关性。此外，我们进行有针对性的实验，以证明我们基于KG的评估方法对生成的输出的细微语义差异更为敏感。最后，我们讨论了评估抹布系统的关键挑战，并突出了未来研究的潜在方向。

Title: Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

Authors: Tolúl\d{o}pé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02569
Pdf URL: https://arxiv.org/pdf/2510.02569
Copy Paste: [[2510.02569]] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models(https://arxiv.org/abs/2510.02569)
Keywords: language model
Abstract: Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don't, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.
摘要：将语音与大语言模型（LMS）集成的语言模型（SLM）依靠模式适配器（MAS）将语音编码器的输出映射到解码器LM可以理解的表示。但是，我们对这些关键MAS如何改变表示形式知之甚少。在这里，我们检查了三个SLM（Salmonn，Qwen2-Audio和Phi-4-Multimodal-Instruct）中的MA输出表示形式。通过找到最近的解码器LM令牌与MA表示形式，我们发现了MA代表的两种策略。对于使用耳语编码器的模型，MAS似乎代表了使用基于英语的Interlingua的输入的含义，从而使他们可以处理教学调整中看不见的语言。对于不像Phi-4-Multimodal-Insruction的模型，MAS相反代表输入的语音，但用英语单词表示。我们假设出现的结果取决于语音编码器是否仅接受语音识别或翻译训练。

Title: Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Authors: Jingyi Sun, Pepa Atanasova, Sagnik Ray Choudhury, Sekh Mainul Islam, Isabelle Augenstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02629
Pdf URL: https://arxiv.org/pdf/2510.02629
Copy Paste: [[2510.02629]] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models(https://arxiv.org/abs/2510.02629)
Keywords: language model
Abstract: Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.
摘要：上下文利用，语言模型（LMS）在生成响应时从提供的上下文中合并相关信息的能力在很大程度上不透明用户，这些用户无法确定模型是从参数内存中汲取的还是提供的上下文，也不确定哪些特定上下文片段。突出显示解释（HES）提供了自然解决方案，因为它们可以指出影响模型输出的确切上下文片段和令牌。但是，没有现有的工作评估其在准确解释上下文利用方面的有效性。我们通过使用具有已知地面真相上下文用法的受控测试用例引入上下文归因的第一个金标准He评估框架来解决这一差距，从而避免了现有的间接代理评估的局限性。为了证明该框架的广泛适用性，我们在四个上下文方案，四个数据集和五个LMS上评估了四种HE方法 - 三种已建立的技术和机械性方法，我们适应了此任务的机械性解释性方法。总体而言，我们发现Mechlight在所有上下文方案中都表现最佳。但是，所有方法都在较长的上下文和表现出位置偏见方面挣扎，这表明了解释准确性的基本挑战，这些挑战需要新的方法来大规模提供可靠的上下文利用解释。

Title: Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions

Authors: Fulei Zhang, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02645
Pdf URL: https://arxiv.org/pdf/2510.02645
Copy Paste: [[2510.02645]] Mind the Gap: Linguistic Divergence and Adaptation Strategies in Human-LLM Assistant vs. Human-Human Interactions(https://arxiv.org/abs/2510.02645)
Keywords: language model, llm, chat, agent
Abstract: As Large Language Models (LLMs) are increasingly deployed in customer-facing applications, a critical yet underexplored question is how users communicate differently with LLM chatbots compared to human agent. In this study, we present empirical evidence that users adopt distinct communication styles when users interact with chatbots versus human agents. Our analysis reveals significant differences in grammatical fluency, politeness, and lexical diversity in user language between the two settings. These findings suggest that models trained exclusively on human-human interaction data may not adequately accommodate the communication style shift that occurs once an LLM chatbot is deployed. To enhance LLM robustness to post-launch communication style changes, we experimented with two strategies: (1) data augmentation during the post-training phase and (2) inference-time user message reformulation. Our results indicate that models trained on stylistically diverse datasets significantly outperform those trained exclusively on original or stylistically uniform datasets, while inference-time reformulation proved less effective. These insights help us to better adapt our models for improved LLM-user interaction experiences.
摘要：由于大型语言模型（LLM）越来越多地在面向客户的应用程序中，因此与人类代理人相比，用户与LLM聊天机器人的沟通方式不同。在这项研究中，我们提供了经验证据，即当用户与聊天机器人与人类代理商互动时，用户采用不同的沟通方式。我们的分析揭示了两种设置之间用户语言的语法流利性，礼貌和词汇多样性的显着差异。这些发现表明，一旦部署了LLM聊天机器人，仅对人类交互数据进行培训的模型可能无法充分适应沟通方式。为了增强LLM对发出后沟通方式变化的鲁棒性，我们尝试了两种策略：（1）在训练后阶段和（2）推理时间用户消息重新重新调整期间的数据增强。我们的结果表明，经过风格上不同数据集培训的模型显着优于专门针对原始或风格统一的数据集培训的模型，而推理时间重新重新构造的效果较低。这些见解有助于我们更好地调整模型，以改善LLM-user互动体验。

Title: SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models

Authors: Rui Qi, Zhibo Man, Yufeng Chen, Fengran Mo, Jinan Xu, Kaiyu Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02648
Pdf URL: https://arxiv.org/pdf/2510.02648
Copy Paste: [[2510.02648]] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models(https://arxiv.org/abs/2510.02648)
Keywords: language model, llm, prompt
Abstract: Recent developments have enabled Large Language Models (LLMs) to engage in complex reasoning tasks through deep thinking. However, the capacity of reasoning has not been successfully transferred to non-high-resource languages due to resource constraints, which struggles with multilingual reasoning tasks. To this end, we propose Structured-of-Thought (SoT), a training-free method that improves the performance on multilingual reasoning through a multi-step transformation: Language Thinking Transformation and Structured Knowledge Transformation. The SoT method converts language-specific semantic information into language-agnostic structured representations, enabling the models to understand the query in different languages more sophisticated. Besides, SoT effectively guides LLMs toward more concentrated reasoning to maintain consistent underlying reasoning pathways when handling cross-lingual variations in expression. Experimental results demonstrate that SoT outperforms several strong baselines on multiple multilingual reasoning benchmarks when adapting to various backbones of LLMs. It can also be integrated with other training-free strategies for further improvements. Our code is available at this https URL.
摘要：最近的发展使大型语言模型（LLMS）能够通过深思熟虑地从事复杂的推理任务。但是，由于资源限制，推理能力尚未成功地转移到非最高资源语言上，这在多语言推理任务中挣扎。为此，我们提出了结构化的（SOT），这是一种无训练的方法，通过多步转换来提高多语言推理的性能：语言思维转换和结构化知识转换。 SOT方法将特定于语言的语义信息转换为语言敏捷的结构化表示形式，从而使模型能够更复杂地了解不同语言的查询。此外，SOT在处理表达中的跨语性变化时有效地指导LLMS更加集中的推理，以保持一致的潜在推理途径。实验结果表明，在适应LLM的各种骨架时，SOT在多种多语言推理基准上的表现优于多个强基线。它也可以与其他无培训策略集成以进行进一步的改进。我们的代码可在此HTTPS URL上找到。

Title: Self-Improvement in Multimodal Large Language Models: A Survey

Authors: Shijian Deng, Kai Wang, Tianyu Yang, Harsh Singh, Yapeng Tian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02665
Pdf URL: https://arxiv.org/pdf/2510.02665
Copy Paste: [[2510.02665]] Self-Improvement in Multimodal Large Language Models: A Survey(https://arxiv.org/abs/2510.02665)
Keywords: language model, llm
Abstract: Recent advancements in self-improvement for Large Language Models (LLMs) have efficiently enhanced model capabilities without significantly increasing costs, particularly in terms of human effort. While this area is still relatively young, its extension to the multimodal domain holds immense potential for leveraging diverse data sources and developing more general self-improving models. This survey is the first to provide a comprehensive overview of self-improvement in Multimodal LLMs (MLLMs). We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data collection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also include commonly used evaluations and downstream applications. Finally, we conclude by outlining open challenges and future research directions.
摘要：大型语言模型（LLM）自我完善的最新进展有效地增强了模型能力，而没有显着提高成本，尤其是在人类努力方面。尽管该区域还很年轻，但扩展到多模式域具有利用各种数据源和开发更通用的自我改善模型的巨大潜力。这项调查是第一个提供多模式LLM（MLLM）自我完善的全面概述的调查。我们提供了当前文献的结构化概述，并从三个角度讨论了方法：1）数据收集，2）数据组织和3）模型优化，以促进MLLM中自我改善的进一步发展。我们还包括常用的评估和下游应用程序。最后，我们通过概述了公开挑战和未来的研究方向的总结。

Title: Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

Authors: Yavuz Bakman, Sungmin Kang, Zhiqi Huang, Duygu Nur Yaldiz, Catarina G. Belém, Chenyang Zhu, Anoop Kumar, Alfy Samuel, Salman Avestimehr, Daben Liu, Sai Praneeth Karimireddy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02671
Pdf URL: https://arxiv.org/pdf/2510.02671
Copy Paste: [[2510.02671]] Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering(https://arxiv.org/abs/2510.02671)
Keywords: llm, prompt
Abstract: Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify epistemic uncertainty. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance (using the provided context rather than parametric knowledge), context comprehension (extracting relevant information from context), and honesty (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.
摘要：不确定性量化（UQ）研究主要集中在闭幕事实问题答案（QA）上，而上下文质量请探索仍未探索，尽管它在现实世界中的重要性很重要。在这项工作中，我们专注于上下文质量检查任务的UQ，并提出了一种理论上扎根的方法来量化认知不确定性。首先，我们引入一个任务无关的，令牌级的不确定性度量定义为定义为给定模型的预测分布与未知的真实分布之间的跨渗透性。通过分解此措施，我们通过完美的，理想化的模型来隔离认知成分，并近似真实的分布。然后，我们得出了认知不确定性的上限，并表明它可以被解释为相对于理想模型的给定模型的隐藏表示中的语义特征差距。我们进一步将此通用框架应用于上下文QA任务，并假设三个特征近似此差距：上下文依靠（使用提供的上下文而不是参数知识），上下文理解（从上下文中提取相关信息）和诚实（避免故意谎言）。使用自上而下的可解释性方法，我们仅使用少量标记的样品并将它们组成以形成强大的不确定性评分来提取这些功能。在分发和分发设置中进行多个QA基准测试的实验表明，我们的方法基本上优于最先进的无监督（基于无样和采样）的方法，并具有监督的UQ方法，并且可以提高13点PRR的改进，同时不可忽略不可忽略的优势。

Title: Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Authors: Yubo Li, Ramayya Krishnan, Rema Padman
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.02712
Pdf URL: https://arxiv.org/pdf/2510.02712
Copy Paste: [[2510.02712]] Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks(https://arxiv.org/abs/2510.02712)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present the first comprehensive survival analysis of conversational AI robustness, analyzing 36,951 conversation turns across 9 state-of-the-art LLMs to model failure as a time-to-event process. Our survival modeling framework-employing Cox proportional hazards, Accelerated Failure Time, and Random Survival Forest approaches-reveals extraordinary temporal dynamics. We find that abrupt, prompt-to-prompt(P2P) semantic drift is catastrophic, dramatically increasing the hazard of conversational failure. In stark contrast, gradual, cumulative drift is highly protective, vastly reducing the failure hazard and enabling significantly longer dialogues. AFT models with interactions demonstrate superior performance, achieving excellent discrimination and exceptional calibration. These findings establish survival analysis as a powerful paradigm for evaluating LLM robustness, offer concrete insights for designing resilient conversational agents, and challenge prevailing assumptions about the necessity of semantic consistency in conversational AI Systems.
摘要：大型语言模型（LLM）彻底改变了对话式AI，但它们在扩展的多转对话中的鲁棒性仍然很少了解。现有的评估框架着眼于静态基准和单转弯评估，未能捕获表征现实世界相互作用的对话降解的时间动态。在这项工作中，我们介绍了对会话AI鲁棒性的首个全面生存分析，分析了9951个对话转向9个最先进的LLM，以模拟失败作为活动时间的过程。我们的生存建模框架 - 就业的COX比例危害，加速失败时间和随机生存森林接近 - 野外的时间动态。我们发现突然的，迅速促进的语义漂移是灾难性的，大大增加了对话失败的危害。与之形成鲜明对比的是，逐渐的累积漂移具有高度保护性，大大降低了故障危害并实现了更长的对话。具有相互作用的船尾模型表现出卓越的性能，实现了出色的歧视和出色的校准。这些发现建立了生存分析，作为评估LLM鲁棒性，为设计弹性对话剂提供具体见解的有力范式，并挑战了关于对话AI系统中语义一致性的普遍假设。

Title: TravelBench : Exploring LLM Performance in Low-Resource Domains

Authors: Srinivas Billa, Xiaonan Jing
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02719
Pdf URL: https://arxiv.org/pdf/2510.02719
Copy Paste: [[2510.02719]] TravelBench : Exploring LLM Performance in Low-Resource Domains(https://arxiv.org/abs/2510.02719)
Keywords: llm
Abstract: Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.
摘要：现有LLM基准测试的结果很少获取有关低资源任务中模型功能的信息，因此很难在这些域中开发有效的解决方案。为了应对这些挑战，我们使用现实世界情景中的匿名数据策划了14个跨越7个常见NLP任务的旅行域数据集，并分析了LLMS的性能。我们报告LLM在各种任务中的准确性，扩展行为和推理能力。我们的结果证实，一般的基准测试结果不足以理解低资源任务中的模型性能。尽管有大量的训练失败，但开箱即用的LLM在复杂的特定领域方案中击中了性能瓶颈。此外，推理通过使模型成为某些任务的更好判断来为较小的LLM提供更大的提升。

Title: IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context

Authors: Santhosh G S, Akshay Govind S, Gokul S Krishnan, Balaraman Ravindran, Sriraam Natarajan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02742
Pdf URL: https://arxiv.org/pdf/2510.02742
Copy Paste: [[2510.02742]] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context(https://arxiv.org/abs/2510.02742)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have gained significant traction across critical domains owing to their impressive contextual understanding and generative capabilities. However, their increasing deployment in high stakes applications necessitates rigorous evaluation of embedded biases, particularly in culturally diverse contexts like India where existing embedding-based bias assessment methods often fall short in capturing nuanced stereotypes. We propose an evaluation framework based on a encoder trained using contrastive learning that captures fine-grained bias through embedding similarity. We also introduce a novel dataset - IndiCASA (IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes) comprising 2,575 human-validated sentences spanning five demographic axes: caste, gender, religion, disability, and socioeconomic status. Our evaluation of multiple open-weight LLMs reveals that all models exhibit some degree of stereotypical bias, with disability related biases being notably persistent, and religion bias generally lower likely due to global debiasing efforts demonstrating the need for fairer model development.
摘要：大型语言模型（LLM）由于其令人印象深刻的上下文理解和生成能力，在关键领域中获得了重大的关注。但是，它们在高利益应用中的部署日益增加需要对嵌入式偏见进行严格的评估，尤其是在具有文化多样的环境中，例如印度，在印度，现有的基于嵌入的偏见评估方法通常在捕获细微的刻板印象时通常不足。我们提出了一个基于使用对比度学习训练的编码器的评估框架，该框架通过嵌入相似性来捕获细粒度的偏见。我们还介绍了一个新颖的数据集 - Indicasa（基于Indibias的上下文刻板印象和抗疾病型），其中包括2,575个人类验证的句子，涵盖了五个人群轴：种姓，性别，宗教，残疾和社会经济状况。我们对多个开放式LLM的评估表明，所有模型都表现出一定程度的刻板印象偏见，与残疾相关的偏见非常持久，宗教偏见通常由于全球性偏见而可能降低，这表明需要更公平的模型发展。

Title: The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

Authors: Hangfan Zhang, Siyuan Xu, Zhimeng Guo, Huaisheng Zhu, Shicheng Liu, Xinrun Wang, Qiaosheng Zhang, Yang Chen, Peng Ye, Lei Bai, Shuyue Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02752
Pdf URL: https://arxiv.org/pdf/2510.02752
Copy Paste: [[2510.02752]] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback(https://arxiv.org/abs/2510.02752)
Keywords: language model, llm, agent
Abstract: Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.
摘要：强化学习（RL）表现出在增强大语言模型（LLMS）的推理能力方面具有潜力，但是这种培训通常需要在创建和注释数据方面进行大量努力。在这项工作中，我们通过使用最小数据来探索通过RL提高LLMS。我们的方法在LLM提出一项任务，然后尝试解决它的LLM之间交替。为了最大程度地减少数据依赖性，我们介绍了以自我意识为基础的两种新颖机制：（1）自我意识到难度预测，该模型学会学会相对于其自身能力评估任务难度，并优先考虑具有挑战性但可解决的任务，以及（2）自我意识到的限制限制限制，何时识别任务超出了其能力边界的超出范围，可以超出其外部数据的限制，从而限制了通过限制进行限制。对九个基准测试的广泛实验显示了相对改善的53.8％，额外数据少于1.2％，这表明了自我意识RL的功效，并强调了自我发展剂训练的希望。

Title: StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

Authors: Tengjun Ni, Xin Yuan, Shenghong Li, Kai Wu, Ren Ping Liu, Wei Ni, Wenjie Zhang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.02827
Pdf URL: https://arxiv.org/pdf/2510.02827
Copy Paste: [[2510.02827]] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering(https://arxiv.org/abs/2510.02827)
Keywords: language model, hallucination, retrieval-augmented generation, chain-of-thought
Abstract: Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.
摘要：最新的检索生成（RAG）的进展导致了更准确，更可解释的多跳问题回答（QA）。然而，挑战一直在将迭代推理步骤与外部知识检索整合在一起。为了解决这个问题，我们介绍了Stepchain GraphRag，该框架将问题分解与广度优先搜索（BFS）推理流有关，以进行增强的多跳QA。我们的方法首先在语料库上建立了全球指数。在推理时，仅将检索到的段落被直接解析为知识图，并将复杂查询分为子问题。对于每个子问题，基于BFS的遍历都会沿相关边缘动态扩展，从而组装明确的证据链，而不会以多余的环境压倒语言模型。关于Musique，2wikimultihopqa和HotPotQA的实验表明，Stepchain GraphRag达到了最新的精确匹配和F1分数。 Stepchain GraphRag在SOTA方法上将平均EM提高2.57％，F1提高2.13％，从而获得HOTPOTQA的最大增益（ +4.70％EM， +3.44％F1）。 Stepchain GraphRag还通过保留跨中间检索步骤的思想链来增强解释性。我们通过讨论未来的工作如何减轻计算开销并解决来自大语言模型的潜在幻觉，以提高多跳质量质量检查的效率和可靠性的可能性结论。

Title: Evaluating Large Language Models for IUCN Red List Species Information

Authors: Shinya Uryu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.02830
Pdf URL: https://arxiv.org/pdf/2510.02830
Copy Paste: [[2510.02830]] Evaluating Large Language Models for IUCN Red List Species Information(https://arxiv.org/abs/2510.02830)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.
摘要：在保护方面，大型语言模型（LLM）迅速采用以解决生物多样性危机，但它们对物种评估的可靠性尚不确定。这项研究系统地验证了四个核心IUCN红色列表评估组件的21,955种物种的五个主要模型：分类，保护状况，分布和威胁。揭示了一个关键的悖论：在分类学分类方面表现出色（94.9％），但在保护推理方面始终失败（身份评估的27.2％）。在所有模型中，这种知识 - 复杂的差距都表明了固有的建筑限制，而不仅仅是数据限制。此外，模型表现出有针对性脊椎动物的系统偏见，可能会扩大现有的保护不平等。这些发现描述了负责任的LLM部署的明确界限：它们是信息检索的有力工具，但需要对基于判断的决策进行监督。建议采用混合方法，其中LLMS增强专家能力，而人类专家则保留了风险评估和政策的唯一权力。

Title: Self-Reflective Generation at Test Time

Authors: Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02919
Pdf URL: https://arxiv.org/pdf/2510.02919
Copy Paste: [[2510.02919]] Self-Reflective Generation at Test Time(https://arxiv.org/abs/2510.02919)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.
摘要：大型语言模型（LLMS）越来越多地通过长期思考来解决复杂的推理任务，但它们的远期自回归生成过程却很脆弱。早期的令牌错误可能会级联，这显然需要自我反思机制。但是，现有的自我反思要么对完整的草稿进行修订，要么通过昂贵的培训来学习自我纠正，无论是从根本上反应性和效率低下而已。为了解决这个问题，我们提出了在测试时间（SRGEN）的自我反射生成，这是一个轻量级的测试时间框架，在不确定的点生成之前反映出。在代币产生过程中，SRGEN利用动态熵阈值来识别高确定性令牌。对于每个已识别的令牌，它都会训练一个特定的纠正矢量，该向量完全利用已经生成的上下文来自我反射生成以纠正令牌概率分布。通过回顾性分析部分输出，这种自我反射可以实现更可信赖的决策，从而大大降低了高度不确定点的错误概率。 SRGEN评估了具有挑战性的数学推理基准和各种LLM的LLM，可以始终如一地增强模型推理：单通道质量的改善也可以转化为更强大的自稳态投票。尤其是在具有DeepSeek-R1-Distill-Qwen-7b的AIME2024上，SRGEN在PASS@1上获得 +12.0％的绝对提高，而CONS@5的 + +13.3％。此外，我们的发现位置SRGEN作为一种插件方法，将反射集成到可靠的LLM推理的生成过程中，通过有限的开销获得一致的增长，并与其他培训时间（例如RLHF）和测试时间（例如SLOT）技术相结合。

Title: Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Authors: Jingqi Zhang, Ruibo Chen, Yingqing Yang, Peihua Mai, Heng Huang, Yan Pang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.02962
Pdf URL: https://arxiv.org/pdf/2510.02962
Copy Paste: [[2510.02962]] Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking(https://arxiv.org/abs/2510.02962)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: this https URL.
摘要：大型语言模型（LLM）越来越多地在较小的特定领域数据集上进行微调，以改善下游性能。这些数据集通常包含专有或受版权保护的材料，这增加了对未经授权使用的可靠保障措施的需求。现有的成员推理攻击（MIA）和数据集推断方法通常需要访问内部信号，例如logits，而当前的黑框方法通常依赖于手工制作的提示或清洁参考数据集进行校准，这两者都限制了实际适用性。水印是一种有前途的选择，但是先前的技术可以降低文本质量或降低任务性能。我们提出了Trace，这是一个实用的框架，用于在LLM微调中完全黑框检测受版权保护的数据集使用。 \ texttt {trace}用以私钥引导的无失真水印来重写数据集，从而确保文本质量和下游实用程序。在检测时间，我们利用微调对水印数据的放射性效应，并引入了一个熵门控的程序，该程序有选择地评分高确定的令牌，从而实质上放大了检测能力。在不同的数据集和模型家族中，痕迹始终达到重要的检测（P <0.05），通常具有极强的统计证据。此外，它支持多数据归因，即使在大型非含水层的语料库进行预处理后，它仍然保持健壮。这些结果将跟踪作为可靠的黑框验证版权数据集使用的实用途径。我们将使代码可用：此HTTPS URL。

Title: Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines

Authors: Matthew Lewis, Samuel Thio, Richard JB Dobson, Spiros Denaxas
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.02967
Pdf URL: https://arxiv.org/pdf/2510.02967
Copy Paste: [[2510.02967]] Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines(https://arxiv.org/abs/2510.02967)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper presents the development and evaluation of a Retrieval-Augmented Generation (RAG) system for querying the United Kingdom's National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs). The extensive length and volume of these guidelines can impede their utilisation within a time-constrained healthcare system, a challenge this project addresses through the creation of a system capable of providing users with precisely matched information in response to natural language queries. The system's retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a database of 10,195 text chunks derived from three hundred guidelines. It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries. The most significant impact of the RAG system was observed during the generation phase. When evaluated on a manually curated dataset of seventy question-answer pairs, RAG-enhanced models showed substantial gains in performance. Faithfulness, the measure of whether an answer is supported by the source text, was increased by 64.7 percentage points to 99.5% for the RAG-enhanced O4-Mini model and significantly outperformed the medical-focused Meditron3-8B LLM, which scored 43%. This, combined with a perfect Context Precision score of 1 for all RAG-enhanced models, confirms the system's ability to prevent information fabrication by grounding its answers in relevant source material. This study thus establishes RAG as an effective, reliable, and scalable approach for applying generative AI in healthcare, enabling cost-effective access to medical guidelines.
摘要：本文介绍了使用大语言模型（LLMS）查询联合王国国家健康与护理卓越卫生与护理研究所（NICE）临床指南的检索成绩（RAG）系统的开发和评估。这些准则的长度和数量可能会阻碍其在时间约束的医疗系统中的利用，这是该项目通过创建能够为用户提供对自然语言查询的精确匹配的信息来应对的挑战。该系统的检索架构由由混合嵌入机制组成，根据来自三百个指南的10,195个文本块的数据库进行了评估。它显示出高性能，平均互惠等级（MRR）为0.814，在第一个块中召回了81％，在7901个查询中进行了评估时，在前十大块中的召回率为99.1％。在发电阶段观察到抹布系统的最重要影响。当对70个问题 - 答案对的手动策划数据集进行评估时，抹布增强的模型显示出巨大的性能增长。忠诚，是源文本是否支持答案的量度，在抹布增强的O4-Mini模型中提高了64.7个百分点，达到99.5％，并显着超过了以医学为中心的Meditron3-8B LLM，得分为43％。这与所有抹布增强模型的完美上下文精确得分结合在一起，确认了该系统通过将其答案接地在相关的源材料中来防止信息制造的能力。因此，这项研究确立了抹布作为一种在医疗保健中应用生成AI的有效，可靠和可扩展的方法，从而使具有成本效益的医疗准则访问。

Title: Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Authors: Oriol Pareras, Gerard I. Gállego, Federico Costa, Cristina España-Bonet, Javier Hernando
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2510.03093
Pdf URL: https://arxiv.org/pdf/2510.03093
Copy Paste: [[2510.03093]] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?(https://arxiv.org/abs/2510.03093)
Keywords: llm, prompt, chain-of-thought
Abstract: Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.
摘要：关于语音到文本翻译（S2TT）的最新工作集中在基于LLM的模型上，引入了越来越多采用的思想链（COT）提示，其中该模型被指导首先转录语音然后翻译语音。 COT通常胜过直接提示，主要是因为它可以利用丰富的自动语音识别（ASR）和文本到文本转换（T2TT）数据集以明确对其步骤进行建模。在本文中，我们会系统地比较越来越多的S2TT数据下的COT和直接提示。为此，我们通过将其转录转换为六种欧洲语言，并培训基于LLM的S2TT系统，并在不同的数据量表上提示策略，从而对ASR语料库进行伪造。我们的结果表明，随着数据增加的数量，直接会更加一致，这表明随着创建较大的S2TT资源，它可能会成为一种更有效的方法。

Title: Semantic Similarity in Radiology Reports via LLMs and NER

Authors: Beth Pearson, Ahmed Adnan, Zahraa Abdallah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03102
Pdf URL: https://arxiv.org/pdf/2510.03102
Copy Paste: [[2510.03102]] Semantic Similarity in Radiology Reports via LLMs and NER(https://arxiv.org/abs/2510.03102)
Keywords: language model, llm
Abstract: Radiology report evaluation is a crucial part of radiologists' training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: \href{this https URL}{this http URL\_reports}
摘要：放射学报告评估是放射学家培训的关键部分，并且在确保诊断准确性方面起着关键作用。作为标准报告工作流程的一部分，初级放射科医生通常会准备一份初步报告，然后由高级放射科医生对此进行审查和编辑以制作最终报告。确定初步报告和最终报告之间的语义差异对于初级医生来说都是必不可少的，无论是作为培训工具还是有助于发现临床知识的差距。虽然放射学的AI是一个快速增长的领域，但由于需要专门的领域知识，大语言模型（LLM）的应用仍然具有挑战性。在本文中，我们探讨了LLM提供可解释，准确比较放射域中报告的能力。首先，我们比较几个LLM在比较放射学报告时的性能。然后，我们根据命名实体识别（NER）评估一种更传统的方法。但是，两种方法在提供语义相似性的准确反馈方面都均显示出局限性。为了解决这个问题，我们建议使用Llama 3.1和NER具有可调式权重的Llama相似性评分方法，以强调或强调特定类型的差异类型。我们的方法为跟踪进度产生了定量相似性评分，并提供了旨在在审查和完善其报告时提供宝贵指导的分数的解释。与放射科医生提供的地面真实分数相比，我们发现我们的方法在+/- 1内达到了67％的精确度和93％的准确性 - 表现优于LLM和NER独立使用。代码可在：\ href {此https url} {此http url \ _reports} {

Title: Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

Authors: Jacobo Romero-Díaz, Gerard I. Gállego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina España-Bonet
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2510.03115
Pdf URL: https://arxiv.org/pdf/2510.03115
Copy Paste: [[2510.03115]] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation(https://arxiv.org/abs/2510.03115)
Keywords: prompt, chain-of-thought
Abstract: Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.
摘要：通过自动语音识别（ASR）和文本到文本翻译（T2TT）模块构建的语音到文本翻译（S2TT）系统面临两个主要局限性：错误传播以及无法利用Prosodic或其他声学提示。最近引入了思考链（COT）提示，期望共同访问语音和转录将克服这些问题。通过归因方法，具有损坏的成绩单的鲁棒性评估以及韵律意识分析COT，我们发现它在很大程度上反映了级联的行为，主要依靠成绩单，而几乎没有利用语音。简单的训练干预措施，例如添加直接的S2TT数据或嘈杂的成绩单注射，增强了鲁棒性并增加语音归因。这些发现挑战了COT的假定优势，并强调了将声学信息明确整合到翻译中的体系结构的需求。

Title: SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Authors: Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03120
Pdf URL: https://arxiv.org/pdf/2510.03120
Copy Paste: [[2510.03120]] SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?(https://arxiv.org/abs/2510.03120)
Keywords: llm, agent
Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).
摘要：学术调查写作将大量文献提炼成连贯且有见地的叙事，这仍然是一项劳动密集型和智力要求的任务。虽然最近的方法（例如一般的深入研究代理和调查专题化方法）可以自动生成调查（又称LLM4Survey），但它们的输出通常符合人体标准，并且缺乏严格的，读者一致的基准测试，从而彻底揭示了他们的缺陷。为了填补空白，我们提出了一个细粒度，测验驱动的评估框架调查台，其中包含（1）最近的11,343张ARXIV论文的典型调查主题来源，相应的4,947次高质量调查；（2）评估轮廓质量（例如覆盖范围，逻辑连贯性），内容质量（例如，综合粒度，见解的清晰度）和非文字丰富度的多面度量层次结构；（3）一个双模式评估协议，其中包括基于内容和基于测验的答案测试，明确与读者的信息需求保持一致。结果表明，SurveyBench有效地挑战了现有的LLM4传感方法（例如，在基于内容的评估中，平均比人类低21％）。

Title: Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

Authors: Ej Zhou, Caiqi Zhang, Tiancheng Hu, Chengzu Li, Nigel Collier, Ivan Vulić, Anna Korhonen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03136
Pdf URL: https://arxiv.org/pdf/2510.03136
Copy Paste: [[2510.03136]] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models(https://arxiv.org/abs/2510.03136)
Keywords: language model, llm
Abstract: Confidence calibration, the alignment of a model's predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model's internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.
摘要：置信度校准是模型预测的置信度与其实际准确性的一致性，对于可靠的大型语言模型（LLMS）的部署至关重要。但是，在多语言上下文中，这种关键属性在很大程度上仍未探索。在这项工作中，我们对六个模型系列和100多种语言进行了首次大规模的系统研究，对多种语言进行了多种语言校准，这表明非英语语言受到系统的校准较差。为了诊断这一点，我们研究了模型的内部表示形式，并发现以英语为中心的培训有偏见的最终层为多语言置信度提供了差的信号。相比之下，我们的层分析发现了一个关键洞察力，即延迟中间层始终提供更可靠，更校准的信号。在此基础上，我们介绍了一套无训练的方法，包括语言感知信心集合（LACE），该方法可适应为每种特定语言选择最佳的层。我们的研究强调了以英语为中心的一致性的隐性成本，并通过超越最后一层，为建立更加公平和值得信赖的LLM提供了新的途径。

Title: EditLens: Quantifying the Extent of AI Editing in Text

Authors: Katherine Thai, Bradley Emi, Elyas Masrour, Mohit Iyyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03154
Pdf URL: https://arxiv.org/pdf/2510.03154
Copy Paste: [[2510.03154]] EditLens: Quantifying the Extent of AI Editing in Text(https://arxiv.org/abs/2510.03154)
Keywords: language model
Abstract: A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.
摘要：大型语言模型的查询很大一部分要求他们编辑用户提供的文本，而不是从头开始生成新的文本。虽然先前的工作着重于检测完全AI生成的文本，但我们证明了AI-Edited文本可以与人类写入和AI生成的文本区分开来。首先，我们建议使用轻巧的相似性指标来量化文本中存在的AI编辑的大小，并用人类注释者验证这些指标。然后，我们将这些相似性指标作为中间监督，然后训练编辑，这是一个回归模型，可预测文本中存在的AI编辑数量。我们的模型在区分人类，AI和混合写作方面都在二进制（F1 = 94.7％）和三元（F1 = 90.4％）分类任务上实现最新性能。我们不仅可以检测到AI-编辑的文本，而且还可以检测到人工智能对人写作的变化程度，这对作者身份归因，教育和政策具有影响。最后，作为一个案例研究，我们使用模型来分析流行的写作辅助工具Grammarly应用的AI-EDIT的效果。为了鼓励进一步的研究，我们致力于公开发布我们的模型和数据集。

Title: Neural Correlates of Language Models Are Specific to Human Language

Authors: Iñigo Parra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03156
Pdf URL: https://arxiv.org/pdf/2510.03156
Copy Paste: [[2510.03156]] Neural Correlates of Language Models Are Specific to Human Language(https://arxiv.org/abs/2510.03156)
Keywords: language model
Abstract: Previous work has shown correlations between the hidden states of large language models and fMRI brain responses, on language tasks. These correlations have been taken as evidence of the representational similarity of these models and brain states. This study tests whether these previous results are robust to several possible concerns. Specifically this study shows: (i) that the previous results are still found after dimensionality reduction, and thus are not attributable to the curse of dimensionality; (ii) that previous results are confirmed when using new measures of similarity; (iii) that correlations between brain representations and those from models are specific to models trained on human language; and (iv) that the results are dependent on the presence of positional encoding in the models. These results confirm and strengthen the results of previous research and contribute to the debate on the biological plausibility and interpretability of state-of-the-art large language models.
摘要：以前的工作显示了大语模型的隐藏状态与语言脑响应的隐藏状态，语言任务。这些相关性已被视为这些模型和大脑状态的代表性相似性的证据。这项研究测试了这些先前的结果是否对几个可能的问题很鲁。具体而言，本研究表明：（i）在降低维度之后仍然发现了先前的结果，因此并非归因于维度的诅咒；（ii）使用新的相似性措施时，证实了先前的结果；（iii）大脑表示与模型的相关性是针对接受人类语言训练的模型的特定的；（iv）结果取决于模型中位置编码的存在。这些结果证实并加强了先前研究的结果，并为最先进的大语言模型的生物学合理性和解释性做出了贡献。

Title: Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?

Authors: Xuan Xu, Haolun Li, Zhongliang Yang, Beilin Chu, Jia Song, Moxuan Xu, Linna Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03174
Pdf URL: https://arxiv.org/pdf/2510.03174
Copy Paste: [[2510.03174]] Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?(https://arxiv.org/abs/2510.03174)
Keywords: language model, llm, prompt
Abstract: Traditional topic models such as neural topic models rely on inference and generation networks to learn latent topic distributions. This paper explores a new paradigm for topic modeling in the era of large language models, framing TM as a long-form generation task whose definition is updated in this paradigm. We propose a simple but practical approach to implement LLM-based topic model tasks out of the box (sample a data subset, generate topics and representative text with our prompt, text assignment with keyword match). We then investigate whether the long-form generation paradigm can beat NTMs via zero-shot prompting. We conduct a systematic comparison between NTMs and LLMs in terms of topic quality and empirically examine the claim that "a majority of NTMs are outdated."
摘要：神经主题模型等传统主题模型依靠推理和发电网络来学习潜在的主题分布。本文探讨了在大型语言模型时代的主题建模的新范式，将TM构建为长期生成任务，其定义在此范式中已更新。我们提出了一种简单但实用的方法，可以在开箱即用的情况下实现基于LLM的主题模型任务（示例数据子集，生成主题和代表性文本，并使用我们的及时的文本分配和关键字匹配）。然后，我们研究长期生成范式是否可以通过零射击提示击败NTM。我们在主题质量方面对NTM和LLM进行了系统的比较，并经验研究了“大多数NTM都过时的说法”。

Title: FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Authors: Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03204
Pdf URL: https://arxiv.org/pdf/2510.03204
Copy Paste: [[2510.03204]] FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents(https://arxiv.org/abs/2510.03204)
Keywords: language model, llm, prompt, agent
Abstract: Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.
摘要：由大语言模型（LLM）提供动力的Web代理必须处理冗长的网页观测值才能完成用户目标；这些页面通常超过数万个令牌。这使上下文限制饱和并增加了计算成本处理；此外，处理完整页面会使代理面临诸如快速注射之类的安全风险。现有的修剪策略要么丢弃相关内容，要么保留无关紧要的环境，从而导致次优的行动预测。我们介绍了一种简单而有效的方法，它利用轻量级的LLM猎犬从可访问性树（Axtree）观察中提取最相关的线，并在任务目标的指导下。通过修剪嘈杂和无关紧要的内容，FocusAgent可以在减少注射攻击的同时有效地推理。有关Workarena和Webarena基准测试的实验表明，FocusAgent与强质基线的性能相匹配，同时将观察尺寸降低了50％以上。此外，FocusAgent的一种变体大大降低了提示攻击的成功率，包括横幅和弹出式攻击，同时保持无攻击设置中的任务成功性能。我们的结果强调，针对性的基于LLM的检索是建立高效，有效和安全的网络代理的实用且强大的策略。

Title: Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Authors: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.03215
Pdf URL: https://arxiv.org/pdf/2510.03215
Copy Paste: [[2510.03215]] Cache-to-Cache: Direct Semantic Communication Between Large Language Models(https://arxiv.org/abs/2510.03215)
Keywords: language model, llm
Abstract: Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at this https URL.
摘要：多LLM系统可以利用各种大型语言模型的互补优势，从而实现了单个模型无法实现的绩效和效率。在现有设计中，LLM通过文本进行通信，迫使内部表示形式转换为输出令牌序列。这个过程既失去了丰富的语义信息，又产生了逐个代币的延迟。在这些局限性的激励下，我们问：LLM可以超越文本交流吗？ Oracle实验表明，丰富KV-CACHE语义可以提高响应质量而无需增加高速缓存大小，从而支持KV-CACHE作为模型间通信的有效媒介。因此，我们提出了缓存到缓存（C2C），这是一种用于LLMS之间直接语义通信的新范式。 C2C使用神经网络将源模型的KV-CACHE与目标模型融合在一起，以实现直接的语义传输。可学习的门控机制选择了从缓存通信中受益的目标层。与文本通信相比，C2C利用了这两种模型的深层专业语义，同时避免了显式中间文本生成。实验表明，与单个模型相比，C2C的平均精度高8.5-10.5％。它进一步超过了文本通信范式约3.0-5.0％，同时延长了2.0倍的延迟。我们的代码可在此HTTPS URL上找到。

Title: Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment

Authors: Hongxiang Zhang, Yuan Tian, Tianyi Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03223
Pdf URL: https://arxiv.org/pdf/2510.03223
Copy Paste: [[2510.03223]] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment(https://arxiv.org/abs/2510.03223)
Keywords: language model, llm, prompt
Abstract: To solve complex reasoning tasks for Large Language Models (LLMs), prompting-based methods offer a lightweight alternative to fine-tuning and reinforcement learning. However, as reasoning chains extend, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this paper, we propose Self-Anchor, a novel pipeline that leverages the inherent structure of reasoning to steer LLM attention. Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model's attention to the most relevant inference steps, allowing the model to maintain focus throughout generation. Our experiment shows that Self-Anchor outperforms SOTA prompting methods across six benchmarks. Notably, Self-Anchor significantly reduces the performance gap between ``non-reasoning'' models and specialized reasoning models, with the potential to enable most LLMs to tackle complex reasoning tasks without retraining.
摘要：为了解决大型语言模型（LLM）的复杂推理任务，基于促进的方法为微调和增强学习提供了一种轻巧的替代方法。但是，随着推理链的扩展，关键的中间步骤和原始提示将在上下文中埋葬，从而受到足够的关注并导致错误。在本文中，我们提出了一种自我实施，这是一种新型的管道，利用推理的固有结构来引起LLM的注意。自主将推理轨迹分解为结构化计划，并自动将模型的注意力与最相关的推理步骤保持一致，从而使模型能够在整个世代中保持焦点。我们的实验表明，自我实施者的表现优于六个基准的SOTA提示方法。值得注意的是，自主主持人大大减少了``非争议''模型与专业推理模型之间的性能差距，从而有可能使大多数LLM能够解决复杂的推理任务而无需重新训练。

Title: Reward Models are Metrics in a Trench Coat

Authors: Sebastian Gehrmann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03231
Pdf URL: https://arxiv.org/pdf/2510.03231
Copy Paste: [[2510.03231]] Reward Models are Metrics in a Trench Coat(https://arxiv.org/abs/2510.03231)
Keywords: language model
Abstract: The emergence of reinforcement learning in post-training of large language models has sparked significant interest in reward models. Reward models assess the quality of sampled model outputs to generate training signals. This task is also performed by evaluation metrics that monitor the performance of an AI model. We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls. Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation. Our position paper argues that a closer collaboration between the fields can help overcome these issues. To that end, we show how metrics outperform reward models on specific tasks and provide an extensive survey of the two areas. Grounded in this survey, we point to multiple research topics in which closer alignment can improve reward models and metrics in areas such as preference elicitation methods, avoidance of spurious correlations and reward hacking, and calibration-aware meta-evaluation.
摘要：大语言模型训练后的加强学习的出现引发了人们对奖励模型的重大兴趣。奖励模型评估采样模型输出的质量以生成训练信号。该任务还通过监视AI模型的性能的评估指标执行。我们发现这两个研究领域大多是分开的，导致冗余术语和反复的陷阱。普遍的挑战包括对虚假相关性的敏感性，对下游奖励黑客的影响，提高数据质量的方法以及元评估方法。我们的立场论文认为，领域之间的更紧密的合作可以帮助克服这些问题。为此，我们展示了指标在特定任务上的表现如何优于奖励模型，并对这两个领域进行了广泛的调查。基于本调查，我们指出了多个研究主题，在这些主题中，更紧密的一致性可以改善偏好启发方法，避免伪造相关性和奖励黑客的领域的奖励模型和指标，以及校准感知的元评估。