2025-09-03

Title: MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation

Authors: Marshall Thomas, Edward Fish, Richard Bowden
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.00030
Pdf URL: https://arxiv.org/pdf/2509.00030
Copy Paste: [[2509.00030]] MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation(https://arxiv.org/abs/2509.00030)
Keywords: language model, llm
Abstract: Despite progress in gloss-free Sign Language Translation (SLT), monolithic end-to-end models consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in Automated Sign Language Translation with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names,places, and technical terms. We introduce MultiStream-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging ChicagoFSWildPlus fingerspelling dataset. These results validate our core hypothesis: by isolating and solving distinct recogni tion tasks before fusion, our multi-expert approach provides a more powerful and effective pathway to robust, high-fidelity sign language translation.
摘要：尽管无光泽的手语翻译（SLT）取得了进展，但单片端到端模型仍在自然签名的两个关键组成部分上持续失败：高速手指倾斜的精确识别以及对异步非人工提示的整合。与大语言模型的自动手语翻译中的最新进展使这一挑战达成了挑战，迫使一个网络同时学习这些挑战，从而导致绩效不佳，而责任翻译关键信息，例如名称，地点和技术术语。我们介绍了Multistream-LLM，这是一个模块化框架，旨在克服这些限制。我们的方法采用单独的专门预测指标来进行连续签名，手指插入和唇读。每个专家网络首先将其特定模式解码为一系列令牌。然后，这些平行流由轻巧的变压器融合，该流量在将合并的表示形式传递给最终句子生成的大型语言模型（LLM）之前，它可以解决时间未对准。我们的方法建立了关于How2Sign基准测试的新最先进的，其BLEU-4得分为23.5，并在具有挑战性的ChicagofSwildplus Fingerspellespelling数据集上获得了73.2％的字母准确性。这些结果证实了我们的核心假设：通过在融合前隔离和解决不同的识别任务，我们的多专家方法为稳健，高保真的手语翻译提供了更强大有效的途径。

Title: Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis

Authors: Teo Susnjak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00038
Pdf URL: https://arxiv.org/pdf/2509.00038
Copy Paste: [[2509.00038]] Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis(https://arxiv.org/abs/2509.00038)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) offer significant potential to accelerate systematic literature reviews (SLRs), yet current approaches often rely on brittle, manually crafted prompts that compromise reliability and reproducibility. This fragility undermines scientific confidence in LLM-assisted evidence synthesis. In response, this work adapts recent advances in declarative prompt optimisation, developed for general-purpose LLM applications, and demonstrates their applicability to the domain of SLR automation. This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR workflow. These emerging methods are translated into a concrete blueprint with working code examples, enabling researchers to construct verifiable LLM pipelines that align with established principles of transparency and rigour in evidence synthesis. This is a novel application of such approaches to SLR pipelines.
摘要：大型语言模型（LLM）为加速系统文献综述（SLR）提供了巨大的潜力，但是当前的方法通常依赖于脆弱的，手动制作的提示，这些提示会损害可靠性和可重复性。这种脆弱性破坏了对LLM辅助证据综合的科学信心。作为回应，这项工作适应了为通用LLM应用程序开发的声明及时优化的最新进展，并证明了它们适用于SLR自动化领域。这项研究提出了一个结构化的，特定于领域的框架，该框架嵌入了任务声明，测试套件和自动化的及时调整，将其嵌入可再现的SLR工作流程中。这些新出现的方法被转化为具有工作代码示例的具体蓝图，使研究人员能够构建可验证的LLM管道，该管道与确定的透明度和严格性的证据合成原理相吻合。这是SLR管道方法的这种新颖应用。

Title: Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

Authors: Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, Julian McAuley
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00190
Pdf URL: https://arxiv.org/pdf/2509.00190
Copy Paste: [[2509.00190]] Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics(https://arxiv.org/abs/2509.00190)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.
摘要：最新的经营链（COT）提示的进步使大型语言模型（LLMS）能够执行多步推理。但是，这种推理的解释性仍然有限，先前的工作主要集中于局部令牌级别的归因，因此推理步骤的高级语义作用及其过渡仍然没有被逐渐驱动。在本文中，我们介绍了一个州感知的过渡框架，将COT轨迹抽象成结构化的潜在动力学。具体而言，要捕获COT推理的不断发展的语义，每个推理步骤均通过令牌级嵌入的光谱分析来表示，并聚集在语义上相干的潜在状态中。为了表征推理的全球结构，我们将它们的进展模型为马尔可夫链，从而产生了对推理过程的结构化和可解释的看法。该抽象支持一系列分析，包括语义角色识别，时间模式可视化和一致性评估。

Title: The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

Authors: Seiji Maekawa, Hayate Iso, Nikita Bhutani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00245
Pdf URL: https://arxiv.org/pdf/2509.00245
Copy Paste: [[2509.00245]] The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs(https://arxiv.org/abs/2509.00245)
Keywords: llm
Abstract: Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.
摘要：有效的决策通常依赖于确定使每个候选人与众不同的原因。尽管LLMS的现有基准强调检索或汇总与给定查询相关的信息，但它们没有评估模型在一组文档中识别全球独特功能的能力。我们引入了独特的特征挖掘（DFM），这是一项新任务，挑战模型以分析中小型收集（10-40个文档）和在全球环境中很少见的表面特征（例如，出现在不到10％的文档中）。此设置反映了现实世界的场景，例如候选选择或产品差异化，而统计推理而不是检索是关键。为了实现对此功能的系统评估，我们提出了Difbench，这是一个可配置的基准创建框架，具有可控参数，例如文档集大小和独特性阈值。使用Difbench，我们对十个最先进的LLM的独特特征开采进行了大规模评估。我们的发现揭示了通用和推理增强模型之间的显着性能差距。但是，随着任务复杂性和文档计数的增加，所有模型都大大降低了。我们还发现，常见的故障模式错误地识别了频繁的特征是独特的。这些见解揭示了当代LLMS执行细粒度，统计推理和稀有检测能力的核心局限性。

Title: The Temporal Game: A New Perspective on Temporal Relation Extraction

Authors: Hugo Sousa, Ricardo Campos, Alípio Jorge
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00250
Pdf URL: https://arxiv.org/pdf/2509.00250
Copy Paste: [[2509.00250]] The Temporal Game: A New Perspective on Temporal Relation Extraction(https://arxiv.org/abs/2509.00250)
Keywords: agent
Abstract: In this paper we demo the Temporal Game, a novel approach to temporal relation extraction that casts the task as an interactive game. Instead of directly annotating interval-level relations, our approach decomposes them into point-wise comparisons between the start and end points of temporal entities. At each step, players classify a single point relation, and the system applies temporal closure to infer additional relations and enforce consistency. This point-based strategy naturally supports both interval and instant entities, enabling more fine-grained and flexible annotation than any previous approach. The Temporal Game also lays the groundwork for training reinforcement learning agents, by treating temporal annotation as a sequential decision-making task. To showcase this potential, the demo presented in this paper includes a Game mode, in which users annotate texts from the TempEval-3 dataset and receive feedback based on a scoring system, and an Annotation mode, that allows custom documents to be annotated and resulting timeline to be exported. Therefore, this demo serves both as a research tool and an annotation interface. The demo is publicly available at this https URL, and the source code is open-sourced to foster further research and community-driven development in temporal reasoning and annotation.
摘要：在本文中，我们演示了时间游戏，这是一种新颖的时间关系提取方法，将任务视为互动游戏。我们的方法不是直接注释间隔级别的关系，而是将它们分解为暂时实体的起点和终点之间的重点比较。在每个步骤中，玩家都对单点关系进行分类，并且系统应用时间闭合来推断其他关系并执行一致性。这种基于点的策略自然支持间隔和即时实体，比以前的任何方法更加细粒度和灵活的注释。时间游戏还通过将时间注释视为一项顺序决策任务，为培训强化学习剂的基础奠定了基础。为了展示这一潜力，本文中介绍的演示包括游戏模式，其中用户注释了Tempeval-3数据集中的文本，并根据评分系统和注释模式接收反馈，该模式允许注释自定义文档并导致导出的时间表。因此，该演示既可以用作研究工具和注释接口。该演示可在此HTTPS URL上公开获取，并开源源代码，以促进时间推理和注释中的进一步研究和社区驱动的发展。

Title: Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval

Authors: Yuxiang Liu, Tian Wang, Gourab Kundu, Tianyu Cao, Guang Cheng, Zhen Ge, Jianshu Chen, Qingjun Cui, Trishul Chilimbi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00276
Pdf URL: https://arxiv.org/pdf/2509.00276
Copy Paste: [[2509.00276]] Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval(https://arxiv.org/abs/2509.00276)
Keywords: language model, llm
Abstract: Transformer-based models such as BERT and E5 have significantly advanced text embedding by capturing rich contextual representations. However, many complex real-world queries require sophisticated reasoning to retrieve relevant documents beyond surface-level lexical matching, where encoder-only retrievers often fall short. Decoder-only large language models (LLMs), known for their strong reasoning capabilities, offer a promising alternative. Despite this potential, existing LLM-based embedding methods primarily focus on contextual representation and do not fully exploit the reasoning strength of LLMs. To bridge this gap, we propose Reasoning-Infused Text Embedding (RITE), a simple but effective approach that integrates logical reasoning into the text embedding process using generative LLMs. RITE builds upon existing language model embedding techniques by generating intermediate reasoning texts in the token space before computing embeddings, thereby enriching representations with inferential depth. Experimental results on BRIGHT, a reasoning-intensive retrieval benchmark, demonstrate that RITE significantly enhances zero-shot retrieval performance across diverse domains, underscoring the effectiveness of incorporating reasoning into the embedding process.
摘要：通过捕获丰富的上下文表示，基于变压器的模型（例如BERT和E5）具有显着高级的文本嵌入。但是，许多复杂的现实世界查询需要复杂的推理才能检索表面级词汇匹配以外的相关文档，而仅编码的检索器通常不足。以强大的推理能力而闻名的仅解码器语言模型（LLMS）提供了有希望的选择。尽管有这种潜力，但现有的基于LLM的嵌入方法主要集中于上下文表示，并且不能完全利用LLM的推理强度。为了弥合这一差距，我们提出了注入推理的文本嵌入（RITE），这是一种简单但有效的方法，将逻辑推理整合到使用生成LLM的文本嵌入过程中。 Rite通过在计算嵌入之前在代币空间中生成中间推理文本来建立现有的语言模型嵌入技术，从而丰富具有推论深度的表示。 Bright的实验结果是一种推理密集的检索基准，这表明仪式可显着提高各种域之间的零射击性能，从而强调将推理纳入嵌入过程的有效性。

Title: OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews

Authors: Mir Tafseer Nayeem, Davood Rafiei
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.00285
Pdf URL: https://arxiv.org/pdf/2509.00285
Copy Paste: [[2509.00285]] OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews(https://arxiv.org/abs/2509.00285)
Keywords: llm
Abstract: We study the problem of opinion highlights generation from large volumes of user reviews, often exceeding thousands per entity, where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs. To tackle this, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries. Additionally, we propose novel reference-free verification metrics designed for sentiment-rich domains, where accurately capturing opinions and sentiment alignment is essential. These metrics offer a fine-grained, context-sensitive assessment of factual consistency. To facilitate evaluation, we contribute the first large-scale dataset of long-form user reviews, comprising entities with over a thousand reviews each, paired with unbiased expert summaries and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights into improving systems, pave the way for future research, and position OpinioRAG as a robust framework for generating accurate, relevant, and structured summaries at scale.
摘要：我们研究了意见问题，重点介绍了大量用户评论的产生，通常每个实体超过数千个实体，而现有方法无法扩展或产生忽略个性化需求的通用，千篇一律的摘要。为了解决这个问题，我们介绍了Opiniorag，这是一个可扩展的，无训练的框架，将基于抹布的证据检索与LLMS结合在一起，以有效地生产量身定制的摘要。此外，我们提出了专为情感丰富的域而设计的新颖的无参考验证指标，在该域中准确捕获观点和情感对齐是必不可少的。这些指标提供了对事实一致性的细粒度，上下文敏感的评估。为了促进评估，我们贡献了第一个长期用户评论的大规模数据集，包括一个超过一千个评论的实体，并与无偏见的专家摘要和手动注释的查询配对。通过广泛的实验，我们确定了关键的挑战，为改进系统提供了可行的见解，为未来的研究铺平了道路，并将Opiniorag定位为在大规模上生成准确，相关和结构化的摘要的强大框架。

Title: Wage Sentiment Indices Derived from Survey Comments via Large Language Models

Authors: Taihei Sone
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00290
Pdf URL: https://arxiv.org/pdf/2509.00290
Copy Paste: [[2509.00290]] Wage Sentiment Indices Derived from Survey Comments via Large Language Models(https://arxiv.org/abs/2509.00290)
Keywords: language model, llm
Abstract: The emergence of generative Artificial Intelligence (AI) has created new opportunities for economic text analysis. This study proposes a Wage Sentiment Index (WSI) constructed with Large Language Models (LLMs) to forecast wage dynamics in Japan. The analysis is based on the Economy Watchers Survey (EWS), a monthly survey conducted by the Cabinet Office of Japan that captures real-time economic assessments from workers in industries highly sensitive to business conditions. The WSI extends the framework of the Price Sentiment Index (PSI) used in prior studies, adapting it specifically to wage related sentiment. To ensure scalability and adaptability, a data architecture is also developed that enables integration of additional sources such as newspapers and social media. Experimental results demonstrate that WSI models based on LLMs significantly outperform both baseline approaches and pretrained models. These findings highlight the potential of LLM-driven sentiment indices to enhance the timeliness and effectiveness of economic policy design by governments and central banks.
摘要：生成人工智能（AI）的出现为经济文本分析创造了新的机会。这项研究提出了一个用大语言模型（LLM）构建的工资情感指数（WSI），以预测日本的工资动态。该分析基于经济观察者调查（EWS），这是日本内阁办公室每月进行的调查，该调查捕获了对业务条件高度敏感的行业工人的实时经济评估。 WSI扩展了先前研究中使用的价格情感指数（PSI）的框架，专门针对与工资相关的情感。为了确保可伸缩性和适应性，还开发了数据架构，该数据体系结构可以集成其他来源，例如报纸和社交媒体。实验结果表明，基于LLM的WSI模型显着胜过基线方法和预验证的模型。这些发现突出了LLM驱动的情感指数的潜力，以提高政府和中央银行经济政策设计的及时性和有效性。

Title: Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models

Authors: Chen Zheng, Yiyuan Ma, Yuan Yang, Deyi Liu, Jing Liu, Zuquan Song, Yuxin Song, Cheng Ren, Hang Zhu, Xin Liu, Yiyuan Ma, Siyuan Qiao, Xun Zhou, Liang Xiang, Yonghui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00309
Pdf URL: https://arxiv.org/pdf/2509.00309
Copy Paste: [[2509.00309]] Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models(https://arxiv.org/abs/2509.00309)
Keywords: language model
Abstract: The development of alignment and reasoning capabilities in large language models has seen remarkable progress through two paradigms: instruction tuning and reinforcement learning from human feedback (RLHF) alignment paradigm, and distillation-based reasoning fine-tuning paradigm. While both approaches prove effective independently, the third paradigm of applying RLHF to distillation-trained models presents significant challenges. Our investigation reveals two critical phenomena that emerge in this paradigm: Sequence Length Collapse, where language generation dramatically reduces during early RLHF training, and the Reward Hockey Stick Curve, featuring severe reward score drops followed by gradual recovery. These instabilities fundamentally compromise the model's alignment and reasoning capabilities. To address these challenges, we propose Balanced Actor Initialization (BAI), a two-stage weighted model merging approach. BAI first merges instruction-following and distillation-based reasoning fine-tuned models, then further combines this intermediate model with the pretrained model to preserve foundational knowledge. Through comprehensive experiments across diverse benchmarks and detailed analysis of training experiments, we demonstrate that BAI resolves Sequence Length Collapse, mitigates the Reward Hockey Stick Curve, and enables continuous sequence length improvement during training. Additionally, our analysis reveals that balanced merging ratios achieve optimal trade-offs between training stability and reasoning capability preservation. Our work provides the effective solution for stable training in this third paradigm, enabling more capable reasoning models that combine distillation efficiency with RLHF alignment.
摘要：大语言模型中的一致性和推理能力的发展通过两个范式取得了显着的进步：从人类反馈（RLHF）对齐范式中学习和强化学习，以及基于蒸馏的推理微调范式。尽管两种方法都证明是独立有效的，但将RLHF应用于蒸馏训练的模型的第三个范式提出了重大挑战。我们的调查揭示了在此范式中出现的两个关键现象：序列长度崩溃，在早期RLHF训练中，语言产生大大减少，以及奖励曲棍球棒曲线，以严重的奖励得分下降，随后逐渐恢复。这些不稳定性从根本上损害了模型的一致性和推理能力。为了应对这些挑战，我们提出了平衡的演员初始化（BAI），这是一种两阶段加权模型合并方法。 BAI首先将指导跟随和基于蒸馏的推理微调模型合并，然后进一步将该中间模型与预验证的模型相结合，以保留基础知识。通过跨不同基准测试的全面实验以及对训练实验的详细分析，我们证明了BAI解决了序列的长度崩溃，减轻奖励曲棍球棒曲线，并在训练过程中启用持续的序列长度改善。此外，我们的分析表明，平衡合并比率实现了训练稳定性和推理能力保存之间的最佳权衡。我们的工作为在第三个范式中的稳定训练提供了有效的解决方案，从而实现了将蒸馏效率与RLHF对齐相结合的更有能力的推理模型。

Title: GIER: Gap-Driven Self-Refinement for Large Language Models

Authors: Rinku Dewri
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.00325
Pdf URL: https://arxiv.org/pdf/2509.00325
Copy Paste: [[2509.00325]] GIER: Gap-Driven Self-Refinement for Large Language Models(https://arxiv.org/abs/2509.00325)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: We introduce GIER (Gap-driven Iterative Enhancement of Responses), a general framework for improving large language model (LLM) outputs through self-reflection and revision based on conceptual quality criteria. Unlike prompting strategies that rely on demonstrations, examples, or chain-of-thought templates, GIER utilizes natural language descriptions of reasoning gaps, and prompts a model to iteratively critique and refine its own outputs to better satisfy these criteria. Across three reasoning-intensive tasks (SciFact, PrivacyQA, and e-SNLI) and four LLMs (GPT-4.1, GPT-4o Mini, Gemini 1.5 Pro, and Llama 3.3 70B), GIER improves rationale quality, grounding, and reasoning alignment without degrading task accuracy. Our analysis demonstrates that models can not only interpret abstract conceptual gaps but also translate them into concrete reasoning improvements.
摘要：我们介绍了吉尔（GAP驱动的迭代增强响应），这是通过基于概念质量标准的自我反射和修订来改善大语言模型（LLM）输出的一般框架。与促使依靠演示，实例或经过思考模板的策略不同，Gier利用了推理差距的自然语言描述，并提示模型进行迭代的批评和完善自己的输出以更好地满足这些标准。在三项推理密集型任务（Scifact，PrivacyQA和E-SNLI）和四个LLM（GPT-4.1，GPT-4.1，GPT-4O MINI，GEMINI 1.5 PRO和LLAMA 3.3 70B）中，GIER提高了无需降级任务精确的理由，接地和推理一致性。我们的分析表明，模型不仅可以解释抽象的概念差距，还可以将它们转化为具体推理的改进。

Title: Open Data Synthesis For Deep Research

Authors: Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00375
Pdf URL: https://arxiv.org/pdf/2509.00375
Copy Paste: [[2509.00375]] Open Data Synthesis For Deep Research(https://arxiv.org/abs/2509.00375)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{this https URL}{this repository}.
摘要：越来越多的大型语言模型（LLM）超越了对深层研究任务的简单事实查询，这些问题需要将问题分解为子问题，协调多步推理，并综合了来自不同来源的证据。我们将具有可验证答案的深入研究任务形式化为分层约束满意度问题（HCSP），它们与单构造，多跳或平面CSP配方根本不同。但是，现有的基准测试（例如自然问题，HotPotQA）无法捕获这种复杂性，而最近的合成数据集通常会引入快捷方式推理，知识泄漏或缺乏足够的结构深度。为了解决这一差距，我们介绍了Infoseek，这是一个可扩展的框架，用于综合复杂的深层研究任务。 Infoseek使用双重机构系统从大规模网页中递归构建研究树，将中间节点模糊为有效的子问题，并将这些树转换为需要遍历完整层次结构的自然语言问题。它还可以快速缩放，产生超过50k的训练示例，一个精选的测试集以及通过拒绝抽样产生的推理轨迹。实验表明，在Infoseek训练的模型始终胜过强大的基线。在具有挑战性的基准Browsecomp-Plus上，Infoseek优化的3B LLM超过了更大的32B型号和轻巧的商业API（例如Gemini2.5-Flash），同时实现了与更强大的API相当的性能（例如Gemini.2.5-Pro）。通过保留诸如中间步骤和检索标签之类的元信息，Infoseek进一步支持高级优化策略，包括复合奖励设计和轨迹级别的探索。我们在\ href {此https url} {此存储库}中提供代码和数据集。

Title: GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction

Authors: Xuelin Li, Xiangqi Jin, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00388
Pdf URL: https://arxiv.org/pdf/2509.00388
Copy Paste: [[2509.00388]] GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction(https://arxiv.org/abs/2509.00388)
Keywords: language model, llm
Abstract: Efficient Key-Value (KV) cache management is essential for processing long text sequences in large language models (LLMs), where memory constraints often limit performance. Conventional KV eviction strategies, such as top-k selection based on attention scores, depend on static heuristics that fail to capture the evolving implicit dependencies among tokens during inference. To overcome this, we propose GraphKV, a graph-based framework that redefines token selection for KV cache compression. In GraphKV, tokens are modeled as nodes with importance scores, and edges represent their similarity relationships. Through a decay-signal-propagation mechanism, token importance is dynamically updated by propagating information across the graph, enabling adaptive retention of the most contextually significant tokens. GraphKV can be seamlessly utilized in existing KV cache eviction methods such as SnapKV and PyramidKV in a plug-and-play manner. Codes will be released on Github.
摘要：有效的键值（KV）缓存管理对于处理大语言模型（LLMS）的长文本序列至关重要，其中内存约束通常会限制性能。常规的KV驱逐策略，例如基于注意力评分的TOP-K选择，取决于静态启发式方法，这些启发式方法未能捕获推断过程中代币中不断发展的隐式依赖性。为了克服这一点，我们提出了GraphKv，这是一个基于图形的框架，可重新定义KV缓存压缩的令牌选择。在GraphKV中，令牌被建模为具有重要性得分的节点，边缘代表其相似性关系。通过衰减 - 信号传播机制，通过在图表上传播信息，可以动态更新令牌的重要性，从而可以自适应地保留最重要的令牌。 GraphKV可以在现有的KV缓存驱逐方法（例如SnapkV和PyramidKv）中以插件和播放方式无缝使用。代码将在Github上发布。

Title: The Resurgence of GCG Adversarial Attacks on Large Language Models

Authors: Yuting Tan, Xuying Li, Zhuo Li, Huizhen Shu, Peikang Hu
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.00391
Pdf URL: https://arxiv.org/pdf/2509.00391
Copy Paste: [[2509.00391]] The Resurgence of GCG Adversarial Attacks on Large Language Models(https://arxiv.org/abs/2509.00391)
Keywords: language model, gpt, llm, prompt
Abstract: Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models' loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.
摘要：基于梯度的对抗提示，例如贪婪的坐标梯度（GCG）算法，已成为一种强大的大型语言模型（LLMS）的强大方法。在本文中，我们对GCG进行了系统的评估及其在不同尺度的开源LLM中进行的，其退火功能变种T-GCG。使用QWEN2.5-0.5B，LLAMA-3.2-1B和GPT-OSS-20B，我们评估了面向安全的提示（Advbench）和推理密集型编码提示的攻击效率。我们的研究揭示了三个关键发现：（1）随着模型大小的攻击成功率（ASR）降低，反映了较大模型损失景观的复杂性和非跨性别的增加；（2）与GPT-4O语义判断相比，基于前缀的启发式方法大幅高估了攻击效果，这提供了更严格，更现实的评估；（3）与对抗性安全提示相比，与编码相关的提示更易受伤害，这表明推理本身可以被用作攻击向量。此外，T-GCG的初步结果表明，模拟退火可以使对抗性搜索多样化并在前缀评估下实现竞争性ASR，尽管其语义判断下的益处仍然有限。这些发现共同强调了GCG的可伸缩性限制，在推理任务中暴露了被忽视的漏洞，并激发了退火启发的策略的进一步发展，以进行更强大的对抗性评估。

Title: MedSEBA: Synthesizing Evidence-Based Answers Grounded in Evolving Medical Literature

Authors: Juraj Vladika, Florian Matthes
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.00414
Pdf URL: https://arxiv.org/pdf/2509.00414
Copy Paste: [[2509.00414]] MedSEBA: Synthesizing Evidence-Based Answers Grounded in Evolving Medical Literature(https://arxiv.org/abs/2509.00414)
Keywords: language model
Abstract: In the digital age, people often turn to the Internet in search of medical advice and recommendations. With the increasing volume of online content, it has become difficult to distinguish reliable sources from misleading information. Similarly, millions of medical studies are published every year, making it challenging for researchers to keep track of the latest scientific findings. These evolving studies can reach differing conclusions, which is not reflected in traditional search tools. To address these challenges, we introduce MedSEBA, an interactive AI-powered system for synthesizing evidence-based answers to medical questions. It utilizes the power of Large Language Models to generate coherent and expressive answers, but grounds them in trustworthy medical studies dynamically retrieved from the research database PubMed. The answers consist of key points and arguments, which can be traced back to respective studies. Notably, the platform also provides an overview of the extent to which the most relevant studies support or refute the given medical claim, and a visualization of how the research consensus evolved through time. Our user study revealed that medical experts and lay users find the system usable and helpful, and the provided answers trustworthy and informative. This makes the system well-suited for both everyday health questions and advanced research insights.
摘要：在数字时代，人们经常转向互联网寻找医疗建议和建议。随着在线内容的增加，很难区分可靠的来源和误导信息。同样，每年都会发表数百万个医学研究，这使研究人员挑战着跟踪最新的科学发现。这些不断发展的研究可以得出不同的结论，这在传统搜索工具中不反映。为了应对这些挑战，我们介绍了Medseba，这是一种互动AI驱动的系统，用于综合基于证据的医学问题答案。它利用大语言模型的力量来产生连贯和表现力的答案，但在可信赖的医学研究中以研究数据库PubMed的方式进行了基础。答案由关键点和论点组成，可以追溯到各自的研究。值得注意的是，该平台还概述了最相关的研究的程度，并驳斥了给定的医学主张，以及对研究共识如何随着时间而演变的可视化。我们的用户研究表明，医学专家和外行用户可以找到该系统可用且有用的，并且提供的答案值得且内容丰富。这使该系统非常适合日常健康问题和高级研究见解。

Title: The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

Authors: Fenghua Liu, Yulong Chen, Yixuan Liu, Zhujun Jin, Solomon Tsai, Ming Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00425
Pdf URL: https://arxiv.org/pdf/2509.00425
Copy Paste: [[2509.00425]] The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang(https://arxiv.org/abs/2509.00425)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98\% EM accuracy in English but only 47\% in Camlang, far below human performance at 87\%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.
摘要：大型语言模型（LLMS）在许多基准中实现了金色的性能，但尚不清楚这种成功是否反映了真正的推理或模式匹配。从认知科学的角度来看，一项信息的测试是，模型是否可以通过明确的金属语言性学习学习来掌握一种陌生的语言，这是一种范式，人类学习者可以通过金属语言推理可靠地化语法系统。我们与Camlang一起解决了这个问题，Camlang是一种新颖的构造语言，展示了自然主义而未进行的特征组合。 Camlang由两个明确的资源，一本语法书籍和双语词典组成，它们通过明确的语法规则和词汇查找反映了成人的第二语言学习，并使我们能够在形态 - 词法，词汇，词汇语义和句子级别的推理中解散错误的错误。人类实验表明，这些资源足以让参与者获得CAMLANG并成功解决Camlang任务。为了进行评估，我们将CommonSenseQA调整为Camlang，创建Camlang-CSQA-V0，这是更广泛的套件中的第一个任务，在该套件中，解决问题需要应用语法规则和词汇映射。实验结果表明，GPT-5在英语中达到98 \％EM的准确性，而在Camlang中只有47 \％，远低于87 \％的人类表现，而其他最先进的推理LLMS的表现甚至更差。人类的验证进一步表明，大多数模型的成功源于浅词汇对准，而GPT-5在有限的程度上显示出新兴的元素语言意识，而不是像人类一样系统的语法掌握。 Camlang建立了一个认知扎根的评估范式，该范式揭示了当前模型与人类属性能力之间的基本差距。

Title: GOSU: Retrieval-Augmented Generation with Global-Level Optimized Semantic Unit-Centric Framework

Authors: Xuecheng Zou, Ke Liu, Bingbing Wang, Huafei Deng, Li Zhang, Yu Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00449
Pdf URL: https://arxiv.org/pdf/2509.00449
Copy Paste: [[2509.00449]] GOSU: Retrieval-Augmented Generation with Global-Level Optimized Semantic Unit-Centric Framework(https://arxiv.org/abs/2509.00449)
Keywords: retrieval-augmented generation
Abstract: Building upon the standard graph-based Retrieval-Augmented Generation (RAG), the introduction of heterogeneous graphs and hypergraphs aims to enrich retrieval and generation by leveraging the relationships between multiple entities through the concept of semantic units (SUs). But this also raises a key issue: The extraction of high-level SUs limited to local text chunks is prone to ambiguity, complex coupling, and increased retrieval overhead due to the lack of global knowledge or the neglect of fine-grained relationships. To address these issues, we propose GOSU, a semantic unit-centric RAG framework that efficiently performs global disambiguation and utilizes SUs to capture interconnections between different nodes across the global context. In the graph construction phase, GOSU performs global merging on the pre-extracted SUs from local text chunks and guides entity and relationship extraction, reducing the difficulty of coreference resolution while uncovering global semantic objects across text chunks. In the retrieval and generation phase, we introduce hierarchical keyword extraction and semantic unit completion. The former uncovers the fine-grained binary relationships overlooked by the latter, while the latter compensates for the coarse-grained n-ary relationships missing from the former. Evaluation across multiple tasks demonstrates that GOSU outperforms the baseline RAG methods in terms of generation quality.
摘要：在基于标准的基于图的检索增强生成（RAG）的基础上，异质图和超图的引入旨在通过通过语义单元的概念（SUS）利用多个实体之间的关系来丰富检索和产生。但这也提出了一个关键问题：由于缺乏全球知识或忽视细粒度的关系而导致的歧义，复杂的耦合和增加的检索间接费用，高级SUS的提取容易受到歧义，复杂的耦合和增加的检索开销。为了解决这些问题，我们提出了GOSU，GOSU是一种以语义单位为中心的RAG框架，可有效执行全球歧义，并利用SUS捕获整个全球环境中不同节点之间的互连。在图形构造阶段，GOSU在本地文本块和指南实体和关系提取中进行了全球合并，并降低了核心分辨率的难度，同时又揭示了跨文本块的全局语义对象。在检索和生成阶段，我们介绍了层次关键字提取和语义单元的完成。前者发现了后者忽略的细粒二进制关系，而后者则补偿了前者缺少的粗粒n- ary关系。跨多个任务的评估表明，GOSU在发电质量方面的表现优于基线抹布方法。

Title: CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

Authors: Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Hichem Telli, Cosimo Distante, Abdenour Hadid
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.00457
Pdf URL: https://arxiv.org/pdf/2509.00457
Copy Paste: [[2509.00457]] CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning(https://arxiv.org/abs/2509.00457)
Keywords: llm
Abstract: Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares, which poses a challenge for AI. In this paper, we present a lightweight framework for solving multiple-choice inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS). The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning. We evaluate Arabic encoders (MARBERT, ArabicBERT, AraBERT) and compare them with API-based LLMs (Gemini, DeepSeek) on the QIAS 2025 dataset. While large models achieve an accuracy of up to 87.6%, they require more resources and are context-dependent. Our MARBERT-based approach achieves 69.87% accuracy, presenting a compelling case for efficiency, on-device deployability, and privacy. While this is lower than the 87.6% achieved by the best-performing LLM, our work quantifies a critical trade-off between the peak performance of large models and the practical advantages of smaller, specialized systems in high-stakes domains.
摘要：伊斯兰遗产法（ILM al-Mawarith）需要精确识别继承人和对股份的计算，这对AI构成了挑战。在本文中，我们提出了一个轻巧的框架，用于使用专门的阿拉伯文本编码器和专业的相关性评分（ARS）解决多项选择继承问题。该系统根据语义相关性对答案选项进行排名，并在没有生成推理的情况下实现快速的，设备的推理。我们评估阿拉伯语编码器（Marbert，Arabicbert，Arabert），并将其与基于API的LLM（Gemini，DeepSeek）在QIAS 2025数据集上进行比较。尽管大型模型的准确性高达87.6％，但它们需要更多的资源，并且依赖上下文。我们的基于玛伯特的方法达到了69.87％的准确性，提出了令人信服的效率，设备可部署性和隐私的案例。虽然这低于表现最好的LLM所实现的87.6％，但我们的工作量化了大型模型的峰值性能与高风险域中较小的专业系统的实际优势之间的关键权衡。

Title: TECP: Token-Entropy Conformal Prediction for LLMs

Authors: Beining Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00461
Pdf URL: https://arxiv.org/pdf/2509.00461
Copy Paste: [[2509.00461]] TECP: Token-Entropy Conformal Prediction for LLMs(https://arxiv.org/abs/2509.00461)
Keywords: language model, llm
Abstract: Uncertainty quantification (UQ) for open-ended language generation remains a critical yet underexplored challenge, especially under black-box constraints where internal model signals are inaccessible. In this paper, we introduce Token-Entropy Conformal Prediction (TECP), a novel framework that leverages token-level entropy as a logit-free, reference-free uncertainty measure and integrates it into a split conformal prediction (CP) pipeline to construct prediction sets with formal coverage guarantees. Unlike existing approaches that rely on semantic consistency heuristics or white-box features, TECP directly estimates epistemic uncertainty from the token entropy structure of sampled generations and calibrates uncertainty thresholds via CP quantiles to ensure provable error control. Empirical evaluations across six large language models and two benchmarks (CoQA and TriviaQA) demonstrate that TECP consistently achieves reliable coverage and compact prediction sets, outperforming prior self-consistency-based UQ methods. Our method provides a principled and efficient solution for trustworthy generation in black-box LLM settings.
摘要：开放式语言生成的不确定性量化（UQ）仍然是一个关键但毫无争议的挑战，尤其是在无法访问的内部模型信号的黑框约束下。在本文中，我们介绍了代币 - 注入综合预测（TECP），这是一个新颖的框架，该框架利用令牌级熵作为无逻辑的无参考的不确定性度量，并将其整合到分裂的保形预测（CP）中，以构建预测集，并提供正式覆盖范围。与依赖语义一致性启发式方法或白盒特征的现有方法不同，TECP直接从采样世代的令牌熵结构中直接估算了认知不确定性，并通过CP量码校准了不确定性阈值，以确保可证明的误差控制。跨六个大语言模型和两个基准（COQA和Triviaqa）的经验评估表明，TECP始终达到可靠的覆盖范围和紧凑的预测集，表现优于先前基于自我矛盾的UQ方法。我们的方法为Black-Box LLM设置中的值得信赖的生成提供了一种有原则而有效的解决方案。

Title: Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

Authors: Saksorn Ruangtanusak, Pittawat Taveekitworachai, Kunat Pipatanakul
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2509.00482
Pdf URL: https://arxiv.org/pdf/2509.00482
Copy Paste: [[2509.00482]] Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting(https://arxiv.org/abs/2509.00482)
Keywords: language model, llm, prompt, agent
Abstract: This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in-character responses (over-speaking) while failing to use tools effectively according to the persona (under-acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) human-crafted role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting. The rule-based role prompting (RRP) approach achieved the best performance through two novel techniques--character-card/scene-contract design and strict enforcement of function calling--which led to an overall score of 0.571, improving on the zero-shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role-playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open-sourcing all of our best-performing prompts and the APO tool. Source code is available at this https URL.
摘要：该报告调查了促使工具增强的大语言模型（LLM）在API中充当角色扮演的对话代理，在常识性角色接地对话挑战（CPDC）2025中。在回答之前存在或进行不必要的工具调用。我们探讨了解决这些问题的四种提示方法：1）基本角色提示，2）人力制作的角色提示，3）自动及时及时优化（APO）和4）基于规则的角色提示。基于规则的角色提示（RRP）方法通过两种新型技术（Character-Card/scene-cartract设计和严格执行函数呼叫的执行）实现了最佳性能 - 这使总体得分为0.571，提高了0.519的零击基线得分。这些发现表明，与更精细的方法（例如APO）相比，RRP设计可以显着提高角色扮演对话剂的有效性和可靠性。为了支持未来的角色提示上的努力，我们正在为所有表现最佳的提示和APO工具开放式外源。源代码可在此HTTPS URL上找到。

Title: ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

Authors: Li S. Yifei, Allen Chang, Chaitanya Malaviya, Mark Yatskar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00496
Pdf URL: https://arxiv.org/pdf/2509.00496
Copy Paste: [[2509.00496]] ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics(https://arxiv.org/abs/2509.00496)
Keywords: llm, agent
Abstract: Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.
摘要：评估对研究查询的长期响应在很大程度上取决于专家注释者，将注意力限制在研究人员可以方便地吸引同事之类的AI领域。然而，研究专业知识是普遍的：调查文章综合了分布在文献中的知识。我们介绍了ResearchQA，这是一种通过将75个研究领域的调查文章提炼为21k查询和160k标题项目来评估LLM系统的资源。每个专栏列出了调查部分的查询，列出了特定的答案评估标准，即引用论文，解释并描述局限性。评估31博士学位8个字段中的注释者表示96％的查询支持博士学位。信息需求和87％的标题项目应在系统响应中通过句子或更多。使用我们的专栏，我们能够构建一个自动成对法官，该法官与专家判断获得了74％的协议。我们利用ResearchQA分析超过7.6K的成对评估中18个系统中的能力差距。在覆盖标题项目时，我们评估的没有参数或检索功能的系统超过70％，而最高的代理系统显示75％的覆盖范围。错误分析表明，最高级别的系统完全解决了不到11％的引文标题项目，48％的限制项目以及49％的比较项目。我们发布我们的数据以促进更全面的多场评估。

Title: Entropy-based Coarse and Compressed Semantic Speech Representation Learning

Authors: Jialong Zuo, Guangyan Zhang, Minghui Fang, Shengpeng Ji, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Zhou Zhao
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.00503
Pdf URL: https://arxiv.org/pdf/2509.00503
Copy Paste: [[2509.00503]] Entropy-based Coarse and Compressed Semantic Speech Representation Learning(https://arxiv.org/abs/2509.00503)
Keywords: language model
Abstract: Discrete speech representation learning has recently attracted increasing interest in both acoustic and semantic modeling. Existing approaches typically encode 16 kHz waveforms into discrete tokens at a rate of 25 or 50 tokens per second. However, given that speech generally conveys only 2 to 5 words per second, such fine-grained tokenization introduces redundancy and hinders efficiency in downstream training and inference. Moreover, semantic speech representations at this frequency primarily capture phonetic-level information, while semantic understanding may not require such detailed token-level resolution. To address these limitations, we propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations. A speech language model is first pre-trained via next-token prediction on large-scale unlabeled data to capture frequent token patterns. Predictive entropy is then used to adaptively determine aggregation boundaries, followed by a cross-attention module that fuses information within each segment. By adjusting the entropy threshold, the granularity and compression ratio of the representations can be flexibly controlled. Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences, demonstrating the effectiveness of the proposed approach.
摘要：离散的语音表示学习最近引起了对声学和语义建模的兴趣日益增加。现有方法通常以每秒25或50令牌的速率将16 kHz波形编码为离散令牌。但是，鉴于语音通常每秒仅传达2到5个单词，因此这种细粒的令牌化引入了冗余，并在下游训练和推理中引起了阻碍效率。此外，此频率下的语义语音表示主要捕获语音级别的信息，而语义理解可能不需要这样详细的令牌级别的分辨率。为了解决这些局限性，我们提出了一个基于熵的动态聚合框架，用于学习压缩语义语音表示。语言模型首先是通过对大规模未标记数据的下一步预测进行预训练的，以捕获频繁的令牌模式。然后，预测性熵用于自适应确定聚合边界，然后是一个融合每个段中信息的交叉意见模块。通过调节熵阈值，可以灵活地控制表示表示的粒度和压缩比。关于ASR，语音到文本翻译和语音转换任务的实验表明，压缩表示形式以与密集的令牌序列相同或更好，证明了所提出的方法的有效性。

Title: Modeling Motivated Reasoning in Law: Evaluating Strategic Role Conditioning in LLM Summarization

Authors: Eunjung Cho, Alexander Hoyle, Yoan Hermstrüwer
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.00529
Pdf URL: https://arxiv.org/pdf/2509.00529
Copy Paste: [[2509.00529]] Modeling Motivated Reasoning in Law: Evaluating Strategic Role Conditioning in LLM Summarization(https://arxiv.org/abs/2509.00529)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used to generate user-tailored summaries, adapting outputs to specific stakeholders. In legal contexts, this raises important questions about motivated reasoning -- how models strategically frame information to align with a stakeholder's position within the legal system. Building on theories of legal realism and recent trends in legal practice, we investigate how LLMs respond to prompts conditioned on different legal roles (e.g., judges, prosecutors, attorneys) when summarizing judicial decisions. We introduce an evaluation framework grounded in legal fact and reasoning inclusion, also considering favorability towards stakeholders. Our results show that even when prompts include balancing instructions, models exhibit selective inclusion patterns that reflect role-consistent perspectives. These findings raise broader concerns about how similar alignment may emerge as LLMs begin to infer user roles from prior interactions or context, even without explicit role instructions. Our results underscore the need for role-aware evaluation of LLM summarization behavior in high-stakes legal settings.
摘要：大型语言模型（LLMS）越来越多地用于生成用户限制的摘要，将输出调整到特定利益相关者。在法律背景下，这提出了有关积极推理的重要问题 - 模型如何通过战略性地构建信息以与利益相关者在法律体系中的地位保持一致。在法律现实主义理论和法律实践的最新趋势的基础上，我们调查了LLM在总结司法决定时如何以不同的法律角色（例如，法官，检察官，检察官，律师）为条件的提示。我们介绍了一个以法律事实和推理包容为基础的评估框架，也考虑对利益相关者的好处。我们的结果表明，即使提示包括平衡说明，模型也会表现出反映角色一致观点的选择性包含模式。这些发现引起了人们对与LLMS开始从先前的互动或上下文中推断用户角色（即使没有明确的角色指令）的角色的更广泛关注。我们的结果强调了对高风险法律环境中LLM摘要行为的角色感知评估的需求。

Title: Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs

Authors: Hanqi Yan, Hainiu Xu, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00544
Pdf URL: https://arxiv.org/pdf/2509.00544
Copy Paste: [[2509.00544]] Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs(https://arxiv.org/abs/2509.00544)
Keywords: language model, llm
Abstract: With Large Language Models (LLMs) becoming increasingly widely adopted, concerns regarding their safety and alignment with human values have intensified. Previous studies have shown that fine-tuning LLMs on narrow and malicious datasets induce misaligned behaviors. In this work, we report a more concerning phenomenon, Reasoning-Induced Misalignment. Specifically, we observe that LLMs become more responsive to malicious requests when reasoning is strengthened, via switching to "think-mode" or fine-tuning on benign math datasets, with dense models particularly vulnerable. Moreover, we analyze internal model states and find that both attention shifts and specialized experts in mixture-of-experts models help redirect excessive reasoning towards safety guardrails. These findings provide new insights into the emerging reasoning-safety trade-off and underscore the urgency of advancing alignment for advanced reasoning models.
摘要：随着大型语言模型（LLMS）越来越广泛地采用，人们对其安全性和与人类价值观的一致性的担忧加剧了。先前的研究表明，狭窄和恶意数据集上的微调LLM会引起未对准的行为。在这项工作中，我们报告了有关现象，推理诱导的未对准的更多。具体而言，我们观察到，通过切换到良性数学数据集上的“思考模式”或微调，LLM会对恶意请求更敏感，而密集的模型尤其容易受到攻击。此外，我们分析了内部模型指出，并发现注意力转移和专业专家的专业专家有助于将过度推理重定向到安全护栏。这些发现为新兴推理安全的权衡提供了新的见解，并强调了提高高级推理模型的一致性的紧迫性。

Title: StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks

Authors: Lang Xiong, Nishant Bhargava, Wesley Chang, Jianhang Hong, Haihao Liu, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00591
Pdf URL: https://arxiv.org/pdf/2509.00591
Copy Paste: [[2509.00591]] StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks(https://arxiv.org/abs/2509.00591)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as "evaluation awareness." This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model's true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from "test-like" to "deploy-like" and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten "deploy-like" prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
摘要：当大型语言模型（LLMS）认为从现实世界部署环境变为受控评估设置时，通常会表现出重大的行为转变，这种现象称为“评估意识”。由于基准性能可能无法准确反映模型的真正安全性和诚实，因此这种差异对AI的一致性构成了关键的挑战。在这项工作中，我们通过操纵提示的感知上下文来系统地量化这些行为变化。我们介绍了一种方法，该方法使用线性探测器以连续的规模为提示得分，从“测试般”到“部署”，并利用LLM重写策略将这些提示转移到更自然的部署式环境的同时，同时保留原始任务。使用这种方法，我们在重写后的战略角色扮演数据集中，平均探针得分提高了30％。评估这些原始和重写的提示上的一系列最新模型，我们发现重写的“部署样”提示会引起行为的重大和一致的转变。在所有模型中，我们观察到诚实反应的平均增加为5.26％，欺骗性反应的平均平均减少为12.40％。此外，拒绝率平均增加了6.38％，表明安全性提高了。我们的发现表明，评估意识是直接影响LLM行为的可量化和可操作因素，表明模型更容易在感知到的测试环境中不安全或欺骗性输出。这强调了更现实的评估框架的迫切需求，以便在部署前准确衡量真实的模型对齐。

Title: Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

Authors: Rishiraj Acharya
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.00605
Pdf URL: https://arxiv.org/pdf/2509.00605
Copy Paste: [[2509.00605]] Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling(https://arxiv.org/abs/2509.00605)
Keywords: long context
Abstract: The Transformer architecture, underpinned by the self-attention mechanism, has become the de facto standard for sequence modeling tasks. However, its core computational primitive scales quadratically with sequence length (O(N^2)), creating a significant bottleneck for processing long contexts. In this paper, we propose the Gated Associative Memory (GAM) network, a novel, fully parallel architecture for sequence modeling that exhibits linear complexity (O(N)) with respect to sequence length. The GAM block replaces the self-attention layer with two parallel pathways: a causal convolution to efficiently capture local, position-dependent context, and a parallel associative memory retrieval mechanism to model global, content-based patterns. These pathways are dynamically fused using a gating mechanism, allowing the model to flexibly combine local and global information for each token. We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2 benchmark, as well as against the Transformer on the TinyStories dataset. Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets, establishing it as a promising and efficient alternative for sequence modeling.
摘要：以自我发项机制为基础的变压器结构已成为序列建模任务的事实上的标准。但是，其核心计算原始尺度在序列长度（O（n^2））上二次地量表，为处理长上下文创建了重要的瓶颈。在本文中，我们提出了封闭式的关联记忆（GAM）网络，这是一种用于序列建模的新颖的，完全平行的架构，相对于序列长度表现出线性复杂性（O（n））。 GAM块用两种平行途径代替了自我发场层：一种因果卷积，可有效捕获本地，与位置有关的上下文，以及一种并行的关联内存检索机制，以模拟基于内容的全局，基于内容的模式。这些途径使用门控机制动态融合，从而使模型可以灵活地结合每个令牌的本地和全局信息。我们从头开始实施GAM，并针对标准变压器模型和Wikitext-2基准上的现代线性时间基线（MAMBA）进行了严格的比较分析，以及针对Tinystories数据集中的变压器。我们的实验表明，GAM始终更快，在训练速度上的两个基准都优于两个基准，并在所有数据集中实现了卓越或竞争性的最终验证困惑，将其确立为序列建模的有希望和有效的替代方案。

Title: Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?

Authors: Md Tanzib Hosain, Md Kishor Morol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00629
Pdf URL: https://arxiv.org/pdf/2509.00629
Copy Paste: [[2509.00629]] Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?(https://arxiv.org/abs/2509.00629)
Keywords: language model, prompt, chain-of-thought, agent
Abstract: Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1\% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2\%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at this https URL.
摘要：人类最艰巨的任务之一是在竞争性编程中发现的，问题需要复杂的算法思维，解决难题和创建有效的代码。作为评估语言模型（LMS）的领域，它没有得到足够的关注。这项研究提出了ICPC基准，该基准包括254个国际大学编程竞赛（ICPC）任务。每个问题包括官方分析，参考代码和样本，高质量单元和隐藏测试。我们能够通过这些资源开发和评估各种LM推理技术，用于竞争性编程。随着零射链的提示，我们发现O1仅达到19.1 \％通过@1求解率。借助我们的最佳推理技术，将多转弯的自我判断与反射和对情节信息的检索结合在一起，将其提高到42.2 \％。此外，我们进行了一项新的人类调查，以更深入地了解剩余困难。出人意料的是，我们发现O1可以解决18个问题中的17个，这些问题以前仅通过一些特定说明而无法解决任何模型或技术。我们的定量发现和定性研究提供了具有扎根，想象力和算法思维的LMS的脚步。我们在此HTTPS URL上开放代码和数据。

Title: Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech

Authors: Sanjeeevan Selvaganapathy, Mehwish Nasim
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.00673
Pdf URL: https://arxiv.org/pdf/2509.00673
Copy Paste: [[2509.00673]] Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech(https://arxiv.org/abs/2509.00673)
Keywords: language model, llm
Abstract: We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining whether models with minimal safety alignment (uncensored) might provide more objective classification capabilities compared to their heavily-aligned (censored) counterparts. While uncensored models theoretically offer a less constrained perspective free from moral guardrails that could bias classification decisions, our results reveal a surprising trade-off: censored models significantly outperform their uncensored counterparts in both accuracy and robustness, achieving 78.7% versus 64.1% strict accuracy. However, this enhanced performance comes with its own limitation -- the safety alignment acts as a strong ideological anchor, making censored models resistant to persona-based influence, while uncensored models prove highly malleable to ideological framing. Furthermore, we identify critical failures across all models in understanding nuanced language such as irony. We also find alarming fairness disparities in performance across different targeted groups and systemic overconfidence that renders self-reported certainty unreliable. These findings challenge the notion of LLMs as objective arbiters and highlight the need for more sophisticated auditing frameworks that account for fairness, calibration, and ideological consistency.
摘要：我们研究了大语言模型（LLM）在检测隐式和明确的仇恨言论方面的功效，研究了与其对应的（审查）对应物相比，使用最小的安全对齐模型（未经审查）是否可以提供更客观的分类能力。虽然理论上未经审查的模型提供了不受道德护栏的限制观点，而这些护栏可能会偏向分类的决策，但我们的结果表明了一个令人惊讶的权衡：审查的模型在准确性和鲁棒性上都显着超过了他们未经审查的同行，可实现78.7％，而严格准确的准确性为64.1％。但是，这种增强的性能具有其自身的局限性 - 安全对准起着强大的意识形态锚，使对基于角色的影响力具有抵抗力的审查模型，而未经审查的模型则证明具有高度可延展的意识形态框架。此外，我们在理解诸如讽刺之类的细微差别语言中确定了所有模型中所有模型的关键失败。我们还发现，不同目标群体的性能和系统性过高的表现方面令人震惊的公平差异，这使自我报告的确定性不可靠。这些发现挑战了LLM作为客观仲裁者的概念，并强调了对更复杂的审计框架的需求，这些框架涉及公平，校准和意识形态的一致性。

Title: Do small language models generate realistic variable-quality fake news headlines?

Authors: Austin McCutcheon, Chris Brogly
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.00680
Pdf URL: https://arxiv.org/pdf/2509.00680
Copy Paste: [[2509.00680]] Do small language models generate realistic variable-quality fake news headlines?(https://arxiv.org/abs/2509.00680)
Keywords: language model, llm, prompt
Abstract: Small language models (SLMs) have the capability for text generation and may potentially be used to generate falsified texts online. This study evaluates 14 SLMs (1.7B-14B parameters) including LLaMA, Gemma, Phi, SmolLM, Mistral, and Granite families in generating perceived low and high quality fake news headlines when explicitly prompted, and whether they appear to be similar to real-world news headlines. Using controlled prompt engineering, 24,000 headlines were generated across low-quality and high-quality deceptive categories. Existing machine learning and deep learning-based news headline quality detectors were then applied against these SLM-generated fake news headlines. SLMs demonstrated high compliance rates with minimal ethical resistance, though there were some occasional exceptions. Headline quality detection using established DistilBERT and bagging classifier models showed that quality misclassification was common, with detection accuracies only ranging from 35.2% to 63.5%. These findings suggest the following: tested SLMs generally are compliant in generating falsified headlines, although there are slight variations in ethical restraints, and the generated headlines did not closely resemble existing primarily human-written content on the web, given the low quality classification accuracy.
摘要：小语言模型（SLM）具有文本生成的能力，并且有可能用于在线生成伪造的文本。这项研究评估了14个SLM（1.7b-14b参数），包括Llama，Gemma，Phi，Smollm，Mistral和Granite家族，在明确提示时会产生感知到的低和高质量的假新闻头条，以及它们是否似乎类似于现实世界新闻头条。使用受控的及时工程，在低品质和高质量的欺骗性类别中产生了24,000个头条新闻。然后针对这些SLM生成的虚假新闻头条应用现有的机器学习和深度学习的新闻标题质量探测器。 SLM表现出很高的依从性，尽管偶尔会有一些例外，但伦理抵抗的率很小。使用已建立的Distilbert和Bagging分类器模型的标题质量检测表明，质量错误分类很常见，检测精度仅为35.2％至63.5％。这些发现表明：经过伪造的标题通常符合伪造的头条，尽管道德约束存在略有差异，并且由于低质量分类的精度，产生的头条新闻差异很小，而产生的头条新闻并不类似于网络上主要的人为所写内容。

Title: CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Authors: Alex Gulko, Yusen Peng, Sachin Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00691
Pdf URL: https://arxiv.org/pdf/2509.00691
Copy Paste: [[2509.00691]] CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders(https://arxiv.org/abs/2509.00691)
Keywords: language model, llm
Abstract: Probing with sparse autoencoders is a promising approach for uncovering interpretable features in large language models (LLMs). However, the lack of automated evaluation methods has hindered their broader adoption and development. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive ablation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks, all without requiring an external LLM. The official implementation and evaluation dataset are open-sourced under the MIT License.
摘要：用稀疏的自动编码器进行探测是一种在大型语言模型（LLM）中揭示可解释功能的有前途的方法。但是，缺乏自动评估方法阻碍了他们更广泛的采用和发展。在这项工作中，我们介绍了CE-Bench，这是一种稀疏自动编码器的新颖且轻巧的对比度评估基准，该基准构建在策划的对比故事对的数据集中。我们进行全面的消融研究，以验证方法的有效性。我们的结果表明，CE Bench可靠地衡量稀疏自动编码器的可解释性，并与现有基准测试良好，而无需外部LLM。正式的实施和评估数据集是根据MIT许可证开源的。

Title: Learning to Shop Like Humans: A Review-driven Retrieval-Augmented Recommendation Framework with LLMs

Authors: Kaiwen Wei, Jinpeng Gao, Jiang Zhong, Yuming Yang, Fengmao Lv, Zhenyang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00698
Pdf URL: https://arxiv.org/pdf/2509.00698
Copy Paste: [[2509.00698]] Learning to Shop Like Humans: A Review-driven Retrieval-Augmented Recommendation Framework with LLMs(https://arxiv.org/abs/2509.00698)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown strong potential in recommendation tasks due to their strengths in language understanding, reasoning and knowledge integration. These capabilities are especially beneficial for review-based recommendation, which relies on semantically rich user-generated texts to reveal fine-grained user preferences and item attributes. However, effectively incorporating reviews into LLM-based recommendation remains challenging due to (1) inefficient to dynamically utilize user reviews under LLMs' constrained context windows, and (2) lacking effective mechanisms to prioritize reviews most relevant to the user's current decision context. To address these challenges, we propose RevBrowse, a review-driven recommendation framework inspired by the "browse-then-decide" decision process commonly observed in online user behavior. RevBrowse integrates user reviews into the LLM-based reranking process to enhance its ability to distinguish between candidate items. To improve the relevance and efficiency of review usage, we introduce PrefRAG, a retrieval-augmented module that disentangles user and item representations into structured forms and adaptively retrieves preference-relevant content conditioned on the target item. Extensive experiments on four Amazon review datasets demonstrate that RevBrowse achieves consistent and significant improvements over strong baselines, highlighting its generalizability and effectiveness in modeling dynamic user preferences. Furthermore, since the retrieval-augmented process is transparent, RevBrowse offers a certain level of interpretability by making visible which reviews influence the final recommendation.
摘要：大型语言模型（LLMS）由于其在语言理解，推理和知识整合方面的优势，在推荐任务方面表现出了强大的潜力。这些功能对于基于审核的建议特别有益，该建议依赖于语义上丰富的用户生成的文本来揭示细粒的用户偏好和项目属性。但是，由于（1）在LLMS受约束的上下文窗口下动态使用用户评论，并且（2）缺乏有效的机制来优先考虑与用户当前决策的问题最相关的评论，因此有效地将评论纳入基于LLM的建议中仍然具有挑战性。为了应对这些挑战，我们提出了RevBrowse，这是一个审查驱动的推荐框架，灵感来自在线用户行为中通常观察到的“浏览 - 任务”决策过程。 RevBrowse将用户评论集成到基于LLM的重新依克过程中，以增强其区分候选项目的能力。为了提高审核用法的相关性和效率，我们介绍了预段，这是一个检索功能的模块，将用户和项目表示形式删除为结构化形式，并适应地检索在目标项目上的优先含量与与偏爱的内容。在四个亚马逊评论数据集上进行的广泛实验表明，Revbrowse对强基础实现了一致和显着的改进，从而突出了其在模拟动态用户偏好时的普遍性和有效性。此外，由于检索提示过程是透明的，因此RevBrowse通过使可见的审查影响最终建议，从而提供了一定程度的解释性。

Title: Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs

Authors: Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, Jaegul Choo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00707
Pdf URL: https://arxiv.org/pdf/2509.00707
Copy Paste: [[2509.00707]] Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs(https://arxiv.org/abs/2509.00707)
Keywords: language model, llm
Abstract: Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.
摘要：蒙版扩散模型（MDMS）为大型语言建模提供了有希望的非自动回旋替代方案。 MDMS的标准解码方法（例如基于置信度的采样），根据每个扩散步骤中的单个令牌信心独立选择令牌。但是，我们观察到，这种独立的令牌选择通常会导致类似于顺序自回归过程的生成订单，从而限制了非自动回归建模的优势。为了减轻这种性质，我们提出了奖励加权采样（RWS），这是一种新颖的解码策略，该策略利用外部奖励模型在迭代扩散过程中提供原则上的全局信号。具体而言，在每个扩散步骤中，RWS评估了整个中间序列的质量，并相应地标记逻辑，从而通过整合全局序列级相干性来指导令牌选择。该方法有选择地提高了最初分数较低的代币的信心，从而促进了更不可能的产生顺序。此外，我们提供理论上的理由表明，奖励加权的logit缩放缩放会在代币选择中引起有益的等级逆转，并始终提高预期奖励。实验表明，RWS显着促进了非自动回归产生顺序，从而导致多个评估指标的改进。这些结果突出了整合全球信号在增强非自动回旋特性和MDMS整体性能方面的有效性。

Title: Designing LMS and Instructional Strategies for Integrating Generative-Conversational AI

Authors: Elias Ra, Seung Je Kim, Eui-Yeong Seo, Geunju So
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00709
Pdf URL: https://arxiv.org/pdf/2509.00709
Copy Paste: [[2509.00709]] Designing LMS and Instructional Strategies for Integrating Generative-Conversational AI(https://arxiv.org/abs/2509.00709)
Keywords: prompt, agent
Abstract: Higher education faces growing challenges in delivering personalized, scalable, and pedagogically coherent learning experiences. This study introduces a structured framework for designing an AI-powered Learning Management System (AI-LMS) that integrates generative and conversational AI to support adaptive, interactive, and learner-centered instruction. Using a design-based research (DBR) methodology, the framework unfolds through five phases: literature review, SWOT analysis, development of ethical-pedagogical principles, system design, and instructional strategy formulation. The resulting AI-LMS features modular components -- including configurable prompts, adaptive feedback loops, and multi-agent conversation flows -- aligned with pedagogical paradigms such as behaviorist, constructivist, and connectivist learning theories. By combining AI capabilities with human-centered design and ethical safeguards, this study advances a practical model for AI integration in education. Future research will validate and refine the system through real-world implementation.
摘要：高等教育在提供个性化，可扩展和教学上连贯的学习经验方面面临着越来越多的挑战。这项研究介绍了一个结构化的框架，用于设计AI驱动的学习管理系统（AI-LMS），该系统将生成和对话的AI集成以支持适应性，互动和以学习者为中心的指导。使用基于设计的研究（DBR）方法，该框架通过五个阶段展开：文献综述，SWOT分析，道德雄性原理的发展，系统设计和教学策略制定。由此产生的AI-LMS具有模块化组件 - 包括可配置的提示，自适应反馈循环以及多代理对话流 - 与教学范式（例如行为主义者，建构主义者和连接主义者学习理论）一致。通过将AI功能与以人为中心的设计和道德保护措施相结合，这项研究为AI整合教育的实用模型提供了发展。未来的研究将通过现实世界实施来验证和完善系统。

Title: LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA

Authors: Houji Jin, Negin Ashrafi, Armin Abdollahi, Wei Liu, Jian Wang, Ganyu Gui, Maryam Pishgar, Huanghao Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00731
Pdf URL: https://arxiv.org/pdf/2509.00731
Copy Paste: [[2509.00731]] LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA(https://arxiv.org/abs/2509.00731)
Keywords: language model, llm, prompt
Abstract: The rapid growth of large language models (LLMs) has heightened the demand for accurate detection of AI-generated text, particularly in languages like Chinese, where subtle linguistic nuances pose significant challenges to current methods. In this study, we conduct a systematic comparison of encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Alibaba's Qwen2.5-7B/DeepSeek-R1-Distill-Qwen-7B fine-tuned via Low-Rank Adaptation, LoRA), and a FastText baseline using the publicly available dataset from the NLPCC 2025 Chinese AI-Generated Text Detection Task. Encoder models were fine-tuned using a novel prompt-based masked language modeling approach, while Qwen2.5-7B was adapted for classification with an instruction-format input and a lightweight classification head trained via LoRA. Experiments reveal that although encoder models nearly memorize training data, they suffer significant performance degradation under distribution shifts (RoBERTa: 76.3% test accuracy; BERT: 79.3%). FastText demonstrates surprising lexical robustness (83.5% accuracy) yet lacks deeper semantic understanding. In contrast, the LoRA-adapted Qwen2.5-7B achieves 95.94% test accuracy with balanced precision-recall metrics, indicating superior generalization and resilience to dataset-specific artifacts. These findings underscore the efficacy of decoder-based LLMs with parameter-efficient fine-tuning for robust Chinese AI-generated text detection. Future work will explore next-generation Qwen3 models, distilled variants, and ensemble strategies to enhance cross-domain robustness further.
摘要：大语言模型（LLM）的快速增长增强了对AI生成的文本准确检测的需求，尤其是在诸如中国语言之类的语言中，在这种语言中，微妙的语言细微差别对当前方法构成了重大挑战。 In this study, we conduct a systematic comparison of encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Alibaba's Qwen2.5-7B/DeepSeek-R1-Distill-Qwen-7B fine-tuned via Low-Rank Adaptation, LoRA), and a FastText baseline using the publicly available dataset from the NLPCC 2025年中国AI生成的文本检测任务。使用一种新颖的基于及时的掩盖语言建模方法对编码器模型进行了微调，而QWEN2.5-7B进行了调整以使用指令 - 格式输入和通过Lora训练的轻量级分类头进行分类。实验表明，尽管编码模型几乎记住了训练数据，但它们在分配变化下遭受了显着的性能降解（Roberta：76.3％的测试准确性； BERT：79.3％）。 FastText表现出令人惊讶的词汇鲁棒性（精度为83.5％），但缺乏更深的语义理解。相比之下，洛拉适应的QWEN2.5-7B具有平衡的精确记录指标的测试精度为95.94％，表明对数据集特异性人工制品具有出色的概括和弹性。这些发现强调了基于解码器的LLM和参数有效的微调对鲁棒中国AI生成的文本检测的功效。未来的工作将探索下一代QWEN3模型，蒸馏变种和整体策略，以进一步增强跨域鲁棒性。

Title: Decomposing and Revising What Language Models Generate

Authors: Zhichao Yan, Jiaoyan Chen, Jiapu Wang, Xiaoli Li, Ru Li, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00765
Pdf URL: https://arxiv.org/pdf/2509.00765
Copy Paste: [[2509.00765]] Decomposing and Revising What Language Models Generate(https://arxiv.org/abs/2509.00765)
Keywords: language model, gpt, llm
Abstract: Attribution is crucial in question answering (QA) with Large Language Models (LLMs).SOTA question decomposition-based approaches use long form answers to generate questions for retrieving related documents. However, the generated questions are often irrelevant and incomplete, resulting in a loss of facts in this http URL approaches also fail to aggregate evidence snippets from different documents and paragraphs. To tackle these problems, we propose a new fact decomposition-based framework called FIDES (\textit{faithful context enhanced fact decomposition and evidence aggregation}) for attributed QA. FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. If the retrieved evidence snippets conflict with the related sub-facts, such sub-facts will be revised accordingly. Finally, the evidence snippets are aggregated according to the original this http URL evaluation has been conducted with six datasets, with an additionally proposed new metric called $Attr_{auto-P}$ for evaluating the evidence precision. FIDES outperforms the SOTA methods by over 14\% in average with GPT-3.5-turbo, Gemini and Llama 70B series.
摘要：归因在有问题的答案（QA）中至关重要的是使用大语言模型（LLMS）.SOTA问题分解方法使用长形式答案来为检索相关文档生成问题。但是，产生的问题通常是无关紧要和不完整的，导致这种HTTP URL方法中的事实损失也无法汇总来自不同文档和段落的证据剪接。为了解决这些问题，我们提出了一个新的基于分解的框架，称为fides（\ textit {忠实的上下文增强了事实分解和证据汇总}）。 FIDE使用上下文增强的两阶段忠实分解方法将长形式的答案分解为子事实，然后猎犬将其用于检索相关的证据片段。如果检索到的证据片段与相关的子事实冲突，则将相应地修改此类子事实。最后，根据六个数据集进行了HTTP URL评估的原始汇总证据摘要，另外提出的新指标称为$ attr_ {auto-p} $，用于评估证据精度。通过GPT-3.5-Turbo，Gemini和Llama 70B系列，FIDE的表现平均超过14 \％。

Title: CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA

Authors: Reem Abdel-Salam, Mary Adewunmi, Modinat A. Abayomi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.00806
Pdf URL: https://arxiv.org/pdf/2509.00806
Copy Paste: [[2509.00806]] CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA(https://arxiv.org/abs/2509.00806)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervised fine-tuning strategy leveraging LLaMA 3 8B, enhanced with a curated biomedical question-answer dataset compiled from external sources including BioASQ, MedQuAD, and TREC. Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only. While our models demonstrate strong domain understanding, achieving concept-level accuracy scores of up to 0.8, their Exact Match (EM) scores remain significantly lower, particularly in the test phase. We introduce a two-stage inference pipeline for precise short-answer extraction to mitigate verbosity and improve alignment with evaluation metrics. Despite partial improvements, challenges persist in generating strictly formatted outputs. Our findings highlight the gap between semantic understanding and exact answer evaluation in biomedical LLM applications, motivating further research in output control and post-processing strategies.
摘要：大型语言模型（LLM）越来越明显，可以在各个领域进行准确的问题回答。但是，在现实生物医学和医疗保健应用中部署之前，对其对复杂问题（QA）功能的绩效进行严格的评估至关重要。本文介绍了我们对生物综合IX共享任务Medhopqa轨道的方法，该任务的重点是涉及疾病，基因和化学物质的多跳生物医学问题。我们采用了利用Llama 3 8B的监督微调策略，并通过从Bioasq，Medquad和Trec等外部来源编译的精心策划的生物医学问题解答数据集增强了。探索了三个实验设置：合并的简短和长答案，仅短答案以及仅长时间答案进行微调。尽管我们的模型表现出强烈的领域理解，但实现概念级的准确度得分高达0.8，但它们的确切匹配（EM）得分仍然显着降低，尤其是在测试阶段。我们引入了一个两阶段的推理管道，以精确的短答案提取，以减轻冗长的速度并改善评估指标的一致性。尽管有部分改进，但挑战仍在产生严格格式的产出。我们的发现突出了生物医学LLM应用中语义理解和确切答案评估之间的差距，从而激发了进一步的输出控制和后处理策略的研究。

Title: Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

Authors: Michelle Elizabeth, Alicja Kasicka, Natalia Krawczyk, Magalie Ochs, Gwénolé Lecorvé, Justyna Gromada, Lina M. Rojas-Barahona
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00841
Pdf URL: https://arxiv.org/pdf/2509.00841
Copy Paste: [[2509.00841]] Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations(https://arxiv.org/abs/2509.00841)
Keywords: language model, prompt
Abstract: The growing number of generative AI-based dialogue systems has made their evaluation a crucial challenge. This paper presents our contribution to this important problem through the Dialogue System Technology Challenge (DSTC-12, Track 1), where we developed models to predict dialogue-level, dimension-specific scores. Given the constraint of using relatively small models (i.e. fewer than 13 billion parameters) our work follows two main strategies: employing Language Models (LMs) as evaluators through prompting, and training encoder-based classification and regression models. Our results show that while LM prompting achieves only modest correlations with human judgments, it still ranks second on the test set, outperformed only by the baseline. The regression and classification models, with significantly fewer parameters, demonstrate high correlation for some dimensions on the validation set. Although their performance decreases on the test set, it is important to note that the test set contains annotations with significantly different score ranges for some of the dimensions with respect to the train and validation sets.
摘要：越来越多的基于AI的对话系统使他们的评估成为至关重要的挑战。本文通过对话系统技术挑战（DSTC-12，TRACK 1）提出了我们对这一重要问题的贡献，我们开发了模型来预测对话级别，维度特定的分数。考虑到使用相对较小的模型（即少于130亿个参数）的限制，我们的工作遵循两种主要策略：通过提示采用语言模型（LMS）作为评估者，以及基于培训编码器的分类和回归模型。我们的结果表明，尽管LM提示仅与人类判断实现适度的相关性，但它在测试集中仍然排名第二，仅比基线优于基线。回归和分类模型的参数明显较少，证明了验证集上的某些维度的高相关性。尽管它们的性能在测试集上降低，但重要的是要注意，测试集包含针对火车和验证集的某些维度的分数范围明显不同的注释。

Title: Negative Matters: Multi-Granularity Hard-Negative Synthesis and Anchor-Token-Aware Pooling for Enhanced Text Embeddings

Authors: Tengyu Pan, Zhichao Duan, Zhenyu Li, Bowen Dong, Ning Liu, Xiuxing Li, Jianyong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00842
Pdf URL: https://arxiv.org/pdf/2509.00842
Copy Paste: [[2509.00842]] Negative Matters: Multi-Granularity Hard-Negative Synthesis and Anchor-Token-Aware Pooling for Enhanced Text Embeddings(https://arxiv.org/abs/2509.00842)
Keywords: language model, llm
Abstract: Text embedding models are essential for various natural language processing tasks, enabling the effective encoding of semantic information into dense vector representations. These models are typically optimized using triplets of (query, positive, negative) data pairs for contrastive learning, where the negative samples play a critical role in enhancing the model's ability to discern subtle semantic distinctions. In this work, we introduce a Multi-Granularity Hard-negative (MGH) synthesis framework that leverages large language models (LLMs) to generate diverse negative samples with varying levels of similarity with the query. This approach facilitates a coarse-to-fine curriculum learning strategy during supervised training, allowing the embedding model to progressively learn more nuanced semantic representations. Meanwhile, we propose an Anchor Token Aware (ATA) pooling method that assigns higher weights to anchor tokens based on aggregation patterns observed in LLMs, improving text embedding accuracy without increasing model complexity. Comprehensive experiments on the MTEB benchmark demonstrate that our methods achieve state-of-the-art performance, surpassing existing synthesis strategies both with synthetic data and when combined with public retrieval datasets.
摘要：文本嵌入模型对于各种自然语言处理任务至关重要，可以将语义信息有效地编码为密集的向量表示。这些模型通常使用（查询，正，负）数据对进行对比学习进行优化，其中负样本在增强模型辨别微妙的语义区分的能力方面起着关键作用。在这项工作中，我们引入了一个多粒性硬性（MGH）综合框架，该框架利用大型语言模型（LLMS）生成与查询相似程度不同的不同负面样本。这种方法促进了监督培训期间的粗到精细课程学习策略，从而允许嵌入模型逐步学习更多细微的语义表示。同时，我们提出了一种锚定令牌意识（ATA）合并方法，该方法基于在LLMS中观察到的聚合模式为锚定令牌分配更高的权重，从而提高了文本嵌入精度而不增加模型复杂性。对MTEB基准测试的全面实验表明，我们的方法实现最先进的性能，通过合成数据以及与公共检索数据集相结合，超过了现有的合成策略。

Title: Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations

Authors: Shaina Raza, Maximus Powers, Partha Pratim Saha, Mahveen Raza, Rizwan Qureshi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00849
Pdf URL: https://arxiv.org/pdf/2509.00849
Copy Paste: [[2509.00849]] Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations(https://arxiv.org/abs/2509.00849)
Keywords: prompt
Abstract: Text-to-Image (TTI) models are powerful creative tools but risk amplifying harmful social biases. We frame representational societal bias assessment as an image curation and evaluation task and introduce a pilot benchmark of occupational portrayals spanning five socially salient roles (CEO, Nurse, Software Engineer, Teacher, Athlete). Using five state-of-the-art models: closed-source (DALLE 3, Gemini Imagen 4.0) and open-source (FLUX.1-dev, Stable Diffusion XL Turbo, Grok-2 Image), we compare neutral baseline prompts against fairness-aware controlled prompts designed to encourage demographic diversity. All outputs are annotated for gender (male, female) and race (Asian, Black, White), enabling structured distributional analysis. Results show that prompting can substantially shift demographic representations, but with highly model-specific effects: some systems diversify effectively, others overcorrect into unrealistic uniformity, and some show little responsiveness. These findings highlight both the promise and the limitations of prompting as a fairness intervention, underscoring the need for complementary model-level strategies. We release all code and data for transparency and reproducibility this https URL.
摘要：文本对图像（TTI）模型是强大的创意工具，但风险扩大有害的社会偏见。我们将代表性的社会偏见评估作为图像策划和评估任务，并引入了跨越五个社会显着角色的职业刻画的试点基准（首席执行官，护士，软件工程师，老师，运动员）。使用五个最先进的型号：封闭源（Dalle 3，Gemini Imagen 4.0）和开源源（Flux.1-DEV，稳定的扩散XL Turbo，GROK-2图像），我们将中性基线提示与旨在鼓励范围多样性的公平控制的提示进行比较。所有输出均以性别（男性，女性）和种族（亚洲，黑色，白色）的注释，可以实现结构化分布分析。结果表明，提示可以实质上改变人口统计学表示，但是具有高度模型的效果：有些系统有效地多样化，而另一些系统则过度纠正了不切实际的均匀性，而有些系统则显示出很少的响应能力。这些发现突出了提示作为公平干预措施的承诺和局限性，强调了对互补模型级策略的需求。我们发布了所有代码和数据，以获得透明度和可重复性，此HTTPS URL。

Title: Exploring and Mitigating Fawning Hallucinations in Large Language Models

Authors: Zixuan Shangguan, Yanjie Dong, Lanjun Wang, Xiaoyi Fan, Victor C. M. Leung, Xiping Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00869
Pdf URL: https://arxiv.org/pdf/2509.00869
Copy Paste: [[2509.00869]] Exploring and Mitigating Fawning Hallucinations in Large Language Models(https://arxiv.org/abs/2509.00869)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) have demonstrated exceptional proficiency in language understanding. However, when LLMs align their outputs with deceptive and/or misleading prompts, the generated responses could deviate from the de facto information. Such observations are known as fawning hallucinations, where the model prioritizes alignment with the input's implied perspective over accuracy and truthfulness. In this work, we analyze fawning hallucinations in various natural language processing tasks and tailor the so-termed contrastive decoding method for fawning-hallucination mitigation. Specifically, we design two paradigms to generate corresponding deceptive and/or misleading inputs for the consistent fawning hallucinations induction. Then, we propose the collaborative contrastive decoding (CCD) to handle the fawning hallucinations across different tasks in LLMs. By contrasting the deviation in output distribution between induced and transformed neutral inputs, the proposed CCD can reduce reliance on deceptive and/or misleading information without requiring additional training. Extensive experiments demonstrate that the proposed CCD can effectively mitigate fawning hallucinations and improve the factuality of the generated responses over various tasks.
摘要：大型语言模型（LLMS）表现出在语言理解方面具有出色的水平。但是，当LLMS与欺骗性和/或误导性提示对齐其输出时，生成的响应可能会偏离事实上的信息。这种观察被称为幻觉，该模型优先考虑与输入的隐含观点相对于准确性和真实性的一致性。在这项工作中，我们分析了各种自然语言处理任务中的幻觉，并为缓解效果缓解的所谓对比解码方法量身定制。具体而言，我们设计了两个范式来生成相应的欺骗性和/或误导性输入，以持续出现幻觉感应。然后，我们提出了协作对比解码（CCD），以处理LLM中不同任务的幻觉。通过将诱导和转化的中性输入之间的输出分布的偏差进行对比，提出的CCD可以减少对欺骗性和/或误导信息的依赖，而无需进行额外的培训。广泛的实验表明，提出的CCD可以有效地减轻幻觉，并改善各种任务上产生的响应的事实。

Title: EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes

Authors: Yuqin Dai, Guoqing Wang, Yuan Wang, Kairan Dou, Kaichen Zhou, Zhanwei Zhang, Shuo Yang, Fei Tang, Jun Yin, Pengyu Zeng, Zhenzhe Ying, Can Yi, Changhua Meng, Yuchen Zhou, Yongliang Shen, Shuai Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00877
Pdf URL: https://arxiv.org/pdf/2509.00877
Copy Paste: [[2509.00877]] EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes(https://arxiv.org/abs/2509.00877)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) empowered with retrieval mechanisms have achieved strong progress in open-domain question answering (QA). Yet, the conventional retrieve--then--answer paradigm often suffers from two key limitations: (1) low signal-to-noise ratio in retrieved evidence, where useful information is buried under irrelevant content, and (2) error accumulation in multi-hop reasoning when incomplete or noisy passages are involved. To address these challenges, we present EviNote-RAG, an agentic RAG framework that introduces a structured retrieve--note--answer pipeline. Instead of directly reasoning over raw retrievals, the model is trained to compose Supportive-Evidence Notes (SENs), concise, human-like notes that preserve only answer-relevant information, highlight uncertainty, and explicitly state when no useful evidence exists. This distillation process is further reinforced by the Evidence Quality Reward (EQR), an entailment-based signal that evaluates whether SENs logically support the final answer. Together, SENs and EQR guide the model toward faithful and robust reasoning, while reducing the impact of noise. Experiments on in-domain and out-of-domain QA benchmarks show that EviNote-RAG consistently outperforms strong baselines in accuracy, generalization, and training stability. In particular, it achieves state-of-the-art results while enhancing robustness and efficiency, yielding relative F1 gains of 20\% on HotpotQA (+0.093), 40\% on Bamboogle (+0.151), and 91\% on 2Wiki (+0.256) via denser rewards and reduced verbosity.
摘要：通过检索机制授权的大型语言模型（LLM）在开放域问题答案（QA）方面取得了强大的进步。然而，常规检索（然后是 - - 然后是 - 范式通常都有两个关键局限性：（1）在检索的证据中低信噪比比率低，在涉及不完全或嘈杂的通道时，将有用的信息埋在无关的含量下，以及（2）在多跳的推理中积累了误差。为了应对这些挑战，我们提出了Evinote-rag，这是一个介绍结构化检索的代理抹布框架 - 宣传 - 通道。该模型没有直接对原始检索进行推理，而是经过训练，可以构成支持性的证据（SENS），简洁，类似于人类的音符，这些音符仅保留与答案相关的信息，突出不确定性，并明确说明当不存在有用的证据时说明。证据质量奖励（eqr）进一步加强了这种蒸馏过程，这是一个基于基于的基于基于的信号，该信号评估了逻辑上是否支持最终答案。 Sens和EQR一起将模型指导到忠实而强大的推理，同时减少噪声的影响。关于域内和域外质量障碍基准测试的实验表明，Evinote-rag在准确性，概括和训练稳定性方面始终优于强大的基准。特别是，它可以在提高鲁棒性和效率的同时获得最先进的结果，从而在hotpotqa（+0.093）上获得20 \％的相对F1增益，在班boogle（+0.151）上获得40 \％的相对F1，而2wiki（+0.256）的相对F1获得（+0.151）的相对F1 \％\％。

Title: SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset

Authors: Răzvan-Alexandru Smădu, Andreea Iuga, Dumitru-Clementin Cercel, Florin Pop
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00893
Pdf URL: https://arxiv.org/pdf/2509.00893
Copy Paste: [[2509.00893]] SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset(https://arxiv.org/abs/2509.00893)
Keywords: language model, llm
Abstract: Satire, irony, and sarcasm are techniques typically used to express humor and critique, rather than deceive; however, they can occasionally be mistaken for factual reporting, akin to fake news. These techniques can be applied at a more granular level, allowing satirical information to be incorporated into news articles. In this paper, we introduce the first sentence-level dataset for Romanian satire detection for news articles, called SeLeRoSa. The dataset comprises 13,873 manually annotated sentences spanning various domains, including social issues, IT, science, and movies. With the rise and recent progress of large language models (LLMs) in the natural language processing literature, LLMs have demonstrated enhanced capabilities to tackle various tasks in zero-shot settings. We evaluate multiple baseline models based on LLMs in both zero-shot and fine-tuning settings, as well as baseline transformer-based models. Our findings reveal the current limitations of these models in the sentence-level satire detection task, paving the way for new research directions.
摘要：讽刺，讽刺和讽刺是通常用来表达幽默和批评的技术，而不是欺骗。但是，他们有时可能会被误认为是事实报道，类似于假新闻。这些技术可以在更详细的水平上应用，从而可以将讽刺信息纳入新闻文章中。在本文中，我们介绍了第一个句子级数据集，用于新闻文章的罗马尼亚讽刺检测，即Selerosa。该数据集包含13,873个手动注释的句子，这些句子涵盖了各个领域，包括社会问题，IT，科学和电影。随着自然语言处理文献中大型语言模型（LLM）的上升和最新进展，LLMS已表现出增强的能力，可以在零局部设置中解决各种任务。我们在零击和微调设置以及基线变压器模型中基于LLM的多个基线模型评估了多个基线模型。我们的发现揭示了这些模型在句子级讽刺检测任务中的当前局限性，为新的研究方向铺平了道路。

Title: Supervised In-Context Fine-Tuning for Generative Sequence Labeling

Authors: David Dukić, Goran Glavaš, Jan Šnajder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00921
Pdf URL: https://arxiv.org/pdf/2509.00921
Copy Paste: [[2509.00921]] Supervised In-Context Fine-Tuning for Generative Sequence Labeling(https://arxiv.org/abs/2509.00921)
Keywords: llm, long context
Abstract: Sequence labeling (SL) tasks, where labels are assigned to tokens, are abundant in NLP (e.g., named entity recognition and aspect-based sentiment analysis). Owing to the intuition that they require bidirectional context, SL tasks are commonly tackled with encoder-only models. Recent work also shows that removing the causal mask in fine-tuning enables decoder-based LLMs to become effective token classifiers. Less work, however, focused on (supervised) generative SL, a more natural setting for causal LLMs. Due to their rapid scaling, causal LLMs applied to SL are expected to outperform encoders, whose own development has stagnated. In this work, we propose supervised in-context fine-tuning (SIFT) for generative SL. SIFT casts SL tasks as constrained response generation, natural to LLMs, combining (1) in-context learning (ICL) from demonstrations with (2) supervised fine-tuning. SIFT considerably outperforms both ICL and decoder-as-encoder fine-tuning baselines on a range of standard SL tasks. We further find that although long context hinders the performance of generative SL in both ICL and SIFT, this deficiency can be mitigated by removing the instruction, as instructions are shown to be largely unnecessary for achieving strong SL performance with SIFT. Our findings highlight strengths and limitations of SL with LLMs, underscoring the importance of a response-based generative task formulation for effective SL performance.
摘要：序列标记（SL）任务（将标签分配给令牌）在NLP中很丰富（例如，命名实体识别和基于方面的情感分析）。由于他们需要双向上下文的直觉，因此SL任务通常是通过仅编码模型来解决的。最近的工作还表明，在微调中删除因果面具，使基于解码器的LLMS成为有效的令牌分类器。但是，更少的工作集中在（有监督的）生成型SL上，这是因果LLM的更自然的环境。由于它们的快速缩放，应用于SL的因果LLM有望超越自己的发育停滞不前的编码器。在这项工作中，我们建议对生成sl的监督中自我调整（SIFT）。 SIFT将SL任务施放为受约束的响应产生，自然到LLM，并将（1）中文学习（ICL）与（2）受监督的微调结合在一起。在一系列标准SL任务上，筛分大大优于ICL和解码器编码基线。我们进一步发现，尽管长篇小说阻碍了ICL和SIFT中生成SL的性能，但可以通过删除指令来缓解这种缺陷，因为证明说明在很大程度上是不必要的，这对于通过SIFT实现了强大的SL性能。我们的发现突出了SL使用LLM的优势和局限性，强调了基于响应的生成任务配方对有效SL性能的重要性。

Title: MedCOD: Enhancing English-to-Spanish Medical Translation of Large Language Models Using Enriched Chain-of-Dictionary Framework

Authors: Md Shahidul Salim, Lian Fu, Arav Adikesh Ramakrishnan, Zonghai Yao, Hong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00934
Pdf URL: https://arxiv.org/pdf/2509.00934
Copy Paste: [[2509.00934]] MedCOD: Enhancing English-to-Spanish Medical Translation of Large Language Models Using Enriched Chain-of-Dictionary Framework(https://arxiv.org/abs/2509.00934)
Keywords: language model, gpt, llm, prompt
Abstract: We present MedCOD (Medical Chain-of-Dictionary), a hybrid framework designed to improve English-to-Spanish medical translation by integrating domain-specific structured knowledge into large language models (LLMs). MedCOD integrates domain-specific knowledge from both the Unified Medical Language System (UMLS) and the LLM-as-Knowledge-Base (LLM-KB) paradigm to enhance structured prompting and fine-tuning. We constructed a parallel corpus of 2,999 English-Spanish MedlinePlus articles and a 100-sentence test set annotated with structured medical contexts. Four open-source LLMs (Phi-4, Qwen2.5-14B, Qwen2.5-7B, and LLaMA-3.1-8B) were evaluated using structured prompts that incorporated multilingual variants, medical synonyms, and UMLS-derived definitions, combined with LoRA-based fine-tuning. Experimental results demonstrate that MedCOD significantly improves translation quality across all models. For example, Phi-4 with MedCOD and fine-tuning achieved BLEU 44.23, chrF++ 28.91, and COMET 0.863, surpassing strong baseline models like GPT-4o and GPT-4o-mini. Ablation studies confirm that both MedCOD prompting and model adaptation independently contribute to performance gains, with their combination yielding the highest improvements. These findings highlight the potential of structured knowledge integration to enhance LLMs for medical translation tasks.
摘要：我们提出了MedCod（医学连锁店），这是一个混合框架，旨在通过将特定领域的结构化知识集成到大语言模型（LLMS）中，以改善英语对西班牙的医学翻译。 MedCod从统一的医学语言系统（UMLS）和LLM-AS知识基础（LLM-KB）范式中整合了特定领域的知识，以增强结构化的提示和微调。我们构建了一个平行的语料库，该语料库为2,999个英语 - 西班牙杂志的文章，并用结构化的医疗环境注释了100个句子测试集。使用结构化提示提示评估了四个开源LLMS（PHI-4，QWEN2.5-14B，QWEN2.5-7B和LLAMA-3.1-8B），这些提示将多种语言变体，医学同义词和UMLS衍生定义结合在一起，并与基于Lora的Fine-Fine-Tuning结合使用。实验结果表明，MEDCOD可显着提高所有模型的翻译质量。例如，具有MEDCOD和微调的PHI-4实现了BLEU 44.23，CHRF ++ 28.91和0.863彗星，超过了强大的基线模型，例如GPT-4O和GPT-4O-Mini。消融研究证实，MEDCOD提示和模型适应性都独立地有助于性能提高，其组合均可取得最大的改善。这些发现突出了结构化知识集成的潜力，以增强医疗翻译任务的LLM。

Title: Structure and Destructure: Dual Forces in the Making of Knowledge Engines

Authors: Yihong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.00949
Pdf URL: https://arxiv.org/pdf/2509.00949
Copy Paste: [[2509.00949]] Structure and Destructure: Dual Forces in the Making of Knowledge Engines(https://arxiv.org/abs/2509.00949)
Keywords: language model
Abstract: The making of knowledge engines in natural language processing has been shaped by two seemingly distinct paradigms: one grounded in structure, the other driven by massively available unstructured data. The structured paradigm leverages predefined symbolic interactions, such as knowledge graphs, as priors and designs models to capture them. In contrast, the unstructured paradigm centers on scaling transformer architectures with increasingly vast data and model sizes, as seen in modern large language models. Despite their divergence, this thesis seeks to establish conceptual connections bridging these paradigms. Two complementary forces, structure and destructure, emerge across both paradigms: structure organizes seen symbolic interactions, while destructure, through periodic embedding resets, improves model plasticity and generalization to unseen scenarios. These connections form a new recipe for developing general knowledge engines that can support transparent, controllable, and adaptable intelligent systems.
摘要：自然语言处理中的知识引擎的制造是由两个看似独特的范式塑造的：一个以结构为基础，另一个是由大量可用的非结构化数据驱动的。结构化范式利用了预定义的符号相互作用，例如知识图，例如先验和设计模型来捕获它们。相比之下，正如现代大型语言模型所看到的那样，非结构化的范式集中于具有越来越巨大的数据和模型大小的缩放变压器体系结构。尽管有分歧，但本文试图建立桥接这些范式的概念联系。在两个范式上出现了两个互补力，结构和破坏：结构组织看到了象征性的相互作用，而通过定期嵌入重置的破坏则改善了模型的可塑性和概括，从而使场景变得不见了。这些连接构成了开发可以支持透明，可控制和适应能力智能系统的通用知识引擎的新食谱。

Title: RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Authors: Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Feng Liu, Fang-Ming Hung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.00974
Pdf URL: https://arxiv.org/pdf/2509.00974
Copy Paste: [[2509.00974]] RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning(https://arxiv.org/abs/2509.00974)
Keywords: language model, llm, chain-of-thought
Abstract: Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.
摘要：医学问题回答需要将域知识与逻辑推理整合在一起的高级推理。但是，现有的大型语言模型（LLM）通常会产生缺乏事实准确性和临床可靠性的推理链。我们提出了排名的偏好增强优化（RPRO），这是一个新颖的框架，将强化学习与偏好驱动的推理精炼相结合，以增强临床思想链（COT）性能。 RPRO通过采用任务自适应推理模板和概率评估机制将自己与已建立的临床工作流程一致的概率评估机制区分开来，同时自动识别和纠正低质量的推理链。与传统的成对偏好方法不同，RPRO基于Bradley-Terry模型引入了组排名优化，并结合了KL-Divergence正则化以进行稳定训练。 PubMedQA和MEDQA-USMLE的实验表现出对强基础的一致性改进。值得注意的是，我们的1.1b参数模型的表现优于更大的7B-13B模型，包括医学专业的变体。这些发现表明，将偏好优化与质量驱动的改进相结合，为建立更可靠的临床扎根医学LLM提供了可扩展有效的方法。

Title: We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

Authors: Nikta Gohari Sadr, Sahar Heidariasl, Karine Megerdoomian, Laleh Seyyed-Kalantari, Ali Emami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01035
Pdf URL: https://arxiv.org/pdf/2509.01035
Copy Paste: [[2509.01035]] We Politely Insist: Your LLM Must Learn the Persian Art of Taarof(https://arxiv.org/abs/2509.01035)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian taarof, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce TaarofBench, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated "polite" by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.
摘要：大型语言模型（LLMS）难以浏览文化特定的沟通规范，从而限制了它们在全球环境中的有效性。我们专注于波斯·塔罗夫（Persian Taarof），这是伊朗互动中的社会规范，这是一种仪式礼貌的复杂体系，强调尊重，谦虚和间接性，但仍缺乏现有的文化基准。我们介绍了Taarofbench，这是评估LLM对Taarof的第一个基准，包括450个角色扮演场景，涵盖了12个常见的社会互动主题，并由母语人士验证。我们对五个边境LLM的评估揭示了文化能力的巨大差距，当Taarof在文化上适当时，准确性率低于母语者40-48％。互动主题之间的性能各不相同，随着波斯语提示提高并表现出基于性别的不对称性。我们还表明，标准指标评估“礼貌”的回应通常违反了Taarof规范，表明西方礼貌框架的局限性。通过监督的微调和直接偏好优化，我们在文化期望的模型一致性方面取得了21.8％和42.3％的提高。我们对33名参与者（11名本地波斯语，11个遗产和11位非伊朗人说话者）的人类研究以不同程度的熟悉波斯规范形成基准。这项工作为发展多样化和文化意识的LLM的基础奠定了基础，从而为更好地浏览复杂的社交互动的应用程序提供了基础。

Title: A Dynamic Fusion Model for Consistent Crisis Response

Authors: Xiaoying Song, Anirban Saha Anik, Eduardo Blanco, Vanessa Frias-Martinez, Lingzi Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01053
Pdf URL: https://arxiv.org/pdf/2509.01053
Copy Paste: [[2509.01053]] A Dynamic Fusion Model for Consistent Crisis Response(https://arxiv.org/abs/2509.01053)
Keywords: language model
Abstract: In response to the urgent need for effective communication with crisis-affected populations, automated responses driven by language models have been proposed to assist in crisis communications. A critical yet often overlooked factor is the consistency of response style, which could affect the trust of affected individuals in responders. Despite its importance, few studies have explored methods for maintaining stylistic consistency across generated responses. To address this gap, we propose a novel metric for evaluating style consistency and introduce a fusion-based generation approach grounded in this metric. Our method employs a two-stage process: it first assesses the style of candidate responses and then optimizes and integrates them at the instance level through a fusion process. This enables the generation of high-quality responses while significantly reducing stylistic variation between instances. Experimental results across multiple datasets demonstrate that our approach consistently outperforms baselines in both response quality and stylistic uniformity.
摘要：为了应对与受危机影响的人群有效沟通的迫切需要，已经提出了由语言模型驱动的自动反应，以协助危机沟通。一个关键但经常被忽视的因素是响应方式的一致性，这可能会影响受影响者在响应者中的信任。尽管它很重要，但很少有研究探讨了在生成的响应之间维持风格一致性的方法。为了解决这一差距，我们提出了一个新型指标，以评估样式一致性，并引入基于融合的生成方法，该方法基于该指标。我们的方法采用了两个阶段的过程：它首先评估候选响应的样式，然后通过融合过程在实例级别进行优化和集成。这可以产生高质量的响应，同时显着减少实例之间的风格差异。多个数据集的实验结果表明，我们的方法在响应质量和风格均匀性方面始终优于基准。

Title: Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL

Authors: Xiaoying Song, Anirban Saha Anik, Dibakar Barua, Pengcheng Luo, Junhua Ding, Lingzi Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01058
Pdf URL: https://arxiv.org/pdf/2509.01058
Copy Paste: [[2509.01058]] Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL(https://arxiv.org/abs/2509.01058)
Keywords: retrieval-augmented generation
Abstract: Health misinformation spreading online poses a significant threat to public health. Researchers have explored methods for automatically generating counterspeech to health misinformation as a mitigation strategy. Existing approaches often produce uniform responses, ignoring that the health literacy level of the audience could affect the accessibility and effectiveness of counterspeech. We propose a Controlled-Literacy framework using retrieval-augmented generation (RAG) with reinforcement learning (RL) to generate tailored counterspeech adapted to different health literacy levels. In particular, we retrieve knowledge aligned with specific health literacy levels, enabling accessible and factual information to support generation. We design a reward function incorporating subjective user preferences and objective readability-based rewards to optimize counterspeech to the target health literacy level. Experiment results show that Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech. This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation.
摘要：在网上传播的健康错误信息对公共卫生构成了重大威胁。研究人员探索了自动生成反语作为缓解策略的方法的方法。现有方法通常会产生统一的反应，而忽略了受众的健康素养水平可能会影响反语的可及性和有效性。我们提出了使用增强学习（RL）的检索授权生成（RAG）的受控识别框架，以生成适合不同健康素养水平的量身定制的反语。特别是，我们检索了与特定的健康素养水平保持一致的知识，从而实现了可访问和事实信息以支持发电。我们设计了一个奖励功能，其中包含主观用户的偏好和基于客观可读性的奖励，以优化针对目标健康素养水平的反语。实验结果表明，通过生成更容易访问和用户偏爱的counterspeech，受控范围的表现优于基线。这项研究通过改善对健康错误信息的可访问性和理解，有助于更公平和有影响力的公共卫生沟通。

Title: Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

Authors: Abdessalam Bouchekif, Samer Rashwani, Heba Sbahi, Shahd Gaben, Mutez Al-Khatib, Mohammed Ghaly
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01081
Pdf URL: https://arxiv.org/pdf/2509.01081
Copy Paste: [[2509.01081]] Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation(https://arxiv.org/abs/2509.01081)
Keywords: language model, llm
Abstract: This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, known as 'ilm al-mawarith. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test models' ability to understand the inheritance context and compute the distribution of shares prescribed by Islamic jurisprudence. The results reveal a significant performance gap: o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation. We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning. Code: this https URL
摘要：本文评估了伊斯兰继承法中大语言模型的知识和推理能力，称为“ ilm al-mawarith”。我们使用1,000个多项选择问题的基准评估了七个LLM的性能，涵盖了各种继承方案，旨在测试模型的能力，以了解伊斯兰法学规定的股份和计算股票的分布。结果表明，性能差距很大：O3和Gemini 2.5的精度超过90％，而Allam，Fanar，Llama和Mistral得分低于50％。这些差异反映了推理能力和领域适应性的重要差异。我们进行了详细的错误分析，以识别模型之间的经常性故障模式，包括对继承情景的误解，法律规则的不正确应用以及域知识不足。我们的发现凸显了处理结构化的法律推理的局限性，并提出了改善伊斯兰法律推理绩效的指示。代码：此HTTPS URL

Title: Privacy-Preserving Reasoning with Knowledge-Distilled Parametric Retrieval Augmented Generation

Authors: Jinwen Chen, Hainan Zhang, Liang Pang, Yongxin Tong, Haibo Zhou, Yuan Zhan, Wei Lin, Zhiming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01088
Pdf URL: https://arxiv.org/pdf/2509.01088
Copy Paste: [[2509.01088]] Privacy-Preserving Reasoning with Knowledge-Distilled Parametric Retrieval Augmented Generation(https://arxiv.org/abs/2509.01088)
Keywords: llm, retrieval augmented generation
Abstract: The current RAG system requires uploading plaintext documents to the cloud, risking private data leakage. Parametric RAG (PRAG) addresses this by encoding documents as LoRA within LLMs, enabling reasoning without exposing raw content. However, it still faces two issues: (1) PRAG demands synthesizing QA pairs and fine-tuning LLM for each individual document to create its corresponding LoRA, leading to unacceptable inference latency. (2) The performance of PRAG relies solely on synthetic QA data, lacking internal alignment with standard RAG, resulting in poor generalization on out-of-distribution(OOD) inputs. Therefore, achieving high-efficiency parameterization while maintaining RAG-level performance remains a critical challenge for privacy-preserving reasoning. In this paper, we propose DistilledPRAG, a generalizable knowledge-distilled parametric RAG model aligned with standard RAG in document structure and parameter activation. We first synthesize QA pairs from single and multi-documents to enhance cross-document reasoning. Then, we mask the plaintext documents with a special token and translate them to LoRA via a parameter generator, maintaining the standard RAG document structure. Finally, guided by synthetic QA data, we train the parameter generator to match standard RAG's hidden states and output logits, enabling RAG-style reasoning without original documents. Experiments on four QA datasets show that DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data.
摘要：当前的破布系统需要将明文文档上传到云中，从而冒着私人数据泄漏的风险。参数抹布（Prag）通过将文档作为LORA在LLMS中的LORA编码来解决此问题，从而在不暴露原始内容的情况下启用推理。但是，它仍然面临两个问题：（1）布拉格要求为每个单独的文档合成QA对和微调LLM，以创建其相应的LORA，从而导致不可接受的推理潜伏期。（2）布拉格的性能仅依赖于综合质量检查数据，缺乏与标准抹布的内部对齐，从而导致对分布（OOD）输入的概括不佳。因此，在保持破布级别的同时实现高效率参数化是对保护隐私推理的关键挑战。在本文中，我们提出了DistilledPrag，这是一种可概括的知识缩放参数抹布模型，该模型与文档结构和参数激活中的标准抹布对齐。我们首先将QA对从单个和多材中合成QA对，以增强跨文档推理。然后，我们使用特殊令牌掩盖了明文文档，并通过参数生成器将其转换为Lora，并维护标准的抹布文档结构。最后，在综合质量检查数据的指导下，我们训练参数生成器以匹配标准抹布的隐藏状态和输出逻辑，从而无需原始文档即可实现抹布式的推理。四个质量检查数据集的实验表明，蒸馏器的表现优于准确性，并且可以很好地概括OOD数据。

Title: REFRAG: Rethinking RAG based Decoding

Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.01092
Pdf URL: https://arxiv.org/pdf/2509.01092
Copy Paste: [[2509.01092]] REFRAG: Rethinking RAG based Decoding(https://arxiv.org/abs/2509.01092)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.
摘要：大型语言模型（LLMS）在利用广泛的外部知识来增强多转弯和代理应用的响应（例如检索功能增强的生成（RAG））方面具有出色的功能。但是，处理长篇小说输入会引入大量的系统延迟，并要求对键值缓存进行大量内存，从而减少吞吐量和知识富集和系统效率之间的基本权衡。虽然最大程度地减少长篇小说输入的延迟是LLM的主要目标，但我们认为RAG需要专门考虑。在抹布中，LLM上下文的大部分都由从检索中的串联段落组成，只有一个与查询直接相关的小子集。这些段落通常由于多样性或重新排列过程的重复数据删除而表现出较低的语义相似性，从而导致与标准LLM生成任务不同的障碍物注意力模式。基于此观察，我们认为在解码过程中的大多数计算是不必要的，并且可以消除对性能的最小影响。为此，我们提出了Refrag，这是一个有效的解码框架，可以压缩，感官和扩展以改善抹布应用程序的延迟。通过利用稀疏性结构，我们证明了30.85的时间第一加速度（以前的工作改善3.75）而不会损失困惑。此外，我们针对大上下文的优化框架使Refrag能够将LLM的上下文大小扩展到16。我们提供了跨不同长篇小说任务的重新验证的严格验证，包括抹布，多转交换对话和长文档摘要，跨越了广泛的数据集。实验结果证实，与各种环境尺寸的Llama模型和其他最先进的基线相比，REDRAG可以提供大量加速，而准确性没有损失。

Title: Natural Context Drift Undermines the Natural Language Understanding of Large Language Models

Authors: Yulong Wu, Viktor Schlegel, Riza Batista-Navarro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01093
Pdf URL: https://arxiv.org/pdf/2509.01093
Copy Paste: [[2509.01093]] Natural Context Drift Undermines the Natural Language Understanding of Large Language Models(https://arxiv.org/abs/2509.01093)
Keywords: language model, llm
Abstract: How does the natural evolution of context paragraphs affect question answering in generative Large Language Models (LLMs)? To investigate this, we propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks and for analyzing LLM performance across a range of semantic similarity scores, which quantify how closely each variant aligns with content seen during pretraining. Using this framework, we evaluate six QA datasets and eight LLMs with publicly available training data. Our experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining-even when the question and all necessary information remains present at inference time. For instance, average model accuracy on BoolQ drops by over 30% from the highest to lowest similarity bins, with slopes exceeding 70 across several LLMs. These findings suggest that natural text evolution poses a significant challenge to the language understanding capabilities of LLMs.
摘要：上下文段落的自然演变如何影响生成大语言模型（LLM）中的问题回答？为了进行调查，我们提出了一个框架，用于策划自然发展，人文编辑的阅读段落的变体，这些变体从当代QA基准测试和分析一系列语义相似性分数中LLM性能分析，从而量化了每个变体与在预交前所见的内容相结合的程度。使用此框架，我们通过公开培训数据评估了六个QA数据集和八个LLM。我们的实验表明，LLM性能下降，因为阅读段落自然与预审议期间遇到的版本不同，即使是在推理时问题和所有必要信息仍然存在的所有必要信息。例如，Boolq的平均模型准确性从最高到最低的相似性垃圾箱下降了30％以上，几个LLM的斜率超过70。这些发现表明，自然文本进化对LLM的语言理解能力构成了重大挑战。

Title: Dream-Coder 7B: An Open Diffusion Language Model for Code

Authors: Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, Lingpeng Kong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01142
Pdf URL: https://arxiv.org/pdf/2509.01142
Copy Paste: [[2509.01142]] Dream-Coder 7B: An Open Diffusion Language Model for Code(https://arxiv.org/abs/2509.01142)
Keywords: language model, prompt
Abstract: We present Dream-Coder 7B, an open-source discrete diffusion language model for code generation that exhibits emergent any-order generation capabilities. Unlike traditional autoregressive (AR) models that decode strictly left-to-right, Dream-Coder 7B adaptively determines its decoding strategy based on the coding task: sketch-first generation for complex algorithms, left-to-right generation for straightforward completions, and interleaved reasoning generation for code understanding tasks. We adapt a pretrained AR checkpoint to a discrete diffusion frameworks with a continuous-time weighted cross-entropy objective. Our post-training recipe comprises (i) supervised fine-tuning, where we mitigate padding pathologies via random truncation and a padding penalty to improve sample efficiency and stabilize generation; and (ii) reinforcement learning with verifiable rewards over a curated high-quality prompt set drawn from open-source datasets, using a tailored reinforcement learning recipe for diffusion language models. The resulting Dream-Coder 7B Instruct attains 21.4\% pass@1 on LiveCodeBench (2410--2505) and demonstrates competitive performance on HumanEval, MBPP, BigCodeBench, and CRUXEval. We release Dream-Coder-7B and Dream-Coder-7B-Instruct checkpoints, training recipes, preprocessing pipelines, and inference code to facilitate reproducibility and further research.
摘要：我们提出Dream-Coder 7B，这是代码生成的开源离散扩散语言模型，展示了任何订单生成功能。与严格从左到右解码的传统自回旋（AR）模型不同，Dream-Coder 7b根据编码任务自适应地确定其解码策略：复杂算法的素描至上一代，从左右的生成直接完成，直接完成，以及与代码理解任务的交错推理。我们将经过验证的AR检查点调整为具有连续的加权交叉透镜物镜的离散扩散框架。我们的训练后食谱包括（i）监督的微调，我们通过随机截断和填充惩罚来减轻填充病理，以提高样品效率并稳定产生；（ii）使用定制的扩散语言模型的量身定制的加固学习配方，对从开源数据集中绘制的精心策划的高质量提示设置进行了可验证的奖励。由此产生的Dream-Coder 7b指示在LiveCodeBench（2410--2505）上获得21.4 \％PASS@1，并在Humaneval，MBPP，BigCodeBench和Cruxeval上展示了竞争性能。我们发布Dream-Coder-7b和Dream-Coder-7b-r-Insustruct检查点，培训食谱，预处理管道和推理代码，以促进可重复性和进一步的研究。

Title: Zero-shot Cross-lingual NER via Mitigating Language Difference: An Entity-aligned Translation Perspective

Authors: Zhihao Zhang, Sophia Yat Mei Lee, Dong Zhang, Shoushan Li, Guodong Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01147
Pdf URL: https://arxiv.org/pdf/2509.01147
Copy Paste: [[2509.01147]] Zero-shot Cross-lingual NER via Mitigating Language Difference: An Entity-aligned Translation Perspective(https://arxiv.org/abs/2509.01147)
Keywords: language model, llm
Abstract: Cross-lingual Named Entity Recognition (CL-NER) aims to transfer knowledge from high-resource languages to low-resource languages. However, existing zero-shot CL-NER (ZCL-NER) approaches primarily focus on Latin script language (LSL), where shared linguistic features facilitate effective knowledge transfer. In contrast, for non-Latin script language (NSL), such as Chinese and Japanese, performance often degrades due to deep structural differences. To address these challenges, we propose an entity-aligned translation (EAT) approach. Leveraging large language models (LLMs), EAT employs a dual-translation strategy to align entities between NSL and English. In addition, we fine-tune LLMs using multilingual Wikipedia data to enhance the entity alignment from source to target languages.
摘要：跨语言命名实体识别（CL-NER）旨在将知识从高资源语言转移到低资源语言。但是，现有的零击Cl-ner（ZCL-NER）方法主要集中在拉丁文脚本语言（LSL）上，其中共同的语言特征有助于有效的知识转移。相反，对于非拉丁文脚本语言（NSL），例如中文和日语，由于具有深厚的结构差异，性能通常会降低。为了应对这些挑战，我们提出了一种与实体一致的翻译（EAT）方法。利用大型语言模型（LLMS），EAT采用双重翻译策略来使NSL和英语之间的实体保持一致。此外，我们使用多语言Wikipedia数据微调LLM，以增强从源到目标语言的实体对齐。

Title: Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning

Authors: Yu Liu, Yanan Cao, Xixun Lin, Yanmin Shang, Shi Wang, Shirui Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01166
Pdf URL: https://arxiv.org/pdf/2509.01166
Copy Paste: [[2509.01166]] Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning(https://arxiv.org/abs/2509.01166)
Keywords: language model, llm
Abstract: Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches design separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%.
摘要：知识图完成（KGC）旨在推断新知识并从知识图中做出预测。最近，大型语言模型（LLMS）表现出了显着的推理能力。 LLM增强的KGC方法主要集中于设计特定于任务的指令，从而实现有希望的进步。但是，仍然面临两个关键挑战。首先，现有方法通常忽略自然语言和图形结构之间的不一致表示空间。其次，大多数方法设计针对不同KGC任务的单独说明，从而导致重复的作品和耗时的过程。为了应对这些挑战，我们提出了SAT，这是一个新颖的框架，可通过结构感知的对齐方式增强KGC的LLM。具体而言，我们首先引入层次知识对齐，以通过多任务对比度学习将嵌入图形嵌入与自然语言空间相结合。然后，我们建议使用统一的图形指令与轻量级知识适配器相结合，以指导LLM在kgs上执行结构感知推理。对四个基准数据集的两个KGC任务的实验结果表明，SAT明显优于最先进的方法，尤其是在链接预测任务中，改进的范围为8.7％至29.8％。

Title: Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation

Authors: Seganrasan Subramanian, Abhigya Verma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01185
Pdf URL: https://arxiv.org/pdf/2509.01185
Copy Paste: [[2509.01185]] Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation(https://arxiv.org/abs/2509.01185)
Keywords: language model, llm, prompt
Abstract: The ability of large language models (LLMs) to process and reason over long textual inputs is critical for a wide range of real-world applications. However, progress in this area is significantly constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation. This work introduces a modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs. The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Through templated prompting, a model-agnostic architecture, and metadata-enriched outputs, the proposed approach facilitates scalable, controllable, and purpose-aligned dataset creation for advancing long-context capabilities in LLMs.
摘要：大型语言模型（LLMS）在长期文本输入中处理和推理的能力对于广泛的现实应用程序至关重要。但是，由于缺乏适合培训和评估的高质量，多样化和可验证的长篇小说数据集，该领域的进展受到了显着限制。这项工作引入了一个模块化，可扩展的框架，用于通过与LLMS的及时交互，用于合成长篇小说数据生成。该框架支持多个培训和对齐目标，包括监督微调（SFT），直接偏好优化（DPO）和小组相对策略优化（GRPO）。它涵盖了四个核心生成范式：多转交谈对话，文档接地输出对对，可验证的指令 - 响应任务和长篇文章的推理示例。通过模板提示，模型不足的体系结构和元数据富含的输出，所提出的方法有助于可扩展，可控制和专用的数据集创建，以提高LLMS中的长篇文化功能。

Title: Statutory Construction and Interpretation for Artificial Intelligence

Authors: Luxi He, Nimra Nadeem, Michel Liao, Howard Chen, Danqi Chen, Mariano-Florentino Cuéllar, Peter Henderson
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.01186
Pdf URL: https://arxiv.org/pdf/2509.01186
Copy Paste: [[2509.01186]] Statutory Construction and Interpretation for Artificial Intelligence(https://arxiv.org/abs/2509.01186)
Keywords: prompt, chat
Abstract: AI systems are increasingly governed by natural language principles, yet a key challenge arising from reliance on language remains underexplored: interpretive ambiguity. As in legal systems, ambiguity arises both from how these principles are written and how they are applied. But while legal systems use institutional safeguards to manage such ambiguity, such as transparent appellate review policing interpretive constraints, AI alignment pipelines offer no comparable protections. Different interpretations of the same rule can lead to inconsistent or unstable model behavior. Drawing on legal theory, we identify key gaps in current alignment pipelines by examining how legal systems constrain ambiguity at both the rule creation and rule application steps. We then propose a computational framework that mirrors two legal mechanisms: (1) a rule refinement pipeline that minimizes interpretive disagreement by revising ambiguous rules (analogous to agency rulemaking or iterative legislative action), and (2) prompt-based interpretive constraints that reduce inconsistency in rule application (analogous to legal canons that guide judicial discretion). We evaluate our framework on a 5,000-scenario subset of the WildChat dataset and show that both interventions significantly improve judgment consistency across a panel of reasonable interpreters. Our approach offers a first step toward systematically managing interpretive ambiguity, an essential step for building more robust, law-following AI systems.
摘要：人工智能系统越来越受自然语言原则的控制，但是依赖语言引起的主要挑战仍然没有得到充实：解释性歧义。与法律体系一样，歧义既来自构成这些原则的编写方式和应用方式。但是，尽管法律制度使用机构保障来管理这种歧义，例如透明的上诉审查警务解释性约束，但AI Alignment Pipelines没有可比的保护。对同一规则的不同解释会导致不一致或不稳定的模型行为。利用法律理论，我们通过检查法律系统如何限制规则创建和规则应用程序步骤中的歧义，从而确定当前对齐管道中的关键差距。然后，我们提出了一个计算框架，该计算框架反映了两种法律机制：（1）规则改进管道通过修改模棱两可的规则（类似于代理机构规则制定或迭代立法行动）来最大程度地减少解释性分歧，以及（2）基于及时的解释性约束，从而减少对法律规定的不合规性（类似法律对指导的指南）。我们在Wildchat数据集的5,000个季节子集上评估了我们的框架，并表明这两种干预措施都大大提高了整个合理口译员的判断一致性。我们的方法为系统地管理解释性歧义提供了第一步，这是建立更强大，遵守AI系统的重要步骤。

Title: Efficient Large Language Models with Zero-Shot Adjustable Acceleration

Authors: Sajjad Kachuee, Mohammad Sharifkhani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01190
Pdf URL: https://arxiv.org/pdf/2509.01190
Copy Paste: [[2509.01190]] Efficient Large Language Models with Zero-Shot Adjustable Acceleration(https://arxiv.org/abs/2509.01190)
Keywords: language model, llm
Abstract: Using Large Language Models (LLMs) in real-world applications presents significant challenges, particularly in balancing computational efficiency and performance. Optimizing acceleration after the fine-tuning phase and during inference is crucial for building an efficient architecture. This paper introduces Zero-Shot Adjustable Acceleration, a novel training and inference method that dynamically adjusts hardware usage during inference without requiring additional fine-tuning. The proposed approach is applied to newly developed models and evaluated across multiple classification and text generation tasks. Experimental results demonstrate that the method enables a wide range of acceleration in a zero-shot manner and achieves up to a 11x speedup compared to the baseline.
摘要：在现实世界应用中使用大型语言模型（LLM）提出了重大挑战，尤其是在平衡计算效率和性能方面。在微调阶段和推理期间优化加速度对于建立有效的体系结构至关重要。本文介绍了可调零的加速度，这是一种新颖的培训和推理方法，该方法在推理过程中动态调整硬件的使用而无需进行其他微调。所提出的方法应用于新开发的模型，并在多个分类和文本生成任务中进行了评估。实验结果表明，该方法可以以零拍的方式实现广泛的加速度，并且与基线相比，达到了11倍的速度。

Title: Mitigating Catastrophic Forgetting in Continual Learning through Model Growth

Authors: Ege Süalp, Mina Rezaei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01213
Pdf URL: https://arxiv.org/pdf/2509.01213
Copy Paste: [[2509.01213]] Mitigating Catastrophic Forgetting in Continual Learning through Model Growth(https://arxiv.org/abs/2509.01213)
Keywords: language model, llm
Abstract: Catastrophic forgetting is a significant challenge in continual learning, in which a model loses prior knowledge when it is fine-tuned on new tasks. This problem is particularly critical for large language models (LLMs) undergoing continual learning, as retaining performance across diverse domains is important for their general utility. In this paper, we explore model growth, a promising strategy that leverages smaller models to expedite and structure the training of larger ones for mitigating the catastrophic forgetting problem. Although growth-based pretraining, particularly via transformer stacking, has shown promise in accelerating convergence, its impact on forgetting remains under-explored. Therefore, we evaluate whether growth-based models can retain previously learned capabilities more effectively across a sequence of fine-tuning tasks involving domain knowledge, reasoning, reading comprehension, and bias. Our findings show that both models -- one trained with growth (Stack LLM) and one without (LLM) -- exhibit improvements in domain knowledge. However, reasoning and reading comprehension degrade over time, indicating signs of catastrophic forgetting. Stack LLM consistently shows less degradation, especially in reading comprehension, suggesting enhanced retention capabilities. Interestingly, in bias evaluation, the baseline LLM becomes progressively more neutral with continued fine-tuning, while Stack LLM maintains a steady bias ratio around 60--61\%. These results indicate that growth-based pretraining may deliver modest improvements in resisting catastrophic forgetting, though trade-offs remain in handling social biases.
摘要：灾难性的遗忘是持续学习的重大挑战，在该模型对新任务进行微调时，模型将失去先验知识。这个问题对于经过持续学习的大型语言模型（LLM）尤其重要，因为在各个领域之间保持绩效对其通用效用非常重要。在本文中，我们探讨了模型增长，这是一种有前途的策略，利用较小的模型加快和构建较大的模型来减轻灾难性遗忘问题。尽管基于增长的预读，尤其是通过变压器堆叠，在加速融合方面表现出了希望，但其对遗忘的影响仍然不足。因此，我们评估了基于增长的模型是否可以在涉及领域知识，推理，阅读理解和偏见的一系列微调任务中更有效地保留先前学习的能力。我们的发现表明，这两种模型 - 一个经过增长（堆栈LLM）和没有（LLM）的模型 - 在域知识方面都有改善。但是，推理和阅读理解随着时间的流逝而降低，表明灾难性遗忘的迹象。堆栈LLM始终显示出较少的降解，尤其是在阅读理解方面，表明保留能力增强。有趣的是，在偏见评估中，基线LLM随着持续的微调而逐渐中性，而Stack LLM在60--61 \％左右保持稳定的偏置比。这些结果表明，基于增长的预读可能会在抵抗灾难性遗忘方面提供适度的改善，尽管权衡取舍仍然在处理社会偏见方面。

Title: DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Taks Based on Data and Model Compression

Authors: Wei Huang, Huang Wei, Yinggui Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.01221
Pdf URL: https://arxiv.org/pdf/2509.01221
Copy Paste: [[2509.01221]] DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Taks Based on Data and Model Compression(https://arxiv.org/abs/2509.01221)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is challenging, primarily focusing on how to quickly identify the optimal LLM. We introduce a Data and Model Compression Framework (DaMoC) that addresses this challenge by: 1) Data Level: A systematic categorization of data filtering methodologies for LLMs is first established, classifying them into three distinct paradigms: (1) distribution-aware methods, (2) quality-aware methods, and (3) hybrid approaches considering both dimensions. Further, we enhance the density of key tokens in the text achieving token compression. Subsequently, we use an LLM to iterative rewrite the text to optimize its expression. 2) Model Level: We use layer similarity scores to assess each layer's importance and remove those with lower importance. Then, we introduce a sparse merging paradigm to preserve as much of the original model's capability as possible. Extensive experiments on four datasets, medical Q&A, financial Q&A, general Q&A, and reading comprehension, show that we can select the optimal LLM while saving approximately 20-fold in training time.
摘要：大型语言模型（LLMS）在一般任务中表现出色，但与特定于领域的模型斗争，需要对特定数据进行微调。有了许多开源LLM，选择用于微调下游任务的最佳模型是具有挑战性的，主要关注如何快速识别最佳LLM。我们介绍了一个数据和模型压缩框架（DAMOC），该框架通过：1）数据级别：1）数据级别：首先建立了LLMS数据过滤方法的系统分类，将它们分类为三种不同的范式：（1）分布意识到的方法，（2）质量意识的方法，以及（3）杂种方法，（3）介绍了两种尺度的方法。此外，我们在文本中增强了要达到令牌压缩的文本中的密度。随后，我们使用LLM进行迭代重写文本以优化其表达式。 2）模型级别：我们使用层相似性得分来评估每层的重要性，并删除那些重要性较低的人。然后，我们引入了一个稀疏的合并范式，以保留尽可能多的原始模型能力。在四个数据集，医疗问答，财务问答，一般问答和阅读理解上进行了广泛的实验，表明我们可以选择最佳LLM，同时节省约20倍的培训时间。

Title: Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors

Authors: Hao Yang, Zhiyu Yang, Yunjie Zhang, Shanyi Zhu, Lin Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01236
Pdf URL: https://arxiv.org/pdf/2509.01236
Copy Paste: [[2509.01236]] Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors(https://arxiv.org/abs/2509.01236)
Keywords: language model, prompt, chain-of-thought
Abstract: Chain-of-Thought reasoning has emerged as a pivotal methodology for enhancing model inference capabilities. Despite growing interest in Chain-of-Thought reasoning, its underlying mechanisms remain unclear. This paper explores the working mechanisms of Chain-of-Thought reasoning from the perspective of the dual relationship between in-context learning and pretrained priors. We first conduct a fine-grained lexical-level analysis of rationales to examine the model's reasoning behavior. Then, by incrementally introducing noisy exemplars, we examine how the model balances pretrained priors against erroneous in-context information. Finally, we investigate whether prompt engineering can induce slow thinking in large language models. Our extensive experiments reveal three key findings: (1) The model not only quickly learns the reasoning structure at the lexical level but also grasps deeper logical reasoning patterns, yet it heavily relies on pretrained priors. (2) Providing sufficient exemplars shifts the model's decision-making from pretrained priors to in-context signals, while misleading prompts introduce instability. (3) Long Chain-of-Thought prompting can induce the model to generate longer reasoning chains, thereby improving its performance on downstream tasks.
摘要：经过思考的推理已成为增强模型推理能力的关键方法。尽管对经过思想的推理的兴趣日益增加，但其基本机制仍不清楚。本文从内在学习与预审核的先知之间的双重关系的角度探讨了思想链推理的工作机制。我们首先对理由进行细粒的词汇水平分析，以检查模型的推理行为。然后，通过逐步引入嘈杂的示例，我们检查了该模型如何平衡鉴定的先知与错误的中文信息。最后，我们调查了迅速的工程是否可以在大型语言模型中引起缓慢的思考。我们的广泛实验揭示了三个关键发现：（1）模型不仅快速学习了词汇水平的推理结构，而且还掌握了更深层次的逻辑推理模式，还在很大程度上依赖于预处理的先验。（2）提供足够的示例将模型的决策从预审计的先知转移到了内在信号，同时误导提示引起了不稳定。（3）长期的经过思考的提示可以引起该模型产生更长的推理链，从而提高其在下游任务上的性能。

Title: Annotation and modeling of emotions in a textual corpus: an evaluative approach

Authors: Jonas Noblet (LIDILEM)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01260
Pdf URL: https://arxiv.org/pdf/2509.01260
Copy Paste: [[2509.01260]] Annotation and modeling of emotions in a textual corpus: an evaluative approach(https://arxiv.org/abs/2509.01260)
Keywords: language model
Abstract: Emotion is a crucial phenomenon in the functioning of human beings in society. However, it remains a widely open subject, particularly in its textual manifestations. This paper examines an industrial corpus manually annotated following an evaluative approach to emotion. This theoretical framework, which is currently underutilized, offers a different perspective that complements traditional approaches. Noting that the annotations we collected exhibit significant disagreement, we hypothesized that they nonetheless follow stable statistical trends. Using language models trained on these annotations, we demonstrate that it is possible to model the labeling process and that variability is driven by underlying linguistic features. Conversely, our results indicate that language models seem capable of distinguishing emotional situations based on evaluative criteria.
摘要：情感是人类在社会中运作的关键现象。但是，它仍然是一个广泛开放的主题，尤其是在其文本表现中。本文研究了一种评估情感方法，研究了手动注释的工业语料库。当前未充分利用的这个理论框架提供了一种与传统方法相辅相成的不同观点。我们指出，我们收集的注释表现出很大的分歧，我们假设它们仍然遵循稳定的统计趋势。使用对这些注释训练的语言模型，我们证明可以对标记过程进行建模，并且可变性是由潜在的语言特征驱动的。相反，我们的结果表明，语言模型似乎能够根据评估标准来区分情感情况。

Title: Culture is Everywhere: A Call for Intentionally Cultural Evaluation

Authors: Juhyun Oh, Inha Cha, Michael Saxon, Hyunseung Lim, Shaily Bhatt, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01301
Pdf URL: https://arxiv.org/pdf/2509.01301
Copy Paste: [[2509.01301]] Culture is Everywhere: A Call for Intentionally Cultural Evaluation(https://arxiv.org/abs/2509.01301)
Keywords: language model, llm
Abstract: The prevailing ``trivia-centered paradigm'' for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly ``neutral'' evaluation settings. In this position paper, we argue for \textbf{intentionally cultural evaluation}: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don't know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.
摘要：随着这些模型变得更加先进和广泛部署，越来越不足以评估大型语言模型（LLMS）的文化一致性（LLMS）的流行``以琐事中心的范式''的流行。现有方法通常将文化减少到静态事实或价值观，通过多项选择或简短回答的问题测试模型，这些问题将文化视为孤立的琐事。这样的方法忽略了文化的多元化和互动现实，而忽略了文化假设如何渗透到表面上``中性''评估环境。在该立场论文中，我们主张\ textbf {有意文化评估}：一种系统地研究嵌入在评估各个方面的文化假设的方法，而不仅仅是明确的文化任务。我们从系统地表征了在评估中出现文化偶然考虑的事情，方式和环境，并强调了研究人员地位对促进包容性的，文化上一致的NLP研究的重要性。最后，我们讨论了超出当前基准测试实践的含义和未来方向，发现我们不知道的重要应用程序，并通过HCI启发的参与式方法使社区参与评估设计。

Title: TableZoomer: A Collaborative Agent Framework for Large-scale Table Question Answering

Authors: Sishi Xiong, Ziyang He, Zhongjiang He, Yu Zhao, Changzai Pan, Jie Zhang, Zhenhe Wu, Shuangyong Song, Yongxiang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01312
Pdf URL: https://arxiv.org/pdf/2509.01312
Copy Paste: [[2509.01312]] TableZoomer: A Collaborative Agent Framework for Large-scale Table Question Answering(https://arxiv.org/abs/2509.01312)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: While large language models (LLMs) have shown promise in the table question answering (TQA) task through prompt engineering, they face challenges in industrial applications, including structural heterogeneity, difficulties in target data localization, and bottlenecks in complex reasoning. To address these limitations, this paper presents TableZoomer, a novel LLM-powered, programming-based agent framework. It introduces three key innovations: (1) replacing the original fully verbalized table with structured table schema to bridge the semantic gap and reduce computational complexity; (2) a query-aware table zooming mechanism that dynamically generates sub-table schema through column selection and entity linking, significantly improving target localization efficiency; and (3) a Program-of-Thoughts (PoT) strategy that transforms queries into executable code to mitigate numerical hallucination. Additionally, we integrate the reasoning workflow with the ReAct paradigm to enable iterative reasoning. Extensive experiments demonstrate that our framework maintains the usability advantages while substantially enhancing performance and scalability across tables of varying scales. When implemented with the Qwen3-8B-Instruct LLM, TableZoomer achieves accuracy improvements of 19.34% and 25% over conventional PoT methods on the large-scale DataBench dataset and the small-scale Fact Checking task of TableBench dataset, respectively.
摘要：尽管大型语言模型（LLMS）通过及时工程在表问题回答（TQA）任务中显示出希望，但它们在工业应用中面临挑战，包括结构异质性，目标数据定位的困难以及复杂推理中的瓶颈。为了解决这些局限性，本文介绍了TableZoomer，这是一种新型的LLM驱动，基于编程的代理框架。它引入了三个关键的创新：（1）用结构化表格架代替原始的完全言语的表，以弥合语义间隙并降低计算复杂性；（2）一种查询意识到的缩放机制，该机制通过列的选择和实体链接动态生成子表模式，从而显着提高了目标定位效率；（3）一种经营计划（POT）策略，将查询转换为可执行代码以减轻数值幻觉。此外，我们将推理工作流与React范式集成在一起，以实现迭代推理。广泛的实验表明，我们的框架保持可用性优势，同时可以大大提高不同尺度表的性能和可伸缩性。当使用QWEN3-8B-INSTRUCT LLM实施时，TableZoomer在大型数据库数据集和TableBench数据集的小规模事实检查任务上，准确提高了19.34％和25％的准确性。

Title: Can Smaller LLMs do better? Unlocking Cross-Domain Potential through Parameter-Efficient Fine-Tuning for Text Summarization

Authors: Anum Afzal, Mehul Kumawat, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01314
Pdf URL: https://arxiv.org/pdf/2509.01314
Copy Paste: [[2509.01314]] Can Smaller LLMs do better? Unlocking Cross-Domain Potential through Parameter-Efficient Fine-Tuning for Text Summarization(https://arxiv.org/abs/2509.01314)
Keywords: language model, llm
Abstract: Large Language Models (LLMs), being generic task solvers, are versatile. However, despite the vast amount of data they are trained on, there are speculations about their adaptation capabilities to a new domain. Additionally, the simple fine-tuning of the model to incorporate knowledge of a new domain is computationally expensive and time-consuming. This becomes more challenging when the domain in question is also low-resource, and labeled data is unavailable. We leverage parameter-efficient fine-tuning techniques (PEFTs) on high-resource datasets to address these challenges to improve performance on unseen low-resource domains. Throughout our experiments, we evaluate whether intrinsic linguistic commonalities between datasets can be leveraged for efficient domain adaptation. We benchmark six PEFTs with \texttt{Llama-3-8B-Instruct} on 14 training datasets from the Scientific, Medical, Legal, and News domains for a Text Summarization task. Our experiments show that for low-resource domains, inference using Within-Domain Adapters can achieve better performance than Few-Shot as well as a much larger \texttt{Llama-3-70B-Instruct}. Lastly, in the absence of Within-Domain Adapters, we explore the concept of using Cross-Domain Adapters as well as the strategic combinations of adapters to leverage intrinsic language similarities across domains, facilitating better adaptability and performance in low-resource settings.
摘要：大型语言模型（LLMS）是通用的任务解决者，它具有通用性。但是，尽管他们接受了大量数据的培训，但仍有关于它们适应新领域的适应能力的猜测。此外，该模型的简单微调以合并新领域的知识是计算昂贵且耗时的。当相关的域也是低资源，并且标记的数据不可用时，这将变得更具挑战性。我们利用高资源数据集的参数有效的微调技术（PEFT）来解决这些挑战，以提高看不见的低资源域的性能。在我们的整个实验中，我们评估数据集之间的内在语言共同点是否可以利用有效的域适应性。我们从科学，医学，法律和新闻领域的14个培训数据集上使用\ texttt {llama-3-8b-instruct}基准六个PEFT，用于文本摘要任务。我们的实验表明，对于低资源域而言，使用内域适配器的推断可以比少数拍摄以及更大的\ texttt {llama-3-70b-instruct}获得更好的性能。最后，在没有域内适配器的情况下，我们探讨了使用跨域适配器的概念以及适配器的战略组合来利用跨领域的固有语言相似性，从而促进了低资源环境中更好的适应性和性能。

Title: LongCat-Flash Technical Report

Authors: Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang
Subjects: cs.CL, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2509.01322
Pdf URL: https://arxiv.org/pdf/2509.01322
Copy Paste: [[2509.01322]] LongCat-Flash Technical Report(https://arxiv.org/abs/2509.01322)
Keywords: language model, chat, agent
Abstract: We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: this https URL Hugging Face: this https URL GitHub: this https URL
摘要：我们引入了Longcat-Flash，这是专为计算效率和先进代理功能设计的560亿参数混合物（MOE）语言模型。朗猫 - 弗拉什（LongCat-Flash）的需求是由于需要可扩展的效率，采用了两种新颖的设计：（a）零委托专家，该专家可以使动态的计算预算分配并激活18.6b-31.3b（平均为27B）（平均为27b），根据上下文需求，可以通过上下文需求进行优化资源使用。（b）与相当规模的模型相比，与相当的模型相比，捷径连接的MOE扩大了计算通信重叠窗口，在推理效率和吞吐量方面表现出显着的提高。我们为大型模型开发了一个综合的缩放框架，该框架结合了超参数转移，模型增长初始化，多管齐下的稳定套件以及确定性计算，以实现稳定且可重复的培训。值得注意的是，我们利用可扩展的建筑设计和基础设施工作中的协同作用，在30天内完成了超过20万亿代币的模型培训，同时以每秒0.70美元的价格达到每秒100个代币（TPS）的推断，为每百万个产量为0.70美元。为了培养longcat-flash针对代理智能，我们对优化的混合物进行了大规模的预训练，然后在推理，代码和说明上进行了针对性的中和后培训，并从合成数据和工具使用任务中进一步增强。全面的评估表明，作为一种无思想的基础模型，Longcat-Flash在其他领先模型之间提供了高度竞争性的性能，具有出色的代理任务优势。 Longcat-Flash的模型检查站开源以促进社区研究。 longcat聊天：此HTTPS URL拥抱面孔：此https url github：此https url

Title: KoBLEX: Open Legal Question Answering with Multi-hop Reasoning

Authors: Jihyung Lee, Daehui Kim, Seonjeong Hwang, Hyounghun Kim, Gary Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01324
Pdf URL: https://arxiv.org/pdf/2509.01324
Copy Paste: [[2509.01324]] KoBLEX: Open Legal Question Answering with Multi-hop Reasoning(https://arxiv.org/abs/2509.01324)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law. Several benchmarks have been proposed to evaluate LLMs' legal capabilities. However, these benchmarks fail to evaluate open-ended and provision-grounded Question Answering (QA). To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX), designed to evaluate provision-grounded, multi-hop legal reasoning. KoBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM-human expert pipeline. We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers. ParSeR facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process. Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-Eval). LF-Eval is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments. Experimental results show that ParSeR consistently outperforms strong baselines, achieving the best results across multiple LLMs. Notably, compared to standard retrieval with GPT-4o, ParSeR achieves +37.91 higher F1 and +30.81 higher LF-Eval. Further analyses reveal that ParSeR efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of ParSeR.
摘要：大型语言模型（LLM）在一般领域中取得了出色的表现，现在正在进入法律的专家领域。已经提出了一些基准来评估LLMS的法律能力。但是，这些基准无法评估开放式和提供的基础问题回答（QA）。为了解决这个问题，我们引入了韩国基准，用于法律可解释的QA（KOBLEX），该基准旨在评估提供供应的多跳法律推理。 Koblex包括226个基于方案的质量检查实例及其支持规定，使用混合LLM-Human Expert Pipeline创建。我们还提出了一种称为参数提供指导选择检索（Parser）的方法，该方法使用LLM生成的参数规定来指导合法扎根和可靠的答案。解析器通过产生参数规定并采用三阶段的顺序检索过程来促进多跳的推理。此外，为了更好地评估生成的答案的法律保真度，我们提出了法律保真度评估（LF-eval）。 LF-eval是一个自动指标，共同考虑了问题，回答和支持规定，并显示了与人类判断的高度相关性。实验结果表明，解析器始终胜过强大的基准，从而在多个LLM中取得了最佳结果。值得注意的是，与使用GPT-4O的标准检索相比，解析器可实现+37.91的F1和+30.81较高的LF-eval。进一步的分析表明，解析器有效地在推理深度上提供一致的性能，并证实了解析器的有效性。

Title: Can Large Language Models Master Complex Card Games?

Authors: Wei Wang, Fuqing Bie, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01328
Pdf URL: https://arxiv.org/pdf/2509.01328
Copy Paste: [[2509.01328]] Can Large Language Models Master Complex Card Games?(https://arxiv.org/abs/2509.01328)
Keywords: language model, llm
Abstract: Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can master multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs.
摘要：长期以来，复杂的游戏一直是测试人工智能算法进度的重要基准。 Alphago，Alphazero和Muzero在Go and Chess中击败了顶级人类球员，引起了人们对人工智能的广泛关注。同时，大型语言模型（LLM）在各种任务中表现出了非凡的功能，提出了一个问题，即LLM是否可以在复杂游戏中取得类似的成功。在本文中，我们探讨了LLM在掌握复杂纸牌游戏中的潜力。我们系统地评估了LLM在八种不同的纸牌游戏中的学习能力，评估了微调对高质量游戏数据的影响，并检查模型在掌握这些游戏时保持一般能力的能力。 Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can master multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data.评估结果表明LLM的学习能力和多功能性很强。

Title: Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

Authors: Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01363
Pdf URL: https://arxiv.org/pdf/2509.01363
Copy Paste: [[2509.01363]] Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic(https://arxiv.org/abs/2509.01363)
Keywords: language model, chain-of-thought
Abstract: Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: $v_{\text{reason}} = \theta_{\text{GRPO}} - \theta_{\text{SFT}}$. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector's strong contribution to the model's reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.
摘要：大型语言模型通常需要昂贵的优化，例如加固学习，以掌握复杂的推理任务。这项工作表明，一旦学会了推理能力，就可以在模型之间提取和传递作为紧凑的任务向量。我们采购了两个公开可用的，相同初始化的QWEN2.5模型，一种通过监督微调（SFT）进行了微调，另一个则在同一数据集上进行了相对策略优化（GRPO）。从这些中，我们提取一个推理向量：$ v _ {\ text {quach}} = \ theta _ {\ text {grpo}} - \ theta _ {\ text {sft {sft}} $。我们假设该矢量捕获了通过强化学习所灌输的推理能力，同时从SFT过程中考虑了共同的知识。当通过简单算术添加到兼容的指令调整模型中时，该矢量始终提高各种推理基准的性能：GSM8K（+4.9％），HumaneVal（+4.3％），Sciq（+1.7％）和BigBenchhard（1.5B模型的+12.3％）。在对抗条件下，绩效的改善持续存在。相反，减去向量会导致显着的性能下降（GSM8K的-11.8％），这表明了矢量对模型推理能力的强有力贡献。这项工作表明了通常通过昂贵的培训开发的推理能力如何从现有的开源模型中提取，并通过简单的张量算术重新使用，从而通过回收先前的计算投资来增强模型的实用方法。

Title: WATCHED: A Web AI Agent Tool for Combating Hate Speech by Expanding Data

Authors: Paloma Piot, Diego Sánchez, Javier Parapar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01379
Pdf URL: https://arxiv.org/pdf/2509.01379
Copy Paste: [[2509.01379]] WATCHED: A Web AI Agent Tool for Combating Hate Speech by Expanding Data(https://arxiv.org/abs/2509.01379)
Keywords: language model, chat, chain-of-thought, agent
Abstract: Online harms are a growing problem in digital spaces, putting user safety at risk and reducing trust in social media platforms. One of the most persistent forms of harm is hate speech. To address this, we need tools that combine the speed and scale of automated systems with the judgment and insight of human moderators. These tools should not only find harmful content but also explain their decisions clearly, helping to build trust and understanding. In this paper, we present WATCHED, a chatbot designed to support content moderators in tackling hate speech. The chatbot is built as an Artificial Intelligence Agent system that uses Large Language Models along with several specialised tools. It compares new posts with real examples of hate speech and neutral content, uses a BERT-based classifier to help flag harmful messages, looks up slang and informal language using sources like Urban Dictionary, generates chain-of-thought reasoning, and checks platform guidelines to explain and support its decisions. This combination allows the chatbot not only to detect hate speech but to explain why content is considered harmful, grounded in both precedent and policy. Experimental results show that our proposed method surpasses existing state-of-the-art methods, reaching a macro F1 score of 0.91. Designed for moderators, safety teams, and researchers, the tool helps reduce online harms by supporting collaboration between AI and human oversight.
摘要：在线危害是数字空间中日益增长的问题，使用户安全处于危险之中并降低社交媒体平台的信任。最持久的伤害形式之一是仇恨言论。为了解决这个问题，我们需要将自动化系统的速度和规模与人类主持人的判断和见解相结合的工具。这些工具不仅应该找到有害内容，而且还应清楚地解释他们的决定，有助于建立信任和理解。在本文中，我们介绍了一个聊天机器人，旨在支持内容主持人解决仇恨言论。聊天机器人是作为人工智能代理系统构建的，该系统使用大型语言模型以及多种专用工具。它将新帖子与仇恨言论和中性内容的真实示例进行了比较，使用基于BERT的分类器来帮助标记有害信息，使用Urban Dictiator等资料来源，抬起语和非正式语言，产生思想链的推理，并检查平台指南来解释和支持其决策。这种组合不仅允许聊天机器人检测仇恨言论，而且可以解释为什么内容被认为是有害的，既有先例和政策。实验结果表明，我们所提出的方法超过了现有的最新方法，宏F1得分为0.91。该工具专为主持人，安全团队和研究人员而设计，有助于通过支持AI和人类监督之间的协作来减少在线危害。

Title: ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links

Authors: Serwar Basch, Ilia Kuznetsov, Tom Hope, Iryna Gurevych
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.01387
Pdf URL: https://arxiv.org/pdf/2509.01387
Copy Paste: [[2509.01387]] ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links(https://arxiv.org/abs/2509.01387)
Keywords: llm
Abstract: Understanding fine-grained relations between documents is crucial for many application domains. However, the study of automated assistance is limited by the lack of efficient methods to create training and evaluation datasets of cross-document links. To address this, we introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links in a new domain from scratch. We first generate and validate semi-synthetic datasets of interconnected documents. This data is used to perform automatic evaluation, producing a shortlist of best-performing linking approaches. These approaches are then used in an extensive human evaluation study, yielding performance estimates on natural text pairs. We apply our framework in two distinct domains -- peer review and news -- and show that combining retrieval models with LLMs achieves 78\% link approval from human raters, more than doubling the precision of strong retrievers alone. Our framework enables systematic study of cross-document understanding across application scenarios, and the resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review. We make the code, data, and annotation protocols openly available.
摘要：了解文档之间的细粒关系对于许多应用领域至关重要。但是，对自动辅助的研究受到缺乏创建跨文档链接培训和评估数据集的有效方法的限制。为了解决这个问题，我们介绍了一个新的域 - 不知不线框架，用于从头开始选择最佳的方法和注释的跨文档链接。我们首先生成和验证互连文档的半合成数据集。该数据用于执行自动评估，并提供最佳表现链接方法的入围名单。然后，在一项广泛的人类评估研究中使用了这些方法，从而对自然文本对产生了绩效估计。我们将我们的框架应用于两个不同的领域 - 同行评审和新闻 - 并表明将检索模型与LLMS相结合78 \％的链接批准来自人类评估者，这使得仅是强犬的精确度增加了一倍以上。我们的框架可以系统地研究跨应用程序场景的跨文档理解，并且由此产生的新颖数据集为众多跨文档任务（例如媒体框架和同行评审）奠定了基础。我们公开提供代码，数据和注释协议。

Title: LLMs cannot spot math errors, even when allowed to peek into the solution

Authors: KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01395
Pdf URL: https://arxiv.org/pdf/2509.01395
Copy Paste: [[2509.01395]] LLMs cannot spot math errors, even when allowed to peek into the solution(https://arxiv.org/abs/2509.01395)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution, which helps improve performance.
摘要：大型语言模型（LLMS）在数学单词问题上表现出了出色的表现，但是它们已被证明在诸如识别学生解决方案中的错误之类的元方法中挣扎。在这项工作中，我们研究了使用两个错误推理数据集在逐步解决方案中找到第一个错误步骤的挑战：VTG和PRM800K。我们的实验表明，即使可以访问参考解决方案，最先进的LLM也很难在学生解决方案中找到第一个错误步骤。为此，我们提出了一种生成中级校正学生解决方案的方法，与原始学生的解决方案更加一致，这有助于提高性能。

Title: Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

Authors: Kaviraj Pather, Elena Hadjigeorgiou, Arben Krasniqi, Claire Schmit, Irina Rusu, Marc Pons, Kabir Khan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01412
Pdf URL: https://arxiv.org/pdf/2509.01412
Copy Paste: [[2509.01412]] Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning(https://arxiv.org/abs/2509.01412)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.
摘要：大型语言模型（LLMS）通过提示（COT）提示显示出强烈的推理，但是该过程是不透明的，这使得在高风险设置中验证，调试和控制困难。我们提出了Vis-Cot，这是一种人类的循环框架，将线性COT文本转换为交互推理图。用户可以通过修剪不正确的路径并嫁接新的，用户定义的前提，可视化逻辑流，识别有缺陷的步骤并进行干预。这将相互作用从被动观察转变为主动协作，转向模型转向更准确和值得信赖的结论。在GSM8K和StrategionQA中，Vis-Cot在非交互式基准的情况下提高了24个百分点的最终准确性。一项用户研究还显示了可感知的可用性和信任的巨大收益。通过将LLM与有针对性的人类监督相结合，可以指出实用的途径，以实现更可靠，易于理解和协作推理。

Title: On the Alignment of Large Language Models with Global Human Opinion

Authors: Yang Liu, Masahiro Kaneko, Chenhui Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01418
Pdf URL: https://arxiv.org/pdf/2509.01418
Copy Paste: [[2509.01418]] On the Alignment of Large Language Models with Global Human Opinion(https://arxiv.org/abs/2509.01418)
Keywords: language model, llm, prompt
Abstract: Today's large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs' opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at this https URL.
摘要：当今的大型语言模型（LLMS）能够支持多语言场景，从而使用户可以用其母语与LLM进行交互。当LLM响应用户提出的主观问题时，他们应与特定人口组或历史时期的观点保持一致，这是由用户与模型互动的语言所塑造的。现有的研究主要集中于研究美国或少数国家 /地区的人群中LLM所代表的观点，缺乏全球的国家样本和关于不同历史时期人类意见的研究，并且缺乏关于使用语言来指导LLMS的讨论。此外，他们还忽略了迅速语言对LLMS意见的一致性的潜在影响。在这项研究中，我们的目标是填补这些空白。为此，我们根据世界价值调查（WVS）创建一个评估框架，以系统地评估LLM与世界各地不同国家，语言和历史时期的人类意见的一致性。我们发现，在与大多数国家的观点不相关的同时，LLMS适当或过度与仅几个国家的观点过度结合。此外，将提示的语言更改为匹配问卷中使用的语言可以有效地引导LLMS比现有转向方法更有效地与相应国家的观点保持一致。同时，LLM与当代人群的意见更加一致。据我们所知，我们的研究是对全球，语言和时间维度跨LLM的意见一致性主题的首次全面调查。我们的代码和数据在此HTTPS URL上公开可用。

Title: Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

Authors: Markus Oehri, Giulia Conti, Kaviraj Pather, Alexandre Rossi, Laia Serra, Adrian Parody, Rogvi Johannesen, Aviaja Petersen, Arben Krasniqi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01455
Pdf URL: https://arxiv.org/pdf/2509.01455
Copy Paste: [[2509.01455]] Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal(https://arxiv.org/abs/2509.01455)
Keywords: language model, hallucination
Abstract: Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.
摘要：部署的语言模型不仅必须决定要回答什么，还必须决定不回答。我们提出了UNICR，这是一个将异质不确定性证据的统一框架，包括序列可能性，自矛盾分散，检索兼容性以及工具或验证者的反馈变为正确性的概率，然后通过原理拒绝来强制使用用户指定的误差预算。 UNICR学习具有温度缩放和适当评分的轻质校准头，通过黑盒功能支持仅API-FOLILY型号，并使用保形风险控制提供无分配保证。为了长期产生，我们通过监督从检索到的证据得出的原子事实得分来使信心与语义忠诚度相结合，从而减少了自信的幻觉，同时保留了覆盖范围。与术前或前logit阈值相比，有关QA的实验，具有执行测试的代码，具有执行测试的代码和检索长期质量质量质量标准的校准指标，风险覆盖曲线下的较低面积以及固定风险的较高覆盖范围与熵或logit阈值相比，固定风险更高的覆盖范围较高。分析表明，证据矛盾，语义分散和工具不一致是弃权的主要驱动因素，产生了信息丰富的用户拒绝信息。结果是可移植的配方，证明了融合的证据融合到校准的概率，以控制风险控制的决策，可以提高信任度，而无需微调基本模型，并且在分配变化下仍然有效。

Title: Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA

Authors: Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01468
Pdf URL: https://arxiv.org/pdf/2509.01468
Copy Paste: [[2509.01468]] Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA(https://arxiv.org/abs/2509.01468)
Keywords: language model, llm
Abstract: Large language models (LLMs) encode vast amounts of world knowledge but remain static once trained, making the timely integration of emerging facts prohibitively expensive via full retraining. Knowledge-editing techniques have thus emerged to inject or overwrite specific facts into LLMs, yet they either over-rely on superficial cues or incur complex, iterative pipelines that collapse under noisy, multi-hop conditions. We introduce Reason-KE, an end-to-end reasoning-chain-based editing framework that steers a pretrained LLM through four structured stages-fact acknowledgment, relevance determination, selective application, and final reasoning-to filter distractors in a single pass. Trained on MQuAKE-CF with up to four irrelevant facts, Reason-KE elevates Qwen2.5-7B's multi-hop QA accuracy to 90.2% while suffering merely a 6.3% drop under heavy distraction and <1% when answers are leaked. Our quantitative analysis confirms Reason-KE's resilience and efficiency, establishing a new state-of-the-art for reliable LLM knowledge updates.
摘要：大型语言模型（LLMS）编码了大量的世界知识，但一旦训练，仍然保持静态知识，从而使新兴事实及时地整合到通过完整的重新培训来过于昂贵。因此，知识编辑的技术已经出现了将特定事实注入或覆盖为LLM，但是它们要么过度呈浅层提示，要么在噪音，多跳的条件下崩溃的浅色复合物，迭代的迭代管道。我们介绍了理性 - 基于端到端推理链的编辑框架，该框架通过四个结构化阶段的确认，相关性确定，选择性应用程序和最终推理，以单个通过的过滤器分散分散术，可以通过四个结构化阶段的确认，相关性确定，选择性应用和最终推理来处理验证的LLM。在Mquake-CF上接受了多达四个无关的事实的培训，Reason-ke将QWEN2.5-7B的多跳QA准确度提升到90.2％，而在沉重的分散注意力下仅遭受6.3％的下降，答案泄漏时<1％。我们的定量分析证实了原因 - Ke的韧性和效率，为可靠的LLM知识更新建立了新的最新技术。

Title: Do Retrieval Augmented Language Models Know When They Don't Know?

Authors: Youchao Zhou, Heyan Huang, Yicheng Liu, Rui Dai, Xinglin Wang, Xingchen Zhang, Shumin Shi, Yang Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01476
Pdf URL: https://arxiv.org/pdf/2509.01476
Copy Paste: [[2509.01476]] Do Retrieval Augmented Language Models Know When They Don't Know?(https://arxiv.org/abs/2509.01476)
Keywords: language model, llm, hallucination
Abstract: Existing Large Language Models (LLMs) occasionally generate plausible yet factually incorrect responses, known as hallucinations. Researchers are primarily using two approaches to mitigate hallucinations, namely Retrieval Augmented Language Models (RALMs) and refusal post-training. However, current research predominantly emphasizes their individual effectiveness while overlooking the evaluation of the refusal capability of RALMs. In this study, we ask the fundamental question: Do RALMs know when they don't know? Specifically, we ask three questions. First, are RALMs well-calibrated regarding different internal and external knowledge states? We examine the influence of various factors. Contrary to expectations, we find that LLMs exhibit significant \textbf{over-refusal} behavior. Then, how does refusal post-training affect the over-refusal issue? We investigate the Refusal-aware Instruction Tuning and In-Context Fine-tuning methods. Our results show that the over-refusal problem is mitigated by In-context fine-tuning. but magnified by R-tuning. However, we also find that the refusal ability may conflict with the quality of the answer. Finally, we develop a simple yet effective refusal method for refusal post-trained models to improve their overall answer quality in terms of refusal and correct answers. Our study provides a more comprehensive understanding of the influence of important factors on RALM systems.
摘要：现有的大型语言模型（LLMS）偶尔会产生合理但事实不正确的反应，称为幻觉。研究人员主要使用两种方法来减轻幻觉，即检索增强语言模型（RALMS）和拒绝培训。但是，当前的研究主要强调了他们的个人效率，同时忽略了对拉尔姆斯拒绝能力的评估。在这项研究中，我们提出了一个基本问题：拉尔姆斯知道他们何时不知道吗？具体来说，我们问三个问题。首先，关于不同的内部和外部知识状态的拉尔姆斯是否校准了？我们检查了各种因素的影响。与期望相反，我们发现llms表现出重要的\ textbf {过度复杂}的行为。然后，拒绝训练后如何影响过度垃圾问题？我们研究了拒绝感知的说明调整和内部文化微调方法。我们的结果表明，过度的问题可以通过封闭式微调来减轻。但通过r-tuning放大。但是，我们还发现拒绝能力可能与答案的质量相抵触。最后，我们为拒绝训练后的模型开发了一种简单而有效的拒绝方法，以在拒绝和正确的答案方面提高其整体答案质量。我们的研究对重要因素对RALM系统的影响提供了更全面的理解。

Title: MeVe: A Modular System for Memory Verification and Effective Context Control in Language Models

Authors: Andreas Ottem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01514
Pdf URL: https://arxiv.org/pdf/2509.01514
Copy Paste: [[2509.01514]] MeVe: A Modular System for Memory Verification and Effective Context Control in Language Models(https://arxiv.org/abs/2509.01514)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems typically face constraints because of their inherent mechanism: a simple top-k semantic search [1]. The approach often leads to the incorporation of irrelevant or redundant information in the context, degrading performance and efficiency [10][11]. This paper presents MeVe, a novel modular architecture intended for Memory Verification and smart context composition. MeVe rethinks the RAG paradigm by proposing a five-phase modular design that distinctly breaks down the retrieval and context composition process into distinct, auditable, and independently tunable phases: initial retrieval, relevance verification, fallback retrieval, context prioritization, and token budgeting. This architecture enables fine-grained control of what knowledge is made available to an LLM, enabling task-dependent filtering and adaptation. We release a reference implementation of MeVe as a proof of concept and evaluate its performance on knowledge-heavy QA tasks over a subset of English Wikipedia [22]. Our results demonstrate that by actively verifying information before composition, MeVe significantly improves context efficiency, achieving a 57% reduction on the Wikipedia dataset and a 75% reduction on the more complex HotpotQA dataset compared to standard RAG implementations [25]. This work provides a framework for more scalable and reliable LLM applications. By refining and distilling contextual information, MeVe offers a path toward better grounding and more accurate factual support [16].
摘要：检索增强的生成（RAG）系统通常由于其固有机制而面临约束：简单的顶级语义搜索[1]。该方法通常会导致在上下文中纳入无关或冗余信息，从而降低性能和效率[10] [11]。本文介绍了Meve，这是一种用于内存验证和智能上下文组成的新型模块化体系结构。 Meve通过提出一种五相模块化设计来重新考虑RAG范式，该设计明显地将检索和上下文组成过程分解为独特的，可审计的和独立的可调阶段：初始检索，相关性验证，后备检索，上下文优先确定和标记预算。该体系结构可以对LLM提供的知识进行细粒度控制，从而实现任务依赖性的过滤和适应。我们发布了Meve作为概念证明的参考实现，并评估了其在英语Wikipedia的一部分中的知识质量质量质量质量质量质量较重的任务上的表现[22]。我们的结果表明，通过在组成之前积极验证信息，MEVE显着提高了上下文效率，与标准的RAG实现相比，Wikipedia数据集的降低了57％，更复杂的HOTPOTQA数据集降低了75％[25]。这项工作为更可靠和可靠的LLM应用程序提供了一个框架。通过完善和提炼上下文信息，Meve为更好的基础和更准确的事实支持提供了途径[16]。

Title: CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models

Authors: Kairong Han, Wenshuo Zhao, Ziyu Zhao, JunJian Ye, Lujia Pan, Kun Kuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01535
Pdf URL: https://arxiv.org/pdf/2509.01535
Copy Paste: [[2509.01535]] CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models(https://arxiv.org/abs/2509.01535)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains. However, a fundamental question remains: Can LLMs effectively utilize causal knowledge for prediction and generation? Through empirical studies, we find that LLMs trained directly on large-scale data often capture spurious correlations rather than true causal relationships, leading to suboptimal performance, especially in out-of-distribution (OOD) scenarios. To address this challenge, we propose Causal Attention Tuning (CAT), a novel approach that injects fine-grained causal knowledge into the attention mechanism. We propose an automated pipeline that leverages human priors to automatically generate token-level causal signals and introduce the Re-Attention mechanism to guide training, helping the model focus on causal structures while mitigating noise and biases in attention scores. Experimental results on our proposed Spurious Token Game (STG) benchmark and multiple downstream tasks demonstrate that our approach effectively leverages causal knowledge for prediction and remains robust in OOD scenarios. Implementation details can be found at this https URL.
摘要：大型语言模型（LLM）在各个领域取得了巨大的成功。但是，仍然存在一个基本问题：LLM可以有效利用因果知识进行预测和产生吗？通过实证研究，我们发现直接在大规模数据上训练的LLM通常会捕获虚假的相关性而不是真正的因果关系，从而导致次优性能，尤其是在分布外（OOD）场景中。为了应对这一挑战，我们提出了因果注意调整（CAT），这是一种新颖的方法，将细粒的因果知识注入了注意机制。我们提出了一条自动化管道，该管道利用人类先验自动产生令牌级别的因果信号，并引入重新注意力的机制来指导培训，从而帮助该模型关注因果结构，同时减轻注意力评分的噪声和偏见。对我们提出的伪造令牌游戏（STG）基准和多个下游任务的实验结果表明，我们的方法有效地利用了因果知识进行预测，并且在OOD场景中保持强大。实施详细信息可以在此HTTPS URL上找到。

Title: In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

Authors: Seungkyu Lee, Nalim Kim, Yohan Jo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01560
Pdf URL: https://arxiv.org/pdf/2509.01560
Copy Paste: [[2509.01560]] In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents(https://arxiv.org/abs/2509.01560)
Keywords: llm, agent
Abstract: Tool agents -- LLM-based systems that interact with external APIs -- offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We will release the dataset and code publicly.
摘要：工具代理 - 与外部API交互的基于LLM的系统 - 提供了执行现实世界任务的方法。但是，随着任务变得越来越复杂，这些代理商努力以适当的顺序识别和调用正确的API。为了解决此问题，我们调查将API文档转换为结构化的API图，该图形捕获API依赖性并利用其用于需要组成API调用的多工具查询。为了支持这一点，我们介绍了In-N-Out，这是第一个由两个现实世界API基准及其文档构建的API图的专家通知数据集。使用IN-N-OUT可显着提高工具检索和多工具查询生成的性能，仅使用文档将LLM的速度几乎翻了一番。此外，在In-N-Out关闭此差距的90％的模型生成的图表中生成的图表，表明我们的数据集有助于模型学习理解API文档和参数关系。我们的发现强调了将expip api图用于工具代理的承诺，并将IN-N-OUT作为宝贵资源的实用性。我们将公开发布数据集和代码。

Title: Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief

Authors: Zeguan Xiao, Diyang Dou, Boya Xiong, Yun Chen, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01564
Pdf URL: https://arxiv.org/pdf/2509.01564
Copy Paste: [[2509.01564]] Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief(https://arxiv.org/abs/2509.01564)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers. This overconfidence, especially in models undergone Reinforcement Learning from Human Feedback (RLHF), poses significant challenges for reliable uncertainty estimation and safe deployment. In this paper, we propose EAGLE (Expectation of AGgregated internaL bEief), a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores. Instead of relying on the model's final output, our approach extracts internal beliefs from multiple intermediate layers during self-evaluation. By aggregating these layer-wise beliefs and calculating the expectation over the resulting confidence score distribution, EAGLE produces a refined confidence score that more faithfully reflects the model's internal certainty. Extensive experiments on diverse datasets and LLMs demonstrate that EAGLE significantly improves calibration performance over existing baselines. We also provide an in-depth analysis of EAGLE, including a layer-wise examination of uncertainty patterns, a study of the impact of self-evaluation prompts, and an analysis of the effect of self-evaluation score range.
摘要：大型语言模型（LLMS）在各种自然语言任务中取得了巨大的成功，但通常表现出过度自信并产生合理但不正确的答案。这种过度自信，尤其是在从人类反馈中学习的强化学习的模型中，对可靠的不确定性估计和安全部署提出了重大挑战。在本文中，我们提出了Eagle（预期汇总内部BEIEF），这是一种基于自我评估的新型校准方法，它利用LLMS的内部隐藏状态来得出更准确的置信度评分。我们的方法不依赖模型的最终输出，而是在自我评估过程中从多个中间层中提取内部信念。通过汇总这些层面信念并计算对由此产生的置信度分布的期望，Eagle产生了精致的置信度评分，更忠实地反映了模型的内部确定性。对不同数据集和LLM的广泛实验表明，Eagle显着提高了现有基准的校准性能。我们还提供了对鹰的深入分析，包括对不确定性模式进行层次检查，对自我评估提示的影响的研究以及对自我评估得分范围效果的分析。

Title: Benchmarking the Detection of LLMs-Generated Modern Chinese Poetry

Authors: Shanshan Wang, Junchao Wu, Fengying Ye, Jingming Yao, Lidia S. Chao, Derek F. Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01620
Pdf URL: https://arxiv.org/pdf/2509.01620
Copy Paste: [[2509.01620]] Benchmarking the Detection of LLMs-Generated Modern Chinese Poetry(https://arxiv.org/abs/2509.01620)
Keywords: language model, llm
Abstract: The rapid development of advanced large language models (LLMs) has made AI-generated text indistinguishable from human-written text. Previous work on detecting AI-generated text has made effective progress, but has not involved modern Chinese poetry. Due to the distinctive characteristics of modern Chinese poetry, it is difficult to identify whether a poem originated from humans or AI. The proliferation of AI-generated modern Chinese poetry has significantly disrupted the poetry ecosystem. Based on the urgency of identifying AI-generated poetry in the real Chinese world, this paper proposes a novel benchmark for detecting LLMs-generated modern Chinese poetry. We first construct a high-quality dataset, which includes both 800 poems written by six professional poets and 41,600 poems generated by four mainstream LLMs. Subsequently, we conduct systematic performance assessments of six detectors on this dataset. Experimental results demonstrate that current detectors cannot be used as reliable tools to detect modern Chinese poems generated by LLMs. The most difficult poetic features to detect are intrinsic qualities, especially style. The detection results verify the effectiveness and necessity of our proposed benchmark. Our work lays a foundation for future detection of AI-generated poetry.
摘要：高级大语模型（LLM）的快速发展使AI生成的文本与人写的文本无法区分。以前关于检测AI生成的文本的工作已取得了有效的进步，但没有涉及现代中国诗歌。由于现代中国诗歌的独特特征，很难确定一首诗是起源于人类还是AI。 AI生成的现代中国诗歌的扩散严重破坏了诗歌生态系统。基于在真正的中国世界中识别AI生成的诗歌的紧迫性，本文提出了一种新颖的基准，用于检测LLMS生成的现代中国诗歌。我们首先构建了一个高质量的数据集，其中包括由六首专业诗人写的800首诗和四个主流LLM产生的41,600首诗。随后，我们对该数据集的六个检测器进行系统的性能评估。实验结果表明，当前的检测器不能用作可靠的工具来检测LLMS产生的现代中国诗歌。要检测到的最困难的诗意特征是内在品质，尤其是风格。检测结果验证了我们提出的基准的有效性和必要性。我们的作品为未来对AI生成的诗歌的检测奠定了基础。

Title: chDzDT: Word-level morphology-aware language model for Algerian social media text

Authors: Abdelkrime Aries
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01772
Pdf URL: https://arxiv.org/pdf/2509.01772
Copy Paste: [[2509.01772]] chDzDT: Word-level morphology-aware language model for Algerian social media text(https://arxiv.org/abs/2509.01772)
Keywords: language model
Abstract: Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. To address this gap, we introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. The training corpus draws from diverse sources, including YouTube comments, French, English, and Berber Wikipedia, as well as the Tatoeba project. It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload. Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems.
摘要：预训练的语言模型（PLM）通过提供上下文敏感的文本表示，具有基本高级的自然语言处理。但是，阿尔及利亚方言的代表性不足，很少有专用模型可用。由于其复杂的形态，频繁的代码切换，多个脚本以及来自其他语言的强烈词汇影响，因此处理此方言是具有挑战性的。这些特征使令牌化复杂并降低了常规单词或子词级方法的有效性。为了解决这一差距，我们介绍了CHDZDT，这是一种针对阿尔及利亚形态量身定制的角色级预训练的语言模型。与依靠令牌序列的常规PLM不同，CHDZDT接受了孤立的单词的训练。该设计使该模型可以坚固地编码形态模式，而无需取决于令牌边界或标准化拼字法。培训语料库借鉴了各种来源，包括YouTube评论，法语，英语和柏柏尔维基百科，以及Tatoeba项目。它涵盖了多个脚本和语言品种，从而实现了大量的预训练工作量。我们的贡献是三倍：（i）使用YouTube评论对阿尔及利亚方言的详细形态分析；（ii）构建多语言阿尔及利亚词典数据集；（iii）对角色级PLM作为以形态为中心任务的编码器的开发和广泛评估。提出的方法证明了角色级建模对于形态上富含，低资源的方言，并为更具包容性和适应性的NLP系统奠定了基础。

Title: Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Authors: Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.01790
Pdf URL: https://arxiv.org/pdf/2509.01790
Copy Paste: [[2509.01790]] Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs(https://arxiv.org/abs/2509.01790)
Keywords: language model, gpt, llm, prompt
Abstract: Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.
摘要：迅速敏感性，指的是释义（即使用不同单词重复或说出的东西）导致大语言模型（LLM）表现的重大变化的现象，已被广泛接受为LLMS的核心限制。在这项工作中，我们重新审视了这个问题，并提出：广泛报道的高敏感性是LLM的固有弱点，还是在很大程度上是评估过程的文物？为了回答这个问题，我们在6个基准中系统地评估了7个LLM（例如GPT和Gemini家族），包括12个不同的及时模板上的多项选择和开放式任务。我们发现，许多迅速的敏感性源于启发式评估方法，包括对数可能评分和刚性的答案匹配，这些方法通常忽略了通过替代词句（例如同义词或释义）表达的语义上正确的响应。当我们采用LLM-As-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-Gudge评估时，我们会观察到跨提示的模型排名的性能差异和持续更高的相关性。我们的发现表明，现代LLM比以前认为的更强大，可以提示模板，并且迅速的敏感性可能比模型中的缺陷更像是评估的工件。

Title: Mic Drop or Data Flop? Evaluating the Fitness for Purpose of AI Voice Interviewers for Data Collection within Quantitative & Qualitative Research Contexts

Authors: Shreyas Tirumala, Nishant Jain, Danny D. Leybzon, Trent D. Buskirk
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2509.01814
Pdf URL: https://arxiv.org/pdf/2509.01814
Copy Paste: [[2509.01814]] Mic Drop or Data Flop? Evaluating the Fitness for Purpose of AI Voice Interviewers for Data Collection within Quantitative & Qualitative Research Contexts(https://arxiv.org/abs/2509.01814)
Keywords: language model, llm
Abstract: Transformer-based Large Language Models (LLMs) have paved the way for "AI interviewers" that can administer voice-based surveys with respondents in real-time. This position paper reviews emerging evidence to understand when such AI interviewing systems are fit for purpose for collecting data within quantitative and qualitative research contexts. We evaluate the capabilities of AI interviewers as well as current Interactive Voice Response (IVR) systems across two dimensions: input/output performance (i.e., speech recognition, answer recording, emotion handling) and verbal reasoning (i.e., ability to probe, clarify, and handle branching logic). Field studies suggest that AI interviewers already exceed IVR capabilities for both quantitative and qualitative data collection, but real-time transcription error rates, limited emotion detection abilities, and uneven follow-up quality indicate that the utility, use and adoption of current AI interviewer technology may be context-dependent for qualitative data collection efforts.
摘要：基于变形金刚的大型语言模型（LLM）为“ AI访调员”铺平了道路，该模型可以实时与受访者进行基于语音的调查。该立场论文回顾了新兴的证据，以了解何时适合在定量和定性研究环境中收集数据的目的。我们评估了跨两个维度的AI访调员的功能以及当前的交互式语音响应（IVR）系统：输入/输出性能（即语音识别，答案记录，情感处理）和口头推理（即能够进行探测，澄清，澄清和处理分支逻辑）。现场研究表明，AI访调员已经超过了定量和定性数据收集的IVR功能，但是实时转录错误率，有限的情绪检测能力以及不均匀的随访质量表明，当前AI访调员技术的实用性，使用和采用可能依赖于定性数据收集工作。

Title: Extracting OPQRST in Electronic Health Records using Large Language Models with Reasoning

Authors: Zhimeng Luo, Abhibha Gupta, Adam Frisch, Daqing He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.01885
Pdf URL: https://arxiv.org/pdf/2509.01885
Copy Paste: [[2509.01885]] Extracting OPQRST in Electronic Health Records using Large Language Models with Reasoning(https://arxiv.org/abs/2509.01885)
Keywords: language model, llm
Abstract: The extraction of critical patient information from Electronic Health Records (EHRs) poses significant challenges due to the complexity and unstructured nature of the data. Traditional machine learning approaches often fail to capture pertinent details efficiently, making it difficult for clinicians to utilize these tools effectively in patient care. This paper introduces a novel approach to extracting the OPQRST assessment from EHRs by leveraging the capabilities of Large Language Models (LLMs). We propose to reframe the task from sequence labeling to text generation, enabling the models to provide reasoning steps that mimic a physician's cognitive processes. This approach enhances interpretability and adapts to the limited availability of labeled data in healthcare settings. Furthermore, we address the challenge of evaluating the accuracy of machine-generated text in clinical contexts by proposing a modification to traditional Named Entity Recognition (NER) metrics. This includes the integration of semantic similarity measures, such as the BERT Score, to assess the alignment between generated text and the clinical intent of the original records. Our contributions demonstrate a significant advancement in the use of AI in healthcare, offering a scalable solution that improves the accuracy and usability of information extraction from EHRs, thereby aiding clinicians in making more informed decisions and enhancing patient care outcomes.
摘要：由于数据的复杂性和非结构化性质，从电子健康记录（EHR）中提取关键患者信息（EHRS）提出了重大挑战。传统的机器学习方法通常无法有效捕获相关细节，因此临床医生很难有效地利用这些工具。本文通过利用大语模型（LLMS）的能力来从EHR中提取OPQRST评估的新颖方法。我们建议将任务从序列标记到文本生成，使模型能够提供模仿医师的认知过程的推理步骤。这种方法可增强可解释性，并适应医疗保健设置中标记数据的有限可用性。此外，我们通过提出对传统命名实体识别（NER）指标的修改来评估机器生成文本的准确性的挑战。这包括集成语义相似性度量，例如BERT评分，以评估生成的文本与原始记录的临床意图之间的比对。我们的贡献表明，在医疗保健中使用AI，提供了可扩展的解决方案，可提高EHR信息的准确性和可用性，从而帮助临床医生做出更明智的决定并增强患者护理结果。

Title: DRAssist: Dispute Resolution Assistance using Large Language Models

Authors: Sachin Pawar, Manoj Apte, Girish K. Palshikar, Basit Ali, Nitin Ramrakhiyani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.01962
Pdf URL: https://arxiv.org/pdf/2509.01962
Copy Paste: [[2509.01962]] DRAssist: Dispute Resolution Assistance using Large Language Models(https://arxiv.org/abs/2509.01962)
Keywords: language model, llm, prompt
Abstract: Disputes between two parties occur in almost all domains such as taxation, insurance, banking, healthcare, etc. The disputes are generally resolved in a specific forum (e.g., consumer court) where facts are presented, points of disagreement are discussed, arguments as well as specific demands of the parties are heard, and finally a human judge resolves the dispute by often favouring one of the two parties. In this paper, we explore the use of large language models (LLMs) as assistants for the human judge to resolve such disputes, as part of our DRAssist system. We focus on disputes from two specific domains -- automobile insurance and domain name disputes. DRAssist identifies certain key structural elements (e.g., facts, aspects or disagreement, arguments) of the disputes and summarizes the unstructured dispute descriptions to produce a structured summary for each dispute. We then explore multiple prompting strategies with multiple LLMs for their ability to assist in resolving the disputes in these domains. In DRAssist, these LLMs are prompted to produce the resolution output at three different levels -- (i) identifying an overall stronger party in a dispute, (ii) decide whether each specific demand of each contesting party can be accepted or not, (iii) evaluate whether each argument by each contesting party is strong or weak. We evaluate the performance of LLMs on all these tasks by comparing them with relevant baselines using suitable evaluation metrics.
摘要：两党之间的争议在几乎所有领域中发生，例如税收，保险，银行业，医疗保健等。争议通常是在提出事实的特定论坛（例如，消费者法院）中解决的，讨论了分歧，争论以及对政党的特定要求，最后是人类法官通过经常通过两个部分来解决争端的问题。在本文中，我们探讨了大型语言模型（LLM）作为人类法官解决此类争议的助手，这是我们的德拉斯主义系统的一部分。我们专注于来自两个特定领域的争议 - 汽车保险和域名争议。 Drassist确定了争议的某些关键结构要素（例如事实，方面或分歧，论点），并总结了非结构化的争议描述，以为每个争端产生结构化摘要。然后，我们探索具有多个LLM的多个提示策略，以帮助他们协助解决这些领域中的争议。在Drassist中，这些LLM被提示以三个不同的级别产生分辨率输出 - （i）确定争议中总体强大的政党，（ii）确定每个竞争方的每个特定需求是否可以接受，（iii）评估每个竞争方的每个参数是强的还是弱的。我们通过使用合适的评估指标将LLM与相关基线进行比较，评估LLM在所有这些任务上的性能。

Title: StructCoh: Structured Contrastive Learning for Context-Aware Text Semantic Matching

Authors: Chao Xue, Ziyuan Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02033
Pdf URL: https://arxiv.org/pdf/2509.02033
Copy Paste: [[2509.02033]] StructCoh: Structured Contrastive Learning for Context-Aware Text Semantic Matching(https://arxiv.org/abs/2509.02033)
Keywords: language model
Abstract: Text semantic matching requires nuanced understanding of both structural relationships and fine-grained semantic distinctions. While pre-trained language models excel at capturing token-level interactions, they often overlook hierarchical structural patterns and struggle with subtle semantic discrimination. In this paper, we proposed StructCoh, a graph-enhanced contrastive learning framework that synergistically combines structural reasoning with representation space optimization. Our approach features two key innovations: (1) A dual-graph encoder constructs semantic graphs via dependency parsing and topic modeling, then employs graph isomorphism networks to propagate structural features across syntactic dependencies and cross-document concept nodes. (2) A hierarchical contrastive objective enforces consistency at multiple granularities: node-level contrastive regularization preserves core semantic units, while graph-aware contrastive learning aligns inter-document structural semantics through both explicit and implicit negative sampling strategies. Experiments on three legal document matching benchmarks and academic plagiarism detection datasets demonstrate significant improvements over state-of-the-art methods. Notably, StructCoh achieves 86.7% F1-score (+6.2% absolute gain) on legal statute matching by effectively identifying argument structure similarities.
摘要：文本语义匹配需要对结构关系和细粒语义区分的细微理解。虽然预训练的语言模型在捕获令牌级别的相互作用方面表现出色，但它们经常忽略层次结构模式，并在微妙的语义歧视中挣扎。在本文中，我们提出了结构性饮料，这是一个具有图形的对比学习框架，可以协同结合结构推理与表示空间优化。我们的方法具有两个关键的创新：（1）通过依赖性解析和主题建模构造语义图，然后采用图形同构网络来传播跨语法依赖性和跨文档概念节点的结构特征。（2）层次对比目标在多个粒度下强制执行一致性：节点级对比度正则化保留了核心语义单元，而图形感知的对比度学习通过显式和隐式负面采样策略来使文档间的结构语义对齐。对三个法律文档进行匹配的基准和学术窃检测数据集的实验表明，对最先进的方法有了重大改进。值得注意的是，通过有效识别参数结构相似性，在法律法规匹配的法律法规匹配方面，结构性饮料达到了86.7％的F1得分（+6.2％的绝对增益）。

Title: DeepSeek performs better than other Large Language Models in Dental Cases

Authors: Hexian Zhang, Xinyu Yan, Yanqi Yang, Lijian Jin, Ping Yang, Junwen Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02036
Pdf URL: https://arxiv.org/pdf/2509.02036
Copy Paste: [[2509.02036]] DeepSeek performs better than other Large Language Models in Dental Cases(https://arxiv.org/abs/2509.02036)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) hold transformative potential in healthcare, yet their capacity to interpret longitudinal patient narratives remains inadequately explored. Dentistry, with its rich repository of structured clinical data, presents a unique opportunity to rigorously assess LLMs' reasoning abilities. While several commercial LLMs already exist, DeepSeek, a model that gained significant attention earlier this year, has also joined the competition. This study evaluated four state-of-the-art LLMs (GPT-4o, Gemini 2.0 Flash, Copilot, and DeepSeek V3) on their ability to analyze longitudinal dental case vignettes through open-ended clinical tasks. Using 34 standardized longitudinal periodontal cases (comprising 258 question-answer pairs), we assessed model performance via automated metrics and blinded evaluations by licensed dentists. DeepSeek emerged as the top performer, demonstrating superior faithfulness (median score = 0.528 vs. 0.367-0.457) and higher expert ratings (median = 4.5/5 vs. 4.0/5), without significantly compromising readability. Our study positions DeepSeek as the leading LLM for case analysis, endorses its integration as an adjunct tool in both medical education and research, and highlights its potential as a domain-specific agent.
摘要：大型语言模型（LLMS）在医疗保健方面具有变革性的潜力，但他们解释纵向患者叙事的能力仍然不足。牙科具有丰富的结构化临床数据存储库，为严格评估LLMS的推理能力提供了独特的机会。尽管已经存在几个商业LLM，但DeepSeek今年早些时候引起了极大的关注，但也加入了竞争。这项研究评估了四个最先进的LLM（GPT-4O，Gemini 2.0 Flash，Copilot和DeepSeek V3），以通过开放式临床任务来分析纵向牙科病例小插曲的能力。我们使用34个标准化的纵向牙周病例（包括258个问题 - 答案对），我们通过自动指标和有执照的牙医进行了盲目的评估来评估模型性能。 DeepSeek成为表现最佳的最佳表现，表现出优异的忠诚（中位数= 0.528对0.367-0.457）和更高的专家评分（中位数= 4.5/5 vs. 4.0/5），而没有显着损害可读性。我们的研究将DeepSeek定位为案例分析的领先LLM，认可其作为医学教育和研究中的辅助工具的整合，并强调了其作为特定领域特定代理的潜力。

Title: Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

Authors: Guangzeng Han, Weisi Liu, Xiaolei Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02040
Pdf URL: https://arxiv.org/pdf/2509.02040
Copy Paste: [[2509.02040]] Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation(https://arxiv.org/abs/2509.02040)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) excel at generating synthetic data, but ensuring its quality and diversity remains challenging. We propose Genetic Prompt, a novel framework that combines genetic algorithms with LLMs to augment synthetic data generation. Our approach treats semantic text attributes as gene sequences and leverages the LLM to simulate crossover and mutation operations. This genetic process enhances data quality and diversity by creating novel attribute combinations, yielding synthetic distributions closer to real-world data. To optimize parent selection, we also integrate an active learning scheme that expands the offspring search space. Our experiments on multiple NLP tasks reveal several key findings: Genetic Prompt not only significantly outperforms state-of-the-art baselines but also shows robust performance across various generator model sizes and scales. Moreover, we demonstrate that fusing our synthetic data with the original training set significantly boosts downstream model performance, particularly for class-imbalanced scenarios. Our findings validate that Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications.
摘要：大型语言模型（LLMS）在生成合成数据方面表现出色，但确保其质量和多样性仍然具有挑战性。我们提出了遗传提示，这是一个新型框架，将遗传算法与LLMS结合起来，以增强合成数据的生成。我们的方法将语义文本属性视为基因序列，并利用LLM模拟交叉和突变操作。这种遗传过程通过创建新型属性组合来增强数据质量和多样性，从而使合成分布更接近现实世界数据。为了优化父级选择，我们还整合了一个活动的学习方案，该方案扩展了后代搜索空间。我们在多个NLP任务上的实验揭示了几个关键发现：遗传提示不仅显着优于最先进的基线，而且还显示了各种生成器模型和尺度的稳健性能。此外，我们证明，将我们的合成数据与原始训练集融合会大大提高下游模型性能，尤其是对于班级失效的情况。我们的发现验证了遗传提示是为广泛的NLP应用生成高质量合成数据的有效方法。

Title: How Instruction-Tuning Imparts Length Control: A Cross-Lingual Mechanistic Analysis

Authors: Elisabetta Rocchetti, Alfio Ferrara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02075
Pdf URL: https://arxiv.org/pdf/2509.02075
Copy Paste: [[2509.02075]] How Instruction-Tuning Imparts Length Control: A Cross-Lingual Mechanistic Analysis(https://arxiv.org/abs/2509.02075)
Keywords: language model, llm
Abstract: Adhering to explicit length constraints, such as generating text with a precise word count, remains a significant challenge for Large Language Models (LLMs). This study aims at investigating the differences between foundation models and their instruction-tuned counterparts, on length-controlled text generation in English and Italian. We analyze both performance and internal component contributions using Cumulative Weighted Attribution, a metric derived from Direct Logit Attribution. Our findings reveal that instruction-tuning substantially improves length control, primarily by specializing components in deeper model layers. Specifically, attention heads in later layers of IT models show increasingly positive contributions, particularly in English. In Italian, while attention contributions are more attenuated, final-layer MLPs exhibit a stronger positive role, suggesting a compensatory mechanism. These results indicate that instruction-tuning reconfigures later layers for task adherence, with component-level strategies potentially adapting to linguistic context.
摘要：遵守明确的长度约束，例如以精确的单词计数生成文本，对于大语言模型（LLM）来说仍然是一个重大挑战。这项研究旨在调查基础模型与其教学调整的对应物之间的差异，并在英语和意大利语中进行长度控制的文本生成。我们使用累积的加权归因分析性能和内部组件贡献，这是一种源自直接logit归因的度量。我们的发现表明，指导调整大大改善了长度控制，主要是通过在更深层次的模型层中专门化组件。具体而言，IT模型后来的注意力头脑的注意力表现出越来越积极的贡献，尤其是在英语中。在意大利语中，尽管注意力贡献更加衰弱，但最终的MLP表现出更强的积极作用，这表明是一种补偿机制。这些结果表明，指令调整重新配置后来以进行任务依从性，其中组件级策略可能会适应语言背景。

Title: Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

Authors: Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, Yi Bu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.02093
Pdf URL: https://arxiv.org/pdf/2509.02093
Copy Paste: [[2509.02093]] Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization(https://arxiv.org/abs/2509.02093)
Keywords: language model, llm, prompt
Abstract: Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs' inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval augmented reasoning process. Our approach retrieves top k reference prompts from the HelpSteer2 dataset, an open-source collection annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high, medium, and low quality prompts to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best prompts along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.
摘要：自动提示优化最近已成为提高大语模型（LLMS）提示质量的策略，目的是产生更准确和更有用的响应。但是，大多数先前的工作都集中在直接及时的细化或模型微调上，忽视了利用LLMS固有的推理能力从对比鲜明的示例中学习的潜力。在本文中，我们提出了对比推理及时优化（CRPO），该框架是一种新颖的框架，该框架将迅速优化作为检索增强推理过程。 Our approach retrieves top k reference prompts from the HelpSteer2 dataset, an open-source collection annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high, medium, and low quality prompts to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, LLM分析每个评估维度最佳提示并将其优势整合到优化的提示中。通过明确对比高质量和低质量的示例，CRPO使该模型能够推断出某些提示成功而其他提示的原因，而其他提示则取得了成功，从而实现了更强大和可解释的优化。 HelpSeer2基准的实验结果表明，CRPO的表现明显优于基准。我们的发现突出了对比度，检索提出的推理的希望，以提高自动及时优化。

Title: JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer

Authors: Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Yuanzhuo Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02097
Pdf URL: https://arxiv.org/pdf/2509.02097
Copy Paste: [[2509.02097]] JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer(https://arxiv.org/abs/2509.02097)
Keywords: language model, llm, agent
Abstract: Evaluating the capabilities of large language models (LLMs) is an essential step to ensure the successful application of LLMs across various domains. The current evaluation of LLMs is based on a paradigm that involves querying them with predefined question sets and assessing their outputs. This paradigm offers controllable processes and simplicity, but faces challenges such as limited interaction with targets, insufficient difficulty control, and difficulties in verifying the validity of evaluation results, making it hard to precisely determine the knowledge and capability boundaries of target models. To address these challenges, we propose JudgeAgent, a knowledge-target adaptive dynamic evaluation framework based on a new interviewer-style evaluation paradigm. JudgeAgent employs a comprehensive evaluation approach consisting of benchmark grading, interactive extension, and evaluation feedback. It utilizes knowledge-driven data synthesis and target-adaptive difficulty adjustment methods to conduct extended testing, providing accurate and effective evaluation results. We also introduce a novel insight into validating evaluation methods, demonstrating the effectiveness of JudgeAgent and its dynamic evaluation paradigm through extensive experiments.
摘要：评估大语言模型（LLMS）的功能是确保LLM在各个领域成功应用的重要步骤。当前对LLMS的评估是基于一个范式，该范式涉及向他们查询预定义的问题集并评估其产出。该范式提供了可控的过程和简单性，但是面临挑战，例如与目标相互作用有限，难度控制不足以及验证评估结果有效性的困难，因此很难精确地确定目标模型的知识和能力界限。为了应对这些挑战，我们提出了基于新的面试官风格的评估范式的判断力，这是一个知识目标自适应动态评估框架。判断采用了一种全面的评估方法，该方法包括基准分级，交互式扩展和评估反馈。它利用知识驱动的数据综合和目标自适应难度调整方法进行扩展测试，从而提供准确有效的评估结果。我们还引入了一种新颖的见解，以验证评估方法，证明了判断力的有效性及其动态评估范式通过广泛的实验。

Title: CMRAG: Co-modality-based document retrieval and visual question answering

Authors: Wang Chen, Guanqiang Qi, Weikang Li, Yang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02123
Pdf URL: https://arxiv.org/pdf/2509.02123
Copy Paste: [[2509.02123]] CMRAG: Co-modality-based document retrieval and visual question answering(https://arxiv.org/abs/2509.02123)
Keywords: language model, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal generation results. This paper proposes co-modality-based RAG (CMRAG), which can simultaneously leverage text and images for efficient retrieval and generation. Specifically, we first perform structured parsing on documents to obtain co-modality representations of text segments and image regions. Subsequently, in response to user queries, we retrieve candidate evidence from text and image channels, respectively, and aggregate the results at the cross-modal retrieval level. Finally, we prompt the VLM to generate the final response based on the co-modality retrieval results. Experiments demonstrate that our method significantly outperforms pure-vision-based RAG in visual document question answering tasks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex document visual question-answering (VQA) systems.
摘要：检索提升的生成（RAG）已成为文档问题回答任务的核心范式。但是，现有方法在处理多模式文档时存在局限性：一类方法依赖于布局分析和文本提取，这只能利用明确的文本信息并难以捕获图像或非结构化内容；另一个类别将文档细分视为视觉输入，并将其直接传递给视觉语言模型（VLMS）进行处理，但它忽略了文本的语义优势，从而导致次优生成结果。本文提出了基于共同体的抹布（CMRAG），该抹布可以同时利用文本和图像以进行有效的检索和生成。具体而言，我们首先对文档进行结构化解析，以获取文本段和图像区域的共同模式表示。随后，根据用户查询，我们分别从文本和图像通道中检索候选证据，并在跨模式检索级别汇总结果。最后，我们提示VLM根据共模性检索结果生成最终响应。实验表明，我们的方法在视觉文档问答任务中大大优于基于纯正的抹布。本文的发现表明，以统一的方式将共同建模信息集成到抹布框架中是改善复杂文档视觉询问（VQA）系统的性能的有效方法。

Title: AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models

Authors: Snehasis Mukhopadhyay, Aryan Kasat, Shivam Dubey, Rahul Karthikeyan, Dhruv Sood, Vinija Jain, Aman Chadha, Amitava Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02133
Pdf URL: https://arxiv.org/pdf/2509.02133
Copy Paste: [[2509.02133]] AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models(https://arxiv.org/abs/2509.02133)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can inadvertently reflect societal biases present in their training data, leading to harmful or prejudiced outputs. In the Indian context, our empirical evaluations across a suite of models reveal that biases around caste and religion are particularly salient. Yet, most existing mitigation strategies are Western-centric and fail to address these local nuances. We propose AMBEDKAR, a framework inspired by the egalitarian vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide LLM outputs toward fairness, neutrality, and inclusion in line with Articles 14 to 17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the AI Constitution of India and applied only at inference time, without any parameter updates to the base model. We incorporate a speculative decoding algorithm that proactively reduces casteist and communal bias during generation. This mitigation layer operates directly within the decoding process, avoiding changes to model internals and lowering the computational and infrastructural costs associated with retraining. We reinterpret speculative decoding not merely as an efficiency tool but as a mechanism for fairness. In this framework, a Small Language Model (SLM) acts as a potentially biased generator, while a constitutionally guided Large Language Model (LLM) serves as the verifier. Rather than accelerating generation, the LLM enforces bias-robust trajectories in the SLM outputs. This inversion of roles gives rise to a fairness-by-speculation paradigm. Our approach yields an absolute reduction of bias up to 26.41 percent compared to baseline. Our source code, datasets, and results are available at this https URL
摘要：大型语言模型（LLM）可以无意中反映其培训数据中存在的社会偏见，从而导致有害或偏见的产出。在印度的背景下，我们在一系列模型中进行的经验评估表明，围绕种姓和宗教的偏见特别重要。然而，大多数现有的缓解策略都是以西方为中心的，无法解决这些当地细微差别。 We propose AMBEDKAR, a framework inspired by the egalitarian vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide LLM outputs toward fairness, neutrality, and inclusion in line with Articles 14 to 17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the AI Constitution of India and applied only at inference time, without any parameter updates to the base model.我们结合了一种投机性解码算法，该算法积极地降低了生成期间的种姓和公共偏见。该缓解层直接在解码过程中运行，避免了对内部模型的更改，并降低了与重新培训相关的计算和基础设施成本。我们重新解释了投机性解码，不仅是效率工具，而且是公平的机制。在此框架中，小型语言模型（SLM）充当了潜在的发电机，而宪法指导的大语言模型（LLM）则是验证者。 LLM不是加速生成，而是在SLM输出中强制执行偏置轨迹。这种角色的反转导致了公平的范围范式。与基线相比，我们的方法的绝对偏差最高可达26.41％。我们的源代码，数据集和结果可在此HTTPS URL上找到

Title: Avoidance Decoding for Diverse Multi-Branch Story Generation

Authors: Kyeongman Park, Nakyeong Yang, Kyomin Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02170
Pdf URL: https://arxiv.org/pdf/2509.02170
Copy Paste: [[2509.02170]] Avoidance Decoding for Diverse Multi-Branch Story Generation(https://arxiv.org/abs/2509.02170)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often generate repetitive and monotonous outputs, especially in tasks like story generation, due to limited creative diversity when given the same input prompt. To address this challenge, we propose a novel decoding strategy, Avoidance Decoding, that modifies token logits by penalizing similarity to previously generated outputs, thereby encouraging more diverse multi-branch stories. This penalty adaptively balances two similarity measures: (1) Concept-level Similarity Penalty, which is prioritized in early stages to diversify initial story concepts, and (2) Narrative-level Similarity Penalty, which is increasingly emphasized later to ensure natural yet diverse plot development. Notably, our method achieves up to 2.6 times higher output diversity and reduces repetition by an average of 30% compared to strong baselines, while effectively mitigating text degeneration. Furthermore, we reveal that our method activates a broader range of neurons, demonstrating that it leverages the model's intrinsic creativity.
摘要：大型语言模型（LLMS）通常会产生重复和单调的输出，尤其是在故事产生的任务中，由于给出相同的输入提示时的创意多样性有限。为了应对这一挑战，我们提出了一种新颖的解码策略，即避免解码，通过惩罚与先前生成的产出的相似性来修改令牌逻辑，从而鼓励更多多样化的多分支故事。这种惩罚可以适应性地平衡两个相似性措施：（1）概念级相似性惩罚在早期阶段被优先考虑以使初始故事概念多样化，以及（2）叙事级别的相似性惩罚越来越强调，以确保自然而多样化的情节发展。值得注意的是，与强质基线相比，我们的方法的产出多样性最多高达2.6倍，并将重复量降低30％，同时有效地减轻文本退化。此外，我们揭示了我们的方法激活了更广泛的神经元，这表明它利用了模型的内在创造力。

Title: FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain

Authors: Anum Afzal, Juraj Vladika, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02198
Pdf URL: https://arxiv.org/pdf/2509.02198
Copy Paste: [[2509.02198]] FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain(https://arxiv.org/abs/2509.02198)
Keywords: language model, llm, hallucination, prompt, chain-of-thought
Abstract: Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.
摘要：大型语言模型在处理专业领域时往往会挣扎。尽管评估的所有方面都很重要，但事实是最关键的。同样，可靠的事实检查工具和数据源对于缓解幻觉至关重要。我们通过提供全面的事实检查基准Factbench来解决这些问题，涵盖了四代任务和六个最先进的大语模型（LLMS）。我们使用两种最先进的事实检查技术：经过思考链（COT）提示和自然语言推断（NLI）。我们的实验表明，通过两种技术的一致投票获得的事实检查得分与域专家评估最相关。

Title: Towards Fundamental Language Models: Does Linguistic Competence Scale with Model Size?

Authors: Jaime Collado-Montañez, L. Alfonso Ureña-López, Arturo Montejo-Ráez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02225
Pdf URL: https://arxiv.org/pdf/2509.02225
Copy Paste: [[2509.02225]] Towards Fundamental Language Models: Does Linguistic Competence Scale with Model Size?(https://arxiv.org/abs/2509.02225)
Keywords: language model, hallucination
Abstract: Large Language Models offer impressive language capabilities but suffer from well-known limitations, including hallucinations, biases, privacy concerns, and high computational costs. These issues are largely driven by the combination of linguistic competence and factual memorization within a single monolithic model. This paper introduces and empirically supports the Fundamental Language Model (FLM) paradigm, which advocates for smaller, linguistically competent models that offload factual retrieval to external tools. We evaluate models ranging from 135M to 32B parameters across three dimensions: linguistic competence, external factual knowledge, and internal factual knowledge. Our findings reveal that while both linguistic competence and factual knowledge improve with scale, internal factual knowledge grows significantly faster, suggesting that model size is more closely tied to memorization than to core language ability. These results support a modular approach to language modeling, where compact, linguistically proficient models serve as the foundation for tool-augmented systems. The FLM paradigm offers a path toward more efficient, interpretable, and sustainable NLP solutions.
摘要：大型语言模型具有令人印象深刻的语言功能，但遭受了众所周知的局限性，包括幻觉，偏见，隐私问题和高计算成本。这些问题在很大程度上是由语言能力和事实记忆在单个单片模型中的结合驱动的。本文介绍并经验支持了基本语言模型（FLM）范式，该模型主张较小的，语言上有能力的模型，以卸载对外部工具的事实检索。我们评估了三个维度的13500万至32B参数的模型：语言能力，外部事实知识和内部事实知识。我们的发现表明，尽管语言能力和事实知识都随规模而改善，但内部事实知识的增长速度要快得多，这表明模型规模与记忆更紧密相关，而不是核心语言能力。这些结果支持一种模块化的语言建模方法，其中紧凑，语言熟练的模型是工具增强系统的基础。 FLM范式为更高效，可解释和可持续的NLP解决方案提供了一条途径。

Title: LLMs and their Limited Theory of Mind: Evaluating Mental State Annotations in Situated Dialogue

Authors: Katharine Kowalyshyn, Matthias Scheutz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02292
Pdf URL: https://arxiv.org/pdf/2509.02292
Copy Paste: [[2509.02292]] LLMs and their Limited Theory of Mind: Evaluating Mental State Annotations in Situated Dialogue(https://arxiv.org/abs/2509.02292)
Keywords: language model, llm
Abstract: What if large language models could not only infer human mindsets but also expose every blind spot in team dialogue such as discrepancies in the team members' joint understanding? We present a novel, two-step framework that leverages large language models (LLMs) both as human-style annotators of team dialogues to track the team's shared mental models (SMMs) and as automated discrepancy detectors among individuals' mental states. In the first step, an LLM generates annotations by identifying SMM elements within task-oriented dialogues from the Cooperative Remote Search Task (CReST) corpus. Then, a secondary LLM compares these LLM-derived annotations and human annotations against gold-standard labels to detect and characterize divergences. We define an SMM coherence evaluation framework for this use case and apply it to six CReST dialogues, ultimately producing: (1) a dataset of human and LLM annotations; (2) a reproducible evaluation framework for SMM coherence; and (3) an empirical assessment of LLM-based discrepancy detection. Our results reveal that, although LLMs exhibit apparent coherence on straightforward natural-language annotation tasks, they systematically err in scenarios requiring spatial reasoning or disambiguation of prosodic cues.
摘要：如果大型语言模型不仅可以推断人类的心态，还可以揭露团队对话中的每个盲点，例如团队成员的共同理解中的差异，该怎么办？我们提出了一个小说的两步框架，该框架利用大型语言模型（LLM）作为团队对话的人类式注释者，以跟踪团队共享的心理模型（SMMS）以及个人心理状态之间的自动差异探测器。在第一步中，LLM通过从合作远程搜索任务（CREST）语料库中识别以任务为导向的对话中的SMM元素来生成注释。然后，次级LLM比较了这些LLM衍生的注释和人类注释与金标准标签，以检测和表征分歧。我们为此用例定义了SMM连贯评估框架，并将其应用于六个Crest对话，最终产生：（1）人类和LLM注释的数据集；（2）SMM连贯性的可再现评估框架；（3）基于LLM的差异检测的经验评估。我们的结果表明，尽管LLMS在直接的自然语言注释任务上表现出明显的连贯性，但它们在需要空间推理或消除韵律提示的情况下有系统地错误。

Title: DCPO: Dynamic Clipping Policy Optimization

Authors: Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.02333
Pdf URL: https://arxiv.org/pdf/2509.02333
Copy Paste: [[2509.02333]] DCPO: Dynamic Clipping Policy Optimization(https://arxiv.org/abs/2509.02333)
Keywords: language model
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.
摘要：从可验证的奖励（RLVR）中学习的强化学习已成为增强大语言模型的推理能力的有前途的框架。但是，诸如GRPO之类的现有方法通常会遭受零梯度的影响。此问题主要是由于令牌级别概率比和相同奖励的标准化的固定剪辑界限，这可能导致无效的梯度更新和未充分利用生成的响应。 In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO基于四个不同的模型在四个基准上实现了最新的性能。特别是，DCPO在贪婪的解码下获得了46.7中的AVG@1，在AIME24基准上进行了32次采样的38.8中的AVG@32@32@32，在Qwen2.5-math-7b-7b-7b-7b-7-math-7b上超过了DAPO（36.7/31.6）和GRPO（36.7/31.6）和GRPO（36.7/32.1）。在基于QWEN2.5-14B的AIME25基准上，DCPO的性能达到（23.3/19.0），超过GRPO（13.3/10.5）和DAPO（20.0/15.3）。此外，DCPO在四个模型中的非零优势平均提高了28％，与DAPO相比，训练效率增加了一倍，并且与GRPO和DAPO相比，代币剪辑比例大大降低了，同时降低了数量级。这些结果突出了DCPO在大型语言模型中更有效地利用生成数据的有效性。

Title: Implicit Reasoning in Large Language Models: A Comprehensive Survey

Authors: Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, Rex Ying
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02350
Pdf URL: https://arxiv.org/pdf/2509.02350
Copy Paste: [[2509.02350]] Implicit Reasoning in Large Language Models: A Comprehensive Survey(https://arxiv.org/abs/2509.02350)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks. Reasoning with LLMs is central to solving multi-step problems and complex decision-making. To support efficient reasoning, recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning, where reasoning occurs silently via latent structures without emitting intermediate textual steps. Implicit reasoning brings advantages such as lower generation cost, faster inference, and better alignment with internal computation. Although prior surveys have discussed latent representations in the context of reasoning, a dedicated and mechanism-level examination of how reasoning unfolds internally within LLMs remains absent. This survey fills that gap by introducing a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies. We organize existing methods into three execution paradigms based on \textbf{\textit{how and where internal computation unfolds}}: latent optimization, signal-guided control, and layer-recurrent execution. We also review structural, behavioral and representation-based evidence that supports the presence of implicit reasoning in LLMs. We further provide a structured overview of the evaluation metrics and benchmarks used in existing works to assess the effectiveness and reliability of implicit this http URL maintain a continuously updated project at: this https URL.
摘要：大型语言模型（LLMS）已在各种任务中表现出强烈的概括。使用LLM的推理对于解决多步骤问题和复杂的决策至关重要。为了支持有效的推理，最近的研究已将注意力从明确的思想链转移到了隐性推理上，在这种推理中，推理是通过潜在结构默默地发生的，而不会发出中间的文本步骤。隐性推理带来了诸如下一代成本，更快的推断以及与内部计算更好的对齐等优点。尽管以前的调查已经在推理的背景下讨论了潜在的表示，但对LLM内部内部推理方式的专用和机制水平的检查仍然不存在。这项调查通过引入以执行范例为中心的分类法，将重点从表示形式转移到计算策略来填补这一空白。我们将现有方法组织为基于\ textbf {\ textIt {\ textit {如何和何处展开内部计算}}的三个执行范式：潜在优化，信号引导的控制和层转换执行。我们还回顾了支持LLM中隐性推理的基于结构，行为和代表性的证据。我们进一步提供了现有作品中使用的评估指标和基准的结构化概述，以评估该HTTP URL隐含的有效性和可靠性，并在以下方面保持了一个不断更新的项目：此HTTPS URL。

Title: Towards Temporal Knowledge-Base Creation for Fine-Grained Opinion Analysis with Language Models

Authors: Gaurav Negi, Atul Kr. Ojha, Omnia Zayed, Paul Buitelaar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02363
Pdf URL: https://arxiv.org/pdf/2509.02363
Copy Paste: [[2509.02363]] Towards Temporal Knowledge-Base Creation for Fine-Grained Opinion Analysis with Language Models(https://arxiv.org/abs/2509.02363)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: We propose a scalable method for constructing a temporal opinion knowledge base with large language models (LLMs) as automated annotators. Despite the demonstrated utility of time-series opinion analysis of text for downstream applications such as forecasting and trend analysis, existing methodologies underexploit this potential due to the absence of temporally grounded fine-grained annotations. Our approach addresses this gap by integrating well-established opinion mining formulations into a declarative LLM annotation pipeline, enabling structured opinion extraction without manual prompt engineering. We define three data models grounded in sentiment and opinion mining literature, serving as schemas for structured representation. We perform rigorous quantitative evaluation of our pipeline using human-annotated test samples. We carry out the final annotations using two separate LLMs, and inter-annotator agreement is computed label-wise across the fine-grained opinion dimensions, analogous to human annotation protocols. The resulting knowledge base encapsulates time-aligned, structured opinions and is compatible with applications in Retrieval-Augmented Generation (RAG), temporal question answering, and timeline summarisation.
摘要：我们提出了一种可扩展的方法，用于用大型语言模型（LLM）作为自动注释者构建时间意见知识库。尽管对下游应用（例如预测和趋势分析）进行了时间序列意见分析证明了文本的意见分析，但由于缺乏时间接地的细颗粒注释，现有的方法论不足。我们的方法通过将良好的意见采矿配方整合到声明的LLM注释管道中，从而解决了这一差距，从而无需手动及时工程即可提取结构化意见。我们定义了以情感和意见挖掘文献为基础的三个数据模型，作为结构化表示的模式。我们使用人类注销的测试样本对管道进行严格的定量评估。我们使用两个单独的LLM执行最终注释，并在细粒度的舆论维度上计算出标签，类似于人类注释方案。由此产生的知识基础封装了时间平衡的结构化意见，并与检索型发电（RAG）中的应用程序兼容，时间问答和时间表汇总。

Title: An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction

Authors: Ali Hamdi, Malak Mohamed, Rokaia Emad, Khaled Shaban
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.02446
Pdf URL: https://arxiv.org/pdf/2509.02446
Copy Paste: [[2509.02446]] An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction(https://arxiv.org/abs/2509.02446)
Keywords: language model, gpt, llm
Abstract: Social telehealth has made remarkable progress in healthcare by allowing patients to post symptoms and participate in medical consultations remotely. Users frequently post symptoms on social media and online health platforms, creating a huge repository of medical data that can be leveraged for disease classification. Large language models (LLMs) such as LLAMA3 and GPT-3.5, along with transformer-based models like BERT, have demonstrated strong capabilities in processing complex medical text. In this study, we evaluate three Arabic medical text preprocessing methods such as summarization, refinement, and Named Entity Recognition (NER) before applying fine-tuned Arabic transformer models (CAMeLBERT, AraBERT, and AsafayaBERT). To enhance robustness, we adopt a majority voting ensemble that combines predictions from original and preprocessed text representations. This approach achieved the best classification accuracy of 80.56%, thus showing its effectiveness in leveraging various text representations and model predictions to improve the understanding of medical texts. To the best of our knowledge, this is the first work that integrates LLM-based preprocessing with fine-tuned Arabic transformer models and ensemble learning for disease classification in Arabic social telehealth data.
摘要：社会远程医疗在医疗保健方面取得了显着进步，允许患者发布症状并远程参加医疗咨询。用户经常在社交媒体和在线健康平台上发布症状，创建了可以利用疾病分类的大量医疗数据存储库。大型语言模型（LLM），例如Llama3和GPT-3.5，以及基于变压器的模型（例如BERT），在处理复杂的医学文本方面表现出强大的功能。在这项研究中，我们在应用微调的阿拉伯变压器模型（Camelbert，Arabert和Asafayabert）之前，评估了三种阿拉伯医学文本预处理方法，例如摘要，改进和命名实体识别（NER）。为了增强鲁棒性，我们采用了多数投票合奏，结合了原始和预处理文本表示的预测。这种方法达到了80.56％的最佳分类准确性，因此显示出其在利用各种文本表示和模型预测以提高对医学文本的理解方面的有效性。据我们所知，这是第一部将基于LLM的预处理与微调的阿拉伯变压器模型集成在一起的工作，并在阿拉伯社会远程医疗数据中进行疾病分类的合奏学习。

Title: Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions

Authors: Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru, Francis Ferraro, Manas Gaur
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.02452
Pdf URL: https://arxiv.org/pdf/2509.02452
Copy Paste: [[2509.02452]] Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions(https://arxiv.org/abs/2509.02452)
Keywords: llm
Abstract: Do LLMs genuinely incorporate external definitions, or do they primarily rely on their parametric knowledge? To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped definitions. Our results reveal that while explicit label definitions can enhance accuracy and explainability, their integration into an LLM's task-solving processes is neither guaranteed nor consistent, suggesting reliance on internalized representations in many cases. Models often default to their internal representations, particularly in general tasks, whereas domain-specific tasks benefit more from explicit definitions. These findings underscore the need for a deeper understanding of how LLMs process external knowledge alongside their pre-existing capabilities.
摘要：LLMS是否真正融合了外部定义，还是主要依赖其参数知识？为了解决这些问题，我们在多个解释基准数据集（一般和域特异性）和标签定义条件（包括专家策划，LLM生成，扰动和交换定义）上进行了对照实验。我们的结果表明，尽管明确的标签定义可以提高准确性和解释性，但它们既可以保证也不一致地整合到LLM的任务解决过程中，这表明在许多情况下，它们都依赖内部化表示形式。模型通常默认为其内部表示形式，尤其是在一般任务中，而特定于领域的任务则从明确的定义中受益更多。这些发现强调了对LLM如何处理外部知识以及其先前存在的能力的需求。

Title: SpecEval: Evaluating Model Adherence to Behavior Specifications

Authors: Ahmed Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, Percy Liang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02464
Pdf URL: https://arxiv.org/pdf/2509.02464
Copy Paste: [[2509.02464]] SpecEval: Evaluating Model Adherence to Behavior Specifications(https://arxiv.org/abs/2509.02464)
Keywords: prompt
Abstract: Companies that develop foundation models publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so. While providers such as OpenAI, Anthropic, and Google have published detailed specifications describing both desired safety constraints and qualitative traits for their models, there has been no systematic audit of adherence to these guidelines. We introduce an automated framework that audits models against their providers specifications by parsing behavioral statements, generating targeted prompts, and using models to judge adherence. Our central focus is on three way consistency between a provider specification, its model outputs, and its own models as judges; an extension of prior two way generator validator consistency. This establishes a necessary baseline: at minimum, a foundation model should consistently satisfy the developer behavioral specifications when judged by the developer evaluator models. We apply our framework to 16 models from six developers across more than 100 behavioral statements, finding systematic inconsistencies including compliance gaps of up to 20 percent across providers.
摘要：开发基金会模型的公司发布行为准则，他们承诺将遵循其模型，但是尚不清楚模型是否真的这样做。尽管OpenAI，Anthropic和Google等提供商已经发布了详细的规格，描述了其模型所需的安全限制和定性特征，但对这些准则的遵守没有系统的审核。我们引入了一个自动化框架，该框架通过解析行为声明，产生目标提示并使用模型来判断依从性来审核其提供者规格。我们的核心重点是提供商规范，其模型输出和自己作为法官的模型之间的三种一致性；前两种方式生成器验证器一致性的扩展。这确立了必要的基准：至少，在由开发人员评估器模型判断时，基础模型应始终如一地满足开发人员的行为规格。我们将我们的框架应用于来自100多个行为陈述的6名开发人员的16个型号，发现系统不一致，包括整个提供者的合规差距高达20％。

Title: MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds

Authors: Junxi Wu, Jinpeng Wang, Zheng Liu, Bin Chen, Dongjian Hu, Hao Wu, Shu-Tao Xiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02499
Pdf URL: https://arxiv.org/pdf/2509.02499
Copy Paste: [[2509.02499]] MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds(https://arxiv.org/abs/2509.02499)
Keywords: language model
Abstract: The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at this https URL.
摘要：大型语言模型的快速发展激发了公众对潜在滥用的关注。因此，构建值得信赖的AI生成的文本检测系统非常重要。现有方法忽略了风格建模，并且主要依赖于静态阈值，这极大地限制了检测性能。在本文中，我们提出了风格专家（Moses）框架的混合物，该框架可以通过条件阈值估计来实现文体学意识的不确定性量化。摩西包含三个核心组件，即风格参考存储库（SRR），文体感知路由器（SAR）和条件阈值估计器（CTE）。对于输入文本，SRR可以在SRR中激活适当的参考数据并将其提供给CTE。随后，CTE共同对语言统计特性和语义特征进行建模，以动态确定最佳阈值。摩西以歧视得分的得分产生了相应置信度的预测标签。与基线相比，我们的框架的检测性能平均提高了11.34％。更具启发性的是，摩西在低资源案例中显示出更明显的改善39.15％。我们的代码可在此HTTPS URL上找到。

Title: L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages

Authors: Nishant Tanksale, Tanmay Kokate, Darshan Gohad, Sarvadnyaa Barate, Raviraj Joshi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.02503
Pdf URL: https://arxiv.org/pdf/2509.02503
Copy Paste: [[2509.02503]] L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages(https://arxiv.org/abs/2509.02503)
Keywords: llm, retrieval-augmented generation
Abstract: Semantic evaluation in low-resource languages remains a major challenge in NLP. While sentence transformers have shown strong performance in high-resource settings, their effectiveness in Indic languages is underexplored due to a lack of high-quality benchmarks. To bridge this gap, we introduce L3Cube-IndicHeadline-ID, a curated headline identification dataset spanning ten low-resource Indic languages: Marathi, Hindi, Tamil, Gujarati, Odia, Kannada, Malayalam, Punjabi, Telugu, Bengali and English. Each language includes 20,000 news articles paired with four headline variants: the original, a semantically similar version, a lexically similar version, and an unrelated one, designed to test fine-grained semantic understanding. The task requires selecting the correct headline from the options using article-headline similarity. We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity. Results show that multilingual models consistently perform well, while language-specific models vary in effectiveness. Given the rising use of similarity models in Retrieval-Augmented Generation (RAG) pipelines, this dataset also serves as a valuable resource for evaluating and improving semantic understanding in such applications. Additionally, the dataset can be repurposed for multiple-choice question answering, headline classification, or other task-specific evaluations of LLMs, making it a versatile benchmark for Indic NLP. The dataset is shared publicly at this https URL
摘要：低资源语言的语义评估仍然是NLP的主要挑战。虽然句子变形金刚在高资源设置中表现出强烈的性能，但由于缺乏高质量的基准，它们在指示语言中的有效性并没有忽视。为了弥合这一差距，我们介绍了L3Cube-Indicheadline-ID，这是一个策划的标题标识数据集，涵盖了十种低资源的指示语言：Marathi，Hindi，Tamil，Tamil，Gujarati，Gujarati，Gujarati，Odia，Kannada，Kannada，Malayalam，Malayalam，Punjabi，Punjabi，Telugu，Bengali，Bengali和English。每种语言都包含20,000篇新闻文章，并配对四个标题版本：原始版本，语义相似的版本，词汇相似的版本和无关的版本，旨在测试细粒的语义理解。该任务需要使用Article-Headline相似性从选项中选择正确的标题。我们使用余弦相似性基准了几个句子变压器，包括多语言和特定于语言的模型。结果表明，多语言模型始终如一地表现良好，而特定于语言的模型的有效性也有所不同。鉴于在检索演出生成（RAG）管道中使用相似性模型的使用越来越大，该数据集也是评估和改善此类应用中语义理解的宝贵资源。此外，可以将数据集重新用于多项选择的问题答案，标题分类或其他特定于LLMS的任务评估，从而使其成为INDIC NLP的多功能基准。该数据集在此HTTPS URL上公开共享

Title: Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Authors: Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2509.02510
Pdf URL: https://arxiv.org/pdf/2509.02510
Copy Paste: [[2509.02510]] Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation(https://arxiv.org/abs/2509.02510)
Keywords: language model, llm
Abstract: Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-\$p\$ (nucleus) sampling, and min-\$p\$ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-\$p\$ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present **top-H** decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an **entropy-constrained minimum divergence** problem. We then prove this minimization problem to be equivalent to an **entropy-constrained mass maximization** (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-\$p\$ sampling by up to **25.63%** on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an *LLM-as-judge* evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be *easily integrated* into creative writing applications. The code is available at this https URL.
摘要：大型语言模型（LLMS）尽管在各种任务中都表现出色，但经常在开放式文本生成中平衡两个相互竞争的目标：促进多样性和创造力，同时保持逻辑相干性。现有的截短采样技术，包括温度缩放，顶部 - \ $ p \ $（核）采样和最小 - \ $ p \ $采样，旨在管理此权衡。但是，它们表现出局限性，特别是在将模型的置信度有效地纳入相应的采样策略中。例如，最小值 - \ $ p \ $采样依赖于单个顶级令牌作为启发式信心，最终将概率分布的信息不足。为了有效纳入模型的信心，在本文中，我们提出了** top-h **解码。我们首先通过制定**熵约束的最小差异**问题来建立创造力与连贯性之间相互作用的理论基础。然后，我们证明了这个最小化问题，等同于**熵约束的质量最大化**（ECMM）问题，即NP- hard。最后，我们提出了Top-H解码，这是一种计算有效的贪婪算法，以解决ECMM问题。广泛的经验评估表明，Top-H优于最新的（SOTA）替代品 - \ $ p \ $采样，最多可以在** 25.63％**上进行创意写作基准测试，同时保持较强的质疑数据集，例如GPQA，GPQA，GPQA，GSM8K和MT-BENCEN。此外， * llm-as-as-and-Gudge *评估证实，即使在较高的温度下，Top-H确实会产生连贯的产出，而创造力尤其重要。总而言之，Top-H在开放式文本生成中的Provings Sota可以 *轻松地整合到创意写作应用程序中。该代码可在此HTTPS URL上找到。

Title: Comparative Study of Pre-Trained BERT and Large Language Models for Code-Mixed Named Entity Recognition

Authors: Mayur Shirke, Amey Shembade, Pavan Thorat, Madhushri Wagh, Raviraj Joshi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.02514
Pdf URL: https://arxiv.org/pdf/2509.02514
Copy Paste: [[2509.02514]] Comparative Study of Pre-Trained BERT and Large Language Models for Code-Mixed Named Entity Recognition(https://arxiv.org/abs/2509.02514)
Keywords: language model, llm
Abstract: Named Entity Recognition (NER) in code-mixed text, particularly Hindi-English (Hinglish), presents unique challenges due to informal structure, transliteration, and frequent language switching. This study conducts a comparative evaluation of code-mixed fine-tuned models and non-code-mixed multilingual models, along with zero-shot generative large language models (LLMs). Specifically, we evaluate HingBERT, HingMBERT, and HingRoBERTa (trained on code-mixed data), and BERT Base Cased, IndicBERT, RoBERTa and MuRIL (trained on non-code-mixed multilingual data). We also assess the performance of Google Gemini in a zero-shot setting using a modified version of the dataset with NER tags removed. All models are tested on a benchmark Hinglish NER dataset using Precision, Recall, and F1-score. Results show that code-mixed models, particularly HingRoBERTa and HingBERT-based fine-tuned models, outperform others - including closed-source LLMs like Google Gemini - due to domain-specific pretraining. Non-code-mixed models perform reasonably but show limited adaptability. Notably, Google Gemini exhibits competitive zero-shot performance, underlining the generalization strength of modern LLMs. This study provides key insights into the effectiveness of specialized versus generalized models for code-mixed NER tasks.
摘要：在代码混合文本（尤其是印度英语（Hinglish））中，命名为实体识别（NER），由于非正式的结构，音译和频繁的语言切换而提出了独特的挑战。这项研究对模式的微调模型和非代码混合的多语言模型进行了比较评估，以及零击生成的大语言模型（LLMS）。具体而言，我们评估了Hingbert，Hingmbert和Hingroberta（经过代码混合数据的培训），以及Bert Base Cased，Indiadbert，Roberta和Muril（接受了非代码混合的多语言数据培训）。我们还使用已删除NER标签的数据集的修改版本的数据集进行了零拍设置中的Google Gemini的性能。使用精度，召回和F1得分在基准Hinglish NER数据集上测试所有模型。结果表明，由于域特异性预处理，因此，代码混合模型，尤其是Hingroberta和基于Hingbert的微型模型（包括Google Gemini）的其他模型，包括Google Gemini等封闭源LLM。非代码混合模型的性能合理，但显示有限的适应性。值得注意的是，Google Gemini表现出竞争性的零击性能，强调了现代LLM的概括强度。这项研究提供了对代码混合NER任务的专业模型与广义模型的有效性的关键见解。

Title: Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Authors: Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.02522
Pdf URL: https://arxiv.org/pdf/2509.02522
Copy Paste: [[2509.02522]] Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR(https://arxiv.org/abs/2509.02522)
Keywords: language model, llm
Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at this https URL.
摘要：通过可验证的奖励（RLVR）增强学习的最新进展已授权大型语言模型（LLMS）解决诸如数学和编程之类的挑战性推理任务。 RLVR利用可验证的结果奖励来指导政策优化，使LLMS能够以基础可靠的方式逐步提高产出质量。尽管有希望，但RLVR范式还是构成了重大挑战，因为现有方法通常会遇到稀疏的奖励信号和不稳定的政策梯度更新，尤其是在基于RL的方法中。为了应对挑战，我们提出了$ \ textbf {pacs} $，这是一个新颖的RLVR框架，可实现IM $ \ textbf {p} $ licit $ \ textbf {a} $ ctor $ \ textbf {c} c} $ ritic coupling a $ \ textbf {通过将结果奖励视为可预测的标签，我们将RLVR问题重新制定为监督的学习任务，而不是由由策略模型参数参数的分数函数重新制定，并使用跨凝性损失进行了优化。详细的梯度分析表明，这种监督的配方固有地恢复了经典的政策梯度更新，同时隐含地耦合演员和评论家角色，从而产生了更稳定，更有效的培训。 PAC在具有挑战性的数学推理任务上进行基准测试，PAC的表现优于强大的RLVR基准，例如PPO和GRPO，实现了卓越的推理性能。例如，PAC在AIME 2025上的PASS@256上获得59.78 \％，代表PPO和GRPO的改善13.32和14.36点。这个简单而强大的框架为LLMS提供了有希望的途径，并获得可验证的奖励。我们的代码和数据可在此HTTPS URL上作为开源。

Title: Jointly Reinforcing Diversity and Quality in Language Model Generations

Authors: Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, Tianlu Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.02534
Pdf URL: https://arxiv.org/pdf/2509.02534
Copy Paste: [[2509.02534]] Jointly Reinforcing Diversity and Quality in Language Model Generations(https://arxiv.org/abs/2509.02534)
Keywords: language model
Abstract: Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.
摘要：大型语言模型（LMS）的培训通常以牺牲多样性为代价来确定准确性和帮助。这会产生张力：虽然训练后提高了响应质量，但它还可以提高产出分布并减少思想范围，从而限制了LMS在创意和探索性任务中的实用性，例如集思广益，讲故事或解决问题。我们通过多样性感知的增强学习（Darling）来应对这一挑战，该框架共同优化了响应质量和语义多样性。 Darling以此为核心引入了学习的分区功能，以衡量表面级词汇变化以外的多样性。然后将这种多样性信号与在线加强学习期间的质量奖励相结合，鼓励模型产生高质量且独特的产出。多个模型家族和大小之间的实验表明，达令将其推广到两个制度：无验证的任务（以下指令和创意写作）和可验证的任务（竞争数学）。在第一个设置的五个基准测试中，达令始终优于质量仅质量的RL基准，从而产生同时具有更高质量和新颖性的输出。在第二个环境中，达令在1（解决方案质量）和通过@K（解决方案品种）中获得更高的通行证。最引人注目的是，在线RL中明确优化多样性的探索，这表现为更高质量的反应。

Title: PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture

Authors: Fakhraddin Alwajih, Abdellah El Mekki, Hamdy Mubarak, Majd Hawasly, Abubakr Mohamed, Muhammad Abdul-Mageed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02550
Pdf URL: https://arxiv.org/pdf/2509.02550
Copy Paste: [[2509.02550]] PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture(https://arxiv.org/abs/2509.02550)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent.
摘要：大型语言模型（LLM）固有地反映了他们在培训阶段遇到的庞大数据分布。由于这些数据主要来自网络，因此很有可能会偏向于高资源的语言和文化，例如西方的文化。因此，LLMS经常表现出对某些社区的了解，这一差距在他们对阿拉伯语和伊斯兰文化的了解中尤为明显。随着代表性不足的主题越来越明显，这个问题变得更加明显。为了应对这一关键挑战，我们介绍了Palmx 2025，这是第一个共享任务，旨在基准在这些特定领域中LLM的文化能力。该任务由两个子任务组成，其中包括现代标准阿拉伯语（MSA）中的多项选择问题（MCQ）：阿拉伯文化和一般伊斯兰文化。这些子任务涵盖了广泛的主题，包括来自22个阿拉伯国家的传统，食物，历史，宗教习俗和语言表达。该计划引起了极大的兴趣，有26个团队为子任务2注册了子任务1和19，分别以9和6个有效的提交为最终。我们的发现表明，特定于任务的微调大大提高了基线模型的性能。在文化问题上，表现最佳的系统的准确度为72.15％，伊斯兰知识的准确性为84.22％。参数有效的微调作为参与者的主要和最有效的方法出现，而数据增强的效用是依赖域的。