2026-01-09

Title: MedPI: Evaluating AI Systems in Medical Patient-facing Interactions

Authors: Diego Fajardo V., Oleksii Proniakin, Victoria-Elisabeth Gruber, Razvan Marinescu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04195
Pdf URL: https://arxiv.org/pdf/2601.04195
Copy Paste: [[2601.04195]] MedPI: Evaluating AI Systems in Medical Patient-facing Interactions(https://arxiv.org/abs/2601.04195)
Keywords: language model, gpt, llm, prompt
Abstract: We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models -- Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 -- across 366 AI Patients and 7,097 conversations using a standardized "vanilla clinician" prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis. Our work can help guide future use of LLMs for diagnosis and treatment recommendations.
摘要：我们推出了 MedPI，这是一个用于评估患者与临床医生对话中的大语言模型 (LLM) 的高维基准。与单轮问答 (QA) 基准不同，MedPI 评估 105 个维度的医疗对话，包括医疗过程、治疗安全、治疗结果和医患沟通，涵盖细粒度、符合认证的标准。 MedPI 包含五层：(1) 患者数据包（类似 EHR 的综合基本事实）； (2) 通过法学硕士实例化的具有记忆和情感的人工智能患者； (3) 涵盖遭遇原因（例如焦虑、怀孕、健康检查）x 遭遇目标（例如诊断、生活方式建议、药物建议）的任务矩阵； (4) 评估框架，包含 1-4 级的 105 个维度，映射到研究生医学教育认证委员会 (ACGME) 的能力； (5) 经过校准、基于委员会的法学硕士人工智能法官，提供分数、标志和与证据相关的理由。我们使用标准化的“普通临床医生”提示，在 366 名 AI 患者和 7,097 次对话中评估了 9 个旗舰模型——Claude Opus 4.1、Claude Sonnet 4、MedGemma、Gemini 2.5 Pro、Llama 3.3 70b Instruct、GPT-5、GPT OSS 120b、o3、Grok-4。对于所有法学硕士，我们在各个方面都观察到表现不佳，特别是在鉴别诊断方面。我们的工作可以帮助指导法学硕士未来用于诊断和治疗建议。

Title: RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation

Authors: Keerthana Murugaraj, Salima Lamsiyah, Martin Theobald
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.04196
Pdf URL: https://arxiv.org/pdf/2601.04196
Copy Paste: [[2601.04196]] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation(https://arxiv.org/abs/2601.04196)
Keywords: retrieval-augmented generation, agent
Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub
摘要：评估检索增强生成（RAG）系统仍然是一项具有挑战性的任务：现有指标通常将异构行为分解为单个分数，并且无法深入了解错误是否由检索、推理或基础产生。在本文中，我们介绍了 RAGVUE，这是一种用于自动、无参考评估 RAG 管道的诊断和可解释框架。 RAGVUE 将 RAG 行为分解为检索质量、答案相关性和完整性、严格声明级别的忠实度以及判断校准。每个指标都包含结构化解释，使评估过程透明。我们的框架支持手动指标选择和全自动代理评估。它还提供 Python API、CLI 和本地 Streamlit 界面以供交互式使用。在比较实验中，RAGVUE 可以显示 RAGAS 等现有工具经常忽略的细粒度故障。我们展示了完整的 RAGVUE 工作流程，并说明了如何将其集成到研究流程和实际 RAG 开发中。源代码和详细使用说明可在 GitHub 上公开获取

Title: Automatic Construction of Chinese Verb Collostruction Database

Authors: Xuri Tang, Daohuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04197
Pdf URL: https://arxiv.org/pdf/2601.04197
Copy Paste: [[2601.04197]] Automatic Construction of Chinese Verb Collostruction Database(https://arxiv.org/abs/2601.04197)
Keywords: llm
Abstract: This paper proposes a fully unsupervised approach to the construction of verb collostruction database for Chinese language, aimed at complementing LLMs by providing explicit and interpretable rules for application scenarios where explanation and interpretability are indispensable. The paper formally defines a verb collostruction as a projective, rooted, ordered, and directed acyclic graph and employs a series of clustering algorithms to generate collostructions for a given verb from a list of sentences retrieved from large-scale corpus. Statistical analysis demonstrates that the generated collostructions possess the design features of functional independence and graded typicality. Evaluation with verb grammatical error correction shows that the error correction algorithm based on maximum matching with collostructions achieves better performance than LLMs.
摘要：本文提出了一种完全无监督的汉语动词搭配数据库构建方法，旨在通过为解释性和可解释性必不可少的应用场景提供明确且可解释的规则来补充法学硕士。该论文将动词搭配正式定义为射影、有根、有序、有向无环图，并采用一系列聚类算法从大规模语料库中检索的句子列表中生成给定动词的搭配。统计分析表明，生成的组合结构具有功能独立性和分级典型性的设计特点。动词语法纠错评估表明，基于搭配词最大匹配的纠错算法比LLM具有更好的性能。

Title: Attribute-Aware Controlled Product Generation with LLMs for E-commerce

Authors: Virginia Negri, Víctor Martínez Gómez, Sergio A. Balanya, Subburam Rajaram
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04200
Pdf URL: https://arxiv.org/pdf/2601.04200
Copy Paste: [[2601.04200]] Attribute-Aware Controlled Product Generation with LLMs for E-commerce(https://arxiv.org/abs/2601.04200)
Keywords: language model, llm, prompt
Abstract: Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.
摘要：产品信息提取对于电子商务服务至关重要，但获得高质量的标记数据集仍然具有挑战性。我们提出了一种使用大型语言模型（LLM）生成合成电子商务产品数据的系统方法，引入了具有三种策略的受控修改框架：属性保留修改、受控负例生成和系统属性删除。使用具有属性感知提示的最先进的法学硕士，我们在保持产品一致性的同时强制实施商店约束。对 2000 种合成产品的人工评估显示出很高的有效性，其中 99.6% 被评为天然产品，96.5% 包含有效属性值，超过 90% 显示一致的属性使用。在公共 MAVE 数据集上，我们的合成数据达到了 60.5% 的准确率，与真实训练数据 (60.8%) 的表现相当，并且比 13.4% 的零样本基线有了显着提高。结合合成数据和真实数据的混合配置进一步提高了性能，准确率达到 68.8%。我们的框架为增强电子商务数据集提供了实用的解决方案，对于资源匮乏的场景尤其有价值。

Title: Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems

Authors: Zihan Gao, Mohsin Y. K. Yousufi, Jacob Thebault-Spieker
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2601.04201
Pdf URL: https://arxiv.org/pdf/2601.04201
Copy Paste: [[2601.04201]] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems(https://arxiv.org/abs/2601.04201)
Keywords: language model, llm
Abstract: Large language model (LLM) question-answering systems often fail on community-specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.
摘要：大型语言模型 (LLM) 问答系统经常无法处理特定于社区的查询，从而产生“知识盲点”，从而边缘化本地声音并加剧认知不公正。我们提出了集体叙事基础（Collective Narrative Grounding），这是一种参与式协议，可将社区故事转化为结构化叙事单元，并将其集成到社区治理下的人工智能系统中。通过向 N=24 名社区成员举办的三个参与式制图研讨会学习，我们设计了启发方法和模式，既保留了叙述的丰富性，同时又能够实现实体、时间和地点的提取、验证和出处控制。为了解决这个问题，我们审核了 14,782 个本地信息 QA 对的县级基准，其中事实差距、文化误解、地理混乱和时间错位占错误的 76.7%。在我们的研讨会衍生的参与式问答集上，最先进的法学硕士在没有添加上下文的情况下正确回答的问题不到 21%，这强调了本地基础的必要性。缺失的事实经常出现在收集的叙述中，这表明了关闭叙述项目的主要错误模式的直接途径。除了协议和试点之外，我们还阐明了关键的设计张力，例如代表和权力、治理和控制以及隐私和同意，为检索优先、出处可见、本地管理的 QA 系统提供了具体要求。我们的分类法、协议和参与式评估共同为构建基于社区的人工智能提供了坚实的基础，以更好地回答当地问题。

Title: TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation

Authors: Anas Ezzakri, Nicola Piovesan, Mohamed Sana, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04202
Pdf URL: https://arxiv.org/pdf/2601.04202
Copy Paste: [[2601.04202]] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation(https://arxiv.org/abs/2601.04202)
Keywords: language model, llm
Abstract: Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.
摘要：电信行业越来越多地探索语言模型 (LLM)，以支持工程任务、加速故障排除并协助解释复杂的技术文档。然而，最近的研究表明，法学硕士在电信标准（尤其是 3GPP 规范）方面表现不佳。我们认为，一个关键原因是这些标准密集地包含了呈现基本信息的表格，但这些表格的法学硕士知识和解释能力在很大程度上仍未得到检验。为了解决这一差距，我们引入了 TeleTables，这是一个基准，旨在评估法学硕士对技术规范中表格的隐性知识及其解释它们的显性能力。 TeleTables 通过新颖的多阶段数据生成管道构建，该管道从 3GPP 标准中提取表格，并使用多模式和面向推理的 LLM 来生成和验证问题。生成的数据集是公开的，包含 500 个经过人工验证的问答对，每个问答对都与多种格式的相应表格相关联。我们的评估表明，较小的模型（10B 参数以下）很难回忆 3GPP 知识和解释表格，这表明在预训练中对电信标准的了解有限，并且在导航复杂的技术材料时归纳偏差不足。另一方面，较大的模型在表格解释上表现出更强的推理能力。总体而言，TeleTables 强调了对特定领域进行微调的必要性，以可靠地解释和推理电信标准。

Title: FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

Authors: Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen
Subjects: cs.CL, cs.CV, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2601.04203
Pdf URL: https://arxiv.org/pdf/2601.04203
Copy Paste: [[2601.04203]] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback(https://arxiv.org/abs/2601.04203)
Keywords: language model, agent
Abstract: We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL
摘要：我们推出了 FronTalk，这是前端代码生成的基准，它开创了独特交互动态的研究：具有多模式反馈的会话代码生成。在前端开发中，草图、模型和带注释的截图等视觉工件对于传达设计意图至关重要，但它们在多轮代码生成中的作用在很大程度上仍未被探索。为了弥补这一差距，我们专注于前端开发任务并策划了 FronTalk，这是一个包含 100 个多轮对话的集合，这些对话源自新闻、金融和艺术等不同领域的现实世界网站。每个回合都包含文本指令和等效的视觉指令，每个指令都代表相同的用户意图。为了全面评估模型性能，我们提出了一种新颖的基于代理的评估框架，利用网络代理来模拟用户并探索网站，从而衡量功能正确性和用户体验。对 20 个模型的评估揭示了文献中尚未系统探索的两个关键挑战：(1) 一个重大的遗忘问题，即模型覆盖以前实现的特征，导致任务失败；(2) 解释视觉反馈方面的持续挑战，特别是对于开源视觉语言模型 (VLM)。我们提出了一个强有力的基线来解决 AceCoder 的遗忘问题，该方法使用自主网络代理来批评每条过去指令的实现。这种方法将遗忘显着减少到几乎为零，并将性能提高了 9.3%（56.0% 到 65.3%）。总的来说，我们的目标是为前端开发和多轮、多模式代码生成的一般交互动态的未来研究提供坚实的基础。代码和数据在此 https URL 发布

Title: STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models

Authors: Xinhao Sun, Maoliang Li, Zihao Zheng, Jiayu Chen, Hezhao Xu, Yun Liang, Xiang Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04205
Pdf URL: https://arxiv.org/pdf/2601.04205
Copy Paste: [[2601.04205]] STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models(https://arxiv.org/abs/2601.04205)
Keywords: language model
Abstract: Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low- priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spa- tial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical re- sults show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.
摘要：与自回归语言模型不同，扩散语言模型 (DLM) 通过并行迭代对所有标记位置进行去噪来生成文本。在每个时间步长，DLM 的重新屏蔽策略都会选择低优先级令牌来推迟其解码，从而提高效率和输出质量。然而，主流的重新屏蔽策略依赖于单一的全局置信度阈值，忽略了单个标记的时间和空间动态。受固定阈值重新屏蔽引入的冗余迭代和约束并行性的启发，我们提出了一种新颖的重新屏蔽方法，该方法动态检测每个令牌的时间方差和空间偏差，这反映了其收敛状态和令牌间相关性。使用这些信号，我们的方法在每一步自适应地调整每个标记的置信度阈值。实证结果表明，我们的方法显着提高了 DLM 在主流数据集上的运行效率，实现了高达 8.9 倍的加速，同时忠实地保持了生成质量。

Title: Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation

Authors: Aram Virabyan
Subjects: cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2601.04206
Pdf URL: https://arxiv.org/pdf/2601.04206
Copy Paste: [[2601.04206]] Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation(https://arxiv.org/abs/2601.04206)
Keywords: language model, retrieval-augmented generation
Abstract: University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students' perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG's ability to access up-to-date information and fine-tuning's capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.
摘要：大学招生办公室面临着有效管理大量询问的同时保持答复质量的重大挑战，这严重影响了未来学生的看法。本文通过提出一种将微调语言模型与检索增强生成（RAG）相结合的人工智能系统来解决响应时间和信息准确性的问题。虽然 RAG 可以有效地从大型数据集中检索相关信息，但如果不进行调整，它在大学招生等狭窄、复杂领域的表现可能会受到限制，由于涉及复杂的规则和具体细节，可能会导致上下文响应不足。为了克服这个问题，我们在特定于招生流程的精选数据集上对模型进行了微调，增强了其准确解释 RAG 提供的数据并生成领域相关输出的能力。这种混合方法利用 RAG 访问最新信息的能力，并微调嵌入细致入微的领域理解的能力。我们进一步探索了响应生成逻辑的优化策略，尝试平衡响应质量和速度的设置，旨在获得始终如一的高质量输出，满足招生沟通的特定要求。

Title: Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis

Authors: Wei Xia, Haowen Tang, Luozheng Li
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2601.04207
Pdf URL: https://arxiv.org/pdf/2601.04207
Copy Paste: [[2601.04207]] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis(https://arxiv.org/abs/2601.04207)
Keywords: llm
Abstract: LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.
摘要：法学硕士在内部按照低维结构组织政治意识形态，这些结构部分但不完全与人类意识形态空间保持一致。这种错位是系统性的、特定于模型的并且是可测量的。我们引入了一种轻量级线性探针，它既可以量化未对准，又可以最小程度地校正输出层。本文介绍了一种简单而有效的方法，用于将模型与特定用户意见结合起来。我们没有重新训练模型，而是根据其内部特征计算偏差分数并直接调整最终的输出概率。该方案实用且成本低廉，并且保留了模型原有的推理能力。

Title: LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach

Authors: Xiang Cheng, Wen Wang, Anindya Ghose
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04208
Pdf URL: https://arxiv.org/pdf/2601.04208
Copy Paste: [[2601.04208]] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach(https://arxiv.org/abs/2601.04208)
Keywords: language model, llm
Abstract: Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.
摘要：人工智能 (AI) 模型越来越多地推动高风险的消费者互动，但其决策逻辑往往仍然不透明。流行的可解释人工智能技术依赖于事后数字特征归因，这无法在模型决策背后提供连贯的叙述。大型语言模型（LLM）提供了生成自然语言解释的机会，但三个设计挑战仍未解决：解释必须既决策正确又忠实于驱动预测的因素；他们应该能够在不改变基本决策规则的情况下为多个受众提供服务；他们应该以标签有效的方式进行训练，而不依赖于大量的人工评分解释语料库。为了应对这些挑战，我们引入了 LEXMA（基于 LLM 的多受众决策解释），这是一种基于强化学习的微调框架，可生成叙述驱动的、适合受众的解释。 LEXMA 将反射增强监督微调与组相对策略优化 (GRPO) 的两个阶段相结合。具体来说，它使用不依赖于人工注释解释的奖励信号来微调两个单独的参数集，以提高决策正确性并满足不同受众的风格要求。我们在抵押贷款审批决策的背景下实例化 LEXMA。结果表明，与其他 LLM 基线相比，LEXMA 在预测性能方面取得了显着改进。此外，人工评估表明，我们的方法生成的面向专家的解释更加注重风险，而面向消费者的解释更清晰、更可操作且更有礼貌。我们的研究提供了一种经济高效、系统性的法学硕士微调方法，以提高业务决策的解释质量，为透明人工智能系统的可扩展部署提供强大的潜力。

Title: Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments

Authors: Seokhwan Ko, Donghyeon Lee, Jaewoo Chun, Hyungsoo Han, Junghwan Cho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04209
Pdf URL: https://arxiv.org/pdf/2601.04209
Copy Paste: [[2601.04209]] Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments(https://arxiv.org/abs/2601.04209)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.
摘要：大语言模型 (LLM) 越来越被认为是整个医疗环境中的宝贵工具，支持临床、研究和管理工作流程。然而，医院环境中严格的隐私和网络安全法规要求敏感数据在完全本地的基础设施内进行处理。在此背景下，我们开发并评估了检索增强生成（RAG）系统，旨在根据医疗机构成员撰写的 PubMed 出版物推荐研究合作者。该系统利用 PubMedBERT 进行特定领域的嵌入生成，并利用本地部署的 LLaMA3 模型进行生成合成。这项研究证明了将专业领域编码器与轻量级法学硕士集成以支持本地部署限制下的生物医学知识发现的可行性和实用性。

Title: Complexity Agnostic Recursive Decomposition of Thoughts

Authors: Kaleem Ullah Qasim, Jiashu Zhang, Hafiz Saif Ur Rehman
Subjects: cs.CL, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2601.04210
Pdf URL: https://arxiv.org/pdf/2601.04210
Copy Paste: [[2601.04210]] Complexity Agnostic Recursive Decomposition of Thoughts(https://arxiv.org/abs/2601.04210)
Keywords: language model
Abstract: Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.
摘要：由于固定的推理策略忽略了问题的特定难度，大型语言模型通常无法进行多步推理。我们引入了 CARD（复杂性无关递归分解），这是一个在生成之前预测问题复杂性并相应地调整分解的框架。我们的系统包括 MRCE（多维推理复杂度估计器）、一个从问题文本预测 30 个细粒度特征的 0.6B Qwen 模型和一个两阶段递归求解器：（1）基于任务配置文件分层分解为 K 个步骤，以及（2）通过递归 MRCE 分析进行每步思想预算分配（1、5-9 或 10 个思想）。在三种推理模型（Qwen3-0.6B、DeepSeek-R1-Distill-Qwen-1.5B、Qwen3-1.7B）上进行评估，CARD 在 GSM8K 上实现了 81.4% 至 89.2% 的准确率，同时与固定分解基线相比，将令牌成本降低了 1.88 倍至 2.40 倍。在 MATH-500 上，CARD 使用减少 1.71 倍到 5.74 倍的令牌达到 75.1 到 86.8% 的准确度。我们的结果表明，先发制人的复杂性估计可以实现更高的准确性并显着提高效率。

Title: TrueBrief: Faithful Summarization through Small Language Models

Authors: Kumud Lakara, Ruibo Shi, Fran Silavong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04212
Pdf URL: https://arxiv.org/pdf/2601.04212
Copy Paste: [[2601.04212]] TrueBrief: Faithful Summarization through Small Language Models(https://arxiv.org/abs/2601.04212)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) have exhibited remarkable proficiency in generating high-quality text; however, their propensity for producing hallucinations poses a significant challenge for their deployment in security-critical domains. In this work, we present TrueBrief, an end-to-end framework specifically designed to enhance the faithfulness of small LLMs (SLMs) primarily for the task of text summarization through a preference-optimization paradigm. Central to our framework is a data generation module that facilitates controlled hallucination injection to generate synthetic preference data. Our work provides insights into the impact of data quality and model size on preference-based optimization, highlighting the conditions under which these methods are most effective.
摘要：大型语言模型（LLM）在生成高质量文本方面表现出了卓越的能力；然而，它们产生幻觉的倾向对其在安全关键领域的部署构成了重大挑战。在这项工作中，我们提出了 TrueBrief，这是一个专门设计用于增强小型法学硕士（SLM）的忠实度的端到端框架，主要用于通过偏好优化范式进行文本摘要任务。我们框架的核心是一个数据生成模块，它有助于受控幻觉注入以生成合成偏好数据。我们的工作深入了解了数据质量和模型大小对基于偏好的优化的影响，强调了这些方法最有效的条件。

Title: AnimatedLLM: Explaining LLMs with Interactive Visualizations

Authors: Zdeněk Kasner, Ondřej Dušek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04213
Pdf URL: https://arxiv.org/pdf/2601.04213
Copy Paste: [[2601.04213]] AnimatedLLM: Explaining LLMs with Interactive Visualizations(https://arxiv.org/abs/2601.04213)
Keywords: language model, llm
Abstract: Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at this https URL, both as a teaching aid and for self-educational purposes.
摘要：大型语言模型 (LLM) 正在成为自然语言处理教育的核心，但展示其机制的材料却很少。我们推出 AnimatedLLM，这是一个交互式 Web 应用程序，可提供 Transformer 语言模型的分步可视化。 AnimatedLLM 完全在浏览器中运行，使用应用于手动策划输入的开放 LLM 的预先计算轨迹。该应用程序可在此 https URL 上获取，既可作为教学辅助工具，也可用于自学目的。

Title: From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

Authors: Xiaoyu Xu, Minxin Du, Zitong Li, Zi Liang, Zhibiao Guo, Shiyu Zhang, Peizhao Hu, Qingqing Ye, Haibo Hu
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04278
Pdf URL: https://arxiv.org/pdf/2601.04278
Copy Paste: [[2601.04278]] From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning(https://arxiv.org/abs/2601.04278)
Keywords: llm, prompt
Abstract: Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true "forgetting scope" learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.
摘要：尽管机器遗忘对于从法学硕士中删除私人、有害或受版权保护的内容至关重要，但当前的基准通常无法忠实地代表模型学到的真正的“遗忘范围”。我们形式化了两种不同的遗忘粒度：领域级和实例级，并提出了 BiForget，一种用于合成高质量遗忘集的自动化框架。与之前依赖外部生成器的工作不同，BiForget 利用目标模型本身，通过种子引导和对抗性提示来获取与其内部知识分布相匹配的数据。我们在不同基准上进行的实验表明，它实现了相关性、多样性和效率的卓越平衡。从数量上来说，在《哈利·波特》领域，与 SOTA 相比，它的相关性提高了 ${\sim}20$，多样性提高了 ${\sim}$0.05，同时将总数据大小减半。最终，它促进了更强大的遗忘和更好的效用保存，为评估 LLM 忘却提供了更严格的基础。

Title: RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation

Authors: Joseph James, Chenghao Xiao, Yucheng Li, Nafise Sadat Moosavi, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04350
Pdf URL: https://arxiv.org/pdf/2601.04350
Copy Paste: [[2601.04350]] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation(https://arxiv.org/abs/2601.04350)
Keywords: llm
Abstract: Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper's body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.
摘要：科学严谨性往往被大胆的陈述所取代，导致作者夸大其结果所支持的主张。我们提出了 RIGOURATE，一个两阶段的多模式框架，它从论文正文中检索支持证据，并为每个主张分配一个夸大的分数。该框架由来自 ICLR 和 NeurIPS 论文的超过 10K 个主张证据集组成的数据集，使用八个法学硕士进行注释，并使用同行评审评论校准夸大分数，并通过人工评估进行验证。它采用微调的重新排序器进行证据检索，并采用微调的模型来合理地预测夸大的分数。与强大的基线相比，RIGOURATE 能够改进证据检索和夸大检测。总体而言，我们的工作实现了证据比例性并支持更清晰、更透明的科学交流。

Title: Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

Authors: Dongqi Liu, Hang Ding, Qiming Feng, Jian Li, Xurong Xie, Zhucun Xue, Chengjie Wang, Jiangning Zhang, Yabiao Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04377
Pdf URL: https://arxiv.org/pdf/2601.04377
Copy Paste: [[2601.04377]] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation(https://arxiv.org/abs/2601.04377)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.
摘要：检索增强生成（RAG）已成为增强大型语言模型（LLM）在知识密集型任务中性能的重要手段。然而，大多数现有的 RAG 策略以扁平且非结构化的方式处理检索到的段落，这阻止了模型捕获结构线索，并限制了其从文档中分散的证据中合成知识的能力。为了克服这些限制，我们提出了 Disco-RAG，这是一种话语感知框架，可以将话语信号明确地注入到生成过程中。我们的方法构建块内话语树来捕获局部层次结构，并构建块间修辞图来建模跨段落连贯性。这些结构共同整合到影响一代的规划蓝图中。问答和长文档摘要基准的实验表明了我们方法的有效性。 Disco-RAG 无需微调即可在基准测试中取得最先进的结果。这些发现强调了话语结构在推进 RAG 系统中的重要作用。

Title: MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

Authors: Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04389
Pdf URL: https://arxiv.org/pdf/2601.04389
Copy Paste: [[2601.04389]] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking(https://arxiv.org/abs/2601.04389)
Keywords: language model, llm, prompt
Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.
摘要：目前对大型语言模型（LLM）的安全评估造成了一种危险的普遍性幻觉，将“身份仇恨”汇总成标量分数，掩盖了针对特定人群的系统漏洞。为了揭示这种选择性安全性，我们引入了 MiJaBench，这是一种双语（英语和葡萄牙语）对抗性基准，包含 16 个少数群体的 44,000 个提示。通过从 12 个最先进的法学硕士生成 528,000 个提示响应对，我们策划了 MiJaBench-Align，揭示了安全对齐不是广义的语义能力，而是人口层次结构：在同一模型中，防御率仅根据目标群体波动高达 33%。至关重要的是，我们证明模型缩放加剧了这些差异，这表明当前的对齐技术并没有创建非歧视原则，而是仅针对特定群体强化了记忆的拒绝边界，挑战了当前的安全缩放法则。我们发布所有数据集和脚本，以鼓励在 GitHub 上进行精细的人口统计调整研究。

Title: ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

Authors: Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath, Swagatam Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04394
Pdf URL: https://arxiv.org/pdf/2601.04394
Copy Paste: [[2601.04394]] ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models(https://arxiv.org/abs/2601.04394)
Keywords: language model, llm, hallucination
Abstract: Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at this https URL.
摘要：人类认知在复杂的神经化学过程的驱动下，在想象和现实之间摇摆，每当这种微妙的漂移导致幻觉或不安全的联想时，人类的认知就会学会自我纠正。近年来，法学硕士在广泛的任务中表现出了卓越的表现。然而，它们仍然缺乏平衡真实性和安全性的人类认知。与此相似，我们认为法学硕士的事实失败和安全失败都是由于其潜在激活空间中的代表性失调造成的，而不是将它们作为完全独立的对齐问题来解决。我们假设，经过训练以理解波动的外部网络可以选择性地干预模型，将虚假输出调节为真实输出，将不安全输出调节为安全输出，而无需微调模型参数本身。为了反映这一假设，我们提出了 ARREST（增强安全性和真相的对抗性弹性监管），这是一个统一的框架，可以识别和纠正漂移的特征，除了事实纠正之外，还可以进行软拒绝和硬拒绝。我们的实证结果表明，ARREST 不仅可以调节错位，而且与 RLHF 对齐模型相比，在由于对抗性训练而产生软拒绝方面也更通用。我们通过此 https URL 提供我们的代码库。

Title: Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

Authors: Yao Dou, Wei Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04424
Pdf URL: https://arxiv.org/pdf/2601.04424
Copy Paste: [[2601.04424]] Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization(https://arxiv.org/abs/2601.04424)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries -- making human references less reliable -- we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.
摘要：大型语言模型 (LLM) 现在支持多达 1M 个标记的上下文，但它们在复杂的长上下文任务上的有效性仍不清楚。在本文中，我们研究多文档法律案例摘要，其中单个案例通常跨越总计 100K-500K 令牌的许多文档。我们引入了 Gavel-Ref，一个基于参考的评估框架，具有超过 26 个项目的多值清单评估，以及剩余事实和写作风格评估。使用 Gavel-Ref，我们超越了之前工作中报告的单一总分，系统地评估了 100 个法律案例中的 12 个前沿 LLM，涉及的代币数量从 32K 到 512K 代币，主要从 2025 年开始。我们的结果表明，即使是最强的模型 Gemini 2.5 Pro，也只能达到大约 50 美元 S_{\text{Gavel-Ref}}$，凸显了任务的难度。模型在简单的清单项目（例如提交日期）上表现良好，但在多值或罕见的项目（例如和解和监控报告）上表现不佳。随着法学硕士不断改进并可能超越人工撰写的摘要（使人工参考不太可靠），我们开发了 Gavel-Agent，这是一种高效且自主的代理支架，为法学硕士配备了六种工具，可以直接从案例文档中导航和提取清单。与使用 GPT-4.1 进行端到端提取相比，使用 Qwen3，Gavel-Agent 将令牌使用量减少了 36%，而 $S_{\text{checklist}}$ 仅下降了 7%。

Title: Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs

Authors: Myra Cheng, Robert D. Hawkins, Dan Jurafsky
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.04435
Pdf URL: https://arxiv.org/pdf/2601.04435
Copy Paste: [[2601.04435]] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs(https://arxiv.org/abs/2601.04435)
Keywords: language model, llm
Abstract: Large language models (LLMs) frequently fail to challenge users' harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users' assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models' ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase "wait a minute", significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.
摘要：大型语言模型（LLM）经常无法挑战用户在从医疗建议到社会推理等领域的有害信念。我们认为，这些失败可以被理解并务实地解决，因为法学硕士默认不适应用户的假设并表现出认知警惕性不足。我们表明，已知影响人类适应性的社会和语言因素（问题性、语言编码和来源可靠性）同样会影响法学硕士的适应性，解释了测试模型挑战有害信念的能力的三个安全基准的性能差异，涵盖错误信息（癌症神话、圣人评估）和阿谀奉承（大象）。我们进一步表明，简单的务实干预措施，例如添加“稍等一下”短语，可以显着提高这些基准的性能，同时保持较低的误报率。我们的结果强调了考虑语用学来评估法学硕士行为和提高法学硕士安全性的重要性。

Title: Learning to Simulate Human Dialogue

Authors: Kanishk Gandhi, Agam Bhatia, Noah D. Goodman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04436
Pdf URL: https://arxiv.org/pdf/2601.04436
Copy Paste: [[2601.04436]] Learning to Simulate Human Dialogue(https://arxiv.org/abs/2601.04436)
Keywords: llm, chain-of-thought
Abstract: To predict what someone will say is to model how they think. We study this through next-turn dialogue prediction: given a conversation, predict the next utterance produced by a person. We compare learning approaches along two dimensions: (1) whether the model is allowed to think before responding, and (2) how learning is rewarded either through an LLM-as-a-judge that scores semantic similarity and information completeness relative to the ground-truth response, or by directly maximizing the log-probability of the true human dialogue. We find that optimizing for judge-based rewards indeed increases judge scores throughout training, however it decreases the likelihood assigned to ground truth human responses and decreases the win rate when human judges choose the most human-like response among a real and synthetic option. This failure is amplified when the model is allowed to think before answering. In contrast, by directly maximizing the log-probability of observed human responses, the model learns to better predict what people actually say, improving on both log-probability and win rate evaluations. Treating chain-of-thought as a latent variable, we derive a lower bound on the log-probability. Optimizing this objective yields the best results on all our evaluations. These results suggest that thinking helps primarily when trained with a distribution-matching objective grounded in real human dialogue, and that scaling this approach to broader conversational data may produce models with a more nuanced understanding of human behavior.
摘要：预测某人会说什么就是模拟他们的想法。我们通过下一回合对话预测来研究这一点：给定一个对话，预测一个人产生的下一个话语。我们从两个维度比较学习方法：（1）是否允许模型在响应之前思考，以及（2）如何通过法学硕士作为法官对相对于真实响应的语义相似性和信息完整性进行评分，或通过直接最大化真实人类对话的对数概率来奖励学习。我们发现，针对基于法官的奖励进行优化确实会在整个训练过程中提高法官的分数，但它会降低分配给真实人类反应的可能性，并降低当人类法官在真实和综合选项中选择最像人类的反应时的获胜率。当允许模型在回答之前思考时，这种失败会被放大。相比之下，通过直接最大化观察到的人类反应的对数概率，该模型学会更好地预测人们实际所说的话，从而改善对数概率和胜率评估。将思想链视为潜在变量，我们得出对数概率的下界。优化这个目标可以在我们所有的评估中产生最好的结果。这些结果表明，当使用基于真实人类对话的分布匹配目标进行训练时，思维主要有帮助，并且将这种方法扩展到更广泛的对话数据可能会产生对人类行为有更细致理解的模型。

Title: Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

Authors: San Kim, Gary Geunbae Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04448
Pdf URL: https://arxiv.org/pdf/2601.04448
Copy Paste: [[2601.04448]] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models(https://arxiv.org/abs/2601.04448)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.
摘要：大型语言模型 (LLM) 具有非常先进的自然语言处理 (NLP)，特别是通过指令调整，无需额外的微调即可实现广泛的任务泛化。然而，它们对大规模数据集（通常从人类或网络来源收集）的依赖使它们容易受到后门攻击，对手会毒害一小部分数据以植入隐藏行为。尽管风险不断增加，但针对指令调整模型的防御措施仍未得到充分探索。我们提出 MB-Defense（合并和破坏防御框架），这是一种新颖的训练管道，可以使经过指令调整的 LLM 免受各种后门威胁的影响。 MB-Defense 包括两个阶段：(i) 防御中毒，将攻击者和防御触发器合并为统一的后门表示；(ii) 重量恢复，通过额外的训练打破这种表示以恢复干净的行为。跨多个法学硕士的大量实验表明，MB-Defense 显着降低了攻击成功率，同时保留了指令跟踪能力。我们的方法提供了一种通用且数据高效的防御策略，提高了指令调整的 LLM 对抗看不见的后门攻击的鲁棒性。

Title: Beyond Static Summarization: Proactive Memory Extraction for LLM Agents

Authors: Chengyuan Yang, Zequn Sun, Wei Wei, Wei Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04463
Pdf URL: https://arxiv.org/pdf/2601.04463
Copy Paste: [[2601.04463]] Beyond Static Summarization: Proactive Memory Extraction for LLM Agents(https://arxiv.org/abs/2601.04463)
Keywords: llm, agent
Abstract: Memory management is vital for LLM agents to handle long-term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary-based methods have two major limitations based on the recurrent processing theory. First, summarization is "ahead-of-time", acting as a blind "feed-forward" process that misses important details because it doesn't know future tasks. Second, extraction is usually "one-off", lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self-questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade-off between extraction quality and token cost.
摘要：内存管理对于 LLM 代理处理长期交互和个性化至关重要。大多数研究集中于如何组织和使用记忆摘要，但往往忽视了最初的记忆提取阶段。在本文中，我们认为基于循环处理理论，现有的基于摘要的方法有两个主要局限性。首先，总结是“提前的”，作为一个盲目的“前馈”过程，会错过重要的细节，因为它不知道未来的任务。其次，提取通常是“一次性”的，缺乏验证事实的反馈循环，从而导致信息丢失的积累。为了解决这些问题，我们提出主动内存提取（即 ProMem）。与静态摘要不同，ProMem 将提取视为迭代的认知过程。我们引入了一个循环反馈循环，其中代理使用自我提问来主动探索对话历史。这种机制允许代理恢复丢失的信息并纠正错误。我们的 ProMem 显着提高了提取内存的完整性和 QA 准确性。它还实现了提取质量和代币成本之间的卓越权衡。

Title: Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

Authors: Ignacio Sastre, Aiala Rosá
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04465
Pdf URL: https://arxiv.org/pdf/2601.04465
Copy Paste: [[2601.04465]] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions(https://arxiv.org/abs/2601.04465)
Keywords: llm, hallucination
Abstract: We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.
摘要：我们提出了概念令牌，这是一种轻量级方法，它将新的特殊令牌添加到预训练的 LLM 中，并仅从目标概念的多个自然语言定义中学习其嵌入，其中概念的出现被新令牌替换。 LLM 保持冻结状态，并使用标准语言建模目标优化嵌入。我们在三种设置中评估概念标记。首先，我们研究了 HotpotQA 闭卷问答中的幻觉，并发现了一种方向性效应：否定幻觉标记主要通过增加弃权来减少幻觉答案，而断言它会增加幻觉并降低精确度。其次，我们引入了重铸，这是第二语言教学的一种教学反馈策略，并观察到相同的定向效果。此外，与在上下文中提供完整的定义语料库相比，概念标记更好地保持对其他指令的遵守（例如，提出后续问题）。最后，我们对埃菲尔铁塔和虚构的“南方塔”进行了定性研究，以说明学习的嵌入捕获了哪些信息以及它们的局限性出现在哪里。总体而言，概念令牌提供了从定义中学习的紧凑控制信号，可以引导冻结的法学硕士中的行为。

Title: SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

Authors: Iaroslav Chelombitko, Ekaterina Chelombitko, Aleksey Komissarov
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04469
Pdf URL: https://arxiv.org/pdf/2601.04469
Copy Paste: [[2601.04469]] SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers(https://arxiv.org/abs/2601.04469)
Keywords: language model
Abstract: The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: this https URL
摘要：子词标记化的质量对于大型语言模型至关重要，但评估形态丰富的乌拉尔语言的标记化器却因缺乏干净的语素词典而受到阻碍。我们推出 SampoNLP，这是一个无语料库工具包，用于使用受 MDL 启发的自参照原子性评分创建形态词典，它通过内部结构线索过滤复合形式 - 适合资源匮乏的环境。使用 SampoNLP 为芬兰语、匈牙利语和爱沙尼亚语生成的高纯度词典，我们对一系列词汇量（8k-256k）的 BPE 分词器进行了系统评估。我们提出了一个统一的指标，即综合性能得分（IPS），来平衡词素覆盖率和过度分割之间的权衡。通过分析 IPS 曲线，我们确定了收益递减的“肘点”，并为这些语言中的最佳词汇量 (k) 提供了第一个基于经验的建议。我们的研究不仅提供了实用指导，而且定量地证明了标准 BPE 对于高粘着语言的局限性。 SampoNLP 库和所有生成的资源均公开可用：此 https URL

Title: WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Authors: Chenchen Yang, Kexin Huang, Liwei Fan, Qian Tu, Botian Jiang, Dong Zhang, Linqi Yin, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu
Subjects: cs.CL, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2601.04508
Pdf URL: https://arxiv.org/pdf/2601.04508
Copy Paste: [[2601.04508]] WESR: Scaling and Evaluating Word-level Event-Speech Recognition(https://arxiv.org/abs/2601.04508)
Keywords: language model
Abstract: Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.
摘要：言语不仅传达语言信息，还传达丰富的非语言声音事件，例如笑和哭。尽管语义转录已得到充分研究，但非语言事件的精确定位仍然是一个关键但尚未充分探索的挑战。当前的方法存在任务定义不足、类别覆盖范围有限和时间粒度不明确的问题。它们还缺乏标准化的评估框架，阻碍了下游应用的发展。为了弥补这一差距，我们首先开发了 21 个声音事件的精细分类法，并将其分为离散（独立）类型和连续（与语音混合）类型。基于精细的分类法，我们引入了 WESR-Bench，这是一个专家注释的评估集（900 多个话语），具有新颖的位置感知协议，可将 ASR 错误与事件检测分开，从而实现离散和连续事件的精确定位测量。我们还通过构建 1,700 多个小时的语料库来建立强大的基线，并训练专门的模型，超越开源音频语言模型和商业 API，同时保持 ASR 质量。我们预计 WESR 将成为未来模拟丰富的真实听觉场景研究的基础资源。

Title: LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation

Authors: Yuxiao Ye, Yiming Zhang, Yiran Ma, Huiyuan Xie, Huining Zhu, Zhiyuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04516
Pdf URL: https://arxiv.org/pdf/2601.04516
Copy Paste: [[2601.04516]] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation(https://arxiv.org/abs/2601.04516)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents' communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.
摘要：大型语言模型 (LLM) 启用了多代理系统 (MAS)，其中代理通过自然语言进行交互，以解决复杂任务或模拟多方对话。最近基于 LLM 的 MAS 的工作主要集中在架构设计，例如角色分配和工作流程编排。相比之下，本文针对交互过程本身，旨在通过帮助智能体更有效地通过语言传达其意图的含义来提高智能体的沟通效率。为此，我们提出了 LinguaGame，一种基于语言的多智能体对话生成博弈论范式。我们的方法将对话建模为关于交流意图和策略的信号游戏，通过用于推理时间决策调整的免训练平衡近似算法来解决。与之前的博弈论 MAS 不同，其游戏设计通常与特定任务目标紧密耦合，我们的框架依赖于语言知情推理，并具有最小的特定任务耦合。具体来说，它将对话视为有意的和战略性的沟通，要求代理人推断其他人想要实现的目标（意图）以及他们如何实现这些目标（策略）。我们在模拟法庭诉讼和辩论中评估我们的框架，人类专家评估显示沟通效率显着提高。

Title: GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence

Authors: Yibo Zhao, Jiapeng Zhu, Zichen Ding, Xiang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04525
Pdf URL: https://arxiv.org/pdf/2601.04525
Copy Paste: [[2601.04525]] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence(https://arxiv.org/abs/2601.04525)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at this https URL..
摘要：检索增强生成（RAG）集成了外部知识来增强大型语言模型（LLM），但系统仍然容易受到两个关键缺陷的影响：在没有明确证据的情况下提供正确答案，以及在检索到的上下文不足时产生伪造的响应。虽然之前的研究已经独立解决了这些问题，但目前缺乏一个整合循证基础和可靠弃权的统一框架。在本文中，我们提出了 GRACE，这是一种强化学习框架，可以同时缓解这两类缺陷。 GRACE采用了一种数据构建方法，利用异构检索器生成多样化的训练样本，无需手动注释。然后采用多阶段门控奖励函数来训练模型以评估证据充分性、提取关键支持证据并提供答案或明确弃权。两个基准测试的实验结果表明，GRACE 实现了最先进的整体准确性，并在准确响应和拒绝之间取得了良好的平衡，同时仅需要现有方法 10% 的注释成本。我们的代码可以在这个 https URL 上找到。

Title: BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation

Authors: Amit Bin Tariqul, A N M Zahid Hossain Milkan, Sahab-Al-Chowdhury, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04534
Pdf URL: https://arxiv.org/pdf/2601.04534
Copy Paste: [[2601.04534]] BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation(https://arxiv.org/abs/2601.04534)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.
摘要：随着大型语言模型 (LLM) 越来越多地用于文本生成，水印对于作者归属、知识产权保护和滥用检测变得至关重要。虽然现有的水印方法在高资源语言中表现良好，但它们在低资源语言中的稳健性仍未得到充分探索。这项工作首次对最先进的文本水印方法进行了系统评估：KGW、指数采样 (EXP) 和瀑布，用于跨语言往返翻译 (RTT) 攻击下的 Bangla LLM 文本生成。在良性条件下，KGW 和 EXP 实现了较高的检测精度 (>88%)，并且困惑度和 ROUGE 退化可以忽略不计。然而，RTT 导致检测精度下降到低于 RTT 导致检测精度下降到 9-13%，这表明令牌级水印的根本失败。为了解决这个问题，我们提出了一种结合嵌入时水印和生成后水印的分层水印策略。实验结果表明，分层水印将 RTT 后检测精度提高了 25-35%，达到 40-50% 的精度，比单层方法相对提高了 3$\times$ 到 4$\times$，但代价是受控语义退化。我们的研究结果量化了多语言水印中鲁棒性与质量的权衡，并将分层水印确立为孟加拉等低资源语言的实用、免培训解决方案。我们的代码和数据将被公开。

Title: Identifying Good and Bad Neurons for Task-Level Controllable LLMs

Authors: Wenjie Li, Guansong Pang, Hezhe Qiao, Debin Gao, David Lo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04548
Pdf URL: https://arxiv.org/pdf/2601.04548
Copy Paste: [[2601.04548]] Identifying Good and Bad Neurons for Task-Level Controllable LLMs(https://arxiv.org/abs/2601.04548)
Keywords: language model, llm
Abstract: Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.
摘要：大型语言模型在多项选择题回答基准上表现出了卓越的能力，但其大规模神经元背后的复杂机制仍然不透明，这对理解和指导法学硕士提出了重大挑战。虽然最近的研究在识别某些能力的负责神经元方面取得了进展，但这些特定于能力的方法对于需要协调使用多种能力的以任务为中心的场景来说是不可行的。此外，这些方法只关注与任务完成呈正相关的支持性神经元，而忽略了具有其他角色的神经元（例如抑制性角色）以及由于法学硕士中的偶然行为而误导的神经元归因（即，偶然正确回答问题而不是真正理解问题）。为了应对这些挑战，我们提出了NeuronLLM，一种新颖的任务级LLM理解框架，采用功能拮抗的生物学原理进行LLM神经元识别。关键的见解是，任务表现是由具有两种相反作用的神经元共同决定的：促进任务完成的好神经元和抑制任务的坏神经元。 NeuronLLM 通过对好神经元和坏神经元的对比学习来实现神经元的整体建模，同时利用增强的问题集来减轻 LLM 中的偶然行为。对不同规模和家族的法学硕士进行的综合实验表明，NeuronLLM 在四个 NLP 任务中优于现有方法，为法学硕士功能组织提供了新的见解。

Title: FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

Authors: Seongyeub Chu, Jongwoo Kim, Munyong Yi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04574
Pdf URL: https://arxiv.org/pdf/2601.04574
Copy Paste: [[2601.04574]] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback(https://arxiv.org/abs/2601.04574)
Keywords: llm
Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.
摘要：除了数字分数的预测之外，最近对自动论文评分的研究越来越强调生成高质量的反馈，以提供合理性和可操作的指导。为了降低专家注释的高昂成本，之前的工作通常依赖法学硕士生成的反馈来训练论文评估模型。然而，此类反馈通常在没有明确的质量验证的情况下被纳入，从而导致下游应用中的噪声传播。为了解决这个限制，我们提出了FeedEval，一个基于法学硕士的框架，用于沿着三个教学基础维度评估法学硕士生成的论文反馈：特异性、有用性和有效性。 FeedEval 使用在本研究中策划的数据集上接受过培训的维度专业 LLM 评估员来评估多个反馈候选者并选择高质量的反馈供下游使用。 ASAP++ 基准测试表明，FeedEval 与人类专家判断紧密一致，并且使用 FeedEval 过滤的高质量反馈训练的论文评分模型可实现卓越的评分性能。此外，使用小型法学硕士的修改实验表明，FeedEval 识别的高质量反馈可以带来更有效的论文修改。我们将在接受后发布我们的代码和精选数据集。

Title: Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Authors: Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04582
Pdf URL: https://arxiv.org/pdf/2601.04582
Copy Paste: [[2601.04582]] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization(https://arxiv.org/abs/2601.04582)
Keywords: gpt, llm
Abstract: Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at this https URL.
摘要：文本到可视化 (Text2Vis) 系统将表格数据上的自然语言查询转换为简洁的答案和可执行的可视化。虽然闭源法学硕士生成功能代码，但生成的图表通常缺乏语义一致性和清晰度，而这些质量只能在执行后进行评估。开源模型的处境更加艰难，经常产生不可执行或视觉效果不佳的输出。尽管有监督的微调可以提高代码的可执行性，但它无法提高整体可视化质量，因为传统的 SFT 损失无法捕获执行后的反馈。为了解决这一差距，我们提出了 RL-Text2Vis，这是第一个用于 Text2Vis 生成的强化学习框架。我们的方法建立在组相对策略优化（GRPO）的基础上，使用一种新颖的多目标奖励，利用执行后反馈共同优化文本准确性、代码有效性和可视化质量。通过训练 Qwen2.5 模型（7B 和 14B），RL-Text2Vis 在 Text2Vis 基准测试中的图表质量比 GPT-4o 提高了 22%，并且相对于其零样本基准，代码执行成功率从 78% 提高到 97%。我们的模型显着优于强大的零样本和监督基线，并且还展示了对 VIS-Eval 和 NVBench 等域外数据集的强大泛化能力。这些结果使 GRPO 成为可视化生成中结构化、多模式推理的有效策略。我们在此 https URL 发布我们的代码。

Title: THaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai -- Technical Report

Authors: KBTG Labs: Anuruth Lertpiya, Danupat Khamnuansin, Kantapong Sucharitpongpan, Pornchanan Balee, Tawunrat Chalothorn, Thadpong Pongthawornkamol, Monchai Lertsutthiwong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04597
Pdf URL: https://arxiv.org/pdf/2601.04597
Copy Paste: [[2601.04597]] THaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai -- Technical Report(https://arxiv.org/abs/2601.04597)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated significant potential across various domains, particularly in banking and finance, where they can automate complex tasks and enhance decision-making at scale. Due to privacy, security, and regulatory concerns, organizations often prefer on-premise deployment of LLMs. The ThaiLLM initiative aims to enhance Thai language capabilities in open-LLMs, enabling Thai industry to leverage advanced language models. However, organizations often face a trade-off between deploying multiple specialized models versus the prohibitive expense of training a single multi-capability model. To address this, we explore model merging as a resource-efficient alternative for developing high-performance, multi-capability LLMs. We present results from two key experiments: first, merging Qwen-8B with ThaiLLM-8B demonstrates how ThaiLLM-8B enhances Thai general capabilities, showing an uplift of M3 and M6 O-NET exams over the general instruction-following Qwen-8B. Second, we merge Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B. This combination results in further improvements in performance across both general and financial domains, by demonstrating an uplift in both M3 and M6 O-NET, Flare-CFA, and Thai-IC benchmarks. The report showcases the viability of model merging for efficiently creating multi-capability LLMs.
摘要：大型语言模型 (LLM) 在各个领域都展现出了巨大的潜力，特别是在银行和金融领域，它们可以自动化复杂的任务并大规模增强决策。出于隐私、安全和监管方面的考虑，组织通常更喜欢本地部署 LLM。 ThaiLLM 计划旨在增强开放式 LLM 中的泰语能力，使泰国业界能够利用先进的语言模型。然而，组织经常面临部署多个专用模型与训练单个多功能模型的高昂费用之间的权衡。为了解决这个问题，我们探索模型合并作为开发高性能、多功能法学硕士的资源高效替代方案。我们展示了两个关键实验的结果：首先，将 Qwen-8B 与 ThaiLLM-8B 合并，展示了 ThaiLLM-8B 如何增强泰语一般能力，显示出 M3 和 M6 O-NET 考试相对于遵循 Qwen-8B 的一般指令的提升。其次，我们将 Qwen-8B 与 ThaiLLM-8B 和 THaLLE-CFA-8B 合并。通过展示 M3 和 M6 O-NET、Flare-CFA 和 Thai-IC 基准的提升，这种组合进一步提高了一般领域和金融领域的性能。该报告展示了模型合并对于有效创建多能力法学硕士的可行性。

Title: When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

Authors: Rhea Kapur, Robert Hawkins, Elisa Kreiss
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04609
Pdf URL: https://arxiv.org/pdf/2601.04609
Copy Paste: [[2601.04609]] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation(https://arxiv.org/abs/2601.04609)
Keywords: language model
Abstract: Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
摘要：视觉语言模型 (VLM) 越来越多地用于通过基于文本的描述来访问视觉内容。然而，在当前的系统中，描述的特异性常常与其长度混为一谈。我们认为，这两个概念必须分开：描述可以简洁但信息丰富，也可以冗长但空洞。我们定义相对于对比集的特异性，其中描述更具体，因为它比其他可能的图像更好地挑选出目标图像。我们构建了一个数据集，在改变信息内容的同时控制长度，并验证人们确实更喜欢更具体的描述，而不管长度如何。我们发现仅控制长度并不能解释特异性的差异：长度预算的分配方式会产生影响。这些结果支持直接优先考虑特异性而非冗长性的评估方法。

Title: Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR

Authors: Yihong Tang, Kehai Chen, Xuefeng Bai, Benyou Wang, Zeming Liu, Haifeng Wang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04611
Pdf URL: https://arxiv.org/pdf/2601.04611
Copy Paste: [[2601.04611]] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR(https://arxiv.org/abs/2601.04611)
Keywords: agent
Abstract: Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.
摘要：当前的角色扮演代理（RPA）通常是通过模仿表面行为来构建的，但这种方法缺乏内部认知一致性，常常在复杂情况下导致不符合性格的错误。为了解决这个问题，我们提出了Character-R1，这是一个旨在为有效的角色感知推理提供全面的可验证奖励信号的框架，而这在最近的研究中是缺失的。具体来说，我们的框架包括三个核心设计：（1）认知焦点奖励，它强制对 10 个角色元素（例如世界观）进行基于标签的明确分析，以构建内部认知； (2)参考引导奖励，利用基于重叠的指标和参考响应作为优化锚点来增强探索和性能； (3)角色条件奖励标准化，根据角色类别调整奖励分配，以确保跨异构角色的稳健优化。大量实验表明，Character-R1 在知识、记忆等方面显着优于现有方法。

Title: From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset

Authors: Haneul Yoo, Won Ik Cho, Geunhye Kim, Jiyoon Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04632
Pdf URL: https://arxiv.org/pdf/2601.04632
Copy Paste: [[2601.04632]] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset(https://arxiv.org/abs/2601.04632)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.
摘要：大型语言模型 (LLM) 在许多任务上都取得了出色的表现，但它们在不同语言和文化中的进展仍然不平衡，通常反映了以英语为中心的训练数据中潜在的价值。为了实现实际的文化一致性，我们提出了一种可扩展的方法，利用国家社会研究课程作为文化意识监督的基础。我们引入了 CuCu，一个自动化的多智能体法学硕士框架，可将国家教科书课程转变为开放式的、针对特定文化的问答对。我们将 CuCu 应用于韩国国家社会研究课程，构建了 KCaQA，其中包含 34.1k 个开放式问答对。我们的定量和定性分析表明，KCaQA 涵盖了特定文化的主题，并根据当地社会文化背景做出了回应。

Title: MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

Authors: Anyang Song, Ying Cheng, Yiqian Xu, Rui Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04633
Pdf URL: https://arxiv.org/pdf/2601.04633
Copy Paste: [[2601.04633]] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark(https://arxiv.org/abs/2601.04633)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.
摘要：大型语言模型 (LLM) 的一致性正在不断发展。机器生成的文本 (MGT) 与人类编写的文本 (HWT) 变得越来越难以区分。这加剧了虚假新闻和网络欺诈等滥用问题。微调检测器的泛化能力高度依赖于数据集质量，仅仅扩展 MGT 的来源是不够的。需要进一步增强生成过程。根据HC-Var的理论，增强生成文本的对齐不仅可以方便对现有检测器的攻击以测试其鲁棒性，还有助于提高在其上微调的检测器的泛化能力。因此，我们提出 \textbf{M}achine-\textbf{A}ugment-\textbf{G} 通过 \textbf{A}lignment (MAGA) 生成文本。 MAGA 的流程实现了从提示构建到推理过程的全面对接，其中我们系统提出的从 \textbf{D}etectors \textbf{F}eedback (RLDF) 获得的 \textbf{R}einforced \textbf{L} 是关键组件。在我们的实验中，在 MAGA 训练集上进行微调的 RoBERTa 检测器在泛化检测 AUC 方面取得了 4.60% 的平均改进。 MAGA数据集导致所选检测器的AUC平均下降8.13%，有望为未来检测器泛化检测能力的研究提供指示意义。

Title: SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation

Authors: Sirry Chen, Jieyi Wang, Wei Chen, Zhongyu Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04638
Pdf URL: https://arxiv.org/pdf/2601.04638
Copy Paste: [[2601.04638]] SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation(https://arxiv.org/abs/2601.04638)
Keywords: language model
Abstract: Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge & Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.
摘要：医疗咨询本质上是以言语为中心的。然而，大多数先前的工作都集中在基于长文本的交互上，这既麻烦又对患者不友好。语音语言模型（SpeechLM）的最新进展使得基于语音的交互更加自然，但医疗语音数据的稀缺性和对语音数据直接微调的低效率共同阻碍了 SpeechLM 在医疗咨询中的采用。在本文中，我们提出了 SpeechMedAssist，这是一种本身能够与患者进行基于语音的多轮交互的 SpeechLM。通过利用 SpeechLM 的架构特性，我们将传统的一阶段训练解耦为两阶段范式，包括 (1) 通过文本注入知识和能力以及 (2) 使用有限语音数据进行模态重新对齐，从而将医学语音数据的需求减少到仅 10k 合成样本。为了评估医疗咨询场景中的 SpeechLM，我们设计了一个包含单轮问答和多轮模拟交互的基准。实验结果表明，在大多数评估设置中，我们的模型在有效性和鲁棒性方面都优于所有基线。

Title: CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

Authors: Yifan Le, Yunliang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04664
Pdf URL: https://arxiv.org/pdf/2601.04664
Copy Paste: [[2601.04664]] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models(https://arxiv.org/abs/2601.04664)
Keywords: language model, llm, chat
Abstract: Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.
摘要：多语言大语言模型 (LLM) 在跨语言方面取得了出色的性能，但人们对语言能力在神经元级别如何组织的了解仍知之甚少。先前的工作主要通过基于激活的启发法来识别与语言相关的神经元，该启发法将语言偏好与功能重要性混为一谈。先前的工作主要通过基于激活的启发法来识别与语言相关的神经元，该启发法将语言偏好与功能重要性混为一谈。我们提出了 CRANE，一种基于相关性的分析框架，它根据功能必要性重新定义了语言特异性，通过有针对性的神经元水平干预来识别语言特异性神经元。 CRANE 通过神经元对语言条件预测的贡献而不是激活幅度来表征神经元的专业化。我们的实施将公开。神经元水平的干预揭示了一致的不对称模式：屏蔽与目标语言相关的神经元会选择性地降低该语言的性能，同时在很大程度上保留其他语言的性能，表明语言选择性但非排他性的神经元专业化。跨多个基准对英语、中文和越南语进行的实验，以及专用的基于相关性的度量和基础到聊天模型的迁移分析表明，CRANE 比基于激活的方法更精确地隔离特定于语言的组件。

Title: ToolGate: Contract-Grounded and Verified Tool Execution for LLMs

Authors: Yanming Liu, Xinyue Peng, Jiannan Cao, Xinyi Wang, Songhang Deng, Jintao Chen, Jianwei Yin, Xuhong Zhang
Subjects: cs.CL, cs.AI, cs.FL
Abstract URL: https://arxiv.org/abs/2601.04688
Pdf URL: https://arxiv.org/pdf/2601.04688
Copy Paste: [[2601.04688]] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs(https://arxiv.org/abs/2601.04688)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbf{ToolGate}, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool's result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.
摘要：通过外部工具增强的大型语言模型 (LLM) 在复杂的推理任务中表现出了卓越的能力。然而，现有框架严重依赖自然语言推理来确定何时可以调用工具以及是否应提交其结果，缺乏对逻辑安全性和可验证性的正式保证。我们提出了 \textbf{ToolGate}，一个前向执行框架，为 LLM 工具调用提供逻辑安全保证和可验证的状态演化。 ToolGate 维护一个显式的符号状态空间，作为在整个推理过程中表示可信世界信息的类型化键值映射。每个工具都被形式化为由前置条件和后置条件组成的 Hoare 式契约，其中前置条件通过检查当前状态是否满足所需条件来控制工具调用，后置条件确定工具的结果是否可以通过运行时验证提交以更新状态。我们的方法保证符号状态仅通过经过验证的工具执行来演化，从而防止无效或幻觉结果破坏世界表征。实验验证表明，ToolGate 显着提高了工具增强的 LLM 系统的可靠性和可验证性，同时保持在复杂的多步骤推理任务上的竞争性能。这项工作为构建更值得信赖和可调试的人工智能系统奠定了基础，该系统将语言模型与外部工具集成。

Title: See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation

Authors: Naquee Rizwan, Subhankar Swain, Paramananda Bhaskar, Gagan Aryan, Shehryaar Shah Khan, Animesh Mukherjee
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.04692
Pdf URL: https://arxiv.org/pdf/2601.04692
Copy Paste: [[2601.04692]] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation(https://arxiv.org/abs/2601.04692)
Keywords: agent
Abstract: In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.
摘要：在这项工作中，我们通过应用一系列基于生成人工智能模型的策略，从三个互补的角度研究仇恨模因——如何检测它们、如何解释它们的内容以及如何在发布之前对其进行干预。据我们所知，解释和干预通常与检测分开研究，这不能反映现实世界的情况。此外，由于管理用于模因审核的大型带注释数据集非常昂贵，因此我们提出了一种新颖的框架，该框架利用特定于任务的生成多模态代理和大型多模态模型的少数样本适应性来满足不同类型的模因。我们相信这是第一个专注于有限数据条件下的普遍仇恨模因调节的工作，并且在现实生产场景中具有强大的部署潜力。警告：含有潜在有毒成分。

Title: Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

Authors: Sungmok Jung, Yeonkyoung So, Joonhak Lee, Sangho Kim, Yelim Ahn, Jaejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04693
Pdf URL: https://arxiv.org/pdf/2601.04693
Copy Paste: [[2601.04693]] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding(https://arxiv.org/abs/2601.04693)
Keywords: language model, llm
Abstract: Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.
摘要：尽管已知否定会挑战大型语言模型（LLM），但评估否定理解的基准（尤其是韩语）却很少。我们对韩语否定进行了基于语料库的分析，结果表明法学硕士的表现在否定的情况下会下降。然后我们介绍 Thunder-KoNUBench，一个反映韩语否定现象经验分布的句子级基准。我们评估了 47 个法学硕士，分析了模型大小和指令调整的影响，并表明 Thunder-KoNUBench 上的微调可以提高韩语中的否定理解和更广泛的上下文理解。

Title: PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards

Authors: Mukesh Ghimire, Aosong Feng, Liwen You, Youzhi Luo, Fang Liu, Xuan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04700
Pdf URL: https://arxiv.org/pdf/2601.04700
Copy Paste: [[2601.04700]] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards(https://arxiv.org/abs/2601.04700)
Keywords: language model, llm
Abstract: Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model's consistency, either by majority voting or by converting the model's internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model's internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model's internal confidence in check.
摘要：目前大型语言模型 (LLM) 的后训练技术要么依赖成本高昂的人工监督，要么依赖外部验证者来提高数学推理和代码生成等任务的性能。然而，随着法学硕士提高他们解决问题的能力，任何进一步的改进都可能需要高质量的解决方案来解决人类无法解决的难题。因此，从未标记的数据中学习在研究界变得越来越有吸引力。现有方法通过多数投票或将模型的内部置信度转化为奖励，从模型的一致性中提取学习信号。尽管熵或自我确定性等内部一致性度量不需要人为干预，但正如我们在这项工作中所示，这些对于大规模和长期的训练来说是不可靠的信号。为了解决不可靠性问题，我们提出了 PRISM，这是一个统一的训练框架，它使用过程奖励模型 (PRM) 来指导学习以及在缺乏真实标签的情况下模型的内部置信度。我们证明，有效地将 PRM 与自我确定性结合起来可以带来稳定的训练和更好的测试时性能，并保持模型的内部信心受到控制。

Title: Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning

Authors: Feihu Jin, Shipeng Cen, Ying Tan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04710
Pdf URL: https://arxiv.org/pdf/2601.04710
Copy Paste: [[2601.04710]] Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning(https://arxiv.org/abs/2601.04710)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.
摘要：微调大型语言模型 (LLM) 在各种 NLP 任务中取得了显着的成功，但反向传播期间的大量内存开销仍然是一个关键瓶颈，尤其是随着模型规模的增长。零阶 (ZO) 优化通过前向传播和高斯采样估计梯度，避免了反向传播的需要，从而缓解了这个问题。然而，传统的 ZO 方法由于依赖随机扰动，在梯度估计方面存在较大方差，导致收敛速度慢且性能不佳。我们提出了一种简单的即插即用方法，该方法结合了先验信息的扰动来改进梯度估计。我们的方法动态地计算来自高斯样本的引导向量，该向量将扰动引向信息更丰富的方向，与标准 ZO 方法相比，显着加速了收敛速度。我们进一步研究贪婪扰动策略，以探索先验知识对梯度估计的影响。从理论上讲，我们证明我们的梯度估计器与真实梯度方向实现了更强的对齐，从而提高了优化效率。不同规模和架构的法学硕士的广泛实验表明，我们提出的方法可以无缝集成到现有的优化方法中，从而提供更快的收敛和卓越的性能。值得注意的是，在 OPT-13B 模型上，我们的方法在所有 11 项基准任务中都优于传统的 ZO 优化，并在 11 项任务中的 9 项上超过基于梯度的基线，在效率和准确性之间建立了稳健的平衡。

Title: DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

Authors: Anh Thi-Hoang Nguyen, Khanh Quoc Tran, Tin Van Huynh, Phuoc Tan-Hoang Nguyen, Cam Tan Nguyen, Kiet Van Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04711
Pdf URL: https://arxiv.org/pdf/2601.04711
Copy Paste: [[2601.04711]] DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs(https://arxiv.org/abs/2601.04711)
Keywords: language model, llm, hallucination, prompt
Abstract: The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations--fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types--factual, noisy, and adversarial--to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.
摘要：生产环境中的大型语言模型 (LLM) 的可靠性仍然受到其产生幻觉的倾向的严重限制——幻觉是流畅的、听起来可信的输出，与信息相矛盾或捏造。虽然幻觉检测最近已成为以英语为中心的基准中的优先事项，但标准化评估框架仍然没有充分涵盖越南语等中低资源语言。本文介绍了 DSC2025 ViHallu 挑战赛，这是越南法学硕士中第一个检测幻觉的大规模共享任务。我们提出了 ViHallu 数据集，其中包含 10,000 个带注释的三元组（上下文、提示、响应）样本，系统地分为三个幻觉类别：无幻觉、内在幻觉和外在幻觉。该数据集包含三种提示类型（事实型、噪声型和对抗性型）来对模型的稳健性进行压力测试。共有 111 个团队参与其中，表现最好的系统获得了 84.80% 的宏观 F1 分数，而仅编码器的基线分数为 32.83%，这表明具有结构化提示和集成策略的指令调整 LLM 的性能大大优于通用架构。然而，与完美性能的差距表明幻觉检测仍然是一个具有挑战性的问题，特别是对于内在（基于矛盾的）幻觉。这项工作建立了严格的基准，并探索了多种检测方法，为未来研究越南语人工智能系统的可信度和可靠性奠定了基础。

Title: Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents

Authors: Yonghyun Jun, Junhyuk Choi, Jihyeong Park, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04716
Pdf URL: https://arxiv.org/pdf/2601.04716
Copy Paste: [[2601.04716]] Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents(https://arxiv.org/abs/2601.04716)
Keywords: language model, llm, agent
Abstract: Despite the rapid proliferation of Role-Playing Agents (RPAs) based on Large Language Models (LLMs), the structural dimensions defining a character's identity remain weakly formalized, often treating characters as arbitrary text inputs. In this paper, we propose the concept of \textbf{Character Identity}, a multidimensional construct that disentangles a character into two distinct layers: \textbf{(1) Parametric Identity}, referring to character-specific knowledge encoded from the LLM's pre-training, and \textbf{(2) Attributive Identity}, capturing fine-grained behavioral properties such as personality traits and moral values. To systematically investigate these layers, we construct a unified character profile schema and generate both Famous and Synthetic characters under identical structural constraints. Our evaluation across single-turn and multi-turn interactions reveals two critical phenomena. First, we identify \textit{"Fame Fades"}: while famous characters hold a significant advantage in initial turns due to parametric knowledge, this edge rapidly vanishes as models prioritize accumulating conversational context over pre-trained priors. Second, we find that \textit{"Nature Remains"}: while models robustly portray general personality traits regardless of polarity, RPA performance is highly sensitive to the valence of morality and interpersonal relationships. Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.
摘要：尽管基于大型语言模型 (LLM) 的角色扮演代理 (RPA) 迅速普及，但定义角色身份的结构维度仍然很弱形式化，通常将字符视为任意文本输入。在本文中，我们提出了 \textbf{Character Identity} 的概念，这是一种多维构造，将角色分解为两个不同的层：\textbf{(1) Parametric Identity}，指的是从 LLM 预训练中编码的特定于角色的知识；以及 \textbf{(2) Attributive Identity}，捕获细粒度的行为属性，例如人格特质和道德价值观。为了系统地研究这些层，我们构建了一个统一的角色配置文件模式，并在相同的结构约束下生成著名角色和综合角色。我们对单轮和多轮交互的评估揭示了两个关键现象。首先，我们确定 \textit{“名声淡出”}：虽然由于参数知识，著名人物在初始回合中拥有显着优势，但随着模型优先考虑累积对话上下文而不是预先训练的先验，这种优势迅速消失。其次，我们发现 \textit{“Nature Remains”}：虽然模型稳健地描绘了一般人格特征（无论极性如何），但 RPA 表现对道德和人际关系的效价高度敏感。我们的研究结果指出，消极的社会本质是 RPA 忠诚度的主要瓶颈，指导未来的角色构建和评估。

Title: Automatic Classifiers Underdetect Emotions Expressed by Men

Authors: Ivan Smirnov, Segun T. Aroyehun, Paul Plener, David Garcia
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2601.04730
Pdf URL: https://arxiv.org/pdf/2601.04730
Copy Paste: [[2601.04730]] Automatic Classifiers Underdetect Emotions Expressed by Men(https://arxiv.org/abs/2601.04730)
Keywords: language model
Abstract: The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.
摘要：自动情绪和情绪分类器的广泛采用使得确保这些工具在不同人群中可靠地运行变得非常重要。然而，它们的可靠性通常是使用依赖于第三方注释者而不是经历情绪的个人本身的基准来评估的，这可能掩盖了系统偏差。在本文中，我们使用包含超过 100 万条自我注释帖子的独特大规模数据集和预先注册的研究设计来调查 414 种模型和情绪相关类别组合中情绪检测中的性别偏见。我们发现，在不同类型的自动分类器和各种潜在情绪中，男性撰写的文本的错误率始终高于女性撰写的文本。我们量化了这种偏差如何影响下游应用程序的结果，并表明当样本的性别构成未知或可变时，应谨慎应用当前的机器学习工具（包括大型语言模型）。我们的研究结果表明，情绪分析尚未解决，特别是在确保跨人口群体的公平模型行为方面。

Title: AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Authors: Han Zhu, Jiale Chen, Chengkun Cai, Shengjie Sun, Haoran Li, Yujin Zhou, Chi-Min Chan, Pengcheng Wen, Lei Li, Sirui Han, Yike Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04736
Pdf URL: https://arxiv.org/pdf/2601.04736
Copy Paste: [[2601.04736]] AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs(https://arxiv.org/abs/2601.04736)
Keywords: language model, llm
Abstract: Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.
摘要：多模态大型语言模型 (MLLM) 越来越多地部署在交互式应用程序中。然而，它们的安全漏洞在多轮多模式场景中变得明显，其中有害意图可以在轮流中逐渐重建，并且随着对话的进展，安全协议逐渐被遗忘。现有的人类反馈强化学习（RLHF）对齐方法主要是为单轮视觉问答（VQA）任务开发的，通常需要昂贵的手动偏好注释，限制了它们在对话中的有效性和可扩展性。为了应对这一挑战，我们推出了 InterSafe-V，这是一个开源多模式对话数据集，包含 11,270 个对话和 500 个专门设计的拒绝 VQA 样本。该数据集通过多个模型之间的交互构建，旨在更准确地反映现实场景，并包括针对特定领域定制的专门 VQA 对。在此数据集的基础上，我们提出了 AM$^3$Safety，这是一个框架，它将冷启动拒绝阶段与组相对策略优化 (GRPO) 微调相结合，在整个对话中使用回合感知双目标奖励。 Qwen2.5-VL-7B-Instruct 和 LLaVA-NeXT-7B 上的实验表明，在多模态多轮安全基准上，MLLM 的攻击成功率 (ASR) 降低了 10% 以上，无害维度至少增加了 8%，有用维度增加了 13% 以上，同时保留了其一般能力。

Title: RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

Authors: Huawei Zheng, Xinqi Jiang, Sen Yang, Shouling Ji, Yingcai Wu, Dazhen Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04740
Pdf URL: https://arxiv.org/pdf/2601.04740
Copy Paste: [[2601.04740]] RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation(https://arxiv.org/abs/2601.04740)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.
摘要：大型语言模型 (LLM) 越来越多地应用于金融和医疗保健等专业领域，但它们带来了独特的安全风险。有害提示的特定领域数据集仍然稀缺，并且仍然很大程度上依赖于手动构建；公共数据集主要关注明确的有害提示，现代法学硕士防御措施通常可以检测并拒绝这些提示。相比之下，隐含的有害提示（通过间接领域知识表达）更难检测，也能更好地反映现实世界的威胁。我们确定了两个挑战：将领域知识转化为可操作的约束，并增加生成的有害提示的隐含性。为了解决这些问题，我们提出了一个端到端框架，该框架首先执行知识图引导的有害提示生成，以系统地生成与领域相关的提示，然后应用双路径混淆重写，通过直接和上下文增强的重写将显式有害提示转换为隐式变体。该框架产生高质量的数据集，将强大的领域相关性与隐式性相结合，从而实现更现实的红队并推进法学硕士安全研究。我们在 GitHub 上发布了我们的代码和数据集。

Title: Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval

Authors: Seyeon Jeong, Yeonjun Choi, JongWook Kim, Beakcheol Jang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04742
Pdf URL: https://arxiv.org/pdf/2601.04742
Copy Paste: [[2601.04742]] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval(https://arxiv.org/abs/2601.04742)
Keywords: language model, llm, hallucination, agent
Abstract: Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.
摘要：大型语言模型（LLM）会出现幻觉和事实不准确的情况，尤其是在复杂的推理和事实验证任务中。多智能体辩论（MAD）系统旨在通过使多个 LLM 智能体参与对话、促进多样化推理和相互验证来提高答案准确性。然而，现有的 MAD 框架主要依赖于内部知识或静态文档，这使得它们很容易产生幻觉。虽然 MADKE 引入外部证据来缓解这一问题，但其一次性检索机制限制了辩论期间对新论点或新信息的适应性。为了解决这些限制，我们提出了 Tool-MAD，这是一种多智能体辩论框架，通过为每个智能体分配不同的外部工具（例如搜索 API 或 RAG 模块）来增强事实验证。 Tool-MAD 引入了三个关键创新：(1) 多智能体辩论框架，其中智能体利用异构外部工具，鼓励多样化的观点；(2) 自适应查询制定机制，根据辩论流程迭代地完善证据检索；(3) 将忠实度和答案相关性分数整合到最终决策过程中，使法官智能体能够定量评估每个响应的连贯性和问题一致性，并有效检测幻觉。四个事实验证基准的实验结果表明，Tool-MAD 的性能始终优于最先进的 MAD 框架，实现了高达 5.5% 的准确性提升。此外，在医学专业领域，Tool-MAD 在各种工具配置和领域条件下表现出强大的鲁棒性和适应性，证实了其在更广泛的现实世界事实检查应用中的潜力。

Title: PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks

Authors: Yehoon Jang, Chaewon Lee, Hyun-seok Min, Sungchul Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04758
Pdf URL: https://arxiv.org/pdf/2601.04758
Copy Paste: [[2601.04758]] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks(https://arxiv.org/abs/2601.04758)
Keywords: language model, llm
Abstract: The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at this https URL.
摘要：美国专利商标局的专利审判和上诉委员会 (PTAB) 每年裁决数千起单方面上诉，需要将技术理解和法律推理相结合。虽然大型语言模型（LLM）越来越多地应用于专利和法律实践，但它们的使用仍然仅限于轻量级任务，没有既定的方法来系统地评估它们在专利领域进行结构化法律推理的能力。在这项工作中，我们引入了 PILOT-Bench，这是第一个以 PTAB 为中心的基准，它将 PTAB 决策与案件级别的 USPTO 专利数据保持一致，并正式确定了三个与 IRAC 一致的分类任务：问题类型、董事会权限和子决策。我们评估各种闭源（商业）和开源法学硕士，并从多个角度进行分析，包括输入变化设置、模型系列和错误倾向。值得注意的是，在问题类型任务中，闭源模型的 Micro-F1 得分始终超过 0.75，而最强的开源模型（Qwen-8B）的表现约为 0.56，凸显了推理能力的巨大差距。 PILOT-Bench 为专利领域法律推理的系统评估奠定了基础，并指出了通过数据集设计和模型对齐改进法学硕士的未来方向。所有数据、代码和基准测试资源均可在此 https URL 中获取。

Title: Differential syntactic and semantic encoding in LLMs

Authors: Santiago Acevedo, Alessandro Laio, Marco Baroni
Subjects: cs.CL, cs.AI, cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2601.04765
Pdf URL: https://arxiv.org/pdf/2601.04765
Copy Paste: [[2601.04765]] Differential syntactic and semantic encoding in LLMs(https://arxiv.org/abs/2601.04765)
Keywords: language model, llm
Abstract: We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.
摘要：我们研究如何在大型语言模型 (LLM) 的内层表示中编码句法和语义信息，重点关注非常大的 DeepSeek-V3。我们发现，通过对共享句法结构或含义的句子的隐藏表示向量进行平均，我们获得了捕获表示中包含的大部分句法和语义信息的向量。特别是，从句子向量中减去这些句法和语义“质心”会强烈影响它们分别与句法和语义匹配的句子的相似性，这表明句法和语义至少部分是线性编码的。我们还发现语法和语义的跨层编码配置文件是不同的，并且这两个信号可以在某种程度上解耦，这表明LLM表示中这两类语言信息的差异编码。

Title: Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

Authors: Shengyin Sun, Yiming Li, Renxi Liu, Weizhe Lin, Hui-Ling Zhen, Xianzhi Yu, Mingxuan Yuan, Chen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04766
Pdf URL: https://arxiv.org/pdf/2601.04766
Copy Paste: [[2601.04766]] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence(https://arxiv.org/abs/2601.04766)
Keywords: llm
Abstract: Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.
摘要：Judge Decoding 通过放宽 Speculative Decoding 的严格验证来加速 LLM 推理，但它通常依赖于昂贵且嘈杂的监督。在这项工作中，我们从首要原则重新审视这一范式，揭示了通过昂贵的监督学习到的“关键性”分数本质上编码在草稿目标分布差异中。我们从理论上证明了学习线性判断和 Kullback-Leibler (KL) 散度之间的结构对应关系，证明它们依赖于相同的底层 logit 原语。以此为指导，我们提出了一种基于KL散度的简单、免训练的验证机制。跨推理和编码基准的广泛实验表明，我们的方法匹配或优于复杂的训练有素的法官（例如 AutoJudge），为领域转换提供了卓越的鲁棒性，并完全消除了监督瓶颈。

Title: NC2C: Automated Convexification of Generic Non-Convex Optimization Problems

Authors: Xinyue Peng, Yanming Liu, Yihan Cang, Yuwei Zhang, Xinyi Wang, Songhang Deng, Jiannan Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04789
Pdf URL: https://arxiv.org/pdf/2601.04789
Copy Paste: [[2601.04789]] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems(https://arxiv.org/abs/2601.04789)
Keywords: language model, llm
Abstract: Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs' mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3\% execution rate and a 76\% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C's ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.
摘要：非凸优化问题在数学规划、工程设计和科学计算中普遍存在，由于其复杂的目标函数和受限的环境，常常给传统求解器带来棘手的挑战。为了解决手动凸化的低效和过度依赖专家知识的问题，我们提出了 NC2C，这是一种基于 LLM 的端到端自动化框架，旨在使用大型语言模型将通用非凸优化问题转换为可解的凸形式。 NC2C 利用法学硕士的数学推理能力来自动检测非凸组件、选择最佳凸化策略并生成严格的凸等价物。该框架集成了符号推理、自适应变换技术和迭代验证，配备纠错循环和可行性域校正机制，确保变换问题的鲁棒性和有效性。在包含 100 个通用非凸问题的不同数据集上进行的实验结果表明，NC2C 在生成可行的高质量凸变换方面实现了 89.3% 的执行率和 76% 的成功率。这大大优于基线方法，突显了 NC2C 能够利用 LLM 进行自动非凸到凸转换、减少对专家的依赖，并能够为以前棘手的优化任务有效部署凸求解器。

Title: Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework

Authors: Junhyuk Choi, Jeongyoun Kwon, Heeju Kim, Haeun Cho, Hayeong Jung, Sehee Min, Bugeun Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04790
Pdf URL: https://arxiv.org/pdf/2601.04790
Copy Paste: [[2601.04790]] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework(https://arxiv.org/abs/2601.04790)
Keywords: language model, gpt, chat, agent
Abstract: Multi-agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role-based authority bias in free-form multi-agent evaluation using ChatEval. Applying French and Raven's power-based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12-turn conversations. Experiments with GPT-4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns.
摘要：利用大型语言模型的多智能体系统通常分配权威角色以提高性能，但权威偏差对智能体交互的影响仍未得到充分研究。我们首次使用 ChatEval 对自由形式多主体评估中基于角色的权威偏差进行系统分析。应用 French 和 Raven 的基于权力的理论，我们将权威角色分为合法型、参考型和专家型，并分析它们在 12 轮对话中的影响力。 GPT-4o 和 DeepSeek R1 的实验表明，专家和参考权力角色比合法权力角色具有更强的影响力。至关重要的是，权威偏见不是通过总代理人的主动服从而出现的，而是通过权威角色始终保持其立场而总代理人表现出灵活性而出现的。此外，权威影响力需要明确的立场声明，因为中立的反应不会产生偏见。这些发现为设计具有不对称交互模式的多代理框架提供了重要见解。

Title: RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection

Authors: Zhiwei Liu, Runteng Guo, Baojie Qu, Yuechen Jiang, Min Peng, Qianqian Xie, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04853
Pdf URL: https://arxiv.org/pdf/2601.04853
Copy Paste: [[2601.04853]] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection(https://arxiv.org/abs/2601.04853)
Keywords: language model, llm, agent
Abstract: Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at this https URL.
摘要：跨领域错误信息检测具有挑战性，因为错误信息是在知识和话语存在巨大差异的领域中出现的。现有方法通常依赖于单视角线索，很难推广到具有挑战性或代表性不足的领域，而推理大型语言模型 (LLM) 虽然对复杂任务有效，但仅限于相同分布的数据。为了解决这些差距，我们引入了 RAAR，这是第一个用于跨域错误信息检测的检索增强代理推理框架。为了实现超越相同分布假设的跨域传输，RAAR 检索与每个目标样本的语义、情感和写作风格一致的多视角源域证据。为了克服单视角建模和缺失系统推理的问题，RAAR 通过专门的多智能体协作构建可验证的多步骤推理路径，其中特定视角的智能体产生互补分析，摘要智能体在验证者的指导下将它们集成。 RAAR 进一步应用监督微调和强化学习来训练单个多任务验证器，以增强验证和推理能力。基于RAAR，我们训练了RAAR-8b和RAAR-14b模型。对三个跨域错误信息检测任务的评估表明，RAAR 极大地增强了基础模型的能力，并且优于其他跨域方法、高级 LLM 和基于 LLM 的自适应方法。该项目将在此 https URL 发布。

Title: Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics

Authors: Oshri Naparstek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04854
Pdf URL: https://arxiv.org/pdf/2601.04854
Copy Paste: [[2601.04854]] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics(https://arxiv.org/abs/2601.04854)
Keywords: language model
Abstract: Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.
摘要：自回归语言模型通常是在离散标记序列上定义的，在每个生成步骤都承诺特定的标记。这种早期的离散化迫使不确定性通过令牌级采样来解决，通常会导致不稳定、重复和对解码启发式的敏感性。在这项工作中，我们引入了语言生成的连续自回归公式，其中标记表示为连续向量，在离散化之前经过多个更新步骤 \emph{mature} 。该模型不是对令牌进行采样，而是通过确定性动态过程演化连续的令牌表示，仅当表示充分收敛时才提交离散令牌。通过硬解码恢复离散文本，同时在连续空间中保持和解决不确定性。我们证明，仅此成熟过程就足以使用确定性解码（argmax）生成连贯且多样化的文本，而不依赖于令牌级采样、扩散式去噪或辅助稳定机制。其他扰动（例如随机动力学或历史平滑）可以自然地合并，但不是模型运行所必需的。据我们所知，这是第一个自回归语言模型，它通过在离散化之前将连续的标记表示进化到收敛来生成文本，从而无需标记级采样即可实现稳定的生成。

Title: MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News

Authors: Zhiwei Liu, Paul Thompson, Jiaqi Rong, Baojie Qu, Runteng Guo, Min Peng, Qianqian Xie, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04857
Pdf URL: https://arxiv.org/pdf/2601.04857
Copy Paste: [[2601.04857]] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News(https://arxiv.org/abs/2601.04857)
Keywords: llm
Abstract: Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at this https URL.
摘要：在线错误信息越来越普遍，但大多数现有基准和方法使用粗略的二进制标签在整个声明或段落的层面上评估准确性，模糊了真实和虚假细节如何在单个句子中共存。这些简化也限制了可解释性：全局解释无法识别哪些特定部分具有误导性，也无法区分细节的错误程度（例如，扭曲与捏造）。为了解决这些差距，我们引入了 MisSpans，这是第一个多领域、人工注释的跨度错误信息检测和分析基准，由配对的真实和虚假新闻故事组成。 MisSpans 定义了三个补充任务：MisSpansIdentity 用于精确定位句子中的错误跨度，MisSpansType 用于按错误信息类型对错误跨度进行分类，MisSpansExplanation 用于提供基于已识别跨度的基本原理。这些任务共同实现了细粒度的本地化、超越真/假的细致入微的表征和可操作的解释。专家注释者受到标准化指南和一致性检查的指导，从而导致注释者之间达成高度一致。我们在零样本和单样本设置下评估了 15 个具有代表性的法学硕士，包括推理增强型和非推理型变体。结果揭示了细粒度错误信息识别和分析的挑战性本质，并强调需要更深入地了解多个相互作用因素（包括模型大小和推理能力以及特定领域的文本特征）如何影响性能。该项目将通过此 https URL 提供。

Title: A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs

Authors: Maxime Delmas, Lei Xu, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04859
Pdf URL: https://arxiv.org/pdf/2601.04859
Copy Paste: [[2601.04859]] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs(https://arxiv.org/abs/2601.04859)
Keywords: llm
Abstract: Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at this https URL.
摘要：基于分块的标准 RAG 管道擅长简单的事实检索，但由于缺乏结构连接而无法处理复杂的多跳查询。相反，将检索与推理交织在一起的初始策略通常缺乏全局语料库意识，而基于知识图（KG）的 RAG 在复杂的多跳任务上表现强劲，但在面向事实的单跳查询上表现不佳。为了弥补这一差距，我们提出了一种新颖的 RAG 框架：ToPG（命题图遍历）。 ToPG 将其知识库建模为命题、实体和段落的异构图，有效地将命题的粒度事实密度与图连接性结合起来。我们使用迭代建议-选择周期来利用这种结构，其中建议阶段支持对图的查询感知遍历，选择阶段提供 LLM 反馈以修剪不相关的命题并为下一次迭代提供种子。在三个不同的 QA 任务（简单、复杂和抽象 QA）上进行评估后，ToPG 在基于准确性和质量的指标上表现出了强大的性能。总体而言，ToPG 表明查询感知图遍历与事实粒度相结合是高效结构化 RAG 系统的关键组件。 ToPG 可通过此 https URL 获取。

Title: EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis

Authors: Xuanguang Pan, Chongyang Tao, Jiayuan Bai, Jianling Gao, Zhengwei Tao, Xiansheng Zhou, Gavin Cheung, Shuai Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04875
Pdf URL: https://arxiv.org/pdf/2601.04875
Copy Paste: [[2601.04875]] EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis(https://arxiv.org/abs/2601.04875)
Keywords: llm, prompt
Abstract: Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.
摘要：由于缺乏高质量、多样化且结构复杂的数据集，训练有效的文本到 SQL 模型仍然具有挑战性。现有方法要么依赖有限的人工注释语料库，要么通过简单地提示 LLM 直接合成数据集，而无需对 SQL 结构进行显式控制，这通常会导致结构多样性和复杂性有限。为了解决这个问题，我们引入了 EvolSQL，这是一个结构感知的数据合成框架，它将 SQL 查询从种子数据演变为更丰富、语义更多样化的形式。 EvolSQL 从探索性 Query-SQL 扩展开始，以扩大问题多样性并提高模式覆盖率，然后使用从 SQL 抽象语法树派生的六个原子转换运算符来应用自适应定向演化策略，以逐步增加跨关系、谓词、聚合和嵌套维度的查询复杂性。基于执行的 SQL 细化模块和模式感知重复数据删除进一步确保创建高质量、结构多样的映射对。实验结果表明，根据我们的数据进行微调的 7B 模型优于仅使用 1/18 数据在更大的 SynSQL 数据集上训练的模型。

Title: Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis

Authors: Mingyue Cheng, Daoyu Wang, Qi Liu, Shuo Yu, Xiaoyu Tao, Yuqian Wang, Chengzhong Chu, Yu Duan, Mingkang Long, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04879
Pdf URL: https://arxiv.org/pdf/2601.04879
Copy Paste: [[2601.04879]] Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis(https://arxiv.org/abs/2601.04879)
Keywords: language model, llm, agent
Abstract: Synthesizing informative commercial reports from massive and noisy web sources is critical for high-stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert-level reports. Specifically, it first probes fine-grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training-free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long-form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC-Eval comprising 200 real-world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at this https URL.
摘要：从大量嘈杂的网络资源中综合信息丰富的商业报告对于高风险的业务决策至关重要。尽管当前的深度研究机构取得了显着进展，但他们的报告在质量、可靠性和覆盖范围方面仍然有限。在这项工作中，我们提出了 Mind2Report，这是一种认知深度研究代理，可以模拟商业分析师来合成专家级报告。具体来说，它首先探测细粒度的意图，然后搜索网络资源并动态记录提取的信息，然后迭代地综合报告。我们将 Mind2Report 设计为一种免训练的代理工作流程，通过动态记忆增强通用大语言模型 (LLM)，以支持这些长形式的认知过程。为了严格评估Mind2Report，我们进一步构建了包含200个现实世界商业任务的QRC-Eval，并建立了整体评估策略来评估报告质量、可靠性和覆盖率。实验表明，Mind2Report 的性能优于领先的基线，包括 OpenAI 和 Gemini 深度研究代理。尽管这是一项初步研究，但我们希望它能够成为推进商业深度研究代理未来设计的基础。我们的代码和数据可在此 https URL 中获取。

Title: CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

Authors: Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.04885
Pdf URL: https://arxiv.org/pdf/2601.04885
Copy Paste: [[2601.04885]] CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters(https://arxiv.org/abs/2601.04885)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at this https URL.
摘要：由于大型语言模型 (LLM) 服务于全球受众，因此一致性必须从强制达成普遍共识转变为尊重文化多元化。我们证明，当被迫适应相互冲突的值分布时，密集模型会遭受 \textbf{Mean Collapse} 的影响，收敛到无法代表不同群体的通用平均值。我们将此归因于 \textbf{文化稀疏性}，其中梯度干扰阻止密集参数跨越不同的文化模式。为了解决这个问题，我们提出了 \textbf{\textsc{CuMA}} （\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}adapters），这是一个将对齐框架视为 \textbf{条件容量分离}问题的框架。通过结合人口统计感知路由，\textsc{CuMA} 内化了 \textit{潜在文化拓扑}，以明确地将冲突梯度分解到专门的专家子空间中。对 WorldValuesBench、Community Alignment 和 PRISM 的广泛评估表明 \textsc{CuMA} 实现了最先进的性能，显着优于密集基线和纯语义 MoE。至关重要的是，我们的分析证实 \textsc{CuMA} 有效地减轻了均值崩溃，保护了文化多样性。我们的代码可以在这个 https URL 上找到。

Title: Faithful Summarisation under Disagreement via Belief-Level Aggregation

Authors: Favour Yahdii Aghaebe, Tanefa Apekey, Elizabeth Williams, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04889
Pdf URL: https://arxiv.org/pdf/2601.04889
Copy Paste: [[2601.04889]] Faithful Summarisation under Disagreement via Belief-Level Aggregation(https://arxiv.org/abs/2601.04889)
Keywords: language model, llm, prompt
Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.
摘要：意见和多文档总结通常涉及真正相互冲突的观点，但许多现有方法，特别是基于法学硕士的系统，隐含地消除了分歧并过度代表了多数意见。这限制了在观点较多的环境中生成的摘要的忠实度。我们引入了一种分歧感知综合管道，它将信念级别聚合与语言生成分开。文档首先被表示为结构化信念集，并使用基于距离的信念合并算子进行聚合，这些算子明确地模拟了冲突。然后，大型语言模型仅用于将聚合信念实现为自然语言摘要。我们跨多个模型系列和规模评估该方法，并将其与在生成过程中执行显式聚合的方法进行比较。我们的结果表明，虽然足够大的模型可以在生成时处理聚合时匹配置信级别聚合，但这种行为在架构或容量之间并不稳定。相比之下，信念级别的聚合与简单的提示相结合，可以在模型之间产生一致的强大的分歧意识性能，同时保持流畅和扎实的总结。

Title: V-FAT: Benchmarking Visual Fidelity Against Text-bias

Authors: Ziteng Wang, Yujie He, Guanliang Li, Siqi Yang, Jiaqi Xiong, Songxiang Liu
Subjects: cs.CL, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2601.04897
Pdf URL: https://arxiv.org/pdf/2601.04897
Copy Paste: [[2601.04897]] V-FAT: Benchmarking Visual Fidelity Against Text-bias(https://arxiv.org/abs/2601.04897)
Keywords: language model, llm
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
摘要：多模态大型语言模型 (MLLM) 的最新进展在标准视觉推理基准上展示了令人印象深刻的性能。然而，人们越来越担心这些模型过度依赖语言捷径而不是真正的视觉基础，我们将这种现象称为文本偏差。在本文中，我们研究了视觉感知和语言先验之间的根本张力。我们将这种偏差的来源分解为两个维度：内部语料库偏差，源于预训练中的统计相关性；外部指令偏差，源于对齐引起的阿谀奉承倾向。为了量化这种影响，我们引入了 V-FAT（针对文本偏差的视觉保真度），这是一种诊断基准，包含跨六个语义域的 4,026 个 VQA 实例。 V-FAT 采用三级评估框架，系统地增加了视觉证据和文本信息之间的冲突：（L1）来自非典型图像的内部偏差，（L2）来自误导性指令的外部偏差，以及（L3）两者一致的协同偏差。我们引入了视觉鲁棒性得分（VRS），这是一种旨在惩罚“幸运”的语言猜测并奖励真正的视觉保真度的指标。我们对 12 个前沿 MLLM 的评估表明，虽然模型在现有基准中表现出色，但它们在高度语言主导下经历了严重的视觉崩溃。

Title: Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences

Authors: Arkadiusz Modzelewski, Paweł Golik, Anna Kołos, Giovanni Da San Martino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04925
Pdf URL: https://arxiv.org/pdf/2601.04925
Copy Paste: [[2601.04925]] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences(https://arxiv.org/abs/2601.04925)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.
摘要：大型语言模型 (LLM) 可以生成极具说服力的文本，引发人们对其滥用于宣传、操纵和其他有害目的的担忧。这引出了我们的核心问题：法学硕士生成的说服是否比人类编写的说服更难自动检测？为了解决这个问题，我们对法学硕士制作有说服力的内容的可控生成方法进行了分类，并引入了 Persuaficial，这是一种高质量的多语言基准，涵盖六种语言：英语、德语、波兰语、意大利语、法语和俄语。使用这个基准，我们对人类撰写的和法学硕士生成的有说服力的文本进行了广泛的实证评估。我们发现，尽管 LLM 生成的明显有说服力的文本比人类编写的文本更容易检测，但 LLM 生成的微妙说服力始终会降低自动检测性能。除了检测性能之外，我们还提供了第一个全面的语言分析，对比人类和法学硕士生成的有说服力的文本，提供可以指导开发更可解释和更强大的检测工具的见解。

Title: GenProve: Learning to Generate Text with Fine-Grained Provenance

Authors: Jingxuan Wei, Xingyue Wang, Yanghaoyu Liao, Jie Dong, Yuchen Liu, Caijun Jia, Bihui Yu, Junnan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04932
Pdf URL: https://arxiv.org/pdf/2601.04932
Copy Paste: [[2601.04932]] GenProve: Learning to Generate Text with Fine-Grained Provenance(https://arxiv.org/abs/2601.04932)
Keywords: language model, llm
Abstract: Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.
摘要：大型语言模型 (LLM) 经常产生幻觉，虽然添加引用是一种常见的解决方案，但由于用户难以验证引用的来源如何支持生成的声明，因此它通常不足以实现问责制。现有的方法通常是粗粒度的，无法区分直接引用和复杂推理。在本文中，我们介绍了生成时间细粒度起源（Generation-time Fine-grained Provenance），该任务中模型必须生成流畅的答案，同时生成结构化的句子级起源三元组。为了实现这一点，我们提出了 ReFInE（关系感知细粒度可解释性和证据），这是一个具有专家验证注释的数据集，可区分引用、压缩和推理。在 ReFInE 的基础上，我们提出了 GenProve，这是一个将监督微调 (SFT) 与组相对策略优化 (GRPO) 相结合的框架。通过优化答案保真度和出处正确性的综合奖励，GenProve 在联合评估中显着优于 14 名实力雄厚的法学硕士。至关重要的是，我们的分析揭示了推理差距，即模型在表面引用方面表现出色，但在基于推理的来源方面却表现不佳，这表明可验证的推理仍然是与表面引用不同的前沿挑战。

Title: A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction

Authors: Qing Wang, Zehan Li, Yaodong Song, Hongjie Chen, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Xuelong Li
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2601.04960
Pdf URL: https://arxiv.org/pdf/2601.04960
Copy Paste: [[2601.04960]] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction(https://arxiv.org/abs/2601.04960)
Keywords: language model, llm
Abstract: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.
摘要：本文提出了一种统一的情商口语模型，并通过一种称为注入情感归因思维（IEAT）的新颖数据构建策略进行了增强。 IEAT 将用户情绪状态及其根本原因纳入模型的内部推理过程中，使情绪感知推理能够内化，而不是被视为显式监督。该模型采用两阶段渐进策略进行训练。第一阶段通过自蒸馏进行语音文本对齐和情感属性建模，第二阶段进行端到端跨模态联合优化，以确保文本和口语情感表达的一致性。类人口语对话系统挑战赛 (HumDial) 情绪智力基准的实验表明，所提出的方法在基于法学硕士和人类评估的情绪轨迹建模、情绪推理和移情反应生成方面均取得了一流的性能。

Title: Text as a Universal Interface for Transferable Personalization

Authors: Yuting Liu, Jian Guan, Jia-Nan Li, Wei Wu, Jiang-Ming Yang, Jianzhe Zhao, Guibing Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.04963
Pdf URL: https://arxiv.org/pdf/2601.04963
Copy Paste: [[2601.04963]] Text as a Universal Interface for Transferable Personalization(https://arxiv.org/abs/2601.04963)
Keywords: language model, llm
Abstract: We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box'' profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc -- outperforming substantially larger open-source models -- while exhibiting strong transferability across tasks, model families, and interaction formats.
摘要：我们研究大型语言模型（LLM）中的个性化问题。先前的工作主要将用户偏好表示为隐式的、特定于模型的向量或参数，产生不透明的“黑盒”配置文件，难以解释和跨模型和任务传输。相比之下，我们主张将自然语言作为偏好表示的通用、模型和任务无关的界面。该公式导致可解释和可重用的偏好描述，同时随着观察到新的交互而自然地支持持续进化。为了学习这种表示，我们引入了一个两阶段训练框架，它将高质量合成数据的监督微调与强化学习结合起来，以优化长期效用和跨任务可转移性。基于此框架，我们开发了 AlignXplore+，这是一种生成文本偏好摘要的通用偏好推理模型。九个基准测试的实验表明，我们的 8B 模型实现了最先进的性能——远远优于更大的开源模型——同时表现出跨任务、模型系列和交互格式的强大可移植性。

Title: Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

Authors: Xueyun Tian (1 and 2), Minghua Ma (3), Bingbing Xu (1 and 4), Nuoyan Lyu (1 and 2), Wei Li, Heng Dong (4), Zheng Chu (3), Yuanzhuo Wang (1), Huawei Shen (1 and 2) ((1) CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China, (2) University of Chinese Academy of Sciences, Beijing, China (3) Harbin Institute of Technology, Harbin, China, (4) Tsinghua University, Beijing, China)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.04992
Pdf URL: https://arxiv.org/pdf/2601.04992
Copy Paste: [[2601.04992]] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization(https://arxiv.org/abs/2601.04992)
Keywords: language model, chain-of-thought
Abstract: Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.
摘要：对思想链 (CoT) 轨迹演示进行监督微调 (SFT) 是在大型语言模型中实现推理的常用方法。标准实践通常只保留具有正确最终答案（正数）的轨迹，而忽略其余部分（负数）。我们认为这种范式放弃了实质性的监督并加剧了过度拟合，限制了域外（OOD）泛化。具体来说，我们令人惊讶地发现，将负轨迹纳入 SFT 比纯正训练产生了显着的 OOD 泛化增益，因为尽管最终答案不正确，这些轨迹通常仍保留有效的中间推理。为了深入了解这种影响，我们系统地分析了数据、训练动态和推理行为，识别了负链中的 22 个重复模式，这些模式具有双重作用：它们调节损失下降以减轻训练期间的过度拟合，并在推理期间将策略熵提高 35.67% 以促进探索。受这些观察的启发，我们进一步提出了基于增益的 LOss 加权（GLOW），这是一种自适应的、样本感知的方案，它通过根据跨时期的进展重新调整每个样本的损失来利用这种独特的训练动态。根据经验，GLOW 有效地利用了未过滤的轨迹，在 Qwen2.5-7B 上比纯正 SFT 产生了 5.51% 的 OOD 增益，并将 MMLU 从 72.82% 提升到 76.47% 作为 RL 初始化。

Title: Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

Authors: Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05004
Pdf URL: https://arxiv.org/pdf/2601.05004
Copy Paste: [[2601.05004]] Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei(https://arxiv.org/abs/2601.05004)
Keywords: language model, llm, agent
Abstract: Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) are applied across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we proposed Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly enhancing the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.
摘要：自毁行为与复杂的心理状态有关，并且很难诊断。由于其独特的表达方式，这些行为在亚文化群体中可能更难识别。随着大型语言模型（LLM）在各个领域的应用，一些研究人员已经开始探索其在检测自毁行为方面的应用。受此启发，我们使用当前基于法学硕士的方法研究亚文化中的自毁行为检测。然而，这些方法有两个主要挑战：（1）知识滞后：亚文化俚语发展迅速，比法学硕士的培训周期更快； (2) 语义错位：掌握亚文化特有的具体而细致的表达方式具有挑战性。为了解决这些问题，我们提出了亚文化对齐求解器（SAS），这是一个结合了自动检索和亚文化对齐的多智能体框架，显着提高了法学硕士在检测自我毁灭行为方面的表现。我们的实验结果表明 SAS 优于当前先进的多智能体框架 OWL。值得注意的是，它与经过微调的法学硕士有很好的竞争优势。我们希望 SAS 能够推动亚文化背景下自毁行为检测领域的发展，并成为未来研究人员的宝贵资源。

Title: Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models

Authors: Yueqing Hu, Xinyang Peng, Shuting Peng, Hanqi Wang, Tianhong Wang
Subjects: cs.CL, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2601.05019
Pdf URL: https://arxiv.org/pdf/2601.05019
Copy Paste: [[2601.05019]] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models(https://arxiv.org/abs/2601.05019)
Keywords: language model
Abstract: Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "Hán Dān Xué Bù" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.
摘要：最近通过强化学习训练的大型推理模型表现出与人类认知成本的“自然”一致性。然而，我们表明，流行的推理蒸馏范式——通过监督微调（SFT）训练学生模型来模仿这些痕迹——无法传递这种认知结构。在 14 个模型中测试“Hán Dān Xué Bù”（表面拟态）假设，我们发现蒸馏会导致“功能对齐崩溃”：虽然教师模型反映了人类难度缩放（$\bar{r}=0.64$），但经过蒸馏的学生显着降低了这种对齐（$\bar{r}=0.34$），通常表现低于他们自己的蒸馏前基线（“负迁移”）。我们的分析表明，SFT 会引发“货物崇拜”效应，即学生仪式性地复制推理的语言形式（冗长），而不内化教师的动态资源分配政策。因此，推理蒸馏将计算成本与认知需求脱钩，揭示类人认知是主动强化的新兴属性，而不是被动模仿。

Title: ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG

Authors: Jianbo Li, Yi Jiang, Sendong Zhao, Bairui Hu, Haochun Wang, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05038
Pdf URL: https://arxiv.org/pdf/2601.05038
Copy Paste: [[2601.05038]] ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG(https://arxiv.org/abs/2601.05038)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding-based compression. While researchers have tried ''compressing'' these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context *Aligner*), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ''gating'' system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge-intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi-hop and long-tail settings. The source code is publicly available.
摘要：检索增强生成 (RAG) 有助于法学硕士保持准确性，但将长文档输入到提示中会使模型变得缓慢且昂贵。这推动了上下文压缩，从令牌修剪和摘要到基于嵌入的压缩。虽然研究人员尝试将这些文档“压缩”成更小的摘要或数学嵌入，但有一个问题：数据压缩得越多，法学硕士就越难以理解它。为了应对这一挑战，我们提出了 ArcAligner（自适应递归上下文 *Aligner*），这是一个集成到语言模型层中的轻量级模块，可帮助模型更好地利用高度压缩的上下文表示进行下游生成。它使用自适应“门控”系统，仅在信息复杂时增加额外的处理能力，从而保持系统快速。在知识密集型 QA 基准测试中，ArcAligner 在相当的压缩率下始终优于压缩基线，尤其是在多跳和长尾设置上。源代码是公开的。

Title: Compositional Steering of Large Language Models with Steering Tokens

Authors: Gorjan Radevski, Kiril Gashteovski, Giwon Hong, Carolin Lawrence, Goran Glavaš
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05062
Pdf URL: https://arxiv.org/pdf/2601.05062
Copy Paste: [[2601.05062]] Compositional Steering of Large Language Models with Steering Tokens(https://arxiv.org/abs/2601.05062)
Keywords: language model, llm
Abstract: Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.
摘要：在现实应用中部署法学硕士需要可控的输出，同时满足多种需求。虽然现有的工作广泛解决了单一行为的法学硕士指导，但\textit{组合指导}——即同时指导法学硕士走向多种行为——仍然是一个尚未充分探索的问题。在这项工作中，我们提出了用于多行为转向的\emph{组合转向标记}。我们首先通过自我蒸馏将以自然语言指令表示的个人行为嵌入到专用令牌中。与大多数在激活空间中操作的先前工作相反，我们的行为引导存在于输入令牌的空间中，从而实现更有效的零样本组合。然后，我们在行为对上训练专用的 \textit{composition token} ，并表明它成功地捕获了组合的概念：它很好地推广到 \textit{unseen} 组合，包括那些具有未见行为的组合以及那些具有未见的 \textit{number} 行为的组合。我们在不同 LLM 架构上的实验表明，与竞争方法（指令、激活转向和 LoRA 合并）相比，转向令牌可带来卓越的多行为控制。此外，我们还表明，转向标记可以补充自然语言指令，它们的组合可以带来进一步的收益。

Title: SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment

Authors: Ziyang Chen, Zhenxuan Huang, Yile Wang, Weiqin Wang, Lu Yin, Hui Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05075
Pdf URL: https://arxiv.org/pdf/2601.05075
Copy Paste: [[2601.05075]] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment(https://arxiv.org/abs/2601.05075)
Keywords: language model, llm, prompt
Abstract: Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.
摘要：传统的句子嵌入方法在非生成预训练模型上采用标记级对比学习。最近，出现了基于生成大语言模型（LLM）的嵌入方法。这些方法要么依赖于固定的提示模板，要么涉及对模型架构的修改。前者缺乏对模型的进一步优化，导致性能有限，而后者改变了模型的内部计算机制，从而损害了其生成能力。我们提出了 SemPA，这是一种新颖的方法，可以增强句子表示，同时通过语义偏好对齐保留法学硕士的生成能力。我们利用句子级直接偏好优化 (DPO) 来有效优化 LLM 的释义生成任务，其中模型学习区分语义上等效的句子，同时保留固有的生成能力。理论上，我们在 Plackett-Luce 模型框架下建立了 DPO 和对比学习之间的正式联系。根据经验，语义文本相似性任务和法学硕士各种基准的实验结果表明，SemPA 在不牺牲法学硕士固有的生成能力的情况下实现了更好的语义表示。

Title: How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness

Authors: Florence Bernays, Marco Henriques Pereira, Jochen Menges (University of Zurich)
Subjects: cs.CL, econ.GN
Abstract URL: https://arxiv.org/abs/2601.05104
Pdf URL: https://arxiv.org/pdf/2601.05104
Copy Paste: [[2601.05104]] How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness(https://arxiv.org/abs/2601.05104)
Keywords: gpt, prompt, chat
Abstract: This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.
摘要：这项研究探讨了人类与人工智能交互的情绪基调如何影响 ChatGPT 和人类行为。在一项受试者间实验中，我们要求参与者在使用 ChatGPT (GPT-4.0) 执行两项任务时表达特定的情绪，包括撰写公开回应和解决道德困境。我们发现，与参与者保持中性语气的互动相比，当参与者赞扬 ChatGPT 的回答时，ChatGPT 的回答表现出更大的进步。相对于中性条件，表达对 ChatGPT 的愤怒也导致了较高但较小的改善，而指责 ChatGPT 并没有改善其答案。在解决道德困境时，当参与者表达愤怒时，ChatGPT 会减少对企业利益的优先考虑，而指责会增加对保护公共利益的重视。此外，我们发现，在互动之后，人们在人与人之间的交流中使用更多消极、敌意和令人失望的表达方式，在互动过程中，参与者会指责而不是赞扬他们的反应。总之，我们的研究结果表明，人们在人机交互中使用的情绪基调不仅会影响 ChatGPT 的输出，还会延续到后续的人机交流中。

Title: Agent-as-a-Judge

Authors: Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.05111
Pdf URL: https://arxiv.org/pdf/2601.05111
Copy Paste: [[2601.05111]] Agent-as-a-Judge(https://arxiv.org/abs/2601.05111)
Keywords: language model, llm, agent
Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
摘要：法学硕士法官通过利用大型语言模型进行可扩展的评估，彻底改变了人工智能评估。然而，随着评估者变得越来越复杂、专业化和多步骤，法学硕士作为法官的可靠性受到固有偏见、肤浅的单遍推理以及无法根据现实世界的观察来验证评估的限制。这促进了向智能体法官的转变，智能体法官利用规划、工具增强验证、多智能体协作和持久记忆来实现更稳健、可验证和细致入微的评估。尽管代理评估系统迅速普及，但该领域缺乏一个统一的框架来应对这种不断变化的格局。为了弥合这一差距，我们提出了第一个追踪这一演变的全面调查。具体来说，我们确定了表征这种范式转变的关键维度，并建立了发展分类法。我们组织跨一般和专业领域的核心方法和调查应用。此外，我们分析前沿挑战并确定有前途的研究方向，最终为下一代代理评估提供清晰的路线图。

Title: DocDancer: Towards Agentic Document-Grounded Information Seeking

Authors: Qintong Zhang, Xinjie Lv, Jialong Wu, Baixuan Li, Zhengwei Tao, Guochen Yan, Huanyao Zhang, Bin Wang, Jiahao Xu, Haitao Mi, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05163
Pdf URL: https://arxiv.org/pdf/2601.05163
Copy Paste: [[2601.05163]] DocDancer: Towards Agentic Document-Grounded Information Seeking(https://arxiv.org/abs/2601.05163)
Keywords: agent
Abstract: Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
摘要：文档问答（DocQA）专注于回答基于给定文档的问题，但现有的 DocQA 代理缺乏有效的工具利用，并且很大程度上依赖闭源模型。在这项工作中，我们引入了 DocDancer，一个经过端到端训练的开源 Doc 代理。我们将 DocQA 表述为一个信息搜索问题，并提出了一个工具驱动的代理框架，该框架明确地模拟了文档探索和理解。为了实现此类代理的端到端训练，我们引入了“探索然后综合”数据合成管道，该管道解决了 DocQA 高质量训练数据的稀缺问题。对合成数据进行训练，在两个长上下文文档理解基准 MMLongBench-Doc 和 DocBench 上训练的模型显示了其有效性。进一步的分析为代理工具设计和合成数据提供了宝贵的见解。

Title: RelayLLM: Efficient Reasoning via Collaborative Decoding

Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05167
Pdf URL: https://arxiv.org/pdf/2601.05167
Copy Paste: [[2601.05167]] RelayLLM: Efficient Reasoning via Collaborative Decoding(https://arxiv.org/abs/2601.05167)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
摘要：用于复杂推理的大型语言模型 (LLM) 通常受到高计算成本和延迟的阻碍，而资源高效型小语言模型 (SLM) 通常缺乏必要的推理能力。现有的协作方法（例如级联或路由）通过将整个查询卸载到 LLM 来以粗粒度进行操作，当 SLM 能够处理大部分推理步骤时，会导致大量的计算浪费。为了解决这个问题，我们提出了 RelayLLM，这是一种通过令牌级协作解码进行高效推理的新颖框架。与路由器不同，RelayLLM 使 SLM 充当主动控制器，通过特殊命令仅针对关键令牌动态调用 LLM，从而有效地“中继”生成过程。我们引入了一个两阶段的训练框架，包括热身和群体相对策略优化（GRPO），以教导模型平衡独立性与战略寻求帮助。六个基准测试的实证结果表明，RelayLLM 的平均准确率达到 49.52%，有效缩小了两个模型之间的性能差距。值得注意的是，这是通过仅调用总生成代币的 1.07% 的 LLM 来实现的，与性能匹配的随机路由器相比，成本降低了 98.2%。

Title: Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference

Authors: Rasmus Blanck, Bill Noble, Stergios Chatzikyriakidis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05170
Pdf URL: https://arxiv.org/pdf/2601.05170
Copy Paste: [[2601.05170]] Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference(https://arxiv.org/abs/2601.05170)
Keywords: language model, llm
Abstract: Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.
摘要：自然语言推理 (NLI) 一直是评估自然语言理解的语言模型的一项重要任务，但人们对该任务的逻辑属性知之甚少，并且经常被错误描述。理解 NLI 捕获的推理概念是解释模型在任务中的性能的关键。在本文中，我们制定了 NLI 标签集的三种可能的解读，并对它们所蕴含的元推理属性进行了全面分析。专注于 SNLI 数据集，我们利用 (1) 具有共享前提的 NLI 项目和 (2) 由法学硕士生成的项目来评估在 SNLI 上训练的模型以实现元推理一致性，并深入了解数据集编码的逻辑关系的读取。

Title: Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems

Authors: Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu, Bo Tang, Feiyu Xiong, Zhiyu li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05171
Pdf URL: https://arxiv.org/pdf/2601.05171
Copy Paste: [[2601.05171]] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems(https://arxiv.org/abs/2601.05171)
Keywords: agent
Abstract: Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.
摘要：现有的长期个性化对话系统很难协调无限的交互流与有限的上下文约束，常常屈服于记忆噪声积累、推理退化和角色不一致。为了应对这些挑战，本文提出了 Inside Out 框架，该框架利用全球维护的 PersonaTree 作为长期用户分析的载体。通过使用初始模式约束主干并更新分支和叶子，PersonaTree 可以实现可控增长，在保持一致性的同时实现内存压缩。此外，我们通过强化学习和基于过程的奖励来训练轻量级 MemListener，以产生结构化、可执行和可解释的 {ADD、UPDATE、DELETE、NO_OP} 操作，从而支持个性化树的动态演化。在响应生成过程中，直接利用 PersonaTree 来增强延迟敏感场景中的输出；当用户需要更多细节时，触发代理模式，在PersonaTree的约束下按需引入细节。实验表明，PersonaTree 在抑制上下文噪声和保持角色一致性方面优于全文串联和各种个性化记忆系统。值得注意的是，小型 MemListener 模型实现的内存操作决策性能可与 DeepSeek-R1-0528 和 Gemini-3-Pro 等强大的推理模型相媲美甚至超越。

Title: LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation

Authors: Samy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.05192
Pdf URL: https://arxiv.org/pdf/2601.05192
Copy Paste: [[2601.05192]] LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation(https://arxiv.org/abs/2601.05192)
Keywords: language model, llm
Abstract: Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.
摘要：实体链接（将文本中的模糊提及映射到知识库中的实体）是知识图构建、问答和信息提取等任务的基础步骤。我们的方法 LELA 是一种模块化的从粗到精的方法，它利用大型语言模型 (LLM) 的功能，并适用于不同的目标领域、知识库和 LLM，无需任何微调阶段。我们在各种实体链接设置上进行的实验表明，LELA 与微调方法相比具有很强的竞争力，并且大大优于非微调方法。

Title: Measuring and Fostering Peace through Machine Learning and Artificial Intelligence

Authors: P. Gilda (1), P. Dungarwal (1), A. Thongkham (1), E. T. Ajayi (2), S. Choudhary (1), T. M. Terol (1), C. Lam (1), J. P. Araujo (1), M. McFadyen-Mungalln (1), L. S. Liebovitch (1), P. T. Coleman (1), H. West (1), K. Sieck (3), S. Carter (3) ((1) Columbia University, (2) St John's University, (3) Toyota Research Institute)
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05232
Pdf URL: https://arxiv.org/pdf/2601.05232
Copy Paste: [[2601.05232]] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence(https://arxiv.org/abs/2601.05232)
Keywords: language model
Abstract: We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.
摘要：我们使用机器学习和人工智能：1）通过新闻和社交媒体衡量各国的和平水平，2）开发在线工具，通过帮助用户更好地了解自己的媒体习惯来促进和平。对于新闻媒体，我们使用神经网络来衡量在线新闻来源的文本嵌入的和平程度。该模型在一个新闻媒体数据集上进行训练，在用于分析不同的新闻数据集时也显示出很高的准确性。对于 YouTube 等社交媒体，我们开发了其他模型，使用单词级别（GoEmotions）和上下文级别（大语言模型）方法来衡量和平时期重要的社交维度水平。为了促进和平，我们注意到 71% 的 20-40 岁人群每天通过社交媒体上的短视频查看大部分新闻。这些视频的内容创作者偏向于创作具有情感激活的视频，让你生气来吸引你，以增加点击量。我们开发并测试了一个 Chrome 扩展程序 MirrorMirror，它可以向 YouTube 观看者提供有关他们正在观看的媒体的和平程度的实时反馈。我们的长期目标是让 MirrorMirror 发展成为一个开源工具，供内容创作者、记者、研究人员、平台和个人用户更好地了解他们的媒体创作和消费的基调及其对观众的影响。我们希望超越简单的参与度指标，鼓励更加尊重、细致和信息丰富的沟通。

Title: GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.05242
Pdf URL: https://arxiv.org/pdf/2601.05242
Copy Paste: [[2601.05242]] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization(https://arxiv.org/abs/2601.05242)
Keywords: language model
Abstract: As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
摘要：随着语言模型的能力越来越强，用户期望它们不仅能够提供准确的响应，而且能够在各种场景中提供符合不同人类偏好的行为。为了实现这一目标，强化学习 (RL) 管道已开始整合多种奖励，每种奖励都捕获不同的偏好，以引导模型实现这些所需的行为。然而，最近的工作默认在多奖励设置下应用组相对策略优化（GRPO），而没有检查其适用性。在本文中，我们证明，直接应用 GRPO 来规范化不同的推出奖励组合会导致它们崩溃为相同的优势值，从而降低训练信号的分辨率，导致收敛不理想，在某些情况下，会导致早期训练失败。然后，我们引入了群体奖励解耦标准化策略优化（GDPO），这是一种新的策略优化方法，通过解耦个体奖励的标准化来解决这些问题，更忠实地保留它们的相对差异，并实现更准确的多奖励优化，同时显着提高训练稳定性。我们在三个任务上比较 GDPO 和 GRPO：工具调用、数学推理和编码推理，评估正确性指标（准确性、错误率）和约束遵守指标（格式、长度）。在所有设置中，GDPO 始终优于 GRPO，证明了其对于多奖励强化学习优化的有效性和普遍性。