2025-11-18

Title: TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy

Authors: James McCammon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.11594
Pdf URL: https://arxiv.org/pdf/2511.11594
Copy Paste: [[2511.11594]] TimeStampEval: A Simple LLM Eval and a Little Fuzzy Matching Trick to Improve Search Accuracy(https://arxiv.org/abs/2511.11594)
Keywords: llm, prompt
Abstract: Traditional fuzzy matching often fails when searching for quotes that are semantically identical but syntactically different across documents-a common issue when aligning official written records with speech-to-text transcripts. We introduce TimeStampEval, a benchmark for retrieving precise millisecond timestamps from long transcripts given non-verbatim quotes. Our simple two-stage method dramatically improves retrieval accuracy while cutting inference costs by over 90%. The motivating use case is an automated long-form podcast that assembles Congressional Record clips into AI-hosted narration. The technical challenge: given a sentence-timestamped transcript and a target quote that may differ due to transcription or editorial drift, return exact start and end boundaries. Standard algorithms handle verbatim text but break under fuzzier variants. Evaluating six modern LLMs on a 2,800-sentence (120k-token) transcript revealed four key findings. (1) Prompt design matters more than model choice: placing the query before the transcript and using compact formatting improved accuracy by 3-20 points while reducing token count by 30-40%. (2) Off-by-one errors form a distinct category, showing models understand the task but misplace boundaries. (3) A modest reasoning budget (600-850 tokens) raises accuracy from 37% to 77% for weak setups and to above 90% for strong ones. (4) Our "Assisted Fuzzy" approach-RapidFuzz pre-filtering followed by LLM verification on short snippets-improves fuzzy match accuracy by up to 50 points while halving latency and reducing cost per correct result by up to 96%. Extended tests on ten transcripts (50k-900k tokens, 1989-2025) confirm robustness to transcript length, vocabulary drift, and domain change, maintaining 95-100% rejection accuracy for absent targets.
摘要：在跨文档搜索语义相同但语法不同的引文时，传统的模糊匹配通常会失败——这是将官方书面记录与语音转文本记录对齐时的常见问题。我们引入了 TimeStampEval，这是一个基准，用于从给定非逐字引用的长记录中检索精确的毫秒时间戳。我们简单的两阶段方法极大地提高了检索准确性，同时将推理成本降低了 90% 以上。令人兴奋的用例是一个自动长格式播客，它将国会记录剪辑组装成人工智能托管的旁白。技术挑战：给定带有时间戳的句子转录本和可能因转录或编辑漂移而不同的目标引用，返回准确的开始和结束边界。标准算法处理逐字文本，但在模糊变体下会崩溃。在 2,800 句话（120k-token）的成绩单上评估六位现代法学硕士，揭示了四个关键发现。 (1) 提示设计比模型选择更重要：将查询放在成绩单之前并使用紧凑格式将准确性提高了 3-20 个百分点，同时将标记数量减少了 30-40%。 (2) 相差一的错误形成一个独特的类别，表明模型理解任务但放错了边界。 (3) 适度的推理预算（600-850 个令牌）可将弱设置的准确率从 37% 提高到 77%，将强设置的准确率提高到 90% 以上。 (4) 我们的“辅助模糊”方法（RapidFuzz 预过滤，然后对短片段进行 LLM 验证）可将模糊匹配准确度提高多达 50 个点，同时将延迟减半，并将每个正确结果的成本降低多达 96%。对 10 个转录本（50k-900k tokens，1989-2025）的扩展测试证实了对转录本长度、词汇漂移和领域变化的鲁棒性，对缺失目标保持 95-100% 的拒绝准确性。

Title: MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Authors: MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Hellen Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.11793
Pdf URL: https://arxiv.org/pdf/2511.11793
Copy Paste: [[2511.11793]] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling(https://arxiv.org/abs/2511.11793)
Keywords: gpt, llm, agent
Abstract: We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.
摘要：我们推出了 MiroThinker v1.0，这是一个开源研究代理，旨在提高工具增强推理和信息搜索功能。与之前仅扩大模型大小或上下文长度的代理不同，MiroThinker 探索了模型级别的交互扩展，系统地训练模型以处理更深、更频繁的代理与环境交互，作为性能改进的第三个维度。与 LLM 测试时间扩展不同的是，LLM 测试时间扩展是孤立运行的，并且存在较长推理链导致退化的风险，而交互式扩展则利用环境反馈和外部信息获取来纠正错误并完善轨迹。通过强化学习，该模型实现了高效的交互扩展：通过 256K 上下文窗口，每个任务最多可以执行 600 次工具调用，从而实现持续的多轮推理和复杂的现实世界研究工作流程。在四个代表性基准测试（GAIA、HLE、BrowseComp 和 BrowseComp-ZH）中，72B 变体的准确率分别高达 81.9%、37.7%、47.1% 和 55.6%，超越了之前的开源代理并接近 GPT-5 等商业同行的高水平。我们的分析表明，MiroThinker 始终受益于交互式扩展：随着模型进行更深入、更频繁的代理-环境交互，研究性能可预测地提高，这表明交互深度表现出类似于模型大小和上下文长度的扩展行为。这些发现将交互扩展确立为构建下一代开放研究代理、补充模型能力和上下文窗口的第三个关键维度。

Title: On the Notion that Language Models Reason

Authors: Bertram Højer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.11810
Pdf URL: https://arxiv.org/pdf/2511.11810
Copy Paste: [[2511.11810]] On the Notion that Language Models Reason(https://arxiv.org/abs/2511.11810)
Keywords: language model
Abstract: Language models (LMs) are said to be exhibiting reasoning, but what does this entail? We assess definitions of reasoning and how key papers in the field of natural language processing (NLP) use the notion and argue that the definitions provided are not consistent with how LMs are trained, process information, and generate new tokens. To illustrate this incommensurability we assume the view that transformer-based LMs implement an \textit{implicit} finite-order Markov kernel mapping contexts to conditional token distributions. In this view, reasoning-like outputs correspond to statistical regularities and approximate statistical invariances in the learned kernel rather than the implementation of explicit logical mechanisms. This view is illustrative of the claim that LMs are "statistical pattern matchers"" and not genuine reasoners and provides a perspective that clarifies why reasoning-like outputs arise in LMs without any guarantees of logical consistency. This distinction is fundamental to how epistemic uncertainty is evaluated in LMs. We invite a discussion on the importance of how the computational processes of the systems we build and analyze in NLP research are described.
摘要：据说语言模型（LM）具有推理能力，但这意味着什么呢？我们评估了推理的定义以及自然语言处理 (NLP) 领域的关键论文如何使用该概念，并认为所提供的定义与 LM 的训练、处理信息和生成新标记的方式不一致。为了说明这种不可通约性，我们假设基于 Transformer 的 LM 实现了 \textit{implicit} 有限阶马尔可夫核，将上下文映射到条件标记分布。在这个观点中，类似推理的输出对应于学习内核中的统计规律和近似统计不变性，而不是显式逻辑机制的实现。这种观点说明了 LM 是“统计模式匹配器”而不是真正的推理器，并提供了一个视角，阐明了为什么在没有任何逻辑一致性保证的情况下，LM 中会出现类似推理的输出。这种区别对于如何在 LM 中评估认知不确定性至关重要。我们邀请大家讨论如何描述我们在 NLP 研究中构建和分析的系统的计算过程的重要性。

Title: Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis

Authors: Hong-Jun Yoon, Faisal Ashraf, Thomas A. Ruggles, Debjani Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.11821
Pdf URL: https://arxiv.org/pdf/2511.11821
Copy Paste: [[2511.11821]] Scaling Open-Weight Large Language Models for Hydropower Regulatory Information Extraction: A Systematic Analysis(https://arxiv.org/abs/2511.11821)
Keywords: language model, hallucination
Abstract: Information extraction from regulatory documents using large language models presents critical trade-offs between performance and computational resources. We evaluated seven open-weight models (0.6B-70B parameters) on hydropower licensing documentation to provide empirical deployment guidance. Our analysis identified a pronounced 14B parameter threshold where validation methods transition from ineffective (F1 $<$ 0.15) to viable (F1 = 0.64). Consumer-deployable models achieve 64\% F1 through appropriate validation, while smaller models plateau at 51\%. Large-scale models approach 77\% F1 but require enterprise infrastructure. We identified systematic hallucination patterns where perfect recall indicates extraction failure rather than success in smaller models. Our findings establish the first comprehensive resource-performance mapping for open-weight information extraction in regulatory contexts, enabling evidence-based model selection. These results provide immediate value for hydropower compliance while contributing insights into parameter scaling effects that generalize across information extraction tasks.
摘要：使用大型语言模型从监管文档中提取信息提出了性能和计算资源之间的关键权衡。我们评估了水电许可文档中的七个开放权重模型（0.6B-70B 参数），以提供经验部署指导。我们的分析确定了一个明显的 14B 参数阈值，其中验证方法从无效 (F1 $<$ 0.15) 转变为可行 (F1 = 0.64)。通过适当的验证，消费者可部署的模型可实现 64% F1，而较小的模型则稳定在 51%。大型模型接近 77\% F1，但需要企业基础设施。我们发现了系统性的幻觉模式，其中完美的回忆表明提取失败，而不是在较小的模型中成功。我们的研究结果建立了第一个全面的资源绩效映射，用于监管环境中的开放权重信息提取，从而实现基于证据的模型选择。这些结果为水电合规性提供了直接价值，同时有助于深入了解可推广到信息提取任务的参数缩放效应。

Title: Towards Autoformalization of LLM-generated Outputs for Requirement Verification

Authors: Mihir Gupte, Ramesh S
Subjects: cs.CL, cs.AI, cs.FL, cs.LO
Abstract URL: https://arxiv.org/abs/2511.11829
Pdf URL: https://arxiv.org/pdf/2511.11829
Copy Paste: [[2511.11829]] Towards Autoformalization of LLM-generated Outputs for Requirement Verification(https://arxiv.org/abs/2511.11829)
Keywords: language model, llm
Abstract: Autoformalization, the process of translating informal statements into formal logic, has gained renewed interest with the emergence of powerful Large Language Models (LLMs). While LLMs show promise in generating structured outputs from natural language (NL), such as Gherkin Scenarios from NL feature requirements, there's currently no formal method to verify if these outputs are accurate. This paper takes a preliminary step toward addressing this gap by exploring the use of a simple LLM-based autoformalizer to verify LLM-generated outputs against a small set of natural language requirements. We conducted two distinct experiments. In the first one, the autoformalizer successfully identified that two differently-worded NL requirements were logically equivalent, demonstrating the pipeline's potential for consistency checks. In the second, the autoformalizer was used to identify a logical inconsistency between a given NL requirement and an LLM-generated output, highlighting its utility as a formal verification tool. Our findings, while limited, suggest that autoformalization holds significant potential for ensuring the fidelity and logical consistency of LLM-generated outputs, laying a crucial foundation for future, more extensive studies into this novel application.
摘要：自动形式化（将非正式语句转换为形式逻辑的过程）随着强大的大型语言模型（LLM）的出现而重新引起了人们的兴趣。虽然法学硕士有望从自然语言 (NL) 生成结构化输出，例如来自 NL 功能要求的 Gherkin 场景，但目前没有正式的方法来验证这些输出是否准确。本文通过探索使用简单的基于 LLM 的自动形式化器来根据一小组自然语言要求验证 LLM 生成的输出，为解决这一差距迈出了初步的一步。我们进行了两项不同的实验。在第一个中，自动形式化器成功识别出两个不同措辞的 NL 需求在逻辑上是等效的，展示了管道进行一致性检查的潜力。在第二个例子中，自动形式化器用于识别给定的 NL 要求和 LLM 生成的输出之间的逻辑不一致，突出了其作为形式验证工具的实用性。我们的研究结果虽然有限，但表明自动形式化在确保法学硕士生成的输出的保真度和逻辑一致性方面具有巨大的潜力，为未来对这一新颖应用进行更广泛的研究奠定了重要的基础。

Title: Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Authors: Namu Park, Giridhar Kaushik Ramachandran, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen, Martin Gunn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.11867
Pdf URL: https://arxiv.org/pdf/2511.11867
Copy Paste: [[2511.11867]] Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches(https://arxiv.org/abs/2511.11867)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.
摘要：大型语言模型 (LLM) 在临床自然语言处理方面显示出了巨大的前景，但很少有特定领域的数据集可以严格评估其在放射学任务中的表现。在这项工作中，我们引入了来自 586 名患者的 6,393 份放射学报告的带注释语料库，每份都标记了随访成像状态，以支持随访依从性检测系统的开发和基准测试。使用这个语料库，我们系统地将传统的机器学习分类器（包括逻辑回归（LR）、支持向量机（SVM）、Longformer 和完全微调的 Llama3-8B-Instruct）与最近的生成式 LLM 进行比较。为了评估生成式 LLM，我们在两种配置下测试了 GPT-4o 和开源 GPT-OSS-20B：基线 (Base) 和任务优化 (Advanced) 设置，该设置侧重于元数据、推荐句子及其周围上下文的输入。 GPT-OSS-20B的精细化提示进一步提高了推理准确性。使用精确度、召回率和 F1 分数来评估性能，并通过非参数引导估计 95% 的置信区间。注释者间的一致性很高（F1 = 0.846）。 GPT-4o（高级）实现了最佳性能（F1 = 0.832），紧随其后的是 GPT-OSS-20B（高级；F1 = 0.828）。 LR 和 SVM 也表现强劲（F1 = 0.776 和 0.775），这突显出虽然法学硕士通过及时优化接近人类水平的一致性，但可解释和资源高效的模型仍然是有价值的基线。

Title: MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

Authors: Fernanda Bufon Färber, Iago Alves Brito, Julia Soares Dollis, Pedro Schindler Freire Brasil Ribeiro, Rafael Teixeira Sousa, Arlindo Rodrigues Galvão Filho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.11878
Pdf URL: https://arxiv.org/pdf/2511.11878
Copy Paste: [[2511.11878]] MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers(https://arxiv.org/abs/2511.11878)
Keywords: language model, llm
Abstract: While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages, creating a critical barrier for others as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus for Brazilian Portuguese, comprising 384,095 authentic question-answer pairs from patient-doctor interactions. The dataset underwent a meticulous multi-stage curation protocol, using a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries. We further augmented the corpus via LLM-driven annotation, classifying questions into seven semantic types to capture user intent. Our analysis reveals its thematic breadth (3,200 topics) and unique linguistic properties, like the natural asymmetry in patient-doctor communication. To validate its utility, we benchmark a medical specialty routing task: fine-tuning a 1.7B parameter model achieves an outstanding 94\% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset's deep semantic richness. We publicly release MedPT to foster the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.
摘要：虽然大型语言模型 (LLM) 在医疗保健领域显示出变革潜力，但其开发仍然集中在高资源语言上，这给其他语言造成了严重障碍，因为简单的翻译无法捕捉独特的临床和文化细微差别，例如地方病。为了解决这个问题，我们引入了 MedPT，这是第一个大规模、真实的巴西葡萄牙语语料库，其中包含来自医患互动的 384,095 个真实问答对。该数据集经过了细致的多阶段管理协议，使用混合定量-定性分析来过滤噪音并根据上下文丰富数千个模糊查询。我们通过法学硕士驱动的注释进一步扩充了语料库，将问题分为七种语义类型以捕获用户意图。我们的分析揭示了其主题广度（3,200 个主题）和独特的语言特性，例如医患沟通中的自然不对称性。为了验证其实用性，我们对医疗专业路由任务进行了基准测试：微调 1.7B 参数模型在 20 级设置上实现了出色的 94% F1 分数。此外，我们的定性误差分析显示错误分类不是随机的，而是反映了真正的临床模糊性（例如，共病状况之间），证明了数据集的深层语义丰富性。我们公开发布 MedPT，以促进为葡语世界发展更加公平、准确和具有文化意识的医疗技术。

Title: ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts

Authors: Karthikeyan K, Raghuveer Thirukovalluru, David Carlson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.11883
Pdf URL: https://arxiv.org/pdf/2511.11883
Copy Paste: [[2511.11883]] ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts(https://arxiv.org/abs/2511.11883)
Keywords: language model, llm
Abstract: Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question-answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2-3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.
摘要：临床记录包含有价值的、上下文丰富的信息，但其非结构化格式带来了一些挑战，包括意外偏差（例如性别或种族偏差）、跨临床环境的泛化性差（例如，由于格式差异，在一种 EHR 系统上训练的模型可能在另一种 EHR 系统上表现不佳）和可解释性差。为了解决这些问题，我们推出了 ClinStructor，这是一个利用大型语言模型 (LLM) 在预测建模之前将临床自由文本转换为结构化、特定于任务的问答对的管道。在 ICU 死亡率预测任务中，与直接微调相比，我们的方法大大增强了透明度和可控性，并且仅导致预测性能适度下降（AUC 下降 2-3%）。 ClinStructor 为在临床环境中构建可靠、可解释和可推广的机器学习模型奠定了坚实的基础。

Title: Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support

Authors: Eric Hua Qing Zhang, Julia Ive
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.11884
Pdf URL: https://arxiv.org/pdf/2511.11884
Copy Paste: [[2511.11884]] Context-Emotion Aware Therapeutic Dialogue Generation: A Multi-component Reinforcement Learning Approach to Language Models for Mental Health Support(https://arxiv.org/abs/2511.11884)
Keywords: language model, gpt, llm
Abstract: Mental health illness represents a substantial global socioeconomic burden, with COVID-19 further exacerbating accessibility challenges and driving increased demand for telehealth mental health support. While large language models (LLMs) offer promising solutions through 24/7 availability and non-judgmental interactions, pre-trained models often lack the contextual and emotional awareness necessary for appropriate therapeutic responses. This paper investigated the application of supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance GPT-2's capacity for therapeutic dialogue generation. The methodology restructured input formats to enable simultaneous processing of contextual information and emotional states alongside user input, employing a multi-component reward function that aligned model outputs with professional therapist responses and annotated emotions. Results demonstrated improvements through reinforcement learning over baseline GPT-2 across multiple evaluation metrics: BLEU (0.0111), ROUGE-1 (0.1397), ROUGE-2 (0.0213), ROUGE-L (0.1317), and METEOR (0.0581). LLM evaluation confirmed high contextual relevance and professionalism, while reinforcement learning achieved 99.34% emotion accuracy compared to 66.96% for baseline GPT-2. These findings demonstrate reinforcement learning's effectiveness in developing therapeutic dialogue systems that can serve as valuable assistive tools for therapists while maintaining essential human clinical oversight.
摘要：精神健康疾病是全球巨大的社会经济负担，新冠肺炎 (COVID-19) 进一步加剧了可及性挑战，并推动了对远程医疗心理健康支持的需求增加。虽然大语言模型 (LLM) 通过 24/7 可用性和非判断性交互提供了有前途的解决方案，但预训练的模型通常缺乏适当治疗反应所需的上下文和情感意识。本文研究了监督微调 (SFT) 和强化学习 (RL) 技术的应用，以增强 GPT-2 的治疗性对话生成能力。该方法重组了输入格式，以便能够在用户输入的同时处理上下文信息和情绪状态，并采用多组件奖励功能，使模型输出与专业治疗师的反应和注释情绪保持一致。结果表明，通过强化学习，在多个评估指标上相对于基线 GPT-2 有所改进：BLEU (0.0111)、ROUGE-1 (0.1397)、ROUGE-2 (0.0213)、ROUGE-L (0.1317) 和 METEOR (0.0581)。 LLM 评估证实了高度的情境相关性和专业性，而强化学习的情感准确度达到了 99.34%，而基线 GPT-2 的情感准确度为 66.96%。这些发现证明了强化学习在开发治疗对话系统方面的有效性，该系统可以作为治疗师的宝贵辅助工具，同时保持基本的人类临床监督。

Title: Additive Large Language Models for Semi-Structured Text

Authors: Karthikeyan K, Raghuveer Thirukovalluru, David Carlson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.11922
Pdf URL: https://arxiv.org/pdf/2511.11922
Copy Paste: [[2511.11922]] Additive Large Language Models for Semi-Structured Text(https://arxiv.org/abs/2511.11922)
Keywords: language model, llm
Abstract: Large Language Models have advanced clinical text classification, but their opaque predictions remain a critical barrier to practical adoption in research and clinical settings where investigators and physicians need to understand which parts of a patient's record drive risk signals. To address this challenge, we introduce \textbf{CALM}, short for \textbf{Classification with Additive Large Language Models}, an interpretable framework for semi-structured text where inputs are composed of semantically meaningful components, such as sections of an admission note or question-answer fields from an intake form. CALM predicts outcomes as the additive sum of each component's contribution, making these contributions part of the forward computation itself and enabling faithful explanations at both the patient and population level. The additive structure also enables clear visualizations, such as component-level risk curves similar to those used in generalized additive models, making the learned relationships easier to inspect and communicate. Although CALM expects semi-structured inputs, many clinical documents already have this form, and similar structure can often be automatically extracted from free-text notes. CALM achieves performance comparable to conventional LLM classifiers while improving trust, supporting quality-assurance checks, and revealing clinically meaningful patterns during model development and auditing.
摘要：大型语言模型具有先进的临床文本分类功能，但其不透明的预测仍然是研究和临床环境中实际采用的关键障碍，研究人员和医生需要了解患者记录的哪些部分驱动风险信号。为了应对这一挑战，我们引入了 \textbf{CALM}，即 \textbf{Additive Large Language Models 分类} 的缩写，它是一种半结构化文本的可解释框架，其中输入由语义上有意义的组件组成，例如录取通知书的部分或录取表中的问答字段。 CALM 将结果预测为每个组成部分贡献的相加总和，使这些贡献成为正向计算本身的一部分，并在患者和人群层面上实现忠实的解释。加法结构还可以实现清晰的可视化，例如类似于广义加法模型中使用的组件级风险曲线，使学习到的关系更易于检查和交流。尽管 CALM 期望半结构化输入，但许多临床文档已经具有这种形式，并且通常可以从自由文本注释中自动提取类似的结构。 CALM 实现了与传统 LLM 分类器相当的性能，同时提高了信任、支持质量保证检查，并在模型开发和审计过程中揭示了具有临床意义的模式。

Title: InData: Towards Secure Multi-Step, Tool-Based Data Analysis

Authors: Karthikeyan K, Raghuveer Thirukovalluru, Bhuwan Dhingra, David Edwin Carlson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.11933
Pdf URL: https://arxiv.org/pdf/2511.11933
Copy Paste: [[2511.11933]] InData: Towards Secure Multi-Step, Tool-Based Data Analysis(https://arxiv.org/abs/2511.11933)
Keywords: language model, gpt, llm, agent
Abstract: Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs' multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels--Easy, Medium, and Hard--capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.
摘要：用于数据分析的大型语言模型代理通常直接在数据库上生成和执行代码。然而，当应用于敏感数据时，这种方法会带来重大的安全风险。为了解决这个问题，我们提出了一种安全驱动的替代方案：限制法学硕士直接代码生成和数据访问，并要求他们通过一组预定义的安全、经过验证的工具专门与数据进行交互。尽管最近存在工具使用基准，但它们主要针对工具选择和简单执行，而不是复杂数据分析所需的组合、多步骤推理。为了缩小这一差距，我们引入了间接数据参与（InData），这是一个旨在评估法学硕士基于多步骤工具的推理能力的数据集。 InData 包括三个难度级别的数据分析问题：简单、中等和困难，捕捉不断增加的推理复杂性。我们在 InData 上对 15 个开源 LLM 进行基准测试，发现虽然大型模型（例如 gpt-oss-120b）在简单任务（97.3%）上实现了高精度，但在困难任务（69.6%）上性能急剧下降。这些结果表明，当前的法学硕士仍然缺乏强大的基于多步骤工具的推理能力。借助 InData，我们朝着实现具有更强大的多步骤工具使用功能的法学硕士的开发和评估迈出了一步。我们将公开发布数据集和代码。

Title: Improving LLM's Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization

Authors: Hadi Sheikhi, Chenyang Huang, Osmar R. Zaïane
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.11946
Pdf URL: https://arxiv.org/pdf/2511.11946
Copy Paste: [[2511.11946]] Improving LLM's Attachment to External Knowledge In Dialogue Generation Tasks Through Entity Anonymization(https://arxiv.org/abs/2511.11946)
Keywords: language model, llm
Abstract: Knowledge graph-based dialogue generation (KG-DG) is a challenging task requiring models to effectively incorporate external knowledge into conversational responses. While large language models (LLMs) have achieved impressive results across various NLP tasks, their ability to utilize external knowledge in KG-DG remains under-explored. We observe that LLMs often rely on internal knowledge, leading to detachment from provided knowledge graphs, even when they are given a flawlessly retrieved knowledge graph. First, we introduce LLM-KAT, an evaluation procedure for measuring knowledge attachment in generated responses. Second, we propose a simple yet effective entity anonymization technique to encourage LLMs to better leverage external knowledge. Experiments on the OpenDialKG dataset demonstrate that our approach improves LLMs' attachment on external knowledge.
摘要：基于知识图的对话生成（KG-DG）是一项具有挑战性的任务，需要模型有效地将外部知识融入对话响应中。虽然大型语言模型 (LLM) 在各种 NLP 任务中取得了令人印象深刻的成果，但它们在 KG-DG 中利用外部知识的能力仍有待探索。我们观察到，法学硕士经常依赖内部知识，导致他们脱离所提供的知识图谱，即使他们获得了完美检索的知识图谱。首先，我们介绍 LLM-KAT，这是一种用于衡量生成的响应中的知识依附的评估程序。其次，我们提出了一种简单而有效的实体匿名化技术，以鼓励法学硕士更好地利用外部知识。 OpenDialKG 数据集上的实验表明，我们的方法提高了法学硕士对外部知识的依恋。

Title: On the Entropy Calibration of Language Models

Authors: Steven Cao, Gregory Valiant, Percy Liang
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.11966
Pdf URL: https://arxiv.org/pdf/2511.11966
Copy Paste: [[2511.11966]] On the Entropy Calibration of Language Models(https://arxiv.org/abs/2511.11966)
Keywords: language model
Abstract: We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
摘要：我们研究熵校准问题，该问题询问语言模型历代的熵是否与其在人类文本上的对数损失相匹配。过去的研究发现，模型校准错误，随着世代的增长，每一步的熵会增加（文本质量会下降）。这种误差积累是自回归模型中的一个基本问题，标准解决方案是截断分布，这以牺牲多样性为代价来提高文本质量。在本文中，我们要问：校准错误是否可能会随着规模的扩大而改善，理论上是否可以在不进行权衡的情况下进行校准？为了建立直觉，我们首先研究一个简化的理论设置来表征相对于数据集大小的错误校准的缩放行为。我们发现缩放行为取决于数据分布的幂律指数 - 特别是，对于接近 1 的幂律指数，缩放指数接近 0，这意味着随着缩放，误校准的改善非常缓慢。接下来，我们根据经验测量语言模型中参数范围从 0.5B 到 70B 的错误校准。我们发现观察到的缩放行为与简化设置所预测的相似：我们对文本的拟合缩放指数接近 0，这意味着较大的模型以与较小的模型相似的速率累积误差。这种缩放（或缺乏缩放）提供了一种解释，解释了为什么我们从截断量与较小模型相似的较大模型中进行采样，尽管较大模型的质量较高。然而，截断并不是一个令人满意的解决方案，因为它是以增加日志丢失为代价的。理论上，是否有可能在保留对数损失的同时减少熵？如果我们假设访问一个可以拟合模型来预测文本未来熵的黑匣子，我们证明这是可能的。

Title: A Reasoning Paradigm for Named Entity Recognition

Authors: Hui Huang, Yanping Chen, Ruizhang Huang, Chuan Lin, Yongbin Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.11978
Pdf URL: https://arxiv.org/pdf/2511.11978
Copy Paste: [[2511.11978]] A Reasoning Paradigm for Named Entity Recognition(https://arxiv.org/abs/2511.11978)
Keywords: gpt, llm
Abstract: Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This "cognitive shortcutting" leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at this https URL.
摘要：生成式法学硕士通常通过指令调整来提高命名实体识别 (NER) 性能。它们擅长通过语义模式匹配生成实体，但缺乏明确的、可验证的推理机制。这种“认知捷径”会导致性能不佳和泛化脆弱，尤其是在零样本和资源匮乏的场景中，在这些场景中，根据有限的上下文线索进行推理至关重要。为了解决这个问题，提出了 NER 的推理框架，它将提取范式从隐式模式匹配转变为显式推理。该框架由三个阶段组成：思想链（CoT）生成、CoT 调整和推理增强。首先，生成一个用面向 NER 的 CoT 注释的数据集，其中包含与任务相关的推理链。然后，它们用于调整 NER 模型以生成连贯的基本原理，然后得出最终答案。最后，实施推理增强阶段，以使用综合奖励信号优化推理过程。此阶段确保明确且可验证的提取。实验表明，ReasoningNER 在 NER 任务中表现出了令人印象深刻的认知能力，取得了有竞争力的表现。在零样本设置中，它实现了最先进 (SOTA) 的性能，在 F1 分数上比 GPT-4 高出 12.3 个百分点。分析结果还证明了其在推进面向推理的信息提取研究方面的巨大潜力。我们的代码可在此 https URL 中获取。

Title: Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations

Authors: Eunkyu Park, Wesley Hanwen Deng, Vasudha Varadarajan, Mingxi Yan, Gunhee Kim, Maarten Sap, Motahhare Eslami
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.12001
Pdf URL: https://arxiv.org/pdf/2511.12001
Copy Paste: [[2511.12001]] Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations(https://arxiv.org/abs/2511.12001)
Keywords: language model, chain-of-thought
Abstract: Explanations are often promoted as tools for transparency, but they can also foster confirmation bias; users may assume reasoning is correct whenever outputs appear acceptable. We study this double-edged role of Chain-of-Thought (CoT) explanations in multimodal moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones. Specifically, we analyze reasoning errors in vision language models (VLMs) and how they impact user trust and the ability to detect errors. Our findings reveal two key effects: (1) users often equate trust with outcome agreement, sustaining reliance even when reasoning is flawed, and (2) the confident tone suppresses error detection while maintaining reliance, showing that delivery styles can override correctness. These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust. All code will be released publicly.
摘要：解释通常被宣传为提高透明度的工具，但它们也会助长确认偏差；当输出看起来可以接受时，用户可能会认为推理是正确的。我们通过系统地扰动推理链和操纵传递语气来研究多模态道德场景中思想链（CoT）解释的双刃剑作用。具体来说，我们分析视觉语言模型（VLM）中的推理错误以及它们如何影响用户信任和检测错误的能力。我们的研究结果揭示了两个关键影响：（1）用户经常将信任与结果一致等同起来，即使推理有缺陷，也会维持依赖；（2）自信的语气在保持依赖的同时抑制错误检测，这表明交付风格可以凌驾于正确性之上。这些结果凸显了 CoT 解释如何同时澄清和误导，强调 NLP 系统需要提供鼓励审查和批判性思维而不是盲目信任的解释。所有代码将公开发布。

Title: CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs

Authors: Truong Vo, Sanmi Koyejo
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.12014
Pdf URL: https://arxiv.org/pdf/2511.12014
Copy Paste: [[2511.12014]] CURE: Cultural Understanding and Reasoning Evaluation - A Framework for "Thick" Culture Alignment Evaluation in LLMs(https://arxiv.org/abs/2511.12014)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in culturally diverse environments, yet existing evaluations of cultural competence remain limited. Existing methods focus on de-contextualized correctness or forced-choice judgments, overlooking the need for cultural understanding and reasoning required for appropriate responses. To address this gap, we introduce a set of benchmarks that, instead of directly probing abstract norms or isolated statements, present models with realistic situational contexts that require culturally grounded reasoning. In addition to the standard Exact Match metric, we introduce four complementary metrics (Coverage, Specificity, Connotation, and Coherence) to capture different dimensions of model's response quality. Empirical analysis across frontier models reveals that thin evaluation systematically overestimates cultural competence and produces unstable assessments with high variance. In contrast, thick evaluation exposes differences in reasoning depth, reduces variance, and provides more stable, interpretable signals of cultural understanding.
摘要：大语言模型（LLM）越来越多地部署在文化多元化的环境中，但现有的文化能力评估仍然有限。现有的方法侧重于脱离情境的正确性或强制选择判断，忽视了适当反应所需的文化理解和推理的需要。为了解决这一差距，我们引入了一组基准，这些基准不是直接探讨抽象规范或孤立的陈述，而是提出具有需要基于文化的推理的现实情境背景的模型。除了标准的“精确匹配”指标之外，我们还引入了四个补充指标（覆盖率、特异性、内涵和连贯性）来捕获模型响应质量的不同维度。跨前沿模型的实证分析表明，薄弱评估系统性地高估了文化能力，并产生具有高方差的不稳定评估。相比之下，深度评估揭示了推理深度的差异，减少了方差，并提供了更稳定、可解释的文化理解信号。

Title: LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models

Authors: Piotr Pęzik, Konrad Kaczyński, Maria Szymańska, Filip Żarnecki, Zuzanna Deckert, Jakub Kwiatkowski, Wojciech Janowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12116
Pdf URL: https://arxiv.org/pdf/2511.12116
Copy Paste: [[2511.12116]] LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models(https://arxiv.org/abs/2511.12116)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM's training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.
摘要：大型语言模型 (LLM) 在文本数据上进行预训练，直至达到特定的时间截止点。这创建了严格的知识边界，超出该边界，模型在不查询外部源的情况下无法提供准确的信息。更微妙的是，当这种限制未知或被忽略时，法学硕士可能会在推理任务期间无意中将过时的时间敏感信息与一般知识混合在一起，从而可能会损害响应的准确性。我们引入了 LLMLagBench，一种 LLM 新鲜度基准，作为一种系统方法，用于通过评估 LLM 对最近事件的了解来识别 LLM 训练数据的最早可能的时间边界。然后，我们应用此基准来评估大量法学硕士，包括具有明确声明和未声明的培训截止值的模型。该基准的可靠性是通过手动验证并与公开发布的有关 LLM 预训练的信息进行比较来评估的。

Title: PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection

Authors: Bingbing Wang, Zhixin Bai, Zhengda Jin, Zihan Wang, Xintong Song, Jingjie Lin, Sixuan Li, Jing Li, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12130
Pdf URL: https://arxiv.org/pdf/2511.12130
Copy Paste: [[2511.12130]] PRISM of Opinions: A Persona-Reasoned Multimodal Framework for User-centric Conversational Stance Detection(https://arxiv.org/abs/2511.12130)
Keywords: chain-of-thought
Abstract: The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users' attitudes toward specific targets within complex discussions. However, existing studies remain limited by: **1) pseudo-multimodality**, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and **2) user homogeneity**, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce **U-MStance**, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose **PRISM**, a **P**ersona-**R**easoned mult**I**modal **S**tance **M**odel for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.
摘要：多模态社交媒体内容的快速扩散推动了多模态会话姿态检测（MCSD）的研究，该研究旨在解释用户在复杂讨论中对特定目标的态度。然而，现有研究仍然受到以下限制：**1）伪多模态**，其中视觉提示仅出现在源帖子中，而评论被视为纯文本，与现实世界的多模态交互不一致； **2) 用户同质性**，不同的用户受到统一对待，忽略了影响立场表达的个人特征。为了解决这些问题，我们引入了**U-MStance**，这是第一个以用户为中心的 MCSD 数据集，包含跨越六个现实世界目标的超过 4 万条带注释的评论。我们进一步提出 **PRISM**，这是一种针对 MCSD 的 **P**persona-**R** 合理的多**I**模态 **S**tance **M** 模型。 PRISM 首先从历史帖子和评论中得出纵向用户角色，以捕获个人特征，然后通过思想链在对话上下文中对齐文本和视觉线索，以弥合跨模式的语义和语用差距。最后，采用相互任务强化机制来联合优化姿态检测和姿态感知响应生成，以实现双向知识传递。 U-MStance 上的实验表明，PRISM 比强基线获得了显着的收益，强调了以用户为中心和基于上下文的多模态推理对于现实立场理解的有效性。

Title: AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

Authors: Qingyu Zhang, Chunlei Xin, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Qing Ye, Qianlong Xie, Xingxing Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12133
Pdf URL: https://arxiv.org/pdf/2511.12133
Copy Paste: [[2511.12133]] AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing(https://arxiv.org/abs/2511.12133)
Keywords: language model, llm, hallucination, agent
Abstract: Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). A lack of task-specific data often limits previous works, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios.
摘要：目标驱动的说服性对话（以电话营销等应用为例）需要复杂的多轮规划和严格的事实忠实度，即使对于最先进的大型语言模型（LLM）来说，这仍然是一个重大挑战。缺乏特定任务的数据通常会限制以前的工作，直接申请法学硕士会遭受战略脆弱性和事实幻觉的困扰。在本文中，我们首先构建并发布了 TeleSalesCorpus，这是该领域第一个基于现实世界的对话数据集。然后，我们提出了 AI-Salesman，这是一种具有双阶段架构的新颖框架。在训练阶段，我们设计了一种贝叶斯监督强化学习算法，可以从嘈杂的对话中学习稳健的销售策略。在推理阶段，我们引入了动态大纲引导代理（DOGA），它利用预先构建的脚本库来提供动态的逐项策略指导。此外，我们设计了一个全面的评估框架，将关键销售技能的细粒度指标与法学硕士作为法官范例相结合。实验结果表明，我们提出的人工智能销售员在自动指标和综合人类评估方面都显着优于基线模型，展示了其在复杂的说服场景中的有效性。

Title: Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

Authors: Pinxue Guo, Chongruo Wu, Xinyu Zhou, Lingyi Hong, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Sen-ching Samson Cheung, Wei Zhang, Wenqiang Zhang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.12140
Pdf URL: https://arxiv.org/pdf/2511.12140
Copy Paste: [[2511.12140]] Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding(https://arxiv.org/abs/2511.12140)
Keywords: language model, gpt, llm, hallucination
Abstract: Multimodal Large Language Models (MLLMs) have unlocked powerful cross-modal capabilities, but still significantly suffer from hallucinations. As such, accurate detection of hallucinations in MLLMs is imperative for ensuring their reliability in practical applications. To this end, guided by the principle of "Seeing is Believing", we introduce VBackChecker, a novel reference-free hallucination detection framework that verifies the consistency of MLLMgenerated responses with visual inputs, by leveraging a pixellevel Grounding LLM equipped with reasoning and referring segmentation capabilities. This reference-free framework not only effectively handles rich-context scenarios, but also offers interpretability. To facilitate this, an innovative pipeline is accordingly designed for generating instruction-tuning data (R-Instruct), featuring rich-context descriptions, grounding masks, and hard negative samples. We further establish R^2 -HalBench, a new hallucination benchmark for MLLMs, which, unlike previous benchmarks, encompasses real-world, rich-context descriptions from 18 MLLMs with high-quality annotations, spanning diverse object-, attribute, and relationship-level details. VBackChecker outperforms prior complex frameworks and achieves state-of-the-art performance on R^2 -HalBench, even rivaling GPT-4o's capabilities in hallucination detection. It also surpasses prior methods in the pixel-level grounding task, achieving over a 10% improvement. All codes, data, and models are available at this https URL.
摘要：多模态大语言模型（MLLM）已经释放了强大的跨模态功能，但仍然严重受到幻觉的影响。因此，准确检测 MLLM 中的幻觉对于确保其在实际应用中的可靠性至关重要。为此，在“眼见为实”原则的指导下，我们引入了 VBackChecker，这是一种新颖的无参考幻觉检测框架，通过利用配备推理和参考分割功能的像素级 Grounding LLM，验证 MLLM 生成的响应与视觉输入的一致性。这种无参考框架不仅可以有效处理丰富的上下文场景，而且还提供了可解释性。为了实现这一点，相应地设计了一种创新的管道来生成指令调整数据（R-Instruct），具有丰富的上下文描述、接地掩模和硬负样本。我们进一步建立了 R^2 -HalBench，这是一个新的 MLLM 幻觉基准，与之前的基准不同，它包含来自 18 个 MLLM 的真实世界、丰富的上下文描述，具有高质量的注释，涵盖不同的对象、属性和关系级别的细节。 VBackChecker 超越了之前的复杂框架，并在 R^2 -HalBench 上实现了最先进的性能，甚至可以与 GPT-4o 的幻觉检测能力相媲美。它在像素级接地任务中也超越了之前的方法，实现了超过 10% 的改进。所有代码、数据和模型均可在此 https URL 中获取。

Title: CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

Authors: Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, Dongbin Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12159
Pdf URL: https://arxiv.org/pdf/2511.12159
Copy Paste: [[2511.12159]] CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic(https://arxiv.org/abs/2511.12159)
Keywords: language model, llm, agent
Abstract: Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.
摘要：带有搜索引擎的工具集成推理（TIR）使大型语言模型能够迭代地检索最新的外部知识，从而增强复杂问答任务的适应性和泛化性。然而，现有的搜索代理管道通常依赖于基于强化学习的优化，这通常会受到稀疏结果奖励的影响，导致探索效率低下和训练不稳定。我们引入了 CriticSearch，这是一个细粒度的信用分配框架，通过回顾性评论家机制提供密集的轮级反馈。在训练过程中，一个冻结的、不对称的批评法学硕士使用来自完整轨迹和黄金答案的特权信息回顾性地评估每个回合，将这些评估转化为指导政策改进的稳定、密集的奖励。不同多跳推理基准的实验结果表明，CriticSearch 始终优于现有基准，实现更快的收敛、改进的训练稳定性和更高的性能。

Title: MME-RAG: Multi-Manager-Expert Retrieval-Augmented Generation for Fine-Grained Entity Recognition in Task-Oriented Dialogues

Authors: Liang Xue, Haoyu Liu, Yajun Tian, Xinyu Zhong, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12213
Pdf URL: https://arxiv.org/pdf/2511.12213
Copy Paste: [[2511.12213]] MME-RAG: Multi-Manager-Expert Retrieval-Augmented Generation for Fine-Grained Entity Recognition in Task-Oriented Dialogues(https://arxiv.org/abs/2511.12213)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Fine-grained entity recognition is crucial for reasoning and decision-making in task-oriented dialogues, yet current large language models (LLMs) continue to face challenges in domain adaptation and retrieval controllability. We introduce MME-RAG, a Multi-Manager-Expert Retrieval-Augmented Generation framework that decomposes entity recognition into two coordinated stages: type-level judgment by lightweight managers and span-level extraction by specialized experts. Each expert is supported by a KeyInfo retriever that injects semantically aligned, few-shot exemplars during inference, enabling precise and domain-adaptive extraction without additional training. Experiments on CrossNER, MIT-Movie, MIT-Restaurant, and our newly constructed multi-domain customer-service dataset demonstrate that MME-RAG performs better than recent baselines in most domains. Ablation studies further show that both the hierarchical decomposition and KeyInfo-guided retrieval are key drivers of robustness and cross-domain generalization, establishing MME-RAG as a scalable and interpretable solution for adaptive dialogue understanding.
摘要：细粒度实体识别对于面向任务的对话中的推理和决策至关重要，但当前的大型语言模型（LLM）继续面临领域适应和检索可控性方面的挑战。我们引入了 MME-RAG，一个多管理器专家检索增强生成框架，它将实体识别分解为两个协调阶段：轻量级管理器的类型级判断和专业专家的跨度级提取。每个专家都由 KeyInfo 检索器支持，该检索器在推理过程中注入语义对齐的、少量样本，从而无需额外训练即可实现精确且领域自适应的提取。在 CrossNER、MIT-Movie、MIT-Restaurant 和我们新构建的多域客户服务数据集上进行的实验表明，MME-RAG 在大多数域中的表现都优于最新的基线。消融研究进一步表明，层次分解和 KeyInfo 引导检索都是鲁棒性和跨域泛化的关键驱动因素，将 MME-RAG 建立为自适应对话理解的可扩展和可解释的解决方案。

Title: Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts

Authors: Raavi Gupta, Pranav Hari Panicker, Sumit Bhatia, Ganesh Ramakrishnan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.12236
Pdf URL: https://arxiv.org/pdf/2511.12236
Copy Paste: [[2511.12236]] Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts(https://arxiv.org/abs/2511.12236)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.
摘要：大型语言模型 (LLM) 尽管具有出色的文本生成能力，但经常会产生幻觉并生成实际上不正确且不基于现实世界知识的文本。这给医疗保健、金融和客户支持等领域带来了严重的风险。使用 LLM 的典型方法是通过 LLM 供应商提供的 API，其中无法访问模型权重或微调模型的选项。在模型访问受到资源限制或约束的情况下，现有的检测幻觉的方法通常需要进行多个 LLM API 调用，从而增加了延迟和 API 成本。我们引入 CONFACTCHECK，这是一种有效的幻觉检测方法，它不利用任何外部知识库，并且基于简单的直觉，即对生成文本中的事实探测的响应应该在单个法学硕士内和不同的法学硕士之间保持一致。对涵盖事实文本生成和开放生成的多个数据集进行的严格实证评估表明，与在类似条件下运行的现有基线相比，CONFACTCHECK 可以使用更少的资源有效地检测幻觉事实，并获得更高的准确度分数。我们的代码可以在这里找到。

Title: Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor

Authors: Ivan Zakazov, Alexander Sharipov, Berke Argin, Oussama Gabouj, Kamel Charaf, Alexi Semiz, Lorenzo Drudi, Nicolas Baldwin, Robert West
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.12281
Pdf URL: https://arxiv.org/pdf/2511.12281
Copy Paste: [[2511.12281]] Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor(https://arxiv.org/abs/2511.12281)
Keywords: language model, gpt, llm, prompt
Abstract: Motivated by the high costs of using black-box Large Language Models (LLMs), we introduce a novel prompt compression paradigm, under which we use smaller LLMs to compress inputs for the larger ones. We present the first comprehensive LLM-as-a-compressor benchmark spanning 25 open- and closed-source models, which reveals significant disparity in models' compression ability in terms of (i) preserving semantically important information (ii) following the user-provided compression rate (CR). We further improve the performance of gpt-4.1-mini, the best overall vanilla compressor, with Textgrad-based compression meta-prompt optimization. We also identify the most promising open-source vanilla LLM - Qwen3-4B - and post-train it with a combination of supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), pursuing the dual objective of CR adherence and maximizing the downstream task performance. We call the resulting model Cmprsr and demonstrate its superiority over both extractive and vanilla abstractive compression across the entire range of compression rates on lengthy inputs from MeetingBank and LongBench as well as short prompts from GSM8k. The latter highlights Cmprsr's generalizability across varying input lengths and domains. Moreover, Cmprsr closely follows the requested compression rate, offering fine control over the cost-quality trade-off.
摘要：由于使用黑盒大型语言模型（LLM）的高成本，我们引入了一种新颖的即时压缩范例，在该范例下，我们使用较小的 LLM 来压缩较大 LLM 的输入。我们提出了第一个全面的 LLM 作为压缩器基准，涵盖 25 个开源和闭源模型，它揭示了模型在以下方面的压缩能力的显着差异：(i) 保留语义上的重要信息 (ii) 按照用户提供的压缩率 (CR)。我们通过基于 Textgrad 的压缩元提示优化进一步提高了 gpt-4.1-mini（最好的整体香草压缩器）的性能。我们还确定了最有前途的开源普通 LLM - Qwen3-4B - 并结合监督微调 (SFT) 和组相对策略优化 (GRPO) 对其进行后训练，追求 CR 遵守和最大化下游任务性能的双重目标。我们将生成的模型称为 Cmprsr，并在来自 MeetingBank 和 LongBench 的长输入以及来自 GSM8k 的简短提示的整个压缩率范围内证明了其相对于提取压缩和普通抽象压缩的优越性。后者强调了 Cmprsr 在不同输入长度和域中的通用性。此外，Cmprsr 严格遵循所要求的压缩率，提供对成本质量权衡的精细控制。

Title: Do LLMs and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering

Authors: Naoya Sugiura, Kosuke Yamada, Yasuhiro Ogawa, Katsuhiko Toyama, Ryohei Sasano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12300
Pdf URL: https://arxiv.org/pdf/2511.12300
Copy Paste: [[2511.12300]] Do LLMs and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering(https://arxiv.org/abs/2511.12300)
Keywords: llm, prompt
Abstract: LLMs have achieved performance that surpasses humans in many NLP tasks. However, it remains unclear whether problems that are difficult for humans are also difficult for LLMs. This study investigates how the difficulty of quizzes in a buzzer setting differs between LLMs and humans. Specifically, we first collect Japanese quiz data including questions, answers, and correct response rate of humans, then prompted LLMs to answer the quizzes under several settings, and compare their correct answer rate to that of humans from two analytical perspectives. The experimental results showed that, compared to humans, LLMs struggle more with quizzes whose correct answers are not covered by Wikipedia entries, and also have difficulty with questions that require numerical answers.
摘要：法学硕士在许多 NLP 任务中取得了超越人类的表现。然而，目前尚不清楚对于人类来说困难的问题对于法学硕士来说是否也困难。这项研究调查了法学硕士和人类在蜂鸣器设置中测验的难度有何不同。具体来说，我们首先收集日语测验数据，包括人类的问题、答案和正确回答率，然后提示法学硕士在多种设置下回答测验，并从两个分析角度将其正确回答率与人类的正确回答率进行比较。实验结果表明，与人类相比，法学硕士在处理维基百科条目未涵盖正确答案的测验时更加困难，并且在处理需要数字答案的问题时也遇到困难。

Title: Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

Authors: Logan Mann, Nayan Saxena, Sarah Tandon, Chenhao Sun, Savar Toteja, Kevin Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12381
Pdf URL: https://arxiv.org/pdf/2511.12381
Copy Paste: [[2511.12381]] Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load(https://arxiv.org/abs/2511.12381)
Keywords: language model, llm, prompt
Abstract: Negation instructions such as 'do not mention $X$' can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load \& content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.
摘要：诸如“不要提及$X$”之类的否定指令可能会矛盾地增加$X$在人类思维中的可及性，这种现象被称为讽刺反弹。大型语言模型（LLM）面临着同样的挑战：抑制一个概念需要在内部激活它，这可能会引发反弹而不是回避。我们通过两个实验研究了这种张力。 \textbf{(1) 加载\&内容}：在否定指令之后，我们改变干扰文本（语义、句法、重复）并测量反弹强度。 \textbf{(2) 极性分离}：我们测试模型是否区分同一概念的中性和负面框架，以及这种分离是否预测反弹的持续性。结果表明，否定后立即出现反弹，并随着较长或语义干扰而加剧，而重复则支持抑制。更强的极性分离与更持久的反弹相关。总之，这些发现加上回路跟踪分析的补充，该分析识别出稀疏的中层注意力头放大了禁止的标记，而早期层则抑制了，将讽刺性反弹的认知预测与对长上下文干扰的机械洞察联系起来。为了支持未来的工作，我们发布了 ReboundBench，这是一个价值 5,000 美元的数据集，系统地改变了否定提示，旨在探测法学硕士的反弹。

Title: From Phonemes to Meaning: Evaluating Large Language Models on Tamil

Authors: Jeyarajalingam Varsha, Menan Velayuthan, Sumirtha Karunakaran, Rasan Nivethiga, Kengatharaiyer Sarveswaran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12387
Pdf URL: https://arxiv.org/pdf/2511.12387
Copy Paste: [[2511.12387]] From Phonemes to Meaning: Evaluating Large Language Models on Tamil(https://arxiv.org/abs/2511.12387)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown strong generalization across tasks in high-resource languages; however, their linguistic competence in low-resource and morphologically rich languages such as Tamil remains largely unexplored. Existing multilingual benchmarks often rely on translated English datasets, failing to capture the linguistic and cultural nuances of the target language. To address this gap, we introduce ILAKKANAM, the first Tamil-specific linguistic evaluation benchmark manually curated using 820 questions from Sri Lankan school-level Tamil subject examination papers. Each question is annotated by trained linguists under five linguistic categories and a factual knowledge category, spanning Grades 1--13 to ensure broad linguistic coverage. We evaluate both closed-source and open-source LLMs using a standardized evaluation framework. Our results show that Gemini 2.5 achieves the highest overall performance, while open-source models lag behind, highlighting the gap in linguistic grounding. Category- and grade-wise analyses reveal that all models perform well on lower-grade questions but show a clear decline as linguistic complexity increases. Further, no strong correlation is observed between a model's overall performance and its ability to identify linguistic categories, suggesting that performance may be driven by exposure rather than genuine understanding.
摘要：大型语言模型（LLM）在高资源语言的任务中表现出很强的泛化性；然而，他们在资源匮乏和形态丰富的语言（例如泰米尔语）方面的语言能力在很大程度上仍未得到开发。现有的多语言基准通常依赖于翻译的英语数据集，无法捕捉目标语言的语言和文化细微差别。为了弥补这一差距，我们推出了 ILAKKANAM，这是第一个泰米尔语特定语言评估基准，使用斯里兰卡学校水平泰米尔语科目考试试卷中的 820 个问题手动策划。每个问题均由训练有素的语言学家根据五个语言类别和一个事实知识类别进行注释，涵盖 1--13 年级，以确保广泛的语言覆盖范围。我们使用标准化评估框架评估闭源和开源法学硕士。我们的结果表明，Gemini 2.5 实现了最高的整体性能，而开源模型则落后，凸显了语言基础上的差距。类别和年级分析表明，所有模型在低年级问题上都表现良好，但随着语言复杂性的增加，表现出明显下降。此外，模型的整体性能与其识别语言类别的能力之间没有观察到很强的相关性，这表明性能可能是由接触而不是真正的理解驱动的。

Title: Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Authors: Chenglong Wang, Yifu Huo, Yang Gan, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Anxiang Ma, Zhengtao Yu, Jingbo Zhu, Tong Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12464
Pdf URL: https://arxiv.org/pdf/2511.12464
Copy Paste: [[2511.12464]] Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models(https://arxiv.org/abs/2511.12464)
Keywords: language model, llm
Abstract: Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models (LLMs), making it a reliable reference for developing advanced reward models. Our analysis of MRMBench evaluation results reveals that reward models often struggle to capture preferences across multiple dimensions, highlighting the potential of multi-objective optimization in reward modeling. Additionally, our findings show that the proposed inference-time probing method offers a reliable metric for assessing the confidence of reward predictions, which ultimately improves the alignment of LLMs.
摘要：以前的方法通过在固定的成对排名测试集上进行测试来评估奖励模型，但它们通常不提供每个偏好维度的性能信息。在这项工作中，我们通过探索偏好表示来解决奖励模型的评估挑战。为了证实这种评估方法的有效性，我们构建了一个多维奖励模型基准（MRMBench），它是针对不同偏好维度的六个探索任务的集合。我们设计它是为了支持和鼓励能够更好地捕捉不同维度偏好的奖励模型。此外，我们引入了一种分析方法，即推理时间探测，它可以识别奖励预测期间使用的维度并增强其可解释性。通过大量的实验，我们发现 MRMBench 与大型语言模型（LLM）的对齐性能密切相关，使其成为开发高级奖励模型的可靠参考。我们对 MRMBench 评估结果的分析表明，奖励模型通常难以捕获多个维度的偏好，这凸显了奖励建模中多目标优化的潜力。此外，我们的研究结果表明，所提出的推理时间探测方法为评估奖励预测的置信度提供了可靠的指标，最终改善了法学硕士的一致性。

Title: Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing

Authors: Mengying Wang, Chenhui Ma, Ao Jiao, Tuo Liang, Pengjun Lu, Shrinidhi Hegde, Yu Yin, Evren Gurkan-Cavusoglu, Yinghui Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12472
Pdf URL: https://arxiv.org/pdf/2511.12472
Copy Paste: [[2511.12472]] Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing(https://arxiv.org/abs/2511.12472)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel ("serendipitious") answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs' ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: this https URL.
摘要：大型语言模型 (LLM) 具有非常先进的知识图问答 (KGQA)，但现有系统通常经过优化以返回高度相关但可预测的答案。缺少但期望的能力是利用法学硕士提出令人惊讶和新颖（“偶然”）的答案。在本文中，我们正式定义了意外发现的 KGQA 任务，并提出了 SerenQA 框架来评估法学硕士在科学 KGQA 任务中发现意外见解的能力。 SerenQA 包括基于相关性、新颖性和惊喜的严格偶然性指标，以及源自临床知识图的专家注释基准，重点关注药物再利用。此外，它还具有结构化评估管道，包含三个子任务：知识检索、子图推理和意外发现探索。我们的实验表明，虽然最先进的法学硕士在检索方面表现良好，但他们仍然难以识别真正令人惊讶和有价值的发现，这强调了未来改进的巨大空间。我们精选的资源和扩展版本发布于：此 https URL。

Title: SGuard-v1: Safety Guardrail for Large Language Models

Authors: JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee, Juree Seok
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2511.12497
Pdf URL: https://arxiv.org/pdf/2511.12497
Copy Paste: [[2511.12497]] SGuard-v1: Safety Guardrail for Large Language Models(https://arxiv.org/abs/2511.12497)
Keywords: language model, llm, prompt
Abstract: We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.
摘要：我们推出了 SGuard-v1，这是一种用于大型语言模型 (LLM) 的轻量级安全护栏，它包含两个专门的模型，用于检测有害内容并筛选人类与人工智能对话设置中的对抗性提示。第一个组件 ContentFilter 经过训练，可根据 MLCommons 危险分类法识别 LLM 提示和响应中的安全风险，MLCommons 危险分类法是人工智能信任和安全评估的综合框架。第二个组件 JailbreakFilter 通过精心设计的课程进行训练，课程涵盖集成数据集和先前对抗性提示工作的结果，涵盖 60 种主要攻击类型，同时减少错误不安全分类。 SGuard-v1 基于 2B 参数 Granite-3.3-2B-Instruct 模型构建，支持 12 种语言。我们从收集和合成的数据中策划了大约 140 万个训练实例，并对基本模型进行指令调整，根据指定的功能将策划的数据分布在两个组件之间。通过对公共和专有安全基准的广泛评估，SGuard-v1 实现了最先进的安全性能，同时保持轻量级，从而减少了部署开销。 SGuard-v1 还通过提供多类安全预测及其二进制置信度分数来提高下游使用的可解释性。我们在 Apache-2.0 许可证下发布了 SGuard-v1，以实现人工智能安全方面的进一步研究和实际部署。

Title: TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

Authors: Jie Zhang, Bo Tang, Wanzi Shao, Wenqiang Wei, Jihao Zhao, Jianqing Zhu, Zhiyu li, Wen Xi, Zehao Lin, Feiyu Xiong, Yanchao Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12520
Pdf URL: https://arxiv.org/pdf/2511.12520
Copy Paste: [[2511.12520]] TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction(https://arxiv.org/abs/2511.12520)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.
摘要：检索增强生成（RAG）通过检索外部知识来改进大型语言模型，这些知识通常由于输入上下文窗口而被截断为更小的块，这会导致信息丢失，从而导致响应幻觉和推理链断裂。此外，传统的 RAG 检索非结构化知识，引入了阻碍准确推理的不相关细节。为了解决这些问题，我们提出了 TAdaRAG，这是一种新颖的 RAG 框架，用于从外部源构建动态任务自适应知识图。具体来说，我们设计了一种针对特定领域提取模板的意图驱动路由机制，然后进行监督微调和基于强化学习的隐式提取机制，确保简洁、连贯和非冗余的知识集成。对三个骨干模型的六个公共基准和现实世界业务基准（NowNewsQA）的评估表明，TAdaRAG 在不同领域和长文本任务上优于现有方法，凸显了其强大的泛化性和实际有效性。

Title: Mitigating Length Bias in RLHF through a Causal Lens

Authors: Hyeonji Kim, Sujeong Oh, Sanghack Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12573
Pdf URL: https://arxiv.org/pdf/2511.12573
Copy Paste: [[2511.12573]] Mitigating Length Bias in RLHF through a Causal Lens(https://arxiv.org/abs/2511.12573)
Keywords: language model, llm
Abstract: Reinforcement learning from human feedback (RLHF) is widely used to align large language models (LLMs) with human preferences. However, RLHF-trained reward models often exhibit length bias -- a systematic tendency to favor longer responses by conflating verbosity with quality. We propose a causal framework for analyzing and mitigating length bias in RLHF reward modeling. Central to our approach is a counterfactual data augmentation method that generates response pairs designed to isolate content quality from verbosity. These counterfactual examples are then used to train the reward model, enabling it to assess responses based on content quality independently of verbosity. Specifically, we construct (1) length-divergent pairs with similar content and (2) content-divergent pairs of similar length. Empirical evaluations show that our method reduces length bias in reward assignment and leads to more concise, content-focused outputs from the policy model. These findings demonstrate that the proposed approach effectively reduces length bias and improves the robustness and content sensitivity of reward modeling in RLHF pipelines.
摘要：基于人类反馈的强化学习 (RLHF) 被广泛用于使大型语言模型 (LLM) 与人类偏好保持一致。然而，RLHF 训练的奖励模型经常表现出长度偏差，这是一种通过将冗长与质量混为一谈而偏向较长响应的系统倾向。我们提出了一个因果框架来分析和减轻 RLHF 奖励模型中的长度偏差。我们方法的核心是反事实数据增强方法，该方法生成旨在将内容质量与冗长性隔离开来的响应对。然后，使用这些反事实示例来训练奖励模型，使其能够根据内容质量评估响应，而无需考虑冗长程度。具体来说，我们构造（1）具有相似内容的长度发散对和（2）具有相似长度的内容发散对。实证评估表明，我们的方法减少了奖励分配中的长度偏差，并导致政策模型的输出更加简洁、以内容为中心。这些发现表明，所提出的方法有效地减少了长度偏差，并提高了 RLHF 管道中奖励建模的鲁棒性和内容敏感性。

Title: MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

Authors: Pu-Hai Yang, Heyan Huang, Heng-Da Xu, Fanshu Sun, Xian-Ling Mao, Chaoxu Mu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12586
Pdf URL: https://arxiv.org/pdf/2511.12586
Copy Paste: [[2511.12586]] MMWOZ: Building Multimodal Agent for Task-oriented Dialogue(https://arxiv.org/abs/2511.12586)
Keywords: agent
Abstract: Task-oriented dialogue systems have garnered significant attention due to their conversational ability to accomplish goals, such as booking airline tickets for users. Traditionally, task-oriented dialogue systems are conceptualized as intelligent agents that interact with users using natural language and have access to customized back-end APIs. However, in real-world scenarios, the widespread presence of front-end Graphical User Interfaces (GUIs) and the absence of customized back-end APIs create a significant gap for traditional task-oriented dialogue systems in practical applications. In this paper, to bridge the gap, we collect MMWOZ, a new multimodal dialogue dataset that is extended from MultiWOZ 2.3 dataset. Specifically, we begin by developing a web-style GUI to serve as the front-end. Next, we devise an automated script to convert the dialogue states and system actions from the original dataset into operation instructions for the GUI. Lastly, we collect snapshots of the web pages along with their corresponding operation instructions. In addition, we propose a novel multimodal model called MATE (Multimodal Agent for Task-oriEnted dialogue) as the baseline model for the MMWOZ dataset. Furthermore, we conduct comprehensive experimental analysis using MATE to investigate the construction of a practical multimodal agent for task-oriented dialogue.
摘要：面向任务的对话系统因其对话能力来实现目标（例如为用户预订机票）而引起了广泛关注。传统上，面向任务的对话系统被概念化为智能代理，它使用自然语言与用户交互并可以访问定制的后端 API。然而，在现实场景中，前端图形用户界面（GUI）的广泛存在和定制后端API的缺乏，使得传统的面向任务的对话系统在实际应用中存在巨大差距。在本文中，为了弥补这一差距，我们收集了 MMWOZ，这是一个从 MultiWOZ 2.3 数据集扩展而来的新多模态对话数据集。具体来说，我们首先开发一个 Web 风格的 GUI 作为前端。接下来，我们设计一个自动化脚本，将原始数据集中的对话状态和系统操作转换为 GUI 的操作指令。最后，我们收集网页快照及其相应的操作说明。此外，我们提出了一种称为 MATE（面向任务对话的多模态代理）的新型多模态模型作为 MMWOZ 数据集的基线模型。此外，我们使用 MATE 进行全面的实验分析，以研究构建面向任务的对话的实用多模式代理。

Title: Group-Aware Reinforcement Learning for Output Diversity in Large Language Models

Authors: Oron Anschel, Alon Shoshan, Adam Botach, Shunit Haviv Hakimi, Asaf Gendler, Emanuel Ben Baruch, Nadav Bhonker, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.12596
Pdf URL: https://arxiv.org/pdf/2511.12596
Copy Paste: [[2511.12596]] Group-Aware Reinforcement Learning for Output Diversity in Large Language Models(https://arxiv.org/abs/2511.12596)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.
摘要：大型语言模型 (LLM) 经常遭受模式崩溃的困扰，即使存在许多有效答案，也会重复生成相同的少数完成结果，从而限制了它们在各种任务中的多样性。我们引入了组感知策略优化（GAPO），这是最近流行的组相对策略优化（GRPO）的简单扩展，它计算整个组的奖励。 GAPO 能够从群体层面的属性（例如多样性和覆盖范围）中进行学习。我们使用频率感知奖励函数来演示 GAPO，该函数鼓励对有效的 LLM 完成情况进行统一抽样，并表明 GAPO 训练的模型会产生有效且更多样化的模型响应。除此之外，GAPO 还推广到开放式提示，并提高了响应多样性，而不会影响标准 LLM 基准（GSM8K、MATH、HumanEval、MMLU-Pro）的准确性。我们的代码将公开。

Title: Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

Authors: Maoqi Liu, Quan Fang, Yang Yang, Can Zhao, Kaiquan Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12630
Pdf URL: https://arxiv.org/pdf/2511.12630
Copy Paste: [[2511.12630]] Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing(https://arxiv.org/abs/2511.12630)
Keywords: llm, prompt, agent
Abstract: Notice to Air Missions (NOTAMs) serve as a critical channel for disseminating key flight safety information, yet their complex linguistic structures and implicit reasoning pose significant challenges for automated parsing. Existing research mainly focuses on surface-level tasks such as classification and named entity recognition, lacking deep semantic understanding. To address this gap, we propose NOTAM semantic parsing, a task emphasizing semantic inference and the integration of aviation domain knowledge to produce structured, inference-rich outputs. To support this task, we construct Knots (Knowledge and NOTAM Semantics), a high-quality dataset of 12,347 expert-annotated NOTAMs covering 194 Flight Information Regions, enhanced through a multi-agent collaborative framework for comprehensive field discovery. We systematically evaluate a wide range of prompt-engineering strategies and model-adaptation techniques, achieving substantial improvements in aviation text understanding and processing. Our experimental results demonstrate the effectiveness of the proposed approach and offer valuable insights for automated NOTAM analysis systems. Our code is available at: this https URL.
摘要：空中任务通知 (NOTAM) 是传播关键飞行安全信息的关键渠道，但其复杂的语言结构和隐式推理对自动解析提出了重大挑战。现有的研究主要集中在分类和命名实体识别等表面任务上，缺乏深层的语义理解。为了解决这一差距，我们提出了 NOTAM 语义解析，这是一项强调语义推理和航空领域知识集成的任务，以产生结构化的、推理丰富的输出。为了支持这项任务，我们构建了 Knots（知识和 NOTAM 语义），这是一个包含 12,347 个专家注释的 NOTAM 的高质量数据集，涵盖 194 个飞行信息区域，并通过用于全面现场发现的多智能体协作框架进行了增强。我们系统地评估了各种即时工程策略和模型适应技术，在航空文本理解和处理方面取得了实质性改进。我们的实验结果证明了所提出方法的有效性，并为自动化 NOTAM 分析系统提供了宝贵的见解。我们的代码位于：此 https URL。

Title: Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing

Authors: Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12661
Pdf URL: https://arxiv.org/pdf/2511.12661
Copy Paste: [[2511.12661]] Reason-KE++: Aligning the Process, Not Just the Outcome, for Faithful LLM Knowledge Editing(https://arxiv.org/abs/2511.12661)
Keywords: language model, llm, hallucination
Abstract: Aligning Large Language Models (LLMs) to be faithful to new knowledge in complex, multi-hop reasoning tasks is a critical, yet unsolved, challenge. We find that SFT-based methods, e.g., Reason-KE, while state-of-the-art, suffer from a "faithfulness gap": they optimize for format mimicry rather than sound reasoning. This gap enables the LLM's powerful parametric priors to override new contextual facts, resulting in critical factual hallucinations (e.g., incorrectly reasoning "Houston" from "NASA" despite an explicit edit). To solve this core LLM alignment problem, we propose Reason-KE++, an SFT+RL framework that instills process-level faithfulness. Its core is a Stage-aware Reward mechanism that provides dense supervision for intermediate reasoning steps (e.g., Decomposition, Sub-answer Correctness). Crucially, we identify that naive outcome-only RL is a deceptive trap for LLM alignment: it collapses reasoning integrity (e.g., 19.00% Hop acc) while superficially boosting final accuracy. Our process-aware framework sets a new SOTA of 95.48% on MQUAKE-CF-3k (+5.28%), demonstrating that for complex tasks, aligning the reasoning process is essential for building trustworthy LLMs.
摘要：在复杂的多跳推理任务中调整大型语言模型（LLM）以忠实于新知识是一项关键但尚未解决的挑战。我们发现基于 SFT 的方法，例如 Reason-KE，虽然是最先进的，但存在“忠实性差距”：它们针对格式模仿而不是声音推理进行优化。这种差距使得法学硕士强大的参数先验能够覆盖新的上下文事实，从而导致关键的事实幻觉（例如，尽管进行了明确的编辑，但仍错误地从“NASA”推理出“休斯顿”）。为了解决这个核心的 LLM 对齐问题，我们提出了 Reason-KE++，这是一个灌输流程级忠实度的 SFT+RL 框架。其核心是阶段感知奖励机制，为中间推理步骤（例如分解、子答案正确性）提供密集监督。至关重要的是，我们发现单纯的只注重结果的 RL 是 LLM 对齐的欺骗性陷阱：它破坏了推理完整性（例如 19.00% Hop acc），同时表面上提高了最终准确性。我们的流程感知框架在 MQUAKE-CF-3k 上设置了 95.48% 的新 SOTA（+5.28%），这表明对于复杂任务，调整推理流程对于构建值得信赖的法学硕士至关重要。

Title: Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Authors: Sina Rashidi, Hossein Sameti
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.12690
Pdf URL: https://arxiv.org/pdf/2511.12690
Copy Paste: [[2511.12690]] Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data(https://arxiv.org/abs/2511.12690)
Keywords: language model
Abstract: Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English
摘要：直接语音到语音翻译 (S2ST) 中所有组件都经过联合训练，是级联系统的一个有吸引力的替代方案，因为它提供了更简单的管道和更低的推理延迟。然而，直接 S2ST 模型需要源语言和目标语言的大量并行语音数据，这对于波斯语等资源匮乏的语言来说很少可用。本文提出了一种用于将波斯语语音翻译为英语语音的直接 S2ST 系统，以及用于合成并行波斯语-英语语音生成的管道。该模型包含三个组件：（1）基于一致性的编码器，从自监督预训练初始化，将源语音映射到高级声学表示；（2）具有相对位置多头注意力的因果变换器解码器将这些表示转换为离散的目标语音单元； (3) 基于单元的神经声码器从预测的离散单元生成波形。为了缓解数据稀缺问题，我们通过使用大型语言模型将波斯语语音转录翻译成英语，然后使用最先进的零样本文本转语音系统合成相应的英语语音，构建了一个新的波斯语-英语并行语音语料库。由此产生的语料库将可用并行语音的数量增加了大约六倍。在 CVSS 语料库的波斯语-英语部分，所提出的模型使用合成数据在直接基线上实现了 4.6 ASR BLEU 的改进。这些结果表明，结合自监督预训练、离散语音单元和合成并行数据对于改进波斯语-英语等低资源语言对中的直接 S2ST 是有效的

Title: Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Authors: Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2511.12710
Pdf URL: https://arxiv.org/pdf/2511.12710
Copy Paste: [[2511.12710]] Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs(https://arxiv.org/abs/2511.12710)
Keywords: language model, llm, prompt, agent
Abstract: Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet they share a fundamental limitation: their jailbreak logic is confined to selecting, combining, or refining pre-existing attack strategies. This binds their creativity and leaves them unable to autonomously invent entirely new attack mechanisms. To overcome this gap, we introduce \textbf{EvoSynth}, an autonomous framework that shifts the paradigm from attack planning to the evolutionary synthesis of jailbreak methods. Instead of refining prompts, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute novel, code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite its own attack logic in response to failure. Through extensive experiments, we demonstrate that EvoSynth not only establishes a new state-of-the-art by achieving an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5, but also generates attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research in this new direction of evolutionary synthesis of jailbreak methods. Code is available at: this https URL.
摘要：大型语言模型（LLM）的自动化红队框架已经变得越来越复杂，但它们有一个基本的局限性：它们的越狱逻辑仅限于选择、组合或改进预先存在的攻击策略。这限制了他们的创造力，使他们无法自主发明全新的攻击机制。为了克服这一差距，我们引入了 \textbf{EvoSynth}，这是一个自主框架，它将范式从攻击计划转变为越狱方法的进化综合。 EvoSynth 没有细化提示，而是采用多代理系统来自主设计、发展和执行新颖的基于代码的攻击算法。至关重要的是，它具有代码级自我纠正循环，允许它迭代地重写自己的攻击逻辑以响应失败。通过大量实验，我们证明 EvoSynth 不仅针对 Claude-Sonnet-4.5 等高度鲁棒的模型实现了 85.5% 的攻击成功率 (ASR)，建立了新的最先进技术，而且还生成比现有方法更加多样化的攻击。我们发布了我们的框架，以促进未来在越狱方法进化综合这一新方向上的研究。代码可在以下位置获取：此 https URL。

Title: Adaptive Focus Memory for Language Models

Authors: Christopher Cruz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12712
Pdf URL: https://arxiv.org/pdf/2511.12712
Copy Paste: [[2511.12712]] Adaptive Focus Memory for Language Models(https://arxiv.org/abs/2511.12712)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, but their behavior is still bottlenecked by fixed context windows and naive memory strategies. Replaying the full conversation at every turn is simple but expensive, while static summarization or recency-only heuristics often erase safety-critical user details. We present Adaptive Focus Memory (AFM), a dynamic context manager that assigns each past message one of three fidelity levels -- FULL, COMPRESSED, or PLACEHOLDER -- based on semantic similarity to the current query, half-life recency weighting, and importance classification. AFM packs messages chronologically under a strict token budget, preferring high fidelity for the most relevant turns while aiming to preserve a cheap trace of the dialogue. In a safety-oriented benchmark involving a user with a severe peanut allergy planning a trip to Thailand, AFM retains the allergy across both short and medium-length conversations, matches the safety performance of naive replay, and cuts average token usage by 66% relative to a replay baseline. We release a modular Python implementation of AFM designed for OpenAI-compatible APIs and offline operation, enabling practitioners to reduce inference cost without sacrificing safety or factual continuity in the evaluated scenario.
摘要：大型语言模型（LLM）越来越多地部署在多轮对话设置中，但它们的行为仍然受到固定上下文窗口和朴素记忆策略的瓶颈。每次重播完整的对话很简单，但成本很高，而静态摘要或仅新近启发法通常会删除安全关键的用户详细信息。我们提出了自适应焦点记忆（AFM），这是一种动态上下文管理器，它根据与当前查询的语义相似性、半衰期新近度加权和重要性分类，为每条过去的消息分配三个保真度级别之一——完整、压缩或占位符。 AFM 在严格的代币预算下按时间顺序打包消息，更喜欢最相关的回合的高保真度，同时旨在保留对话的廉价痕迹。在一个涉及严重花生过敏的用户计划去泰国旅行的安全基准中，AFM 在短和中长对话中保留了过敏情况，与简单重放的安全性能相匹配，并且相对于重放基线，平均令牌使用量减少了 66%。我们发布了 AFM 的模块化 Python 实现，专为兼容 OpenAI 的 API 和离线操作而设计，使从业者能够在不牺牲评估场景中的安全性或事实连续性的情况下降低推理成本。

Title: On the Brittleness of LLMs: A Journey around Set Membership

Authors: Lea Hergert, Gábor Berend, Mario Szegedy, Gyorgy Turan, Márk Jelasity
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12728
Pdf URL: https://arxiv.org/pdf/2511.12728
Copy Paste: [[2511.12728]] On the Brittleness of LLMs: A Journey around Set Membership(https://arxiv.org/abs/2511.12728)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
摘要：大型语言模型 (LLM) 在复杂的推理任务上实现了超人的性能，但在解决简单得多的问题上常常失败，引发了人们对其可靠性和可解释性的担忧。我们通过一项具有两个关键设计特征的集中研究来调查这一悖论：简单性，以暴露基本故障模式；以及规模，以实现全面的受控实验。我们专注于集合成员资格查询（最基本的推理形式之一），使用诸如“苹果是集合 \{pear、plum、apple、raspberry\} 的元素吗？”之类的任务。我们对提示措辞、语义结构、元素排序和模型选择进行了系统的实证评估。我们的大规模分析表明，法学硕士在这项基本任务上的表现始终是脆弱的，并且在所有维度上都是不可预测的，这表明模型对集合概念的“理解”充其量是支离破碎且令人费解的。我们的工作表明，由于问题的简单性而进行的大规模实验使我们能够全面地映射和分析故障模式，从而使这种方法成为一般法学硕士评估的有价值的方法。

Title: Evidence of Phase Transitions in Small Transformer-Based Language Models

Authors: Noah Hong, Tao Hong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12768
Pdf URL: https://arxiv.org/pdf/2511.12768
Copy Paste: [[2511.12768]] Evidence of Phase Transitions in Small Transformer-Based Language Models(https://arxiv.org/abs/2511.12768)
Keywords: language model, gpt, llm
Abstract: Phase transitions have been proposed as the origin of emergent abilities in large language models (LLMs), where new capabilities appear abruptly once models surpass critical thresholds of scale. Prior work, such as that of Wei et al., demonstrated these phenomena under model and data scaling, with transitions revealed after applying a log scale to training compute. In this work, we ask three complementary questions: (1) Are phase transitions unique to large models, or can they also be observed in small transformer-based language models? (2) Can such transitions be detected directly in linear training space, rather than only after log rescaling? and (3) Can these transitions emerge at early stages of training? To investigate, we train a small GPT-style transformer on a character-level corpus and analyze the evolution of vocabulary usage throughout training. We track the average word length, the number of correct versus incorrect words, and shifts in vocabulary diversity. Building on these measures, we apply Poisson and sub-Poisson statistics to quantify how words connect and reorganize. This combined analysis reveals a distinct transition point during training. Notably, these transitions are not apparent in standard loss or validation curves, but become visible through our vocabulary- and statistics-based probes. Our findings suggest that phase-transition reorganizations are a general feature of language model training, observable even in modest models, detectable directly in linear training space, and occurring surprisingly early as coherence emerges. This perspective provides new insight into the nonlinear dynamics of language model training and underscores the importance of tailored metrics for uncovering phase transition behaviors
摘要：相变被认为是大型语言模型（LLM）中涌现能力的起源，一旦模型超过规模的临界阈值，新的能力就会突然出现。 Wei 等人的先前工作证明了模型和数据缩放下的这些现象，并在将对数缩放应用于训练计算后揭示了转换。在这项工作中，我们提出了三个互补的问题：（1）相变是大型模型所独有的，还是也可以在基于 Transformer 的小型语言模型中观察到？ (2) 是否可以在线性训练空间中直接检测到这种转变，而不是仅在对数重新缩放之后？ (3) 这些转变可以在培训的早期阶段出现吗？为了进行调查，我们在字符级语料库上训练了一个小型 GPT 式转换器，并分析了整个训练过程中词汇使用的演变。我们跟踪平均单词长度、正确单词与错误单词的数量以及词汇多样性的变化。在这些措施的基础上，我们应用泊松和亚泊松统计来量化单词如何连接和重组。这种综合分析揭示了训练期间的明显转变点。值得注意的是，这些转变在标准损失或验证曲线中并不明显，但通过我们基于词汇和统计的探针变得可见。我们的研究结果表明，相变重组是语言模型训练的一个普遍特征，即使在适度的模型中也可以观察到，可以在线性训练空间中直接检测到，并且在连贯性出现时出人意料地早出现。这一观点提供了对语言模型训练的非线性动力学的新见解，并强调了定制指标对于揭示相变行为的重要性

Title: LLM Reinforcement in Context

Authors: Thomas Rivasseau
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2511.12782
Pdf URL: https://arxiv.org/pdf/2511.12782
Copy Paste: [[2511.12782]] LLM Reinforcement in Context(https://arxiv.org/abs/2511.12782)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.
摘要：当前的大语言模型对齐研究主要集中在通过示例训练和提示来提高模型针对对抗性攻击和不当行为的鲁棒性。研究表明，LLM 越狱概率随着用户输入的大小或对话长度的增加而增加。缺乏对加强对齐的方法的适当研究，这种对齐也随着用户输入的长度而扩展。我们建议中断作为此问题的可能解决方案。中断是针对某个任意 x 大约每 x 个标记添加到用户输入的控制语句。我们建议这可以推广到思想链过程以防止阴谋。

Title: Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Authors: Hayden Moore, Asfahan Shah
Subjects: cs.CL, cs.LO
Abstract URL: https://arxiv.org/abs/2511.12784
Pdf URL: https://arxiv.org/pdf/2511.12784
Copy Paste: [[2511.12784]] Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing(https://arxiv.org/abs/2511.12784)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved (Safarzadeh, Oroojlooyjadid, and Roth 2025). In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F (Zheng, Han, and Polu 2021) and Lean 4 version of ProofNet (Xin et al. 2024), and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.
摘要：大型语言模型 (LLM) 最近已成为自动形式化的强大工具。尽管它们的性能令人印象深刻，但这些模型仍然难以产生接地且可验证的形式化。最近在文本到 SQL 方面的工作表明，法学硕士可能对释义的自然语言 (NL) 输入敏感，即使保留了高度的语义保真度（Safarzadeh、Oroojlooyjadid 和 Roth 2025）。在本文中，我们在自动形式化领域研究了这一主张。具体来说，我们通过测量语义和编译有效性来评估法学硕士使用语义相似的释义 NL 语句生成形式证明的稳健性。使用正式基准 MiniF2F（Zheng、Han 和 Polu 2021）和 ProofNet 的 Lean 4 版本（Xin 等人 2024）以及两个现代法学硕士，我们生成释义的自然语言语句，并在两个模型中交叉评估这些语句。本文的结果揭示了解释输入之间的性能差异，表明 NL 语句中的微小变化可以显着影响模型输出。

Title: BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals

Authors: Ruiyu Wang, Yuzhang Xie, Xiao Hu, Carl Yang, Jiaying Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12821
Pdf URL: https://arxiv.org/pdf/2511.12821
Copy Paste: [[2511.12821]] BioMedJImpact: A Comprehensive Dataset and LLM Pipeline for AI Engagement and Scientific Impact Analysis of Biomedical Journals(https://arxiv.org/abs/2511.12821)
Keywords: llm
Abstract: Assessing journal impact is central to scholarly communication, yet existing open resources rarely capture how collaboration structures and artificial intelligence (AI) research jointly shape venue prestige in biomedicine. We present BioMedJImpact, a large-scale, biomedical-oriented dataset designed to advance journal-level analysis of scientific impact and AI engagement. Built from 1.74 million PubMed Central articles across 2,744 journals, BioMedJImpact integrates bibliometric indicators, collaboration features, and LLM-derived semantic indicators for AI engagement. Specifically, the AI engagement feature is extracted through a reproducible three-stage LLM pipeline that we propose. Using this dataset, we analyze how collaboration intensity and AI engagement jointly influence scientific impact across pre- and post-pandemic periods (2016-2019, 2020-2023). Two consistent trends emerge: journals with higher collaboration intensity, particularly those with larger and more diverse author teams, tend to achieve greater citation impact, and AI engagement has become an increasingly strong correlate of journal prestige, especially in quartile rankings. To further validate the three-stage LLM pipeline we proposed for deriving the AI engagement feature, we conduct human evaluation, confirming substantial agreement in AI relevance detection and consistent subfield classification. Together, these contributions demonstrate that BioMedJImpact serves as both a comprehensive dataset capturing the intersection of biomedicine and AI, and a validated methodological framework enabling scalable, content-aware scientometric analysis of scientific impact and innovation dynamics. Code is available at this https URL.
摘要：评估期刊影响力是学术交流的核心，但现有的开放资源很少捕捉到协作结构和人工智能（AI）研究如何共同塑造生物医学领域的声誉。我们推出了 BioMedJImpact，这是一个面向生物医学的大规模数据集，旨在推进科学影响和人工智能参与的期刊级分析。 BioMedJImpact 基于 2,744 种期刊的 174 万篇 PubMed Central 文章构建，集成了文献计量指标、协作功能和 LLM 衍生的人工智能参与语义指标。具体来说，人工智能参与特征是通过我们提出的可重复的三阶段 LLM 管道提取的。使用该数据集，我们分析了协作强度和人工智能参与如何共同影响大流行前后时期（2016-2019、2020-2023）的科学影响。出现了两个一致的趋势：合作强度较高的期刊，特别是那些拥有更大、更多样化作者团队的期刊，往往会获得更大的引文影响力，而人工智能的参与度已成为期刊声望的越来越强的相关性，尤其是在四分位数排名中。为了进一步验证我们提出的用于推导人工智能参与特征的三阶段法学硕士流程，我们进行了人工评估，确认人工智能相关性检测和一致的子领域分类方面的实质性一致性。这些贡献共同表明，BioMedJImpact 既是捕获生物医学和人工智能交叉点的综合数据集，又是经过验证的方法框架，可对科学影响和创新动态进行可扩展、内容感知的科学计量分析。代码可从此 https URL 获取。

Title: From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation

Authors: Niranjan Chebrolu, Gerard Christopher Yeo, Kokil Jaidka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12832
Pdf URL: https://arxiv.org/pdf/2511.12832
Copy Paste: [[2511.12832]] From Passive to Persuasive: Steering Emotional Nuance in Human-AI Negotiation(https://arxiv.org/abs/2511.12832)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable framework and new directions for the study of conversational AI.
摘要：大型语言模型（LLM）显示出对话的流畅性不断提高，但向它们灌输细致入微的、类人的情感表达仍然是一个重大挑战。当前的对准技术通常解决表面级输出或需要大量的微调。本文证明，有针对性的激活工程可以引导 LLaMA 3.1-8B 表现出更接近人类的情感细微差别。我们首先采用归因修补来识别因果影响成分，通过观察诊断对话任务期间的激活模式来找到关键干预位点。然后，我们从对比文本对（目标情绪的正面与负面示例）生成的激活差异中得出情感表达向量。将这些向量应用于新的对话提示可以显着增强情感特征：引导响应显示出积极情绪的增加（例如喜悦、信任）和更频繁的第一人称代词使用，表明个人参与度更高。我们的研究结果为对话式人工智能的研究提供了一个精确且可解释的框架和新方向。

Title: NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation

Authors: Kang Yin, Hye-Bin Shin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.12851
Pdf URL: https://arxiv.org/pdf/2511.12851
Copy Paste: [[2511.12851]] NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation(https://arxiv.org/abs/2511.12851)
Keywords: language model, hallucination
Abstract: Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.
摘要：临床脑电图 (EEG) 报告对通用语言模型 (LM) 无法捕获的特定领域的语言约定进行编码。我们引入 NeuroLex，这是一种轻量级的领域自适应语言模型，纯粹根据哈佛脑电图数据库的脑电图报告文本进行训练。与现有的生物医学语言模型不同，NeuroLex 是根据脑电图报告的语言和诊断特征量身定制的，使其能够充当独立的文本模型和多模态脑电图语言系统的解码器主干。 NeuroLex 使用跨度损坏预训练和指令式微调来进行报告润色、段落摘要和术语问答，从而学习脑电图解释的语法和推理模式特征。综合评价表明，与同规模的通用模型相比，它实现了更低的困惑度、更高的提取和概括精度、更好的标签效率、提高了对否定和事实幻觉的鲁棒性。凭借脑电图感知的语言骨干，NeuroLex 连接了生物医学文本建模和脑机接口应用程序，为可解释和语言驱动的神经解码提供了基础。

Title: From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Authors: Wenxin Zhu, Andong Chen, Yuchen Song, Kehai Chen, Conghui Zhu, Ziyan Chen, Tiejun Zhao
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.12861
Pdf URL: https://arxiv.org/pdf/2511.12861
Copy Paste: [[2511.12861]] From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models(https://arxiv.org/abs/2511.12861)
Keywords: language model, llm, chain-of-thought
Abstract: With the remarkable success of Multimodal Large Language Models (MLLMs) in perception tasks, enhancing their complex reasoning capabilities has emerged as a critical research focus. Existing models still suffer from challenges such as opaque reasoning paths and insufficient generalization ability. Chain-of-Thought (CoT) reasoning, which has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability, holds promise for improving model reasoning capabilities when extended to the multimodal domain. This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT). First, it analyzes the background and theoretical motivations for its inception from the perspectives of technical evolution and task demands. Then, it introduces mainstream MCoT methods from three aspects: CoT paradigms, the post-training stage, and the inference stage, while also analyzing their underlying mechanisms. Furthermore, the paper summarizes existing evaluation benchmarks and metrics, and discusses the application scenarios of MCoT. Finally, it analyzes the challenges currently facing MCoT and provides an outlook on its future research directions.
摘要：随着多模态大语言模型（MLLM）在感知任务中取得的显着成功，增强其复杂推理能力已成为一个关键的研究重点。现有模型仍面临推理路径不透明、泛化能力不足等挑战。思想链（CoT）推理通过增强推理透明度和输出可解释性，在语言模型中表现出了显着的功效，当扩展到多模态领域时，它有望提高模型推理能力。本文以“多模式思想链”（MCoT）为中心进行了系统综述。首先从技术演进和任务需求的角度分析了其产生的背景和理论动因。然后，从CoT范式、后训练阶段和推理阶段三个方面介绍了主流的MCoT方法，并分析了其底层机制。此外，本文还总结了现有的评估基准和指标，并讨论了MCoT的应用场景。最后分析了MCoT目前面临的挑战并对其未来的研究方向进行了展望。

Title: Classification of Hope in Textual Data using Transformer-Based Models

Authors: Chukwuebuka Fortunate Ijezue, Tania-Amanda Fredrick Eneye, Maaz Amjad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.12874
Pdf URL: https://arxiv.org/pdf/2511.12874
Copy Paste: [[2511.12874]] Classification of Hope in Textual Data using Transformer-Based Models(https://arxiv.org/abs/2511.12874)
Keywords: gpt
Abstract: This paper presents a transformer-based approach for classifying hope expressions in text. We developed and compared three architectures (BERT, GPT-2, and DeBERTa) for both binary classification (Hope vs. Not Hope) and multiclass categorization (five hope-related categories). Our initial BERT implementation achieved 83.65% binary and 74.87% multiclass accuracy. In the extended comparison, BERT demonstrated superior performance (84.49% binary, 72.03% multiclass accuracy) while requiring significantly fewer computational resources (443s vs. 704s training time) than newer architectures. GPT-2 showed lowest overall accuracy (79.34% binary, 71.29% multiclass), while DeBERTa achieved moderate results (80.70% binary, 71.56% multiclass) but at substantially higher computational cost (947s for multiclass training). Error analysis revealed architecture-specific strengths in detecting nuanced hope expressions, with GPT-2 excelling at sarcasm detection (92.46% recall). This study provides a framework for computational analysis of hope, with applications in mental health and social media analysis, while demonstrating that architectural suitability may outweigh model size for specialized emotion detection tasks.
摘要：本文提出了一种基于变压器的方法，用于对文本中的希望表达进行分类。我们开发并比较了用于二元分类（希望与不希望）和多类分类（五个与希望相关的类别）的三种架构（BERT、GPT-2 和 DeBERTa）。我们最初的 BERT 实现实现了 83.65% 的二进制准确率和 74.87% 的多类准确率。在扩展比较中，BERT 表现出了卓越的性能（二进制准确率 84.49%，多类准确率 72.03%），同时与新架构相比，所需的计算资源显着减少（443 秒与 704 秒训练时间）。 GPT-2 显示出最低的总体准确率（二进制为 79.34%，多类为 71.29%），而 DeBERTa 取得了中等结果（二进制为 80.70%，多类为 71.56%），但计算成本却要高得多（多类训练为 947 秒）。错误分析揭示了检测细微的希望表达方面的特定架构的优势，其中 GPT-2 在讽刺检测方面表现出色（召回率为 92.46%）。这项研究为希望的计算分析提供了一个框架，可应用于心理健康和社交媒体分析，同时证明对于专门的情绪检测任务，架构的适用性可能超过模型的大小。

Title: Visual Room 2.0: Seeing is Not Understanding for MLLMs

Authors: Haokun Li, Yazhou Zhang, Jizhi Ding, Qiuchi Li, Peng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12928
Pdf URL: https://arxiv.org/pdf/2511.12928
Copy Paste: [[2511.12928]] Visual Room 2.0: Seeing is Not Understanding for MLLMs(https://arxiv.org/abs/2511.12928)
Keywords: language model, llm
Abstract: Can multi-modal large language models (MLLMs) truly understand what they can see? Extending Searle's Chinese Room into the multi-modal domain, this paper proposes the Visual Room argument: MLLMs may describe every visual detail precisely yet fail to comprehend the underlying emotions and intentions, namely seeing is not understanding. Building on this, we introduce \textit{Visual Room} 2.0, a hierarchical benchmark for evaluating perception-cognition alignment of MLLMs. We model human perceptive and cognitive processes across three levels: low, middle, and high, covering 17 representative tasks. The perception component ranges from attribute recognition to scene understanding, while the cognition component extends from textual entailment to causal and social reasoning. The dataset contains 350 multi-modal samples, each with six progressive questions (2,100 in total) spanning perception to cognition. Evaluating 10 state-of-the-art (SoTA) MLLMs, we highlight three key findings: (1) MLLMs exhibit stronger perceptual competence than cognitive ability (8.0\%$\uparrow$); (2) cognition appears not causally dependent on perception-based reasoning; and (3) cognition scales with model size, but perception does not consistently improve with larger variants. This work operationalizes Seeing $\ne$ Understanding as a testable hypothesis, offering a new paradigm from perceptual processing to cognitive reasoning in MLLMs. Our dataset is available at this https URL.
摘要：多模态大语言模型 (MLLM) 能否真正理解它们所看到的内容？将 Searle 的 Chinese Room 扩展到多模态领域，本文提出了 Visual Room 论点：MLLM 可以精确地描述每个视觉细节，但无法理解潜在的情感和意图，即看到不等于理解。在此基础上，我们引入了 \textit{Visual Room} 2.0，这是一个用于评估 MLLM 感知认知一致性的分层基准。我们对人类感知和认知过程进行了建模，分为低、中、高三个级别，涵盖 17 个代表性任务。感知部分的范围从属性识别到场景理解，而认知部分则从文本蕴涵延伸到因果和社会推理。该数据集包含 350 个多模态样本，每个样本都有 6 个涵盖感知和认知的渐进问题（总共 2,100 个）。通过评估 10 个最先进的 (SoTA) MLLM，我们强调了三个关键发现：(1) MLLM 表现出比认知能力更强的感知能力 (8.0\%$\uparrow$)； (2) 认知似乎并不因果依赖于基于感知的推理； (3) 认知随模型大小而变化，但感知并不会随着变体的增大而持续改善。这项工作将 Seeing $\ne$ Understanding 作为一个可检验的假设进行操作，提供了从 MLLM 中的感知处理到认知推理的新范式。我们的数据集可通过此 https URL 获取。

Title: Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

Authors: Zeyu Shi, Ziming Wang, Tianyu Chen, Shiqi Gao, Haoyi Zhou, Qingyun Sun, Jianxin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.12991
Pdf URL: https://arxiv.org/pdf/2511.12991
Copy Paste: [[2511.12991]] Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty(https://arxiv.org/abs/2511.12991)
Keywords: language model, llm
Abstract: The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.
摘要：大型语言模型 (LLM) 的诚实性对于高风险领域的安全部署越来越重要。然而，这一关键特征被监督微调（SFT）（模型专业化的常用技术）严重破坏。现有的恢复方法依赖于数据密集型全局参数调整，隐含地假设 SFT 严重破坏了模型识别其知识边界的能力。然而，我们观察到经过微调的法学硕士仍然保留了这种能力；受到损害的是他们忠实表达这种意识的能力。在此基础上，我们提出诚实关键神经元恢复（HCNR）来通过手术修复这种被抑制的能力。 HCNR 识别关键的表达控制神经元并将其恢复到预训练状态，同时通过 Hessian 引导补偿将它们与任务导向的神经元协调起来。对四个 QA 任务和五个 LLM 系列的实验表明，HCNR 有效地恢复了 33.25% 的受损诚实性，同时与基线方法相比，以减少 10 倍以上的数据实现至少 2.23 倍的加速，为值得信赖的 LLM 部署提供了实用的解决方案。

Title: AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

Authors: Declan Jackson, William Keating, George Cameron, Micah Hill-Smith
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.13029
Pdf URL: https://arxiv.org/pdf/2511.13029
Copy Paste: [[2511.13029]] AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models(https://arxiv.org/abs/2511.13029)
Keywords: language model, hallucination
Abstract: Existing language model evaluations primarily measure general capabilities, yet reliable use of these models across a range of domains demands factual accuracy and recognition of knowledge gaps. We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.
摘要：现有的语言模型评估主要衡量一般能力，但在一系列领域中可靠地使用这些模型需要事实的准确性和对知识差距的认识。我们推出 AA-Omniscience，这是一个旨在衡量 6,000 个问题的事实回忆和知识校准的基准。问题来自权威的学术和行业来源，涵盖 6 个不同领域的 42 个经济相关主题。该评估衡量模型的全知指数，这是一个衡量事实回忆的有界指标（-100 到 100），在不确定时联合惩罚幻觉并奖励弃权，0 相当于模型正确回答问题的程度与错误回答问题的程度相同。在评估的模型中，Claude 4.1 Opus 获得了最高分（4.8），成为仅有的三个得分高于零的模型之一。这些结果揭示了前沿模型中持续存在的事实性和校准弱点。性能也因领域而异，来自三个不同研究实验室的模型在六个领域中处于领先地位。这种性能变化表明应该根据用例的需求来选择模型，而不是根据知识很重要的任务的一般性能来选择模型。

Title: Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

Authors: Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu, Zhongyi Ye, Weitai Zhang, Quan Liu, Si Wei, Cong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13043
Pdf URL: https://arxiv.org/pdf/2511.13043
Copy Paste: [[2511.13043]] Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training(https://arxiv.org/abs/2511.13043)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover-X1-7B achieves state-of-the-art performance among similarly-sized open-source models, attaining a 37.0\% average pass rate (pass@32). It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at:this https URL, this https URL.
摘要：大型语言模型（LLM）在自动定理证明方面显示出了巨大的前景，但进展往往受到缺乏多样化和高质量的形式语言数据的限制。为了解决这个问题，我们引入了 Spark-Prover-X1，这是一种通过三阶段框架训练的 7B 参数模型，旨在释放更容易访问和中等规模的法学硕士的推理潜力。第一阶段通过对广泛的数学语料库进行持续预训练，注入深厚的知识，并通过一系列新颖的数据任务进行增强。关键创新是“CoT增强状态预测”任务，以实现细粒度推理。第二阶段在专家迭代循环中采用监督微调 (SFT)，以专门化 Spark-Prover-X1-7B 和 Spark-Formalizer-X1-7B 模型。最后，应用一轮有针对性的组相对策略优化（GRPO）来增强证明者解决最具挑战性问题的能力。为了促进稳健的评估，特别是针对现实考试中的问题，我们还引入了 ExamFormal-Bench，这是一个包含 402 个形式问题的新基准数据集。实验结果表明，Spark-Prover-X1-7B 在类似规模的开源模型中实现了最先进的性能，获得了 37.0% 的平均通过率 (pass@32)。它在困难的竞争基准上表现出了卓越的性能，特别是在 PutnamBench (pass@32) 上解决了 27 个问题，并在 CombiBench (pass@32) 上实现了 24.0\%。我们的工作证明，这种多样化的训练数据和逐步完善的训练流程为增强轻量级法学硕士的形式推理能力提供了有效的途径。 Spark-Prover-X1-7B 和 Spark-Formalizer-X1-7B 以及 ExamFormal-Bench 数据集均在以下网址公开提供：此 https URL、此 https URL。

Title: BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Authors: Chuyuan Li, Giuseppe Carenini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13095
Pdf URL: https://arxiv.org/pdf/2511.13095
Copy Paste: [[2511.13095]] BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models(https://arxiv.org/abs/2511.13095)
Keywords: language model, gpt, llm
Abstract: We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.
摘要：我们推出 BeDiscovER（推理语言模型时代的话语理解基准），这是一个最新的综合套件，用于评估现代法学硕士的话语水平知识。 BeDiscovER 编译了 5 个跨话语词典、（多）句子和文档级别的公开可用话语任务，总共有 52 个单独的数据集。它既涵盖了广泛研究的任务，如语篇解析和时间关系提取，也涵盖了一些新的挑战，如语篇粒子消歧（例如“just”），还聚合了语篇关系解析和树库的共享任务，用于多语言和多框架语篇关系分类。我们评估了开源 LLM：Qwen3 系列、DeepSeek-R1 以及 BeDiscovER 上的 GPT-5-mini 等前沿模型，发现最先进的模型在时间推理的算术方面表现出强大的性能，但它们在完整文档推理和一些微妙的语义和话语现象（例如修辞关系识别）方面遇到了困难。

Title: Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study

Authors: Zhichao He, Mouxiao Bian, Jianhong Zhu, Jiayuan Chen, Yunqiu Wang, Wenxia Zhao, Tianbin Li, Bing Han, Jie Xu, Junyan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13107
Pdf URL: https://arxiv.org/pdf/2511.13107
Copy Paste: [[2511.13107]] Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study(https://arxiv.org/abs/2511.13107)
Keywords: language model, gpt, llm
Abstract: The Consolidated Standards of Reporting Trials statement is the global benchmark for transparent and high-quality reporting of randomized controlled trials. Manual verification of CONSORT adherence is a laborious, time-intensive process that constitutes a significant bottleneck in peer review and evidence synthesis. This study aimed to systematically evaluate the accuracy and reliability of contemporary LLMs in identifying the adherence of published RCTs to the CONSORT 2010 statement under a zero-shot setting. We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties. The primary outcome was the macro-averaged F1-score for the three-class classification task, supplemented by item-wise performance metrics and qualitative error analysis. Overall model performance was modest. The top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively, indicating only fair agreement with expert consensus. A striking performance disparity was observed across classes: while most models could identify compliant items with high accuracy (F1 score > 0.850), they struggled profoundly with identifying non-compliant and not applicable items, where F1 scores rarely exceeded 0.400. Notably, some high-profile models like GPT-4o underperformed, achieving a macro F1-score of only 0.521. LLMs show potential as preliminary screening assistants for CONSORT checks, capably identifying well-reported items. However, their current inability to reliably detect reporting omissions or methodological flaws makes them unsuitable for replacing human expertise in the critical appraisal of trial quality.
摘要：报告试验综合标准声明是透明和高质量随机对照试验报告的全球基准。 CONSORT 遵守情况的手动验证是一个费力、耗时的过程，构成同行评审和证据综合的重大瓶颈。本研究旨在系统地评估当代法学硕士在零样本环境下确定已发表的 RCT 是否符合 CONSORT 2010 声明的准确性和可靠性。我们构建了一个黄金标准数据集，其中包含 150 项已发表的随机对照试验，涵盖不同的医学专业。主要结果是三类分类任务的宏观平均 F1 分数，并辅以逐项性能指标和定性误差分析。模型的整体表现一般。表现最好的模型 Gemini-2.5-Flash 和 DeepSeek-R1 获得了几乎相同的宏观 F1 分数 0.634，Cohen 的 Kappa 系数分别为 0.280 和 0.282，这表明与专家共识相当一致。不同类别之间观察到了显着的性能差异：虽然大多数模型可以高精度识别合规项目（F1 分数 > 0.850），但它们在识别不合规和不适用的项目方面遇到了很大的困难，其中 F1 分数很少超过 0.400。值得注意的是，一些备受瞩目的模型（例如 GPT-4o）表现不佳，宏观 F1 得分仅为 0.521。法学硕士显示出作为 CONSORT 检查的初步筛选助理的潜力，能够识别报告良好的项目。然而，它们目前无法可靠地检测报告遗漏或方法缺陷，这使得它们不适合取代对试验质量进行严格评估的人类专业知识。

Title: Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

Authors: Quanjiang Guo, Sijie Wang, Jinchuan Zhang, Ben Zhang, Zhao Kang, Ling Tian, Ke Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.13118
Pdf URL: https://arxiv.org/pdf/2511.13118
Copy Paste: [[2511.13118]] Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction(https://arxiv.org/abs/2511.13118)
Keywords: language model, llm, prompt, agent
Abstract: Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs--such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks--retrieval, planning, coding, and verification--each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on this https URL.
摘要：由于需要复杂的推理和特定领域的理解，零样本事件提取 (ZSEE) 仍然是大型语言模型 (LLM) 的重大挑战。直接提示通常会产生不完整或结构无效的输出，例如触发器分类错误、参数缺失和模式违规。为了解决这些限制，我们提出了 Agent-Event-Coder (AEC)，这是一种新颖的多代理框架，它将事件提取视为软件工程：作为结构化的迭代代码生成过程。 AEC 将 ZSEE 分解为专门的子任务——检索、规划、编码和验证——每个子任务都由专门的 LLM 代理处理。事件模式表示为可执行的类定义，通过验证代理实现确定性验证和精确反馈。这种受编程启发的方法允许通过迭代细化系统地消除歧义和模式实施。通过利用协作代理工作流程，AEC 使法学硕士能够在零样本设置中生成精确、完整且模式一致的提取。跨越五个不同领域和六个法学硕士的实验表明，AEC 始终优于之前的零样本基线，展示了像代码生成一样处理事件提取的强大功能。代码和数据在此 https URL 上发布。

Title: Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels

Authors: Sourya Dipta Das, Shubham Kumar, Kuldeep Yadav
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13152
Pdf URL: https://arxiv.org/pdf/2511.13152
Copy Paste: [[2511.13152]] Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels(https://arxiv.org/abs/2511.13152)
Keywords: language model, llm, prompt
Abstract: Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.
摘要：语法能力评估对于评估书面和口语的语言能力至关重要；然而，口语形式由于其自发性、非结构化和不连贯的性质而带来了额外的挑战。开发准确的语法评分模型还需要大量的专家注释，这使得大规模数据创建变得不切实际。为了解决这些限制，我们提出了一种零样本语法能力评估框架，该框架利用未标记的数据和大型语言模型（LLM），而不依赖于手动标签。在训练期间，我们通过使用基于语法能力量规的提示，对未标记的数据采用法学硕士生成的预测。这些预测被视为伪标签，用于通过旨在有效处理标签噪声的新颖训练框架来训练基于变压器的模型。我们表明，用于伪标签生成的 LLM 的选择严重影响模型性能，并且训练期间干净样本与噪声样本的比率强烈影响稳定性和准确性。最后，对错误强度和分数预测的定性分析证实了我们方法的稳健性和可解释性。实验结果证明了我们的方法在高精度估计语法能力分数方面的有效性，为可扩展的低资源语法评估系统铺平了道路。

Title: Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Authors: Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13159
Pdf URL: https://arxiv.org/pdf/2511.13159
Copy Paste: [[2511.13159]] Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis(https://arxiv.org/abs/2511.13159)
Keywords: language model, llm, prompt
Abstract: Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68\% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78\% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.
摘要：自动语音识别 (ASR) 转录本，尤其是孟加拉语等资源匮乏的语言，包含严重的歧义：词与词之间的重复可能是重复不流畅（无意识的 ASR 错误/犹豫）或形态重复（故意的语法结构）。标准不流畅纠正因错误删除有效语言信息而失败。为了解决这个问题，我们引入了第一个公开的 20,000 行孟加拉语语料库，手动注释以明确区分嘈杂的 ASR 转录本中的这两种现象。我们使用两种范式对这种新颖的资源进行基准测试：最先进的多语言大语言模型（LLM）和编码器模型的特定于任务的微调。法学硕士通过几次提示即可实现有竞争力的表现（准确率高达 82.68%）。然而，微调证明是优越的，特定于语言的 BanglaBERT 模型达到了 84.78% 的最高准确率，F1 分数为 0.677。这建立了一个强大的、语言信息丰富的基线，并为开发复杂的、保留语义的孟加拉语文本规范化系统提供了必要的数据。

Title: TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine

Authors: Tianai Huang, Jiayuan Chen, Lu Lu, Pengcheng Chen, Tianbin Li, Bing Han, Wenchao Tang, Jie Xu, Ming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13169
Pdf URL: https://arxiv.org/pdf/2511.13169
Copy Paste: [[2511.13169]] TCM-5CEval: Extended Deep Evaluation Benchmark for LLM's Comprehensive Clinical Research Competence in Traditional Chinese Medicine(https://arxiv.org/abs/2511.13169)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in general domains, yet their application in highly specialized and culturally-rich fields like Traditional Chinese Medicine (TCM) requires rigorous and nuanced evaluation. Building upon prior foundational work such as TCM-3CEval, which highlighted systemic knowledge gaps and the importance of cultural-contextual alignment, we introduce TCM-5CEval, a more granular and comprehensive benchmark. TCM-5CEval is designed to assess LLMs across five critical dimensions: (1) Core Knowledge (TCM-Exam), (2) Classical Literacy (TCM-LitQA), (3) Clinical Decision-making (TCM-MRCD), (4) Chinese Materia Medica (TCM-CMM), and (5) Clinical Non-pharmacological Therapy (TCM-ClinNPT). We conducted a thorough evaluation of fifteen prominent LLMs, revealing significant performance disparities and identifying top-performing models like deepseek\_r1 and gemini\_2\_5\_pro. Our findings show that while models exhibit proficiency in recalling foundational knowledge, they struggle with the interpretative complexities of classical texts. Critically, permutation-based consistency testing reveals widespread fragilities in model inference. All evaluated models, including the highest-scoring ones, displayed a substantial performance degradation when faced with varied question option ordering, indicating a pervasive sensitivity to positional bias and a lack of robust understanding. TCM-5CEval not only provides a more detailed diagnostic tool for LLM capabilities in TCM but aldso exposes fundamental weaknesses in their reasoning stability. To promote further research and standardized comparison, TCM-5CEval has been uploaded to the Medbench platform, joining its predecessor in the "In-depth Challenge for Comprehensive TCM Abilities" special track.
摘要：大语言模型 (LLM) 在一般领域表现出了卓越的能力，但它们在传统中医 (TCM) 等高度专业化和文化丰富的领域的应用需要严格而细致的评估。基于 TCM-3CEval 等先前的基础工作（强调系统知识差距和文化背景协调的重要性），我们引入了 TCM-5CEval，这是一个更细粒度和更全面的基准。 TCM-5CEval 旨在从五个关键维度评估法学硕士：(1) 核心知识 (TCM-Exam)、(2) 古典素养 (TCM-LitQA)、(3) 临床决策 (TCM-MRCD)、(4) 中药学 (TCM-CMM) 和 (5) 临床非药物治疗 (TCM-ClinNPT)。我们对 15 个著名的法学硕士进行了全面评估，揭示了显着的性能差异，并确定了 Deepseek\_r1 和 Gemini\_2\_5\_pro 等表现最佳的模型。我们的研究结果表明，虽然模型在回忆基础知识方面表现出熟练程度，但它们却难以应对经典文本的解释复杂性。至关重要的是，基于排列的一致性测试揭示了模型推理中普遍存在的脆弱性。所有评估的模型，包括得分最高的模型，在面对不同的问题选项排序时都表现出显着的性能下降，这表明对位置偏差普遍敏感且缺乏可靠的理解。 TCM-5CEval 不仅为 TCM 中的 LLM 能力提供了更详细的诊断工具，而且还暴露了其推理稳定性的根本弱点。为推动进一步研究和标准化比对，TCM-5CEval已上传至Medbench平台，与前身一起加入“中医综合能力深度挑战”专项赛道。

Title: Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study

Authors: Mihai Dan Nadas, Laura Diosan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13182
Pdf URL: https://arxiv.org/pdf/2511.13182
Copy Paste: [[2511.13182]] Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study(https://arxiv.org/abs/2511.13182)
Keywords: language model, gpt, llm, prompt
Abstract: Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
摘要：自动变音符号恢复对于变音符号丰富的语言（例如罗马尼亚语）的文本处理至关重要。本研究评估了几种大型语言模型 (LLM) 在恢复罗马尼亚语文本中的变音符号方面的性能。使用综合语料库，我们在从零镜头到复杂多镜头的多种提示模板下测试了模型，包括 OpenAI 的 GPT-3.5、GPT-4、GPT-4o、Google 的 Gemini 1.0 Pro、Meta 的 Llama 2 和 Llama 3、MistralAI 的 Mixtral 8x7B Instruct、airoboros 70B 和 OpenLLM-Ro 的 RoLlama 2 7B说明。结果表明，GPT-4o 等模型实现了较高的变音符号恢复精度，始终超过中性回波基线，而包括 Meta 的 Llama 系列在内的其他模型则表现出更大的变异性。这些发现强调了模型架构、训练数据和提示设计对变音符号恢复性能的影响，并概述了改进变音符号丰富语言的 NLP 工具的有希望的方向。

Title: Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

Authors: Tyler Loakman, Joseph James, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13225
Pdf URL: https://arxiv.org/pdf/2511.13225
Copy Paste: [[2511.13225]] Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms(https://arxiv.org/abs/2511.13225)
Keywords: language model, llm
Abstract: With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.
摘要：随着大型语言模型（LLM）及其视觉对应模型（VLM）的兴起，许多工作研究了它们在融合视觉和语言模式的任务中的能力。在这项工作中，我们对 VLM 能够充当训练有素的语音学家、解释语音频谱图和波形的程度进行了基准测试。为此，我们综合了一个新颖的数据集，其中包含 4k 多个独立的英语单词以及风格一致的频谱图和波形图。我们通过一项多项选择任务来测试 VLM 理解这些语音表示的能力，其中模型必须预测口语单词在 3 个干扰转录中的正确音位或字形转录，这些干扰转录是根据其与基本事实的音素编辑距离而选择的。我们观察到，零样本和微调模型很少表现出高于机会的结果，这表明需要如何解释这些数字的特定参数知识，而不是单独的配对样本。

Title: Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance

Authors: Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13254
Pdf URL: https://arxiv.org/pdf/2511.13254
Copy Paste: [[2511.13254]] Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance(https://arxiv.org/abs/2511.13254)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.
摘要：大型语言模型 (LLM) 在不同领域展示了卓越的能力，但它们的训练仍然是资源和时间密集型的，需要大量的计算能力和精心编排的训练过程。模型汤化（对同一架构的多个模型的权重进行平均的做法）已成为一种有前途的训练前和训练后技术，可以在无需昂贵的重新训练的情况下提高性能。在本文中，我们介绍了类别专家汤 (SoCE)，这是一种模型汤的原理方法，它利用基准组合来识别最佳模型候选，并应用非均匀加权平均来最大化性能。与以前的均匀平均方法相反，我们的方法利用了基准类别通常在模型性能中表现出较低的相互相关性的观察结果。 SoCE 为每个弱相关类别集群识别“专家”模型，并使用优化的加权平均而不是统一权重将它们组合起来。我们证明，所提出的方法提高了多个领域的性能和鲁棒性，包括多语言功能、工具调用和数学，并在伯克利函数调用排行榜上取得了最先进的结果。

Title: Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

Authors: Kajetan Dymkiewicz, Ivan Vulic, Helen Yannakoudakis, Eilam Shapira, Roi Reichart, Anna Korhonen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.13368
Pdf URL: https://arxiv.org/pdf/2511.13368
Copy Paste: [[2511.13368]] Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2511.13368)
Keywords: language model, llm
Abstract: Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages and their combinations remains poorly understood. We conduct a controlled PEFT/LoRA study across multiple open-weight LLM families and sizes, treating task and language as transfer axes while conditioning on model family and size; we fine-tune each model on a single task-language source and measure transfer as the percentage-point change versus its baseline score when evaluated on all other task-language target pairs. We decompose transfer into (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language) regimes. We uncover two consistent general patterns. First, a pronounced on-task vs. off-task asymmetry: Matched-Task (Cross-Language) transfer is reliably positive, whereas off-task transfer often incurs collateral degradation. Second, a stable donor-recipient structure across languages and tasks (hub donors vs. brittle recipients). We outline implications for risk-aware fine-tuning and model specialisation.
摘要：大型语言模型 (LLM) 在跨任务和语言方面表现出色，但一项任务或语言的改进如何影响其他任务和语言及其组合仍然知之甚少。我们对多个开放权重 LLM 家族和规模进行了受控 PEFT/LoRA 研究，将任务和语言视为转移轴，同时以模型家族和规模为条件；我们在单个任务语言源上微调每个模型，并在对所有其他任务语言目标对进行评估时，以百分比变化与其基线分数来衡量迁移。我们将迁移分解为（i）匹配任务（跨语言）、（ii）匹配语言（跨任务）和（iii）跨任务（跨语言）机制。我们发现了两种一致的一般模式。首先，任务内与任务外的明显不对称：匹配任务（跨语言）迁移确实是积极的，而任务外迁移通常会导致附带退化。其次，跨语言和任务的稳定的捐赠者-接收者结构（中心捐赠者与脆弱的接收者）。我们概述了风险意识微调和模型专业化的影响。

Title: Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Authors: Siyu Zhu, Mouxiao Bian, Yue Xie, Yongyu Tang, Zhikang Yu, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Xiaoyan Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13381
Pdf URL: https://arxiv.org/pdf/2511.13381
Copy Paste: [[2511.13381]] Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts(https://arxiv.org/abs/2511.13381)
Keywords: language model, gpt, llm
Abstract: With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.
摘要：随着大语言模型（LLM）在医学领域的迅速崛起，一个关键问题是他们能否在现实临床环境中充当合格的儿科医生。我们开发了PEDIASBench，这是一个以知识系统框架为中心、针对现实临床环境量身定制的系统评估框架。 PEDIASBench从基础知识应用、动态诊疗能力、儿科医疗安全与医学伦理三个维度对法学硕士进行评估。我们评估了过去两年发布的12个代表性模型，包括GPT-4o、Qwen3-235B-A22B和DeepSeek-V3，涵盖19个儿科亚专科和211种典型疾病。最先进的模型在基础知识上表现良好，Qwen3-235B-A22B 在许可级别问题上的准确率超过 90%，但随着任务复杂性的增加，性能下降了约 15%，揭示了复杂推理的局限性。多项选择评估突出了综合推理和知识回忆方面的弱点。在动态诊断和治疗场景中，DeepSeek-R1 在病例推理方面得分最高（平均 0.58），但大多数模型难以适应患者的实时变化。在儿科医学伦理和安全任务上，Qwen2.5-72B 表现最好（准确率 92.05%），但人文敏感性仍然有限。这些发现表明儿科法学硕士受到有限的动态决策和不发达的人文关怀的限制。未来的发展应侧重于多模式集成和临床反馈模型迭代循环，以增强安全性、可解释性和人机协作。虽然目前的法学硕士无法独立进行儿科护理，但他们有望提供决策支持、医学教育和患者沟通，为安全、值得信赖和协作的智能儿科医疗保健系统奠定基础。

Title: Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Authors: Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen, Chuan Yu, Xubin Li, Tiezheng Ge, Wenxuan Wang, Qin Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13410
Pdf URL: https://arxiv.org/pdf/2511.13410
Copy Paste: [[2511.13410]] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction(https://arxiv.org/abs/2511.13410)
Keywords: llm, retrieval-augmented generation, agent
Abstract: With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users' subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.
摘要：随着智能个人设备的兴起，面向服务的人机交互变得越来越普遍。这一趋势凸显了对个性化对话助手的需求，这些助手可以了解用户的特定特征，从而准确地解释需求并根据个人偏好定制响应。然而，现有的方法往往忽视了长期交互的复杂性，并且无法捕捉用户的主观特征。为了弥补这些差距，我们提出了 PAL-Bench，这是一个新的基准，旨在评估面向服务的助手在长期用户代理交互中的个性化能力。在缺乏可用的真实世界数据的情况下，我们开发了一个基于 LLM 的多步骤合成管道，并由人类注释者进一步验证和完善。这个过程产生了 PAL-Set，这是第一个包含多会话用户日志和对话历史的中国数据集，它作为 PAL-Bench 的基础。此外，为了改进面向个性化服务的交互，我们提出了 H$^2$Memory，这是一种分层异构内存框架，它结合了检索增强生成来改进个性化响应生成。我们的 PAL-Bench 和外部数据集上的综合实验证明了所提出的内存框架的有效性。

Title: Applying Large Language Models to Characterize Public Narratives

Authors: Elinor Poole-Dayan, Daniel T Kessler, Hannah Chiou, Margaret Hughes, Emily S Lin, Marshall Ganz, Deb Roy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13505
Pdf URL: https://arxiv.org/pdf/2511.13505
Copy Paste: [[2511.13505]] Applying Large Language Models to Characterize Public Narratives(https://arxiv.org/abs/2511.13505)
Keywords: language model, llm
Abstract: Public Narratives (PNs) are key tools for leadership development and civic mobilization, yet their systematic analysis remains challenging due to their subjective interpretation and the high cost of expert annotation. In this work, we propose a novel computational framework that leverages large language models (LLMs) to automate the qualitative annotation of public narratives. Using a codebook we co-developed with subject-matter experts, we evaluate LLM performance against that of expert annotators. Our work reveals that LLMs can achieve near-human-expert performance, achieving an average F1 score of 0.80 across 8 narratives and 14 codes. We then extend our analysis to empirically explore how PN framework elements manifest across a larger dataset of 22 stories. Lastly, we extrapolate our analysis to a set of political speeches, establishing a novel lens in which to analyze political rhetoric in civic spaces. This study demonstrates the potential of LLM-assisted annotation for scalable narrative analysis and highlights key limitations and directions for future research in computational civic storytelling.
摘要：公共叙事（PN）是领导力发展和公民动员的关键工具，但由于其主观解释和专家注释的高昂成本，对其进行系统分析仍然具有挑战性。在这项工作中，我们提出了一种新颖的计算框架，该框架利用大型语言模型（LLM）来自动对公共叙述进行定性注释。使用我们与主题专家共同开发的密码本，我们根据专家注释者的表现来评估法学硕士的表现。我们的工作表明，法学硕士可以实现接近人类专家的表现，在 8 个叙述和 14 个代码中获得 0.80 的平均 F1 分数。然后，我们扩展分析，以实证方式探索 PN 框架元素如何在包含 22 个故事的更大数据集中体现。最后，我们将我们的分析推断为一系列政治演讲，建立了一个新的视角来分析公民空间中的政治言论。这项研究展示了法学硕士辅助注释在可扩展叙事分析中的潜力，并强调了计算公民故事讲述未来研究的关键局限性和方向。

Title: Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation

Authors: Hao Wang, Yuanfeng Song, Xiaoming Yin, Xing Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.13590
Pdf URL: https://arxiv.org/pdf/2511.13590
Copy Paste: [[2511.13590]] Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation(https://arxiv.org/abs/2511.13590)
Keywords: language model, llm
Abstract: Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
摘要：文本到 SQL 数据集对于训练和评估文本到 SQL 模型至关重要，但现有数据集通常覆盖范围有限，无法捕获现实应用程序的多样性。为了解决这个问题，我们提出了一种基于核心意图、语句类型、语法结构和关键操作等维度的文本到 SQL 分类的新分类法。使用这种分类法，我们评估广泛使用的公共文本到 SQL 数据集（例如 Spider 和 Bird），并揭示其覆盖范围和多样性的局限性。然后，我们引入分类引导的数据集合成管道，生成一个名为 SQL-Synth 的新数据集。这种方法将分类法与大型语言模型 (LLM) 相结合，以确保数据集反映现实世界文本到 SQL 应用程序的广度和复杂性。广泛的分析和实验结果验证了我们分类法的有效性，因为与现有基准相比，SQL-Synth 表现出更大的多样性和覆盖范围。此外，我们发现现有的法学硕士通常无法充分捕获全方位的场景，从而导致 SQL-Synth 的性能有限。然而，微调可以大大提高它们在这些场景中的性能。所提出的分类法具有重大的潜在影响，因为它不仅能够对数据集和不同法学硕士的表现进行全面分析，而且还可以指导法学硕士培训数据的构建。

Title: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Authors: Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.13593
Pdf URL: https://arxiv.org/pdf/2511.13593
Copy Paste: [[2511.13593]] Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents(https://arxiv.org/abs/2511.13593)
Keywords: llm, agent
Abstract: Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.76% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.
摘要：LLM驱动的智能体的最新进展已经证明了在产生类人反应方面的巨大潜力；然而，他们在复杂环境中维持长期交互仍然面临挑战，这主要是由于上下文一致性和动态个性化的限制。现有的记忆系统通常依赖于检索之前的语义分组，这可能会忽略语义上不相关但关键的用户信息并引入检索噪声。在本报告中，我们提出了 O-Mem 的初步设计，这是一种基于主动用户分析的新型内存框架，可以从用户与代理的主动交互中动态提取和更新用户特征和事件记录。 O-Mem 支持角色属性和主题相关上下文的分层检索，从而实现更具适应性和连贯性的个性化响应。 O-Mem 在公共 LoCoMo 基准上达到了 51.76%，比之前的最先进的 LangMem 提高了近 3%；在 PERSONAMEM 上达到了 62.99%，比之前的最先进的 A-Mem 提高了 3.5%。与以前的内存框架相比，O-Mem 还提高了令牌和交互响应时间效率。我们的工作为未来开发高效、类人的个性化人工智能助理开辟了有前景的方向。

Title: Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues

Authors: Jiaming Qu, Mengtian Guo, Yue Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.13658
Pdf URL: https://arxiv.org/pdf/2511.13658
Copy Paste: [[2511.13658]] Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues(https://arxiv.org/abs/2511.13658)
Keywords: language model, llm
Abstract: Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of training examples to effectively distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret. In this work, we explore using large language models (LLMs) to translate machine-learned lexical cues into human-understandable language phenomena that can differentiate deceptive reviews from genuine ones. We show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena either in LLMs' prior knowledge or obtained through in-context learning. These language phenomena have the potential to aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.
摘要：欺骗性评论会误导消费者、损害企业并破坏对在线市场的信任。机器学习分类器可以从大量的训练示例中学习，以有效区分欺骗性评论和真实评论。然而，这些分类器学到的区分特征通常是微妙的、碎片化的，并且人类难以解释。在这项工作中，我们探索使用大型语言模型（LLM）将机器学习的词汇线索翻译成人类可理解的语言现象，从而区分欺骗性评论和真实评论。我们表明，以这种方式获得的语言现象在经验上以数据为基础，可以在相似的领域中推广，并且比法学硕士先验知识或通过上下文学习获得的现象更具预测性。这些语言现象有可能帮助人们在欺骗检测分类器不可用的环境中批判性地评估在线评论的可信度。

Title: Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

Authors: Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.13689
Pdf URL: https://arxiv.org/pdf/2511.13689
Copy Paste: [[2511.13689]] Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation(https://arxiv.org/abs/2511.13689)
Keywords: language model, llm, prompt
Abstract: Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader's experience.
摘要：印度诗歌以其语言复杂性和深厚的文化共鸣而闻名，拥有跨越数千年的丰富多样的遗产。然而，它的分层含义、文化典故和复杂的语法结构常常给理解带来挑战，特别是对于非母语人士或不熟悉其上下文和语言的读者而言。尽管具有文化意义，现有的诗歌作品在很大程度上忽视了印度语言诗歌。在本文中，我们提出了翻译和图像生成（TAI）框架，通过适当的提示调整来利用大型语言模型（LLM）和潜在扩散模型。我们的框架通过提高全球受众对文化丰富的印度语言诗歌的可及性，支持联合国优质教育可持续发展目标（SDG 4）和减少不平等（SDG 10）。它包括 (1) 一个翻译模块，使用比值比偏好对齐算法将形态丰富的诗歌准确地翻译成英语；(2) 一个图像生成模块，使用语义图捕获标记、依赖关系以及隐喻及其含义之间的语义关系，以创建印度诗歌的视觉上有意义的表示。我们全面的实验评估，包括人类和定量评估，证明了 TAI Diffusion 在诗歌图像生成任务中的优越性，优于强大的基线。为了进一步解决印度语言诗歌资源稀缺的问题，我们引入了形态丰富的印度语言诗歌 MorphoVerse 数据集，其中包含 21 种资源匮乏的印度语言的 1,570 首诗歌。通过解决诗歌翻译和视觉理解方面的差距，这项工作旨在扩大可访问性并丰富读者的体验。

Title: Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Authors: Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.13703
Pdf URL: https://arxiv.org/pdf/2511.13703
Copy Paste: [[2511.13703]] Generalist Foundation Models Are Not Clinical Enough for Hospital Operations(https://arxiv.org/abs/2511.13703)
Keywords: llm
Abstract: Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
摘要：医院和医疗保健系统依赖于决定患者流量、成本和护理质量的运营决策。尽管在医学知识和对话基准方面表现出色，但在一般文本上训练的基础模型可能缺乏这些操作决策所需的专业知识。我们引入了 Lang1，这是一个在专门的语料库上进行预训练的模型系列（100M-7B 参数），该语料库混合了来自 NYU Langone Health 的 EHR 的 80B 临床标记和来自互联网的 627B 标记。为了在现实环境中严格评估 Lang1，我们开发了现实医学评估 (ReMedE)，这是一个源自 668,331 份 EHR 记录的基准，评估五项关键任务：30 天再入院预测、30 天死亡率预测、住院时间、合并症编码和预测保险索赔拒绝。在零样本设置中，通用模型和专用模型在五项任务中的四项上表现不佳（36.6%-71.7% AUROC），死亡率预测是一个例外。微调后，Lang1-1B 的性能优于微调后的通才模型达 70 倍大，零样本模型达 671 倍大，分别将 AUROC 提高了 3.64%-6.75% 和 1.66%-23.66%。我们还观察到通过对多个任务进行联合微调来实现跨任务扩展，从而改善其他任务。 Lang1-1B 有效地转移到分布外环境，包括其他临床任务和外部卫生系统。我们的研究结果表明，医院运营的预测能力需要明确的监督微调，并且通过 EHR 的域内预训练可以使这一微调过程更加高效。我们的研究结果支持了新的观点，即专业的法学硕士可以在专业任务中与通才模型竞争，并表明有效的医疗保健系统人工智能需要将域内预训练、监督微调和超越代理基准的现实世界评估相结合。