2025-11-13

Title: GMTRouter: Personalized LLM Router over Multi-turn User Interactions

Authors: Encheng Xie, Yihang Sun, Tao Feng, Jiaxuan You
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.08590
Pdf URL: https://arxiv.org/pdf/2511.08590
Copy Paste: [[2511.08590]] GMTRouter: Personalized LLM Router over Multi-turn User Interactions(https://arxiv.org/abs/2511.08590)
Keywords: language model, llm
Abstract: Large Language Model (LLM) routing has demonstrated strong capability in balancing response quality with computational cost. As users exhibit diverse preferences, personalization has attracted increasing attention in LLM routing, since even identical queries may require different models to generate responses tailored to individual needs. However, existing approaches are not fully personalized and often fail to capture the complex interactions between specific users and LLMs. Moreover, user preference data is typically scarce, noisy, and inconsistent in format, which limits the effectiveness of methods that rely solely on user-specific data. To address these challenges, we propose GMTRouter, which represents multi-turn user-LLM interactions as a heterogeneous graph with four node types: user, LLM, query, and response, thereby preserving the rich relational structure of the interaction. Through a tailored message-passing mechanism, GMTRouter learns to capture user preferences from few-shot data within a lightweight inductive graph learning framework, enabling effective personalization. Extensive experiments demonstrate that GMTRouter consistently outperforms strong baselines, achieving 0.9 to 21.6 percent higher accuracy and 0.006 to 0.309 higher AUC across multiple datasets. More importantly, we demonstrate that GMTRouter can adapt to new users and evolving preferences using only few-shot data, without extensive fine-tuning. The code for GMTRouter is publicly available at this https URL.
摘要：大语言模型 (LLM) 路由已展现出平衡响应质量与计算成本的强大能力。随着用户表现出不同的偏好，个性化在 LLM 路由中引起了越来越多的关注，因为即使是相同的查询也可能需要不同的模型来生成适合个人需求的响应。然而，现有的方法并不完全个性化，并且常常无法捕捉特定用户和法学硕士之间的复杂交互。此外，用户偏好数据通常稀缺、嘈杂且格式不一致，这限制了仅依赖于用户特定数据的方法的有效性。为了解决这些挑战，我们提出了 GMTRouter，它将多轮用户-LLM 交互表示为具有四种节点类型的异构图：用户、LLM、查询和响应，从而保留交互的丰富关系结构。通过定制的消息传递机制，GMTRouter 学会在轻量级归纳图学习框架内从少量数据中捕获用户偏好，从而实现有效的个性化。大量实验表明，GMTRouter 始终优于强大的基线，在多个数据集上实现了 0.9% 至 21.6% 的准确度提高和 0.006 至 0.309 个 AUC 提高。更重要的是，我们证明 GMTRouter 可以仅使用少量数据来适应新用户和不断变化的偏好，而无需进行大量微调。 GMTRouter 的代码可在此 https URL 公开获取。

Title: The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions

Authors: Azza Bouleimen, Giordano De Marzo, Taehee Kim, Nicol`o Pagan, Hannah Metzler, Silvia Giordano, David Garcia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08592
Pdf URL: https://arxiv.org/pdf/2511.08592
Copy Paste: [[2511.08592]] The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions(https://arxiv.org/abs/2511.08592)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) offer new avenues to simulate online communities and social media. Potential applications range from testing the design of content recommendation algorithms to estimating the effects of content policies and interventions. However, the validity of using LLMs to simulate conversations between various users remains largely untested. We evaluated whether LLMs can convincingly mimic human group conversations on social media. We collected authentic human conversations from Reddit and generated artificial conversations on the same topic with two LLMs: Llama 3 70B and GPT-4o. When presented side-by-side to study participants, LLM-generated conversations were mistaken for human-created content 39\% of the time. In particular, when evaluating conversations generated by Llama 3, participants correctly identified them as AI-generated only 56\% of the time, barely better than random chance. Our study demonstrates that LLMs can generate social media conversations sufficiently realistic to deceive humans when reading them, highlighting both a promising potential for social simulation and a warning message about the potential misuse of LLMs to generate new inauthentic social media content.
摘要：大型语言模型 (LLM) 提供了模拟在线社区和社交媒体的新途径。潜在的应用范围从测试内容推荐算法的设计到估计内容政策和干预措施的效果。然而，使用法学硕士来模拟不同用户之间对话的有效性在很大程度上尚未经过测试。我们评估了法学硕士是否可以令人信服地模仿社交媒体上的人类群体对话。我们从 Reddit 收集了真实的人类对话，并与两位法学硕士（Llama 3 70B 和 GPT-4o）就同一主题生成了人工对话。当并排呈现给研究参与者时，LLM 生成的对话在 39% 的情况下被误认为是人类创建的内容。特别是，在评估 Llama 3 生成的对话时，参与者正确识别为 AI 生成的对话的概率仅为 56%，仅比随机机会好一点。我们的研究表明，法学硕士可以生成足够真实的社交媒体对话，从而在阅读时欺骗人类，这既凸显了社交模拟的巨大潜力，也对法学硕士可能被滥用来生成新的不真实的社交媒体内容发出了警告。

Title: Knowledge Graph Analysis of Legal Understanding and Violations in LLMs

Authors: Abha Jha, Abel Salinas, Fred Morstatter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08593
Pdf URL: https://arxiv.org/pdf/2511.08593
Copy Paste: [[2511.08593]] Knowledge Graph Analysis of Legal Understanding and Violations in LLMs(https://arxiv.org/abs/2511.08593)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The rise of Large Language Models (LLMs) offers transformative potential for interpreting complex legal frameworks, such as Title 18 Section 175 of the US Code, which governs biological weapons. These systems hold promise for advancing legal analysis and compliance monitoring in sensitive domains. However, this capability comes with a troubling contradiction: while LLMs can analyze and interpret laws, they also demonstrate alarming vulnerabilities in generating unsafe outputs, such as actionable steps for bioweapon creation, despite their safeguards. To address this challenge, we propose a methodology that integrates knowledge graph construction with Retrieval-Augmented Generation (RAG) to systematically evaluate LLMs' understanding of this law, their capacity to assess legal intent (mens rea), and their potential for unsafe applications. Through structured experiments, we assess their accuracy in identifying legal violations, generating prohibited instructions, and detecting unlawful intent in bioweapons-related scenarios. Our findings reveal significant limitations in LLMs' reasoning and safety mechanisms, but they also point the way forward. By combining enhanced safety protocols with more robust legal reasoning frameworks, this research lays the groundwork for developing LLMs that can ethically and securely assist in sensitive legal domains - ensuring they act as protectors of the law rather than inadvertent enablers of its violation.
摘要：大型语言模型 (LLM) 的兴起为解释复杂的法律框架提供了变革潜力，例如管理生物武器的美国法典第 18 章第 175 条。这些系统有望推进敏感领域的法律分析和合规监控。然而，这种能力带来了一个令人不安的矛盾：虽然法学硕士可以分析和解释法律，但他们在产生不安全输出方面也表现出令人震惊的漏洞，例如制造生物武器的可行步骤，尽管他们有保障措施。为了应对这一挑战，我们提出了一种将知识图谱构建与检索增强生成（RAG）相结合的方法，以系统地评估法学硕士对该法律的理解、评估法律意图（犯罪意图）的能力以及不安全应用的可能性。通过结构化实验，我们评估了它们在识别违法行为、生成禁止指令以及检测生物武器相关场景中的非法意图方面的准确性。我们的研究结果揭示了法学硕士推理和安全机制的重大局限性，但它们也指出了前进的方向。通过将增强的安全协议与更强大的法律推理框架相结合，这项研究为开发能够在道德上和安全地协助敏感法律领域的法学硕士奠定了基础——确保他们充当法律的保护者，而不是无意中推动违法行为。

Title: Diverse Preference Learning for Capabilities and Alignment

Authors: Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08594
Pdf URL: https://arxiv.org/pdf/2511.08594
Copy Paste: [[2511.08594]] Diverse Preference Learning for Capabilities and Alignment(https://arxiv.org/abs/2511.08594)
Keywords: llm
Abstract: The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Soft Preference Learning resembles, but is a Pareto improvement over, standard temperature scaling.
摘要：法学硕士代表不同观点的能力至关重要，因为他们对社会的影响越来越大。然而，最近的研究表明，RLHF 和 DPO 等对齐算法显着降低了 LLM 输出的多样性。一致的法学硕士不仅生成具有重复结构和词语选择的文本，他们还以更统一的方式处理问题，他们的回答反映了更狭窄的社会观点。我们将此问题归因于偏好学习算法中使用的 KL 散度正则化器。这导致模型系统性地过度重视多数意见并牺牲其输出的多样性。为了解决这个问题，我们提出了软偏好学习，它将 KL 惩罚中的熵和交叉熵项解耦——允许对 LLM 生成多样性进行细粒度控制。从能力的角度来看，使用软偏好学习训练的法学硕士在困难的重复采样任务上获得了更高的准确性，并产生具有更大语义和词汇多样性的输出。从对齐的角度来看，它们能够代表更广泛的社会观点，并显示出改进的逻辑校准。值得注意的是，软偏好学习类似于标准温度缩放，但它是标准温度缩放的帕累托改进。

Title: Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning

Authors: Joongho Kim, Xirui Huang, Zarreen Reza, Gabriel Grand, Kevin Zhu, Ryan Lagasse
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08595
Pdf URL: https://arxiv.org/pdf/2511.08595
Copy Paste: [[2511.08595]] Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning(https://arxiv.org/abs/2511.08595)
Keywords: language model, llm, tree-of-thought
Abstract: Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), a lightweight method that, to the best of our knowledge, is the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%, demonstrating a practical approach to efficient, scalable LLM reasoning. The implementation of SSDP is publicly available at this https URL.
摘要：思想树 (ToT) 推理提高了大型语言模型 (LLM) 解决问题的能力，但由于语义冗余，计算成本高昂，其中不同的分支探索等效的推理路径。我们引入了基于语义相似性的动态修剪（SSDP），这是一种轻量级方法，据我们所知，它是第一个将在线语义合并集成到并行树搜索中的框架，从而能够实时对冗余步骤进行聚类和修剪。在包括 GSM8K 和 MATH500 在内的推理基准测试中，SSDP 比最先进的树搜索基线实现了高达 2.3 倍的加速，同时保持有竞争力的准确性（通常在最强基线的 5% 以内）并将探索的节点数量减少 85-90%，展示了一种高效、可扩展的 LLM 推理的实用方法。 SSDP 的实现可通过此 https URL 公开获得。

Title: What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs' Self-consistency Via Adversarial Nudge

Authors: Arka Dutta, Sujan Dutta, Rijul Magu, Soumyajit Datta, Munmun De Choudhury, Ashiqur R. KhudaBukhsh
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.08596
Pdf URL: https://arxiv.org/pdf/2511.08596
Copy Paste: [[2511.08596]] What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs' Self-consistency Via Adversarial Nudge(https://arxiv.org/abs/2511.08596)
Keywords: language model, gpt, llm, hallucination
Abstract: Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \texttt{Claude} exhibits strong resilience, \texttt{GPT} and \texttt{Grok} demonstrate moderate resilience, while \texttt{Gemini} and \texttt{DeepSeek} show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.
摘要：幻觉对高风险领域中大型语言模型 (LLM) 的实际部署提出了严峻的挑战。在本文中，我们提出了一个框架，用于在存在对抗性推动的情况下对法学硕士的事实保真度进行压力测试。我们的框架由三个步骤组成。第一步，我们指示法学硕士产生与所讨论的封闭域一致的事实和谎言集。在下一步中，我们指示法学硕士验证同一组断言作为真理，并与同一封闭域保持一致。在最后一步中，我们根据法学硕士本身生成（和验证）的谎言来测试其稳健性。我们使用五个广为人知的专有法学硕士在流行电影和小说的两个封闭领域进行了广泛的评估，揭示了对对抗性推动的广泛敏感性：\texttt{Claude} 表现出很强的弹性，\texttt{GPT} 和 \texttt{Grok} 表现出中等弹性，而 \texttt{Gemini} 和 \texttt{DeepSeek} 显示出较弱的弹性。考虑到越来越多的人使用法学硕士来寻求信息，我们的发现引起了警惕。

Title: Self-HarmLLM: Can Large Language Model Harm Itself?

Authors: Heehwan Kim, Sungjune Park, Daeseon Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08597
Pdf URL: https://arxiv.org/pdf/2511.08597
Copy Paste: [[2511.08597]] Self-HarmLLM: Can Large Language Model Harm Itself?(https://arxiv.org/abs/2511.08597)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.
摘要：大型语言模型（LLM）通常配备有护栏来阻止有害响应的产生。然而，现有的防御措施总是假设外部攻击者制作了有害的查询，并且模型自身的输出成为新的攻击向量的可能性尚未得到充分探索。在本研究中，我们提出了 Self-HarmLLM 场景，该场景使用同一模型生成的减轻有害查询（MHQ）作为新输入。 MHQ 是一种模糊查询，其原始意图得以保留，但其有害本质并未直接暴露。我们验证了当该 MHQ 重新进入同一型号的单独会话时是否发生越狱。我们在 Base、Zero-shot 和 Few-shot 条件下对 GPT-3.5-turbo、LLaMA3-8B-instruct 和 DeepSeek-R1-Distill-Qwen-7B 进行了实验。结果显示，在 Zero-shot 条件下，转换成功率高达 52%，越狱成功率高达 33%；在 Few-shot 条件下，转换成功率高达 65%，越狱成功率高达 41%。通过执行基于前缀的自动评估和人工评估，我们发现自动评估始终高估越狱成功率，平均差异为 52%。这表明仅靠自动评估并不能准确地确定危害性。虽然这项研究是基于有限查询集和评估器的玩具级研究，但它证明我们的方法仍然可以是有效的攻击场景。这些结果表明需要对护栏设计进行根本性的重新考虑并建立更稳健的评估方法。

Title: OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking

Authors: Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, Jiawei Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08598
Pdf URL: https://arxiv.org/pdf/2511.08598
Copy Paste: [[2511.08598]] OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking(https://arxiv.org/abs/2511.08598)
Keywords: language model, llm, agent
Abstract: Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.
摘要：知识密集型问答是大型语言模型 (LLM) 的核心，通常使用源自维基百科和教科书等来源的静态基准进行评估。然而，这些基准无法捕捉动态世界中不断变化的知识，并且集中管理难以跟上法学硕士的快速进步。为了解决这些缺点，我们提出了开放知识基准（OKBench），这是一个完全自动化的框架，用于按需生成高质量、动态的知识基准。 OKBench 专注于知识每天更新的新闻领域，是一个代理框架，可以自动执行基准的采购、创建、验证和分发。我们的方法使基准创建民主化，并通过减少与预训练数据的重叠来促进对检索增强方法的彻底评估。我们在各种规模和配置的各种开源和专有法学硕士上评估我们的框架，无论是否检索新生成的知识。我们的结果揭示了面对新信息时不同的模型行为，并强调检索如何缩小小型模型和大型模型之间的性能差距。这些发现强调了根据不断变化的知识基准评估法学硕士的重要性。

Title: Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study

Authors: Yilan Liu
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.08600
Pdf URL: https://arxiv.org/pdf/2511.08600
Copy Paste: [[2511.08600]] Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study(https://arxiv.org/abs/2511.08600)
Keywords: language model, gpt, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.
摘要：临床小插图是言语病理学 (SLP) 中必不可少的教育工具，但手动创建非常耗时。虽然通用大型语言模型（LLM）可以生成文本，但它们缺乏特定领域的知识，导致幻觉并需要大量的专家修改。本研究提出了一个概念验证系统，将检索增强生成 (RAG) 与精选知识库相结合，以生成儿科 SLP 病例材料。基于多模型 RAG 的系统原型化了，将精选的领域知识与工程提示模板集成在一起，支持五个商业（GPT-4o、Claude 3.5 Sonnet、Gemini 2.5 Pro）和开源（Llama 3.2、Qwen 2.5-7B）LLM。系统地设计了涵盖不同疾病类型和年级的七个测试场景。生成的病例使用多维评估标准进行自动质量评估，评估结构完整性、内部一致性、临床适当性和 IEP 目标/会议记录质量。这一概念验证证明了 RAG 增强生成儿科 SLP 小插图的技术可行性。商业模型显示出边际质量优势，但开源替代方案取得了可接受的性能，这表明隐私保护机构部署的潜力。整合策划的知识库使内容生成符合专业指南。在教育或研究实施之前，需要通过专家评审、学生试点测试和心理测量评估进行广泛的验证。未来的应用可能会扩展到临床决策支持、自动化 IEP 目标生成和临床反思培训。

Title: Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice

Authors: Azmine Toushik Wasi, Wahid Faisal, Mst Rafia Islam
Subjects: cs.CL, cs.CY, cs.HC, cs.MA, cs.MM
Abstract URL: https://arxiv.org/abs/2511.08605
Pdf URL: https://arxiv.org/pdf/2511.08605
Copy Paste: [[2511.08605]] Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice(https://arxiv.org/abs/2511.08605)
Keywords: llm, chat, agent
Abstract: Bangladesh's low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.
摘要：由于复杂的法律语言、程序不透明和高昂的费用，孟加拉国的低收入人口在获得负担得起的法律咨询方面面临重大障碍。现有的人工智能法律助理缺乏孟加拉语支持和针对特定司法管辖区的适应，限制了其有效性。为了解决这个问题，我们开发了 Mina，一位针对孟加拉国情况量身定制的多语言法学硕士法律助理。它采用多语言嵌入和基于 RAG 的工具链框架进行检索、推理、翻译和文档生成，通过交互式聊天界面提供上下文感知的法律草案、引文和简单语言解释。在 2022 年和 2023 年孟加拉国律师协会考试的各个阶段，由孟加拉国领先大学的法学院教师进行评估，Mina 在初步 MCQ、笔试和模拟 Viva Voce 考试中取得了 75-80% 的成绩，达到或超过了人类的平均表现，并表现出清晰度、上下文理解和合理的法律推理。这些结果证实了其作为低成本、多语言人工智能助手的潜力，可以自动执行关键法律任务并扩大诉诸司法的机会，提供有关构建特定领域、低资源系统和解决多语言适应、效率和可持续公共服务人工智能部署挑战的真实案例研究。

Title: A Super-Learner with Large Language Models for Medical Emergency Advising

Authors: Sergey K. Aityan, Abdolreza Mosaddegh, Rolando Herrero, Haitham Tayyar, Jiang Han, Vikram Sawant, Qi Chen, Rishabh Jain, Aruna Senthamaraikannan, Stephen Wood, Manuel Mersini, Rita Lazzaro, Mario Balzaneli, Nicola Iacovazzo, Ciro Gargiulo Isacco
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08614
Pdf URL: https://arxiv.org/pdf/2511.08614
Copy Paste: [[2511.08614]] A Super-Learner with Large Language Models for Medical Emergency Advising(https://arxiv.org/abs/2511.08614)
Keywords: language model, gpt, llm
Abstract: Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients' conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.
摘要：医疗决策支持和建议系统对于急诊医生快速准确地评估患者病情并做出诊断至关重要。近年来，人工智能 (AI) 已成为医疗保健领域的一股变革力量，大型语言模型 (LLM) 已应用于医疗决策支持系统的各个领域。我们研究了一组不同的法学硕士对急诊医学真实案例的反应。我们对五位最著名的法学硕士的研究结果显示，大型语言模型在医疗紧急情况下诊断急性疾病的能力存在显着差异，准确度在 58% 到 65% 之间。这一准确性大大超过了人类医生报告的准确性。我们建立了一个由五个主要法学硕士（Gemini、Llama、Grok、GPT 和 Claude）组成的超级学习器 MEDAS（医疗紧急诊断建议系统）。即使使用非常基本的元学习器，超级学习器也能产生更高的诊断准确率（70%）。然而，同一超级学习者中至少有一个综合法学硕士能够做出 85% 的正确诊断。超级学习器使用元学习器集成了一个法学硕士集群，该元学习器能够学习每个法学硕士的不同能力，通过集群中所有法学硕士的集体能力来利用模型的诊断准确性。我们的研究结果表明，元学习方法提供的聚合诊断准确性超过了任何单独的法学硕士，这表明超级学习者可以利用用于培训法学硕士群体的医学数据集的综合知识。

Title: Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM

Authors: Yibai Liu, Shihang Wang, Zeming Liu, Zheming Song, Junzhe Wang, Jingjing Liu, Qingjie Liu, Yunhong Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.08620
Pdf URL: https://arxiv.org/pdf/2511.08620
Copy Paste: [[2511.08620]] Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM(https://arxiv.org/abs/2511.08620)
Keywords: language model, llm
Abstract: Despite large language models (LLMs) have achieved impressive achievements across numerous tasks, supervised fine-tuning (SFT) remains essential for adapting these models to specialized domains. However, SFT for domain specialization can be resource-intensive and sometimes leads to a deterioration in performance over general capabilities due to catastrophic forgetting (CF). To address these issues, we propose a self-adaptive gradient-aware data selection approach (GrADS) for supervised fine-tuning of LLMs, which identifies effective subsets of training data by analyzing gradients obtained from a preliminary training phase. Specifically, we design self-guided criteria that leverage the magnitude and statistical distribution of gradients to prioritize examples that contribute the most to the model's learning process. This approach enables the acquisition of representative samples that enhance LLMs understanding of domain-specific tasks. Through extensive experimentation with various LLMs across diverse domains such as medicine, law, and finance, GrADS has demonstrated significant efficiency and cost-effectiveness. Remarkably, utilizing merely 5% of the selected GrADS data, LLMs already surpass the performance of those fine-tuned on the entire dataset, and increasing to 50% of the data results in significant improvements! With catastrophic forgetting substantially mitigated simultaneously. We will release our code for GrADS later.
摘要：尽管大型语言模型（LLM）在众多任务中取得了令人印象深刻的成就，但监督微调（SFT）对于使这些模型适应专业领域仍然至关重要。 However, SFT for domain specialization can be resource-intensive and sometimes leads to a deterioration in performance over general capabilities due to catastrophic forgetting (CF).为了解决这些问题，我们提出了一种用于 LLM 监督微调的自适应梯度感知数据选择方法（GrADS），该方法通过分析从初步训练阶段获得的梯度来识别训练数据的有效子集。具体来说，我们设计了自我指导标准，利用梯度的大小和统计分布来优先考虑对模型学习过程贡献最大的示例。这种方法能够获取代表性样本，从而增强法学硕士对特定领域任务的理解。通过对医学、法律和金融等不同领域的各种法学硕士进行广泛实验，GrADS 展示了显着的效率和成本效益。值得注意的是，仅利用所选 GrADS 数据的 5%，LLM 的性能就已经超越了在整个数据集上进行微调的那些，并且增加到 50% 的数据会带来显着的改进！灾难性遗忘同时大幅减轻。我们稍后将发布 GrADS 的代码。

Title: Structured Uncertainty guided Clarification for LLM Agents

Authors: Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi, Dinesh Manocha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08798
Pdf URL: https://arxiv.org/pdf/2511.08798
Copy Paste: [[2511.08798]] Structured Uncertainty guided Clarification for LLM Agents(https://arxiv.org/abs/2511.08798)
Keywords: language model, llm, prompt, agent
Abstract: LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39\% while reducing clarification questions by 1.5-2.7$\times$ compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5\% to 65.2\% (3B model) and 36.7\% to 62.9\% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.
摘要：LLM 代理通过工具调用功能扩展大型语言模型，但模糊的用户指令通常会导致不正确的调用和任务失败。我们引入了工具调用参数的结构化不确定性的原则表述，将联合工具参数澄清建模为具有完美信息期望值 (EVPI) 目标的 POMDP，以实现最佳问题选择和基于方面的成本建模以防止冗余。我们的 SAGE-Agent 利用这种结构化的不确定性来实现卓越的效率：与强提示和基于不确定性的基线相比，模糊任务的覆盖率提高了 7-39%，同时将澄清问题减少了 1.5-2.7$\times$。我们推出了 ClarifyBench，这是第一个多回合工具增强消歧基准，具有跨文档编辑、车辆控制和旅行预订等不同领域的基于 LLM 的真实用户模拟。此外，我们证明结构化不确定性为强化学习提供了有效的训练信号，通过不确定性加权 GRPO 训练将 When2Call 准确率从 36.5\% 提高到 65.2\%（3B 模型）和 36.7\% 到 62.9\%（7B 模型）。这些结果将结构化不确定性确立为工具增强代理的一种有原则的、有效的方法，从而提高现实场景中的任务成功率和交互效率。

Title: Toward Automated Cognitive Assessment in Parkinson's Disease Using Pretrained Language Models

Authors: Varada Khanna (1), Nilay Bhatt (1), Ikgyu Shin (1), Sule Tinaz (2), Yang Ren (1), Hua Xu (1), Vipina K. Keloth (1) ((1) Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, (2) Department of Neurology, Yale School of Medicine, New Haven, CT)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08806
Pdf URL: https://arxiv.org/pdf/2511.08806
Copy Paste: [[2511.08806]] Toward Automated Cognitive Assessment in Parkinson's Disease Using Pretrained Language Models(https://arxiv.org/abs/2511.08806)
Keywords: language model, gpt
Abstract: Understanding how individuals with Parkinson's disease (PD) describe cognitive experiences in their daily lives can offer valuable insights into disease-related cognitive and emotional changes. However, extracting such information from unstructured patient narratives is challenging due to the subtle, overlapping nature of cognitive constructs. This study developed and evaluated natural language processing (NLP) models to automatically identify categories that reflect various cognitive processes from de-identified first-person narratives. Three model families, a Bio_ClinicalBERT-based span categorization model for nested entity recognition, a fine-tuned Meta-Llama-3-8B-Instruct model using QLoRA for instruction following, and GPT-4o mini evaluated under zero- and few-shot settings, were compared on their performance on extracting seven categories. Our findings indicated that model performance varied substantially across categories and model families. The fine-tuned Meta-Llama-3-8B-Instruct achieved the highest overall F1-scores (0.74 micro-average and 0.59 macro-average), particularly excelling in context-dependent categories such as thought and social interaction. Bio_ClinicalBERT exhibited high precision but low recall and performed comparable to Llama for some category types such as location and time but failed on other categories such as thought, emotion and social interaction. Compared to conventional information extraction tasks, this task presents a greater challenge due to the abstract and overlapping nature of narrative accounts of complex cognitive processes. Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.
摘要：了解帕金森病 (PD) 患者如何描述日常生活中的认知体验可以为与疾病相关的认知和情绪变化提供有价值的见解。然而，由于认知结构的微妙、重叠性质，从非结构化的患者叙述中提取此类信息具有挑战性。这项研究开发并评估了自然语言处理（NLP）模型，以自动识别反映来自去识别的第一人称叙述的各种认知过程的类别。比较了三个模型系列在提取七个类别方面的性能：用于嵌套实体识别的基于 Bio_ClinicalBERT 的跨度分类模型、使用 QLoRA 进行指令跟踪的微调 Meta-Llama-3-8B-Instruct 模型以及在零样本和少样本设置下评估的 GPT-4o mini。我们的研究结果表明，不同类别和模型系列的模型性能差异很大。经过微调的 Meta-Llama-3-8B-Instruct 获得了最高的总体 F1 分数（微观平均 0.74 和宏观平均 0.59），尤其在思维和社交互动等依赖于情境的类别中表现出色。 Bio_ClinicalBERT 表现出高精度但召回率低，并且在某些类别类型（例如位置和时间）上的表现与 Llama 相当，但在其他类别（例如思想、情感和社交互动）上表现不佳。与传统的信息提取任务相比，由于复杂认知过程的叙述帐户的抽象性和重叠性，该任务提出了更大的挑战。 Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.

Title: Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents

Authors: Yejin Yoon, Yuri Son, Namyoung So, Minseo Kim, Minsoo Cho, Chanhee Park, Seungshin Lee, Taeuk Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08835
Pdf URL: https://arxiv.org/pdf/2511.08835
Copy Paste: [[2511.08835]] Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents(https://arxiv.org/abs/2511.08835)
Keywords: gpt, chat, agent
Abstract: Conversational agents have traditionally been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited progress in unifying the two. Yet, real-world conversations naturally involve fluid transitions between these modes. To address this gap, we introduce TACT (TOD-And-Chitchat Transition), a dataset designed for transition-aware dialogue modeling that incorporates structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex conversational dynamics. To evaluate an agent's ability to initiate and recover from mode transitions, we propose two new metrics -- Switch and Recovery. Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields additional gains, achieving 75.74\% joint mode-intent accuracy and a 70.1\% win rate against GPT-4o in human evaluation. These results demonstrate that pairing structurally diverse data with DPO enhances response quality and transition control, paving the way for more proactive and transition-aware conversational agents.
摘要：传统上，会话代理是为面向任务的对话（TOD）或开放式闲聊而开发的，在统一两者方面进展有限。然而，现实世界的对话自然涉及这些模式之间的流畅转换。为了解决这一差距，我们引入了 TACT（TOD-And-Chitchat Transition），这是一个专为转换感知对话建模而设计的数据集，其中包含结构多样和集成的模式流。 TACT 支持用户驱动和代理驱动的模式切换，从而能够对复杂的对话动态进行稳健的建模。为了评估代理启动模式转换和从模式转换中恢复的能力，我们提出了两个新指标——切换和恢复。在 TACT 上训练的模型在意图检测和模式转换处理方面都优于基线。此外，将直接偏好优化 (DPO) 应用于 TACT 训练的模型会产生额外的收益，在人类评估中相对于 GPT-4o 实现 75.74% 的联合模式意图准确度和 70.1% 的获胜率。这些结果表明，将结构多样化的数据与 DPO 配对可以增强响应质量和转换控制，为更主动和具有转换意识的对话代理铺平道路。

Title: BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation

Authors: Fuyi Yang, Chenchen Ye, Mingyu Derek Ma, Yijia Xiao, Matthew Yang, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08866
Pdf URL: https://arxiv.org/pdf/2511.08866
Copy Paste: [[2511.08866]] BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation(https://arxiv.org/abs/2511.08866)
Keywords: language model, llm, agent
Abstract: Hypothesis generation in biomedical research has traditionally centered on uncovering hidden relationships within vast scientific literature, often using methods like Literature-Based Discovery (LBD). Despite progress, current approaches typically depend on single data types or predefined extraction patterns, which restricts the discovery of novel and complex connections. Recent advances in Large Language Model (LLM) agents show significant potential, with capabilities in information retrieval, reasoning, and generation. However, their application to biomedical hypothesis generation has been limited by the absence of standardized datasets and execution environments. To address this, we introduce BioVerge, a comprehensive benchmark, and BioVerge Agent, an LLM-based agent framework, to create a standardized environment for exploring biomedical hypothesis generation at the frontier of existing scientific knowledge. Our dataset includes structured and textual data derived from historical biomedical hypotheses and PubMed literature, organized to support exploration by LLM agents. BioVerge Agent utilizes a ReAct-based approach with distinct Generation and Evaluation modules that iteratively produce and self-assess hypothesis proposals. Through extensive experimentation, we uncover key insights: 1) different architectures of BioVerge Agent influence exploration diversity and reasoning strategies; 2) structured and textual information sources each provide unique, critical contexts that enhance hypothesis generation; and 3) self-evaluation significantly improves the novelty and relevance of proposed hypotheses.
摘要：生物医学研究中的假设生成传统上集中于揭示大量科学文献中隐藏的关系，通常使用基于文献的发现 (LBD) 等方法。尽管取得了进展，但当前的方法通常依赖于单一数据类型或预定义的提取模式，这限制了新颖且复杂的连接的发现。大型语言模型 (LLM) 代理的最新进展显示出巨大的潜力，具有信息检索、推理和生成的能力。然而，由于缺乏标准化数据集和执行环境，它们在生物医学假设生成中的应用受到限制。为了解决这个问题，我们引入了 BioVerge（一个综合基准）和 BioVerge Agent（一个基于 LLM 的代理框架），以创建一个标准化环境，用于在现有科学知识的前沿探索生物医学假设的生成。我们的数据集包括源自历史生物医学假设和 PubMed 文献的结构化和文本数据，这些数据的组织是为了支持法学硕士代理人的探索。 BioVerge Agent 采用基于 ReAct 的方法，具有不同的生成和评估模块，可迭代地生成和自我评估假设提案。通过广泛的实验，我们揭示了关键的见解：1）BioVerge Agent 的不同架构影响探索多样性和推理策略； 2）结构化和文本信息源各自提供独特的、关键的上下文，以增强假设的生成； 3）自我评估显着提高了所提出假设的新颖性和相关性。

Title: Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models

Authors: Junichiro Niimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08877
Pdf URL: https://arxiv.org/pdf/2511.08877
Copy Paste: [[2511.08877]] Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models(https://arxiv.org/abs/2511.08877)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in citation recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic records depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the pretraining corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record appears in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 citations across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) citation count is strongly correlated with factual accuracy, (ii) bibliographic information becomes almost verbatim memorized beyond roughly 1,000 citations, and (iii) memory interference occurs when multiple highly cited papers share similar content. These findings indicate a threshold where generalization shifts into memorization, with highly cited papers being nearly verbatim retained in the model.
摘要：大型语言模型 (LLM) 已越来越多地应用于从自然语言理解到代码生成的各种任务。虽然它们也被用来协助引文推荐，但对不存在论文的幻觉仍然是一个主要问题。基于先前的研究，本研究假设法学硕士正确生成书目记录的能力取决于基础知识是生成还是记忆，被引用率较高的论文（即更频繁地出现在预训练语料库中）显示出较低的幻觉率。因此，我们假设引用计数作为训练数据冗余的代理（即给定书目记录出现在预训练语料库中的频率），并研究引用频率如何影响法学硕士输出中的幻觉参考文献。使用 GPT-4.1，我们生成并手动验证了 20 个计算机科学领域的 100 条引用，并通过生成的元数据和真实元数据之间的余弦相似性来衡量事实一致性。结果显示，(i) 引用计数与事实准确性密切相关，(ii) 超过大约 1,000 次引用后，书目信息几乎被逐字记忆，(iii) 当多篇高被引论文共享相似内容时，会发生记忆干扰。这些发现表明泛化转变为记忆的阈值，被引用次数最多的论文几乎逐字保留在模型中。

Title: HalluClean: A Unified Framework to Combat Hallucinations in LLMs

Authors: Yaxin Zhao, Yu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08916
Pdf URL: https://arxiv.org/pdf/2511.08916
Copy Paste: [[2511.08916]] HalluClean: A Unified Framework to Combat Hallucinations in LLMs(https://arxiv.org/abs/2511.08916)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中取得了令人印象深刻的性能，但它们经常产生破坏事实可靠性的幻觉内容。为了应对这一挑战，我们引入了 HalluClean，这是一个轻量级且与任务无关的框架，用于检测和纠正法学硕士生成文本中的幻觉。 HalluClean 采用推理增强范式，将流程明确分解为规划、执行和修订阶段，以识别和完善不支持的主张。它采用最少的任务路由提示来实现跨不同领域的零样本泛化，而不依赖于外部知识源或监督检测器。我们对五个有代表性的任务——问答、对话、总结、数学应用题和矛盾检测——进行了广泛的评估。实验结果表明，HalluClean 显着提高了事实一致性，并超越了竞争基线，证明了其在现实应用中增强 LLM 输出可信度的潜力。

Title: TiDAR: Think in Diffusion, Talk in Autoregression

Authors: Jingyu Liu, Xin Dong, Zhifan Ye, Rishabh Mehta, Yonggan Fu, Vartika Singh, Jan Kautz, Ce Zhang, Pavlo Molchanov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.08923
Pdf URL: https://arxiv.org/pdf/2511.08923
Copy Paste: [[2511.08923]] TiDAR: Think in Diffusion, Talk in Autoregression(https://arxiv.org/abs/2511.08923)
Keywords: language model
Abstract: Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
摘要：扩散语言模型有望实现快速并行生成，而自回归 (AR) 模型通常由于其因果结构与语言建模自然一致而在质量上表现出色。这就提出了一个基本问题：我们能否实现高吞吐量、更高 GPU 利用率和 AR 级别质量的协同作用？现有方法未能有效平衡这两方面，要么优先考虑 AR，使用较弱的模型进行顺序绘图（推测解码），导致绘图效率较低，要么使用某种形式的从左到右（类似 AR）解码逻辑进行扩散，但仍然会遭受质量下降并丧失其潜在的并行性。我们引入了 TiDAR，一种序列级混合架构，它在 Diffusion 中起草 token（Thinking）并自动回归采样最终输出（Talking）——所有这些都在使用专门设计的结构化注意力掩模的单个前向传递中进行。该设计利用了免费的 GPU 计算密度，在绘图和验证能力之间实现了强有力的平衡。此外，TiDAR 作为独立模型被设计为服务友好（低开销）。我们针对 AR 模型、推测性解码和 1.5B 和 8B 尺度的生成和似然任务的扩散变体广泛评估 TiDAR。得益于并行绘图和采样以及精确的 KV 缓存支持，TiDAR 在测量吞吐量方面优于推测解码，并在效率和质量方面优于 Dream 和 Llada 等扩散模型。最值得注意的是，TiDAR 是第一个缩小与 AR 模型质量差距的架构，同时每秒提供 4.71 倍到 5.91 倍的代币。

Title: EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

Authors: Longfei Zuo, Barbara Plank, Siyao Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.08949
Pdf URL: https://arxiv.org/pdf/2511.08949
Copy Paste: [[2511.08949]] EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI(https://arxiv.org/abs/2511.08949)
Keywords: language model, llm
Abstract: High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.
摘要：高质量的数据集对于训练和评估可靠的 NLP 模型至关重要。在自然语言推理 (NLI) 等任务中，当多个标签对同一实例有效时，就会出现人工标签变异 (HLV)，从而很难将注释错误与合理的变异区分开来。早期的框架 VARIERR（Weber-Genzel 等人，2024）要求多个注释者在第一轮中解释他们的标签决策，并在第二轮中通过有效性判断来标记错误。然而，进行两轮手动注释成本高昂，并且可能会限制合理标签或解释的覆盖范围。我们的研究提出了一个新框架 EVADE，用于生成和验证解释以使用大型语言模型 (LLM) 检测错误。我们对 NLI 的人为检测错误和法学硕士检测错误进行了全面分析，包括分布比较、验证重叠以及对模型微调的影响。我们的实验表明，LLM 验证细化了生成的解释分布，使其与人类注释更加一致，并且与删除人类注释者识别的错误相比，从训练数据中删除 LLM 检测到的错误可以提高微调性能。这凸显了扩展错误检测的潜力，减少了人力，同时提高了标签变化下的数据集质量。

Title: Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models

Authors: Zhouxing Tan, Ruochong Xiong, Yulong Wan, Jinlong Ma, Hanlin Xue, Qichun Deng, Haifeng Jing, Zhengtong Zhang, Depei Liu, Shiyuan Luo, Junfei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09003
Pdf URL: https://arxiv.org/pdf/2511.09003
Copy Paste: [[2511.09003]] Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models(https://arxiv.org/abs/2511.09003)
Keywords: language model, llm
Abstract: Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship. However, existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support. To overcome this limitation, we shift from snapshot-based evaluation to trajectory-based assessment, adopting a user-centered perspective that evaluates models based on their ability to improve and stabilize user emotional states over time. Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios. To encourage psychologically grounded responses, we constrain model outputs using validated emotion regulation strategies such as situation selection and cognitive reappraisal. User emotional trajectories are modeled as a first-order Markov process, and we apply causally-adjusted emotion estimation to obtain unbiased emotional state tracking. Based on this framework, we introduce three trajectory-level metrics: Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP). These metrics collectively capture user emotional dynamics over time and support comprehensive evaluation of long-term emotional support performance of LLMs. Extensive evaluations across a diverse set of LLMs reveal significant disparities in emotional support capabilities and provide actionable insights for model development.
摘要：情感支持是人机交互的核心能力，其应用包括心理咨询、角色扮演和陪伴。然而，现有的大语言模型（LLM）评估通常依赖于简短的静态对话，无法捕捉情感支持的动态和长期性质。为了克服这一限制，我们从基于快照的评估转向基于轨迹的评估，采用以用户为中心的视角，根据模型随着时间的推移改善和稳定用户情绪状态的能力来评估模型。我们的框架构建了一个由 328 个情绪情境和 1,152 个干扰事件组成的大规模基准，模拟不断变化的对话场景下的真实情绪变化。为了鼓励基于心理的反应，我们使用经过验证的情绪调节策略（例如情境选择和认知重新评估）来限制模型输出。用户情绪轨迹被建模为一阶马尔可夫过程，我们应用因果调整的情绪估计来获得无偏的情绪状态跟踪。基于这个框架，我们引入了三个轨迹级指标：基线情绪水平（BEL）、情绪轨迹波动性（ETV）和情绪质心位置（ECP）。这些指标共同捕获了一段时间内的用户情绪动态，并支持对法学硕士长期情感支持绩效的综合评估。对不同法学硕士的广泛评估揭示了情感支持能力的显着差异，并为模型开发提供了可行的见解。

Title: A Neurosymbolic Approach to Natural Language Formalization and Verification

Authors: Sam Bayless, Stefano Buliani, Darion Cassel, Byron Cook, Duncan Clough, Rémi Delmas, Nafi Diallo, Ferhat Erata, Nick Feng, Dimitra Giannakopoulou, Aman Goel, Aditya Gokhale, Joe Hendrix, Marc Hudak, Dejan Jovanović, Andrew M. Kent, Benjamin Kiesl-Reiter, Jeffrey J. Kuna, Nadia Labai, Joseph Lilien, Divya Raghunathan, Zvonimir Rakamarić, Niloofar Razavi, Michael Tautschnig, Ali Torkamani, Nathaniel Weir, Michael W. Whalen, Jianan Yao
Subjects: cs.CL, cs.AI, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2511.09008
Pdf URL: https://arxiv.org/pdf/2511.09008
Copy Paste: [[2511.09008]] A Neurosymbolic Approach to Natural Language Formalization and Verification(https://arxiv.org/abs/2511.09008)
Keywords: language model, llm
Abstract: Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autoformalization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our approach produces auditable logical artifacts that substantiate the verification outcomes and can be used to improve the original text.
摘要：大型语言模型在自然语言解释和推理方面表现良好，但其固有的随机性限制了它们在金融和医疗保健等在严格政策下运营的受监管行业中的采用。为了解决这一限制，我们提出了一个两阶段的神经符号框架，该框架（1）使用具有可选人工指导的法学硕士来形式化自然语言策略，从而允许对形式化过程进行细粒度控制；（2）使用推理时间自动形式化来验证自然语言语句针对这些策略的逻辑正确性。当正确性至关重要时，我们在推理时执行多个冗余的形式化步骤，交叉检查形式化的语义等价性。我们的基准测试表明，我们的方法的健全性超过 99%，表明在识别逻辑有效性方面的误报率接近于零。我们的方法产生可审计的逻辑工件，证实验证结果并可用于改进原始文本。

Title: MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

Authors: Gailun Zeng, Ziyang Luo, Hongzhan Lin, Yuchen Tian, Kaixin Li, Ziyang Gong, Jianxiong Guo, Jing Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09067
Pdf URL: https://arxiv.org/pdf/2511.09067
Copy Paste: [[2511.09067]] MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique(https://arxiv.org/abs/2511.09067)
Keywords: gpt
Abstract: The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs' critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at this https URL.
摘要：批判能力对于模型的自我完善和作为可靠的人工智能助手至关重要。虽然在纯语言环境中进行了广泛的研究，但对大型多模态模型 (LMM) 的多模态批判仍未得到充分探索，尽管它们在字幕和视觉推理等任务中的能力不断增强。在这项工作中，我们引入了 MM-CRITIC，这是一个用于评估 LMM 跨多个维度的批判能力的整体基准：基础、校正和比较。 MM-CRITIC 涵盖 8 个主要任务类型和 500 多个任务，收集了具有不同模型大小的各种 LMM 的响应，由 4471 个样本组成。为了提高评估的可靠性，我们将专家知情的基本答案整合到评分细则中，指导 GPT-4o 注释回复并生成参考评论，作为可信判断的锚点。大量实验验证了MM-CRITIC的有效性，并在多个维度下对领先的LMM的批评能力进行了全面评估。进一步的分析揭示了一些关键的见解，包括回应质量和批评之间的相关性，以及跨评估维度的不同批评难度。我们的代码可以在这个 https URL 上找到。

Title: Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition

Authors: Chao Wang, Yuqing Cai, Renzeng Duojie, Jin Zhang, Yutong Liu, Nyima Tashi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09085
Pdf URL: https://arxiv.org/pdf/2511.09085
Copy Paste: [[2511.09085]] Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition(https://arxiv.org/abs/2511.09085)
Keywords: language model
Abstract: In this work, we propose a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism. The proposed strategy adaptively adjusts chunk widths based on encoding states, enabling flexible receptive fields, cross-chunk information exchange, and robust adaptation to varying speaking rates, thereby alleviating the context truncation problem of fixed-chunk methods. To further capture the linguistic characteristics of Tibetan, we construct a lexicon grounded in its orthographic principles, providing linguistically motivated modeling units. During decoding, an external language model is integrated to enhance semantic consistency and improve recognition of long sentences. Experimental results show that the proposed framework achieves a word error rate (WER) of 6.23% on the test set, yielding a 48.15% relative improvement over the fixed-chunk baseline, while significantly reducing recognition latency and maintaining performance close to global decoding.
摘要：在这项工作中，我们提出了一个安多藏文的流式语音识别框架，该框架建立在具有上下文感知动态分块机制的混合 CTC/注意力架构之上。所提出的策略根据编码状态自适应地调整块宽度，实现灵活的感受野、跨块信息交换以及对不同语速的鲁棒适应，从而减轻固定块方法的上下文截断问题。为了进一步捕捉藏语的语言特征，我们构建了一个基于其拼字原则的词典，提供了语言驱动的建模单元。在解码过程中，集成外部语言模型以增强语义一致性并提高对长句子的识别。实验结果表明，所提出的框架在测试集上实现了 6.23% 的字错误率 (WER)，相对于固定块基线提高了 48.15%，同时显着降低了识别延迟并保持接近全局解码的性能。

Title: Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

Authors: Wenda Wei, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Lixin Su, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Xueqi Cheng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.09109
Pdf URL: https://arxiv.org/pdf/2511.09109
Copy Paste: [[2511.09109]] Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning(https://arxiv.org/abs/2511.09109)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning this http URL efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.
摘要：检索增强生成 (RAG) 已被证明可以有效减轻大型语言模型中的幻觉，但其有效性在复杂的多步骤推理中仍然有限。该 http URL 工作已将基于搜索的交互纳入 RAG，从而实现了实时检索的迭代推理。大多数方法依赖于基于结果的监督，没有为中间步骤提供明确的指导。这通常会导致奖励黑客攻击和响应质量下降。我们提出了 Bi-RAR，一种新颖的检索增强推理框架，可以在前向和后向方向上联合评估每个中间步骤。为了评估每个步骤的信息完整性，我们引入了基于柯尔莫哥洛夫复杂度的双向信息距离，通过语言模型生成概率进行近似。这种量化既可以衡量当前推理与答案的差距，也可以衡量它解决问题的程度。为了优化这些双向信号下的推理，我们采用了多目标强化学习框架，该框架具有强调早期轨迹对齐的级联奖励结构。七个问答基准的实证结果表明，Bi-RAR 超越了之前的方法，能够在训练和推理过程中与搜索引擎进行高效的交互和推理。

Title: Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation

Authors: Ritsu Sakabe, Hwichan Kim, Tosho Hirasawa, Mamoru Komachi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09133
Pdf URL: https://arxiv.org/pdf/2511.09133
Copy Paste: [[2511.09133]] Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation(https://arxiv.org/abs/2511.09133)
Keywords: language model, llm, agent
Abstract: Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications, such as sophisticated dialogue systems. While previous studies have benchmarked the humor capabilities of Large Language Models (LLMs), they have often relied on single-dimensional evaluations, such as judging whether something is simply ``funny.'' This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri, a form of Japanese improvisational comedy games. To achieve this, we expanded upon existing Oogiri datasets with data from new sources and then augmented the collection with Oogiri responses generated by LLMs. We then manually annotated this expanded collection with 5-point absolute ratings across six dimensions: Novelty, Clarity, Relevance, Intelligence, Empathy, and Overall Funniness. Using this dataset, we assessed the capabilities of state-of-the-art LLMs on two core tasks: their ability to generate creative Oogiri responses and their ability to evaluate the funniness of responses using a six-dimensional evaluation. Our results show that while LLMs can generate responses at a level between low- and mid-tier human performance, they exhibit a notable lack of Empathy. This deficit in Empathy helps explain their failure to replicate human humor assessment. Correlation analyses of human and model evaluation data further reveal a fundamental divergence in evaluation criteria: LLMs prioritize Novelty, whereas humans prioritize Empathy. We release our annotated corpus to the community to pave the way for the development of more emotionally intelligent and sophisticated conversational agents.
摘要：计算幽默是创建先进且引人入胜的自然语言处理（NLP）应用程序（例如复杂的对话系统）的前沿。虽然之前的研究对大型语言模型 (LLM) 的幽默能力进行了基准测试，但它们往往依赖于单维评估，例如判断某件事是否只是“有趣”。本文认为，对幽默的多方面理解是必要的，并通过 Oogiri（日本即兴喜剧游戏的一种形式）的视角系统地评估 LLM，从而弥补了这一差距。为了实现这一目标，我们使用新来源的数据扩展了现有的 Oogiri 数据集，然后使用法学硕士生成的 Oogiri 响应来扩充集合。然后，我们通过六个维度的 5 分绝对评级手动注释这个扩展的集合：新颖性、清晰度、相关性、智力、同理心和整体有趣性。使用该数据集，我们评估了最先进的法学硕士在两项核心任务上的能力：他们产生创造性 Oogiri 响应的能力以及他们使用六维评估来评估响应的有趣性的能力。我们的结果表明，虽然法学硕士可以产生介于低级和中级人类表现之间的反应，但他们表现出明显缺乏同理心。这种同理心的缺陷有助于解释他们未能复制人类幽默评估的原因。人类和模型评估数据的相关性分析进一步揭示了评估标准的根本差异：法学硕士优先考虑新颖性，而人类优先考虑同理心。我们向社区发布带注释的语料库，为开发更具情商和复杂的对话代理铺平道路。

Title: One-Topic-Doesn't-Fit-All: Transcreating Reading Comprehension Test for Personalized Learning

Authors: Jieun Han, Daniel Lee, Haneul Yoo, Jinsung Yoon, Junyeong Park, Suin Kim, So-Yeon Ahn, Alice Oh
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.09135
Pdf URL: https://arxiv.org/pdf/2511.09135
Copy Paste: [[2511.09135]] One-Topic-Doesn't-Fit-All: Transcreating Reading Comprehension Test for Personalized Learning(https://arxiv.org/abs/2511.09135)
Keywords: gpt
Abstract: Personalized learning has gained attention in English as a Foreign Language (EFL) education, where engagement and motivation play crucial roles in reading comprehension. We propose a novel approach to generating personalized English reading comprehension tests tailored to students' interests. We develop a structured content transcreation pipeline using OpenAI's gpt-4o, where we start with the RACE-C dataset, and generate new passages and multiple-choice reading comprehension questions that are linguistically similar to the original passages but semantically aligned with individual learners' interests. Our methodology integrates topic extraction, question classification based on Bloom's taxonomy, linguistic feature analysis, and content transcreation to enhance student engagement. We conduct a controlled experiment with EFL learners in South Korea to examine the impact of interest-aligned reading materials on comprehension and motivation. Our results show students learning with personalized reading passages demonstrate improved comprehension and motivation retention compared to those learning with non-personalized materials.
摘要：个性化学习在英语作为外语（EFL）教育中受到关注，其中参与度和动机在阅读理解中发挥着至关重要的作用。我们提出了一种根据学生兴趣生成个性化英语阅读理解测试的新颖方法。我们使用 OpenAI 的 gpt-4o 开发结构化内容创译管道，从 RACE-C 数据集开始，生成新的段落和多项选择阅读理解问题，这些问题在语言上与原始段落相似，但在语义上与个体学习者的兴趣一致。我们的方法集成了主题提取、基于布鲁姆分类法的问题分类、语言特征分析和内容创译，以提高学生的参与度。我们对韩国的英语学习者进行了一项对照实验，以研究兴趣一致的阅读材料对理解和动机的影响。我们的结果表明，与使用非个性化材料学习的学生相比，使用个性化阅读段落学习的学生表现出更好的理解力和动机保留。

Title: LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls

Authors: Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, Lei Zhang, Yong Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.09148
Pdf URL: https://arxiv.org/pdf/2511.09148
Copy Paste: [[2511.09148]] LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls(https://arxiv.org/abs/2511.09148)
Keywords: language model, llm
Abstract: Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines where data generation and model training are executed as two separate, non-interactive processes. This approach fails to adaptively focus on a model's specific weaknesses and allows noisy labels to persist, degrading training efficiency. We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model's mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process operates within a cost-effective, open-source ecosystem, eliminating dependence on expensive closed-source APIs. Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.
摘要：使用外部工具增强大型语言模型 (LLM) 使它们能够执行复杂的多步骤任务。然而，工具学习受到静态合成数据管道的阻碍，其中数据生成和模型训练作为两个独立的非交互式过程执行。这种方法无法自适应地关注模型的特定弱点，并且允许噪声标签持续存在，从而降低训练效率。我们引入了 LoopTool，这是一个完全自动化的、模型感知的数据演化框架，它通过紧密集成数据合成和模型训练来关闭这个循环。 LoopTool通过三个协同模块迭代地细化数据和模型：（1）贪婪能力探测（GCP）诊断模型的掌握和失败能力；（2）判断引导标签验证（JGLV）使用开源判断模型来发现并纠正标注错误，逐步净化数据集； (3) 错误驱动数据扩展 (EDDE) 根据已识别的故障生成新的、具有挑战性的样本。这种闭环流程在经济高效的开源生态系统中运行，消除了对昂贵的闭源 API 的依赖。实验表明，我们使用 LoopTool 训练的 8B 模型显着超越了其 32B 数据生成器，并在 BFCL-v3 和 ACEBench 基准测试中取得了新的最先进结果。我们的工作表明，闭环、自我完善的数据管道可以显着增强法学硕士的工具使用能力。

Title: A Hybrid Search for Complex Table Question Answering in Securities Report

Authors: Daiki Shirafuji, Koji Tanaka, Tatsuhiko Saito
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09179
Pdf URL: https://arxiv.org/pdf/2511.09179
Copy Paste: [[2511.09179]] A Hybrid Search for Complex Table Question Answering in Securities Report(https://arxiv.org/abs/2511.09179)
Keywords: language model, gpt, llm
Abstract: Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting information from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the U4 shared task at NTCIR-18. The experimental results show that our pipeline achieves an accuracy of 74.6\%, outperforming existing LLMs such as GPT-4o mini~(63.9\%). In the future, although we used traditional encoder models for retrieval in this study, we plan to incorporate more efficient text-search models to improve performance and narrow the gap with human evaluation results.
摘要：最近，大型语言模型 (LLM) 在表格问答 (TQA) 领域越来越受到关注，特别是从文档中的表格中提取信息。然而，直接将整个表格作为长文本输入法学硕士通常会导致错误的答案，因为大多数法学硕士本身无法捕获复杂的表格结构。在本文中，我们提出了一种无需手动识别的 TQA 单元格提取方法，即使对于复杂的表头也是如此。我们的方法通过集成语言模型和 TF-IDF 的混合检索机制计算给定问题和单个单元格之间的相似性来估计表标题。然后，我们选择最相关的行和列的交叉点处的单元格作为答案。此外，语言模型是在问题标题对的小数据集上使用对比学习来训练的，以提高性能。我们在 NTCIR-18 的 U4 共享任务的 TQA 数据集中评估了我们的方法。实验结果表明，我们的流程达到了 74.6% 的准确率，优于现有的 LLM，例如 GPT-4o mini~(63.9%)。未来，虽然我们在本研究中使用传统的编码器模型进行检索，但我们计划结合更高效的文本搜索模型来提高性能并缩小与人类评估结果的差距。

Title: Context is Enough: Empirical Validation of $\textit{Sequentiality}$ on Essays

Authors: Amal Sunny, Advay Gupta, Vishnu Sreekumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09185
Pdf URL: https://arxiv.org/pdf/2511.09185
Copy Paste: [[2511.09185]] Context is Enough: Empirical Validation of $\textit{Sequentiality}$ on Essays(https://arxiv.org/abs/2511.09185)
Keywords: language model, llm, prompt
Abstract: Recent work has proposed using Large Language Models (LLMs) to quantify narrative flow through a measure called sequentiality, which combines topic and contextual terms. A recent critique argued that the original results were confounded by how topics were selected for the topic-based component, and noted that the metric had not been validated against ground-truth measures of flow. That work proposed using only the contextual term as a more conceptually valid and interpretable alternative. In this paper, we empirically validate that proposal. Using two essay datasets with human-annotated trait scores, ASAP++ and ELLIPSE, we show that the contextual version of sequentiality aligns more closely with human assessments of discourse-level traits such as Organization and Cohesion. While zero-shot prompted LLMs predict trait scores more accurately than the contextual measure alone, the contextual measure adds more predictive value than both the topic-only and original sequentiality formulations when combined with standard linguistic features. Notably, this combination also outperforms the zero-shot LLM predictions, highlighting the value of explicitly modeling sentence-to-sentence flow. Our findings support the use of context-based sequentiality as a validated, interpretable, and complementary feature for automated essay scoring and related NLP tasks.
摘要：最近的工作提出使用大型语言模型（LLM）通过一种称为顺序性的测量来量化叙述流，该测量结合了主题和上下文术语。最近的批评认为，最初的结果因如何为基于主题的组件选择主题而混淆，并指出该指标尚未针对流的真实测量进行验证。这项工作建议仅使用上下文术语作为概念上更有效和可解释的替代方案。在本文中，我们通过实证验证了该提议。使用两个具有人类注释特征得分的论文数据集 ASAP++ 和 ELLIPSE，我们表明顺序性的上下文版本与人类对话语层面特征（例如组织和凝聚力）的评估更加一致。虽然零样本促使法学硕士比单独的上下文测量更准确地预测特质分数，但当与标准语言特征相结合时，上下文测量比仅主题公式和原始序列公式增加了更多的预测价值。值得注意的是，这种组合也优于零样本 LLM 预测，凸显了显式建模句子到句子流的价值。我们的研究结果支持使用基于上下文的顺序作为自动论文评分和相关 NLP 任务的经过验证、可解释和补充的功能。

Title: The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Authors: Francois Meyer, Jan Buys
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09197
Pdf URL: https://arxiv.org/pdf/2511.09197
Copy Paste: [[2511.09197]] The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages(https://arxiv.org/abs/2511.09197)
Keywords: language model
Abstract: Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.
摘要：子词分割通常应用于预处理并在训练期间保持固定。或者，可以在训练期间学习它以优化训练目标。在本文中，我们研究了子词分割的学习动态：如果语言模型可以动态优化标记化，那么它的子词在预训练和微调过程中如何演变？为了探索这一点，我们扩展了子词分段语言模型（SSLM），这是一个在训练期间学习子词的框架，以支持预训练和微调。我们为三种类型不同的语言训练模型，以研究跨形态谱的学习动态：伊西-科萨语是连接词（由许多语素组成的长词形式），茨瓦纳语是析取语（词素写成单独的单词），英语代表类型学的中间立场。我们从语言学角度分析子词动态，跟踪形态、生产力和生育力。我们确定了子词学习的四个阶段，其中形态复杂的伊西科萨语表现出更大的不稳定性。在微调过程中，子字边界会变得更细粒度。最后，我们表明，可学习的子词提供了一种有前途的方法，可以改善资源匮乏、形态复杂的语言的文本生成和跨语言迁移。

Title: Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

Authors: Jiarui Liu, Kaustubh Dhole, Yingheng Wang, Haoyang Wen, Sarah Zhang, Haitao Mao, Gaotang Li, Neeraj Varshney, Jingguo Liu, Xiaoman Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09222
Pdf URL: https://arxiv.org/pdf/2511.09222
Copy Paste: [[2511.09222]] Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning(https://arxiv.org/abs/2511.09222)
Keywords: language model
Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.
摘要：最近，带有可验证奖励的强化学习（RLVR）已成为一种很有前途的框架，可以将语言模型与复杂的推理目标结合起来。然而，大多数现有方法仅针对最终任务结果进行优化，当负奖励主导早期训练时，模型很容易崩溃。这一挑战在诚实对齐中尤其明显，其中模型不仅必须解决可回答的查询，而且还必须识别何时无法从给定前提得出结论。演绎推理提供了一个理想的测试平台，因为它将推理能力与对外部事实知识的依赖隔离开来。为了研究诚实对齐，我们从图结构中策划了两个多步骤演绎推理数据集，一个用于线性代数，一个用于逻辑推理，并通过随机扰动一半实例中的边缘来引入无法回答的情况。我们发现，无论有没有监督微调初始化，GRPO 都很难完成这些任务。通过对三种模型的广泛实验，我们评估了稳定策略，并表明课程学习提供了一些好处，但需要在难度可控的分布数据集中仔细设计。为了解决这些限制，我们提出了 Anchor，这是一种强化学习方法，可将地面实况轨迹注入到部署中，防止早期训练崩溃。我们的结果表明，这种方法可以稳定学习并显着提高整体推理性能，强调了训练动态对于在对齐的语言模型中实现可靠的演绎推理的重要性。

Title: POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation

Authors: Xuanchen Li, Chenrui Cui, Tianrui Wang, Meng Ge, Zikang Huang, Jin Li, Yizhou Peng, Longbiao Wang, Jianwu Dang, Nyima Tashi
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2511.09232
Pdf URL: https://arxiv.org/pdf/2511.09232
Copy Paste: [[2511.09232]] POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation(https://arxiv.org/abs/2511.09232)
Keywords: language model, llm
Abstract: Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbf{POTSA} (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.
摘要：语音大语言模型 (SpeechLLM) 在多语言语音到文本翻译 (S2TT) 方面取得了突破。然而，现有的方法常常忽视源语言之间的语义共性，导致翻译性能出现偏差。在这项工作中，我们提出了\textbf{POTSA}（语音对齐的并行最佳传输），这是一种基于跨语言并行语音对和最佳传输（OT）的新框架，旨在弥合高资源和低资源翻译差距。首先，我们引入偏差补偿模块来粗略地对齐跨语言的初始语音表示。其次，我们使用并行语音对对 Q-Former 施加标记级 OT 约束，以建立表示的细粒度一致性。然后，我们应用层调度策略将 OT 约束集中在语义上最有益的层上。在 FLEURS 数据集上的实验表明，我们的方法实现了 SOTA 性能，在每种源语言仅使用 10 小时的并行语音的情况下，五种常见语言的平均 BLEU 为 +0.93，零样本语言的平均 BLEU 为 +5.05。

Title: C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation

Authors: Yu Li, Zhe Yang, Yi Huang, Xin Liu, Guilin Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09292
Pdf URL: https://arxiv.org/pdf/2511.09292
Copy Paste: [[2511.09292]] C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation(https://arxiv.org/abs/2511.09292)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C$^3$TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C$^3$TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C$^3$TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C$^3$TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.
摘要：大型语言模型 (LLM) 的最新进展展示了卓越的文本生成能力。然而，如果不进行架构修改或广泛的微调，控制生成文本的特定属性仍然具有挑战性。当前的方法通常切换单个基本属性，但难以精确控制多属性。在属性需求冲突的场景中，现有方法缺乏协调机制，导致所需属性之间的干扰。此外，这些方法未能将迭代优化过程纳入受控生成管道中。为了解决这些限制，我们提出了冲突感知、复合和协作控制文本生成（C$^3$TG），这是一个用于细粒度、多维文本属性控制的两阶段框架。在生成过程中，C$^3$TG 有选择地将 LLM 与 17 个可用维度中所需的属性分类器配对，并采用加权 KL 散度来调整标记概率。然后，优化阶段利用结合了分类器分数和惩罚项的能量函数，通过迭代反馈解决属性冲突，从而能够同时精确控制多个维度，同时保留自然文本流。实验表明，C$^3$TG 在属性准确性、语言流畅性和输出多样性等多个指标上显着优于基线，同时降低了毒性。这些结果表明 C$^3$TG 是一种有效且灵活的多维文本属性控制解决方案，无需进行昂贵的模型修改。

Title: LiteraryTaste: A Preference Dataset for Creative Writing Personalization

Authors: John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yi Wang, Yuqian Sun, Tiffany Wang, Shm Garanganao Almeda, Brett A. Halperin, Yuwen Lu, Max Kreminski
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.09310
Pdf URL: https://arxiv.org/pdf/2511.09310
Copy Paste: [[2511.09310]] LiteraryTaste: A Preference Dataset for Creative Writing Personalization(https://arxiv.org/abs/2511.09310)
Keywords: language model, llm
Abstract: People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user's preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people's preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.
摘要：人们有不同的创意写作偏好，用于这些任务的大型语言模型 (LLM) 可以从适应每个用户的偏好中受益。然而，这些模型通常是在将不同的个人品味视为一个整体的数据集上进行训练的。为了促进个性化创意写作法学硕士的培养，我们引入了 LiteraryTaste，这是一个包含 60 个人阅读偏好的数据集，其中每个人：1）自我报告他们的阅读习惯和品味（明示偏好），2）在 100 对简短的创意写作文本中注释他们的偏好（明示偏好）。通过我们的数据集，我们发现：1）人们在创意写作偏好上存在分歧，2）在对个人和集体显示偏好进行建模时，微调 Transformer 编码器可以达到 75.8％和 67.7％的准确度，3）陈述偏好在建模显示偏好方面的效用有限。通过法学硕士驱动的可解释性管道，我们分析了人们的偏好如何变化。我们希望我们的工作成为个性化创意写作技术的基石。

Title: mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models

Authors: Arka Mukherjee, Shreya Ghosh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09339
Pdf URL: https://arxiv.org/pdf/2511.09339
Copy Paste: [[2511.09339]] mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models(https://arxiv.org/abs/2511.09339)
Keywords: language model, gpt
Abstract: Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: this https URL
摘要：现代视觉语言模型 (VLM) 在现有多模态推理基准上表现良好（MMMU、MathVista 上的准确度为 78-85%）。然而，这些结果未能充分区分真正的科学推理表达能力和模式匹配能力。为了解决这一差距，我们引入了 \textbf{mmJEE-Eval}，这是一种多模式双语（英语和印地语）基准，包含来自印度 JEE 高级考试（2019-2025）的 1,460 个问题，涵盖大学预科物理、化学和数学领域。我们对 17 个最先进模型的评估表明，虽然前沿 VLM（GPT-5、Gemini 2.5 Pro/Flash）在 2025 年提出的问题上实现了 77-84% 的准确率，但开源模型尽管扩展到 400B 参数，但仍保持在 37-45% 的准确率，在现有基准测试中未观察到显着差异。虽然 Google 和 OpenAI 的封闭边界显示出很高的问题解决精度（高达 100\% pass@3 分数），但当推理负载增加元认知时，它们会完全崩溃（GPT-5 仅修复 5.2% 的错误）。系统性的消融表明，mmJEE-Eval 的难度源于复杂性和推理深度，而不是记忆。实际上，我们的基准测试将卓越的训练和推理方法与其他方法失败的地方区分开来。我们公开发布我们的代码和数据：此 https URL

Title: Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

Authors: Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09345
Pdf URL: https://arxiv.org/pdf/2511.09345
Copy Paste: [[2511.09345]] Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling(https://arxiv.org/abs/2511.09345)
Keywords: language model, llm
Abstract: Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.
摘要：测试时间扩展提高了大型语言模型 (LLM) 的推理性能，但也会产生大量的计算成本。尽管最近的研究通过动态自一致性减少了令牌消耗，但它们仍然受到顺序请求的高延迟的限制。在本文中，我们提出了 SeerSC，这是一种动态自洽框架，通过集成系统 1 和系统 2 推理，同时提高代币效率和延迟。具体来说，我们利用快速系统 1 来计算给定查询的答案熵。然后，该分数用于评估样本扩展的潜力，从而在系统 2 下实现动态自一致性。受益于系统 1 提供的先进且准确的估计，所提出的方法可以减少令牌使用，同时通过并行生成实现延迟的显着减少。它优于现有方法，在没有显着性能损失的情况下，将令牌消耗减少了 47%，推理延迟减少了 43%。

Title: MTQ-Eval: Multilingual Text Quality Evaluation for Language Models

Authors: Rhitabrat Pokharel, Ameeta Agrawal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09374
Pdf URL: https://arxiv.org/pdf/2511.09374
Copy Paste: [[2511.09374]] MTQ-Eval: Multilingual Text Quality Evaluation for Language Models(https://arxiv.org/abs/2511.09374)
Keywords: language model, llm
Abstract: The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.
摘要：使用大型语言模型 (LLM) 来评估输出正在成为一种日益有效且可扩展的方法。然而，目前尚不确定这种能力是否超越特定于任务的评估扩展到更一般的文本质量评估，特别是在多语言环境中。在这项研究中，我们介绍了 MTQ-Eval，这是一种用于多语言文本质量评估的新颖框架，它可以从高质量和低质量文本的示例中学习，调整其内部表示。为了开发 MTQ-Eval，我们首先自动生成文本质量偏好数据，然后使用它来训练开源基础法学硕士，以与高质量和低质量文本的评级保持一致。我们对 115 种语言的综合评估证明了所提出模型的性能得到了提高。经过进一步分析，我们发现这种增强的评估能力也导致下游任务的显着改进。

Title: Self-Correcting Large Language Models: Generation vs. Multiple Choice

Authors: Hossein A. Rahmani, Satyapriya Krishna, Xi Wang, Mohammadmehdi Naghiaei, Emine Yilmaz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09381
Pdf URL: https://arxiv.org/pdf/2511.09381
Copy Paste: [[2511.09381]] Self-Correcting Large Language Models: Generation vs. Multiple Choice(https://arxiv.org/abs/2511.09381)
Keywords: language model, llm, agent
Abstract: Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes: \textit{While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options}. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs.
摘要：大型语言模型最近表现出了通过迭代细化自我纠正其响应的非凡能力，通常称为自我一致性或自我反思。然而，这种自我纠正机制的动态可能会有很大不同，具体取决于模型的任务是生成开放式文本还是从多个预定义选项中选择最合适的响应。在本文中，我们通过比较各种自然语言理解和推理任务（涵盖不同规模和家族的语言模型）的性能趋势和纠错行为，对这两种范式进行了系统研究。我们的实验结果揭示了改进和失败模式的不同模式：\textit{虽然开放式生成通常受益于重新解释和组合细化的灵活性，但多项选择可以利用更清晰的解决方案边界，但可能受到所提供选项的限制}。这种对比也反映了新兴代理法学硕士申请面临的双重需求：有效的代理不仅必须生成和完善开放式计划或解释，而且在受限的行动空间内操作时必须做出可靠的离散选择。因此，我们的研究结果强调，自我纠正机制的设计应考虑任务结构和输出空间之间的相互作用，这对法学硕士的知识密集型推理和决策导向应用都有影响。

Title: AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

Authors: Ruibo Deng, Duanyu Feng, Wenqiang Lei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09385
Pdf URL: https://arxiv.org/pdf/2511.09385
Copy Paste: [[2511.09385]] AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment(https://arxiv.org/abs/2511.09385)
Keywords: language model
Abstract: Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.
摘要：离线偏好优化为 RLHF 对齐语言模型提供了更简单、更稳定的替代方案。然而，它们的有效性很大程度上取决于排名准确性，这是一个衡量进一步收益非常有影响力的指标。这种限制源于我们识别并形式化为过拟合-欠拟合困境的一个基本问题：当前的边际设计导致模型对正确排序的样本应用过多、浪费的梯度（过拟合），同时为错误排序的样本（欠拟合）提供不足的纠正信号。为了解决这个困境，我们提出了自适应保证金附加偏好优化（AMaPO），这是一种简单但有原则的算法。 AMaPO 采用实例级自适应裕度，通过 Z 归一化和指数缩放进行细化，通过放大错误排序样本的梯度并抑制正确样本的梯度来动态重新分配学习工作。在广泛使用的基准上进行的大量实验表明，AMaPO 不仅实现了更好的排序精度和卓越的下游对齐性能，而且有针对性的分析也证实它成功缓解了核心过拟合和欠拟合问题。

Title: Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Authors: Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.09396
Pdf URL: https://arxiv.org/pdf/2511.09396
Copy Paste: [[2511.09396]] Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque(https://arxiv.org/abs/2511.09396)
Keywords: language model, llm
Abstract: Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.
摘要：当前的多模态大型语言模型对于多项要求较高的任务表现出非常强大的性能。虽然商业 MLLM 在资源匮乏的语言中提供了可接受的性能，但在开放科学界仍未获得可比较的结果。在本文中，我们的目标是为低资源语言（即巴斯克语）开发强大的 MLLM。为此，我们开发了自己的训练和评估图像文本数据集。使用两种不同的大型语言模型作为骨干，Llama-3.1-Instruct 模型和一个名为 Latxa 的巴斯克语适应变体，我们探索了几种用于训练的数据混合。我们表明：i）巴斯克语多模态数据的低比例（约 20％）已经足以在巴斯克语基准上获得可靠的结果，并且 ii）与预期相反，巴斯克语指导的骨干法学硕士不需要获得强大的巴斯克语 MLLM。我们的结果为通过公开释放我们的资源为其他低资源语言开发 MLLM 铺平了道路。

Title: CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

Authors: Bichen Wang, Yixin Sun, Junzhe Wang, Hao Yang, Xing Fu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09407
Pdf URL: https://arxiv.org/pdf/2511.09407
Copy Paste: [[2511.09407]] CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling(https://arxiv.org/abs/2511.09407)
Keywords: language model, llm
Abstract: The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.
摘要：不断增长的心理咨询需求与有限的服务之间的不匹配促使人们研究大型语言模型（LLM）在该领域的应用。因此，需要一个强有力且统一的基准来评估各种法学硕士的咨询能力。然而，现有的工作受到不专业的客户模拟、静态问答评估格式和单维指标的限制。这些限制阻碍了它们评估模型处理多样化和复杂客户的综合能力的有效性。为了解决这一差距，我们引入了 \textbf{CARE-Bench}，这是一种动态且交互式的自动化基准测试。它建立在来自真实世界咨询案例的不同客户资料的基础上，并根据专家指南进行模拟。 CARE-Bench 提供基于既定心理量表的多维绩效评估。使用 CARE-Bench，我们评估了几种通用的法学硕士和专业咨询模型，揭示了它们当前的局限性。我们与心理学家合作，对法学硕士在与不同类型的客户互动时失败的原因进行了详细分析，为开发更全面、通用、有效的咨询模式提供了方向。

Title: GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning

Authors: Wolfgang Otto, Lu Gan, Sharmila Upadhyaya, Saurav Karmakar, Stefan Dietze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09411
Pdf URL: https://arxiv.org/pdf/2511.09411
Copy Paste: [[2511.09411]] GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning(https://arxiv.org/abs/2511.09411)
Keywords: language model, llm, prompt
Abstract: Research in Machine Learning (ML) and AI evolves rapidly. Information Extraction (IE) from scientific publications enables to identify information about research concepts and resources on a large scale and therefore is a pathway to improve understanding and reproducibility of ML-related research. To extract and connect fine-grained information in ML-related research, e.g. method training and data usage, we introduce GSAP-ERE. It is a manually curated fine-grained dataset with 10 entity types and 18 semantically categorized relation types, containing mentions of 63K entities and 35K relations from the full text of 100 ML publications. We show that our dataset enables fine-tuned models to automatically extract information relevant for downstream tasks ranging from knowledge graph (KG) construction, to monitoring the computational reproducibility of AI research at scale. Additionally, we use our dataset as a test suite to explore prompting strategies for IE using Large Language Models (LLM). We observe that the performance of state-of-the-art LLM prompting methods is largely outperformed by our best fine-tuned baseline model (NER: 80.6%, RE: 54.0% for the fine-tuned model vs. NER: 44.4%, RE: 10.1% for the LLM). This disparity of performance between supervised models and unsupervised usage of LLMs suggests datasets like GSAP-ERE are needed to advance research in the domain of scholarly information extraction.
摘要：机器学习 (ML) 和人工智能的研究发展迅速。从科学出版物中提取信息 (IE) 能够大规模识别有关研究概念和资源的信息，因此是提高对 ML 相关研究的理解和可重复性的途径。提取和连接机器学习相关研究中的细粒度信息，例如方法训练和数据使用，我们引入GSAP-ERE。它是一个手动策划的细粒度数据集，具有 10 种实体类型和 18 种语义分类关系类型，包含来自 100 份 ML 出版物全文的 63K 实体和 35K 关系提及。我们表明，我们的数据集使微调模型能够自动提取与下游任务相关的信息，从知识图（KG）构建到大规模监控人工智能研究的计算可重复性。此外，我们使用数据集作为测试套件来探索使用大型语言模型 (LLM) 的 IE 提示策略。我们观察到，最先进的 LLM 提示方法的性能在很大程度上优于我们最好的微调基线模型（微调模型的 NER：80.6％，RE：54.0％，而 LLM 的 NER：44.4％，RE：10.1％）。法学硕士的监督模型和无监督使用之间的性能差异表明，需要像 GSAP-ERE 这样的数据集来推进学术信息提取领域的研究。

Title: BIG5-TPoT: Predicting BIG Five Personality Traits, Facets, and Items Through Targeted Preselection of Texts

Authors: Triet M. Le, Arjun Chandra, C. Anton Rytting, Valerie P. Karuzis, Vladimir Rife, William A. Simpson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.09426
Pdf URL: https://arxiv.org/pdf/2511.09426
Copy Paste: [[2511.09426]] BIG5-TPoT: Predicting BIG Five Personality Traits, Facets, and Items Through Targeted Preselection of Texts(https://arxiv.org/abs/2511.09426)
Keywords: language model
Abstract: Predicting an individual's personalities from their generated texts is a challenging task, especially when the text volume is large. In this paper, we introduce a straightforward yet effective novel strategy called targeted preselection of texts (TPoT). This method semantically filters the texts as input to a deep learning model, specifically designed to predict a Big Five personality trait, facet, or item, referred to as the BIG5-TPoT model. By selecting texts that are semantically relevant to a particular trait, facet, or item, this strategy not only addresses the issue of input text limits in large language models but also improves the Mean Absolute Error and accuracy metrics in predictions for the Stream of Consciousness Essays dataset.
摘要：从生成的文本中预测个人的性格是一项具有挑战性的任务，尤其是当文本量很大时。在本文中，我们介绍了一种简单而有效的新颖策略，称为有针对性的文本预选（TPoT）。该方法在语义上过滤文本作为深度学习模型的输入，该模型专门用于预测大五人格特征、方面或项目，称为 BIG5-TPoT 模型。通过选择在语义上与特定特征、方面或项目相关的文本，该策略不仅解决了大型语言模型中输入文本限制的问题，而且还提高了意识流论文数据集预测中的平均绝对误差和准确性指标。

Title: SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification

Authors: Mohamed Elaraby, Jyoti Prakash Maheswari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.09539
Pdf URL: https://arxiv.org/pdf/2511.09539
Copy Paste: [[2511.09539]] SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification(https://arxiv.org/abs/2511.09539)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification -- a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores do not improve, underscoring its potential to strengthen both performance and explainability.
摘要：具有扩展上下文窗口的大型语言模型 (LLM) 可以对长文档进行直接推理，从而减少分块或检索的需要。然而，构建用于培训和评估的带注释资源仍然成本高昂。合成数据提供了一种可扩展的替代方案，我们引入了 SynClaimEval，这是一个用于评估长上下文声明验证中合成数据效用的框架，这是幻觉检测和事实检查的核心任务。我们的框架检查三个维度：（i）输入特征，通过改变上下文长度并测试泛化到域外基准； (ii) 综合逻辑，通过控制索赔复杂性和错误类型变化； (iii) 解释质量，衡量模型解释提供与预测一致的证据的程度。跨基准的实验表明，长上下文综合可以改进基本指令调整模型的验证，特别是在增强现有的人类编写的数据集时。此外，即使验证分数没有提高，综合也可以提高解释质量，这突显了其增强性能和可解释性的潜力。