2026-01-26

Title: ChiEngMixBench: Evaluating Large Language Models on Spontaneous and Natural Chinese-English Code-Mixed Generation

Authors: Qingyan Yang, Tongxi Wang, Yunsheng Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16217
Pdf URL: https://arxiv.org/pdf/2601.16217
Copy Paste: [[2601.16217]] ChiEngMixBench: Evaluating Large Language Models on Spontaneous and Natural Chinese-English Code-Mixed Generation(https://arxiv.org/abs/2601.16217)
Keywords: language model
Abstract: Code-mixing is increasingly prevalent in interactions between humans and large language models, yet existing work often reduces it to a translation or convertibility problem, making it difficult to assess whether a model's switching behavior is context-appropriate and aligned with human conventions. We introduce ChiEngMixBench, the first benchmark designed to evaluate code-mixing ability in authentic community contexts, built upon a general construction pipeline that enables scalable dataset development across domains and bilingual pairs. ChiEngMixBench formulates code-mixing as a cognitive alignment problem, characterized by two complementary signals: Spontaneity and Naturalness. Empirical evaluation shows that our metrics can systematically distinguish code-mixing performance across models. Beyond benchmarking, we further uncover an implicitly emergent Terminology Layering Strategy, a phenomenon consistent with the Matrix Language Frame (MLF) theory, indicating structured cognitive alignment between multilingual large language models and human communication.
摘要：代码混合在人类和大型语言模型之间的交互中越来越普遍，但现有的工作通常将其简化为翻译或可转换问题，从而很难评估模型的切换行为是否适合上下文并符合人类惯例。我们推出了 ChiEngMixBench，这是第一个旨在评估真实社区环境中的代码混合能力的基准测试，它建立在通用构建管道的基础上，可实现跨领域和双语对的可扩展数据集开发。 ChiEngMixBench 将代码混合表述为认知对齐问题，其特征是两个互补信号：自发性和自然性。实证评估表明，我们的指标可以系统地区分模型之间的代码混合性能。除了基准测试之外，我们还进一步发现了一种隐含的术语分层策略，这种现象与矩阵语言框架（MLF）理论一致，表明多语言大语言模型和人类交流之间的结构化认知一致性。

Title: M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models

Authors: Aleix Torres-Camps, Nathaniel Mitrani Hadida, Víctor Conchello Vendrell, Àlex Batlle Casellas, Arnau Padrés Masdemont, Jordi Ros-Giralt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16218
Pdf URL: https://arxiv.org/pdf/2601.16218
Copy Paste: [[2601.16218]] M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models(https://arxiv.org/abs/2601.16218)
Keywords: language model
Abstract: Despite state-of-the-art vision-language models (VLMs) have demonstrated strong reasoning capabilities, their performance in multilingual mathematical reasoning remains underexplored, particularly when compared to human performance. To bridge this gap, we introduce M3Kang, the first massively multilingual, multimodal mathematical reasoning dataset for VLMs. It is derived from the Kangaroo Math Competition, the world's largest mathematics contest, which annually engages over six million participants under the age of 18 across more than 90 countries. M3Kang includes 1,747 unique multiple-choice problems organized by grade-level difficulty, with translations into 108 culturally diverse languages, some of them including diagrams essential for solving them. Using this dataset, we conduct extensive benchmarking on both closed- and open-source SOTA models. We observe that, despite recent advances, models still struggle with basic math and diagram-based reasoning, with performance scaling with language presence and model size, but not with grade level. We also find that multilingual techniques can be effectively extended to the multimodal setting, resulting in significant improvements over baseline approaches. Our analysis also incorporates performance data from over 68,000 students, enabling direct comparison with human performance. We are open-sourcing M3Kang, including the English-only subset M2Kang, along with the framework and codebase used to construct the dataset.
摘要：尽管最先进的视觉语言模型（VLM）已经表现出强大的推理能力，但它们在多语言数学推理方面的表现仍未得到充分探索，特别是与人类表现相比。为了弥补这一差距，我们引入了 M3Kang，这是第一个用于 VLM 的大规模多语言、多模式数学推理数据集。它源自袋鼠数学竞赛，这是世界上最大的数学竞赛，每年吸引来自 90 多个国家超过 600 万 18 岁以下的参与者。 M3Kang 包含 1,747 个按年级难度组织的独特多项选择题，并翻译成 108 种不同文化的语言，其中一些还包含解决这些问题所必需的图表。使用该数据集，我们对封闭源和开源 SOTA 模型进行了广泛的基准测试。我们观察到，尽管最近取得了进展，但模型仍然难以处理基本数学和基于图表的推理，性能随语言存在和模型大小而变化，但不随年级水平变化。我们还发现多语言技术可以有效地扩展到多模式环境，从而比基线方法有显着改进。我们的分析还纳入了超过 68,000 名学生的表现数据，可以与人类表现进行直接比较。我们正在开源 M3Kang，包括纯英文子集 M2Kang，以及用于构建数据集的框架和代码库。

Title: Domain Specific Specialization in Low-Resource Settings: The Efficacy of Offline Response-Based Knowledge Distillation in Large Language Models

Authors: Erdem Aslan, Pakize Erdoğmuş
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16219
Pdf URL: https://arxiv.org/pdf/2601.16219
Copy Paste: [[2601.16219]] Domain Specific Specialization in Low-Resource Settings: The Efficacy of Offline Response-Based Knowledge Distillation in Large Language Models(https://arxiv.org/abs/2601.16219)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) excel in general tasks but often struggle with hallucinations when handling domain-specific or institutional knowledge absent from their pre-training. We present an offline response-based knowledge distillation method that develops high-accuracy specialized assistants under constrained hardware resources. We evaluate three distinct data strategies: general domain adaptation (15,000 lines), unstructured knowledge injection (2,000 lines), and a context-aware synthetic dataset (500 lines) generated by a teacher model. To minimize computational costs, we utilize the Unsloth library to optimize the Qwen-2.5-7B student model, reducing NVIDIA A100 GPU memory requirements from 40 GB to 16 GB. Experimental results demonstrate that while larger unstructured datasets suffer from persistent hallucinations, the 500-line context-aware dataset achieves a 96.7% accuracy rate and robust rejection capability. These findings validate the LIMA hypothesis, showing that data quality and structural alignment are more critical than quantity for domain adaptation in low-resource settings.
摘要：大型语言模型 (LLM) 在一般任务中表现出色，但在处理预训练中缺少的特定领域或机构知识时，常常会出现幻觉。我们提出了一种基于离线响应的知识蒸馏方法，可以在硬件资源有限的情况下开发高精度的专业助手。我们评估了三种不同的数据策略：一般领域适应（15,000 行）、非结构化知识注入（2,000 行）和由教师模型生成的上下文感知合成数据集（500 行）。为了最大限度地降低计算成本，我们利用 Unsloth 库来优化 Qwen-2.5-7B 学生模型，将 NVIDIA A100 GPU 内存要求从 40 GB 减少到 16 GB。实验结果表明，虽然较大的非结构化数据集会遭受持续的幻觉，但 500 行上下文感知数据集实现了 96.7% 的准确率和强大的拒绝能力。这些发现验证了 LIMA 假设，表明在资源匮乏的环境中，数据质量和结构对齐比数量更重要。

Title: Towards Latent Diffusion Suitable For Text

Authors: Nesta Midavaine, Christian A. Naesseth, Grigory Bartosh
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2601.16220
Pdf URL: https://arxiv.org/pdf/2601.16220
Copy Paste: [[2601.16220]] Towards Latent Diffusion Suitable For Text(https://arxiv.org/abs/2601.16220)
Keywords: language model, llm
Abstract: Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of continuous diffusion models to discrete state spaces. NFDM learns a multivariate forward process from the data, ensuring that the forward process and generative trajectory are a good fit for language modeling. Our model substantially reduces the likelihood gap with autoregressive models of the same size, while achieving sample quality comparable to that of previous latent diffusion models.
摘要：语言扩散模型旨在提高自回归法学硕士的采样速度和连贯性。我们引入了用于语言生成的神经流扩散模型，它是 NFDM 的扩展，可以将连续扩散模型直接应用于离散状态空间。 NFDM 从数据中学习多变量前向过程，确保前向过程和生成轨迹非常适合语言建模。我们的模型大大减少了与相同大小的自回归模型的似然差距，同时实现了与以前的潜在扩散模型相当的样本质量。

Title: Limits of n-gram Style Control for LLMs via Logit-Space Injection

Authors: Sami-ul Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16224
Pdf URL: https://arxiv.org/pdf/2601.16224
Copy Paste: [[2601.16224]] Limits of n-gram Style Control for LLMs via Logit-Space Injection(https://arxiv.org/abs/2601.16224)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are typically personalized via prompt engineering or parameter-efficient fine-tuning such as LoRA. However, writing style can be difficult to distill into a single prompt, and LoRA fine-tuning requires computationally intensive training and infrastructure. We investigate a possible lightweight alternative: steering a frozen LLM with n-gram style priors injected in logit space at decoding time. We train an n-gram model on stylistically distinct corpora -- including Don Quixote, CNN/DailyMail news headlines, and arXiv abstracts -- constructing an interpolated 1-to-3-gram prior over next-token probabilities. During generation we modify the LLM's logits by adding a weighted sum of style log-probabilities from each n-gram order that matches the current context, scaled by a control parameter lambda in [0, 1]. We sweep lambda and style corpora and report style perplexity under the n-gram model, base-model perplexity as a proxy for fluency, Jensen-Shannon (JS) divergence between the original and steered token distributions, and token-overlap statistics. On TinyLlama-1.1B we identify a single narrow regime (for the Don Quixote corpus at lambda=0.1) where style perplexity improves by 24.7% and base-model perplexity improves by 51.4% relative to the frozen model. Outside this regime, and for multi-author corpora such as CNN/DailyMail and arXiv abstracts, even small nonzero lambda values generally result in worse style and fluency, and larger lambda values lead to collapse with extreme perplexities and incoherent text. Logit-space injection of n-gram style priors provides lightweight, tunable style control, but it is fragile: it operates effectively only within a narrow range of low lambda values and is consistently outperformed by prompting and LoRA.
摘要：大型语言模型 (LLM) 通常通过快速工程或参数高效的微调（例如 LoRA）来实现个性化。然而，写作风格可能很难提炼成单个提示，并且 LoRA 微调需要计算密集型培训和基础设施。我们研究了一种可能的轻量级替代方案：在解码时将 n-gram 样式先验注入到 logit 空间中，从而引导冻结的 LLM。我们在风格不同的语料库（包括《堂吉诃德》、CNN/DailyMail 新闻标题和 arXiv 摘要）上训练一个 n-gram 模型，构建一个插值的 1 到 3 克先验的下一个标记概率。在生成过程中，我们通过添加与当前上下文匹配的每个 n 元语法顺序的风格对数概率的加权和来修改 LLM 的 logits，并通过 [0, 1] 中的控制参数 lambda 进行缩放。我们扫描 lambda 和风格语料库，并报告 n-gram 模型下的风格困惑度、作为流畅度代理的基本模型困惑度、原始令牌分布和引导令牌分布之间的 Jensen-Shannon (JS) 分歧以及令牌重叠统计数据。在 TinyLlama-1.1B 上，我们确定了一个单一的窄范围（对于 lambda=0.1 的唐吉诃德语料库），相对于冻结模型，风格困惑度提高了 24.7%，基础模型困惑度提高了 51.4%。在这个制度之外，对于 CNN/DailyMail 和 arXiv 摘要等多作者语料库，即使很小的非零 lambda 值通常也会导致更差的风格和流畅性，而较大的 lambda 值会导致文本极度混乱和不连贯的崩溃。 n-gram 样式先验的 Logit 空间注入提供了轻量级、可调节的样式控制，但它很脆弱：它仅在低 lambda 值的狭窄范围内有效运行，并且始终优于提示和 LoRA。

Title: GameTalk: Training LLMs for Strategic Conversation

Authors: Victor Conchello Vendrell, Max Ruiz Luyten, Mihaela van der Schaar
Subjects: cs.CL, cs.AI, cs.GT, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2601.16276
Pdf URL: https://arxiv.org/pdf/2601.16276
Copy Paste: [[2601.16276]] GameTalk: Training LLMs for Strategic Conversation(https://arxiv.org/abs/2601.16276)
Keywords: language model, llm, agent
Abstract: Strategic decision-making in multi-agent settings is a key challenge for large language models (LLMs), particularly when coordination and negotiation must unfold over extended conversations. While recent work has explored the use of LLMs in isolated decision tasks, little attention has been given to optimizing long-term objectives through dialogue. We introduce \textbf{GameTalk}, a framework for training LLMs to make strategic decisions via multi-turn interactions. Unlike prior work that focuses on single-turn objectives or static action prediction, we train LLMs to optimize a global objective across full conversations. We achieve this by adapting fine-tuning methods like GRPO, DPO, and STaR to incorporate reward signals that depend on the entire interaction. We evaluate this approach on a suite of increasingly complex games, designed to stress different aspects of reasoning, coordination, and opponent modeling. Our results show that GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains. These findings position conversational fine-tuning as a promising path for LLMs to reason, negotiate, and act in interactive environments.
摘要：多智能体环境中的战略决策是大型语言模型 (LLM) 的一个关键挑战，特别是当协调和谈判必须在扩展对话中展开时。虽然最近的工作探索了法学硕士在孤立的决策任务中的使用，但很少关注通过对话优化长期目标。我们引入了 \textbf{GameTalk}，这是一个训练法学硕士通过多轮交互做出战略决策的框架。与之前专注于单轮目标或静态动作预测的工作不同，我们训练法学硕士在整个对话中优化全局目标。我们通过采用 GRPO、DPO 和 STaR 等微调方法来合并取决于整个交互的奖励信号来实现这一目标。我们在一系列日益复杂的游戏中评估了这种方法，这些游戏旨在强调推理、协调和对手建模的不同方面。我们的结果表明，GameTalk 的性能显着优于未经训练的模型，尤其是在奖励塑造方面，其中 DPO 始终获得最强劲的收益。这些发现将对话微调视为法学硕士在交互式环境中推理、谈判和行动的一条有希望的途径。

Title: Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification

Authors: Branislav Pecher, Jan Cegin, Robert Belanec, Ivan Srba, Jakub Simko, Maria Bielikova
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.16278
Pdf URL: https://arxiv.org/pdf/2601.16278
Copy Paste: [[2601.16278]] Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification(https://arxiv.org/abs/2601.16278)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, making them promising tools in both high- and low-resource languages. One particularly valuable use case is generating synthetic samples that can be used to train smaller models in low-resource scenarios where human-labelled data is scarce. In this work, we investigate whether these synthetic data generation capabilities can serve as a form of distillation, producing smaller models that perform on par with or even better than massive LLMs across languages and tasks. To this end, we use a state-of-the-art multilingual LLM to generate synthetic datasets covering 11 languages and 4 classification tasks. These datasets are then used to train smaller models via fine-tuning or instruction tuning, or as synthetic in-context examples for compact LLMs. Our experiments show that even small amounts of synthetic data enable smaller models to outperform the large generator itself, particularly in low-resource languages. Overall, the results suggest that LLMs are best utilised as generators (teachers) rather than classifiers, producing data that empowers smaller and more efficient multilingual models.
摘要：大型语言模型 (LLM) 已展现出卓越的多语言功能，使其成为高资源语言和低资源语言的有前途的工具。一个特别有价值的用例是生成合成样本，可用于在人类标记数据稀缺的资源匮乏场景中训练较小的模型。在这项工作中，我们研究这些合成数据生成功能是否可以作为一种蒸馏形式，生成更小的模型，其性能与跨语言和任务的大规模法学硕士相当甚至更好。为此，我们使用最先进的多语言法学硕士来生成涵盖 11 种语言和 4 种分类任务的综合数据集。然后，这些数据集用于通过微调或指令调整来训练较小的模型，或者作为紧凑型法学硕士的综合上下文示例。我们的实验表明，即使是少量的合成数据也能让较小的模型胜过大型生成器本身，特别是在资源匮乏的语言中。总体而言，结果表明法学硕士最适合用作生成器（教师）而不是分类器，生成的数据可以支持更小、更高效的多语言模型。

Title: Generating Literature-Driven Scientific Theories at Scale

Authors: Peter Jansen, Peter Clark, Doug Downey, Daniel S. Weld
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16282
Pdf URL: https://arxiv.org/pdf/2601.16282
Copy Paste: [[2601.16282]] Generating Literature-Driven Scientific Theories at Scale(https://arxiv.org/abs/2601.16282)
Keywords: llm, agent
Abstract: Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain underexplored. In this work, we formulate the problem of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature. We study theory generation at scale, using 13.7k source papers to synthesize 2.9k theories, examining how generation using literature-grounding versus parametric knowledge, and accuracy-focused versus novelty-focused generation objectives change theory properties. Our experiments show that, compared to using parametric LLM memory for generation, our literature-supported method creates theories that are significantly better at both matching existing evidence and at predicting future results from 4.6k subsequently-written papers
摘要：当代自动化科学发现的重点是生成科学实验的代理，而执行更高级别科学活动（例如理论构建）的系统仍未得到充分探索。在这项工作中，我们提出了综合由大量科学文献中的定性和定量定律组成的理论的问题。我们大规模地研究理论生成，使用 13,700 篇源论文来综合 2,900 个理论，研究使用文献基础与参数知识、注重准确性与新颖性的生成目标如何改变理论属性。我们的实验表明，与使用参数化 LLM 内存进行生成相比，我们的文献支持方法创建的理论在匹配现有证据和从 4.6k 篇后续撰写的论文中预测未来结果方面都明显更好

Title: Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

Authors: Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao, Niranjan Balasubramanian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16312
Pdf URL: https://arxiv.org/pdf/2601.16312
Copy Paste: [[2601.16312]] Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks(https://arxiv.org/abs/2601.16312)
Keywords: language model, llm
Abstract: Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs prove ineffective on this problem space because: (i) most models lack polymer-specific knowledge (ii) existing aligned models lack coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large scale training and test benchmark dataset of more than 125K polymer design related tasks, leveraging a knowledge base of 13M+ data points obtained from experimental and synthetic sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small language models (SLMs), of 7B to 14B parameters, trained on PolyBench data outperform similar sized models, and even closed source frontier LLMs on PolyBench test dataset while demonstrating gains on other polymer benchmarks as well.
摘要：AI4Science 的研究在许多科学应用中显示出了前景，包括聚合物设计。然而，目前的法学硕士在这个问题领域被证明是无效的，因为：（i）大多数模型缺乏聚合物特定的知识（ii）现有的对齐模型缺乏与聚合物设计相关的知识和能力的覆盖。为了解决这个问题，我们引入了 PolyBench，这是一个包含超过 125K 聚合物设计相关任务的大规模训练和测试基准数据集，利用从实验和合成来源获得的超过 1300 万个数据点的知识库，以确保广泛覆盖聚合物及其特性。为了使用 PolyBench 进行有效对齐，我们引入了一种知识增强推理蒸馏方法，该方法通过结构化 CoT 来增强该数据集。此外，PolyBench 中的任务按照从简单到复杂的分析推理问题进行组织，从而能够在整个问题空间中进行泛化测试和诊断探测。实验表明，在 PolyBench 数据上训练的 7B 至 14B 参数的小型语言模型 (SLM) 优于类似大小的模型，甚至优于 PolyBench 测试数据集上的闭源前沿 LLM，同时也展示了在其他聚合物基准测试上的收益。

Title: Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP

Authors: Andres Karjus, Kais Allkivi, Silvia Maine, Katarin Leppik, Krister Kruusmaa, Merilin Aruvee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16314
Pdf URL: https://arxiv.org/pdf/2601.16314
Copy Paste: [[2601.16314]] Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLMs and Statistical NLP(https://arxiv.org/abs/2601.16314)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) enable rapid and consistent automated evaluation of open-ended exam responses, including dimensions of content and argumentation that have traditionally required human judgment. This is particularly important in cases where a large amount of exams need to be graded in a limited time frame, such as nation-wide graduation exams in various countries. Here, we examine the applicability of automated scoring on two large datasets of trial exam essays of two full national cohorts from Estonia. We operationalize the official curriculum-based rubric and compare LLM and statistical natural language processing (NLP) based assessments with human panel scores. The results show that automated scoring can achieve performance comparable to that of human raters and tends to fall within the human scoring range. We also evaluate bias, prompt injection risks, and LLMs as essay writers. These findings demonstrate that a principled, rubric-driven, human-in-the-loop scoring pipeline is viable for high-stakes writing assessment, particularly relevant for digitally advanced societies like Estonia, which is about to adapt a fully electronic examination system. Furthermore, the system produces fine-grained subscore profiles that can be used to generate systematic, personalized feedback for instruction and exam preparation. The study provides evidence that LLM-assisted assessment can be implemented at a national scale, even in a small-language context, while maintaining human oversight and compliance with emerging educational and regulatory standards.
摘要：大型语言模型 (LLM) 能够快速、一致地自动评估开放式考试答案，包括传统上需要人类判断的内容维度和论证。这在需要在有限的时间内对大量考试进行评分的情况下尤其重要，例如各国的全国性毕业考试。在这里，我们研究了自动评分在爱沙尼亚两个完整国家队列的两个大型试卷数据集上的适用性。我们实施基于官方课程的评估标准，并将基于法学硕士和统计自然语言处理（NLP）的评估与人类小组分数进行比较。结果表明，自动评分可以实现与人类评分者相当的性能，并且往往落在人类评分范围内。我们还评估偏见、即时注入风险和法学硕士作为论文作者。这些发现表明，有原则的、以标题为驱动的、人机交互的评分流程对于高风险的写作评估是可行的，特别是与爱沙尼亚这样的数字化先进社会相关，爱沙尼亚即将采用完全电子考试系统。此外，该系统还生成细粒度的子分数配置文件，可用于为教学和考试准备生成系统的、个性化的反馈。该研究提供的证据表明，法学硕士辅助评估可以在全国范围内实施，即使是在小语言背景下，同时保持人类监督并遵守新兴的教育和监管标准。

Title: Regional Bias in Large Language Models

Authors: M P V S Gopinadh, Kappara Lakshmi Sindhu, Soma Sekhar Pandu Ranga Raju P, Yesaswini Swarna
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16349
Pdf URL: https://arxiv.org/pdf/2601.16349
Copy Paste: [[2601.16349]] Regional Bias in Large Language Models(https://arxiv.org/abs/2601.16349)
Keywords: language model, gpt, llm, prompt
Abstract: This study investigates regional bias in large language models (LLMs), an emerging concern in AI fairness and global representation. We evaluate ten prominent LLMs: GPT-3.5, GPT-4o, Gemini 1.5 Flash, Gemini 1.0 Pro, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3, Gemma 7B, Mistral 7B, and Vicuna-13B using a dataset of 100 carefully designed prompts that probe forced-choice decisions between regions under contextually neutral scenarios. We introduce FAZE, a prompt-based evaluation framework that measures regional bias on a 10-point scale, where higher scores indicate a stronger tendency to favor specific regions. Experimental results reveal substantial variation in bias levels across models, with GPT-3.5 exhibiting the highest bias score (9.5) and Claude 3.5 Sonnet scoring the lowest (2.5). These findings indicate that regional bias can meaningfully undermine the reliability, fairness, and inclusivity of LLM outputs in real-world, cross-cultural applications. This work contributes to AI fairness research by highlighting the importance of inclusive evaluation frameworks and systematic approaches for identifying and mitigating geographic biases in language models.
摘要：本研究调查了大型语言模型 (LLM) 中的区域偏差，这是人工智能公平性和全球代表性中的一个新问题。我们使用包含 100 个精心设计的提示的数据集来评估十个著名的 LLM：GPT-3.5、GPT-4o、Gemini 1.5 Flash、Gemini 1.0 Pro、Claude 3 Opus、Claude 3.5 Sonnet、Llama 3、Gemma 7B、Mistral 7B 和 Vicuna-13B，这些提示在上下文中立的场景下探索区域之间的强制选择决策。我们引入了 FAZE，这是一个基于提示的评估框架，以 10 分制衡量区域偏差，分数越高表明偏向特定区域的倾向越强。实验结果显示，不同模型的偏差水平存在显着差异，GPT-3.5 的偏差分数最高 (9.5)，Claude 3.5 Sonnet 的偏差分数最低 (2.5)。这些发现表明，地区偏见可能会严重损害现实世界跨文化应用中法学硕士成果的可靠性、公平性和包容性。这项工作通过强调包容性评估框架和识别和减轻语言模型中地理偏见的系统方法的重要性，为人工智能公平性研究做出了贡献。

Title: Identity, Cooperation and Framing Effects within Groups of Real and Simulated Humans

Authors: Suhong Moon, Minwoo Kang, Joseph Suh, Mustafa Safdari, John Canny
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16355
Pdf URL: https://arxiv.org/pdf/2601.16355
Copy Paste: [[2601.16355]] Identity, Cooperation and Framing Effects within Groups of Real and Simulated Humans(https://arxiv.org/abs/2601.16355)
Keywords: language model, llm, chat
Abstract: Humans act via a nuanced process that depends both on rational deliberation and also on identity and contextual factors. In this work, we study how large language models (LLMs) can simulate human action in the context of social dilemma games. While prior work has focused on "steering" (weak binding) of chat models to simulate personas, we analyze here how deep binding of base models with extended backstories leads to more faithful replication of identity-based behaviors. Our study has these findings: simulation fidelity vs human studies is improved by conditioning base LMs with rich context of narrative identities and checking consistency using instruction-tuned models. We show that LLMs can also model contextual factors such as time (year that a study was performed), question framing, and participant pool effects. LLMs, therefore, allow us to explore the details that affect human studies but which are often omitted from experiment descriptions, and which hamper accurate replication.
摘要：人类的行为是通过一个微妙的过程来实现的，这个过程既依赖于理性的深思熟虑，也依赖于身份和背景因素。在这项工作中，我们研究大型语言模型（LLM）如何在社交困境游戏的背景下模拟人类行为。虽然之前的工作重点是“引导”（弱绑定）聊天模型来模拟角色，但我们在这里分析了基础模型与扩展背景故事的深度绑定如何导致更忠实地复制基于身份的行为。我们的研究有以下发现：通过使用丰富的叙事身份背景来调节基础 LM，并使用指令调整模型检查一致性，可以提高模拟保真度与人类研究的比较。我们表明，法学硕士还可以对背景因素进行建模，例如时间（进行研究的年份）、问题框架和参与者池效应。因此，法学硕士使我们能够探索影响人类研究的细节，但这些细节往往在实验描述中被省略，并且妨碍了准确的复制。

Title: PolyAgent: Large Language Model Agent for Polymer Design

Authors: Vani Nigam, Achuth Chandrasekhar, Amir Barati Farimani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16376
Pdf URL: https://arxiv.org/pdf/2601.16376
Copy Paste: [[2601.16376]] PolyAgent: Large Language Model Agent for Polymer Design(https://arxiv.org/abs/2601.16376)
Keywords: language model, llm, agent
Abstract: On-demand Polymer discovery is essential for various industries, ranging from biomedical to reinforcement materials. Experiments with polymers have a long trial-and-error process, leading to long procedures and extensive resources. For these processes, machine learning has accelerated scientific discovery at the property prediction and latent space search fronts. However, laboratory researchers cannot readily access codes and these models to extract individual structures and properties due to infrastructure limitations. We present a closed-loop polymer structure-property predictor integrated in a terminal for early-stage polymer discovery. The framework is powered by LLM reasoning to provide users with property prediction, property-guided polymer structure generation, and structure modification capabilities. The SMILES sequences are guided by the synthetic accessibility score and the synthetic complexity score (SC Score) to ensure that polymer generation is as close as possible to synthetically accessible monomer-level structures. This framework addresses the challenge of generating novel polymer structures for laboratory researchers, thereby providing computational insights into polymer research.
摘要：按需聚合物发现对于从生物医学到增强材料等各个行业都至关重要。聚合物实验需要一个漫长的试错过程，导致程序冗长和资源广泛。对于这些过程，机器学习加速了属性预测和潜在空间搜索前沿的科学发现。然而，由于基础设施的限制，实验室研究人员无法轻松访问代码和这些模型来提取单个结构和属性。我们提出了一种集成在终端中的闭环聚合物结构-性能预测器，用于早期聚合物发现。该框架由 LLM 推理提供支持，为用户提供属性预测、属性引导的聚合物结构生成和结构修改功能。 SMILES 序列由合成可及性评分和合成复杂性评分（SC 评分）指导，以确保聚合物的生成尽可能接近可合成的单体水平结构。该框架解决了实验室研究人员生成新型聚合物结构的挑战，从而为聚合物研究提供了计算见解。

Title: Cross-Lingual Activation Steering for Multilingual Language Models

Authors: Rhitabrat Pokharel, Ameeta Agrawal, Tanay Nagar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16390
Pdf URL: https://arxiv.org/pdf/2601.16390
Copy Paste: [[2601.16390]] Cross-Lingual Activation Steering for Multilingual Language Models(https://arxiv.org/abs/2601.16390)
Keywords: language model
Abstract: Large language models exhibit strong multilingual capabilities, yet significant performance gaps persist between dominant and non-dominant languages. Prior work attributes this gap to imbalances between shared and language-specific neurons in multilingual representations. We propose Cross-Lingual Activation Steering (CLAS), a training-free inference-time intervention that selectively modulates neuron activations. We evaluate CLAS on classification and generation benchmarks, achieving average improvements of 2.3% (Acc.) and 3.4% (F1) respectively, while maintaining high-resource language performance. We discover that effective transfer operates through functional divergence rather than strict alignment; performance gains correlate with increased language cluster separation. Our results demonstrate that targeted activation steering can unlock latent multilingual capacity in existing models without modification to model weights.
摘要：大型语言模型表现出强大的多语言能力，但主导语言和非主导语言之间仍然存在显着的性能差距。先前的研究将这种差距归因于多语言表征中共享神经元和特定语言神经元之间的不平衡。我们提出跨语言激活引导（CLAS），这是一种无需训练的推理时间干预，可以选择性地调节神经元激活。我们在分类和生成基准上评估 CLAS，分别实现了 2.3% (Acc.) 和 3.4% (F1) 的平均改进，同时保持了高资源语言性能。我们发现，有效的转移是通过功能分歧而不是严格对齐来实现的；性能提升与语言簇分离的增加相关。我们的结果表明，有针对性的激活引导可以释放现有模型中潜在的多语言能力，而无需修改模型权重。

Title: Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification

Authors: Zongwan Cao, Bingbing Wen, Lucy Lu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16400
Pdf URL: https://arxiv.org/pdf/2601.16400
Copy Paste: [[2601.16400]] Clarify or Answer: Reinforcement Learning for Agentic VQA with Context Under-specification(https://arxiv.org/abs/2601.16400)
Keywords: llm, prompt, agent
Abstract: Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines
摘要：现实世界的视觉问答（VQA）通常依赖于上下文：图像-问题对可能未指定，因此正确答案取决于在图像中不可观察到的外部信息。在这种情况下，直接回答可能会导致自信但不正确的预测。我们提出了 CoA（Clarify-or-Answer），一种询问或回答代理，它单独模拟询问或回答的决定，以及如果需要询问什么。 CoA 首先确定是否需要澄清；如果是这样，它会提出一个重点问题，然后合并响应以产生最终答案。我们通过一组不明确的 VQA 问题和明确的对比集引入 CONTEXTCLARIFY。我们进一步介绍了 GRPO-CR（澄清推理），这是一种强化学习方法，它通过多个奖励信号来优化澄清问题的生成，鼓励提出格式良好、重点突出、重要的问题，从而解决歧义。在三个 VLLM 和三个数据集上，CoA 在模块和系统级别上实现了一致的改进，与基于提示的基线相比，端到端 VQA 准确性平均提高了 +15.3 点 (83%)

Title: Jacobian Scopes: token-level causal attributions in LLMs

Authors: Toni J.B. Liu, Baran Zadeoğlu, Nicolas Boullé, Raphaël Sarfati, Christopher J. Earls
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16407
Pdf URL: https://arxiv.org/pdf/2601.16407
Copy Paste: [[2601.16407]] Jacobian Scopes: token-level causal attributions in LLMs(https://arxiv.org/abs/2601.16407)
Keywords: language model, llm
Abstract: Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. By analyzing the linearized relations of final hidden state with respect to inputs, Jacobian Scopes quantify how input tokens influence a model's prediction. We introduce three variants - Semantic, Fisher, and Temperature Scopes - which respectively target sensitivity of specific logits, the full predictive distribution, and model confidence (inverse temperature). Through case studies spanning instruction understanding, translation and in-context learning (ICL), we uncover interesting findings, such as when Jacobian Scopes point to implicit political biases. We believe that our proposed methods also shed light on recently debated mechanisms underlying in-context time-series forecasting. Our code and interactive demonstrations are publicly available at this https URL.
摘要：大型语言模型 (LLM) 根据上下文中出现的线索（例如语义描述和上下文示例）进行下一个标记预测。然而，由于现代架构中层和注意力头的激增，阐明哪些先验标记对给定的预测影响最大仍然具有挑战性。我们提出了雅可比范围（Jacobian Scopes），这是一套基于梯度的、标记级的因果归因方法，用于解释 LLM 的预测。通过分析最终隐藏状态相对于输入的线性关系，雅可比范围量化了输入标记如何影响模型的预测。我们引入了三种变体——语义范围、费舍尔范围和温度范围——它们分别针对特定逻辑的敏感性、完整的预测分布和模型置信度（逆温度）。通过涵盖教学理解、翻译和情境学习 (ICL) 的案例研究，我们发现了有趣的发现，例如雅可比范围何时指出隐性政治偏见。我们相信，我们提出的方法也揭示了最近争论的背景时间序列预测的机制。我们的代码和交互式演示可通过此 https URL 公开获取。

Title: Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning

Authors: Qinglong Cao, Yuntian Chen, Chao Ma, Xiaokang Yang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.16419
Pdf URL: https://arxiv.org/pdf/2601.16419
Copy Paste: [[2601.16419]] Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning(https://arxiv.org/abs/2601.16419)
Keywords: language model, llm, prompt
Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in multimodal perception and understanding tasks. However, their effectiveness in specialized domains, such as remote sensing and medical imaging, remains limited. A natural approach to domain adaptation is to inject domain knowledge through textual instructions, prompts, or auxiliary captions. Surprisingly, we find that such input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks, even when the domain knowledge is explicitly provided. This observation suggests that current MLLMs fail to internalize domain-specific priors through language alone, and that domain knowledge must be integrated at the optimization level. Motivated by this insight, we propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective. Instead of treating domain knowledge as descriptive information, we encode it as domain-informed constraints and reward signals, shaping the model's behavior in the output space. Extensive experiments across multiple datasets in remote sensing and medical domains consistently demonstrate good performance gains, achieving state-of-the-art results on multimodal domain tasks. Our results highlight the necessity of optimization-level domain knowledge integration and reveal a fundamental limitation of textual domain conditioning in current MLLMs.
摘要：多模态大语言模型（MLLM）在多模态感知和理解任务中表现出了卓越的能力。然而，它们在遥感和医学成像等专业领域的有效性仍然有限。领域适应的一种自然方法是通过文本指令、提示或辅助标题注入领域知识。令人惊讶的是，我们发现这种输入级领域知识注入对科学多模态任务几乎没有任何改进，即使明确提供了领域知识。这一观察结果表明，当前的 MLLM 无法仅通过语言内化特定于领域的先验，并且必须在优化级别集成领域知识。受这种洞察力的启发，我们提出了一个强化微调框架，将领域知识直接纳入学习目标。我们没有将领域知识视为描述性信息，而是将其编码为领域知情的约束和奖励信号，从而塑造模型在输出空间中的行为。在遥感和医学领域的多个数据集上进行的广泛实验一致证明了良好的性能增益，在多模态域任务上取得了最先进的结果。我们的结果强调了优化级领域知识集成的必要性，并揭示了当前 MLLM 中文本域调节的基本局限性。

Title: Exploring the Effects of Alignment on Numerical Bias in Large Language Models

Authors: Ayako Sato, Hwichan Kim, Zhousi Chen, Masato Mita, Mamoru Komachi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16444
Pdf URL: https://arxiv.org/pdf/2601.16444
Copy Paste: [[2601.16444]] Exploring the Effects of Alignment on Numerical Bias in Large Language Models(https://arxiv.org/abs/2601.16444)
Keywords: language model, llm
Abstract: ``LLM-as-a-judge,'' which utilizes large language models (LLMs) as evaluators, has proven effective in many evaluation tasks. However, evaluator LLMs exhibit numerical bias, a phenomenon where certain evaluation scores are generated disproportionately often, leading reduced evaluation performance. This study investigates the cause of this bias. Given that most evaluator LLMs are aligned through instruction tuning and preference tuning, and that prior research suggests alignment reduces output diversity, we hypothesize that numerical bias arises from alignment. To test this, we compare outputs from pre- and post-alignment LLMs, and observe that alignment indeed increases numerical bias. We also explore mitigation strategies for post-alignment LLMs, including temperature scaling, distribution calibration, and score range adjustment. Among these, score range adjustment is most effective in reducing bias and improving performance, though still heuristic. Our findings highlight the need for further work on optimal score range selection and more robust mitigation strategies.
摘要：“LLM-as-a-judge”利用大型语言模型（LLM）作为评估器，已在许多评估任务中证明是有效的。然而，评估者法学硕士存在数值偏差，这种现象是某些评估分数经常不成比例地生成，从而导致评估绩效下降。这项研究调查了这种偏见的原因。鉴于大多数评估者法学硕士通过指令调整和偏好调整来对齐，并且先前的研究表明对齐会减少输出多样性，我们假设数值偏差是由对齐引起的。为了测试这一点，我们比较了对齐前和对齐后法学硕士的输出，并观察对齐确实增加了数值偏差。我们还探索了对齐后法学硕士的缓解策略，包括温度缩放、分布校准和分数范围调整。其中，分数范围调整在减少偏差和提高绩效方面最有效，尽管仍然是启发式的。我们的研究结果强调需要进一步研究最佳分数范围选择和更稳健的缓解策略。

Title: Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

Authors: Yichuan Ma, Linyang Li, Yongkang Chen, Peiji Li, Jiasheng Ye, Qipeng Guo, Dahua Lin, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16447
Pdf URL: https://arxiv.org/pdf/2601.16447
Copy Paste: [[2601.16447]] Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go(https://arxiv.org/abs/2601.16447)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs' general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present \textbf{LoGos}, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional-level performance in Go at: this https URL.
摘要：大型语言模型（LLM）在数学和编码等推理任务中表现出了卓越的性能，匹配或超越了人类的能力。然而，这些令人印象深刻的推理能力在专业领域面临着巨大的挑战。以围棋为例，尽管AlphaGo已经在围棋领域确立了AI系统的高性能天花板，但主流的法学硕士仍然难以达到初学者水平，更不用说进行自然语言推理了。通用法学硕士和领域专家之间的这种性能差距极大地限制了法学硕士在更广泛的特定领域任务上的应用。在这项工作中，我们的目标是弥合法学硕士的一般推理能力和特定领域任务的专家知识之间的鸿沟。我们使用结构化围棋专业知识和通用长思维链（CoT）推理数据作为冷启动进行混合微调，然后进行强化学习，将围棋专家知识与通用推理能力相结合。通过这种方法，我们提出了 \textbf{LoGos}，一个强大的法学硕士，它不仅保持出色的一般推理能力，而且还能用自然语言进行围棋游戏，展示有效的策略推理和准确的下一步预测。 LoGos 的表现可与人类职业选手相媲美，大大超越了所有现有的法学硕士。通过这项工作，我们的目标是贡献将一般法学硕士推理能力应用于专业领域的见解。我们将发布第一个用于 LLM 训练的大规模 Go 数据集、第一个 LLM Go 评估基准，以及第一个在 Go 方面达到人类专业水平表现的通用 LLM：此 https URL。

Title: Graph-Anchored Knowledge Indexing for Retrieval-Augmented Generation

Authors: Zhenghao Liu, Mingyan Wu, Xinze Li, Yukun Yan, Shuo Wang, Cheng Yang, Minghe Yu, Zheni Zeng, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16462
Pdf URL: https://arxiv.org/pdf/2601.16462
Copy Paste: [[2601.16462]] Graph-Anchored Knowledge Indexing for Retrieval-Augmented Generation(https://arxiv.org/abs/2601.16462)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. Nevertheless, effectively integrating and interpreting key evidence scattered across noisy documents remains a critical challenge for existing RAG systems. In this paper, we propose GraphAnchor, a novel Graph-Anchored Knowledge Indexing approach that reconceptualizes graph structures from static knowledge representations into active, evolving knowledge indices. GraphAnchor incrementally updates a graph during iterative retrieval to anchor salient entities and relations, yielding a structured index that guides the LLM in evaluating knowledge sufficiency and formulating subsequent subqueries. The final answer is generated by jointly leveraging all retrieved documents and the final evolved graph. Experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of GraphAnchor, and reveal that GraphAnchor modulates the LLM's attention to more effectively associate key information distributed in retrieved documents. All code and data are available at this https URL.
摘要：检索增强生成（RAG）已成为通过整合外部知识来减轻大型语言模型（LLM）中的幻觉的主导范例。然而，有效整合和解释分散在嘈杂文档中的关键证据仍然是现有 RAG 系统的关键挑战。在本文中，我们提出了 GraphAnchor，一种新颖的图锚定知识索引方法，它将图结构从静态知识表示重新概念化为主动的、不断发展的知识索引。 GraphAnchor 在迭代检索期间增量更新图以锚定显着实体和关系，产生结构化索引，指导法学硕士评估知识充分性并制定后续子查询。最终答案是通过共同利用所有检索到的文档和最终演化图来生成的。对四个多跳问答基准的实验证明了 GraphAnchor 的有效性，并揭示了 GraphAnchor 调节了 LLM 的注意力，以更有效地关联分布在检索到的文档中的关键信息。所有代码和数据均可在此 https URL 中获取。

Title: Persona Jailbreaking in Large Language Models

Authors: Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16466
Pdf URL: https://arxiv.org/pdf/2601.16466
Copy Paste: [[2601.16466]] Persona Jailbreaking in Large Language Models(https://arxiv.org/abs/2601.16466)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role-playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user-side inputs under a black-box, inference-only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack success. Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral changes in correlated traits, and exhibits stronger effects in multi-turn settings. In high-risk domains mental health, tutoring, and customer support, PHISH reliably manipulates personas, validated by both human and LLM-as-Judge evaluations. Importantly, PHISH causes only a small reduction in reasoning benchmark performance, leaving overall utility largely intact while still enabling significant persona manipulation. While current guardrails offer partial protection, they remain brittle under sustained attack. Our findings expose new vulnerabilities in personas and highlight the need for context-resilient persona in LLMs. Our codebase and dataset is available at: this https URL
摘要：大型语言模型 (LLM) 越来越多地部署在教育、心理健康和客户支持等领域，这些领域稳定一致的角色对于可靠性至关重要。然而，现有的研究侧重于叙事或角色扮演任务，而忽视了对抗性对话历史如何单独重塑诱导角色。黑盒角色操纵仍未被探索，引起了人们对现实交互中鲁棒性的担忧。作为回应，我们引入了角色编辑任务，该任务在黑盒、仅推理设置下通过用户端输入来对抗性地引导 LLM 特征。为此，我们提出了 PHISH（历史上通过隐式引导进行的角色劫持），这是第一个暴露 LLM 安全性新漏洞的框架，该框架将语义加载的线索嵌入到用户查询中，以逐渐诱导反向角色。我们还定义了一个衡量攻击成功的指标。在 3 个基准和 8 个法学硕士中，PHISH 可预测地改变人物角色，触发相关特征的附带变化，并在多回合设置中表现出更强的效果。在心理健康、辅导和客户支持等高风险领域，PHISH 可靠地操纵角色，并通过人类和法学硕士法官评估进行验证。重要的是，PHISH 仅导致推理基准性能略有下降，总体效用基本保持不变，同时仍可实现显着的角色操纵。虽然目前的护栏提供了部分保护，但在持续攻击下它们仍然很脆弱。我们的研究结果暴露了角色中的新漏洞，并强调了法学硕士中对情境弹性角色的需求。我们的代码库和数据集位于：此 https URL

Title: DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering

Authors: Haotian Chen, Qingqing Long, Siyu Pu, Xiao Luo, Wei Ju, Meng Xiao, Yuanchun Zhou, Jianghua Zhao, Xuezhi Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16478
Pdf URL: https://arxiv.org/pdf/2601.16478
Copy Paste: [[2601.16478]] DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering(https://arxiv.org/abs/2601.16478)
Keywords: llm, retrieval-augmented generation, agent
Abstract: With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying this http URL address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
摘要：随着科学文献的快速增长，科学问答（SciQA）对于探索和利用科学知识变得越来越重要。检索增强生成（RAG）通过整合外部来源的知识来增强法学硕士，从而为科学问答提供可靠的证据。但现有的检索和重新排序方法仍然容易受到语义相似但逻辑上不相关的段落的影响，通常会降低事实可靠性并放大此http URL，为了解决这一挑战，我们提出了一种深度证据重新排序代理（DeepEra），它集成了逐步推理，能够更精确地评估超越表面语义的候选段落。为了支持系统评估，我们构建了 SciRAG-SSLI（科学 RAG - 语义相似但逻辑不相关），这是一个大型数据集，包含 10 个主题的约 30 万个 SciQA 实例，由 1000 万个科学语料库构建。该数据集将自然检索的上下文与系统生成的干扰项相结合，以测试逻辑稳健性和事实基础。综合评估证实，与领先的重新排序器相比，我们的方法实现了卓越的检索性能。据我们所知，这项工作是第一个全面研究和实证验证两阶段 RAG 框架中不可忽略的 SSLI 问题的工作。

Title: TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization

Authors: Peiji Li, Linyang Li, Handa Sun, Wenjin Mai, Yongkang Chen, Xiaozhe Li, Yue Shen, Yichuan Ma, Yiliu Sun, Jiaxi Cao, Zhishu He, Bo Wang, Xiaoqing Zheng, Zhaori Bi, Xipeng Qiu, Qipeng Guo, Kai Chen, Dahua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16480
Pdf URL: https://arxiv.org/pdf/2601.16480
Copy Paste: [[2601.16480]] TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization(https://arxiv.org/abs/2601.16480)
Keywords: language model, agent
Abstract: Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.
摘要：大型语言模型通过工具集成在复杂任务中表现出强大的推理能力，通常被构建为马尔可夫决策过程，并使用 GRPO 等轨迹级 RL 算法进行优化。然而，一类常见的推理任务（迭代优化）提出了独特的挑战：代理在回合中与相同的底层环境状态交互，并且轨迹的价值由最佳回合级奖励而不是累积回报决定。现有的基于 GRPO 的方法无法在此类设置中执行细粒度、回合级优化，而黑盒优化方法则丢弃先验知识和推理能力。为了解决这一差距，我们提出了回合级 GRPO (TL-GRPO)，这是一种轻量级 RL 算法，可以执行回合级分组采样以实现细粒度优化。我们在模拟电路选型 (ACS) 上评估 TL-GRPO，这是一项具有挑战性的科学优化任务，需要多次模拟和领域专业知识。结果表明，TL-GRPO 在各种规格上均优于标准 GRPO 和贝叶斯优化方法。此外，我们用 TL-GRPO 训练的 30B 模型在相同的模拟预算下在 ACS 任务上实现了最先进的性能，展示了强大的泛化性和实用性。

Title: Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic

Authors: Yichuan Ma, Linyang Li, Yongkang chen, Peiji Li, Xiaozhe Li, Qipeng Guo, Dahua Lin, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16486
Pdf URL: https://arxiv.org/pdf/2601.16486
Copy Paste: [[2601.16486]] Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic(https://arxiv.org/abs/2601.16486)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.
摘要：随着大型语言模型 (LLM) 越来越多地处理复杂的推理任务，测试时间扩展对于增强能力变得至关重要。然而，在工具调用频繁的代理场景中，传统的基于生成长度的定义就失效了：工具延迟将推理时间与生成长度解耦。我们提出了 Timely Machine，将测试时间重新定义为挂钟时间，模型根据时间预算动态调整策略。我们引入了 Timely-Eval，这是一个涵盖高频工具调用、低频工具调用和时间约束推理的基准测试。通过改变工具延迟，我们发现较小的模型通过更多的交互而擅长快速反馈，而较大的模型则通过卓越的交互质量主导高延迟设置。此外，现有模型无法使推理适应时间预算。我们提出 Timely-RL 来解决这个问题。经过冷启动监督微调后，我们使用强化学习来增强时间规划。 Timely-RL 提高了时间预算意识，并持续提高了 Timely-Eval 的性能。我们希望我们的工作为代理时代的测试时间扩展提供新的视角。

Title: MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine

Authors: Wei Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16503
Pdf URL: https://arxiv.org/pdf/2601.16503
Copy Paste: [[2601.16503]] MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine(https://arxiv.org/abs/2601.16503)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench's dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.
摘要：虽然检索增强生成（RAG）已在科学和临床质量保证系统中迅速采用，但医学领域缺乏全面的评估基准。为了弥补这一差距，我们引入了医学检索增强生成（MRAG）基准，涵盖英语和中文的各种任务，并与 Wikipedia 和 Pubmed 建立了语料库。此外，我们还开发了 MRAG 工具包，促进对不同 RAG 组件的系统探索。我们的实验表明：(a) RAG 增强了 MRAG 任务中的 LLM 可靠性。 (b) RAG 系统的性能受到检索方法、模型大小和提示策略的影响。 (c) 虽然 RAG 提高了实用性和推理质量，但对于长格式问题，LLM 的回答可能会变得可读性稍差。我们将在接受后发布带有CCBY-4.0许可证的MRAG-Bench数据集和工具包，以方便学术界和工业界的应用。

Title: LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

Authors: Obed Junias, Maria Leonor Pacheco
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16504
Pdf URL: https://arxiv.org/pdf/2601.16504
Copy Paste: [[2601.16504]] LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning(https://arxiv.org/abs/2601.16504)
Keywords: prompt, chain-of-thought
Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
摘要：常识推理通常涉及评估多个看似合理的解释，而不是选择单个原子答案，但大多数基准依赖于单标签评估，模糊了陈述是否共同合理、相互排斥或共同难以置信。我们引入了 LOGICAL-COMMONSENSEQA，这是一个基准，它将常识推理重新构建为使用合理性级别运算符（AND、OR、NEITHER/NOR）的原子语句对的逻辑组合。在零样本、少样本和思维链提示下评估指令调整、推理专用和微调模型时，我们发现，虽然模型在合取推理上表现合理，在选言推理上表现中等，但在基于否定的问题上表现急剧下降。 LOGICAL-COMMONSENSEQA 揭示了基本推理局限性，并为推进组合常识推理提供了一个受控框架。

Title: Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ

Authors: Karl Neergaard, Le Qiu, Emmanuele Chersoni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16508
Pdf URL: https://arxiv.org/pdf/2601.16508
Copy Paste: [[2601.16508]] Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ(https://arxiv.org/abs/2601.16508)
Keywords: llm, prompt
Abstract: Single-prompt evaluations dominate current LLM benchmarking, yet they fail to capture the conversational dynamics where real-world harm occurs. In this study, we examined whether conversation length affects response veracity by evaluating LLM performance on the BoolQ dataset under varying length and scaffolding conditions. Our results across three distinct LLMs revealed model-specific vulnerabilities that are invisible under single-turn testing. The length-dependent and scaffold-specific effects we observed demonstrate a fundamental limitation of static evaluations, as deployment-relevant vulnerabilities could only be spotted in a multi-turn conversational setting.
摘要：单一提示评估在当前的法学硕士基准测试中占主导地位，但它们无法捕捉现实世界中发生伤害的对话动态。在这项研究中，我们通过评估不同长度和脚手架条件下 BoolQ 数据集上的 LLM 表现，研究了对话长度是否会影响响应的准确性。我们在三个不同的法学硕士中的结果揭示了在单轮测试下不可见的特定于模型的漏洞。我们观察到的长度相关和支架特定的效果证明了静态评估的基本局限性，因为与部署相关的漏洞只能在多轮对话设置中发现。

Title: SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine

Authors: Hoang-Quoc Nguyen-Son, Minh-Son Dao, Koji Zettsu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16512
Pdf URL: https://arxiv.org/pdf/2601.16512
Copy Paste: [[2601.16512]] SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine(https://arxiv.org/abs/2601.16512)
Keywords: language model, llm
Abstract: With the advent of large language models (LLMs), it has become common practice for users to draft text and utilize LLMs to enhance its quality through paraphrasing. However, this process can sometimes result in the loss or distortion of the original intended meaning. Due to the human-like quality of LLM-generated text, traditional detection methods often fail, particularly when text is paraphrased to closely mimic original content. In response to these challenges, we propose a novel approach named SearchLLM, designed to identify LLM-paraphrased text by leveraging search engine capabilities to locate potential original text sources. By analyzing similarities between the input and regenerated versions of candidate sources, SearchLLM effectively distinguishes LLM-paraphrased content. SearchLLM is designed as a proxy layer, allowing seamless integration with existing detectors to enhance their performance. Experimental results across various LLMs demonstrate that SearchLLM consistently enhances the accuracy of recent detectors in detecting LLM-paraphrased text that closely mimics original content. Furthermore, SearchLLM also helps the detectors prevent paraphrasing attacks.
摘要：随着大型语言模型 (LLM) 的出现，用户起草文本并利用 LLM 通过释义来提高文本质量已成为常见做法。然而，这个过程有时会导致原始含义的丢失或扭曲。由于法学硕士生成的文本具有类似人类的质量，传统的检测方法经常失败，特别是当文本被解释为密切模仿原始内容时。为了应对这些挑战，我们提出了一种名为SearchLLM的新颖方法，旨在通过利用搜索引擎功能来定位潜在的原始文本来源来识别LLM释义文本。通过分析候选源的输入版本和重新生成版本之间的相似性，SearchLLM 可以有效地区分 LLM 释义内容。 SearchLLM 被设计为代理层，允许与现有检测器无缝集成以增强其性能。各种法学硕士的实验结果表明，SearchLLM 始终提高了最新检测器在检测与原始内容极为相似的法学硕士释义文本方面的准确性。此外，SearchLLM 还可以帮助检测器防止释义攻击。

Title: Curate-Train-Refine: A Closed-Loop Agentic Framework for Zero Shot Classification

Authors: Gaurav Maheshwari, Kevin El Haddad
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.16530
Pdf URL: https://arxiv.org/pdf/2601.16530
Copy Paste: [[2601.16530]] Curate-Train-Refine: A Closed-Loop Agentic Framework for Zero Shot Classification(https://arxiv.org/abs/2601.16530)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) and high-capacity encoders have advanced zero and few-shot classification, but their inference cost and latency limit practical deployment. We propose training lightweight text classifiers using dynamically generated supervision from an LLM. Our method employs an iterative, agentic loop in which the LLM curates training data, analyzes model successes and failures, and synthesizes targeted examples to address observed errors. This closed-loop generation and evaluation process progressively improves data quality and adapts it to the downstream classifier and task. Across four widely used benchmarks, our approach consistently outperforms standard zero and few-shot baselines. These results indicate that LLMs can serve effectively as data curators, enabling accurate and efficient classification without the operational cost of large-model deployment.
摘要：大型语言模型 (LLM) 和大容量编码器具有先进的零样本和少样本分类，但它们的推理成本和延迟限制了实际部署。我们建议使用法学硕士动态生成的监督来训练轻量级文本分类器。我们的方法采用迭代、代理循环，其中法学硕士负责整理训练数据、分析模型的成功和失败，并综合有针对性的示例来解决观察到的错误。这种闭环生成和评估过程逐步提高数据质量并使其适应下游分类器和任务。在四个广泛使用的基准中，我们的方法始终优于标准零基准和少样本基准。这些结果表明法学硕士可以有效地充当数据管理者，实现准确、高效的分类，而无需承担大型模型部署的运营成本。

Title: Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking

Authors: Mingwei Sun, Qianlong Wang, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16555
Pdf URL: https://arxiv.org/pdf/2601.16555
Copy Paste: [[2601.16555]] Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking(https://arxiv.org/abs/2601.16555)
Keywords: language model, llm
Abstract: Fact-checking aims to verify the truthfulness of a claim based on the retrieved evidence. Existing methods typically follow a decomposition paradigm, in which a claim is broken down into sub-claims that are individually verified. However, the decomposition paradigm may introduce noise to the verification process due to irrelevant entities or evidence, ultimately degrading verification accuracy. To address this problem, we propose a Retrieve-Refine-Calibrate (RRC) framework based on large language models (LLMs). Specifically, the framework first identifies the entities mentioned in the claim and retrieves evidence relevant to them. Then, it refines the retrieved evidence based on the claim to reduce irrelevant information. Finally, it calibrates the verification process by re-evaluating low-confidence predictions. Experiments on two popular fact-checking datasets (HOVER and FEVEROUS-S) demonstrate that our framework achieves superior performance compared with competitive baselines.
摘要：事实核查旨在根据检索到的证据来验证主张的真实性。现有方法通常遵循分解范例，其中将权利要求分解为单独验证的子权利要求。然而，分解范式可能会由于不相关的实体或证据而给验证过程带来噪音，最终降低验证的准确性。为了解决这个问题，我们提出了一种基于大语言模型（LLM）的检索-细化-校准（RRC）框架。具体来说，该框架首先识别声明中提到的实体并检索与它们相关的证据。然后，它根据主张提炼检索到的证据，以减少不相关的信息。最后，它通过重新评估低置信度预测来校准验证过程。在两个流行的事实检查数据集（HOVER 和 FEVEROUS-S）上进行的实验表明，与竞争基准相比，我们的框架实现了卓越的性能。

Title: Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis

Authors: Jianyu Wen, Yang Wei, Xiongxi Yu, Changxuan Xiao, Ke Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16596
Pdf URL: https://arxiv.org/pdf/2601.16596
Copy Paste: [[2601.16596]] Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis(https://arxiv.org/abs/2601.16596)
Keywords: language model, gpt, llm, hallucination, agent
Abstract: As the development of Large Language Models (LLMs) shifts from parameter scaling to inference-time collaboration, the Mixture-of-Agents (MoA) framework has emerged as a general paradigm to harness collective intelligence by layering diverse models. While recent MoA variants have introduced dynamic routing and residual connections to improve efficiency, these methods often fail to facilitate deep semantic interaction between agents, limiting the system's ability to actively correct hallucinations and refine logic. In this paper, we introduce Attention-MoA, a novel MoA-based framework that redefines collaboration through Inter-agent Semantic Attention. Complemented by an Inter-layer Residual Module with Adaptive Early Stopping Mechanism, our architecture mitigates information degradation in deep layers while improving computational efficiency. Extensive evaluations across AlpacaEval 2.0, MT-Bench, and FLASK demonstrate that Attention-MoA significantly outperforms state-of-the-art baselines, achieving a 91.15% Length-Controlled Win Rate on AlpacaEval 2.0 and dominating in 10 out of 12 capabilities on FLASK. Notably, Attention-MoA enables an ensemble of small open-source models to outperform massive proprietary models like Claude-4.5-Sonnet and GPT-4.1, achieving an MT-Bench score of 8.83 and an AlpacaEval 2.0 LC Win Rate of 77.36%.
摘要：随着大型语言模型 (LLM) 的发展从参数缩放转向推理时协作，混合代理 (MoA) 框架已成为通过分层不同模型来利用集体智慧的通用范例。虽然最近的 MoA 变体引入了动态路由和残差连接来提高效率，但这些方法通常无法促进代理之间的深层语义交互，从而限制了系统主动纠正幻觉和细化逻辑的能力。在本文中，我们介绍了 Attention-MoA，这是一种基于 MoA 的新型框架，它通过智能体间语义注意力重新定义协作。通过具有自适应提前停止机制的层间残差模块的补充，我们的架构减轻了深层的信息退化，同时提高了计算效率。对 AlpacaEval 2.0、MT-Bench 和 FLASK 的广泛评估表明，Attention-MoA 的性能显着优于最先进的基线，在 AlpacaEval 2.0 上实现了 91.15% 的长度控制胜率，并在 FLASK 上的 12 项功能中的 10 项中占据主导地位。值得注意的是，Attention-MoA 使小型开源模型的集合能够超越 Claude-4.5-Sonnet 和 GPT-4.1 等大型专有模型，获得 8.83 的 MT-Bench 分数和 77.36% 的 AlpacaEval 2.0 LC 胜率。

Title: AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model

Authors: Xiang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16615
Pdf URL: https://arxiv.org/pdf/2601.16615
Copy Paste: [[2601.16615]] AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model(https://arxiv.org/abs/2601.16615)
Keywords: language model, llm
Abstract: Recently, due to the advancement of multimodal technology, people are attempting to use visual large language models (VLLMs) in industrial production. Many deep learning models (DLMs) deployed in the production environment are gradually being replaced by VLLMs. Compared with DLMs, VLLMs have some advantages in industrial applications: (1) Their strong generalization ability enables them to perform well across a wide range of tasks. (2) They are flexible and can deal with unfamiliar samples through context learning quickly. However, VLLMs also have obvious drawbacks: (1) VLLMs do not perform as well as custom-developed DLMs in specific domains. (2) The number of parameters in VLLMs is generally quite large, and their deployment requires substantial computational resources. (3) VLLMs generally operate much slower than DLMs, making real-time response challenging to achieve. To better utilize VLLMs in industrial applications, we introduce AuroraEdge-V-2B in this work, a compact, robust, and high-speed VLLM designed for edge deployment. To make the model run faster, we also propose a compression-fusion method to improve inference efficiency. AuroraEdge-V-2B has the following notable features: (1) Easy deployment and faster: It has only 2B parameters and is highly suitable for edge deployment, offering better real-time performance. (2) Fewer visual tokens and cheaper: It significantly reduces the number of visual tokens in the decoding process, thereby reducing the floating-point operations by half during inference and making it cheaper to use. (3) Strong performance: It gets a higher score on 9 benchmarks than models with the same number of parameter (e.g., Qwen2-VL-2B, Qwen2.5-VL-3B, InternVL-2.5-2B).
摘要：近年来，由于多模态技术的进步，人们正在尝试在工业生产中使用视觉大语言模型（VLLM）。许多部署在生产环境中的深度学习模型（DLM）正在逐渐被VLLM所取代。与DLM相比，VLLM在工业应用中具有一些优势：（1）其强大的泛化能力使其能够在广泛的任务中表现良好。（2）它们灵活，可以通过上下文学习快速处理不熟悉的样本。然而，VLLM 也有明显的缺点：(1) VLLM 在特定领域的性能不如定制开发的 DLM。 (2)VLLM中的参数数量通常相当大，其部署需要大量的计算资源。 (3) VLLM 的运行速度通常比 DLM 慢得多，因此难以实现实时响应。为了更好地在工业应用中利用 VLLM，我们在这项工作中引入了 AuroraEdge-V-2B，这是一款专为边缘部署而设计的紧凑、强大且高速的 VLLM。为了使模型运行得更快，我们还提出了一种压缩融合方法来提高推理效率。 AuroraEdge-V-2B具有以下显着特点：（1）部署方便、速度更快：只有2B参数，非常适合边缘部署，实时性能更好。 (2) 更少的视觉标记和更便宜：它显着减少了解码过程中的视觉标记数量，从而将推理过程中的浮点运算减少了一半，并且使用起来更便宜。（3）性能强劲：在9个基准测试中比相同参数数量的模型（例如Qwen2-VL-2B、Qwen2.5-VL-3B、InternVL-2.5-2B）获得更高的分数。

Title: PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs

Authors: Jing Xu, Jiaqi Wang, Daxin Tan, Xiao Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16618
Pdf URL: https://arxiv.org/pdf/2601.16618
Copy Paste: [[2601.16618]] PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs(https://arxiv.org/abs/2601.16618)
Keywords: language model, llm
Abstract: Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model's S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.
摘要：尽管大型语言模型 (LLM) 在许多任务中表现出色，但其在语音到语音翻译 (S2ST) 中的应用尚未得到充分开发，并且受到数据稀缺的阻碍。为了弥补这一差距，我们提出 PROST-LLM（渐进式语音到语音翻译）来逐步增强法学硕士的 S2ST 能力。首先，我们使用 CVSS 语料库对法学硕士进行微调，采用设计的三任务学习和模态链方法来提高初始性能。然后，利用微调模型，我们通过自采样和反向翻译生成偏好对，无需人工评估。最后，这些偏好对用于偏好优化，进一步增强模型的 S2ST 能力。大量的实验证实了我们提出的 PROST-LLM 在提高 LLM 的 S2ST 能力方面的有效性。

Title: How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants

Authors: Xueyang Feng, Weinan Gan, Xu Chen, Quanyu Dai, Yong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16621
Pdf URL: https://arxiv.org/pdf/2601.16621
Copy Paste: [[2601.16621]] How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants(https://arxiv.org/abs/2601.16621)
Keywords: language model, llm
Abstract: Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM's intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at this https URL.
摘要：基于大语言模型 (LLM) 的助手最近集成了记录用户偏好的记忆机制，从而产生更加个性化和符合用户需求的响应。然而，不相关的个性化记忆经常被引入上下文中，干扰法学硕士的意图理解。为了全面研究个性化的双重影响，我们开发了 RPEval，这是一个包含个性化意图推理数据集和多粒度评估协议的基准。 RPEval揭示了现有法学硕士中普遍存在的非理性个性化现象，并通过错误模式分析说明了其对用户体验的负面影响。最后，我们引入了 RP-Reasoner，它将内存利用视为一个实用的推理过程，从而能够选择性地集成个性化信息。实验结果表明，我们的方法明显优于精心设计的 RPEval 基线，并解决了在大型商业个性化助理中观察到的 80% 的不良案例，凸显了实用推理在减轻非理性个性化方面的潜力。我们的基准测试可通过此 https URL 公开获取。

Title: MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages

Authors: Weerayut Buaphet, Thanh-Nhi Nguyen, Risa Kondo, Tomoyuki Kajiwara, Yumin Kim, Jimin Lee, Hwanhee Lee, Holy Lovenia, Peerat Limkonchotiwat, Sarana Nutanong, Rob Van der Goot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16623
Pdf URL: https://arxiv.org/pdf/2601.16623
Copy Paste: [[2601.16623]] MultiLexNorm++: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages(https://arxiv.org/abs/2601.16623)
Keywords: language model, llm
Abstract: Social media data has been of interest to Natural Language Processing (NLP) practitioners for over a decade, because of its richness in information, but also challenges for automatic processing. Since language use is more informal, spontaneous, and adheres to many different sociolects, the performance of NLP models often deteriorates. One solution to this problem is to transform data to a standard variant before processing it, which is also called lexical normalization. There has been a wide variety of benchmarks and models proposed for this task. The MultiLexNorm benchmark proposed to unify these efforts, but it consists almost solely of languages from the Indo-European language family in the Latin script. Hence, we propose an extension to MultiLexNorm, which covers 5 Asian languages from different language families in 4 different scripts. We show that the previous state-of-the-art model performs worse on the new languages and propose a new architecture based on Large Language Models (LLMs), which shows more robust performance. Finally, we analyze remaining errors, revealing future directions for this task.
摘要：十多年来，社交媒体数据一直引起自然语言处理 (NLP) 从业者的兴趣，因为它信息丰富，但也面临自动处理的挑战。由于语言的使用更加非正式、自发，并且遵循许多不同的社会习惯，因此 NLP 模型的性能往往会恶化。解决此问题的一种方法是在处理数据之前将其转换为标准变体，这也称为词汇标准化。为此任务提出了各种各样的基准和模型。 MultiLexNorm 基准建议统一这些工作，但它几乎仅由拉丁字母中的印欧语系语言组成。因此，我们提出了 MultiLexNorm 的扩展，它涵盖了来自不同语系、4 种不同文字的 5 种亚洲语言。我们表明，之前最先进的模型在新语言上表现较差，并提出了一种基于大型语言模型（LLM）的新架构，该架构显示出更稳健的性能。最后，我们分析剩余的错误，揭示该任务的未来方向。

Title: Typologically Informed Parameter Aggregation

Authors: Stef Accou, Wessel Poelman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16629
Pdf URL: https://arxiv.org/pdf/2601.16629
Copy Paste: [[2601.16629]] Typologically Informed Parameter Aggregation(https://arxiv.org/abs/2601.16629)
Keywords: language model
Abstract: Massively multilingual language models enable cross-lingual generalization but underperform on low-resource and unseen languages. While adapter-based fine-tuning offers a parameter-efficient solution, training language-specific adapters at scale remains costly. We introduce Typologically Informed Parameter Aggregation (TIPA), a training-free method that constructs proxy language adapters by aggregating existing ones, weighted by typological similarity. Integrated into the MAD-X framework, these proxies enable zero-shot cross-lingual transfer without additional training. We evaluate TIPA on five NLP tasks and over 230 languages. TIPA consistently outperforms or matches baselines such as English-only fine-tuning or selecting the typologically closest language adapter. We see the largest gains for languages lacking dedicated adapters. Our results demonstrate that typologically informed aggregation provides a viable alternative to language-specific modules without any training needed.
摘要：大规模多语言语言模型可以实现跨语言泛化，但在资源匮乏和看不见的语言上表现不佳。虽然基于适配器的微调提供了参数高效的解决方案，但大规模训练特定于语言的适配器仍然成本高昂。我们引入了类型学参数聚合（TIPA），这是一种免训练方法，通过聚合现有的代理语言适配器来构建代理语言适配器，并按类型学相似性进行加权。这些代理集成到 MAD-X 框架中，无需额外培训即可实现零样本跨语言传输。我们在 5 种 NLP 任务和 230 多种语言上评估 TIPA。 TIPA 始终优于或匹配基线，例如仅英语微调或选择类型上最接近的语言适配器。我们发现缺乏专用适配器的语言获得的收益最大。我们的结果表明，类型学信息聚合为特定语言模块提供了可行的替代方案，无需任何培训。

Title: Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations

Authors: Lukas Hinterleitner, Loris Schoenegger, Benjamin Roth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16651
Pdf URL: https://arxiv.org/pdf/2601.16651
Copy Paste: [[2601.16651]] Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations(https://arxiv.org/abs/2601.16651)
Keywords: language model, llm
Abstract: Gradient-based methods for instance-based explanation for large language models (LLMs) are hindered by the immense dimensionality of model gradients. In practice, influence estimation is restricted to a subset of model parameters to make computation tractable, but this subset is often chosen ad hoc and rarely justified by systematic evaluation. This paper investigates if it is better to create low-dimensional representations by selecting a small, architecturally informed subset of model components or by projecting the full gradients into a lower-dimensional space. Using a novel benchmark, we show that a greedily selected subset of components captures the information about training data influence needed for a retrieval task more effectively than either the full gradient or random projection. We further find that this approach is more computationally efficient than random projection, demonstrating that targeted component selection is a practical strategy for making instance-based explanations of large models more computationally feasible.
摘要：用于大型语言模型（LLM）基于实例的解释的基于梯度的方法受到模型梯度的巨大维度的阻碍。在实践中，影响估计仅限于模型参数的子集，以使计算易于处理，但该子集通常是临时选择的，很少经过系统评估来证明其合理性。本文研究了是否通过选择一个小型的、架构上已知的模型组件子集或通过将完整梯度投影到较低维空间来创建低维表示更好。使用一种新颖的基准，我们表明贪婪选择的组件子集比全梯度或随机投影更有效地捕获检索任务所需的训练数据影响的信息。我们进一步发现这种方法比随机投影的计算效率更高，这表明目标组件选择是一种实用策略，可以使大型模型的基于实例的解释在计算上更加可行。

Title: PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

Authors: Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, Wenbo Lv, Weiheng Wu, Kexin Yang, Sen Yang, Wei Wang, Rongyao Shi, Yuanyang Qiu, Yuemeng Qi, Jingwen Zhang, Xiaoyu Sui, Yifan Chen, Yi Zhang, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Weixing Shen, Bing Zhao, Charles L.A. Clarke, Hu Wei
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2601.16669
Pdf URL: https://arxiv.org/pdf/2601.16669
Copy Paste: [[2601.16669]] PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice(https://arxiv.org/abs/2601.16669)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: this https URL.
摘要：随着大型语言模型（LLM）越来越多地应用于法律领域特定的任务，评估它们在现实环境中执行法律工作的能力变得至关重要。然而，现有的法律基准依赖于简化和高度标准化的任务，未能捕捉到真实法律实践的模糊性、复杂性和推理需求。此外，先前的评估通常采用粗略的、单维的指标，并且没有明确评估细粒度的法律推理。为了解决这些限制，我们引入了PLawBench，这是一个实用法律基准，旨在评估现实法律实践场景中的法学硕士。 PLAwBench立足于现实世界的法律工作流程，通过公共法律咨询、实际案例分析和法律文件生成三个任务类别，对法律从业者的核心流程进行建模。这些任务评估模型识别法律问题和关键事实、执行结构化法律推理以及生成法律上一致的文档的能力。 PLAwBench 包含 13 个实际法律场景的 850 个问题，每个问题都附有专家设计的评估细则，从而形成约 12,500 个细粒度评估的细则项目。我们使用基于法学硕士的评估器与人类专家判断相一致，评估了 10 个最先进的法学硕士。实验结果表明，没有一个在PLawBench上取得了优异的表现，揭示了当前法学硕士细粒度法律推理能力的实质性局限性，并凸显了法律法学硕士未来评估和发展的重要方向。数据可在以下位置获取：此 https URL。

Title: EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents

Authors: Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang, Yixin Cao, Aixin Sun
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2601.16690
Pdf URL: https://arxiv.org/pdf/2601.16690
Copy Paste: [[2601.16690]] EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents(https://arxiv.org/abs/2601.16690)
Keywords: prompt, agent
Abstract: We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent's own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.
摘要：我们推出了 EMemBench，这是一个通过互动游戏评估智能体长期记忆的程序化基准。 EMemBench 不是使用一组固定的问题，而是根据每个代理自己的轨迹生成问题，涵盖文本和视觉游戏环境。每个模板都根据底层游戏信号计算可验证的基本事实，并具有受控的可回答性和平衡的记忆技能覆盖范围：单跳/多跳回忆、归纳、时间、空间、逻辑和对抗性。我们以强大的 LM/VLM 作为骨干，使用上下文提示作为基线来评估记忆代理。在 15 个文本游戏和多个视觉种子中，结果远未达到饱和：归纳和空间推理是持续存在的瓶颈，尤其是在视觉设置中。持久记忆为文本游戏的开放主干带来了明显的收益，但 VLM 代理的改进不太一致，这表明基于视觉的情景记忆仍然是一个开放的挑战。人体研究进一步证实了 EMemBench 的难度。

Title: Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition

Authors: Shanshan Liu, Noriki Nishida, Fei Cheng, Narumi Tokunaga, Rumana Ferdous Munne, Yuki Yamagata, Kouji Kozaki, Takehito Utsuro, Yuji Matsumoto
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2601.16711
Pdf URL: https://arxiv.org/pdf/2601.16711
Copy Paste: [[2601.16711]] Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition(https://arxiv.org/abs/2601.16711)
Keywords: llm
Abstract: Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at this https URL.
摘要：由于提及不可知的生物医学概念识别（MA-BCR）中人类注释的稀缺，对未见过的概念的泛化是一个主要挑战。这项工作为系统地解决这个问题做出了两个关键贡献。首先，我们提出了一个基于分层概念指数和新颖指标来衡量泛化性的评估框架。其次，我们探索基于 LLM 的自动标记数据 (ALD) 作为可扩展资源，为其生成创建特定于任务的管道。我们的研究明确表明，虽然 LLM 生成的 ALD 不能完全替代手动注释，但它是提高泛化能力的宝贵资源，成功地为模型提供了更广泛的覆盖范围和识别看不见的概念所需的结构知识。代码和数据集可从此 https URL 获取。

Title: Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation

Authors: Xinyi Wang, Grazziela Figueredo, Ruizhe Li, Xin Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16753
Pdf URL: https://arxiv.org/pdf/2601.16753
Copy Paste: [[2601.16753]] Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation(https://arxiv.org/abs/2601.16753)
Keywords: language model, llm
Abstract: Longitudinal information in radiology reports refers to the sequential tracking of findings across multiple examinations over time, which is crucial for monitoring disease progression and guiding clinical decisions. Many recent automated radiology report generation methods are designed to capture longitudinal information; however, validating their performance is challenging. There is no proper tool to consistently label temporal changes in both ground-truth and model-generated texts for meaningful comparisons. Existing annotation methods are typically labor-intensive, relying on the use of manual lexicons and rules. Complex rules are closed-source, domain specific and hard to adapt, whereas overly simple ones tend to miss essential specialised information. Large language models (LLMs) offer a promising annotation alternative, as they are capable of capturing nuanced linguistic patterns and semantic similarities without extensive manual intervention. They also adapt well to new contexts. In this study, we therefore propose an LLM-based pipeline to automatically annotate longitudinal information in radiology reports. The pipeline first identifies sentences containing relevant information and then extracts the progression of diseases. We evaluate and compare five mainstream LLMs on these two tasks using 500 manually annotated reports. Considering both efficiency and performance, Qwen2.5-32B was subsequently selected and used to annotate another 95,169 reports from the public MIMIC-CXR dataset. Our Qwen2.5-32B-annotated dataset provided us with a standardized benchmark for evaluating report generation models. Using this new benchmark, we assessed seven state-of-the-art report generation models. Our LLM-based annotation method outperforms existing annotation solutions, achieving 11.3\% and 5.3\% higher F1-scores for longitudinal information detection and disease tracking, respectively.
摘要：放射学报告中的纵向信息是指随着时间的推移对多次检查结果的连续跟踪，这对于监测疾病进展和指导临床决策至关重要。许多最近的自动放射学报告生成方法旨在捕获纵向信息；然而，验证他们的表现具有挑战性。没有适当的工具可以一致地标记真实文本和模型生成文本中的时间变化以进行有意义的比较。现有的注释方法通常是劳动密集型的，依赖于手动词典和规则的使用。复杂的规则是闭源的、特定领域的且难以适应，而过于简单的规则往往会错过重要的专业信息。大型语言模型（LLM）提供了一种有前景的注释替代方案，因为它们能够捕获细微的语言模式和语义相似性，而无需大量的手动干预。他们也能很好地适应新环境。因此，在本研究中，我们提出了一种基于法学硕士的管道来自动注释放射学报告中的纵向信息。该管道首先识别包含相关信息的句子，然后提取疾病的进展。我们使用 500 份手动注释的报告来评估和比较五个主流法学硕士在这两项任务上的情况。考虑到效率和性能，随后选择了 Qwen2.5-32B 并用于注释公共 MIMIC-CXR 数据集中的另外 95,169 份报告。我们的 Qwen2.5-32B 带注释的数据集为我们提供了评估报告生成模型的标准化基准。使用这个新基准，我们评估了七种最先进的报告生成模型。我们基于 LLM 的注释方法优于现有的注释解决方案，纵向信息检测和疾病跟踪的 F1 分数分别提高了 11.3% 和 5.3%。

Title: Do LLM hallucination detectors suffer from low-resource effect?

Authors: Debtanu Datta, Mohan Kishore Chilukuri, Yash Kumar, Saptarshi Ghosh, Muhammad Bilal Zafar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16766
Pdf URL: https://arxiv.org/pdf/2601.16766
Copy Paste: [[2601.16766]] Do LLM hallucination detectors suffer from low-resource effect?(https://arxiv.org/abs/2601.16766)
Keywords: llm, hallucination
Abstract: LLMs, while outperforming humans in a wide range of tasks, can still fail in unanticipated ways. We focus on two pervasive failure modes: (i) hallucinations, where models produce incorrect information about the world, and (ii) the low-resource effect, where the models show impressive performance in high-resource languages like English but the performance degrades significantly in low-resource languages like Bengali. We study the intersection of these issues and ask: do hallucination detectors suffer from the low-resource effect? We conduct experiments on five tasks across three domains (factual recall, STEM, and Humanities). Experiments with four LLMs and three hallucination detectors reveal a curious finding: As expected, the task accuracies in low-resource languages experience large drops (compared to English). However, the drop in detectors' accuracy is often several times smaller than the drop in task accuracy. Our findings suggest that even in low-resource languages, the internal mechanisms of LLMs might encode signals about their uncertainty. Further, the detectors are robust within language (even for non-English) and in multilingual setups, but not in cross-lingual settings without in-language supervision.
摘要：法学硕士虽然在广泛的任务中表现优于人类，但仍然可能会以意想不到的方式失败。我们关注两种普遍的失败模式：（i）幻觉，模型产生关于世界的错误信息；（ii）低资源效应，模型在英语等高资源语言中表现出令人印象深刻的性能，但在孟加拉语等低资源语言中性能显着下降。我们研究这些问题的交叉点并提出疑问：幻觉探测器是否会受到低资源效应的影响？我们对三个领域（事实回忆、STEM 和人文学科）的五项任务进行了实验。使用四个法学硕士和三个幻觉检测器进行的实验揭示了一个奇怪的发现：正如预期的那样，低资源语言的任务准确性大幅下降（与英语相比）。然而，检测器精度的下降通常比任务精度的下降小几倍。我们的研究结果表明，即使在资源匮乏的语言中，法学硕士的内部机制也可能编码有关其不确定性的信号。此外，检测器在语言（即使是非英语）和多语言设置中都很强大，但在没有语言内监督的跨语言设置中则不然。

Title: Persuasion Tokens for Editing Factual Knowledge in LLMs

Authors: Paul Youssef, Jörg Schlötterer, Christin Seifert
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.16781
Pdf URL: https://arxiv.org/pdf/2601.16781
Copy Paste: [[2601.16781]] Persuasion Tokens for Editing Factual Knowledge in LLMs(https://arxiv.org/abs/2601.16781)
Keywords: language model, llm
Abstract: In-context knowledge editing (IKE) is a promising technique for updating Large Language Models (LLMs) with new information. However, IKE relies on lengthy, fact-specific demonstrations which are costly to create and consume significant context window space. In this paper, we introduce persuasion tokens (P-Tokens) -- special tokens trained to replicate the effect of IKE demonstrations, enabling efficient knowledge editing without requiring fact-specific demonstrations. We evaluate P-Tokens across two editing datasets and three LLMs, demonstrating performance comparable to, and often exceeding, IKE. We further find that editing performance is robust to distractors with small negative effects to neighboring facts, and that increasing the number of P-Tokens improves performance. Our work addresses key limitations of IKE and provides a more practical and scalable alternative for editing LLMs.
摘要：上下文知识编辑 (IKE) 是一种很有前途的技术，可以用新信息更新大型语言模型 (LLM)。然而，IKE 依赖于冗长的、具体事实的演示，这些演示的创建和消耗大量上下文窗口空间的成本很高。在本文中，我们介绍了说服令牌（P-Tokens）——经过训练可以复制 IKE 演示效果的特殊令牌，无需特定事实的演示即可实现高效的知识编辑。我们在两个编辑数据集和三个法学硕士中评估 P-Token，证明其性能可与 IKE 相媲美，甚至常常超过 IKE。我们进一步发现，编辑性能对于干扰因素来说是稳健的，对邻近事实的负面影响很小，并且增加 P 令牌的数量可以提高性能。我们的工作解决了 IKE 的主要局限性，并为编辑 LLM 提供了更实用和可扩展的替代方案。

Title: Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Authors: Gaurav Negi, MA Waskow, Paul Buitelaar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16800
Pdf URL: https://arxiv.org/pdf/2601.16800
Copy Paste: [[2601.16800]] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis(https://arxiv.org/abs/2601.16800)
Keywords: language model, llm, prompt
Abstract: Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications. We explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis, addressing the shortage of domain-specific labelled datasets. In this work, we use a declarative annotation pipeline. This approach reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a novel methodology for an LLM to adjudicate multiple labels and produce final annotations. After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators. This reduces the cost and human effort needed to create these fine-grained opinion-annotated datasets.
摘要：对文本的细粒度意见分析提供了对表达的情感（包括所处理的实体）的详细理解。尽管这种详细程度是合理的，但在训练模型的数据集中注释意见需要大量的人力和大量的成本，特别是在不同的领域和实际应用中。我们探讨了法学硕士作为细粒度意见分析的自动注释器的可行性，解决了特定领域标记数据集的短缺问题。在这项工作中，我们使用声明性注释管道。当使用法学硕士来识别文本中的细粒度意见范围时，这种方法减少了手动提示工程的可变性。我们还为法学硕士提出了一种新颖的方法来裁定多个标签并生成最终注释。在使用不同大小的模型对方面情感三元组提取（ASTE）和方面类别观点情感（ACOS）分析任务的管道进行试验后，我们表明LLM可以充当自动注释器和裁决器，在各个基于LLM的注释器之间实现高注释器间一致性。这减少了创建这些细粒度意见注释数据集所需的成本和人力。

Title: SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation

Authors: Carolin Holtermann, Florian Schneider, Anne Lauscher
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16803
Pdf URL: https://arxiv.org/pdf/2601.16803
Copy Paste: [[2601.16803]] SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation(https://arxiv.org/abs/2601.16803)
Keywords: prompt
Abstract: Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt's semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models' SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.
摘要：文本到图像 (T2I) 模型越来越多地被全球用户采用。然而，先前的研究指出 T2I 对特定输入语言的高度敏感性 - 当面对英语以外的语言（即同一提示的不同表面形式）时，T2I 模型通常会产生文化上的刻板描述，优先考虑表面而不是提示的语义。然而，对这种行为的全面分析（我们称之为表面语义（SoS））却缺失。我们首次对 T2I 模型的 SoS 趋势进行分析。为此，我们创建了一套涵盖 171 种文化身份的提示，翻译成 14 种语言，并用它来提示 7 个 T2I 模型。为了量化跨模型、语言和文化的 SoS 趋势，我们引入了一种新颖的测量方法，并分析我们识别的趋势如何在视觉上表现出来。我们表明，除了一个模型之外，所有模型都在至少两种语言中表现出强烈的表面趋势，并且这种效应在 T2I 文本编码器的各个层中不断增强。此外，这些表面倾向经常与刻板的视觉描述相关。

Title: Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess

Authors: Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsäcker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.16823
Pdf URL: https://arxiv.org/pdf/2601.16823
Copy Paste: [[2601.16823]] Trapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess(https://arxiv.org/abs/2601.16823)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity--ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.
摘要：大型语言模型（LLM）展现出非凡的能力，但目前尚不清楚这些模型在多大程度上反映了复杂的回忆（结晶智能）或推理能力（流体智能）。我们引入国际象棋作为一个受控的测试平台来解开这些能力。利用游戏的结构和可扩展的引擎评估，我们构建了训练语料库接近度不同的位置分类——从可通过记忆解决的常见状态到需要第一原理推理的新颖状态。我们系统地评估不同推理强度下的多个 GPT 代。我们的分析揭示了一个明显的梯度：随着流体智能需求的增加，性能持续下降。值得注意的是，在分布外的任务中，性能会崩溃到随机水平。虽然新模型有所改进，但训练分布之外的任务的进度显着减慢。此外，虽然推理增强推理提高了性能，但其每个代币的边际收益随着分布接近度的增加而降低。这些结果表明，当前的架构在系统泛化方面仍然有限，强调需要超越规模的机制来实现强大的流体智能。

Title: LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

Authors: João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.16890
Pdf URL: https://arxiv.org/pdf/2601.16890
Copy Paste: [[2601.16890]] LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems(https://arxiv.org/abs/2601.16890)
Keywords: llm
Abstract: Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.
摘要：自动事实核查 (AFC) 系统容易受到对抗性攻击，导致虚假声明逃避检测。现有的对抗框架通常依赖于注入噪音或改变语义，但现有的框架都没有利用说服技术的对抗潜力，这些技术被广泛用于虚假信息活动中以操纵受众。在本文中，我们通过使用生成法学硕士来使用说服技术重新表述主张，介绍了一种针对 AFC 的新颖的说服性对抗性攻击。考虑到分为 6 类的 15 种技术，我们使用解耦评估策略研究说服对主张验证和证据检索的影响。 FEVER 和 FEVEROUS 基准测试表明，说服攻击会大幅降低验证性能和证据检索。我们的分析将说服技术确定为一类有效的对抗性攻击，强调了对更强大的 AFC 系统的需求。

Title: Strategies for Span Labeling with Large Language Models

Authors: Danil Semin, Ondřej Dušek, Zdeněk Kasner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.16946
Pdf URL: https://arxiv.org/pdf/2601.16946
Copy Paste: [[2601.16946]] Strategies for Span Labeling with Large Language Models(https://arxiv.org/abs/2601.16946)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model's output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.
摘要：大型语言模型 (LLM) 越来越多地用于文本分析任务，例如命名实体识别或错误检测。然而，与基于编码器的模型不同，生成架构缺乏明确的机制来引用其输入的特定部分。这导致了跨度标签的各种临时提示策略，通常会产生不一致的结果。在本文中，我们将这些策略分为三个系列：标记输入文本、索引跨度的数字位置以及匹配跨度内容。为了解决内容匹配的局限性，我们引入了 LogitMatch，这是一种新的约束解码方法，可强制模型的输出与有效输入范围对齐。我们评估了四个不同任务中的所有方法。我们发现，虽然标记仍然是一个可靠的基线，但 LogitMatch 通过消除跨度匹配问题改进了基于竞争性匹配的方法，并且在某些设置中优于其他策略。