2025-09-24

Title: Dynamic Prompt Fusion for Multi-Task and Cross-Domain Adaptation in LLMs

Authors: Xin Hu, Yue Kang, Guanzi Yao, Tianze Kang, Mengjie Wang, Heyao Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.18113
Pdf URL: https://arxiv.org/pdf/2509.18113
Copy Paste: [[2509.18113]] Dynamic Prompt Fusion for Multi-Task and Cross-Domain Adaptation in LLMs(https://arxiv.org/abs/2509.18113)
Keywords: language model, llm, prompt
Abstract: This study addresses the generalization limitations commonly observed in large language models under multi-task and cross-domain settings. Unlike prior methods such as SPoT, which depends on fixed prompt templates, our study introduces a unified multi-task learning framework with dynamic prompt scheduling mechanism. By introducing a prompt pool and a task-aware scheduling strategy, the method dynamically combines and aligns prompts for different tasks. This enhances the model's ability to capture semantic differences across tasks. During prompt fusion, the model uses task embeddings and a gating mechanism to finely control the prompt signals. This ensures alignment between prompt content and task-specific demands. At the same time, it builds flexible sharing pathways across tasks. In addition, the proposed optimization objective centers on joint multi-task learning. It incorporates an automatic learning strategy for scheduling weights, which effectively mitigates task interference and negative transfer. To evaluate the effectiveness of the method, a series of sensitivity experiments were conducted. These experiments examined the impact of prompt temperature parameters and task number variation. The results confirm the advantages of the proposed mechanism in maintaining model stability and enhancing transferability. Experimental findings show that the prompt scheduling method significantly improves performance on a range of language understanding and knowledge reasoning tasks. These results fully demonstrate its applicability and effectiveness in unified multi-task modeling and cross-domain adaptation.
摘要：这项研究解决了在多任务和跨域设置下在大语言模型中通常观察到的概括限制。与先前的方法（例如Spot取决于固定的及时模板）不同，我们的研究介绍了具有动态及时调度机制的统一的多任务学习框架。通过引入提示池和任务意识的调度策略，该方法动态组合并对齐不同的任务。这增强了模型捕获跨任务的语义差异的能力。在及时融合过程中，该模型使用任务嵌入式和门控机制来精心控制及时信号。这样可以确保及时内容和特定于任务的需求之间的一致性。同时，它跨越任务构建了灵活的共享途径。此外，提议的优化目标集中于联合多任务学习。它结合了安排权重的自动学习策略，可有效减轻任务干扰和负转移。为了评估该方法的有效性，进行了一系列灵敏度实验。这些实验检查了及时温度参数和任务编号变化的影响。结果证实了拟议机制在维持模型稳定性和增强可传递性方面的优势。实验发现表明，及时的调度方法可显着提高一系列语言理解和知识推理任务的绩效。这些结果充分证明了其在统一的多任务建模和跨域适应中的适用性和有效性。

Title: GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models

Authors: Yue Zhang, Jiaxin Zhang, Qiuyu Ren, Tahsin Saffat, Xiaoxuan Liu, Zitong Yang, Banghua Zhu, Yi Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18122
Pdf URL: https://arxiv.org/pdf/2509.18122
Copy Paste: [[2509.18122]] GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models(https://arxiv.org/abs/2509.18122)
Keywords: language model, gpt, llm
Abstract: We introduce \textbf{GAUSS} (\textbf{G}eneral \textbf{A}ssessment of \textbf{U}nderlying \textbf{S}tructured \textbf{S}kills in Mathematics), a benchmark that evaluates LLMs' mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models' mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textsc{GAUSS} benchmark, we have derived the skill profile of \textsc{GPT-5-thinking}, revealing its strengths and weaknesses as well as its differences relative to \textsc{o4-mini-high}, thereby underscoring the value of multidimensional, skill-based evaluation.
摘要：我们介绍\ textbf {gauss}（\ textbf {g} ocer \ textbf {a} \ textbf {u} nderlying \ textbf {s} tructured \ textbf {s} \ textbf {s}在数学中遇到的bencher corne conterions'三个领域：知识和理解，解决问题和交流，以及元技能和创造力。通过根据认知技能和设计隔离特定能力的任务进行分类，高斯构建了模型数学能力的全面，细粒度和可解释的概况。这些概况忠实地代表了他们的基本数学智能。为了举例说明如何使用\ textsc {gauss}基准，我们得出了\ textsc {gpt-5-5思维}的技能概况，揭示其优势和劣势以及相对于\ textsc {o4-mini-high}的差异，从而强调了多个超级评估的价值。

Title: Event Causality Identification with Synthetic Control

Authors: Haoyu Wang, Fengze Liu, Jiayao Zhang, Dan Roth, Kyle Richardson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18156
Pdf URL: https://arxiv.org/pdf/2509.18156
Copy Paste: [[2509.18156]] Event Causality Identification with Synthetic Control(https://arxiv.org/abs/2509.18156)
Keywords: gpt
Abstract: Event causality identification (ECI), a process that extracts causal relations between events from text, is crucial for distinguishing causation from correlation. Traditional approaches to ECI have primarily utilized linguistic patterns and multi-hop relational inference, risking false causality identification due to informal usage of causality and specious graphical inference. In this paper, we adopt the Rubin Causal Model to identify event causality: given two temporally ordered events, we see the first event as the treatment and the second one as the observed outcome. Determining their causality involves manipulating the treatment and estimating the resultant change in the likelihood of the outcome. Given that it is only possible to implement manipulation conceptually in the text domain, as a work-around, we try to find a twin for the protagonist from existing corpora. This twin should have identical life experiences with the protagonist before the treatment but undergoes an intervention of treatment. However, the practical difficulty of locating such a match limits its feasibility. Addressing this issue, we use the synthetic control method to generate such a twin' from relevant historical data, leveraging text embedding synthesis and inversion techniques. This approach allows us to identify causal relations more robustly than previous methods, including GPT-4, which is demonstrated on a causality benchmark, COPES-hard.
摘要：事件因果关系识别（ECI）是从文本中提取因果关系的过程，对于区分因果与相关性至关重要。传统的ECI方法主要利用语言模式和多跳的关系推断，由于因果关系的非正式使用和奇异的图形推断而冒着虚假因果关系识别的风险。在本文中，我们采用鲁宾因果模型来识别事件因果关系：给定两个时间有序的事件，我们将第一个事件视为治疗，第二个事件是观察到的结果。确定其因果关系涉及操纵治疗并估算结果可能性的变化。鉴于只有在文本域中概念上实施操纵，而是作为一个工作，我们试图从现有Corpora中找到一个为主角的双胞胎。该双胞胎在治疗前应该具有与主角相同的生活经历，但要接受治疗的干预。但是，定位这种比赛的实际困难限制了其可行性。在解决此问题时，我们使用合成控制方法来从相关的历史数据中生成这种双胞胎，利用文本嵌入综合和反转技术。这种方法使我们能够比以前的方法更牢固地识别因果关系，包括GPT-4，该方法在因果关系基准上所证明的，即应对。

Title: ZERA: Zero-init Instruction Evolving Refinement Agent - From Zero Instructions to Structured Prompts via Principle-based Optimization

Authors: Seungyoun Yi, Minsoo Khang, Sungrae Park
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.18158
Pdf URL: https://arxiv.org/pdf/2509.18158
Copy Paste: [[2509.18158]] ZERA: Zero-init Instruction Evolving Refinement Agent - From Zero Instructions to Structured Prompts via Principle-based Optimization(https://arxiv.org/abs/2509.18158)
Keywords: language model, llm, prompt, agent
Abstract: Automatic Prompt Optimization (APO) improves large language model (LLM) performance by refining prompts for specific tasks. However, prior APO methods typically focus only on user prompts, rely on unstructured feedback, and require large sample sizes and long iteration cycles-making them costly and brittle. We propose ZERA (Zero-init Instruction Evolving Refinement Agent), a novel framework that jointly optimizes both system and user prompts through principled, low-overhead refinement. ZERA scores prompts using eight generalizable criteria with automatically inferred weights, and revises prompts based on these structured critiques. This enables fast convergence to high-quality prompts using minimal examples and short iteration cycles. We evaluate ZERA across five LLMs and nine diverse datasets spanning reasoning, summarization, and code generation tasks. Experimental results demonstrate consistent improvements over strong baselines. Further ablation studies highlight the contribution of each component to more effective prompt construction. Our implementation including all prompts is publicly available at this https URL.
摘要：自动提示优化（APO）通过完善特定任务提示来改善大语言模型（LLM）的性能。但是，先前的APO方法通常仅关注用户提示，依靠非结构化的反馈，并且需要大量样本量和长时间的迭代周期，使它们成本高昂且脆弱。我们提出了Zera（Zero-Init指令不断发展的改进剂），这是一个新颖的框架，可以通过有原则的低空改进来共同优化系统和用户提示。 Zera分数使用八个具有自动推断权重的可推广标准提示，并根据这些结构化的评论修改提示。这使得使用最小示例和短迭代周期的快速收敛到高质量的提示。我们评估了五个LLM和九个不同数据集的Zera，这些数据集涵盖了推理，摘要和代码生成任务。实验结果表明对强基础的一致改进。进一步的消融研究强调了每个组件对更有效的及时构建的贡献。我们的实施包括所有提示，可在此HTTPS URL上公开可用。

Title: Thinking in a Crowd: How Auxiliary Information Shapes LLM Reasoning

Authors: Haodong Zhao, Chenyan Zhao, Yansi Li, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18163
Pdf URL: https://arxiv.org/pdf/2509.18163
Copy Paste: [[2509.18163]] Thinking in a Crowd: How Auxiliary Information Shapes LLM Reasoning(https://arxiv.org/abs/2509.18163)
Keywords: language model, llm
Abstract: The capacity of Large Language Models (LLMs) to reason is fundamental to their application in complex, knowledge-intensive domains. In real-world scenarios, LLMs are often augmented with external information that can be helpful, irrelevant, or even misleading. This paper investigates the causal impact of such auxiliary information on the reasoning process of LLMs with explicit step-by-step thinking capabilities. We introduce SciAux, a new dataset derived from ScienceQA, to systematically test the robustness of the model against these types of information. Our findings reveal a critical vulnerability: the model's deliberative "thinking mode" is a double-edged sword. While helpful context improves accuracy, misleading information causes a catastrophic drop in performance, which is amplified by the thinking process. Instead of conferring robustness, thinking reinforces the degree of error when provided with misinformation. This highlights that the challenge is not merely to make models "think", but to endow them with the critical faculty to evaluate the information upon which their reasoning is based. The SciAux dataset is available at this https URL.
摘要：大型语言模型（LLM）推理的能力是其在复杂，知识密集的领域中的应用。在实际情况下，LLM通常会增加外部信息，这些信息可能会有所帮助，无关紧要，甚至误导。本文研究了这种辅助信息对LLMS的推理过程的因果影响，并具有明确的分步思维能力。我们介绍了Sciaux，这是一种来自ScienceQA的新数据集，以系统地测试模型对这些类型信息的鲁棒性。我们的发现揭示了一个关键的脆弱性：模型的审议“思维模式”是一把双刃剑。尽管有用的环境提高了准确性，但误导性信息会导致性能下降，这会被思考过程所扩大。当提供错误信息时，思考而不是赋予鲁棒性，而是增强了错误程度。这凸显了挑战不仅是使模型“思考”，而且还要赋予他们与关键的教师一起评估其推理所基于的信息。 SCIAUX数据集可在此HTTPS URL上找到。

Title: SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework

Authors: Junlin Wang, Zehao Wu, Shaowei Lu, Yanlan Li, Xinghao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18167
Pdf URL: https://arxiv.org/pdf/2509.18167
Copy Paste: [[2509.18167]] SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework(https://arxiv.org/abs/2509.18167)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access external knowledge sources, but the effectiveness of RAG relies on the coordination between the retriever and the generator. Since these components are developed independently, their interaction is often suboptimal: the retriever may return irrelevant or redundant documents, while the generator may fail to fully leverage retrieved evidence. In this work, we propose a process-supervised multi-agent framework to bridge the gap between retriever and generator. The framework introduces two lightweight agents: a Decision Maker, which determines when to continue retrieval or stop for answer generation, and a Knowledge Selector, which filters retrieved documents to retain only the most useful evidence. To provide fine-grained supervision, we employ an LLM-as-a-Judge that evaluates each intermediate action with process-level rewards, ensuring more accurate credit assignment than relying solely on final answer correctness. We further adopt a tree-structured rollout strategy to explore diverse reasoning paths, and train both agents with Proximal Policy Optimization (PPO) in an end-to-end manner. Experiments on single-hop and multi-hop question answering benchmarks show that our approach achieves higher accuracy, more stable convergence, and produces more interpretable reasoning trajectories compared with standard RAG baselines. Importantly, the proposed framework is modular and plug-and-play, requiring no modification to the retriever or generator, making it practical for real-world RAG applications.
摘要：检索增强的生成（RAG）使大型语言模型（LLMS）能够访问外部知识来源，但是RAG的有效性依赖于猎犬和发电机之间的协调。由于这些组件是独立开发的，因此它们的相互作用通常是次优的：检索器可能会返回无关紧要或冗余文档，而发电机可能无法完全利用检索到的证据。在这项工作中，我们提出了一个程序监督的多代理框架，以弥合回猎犬和发电机之间的差距。该框架介绍了两个轻量级代理：一个决策者，它决定何时继续检索或停止以进行答案，而知识选择者则过滤了文件以仅保留最有用的证据。为了提供细粒度的监督，我们采用了法学律师法官，通过流程级别的奖励评估每个中级行动，从而确保比仅依靠最终答案正确性的信贷分配更准确。我们进一步采用树木结构的推出策略来探索各种推理路径，并以端到端的方式对两种代理进行近端政策优化（PPO）培训。单跳和多跳的问题回答基准的实验表明，与标准的抹布基线相比，我们的方法可实现更高的准确性，更稳定的收敛性，并产生更明显的推理轨迹。重要的是，所提出的框架是模块化和插件播放，不需要对猎犬或发电机进行修改，因此对于现实世界中的破布应用程序而言，它是实用的。

Title: ERFC: Happy Customers with Emotion Recognition and Forecasting in Conversation in Call Centers

Authors: Aditi Debsharma, Bhushan Jagyasi, Surajit Sen, Priyanka Pandey, Devicharith Dovari, Yuvaraj V.C, Rosalin Parida, Gopali Contractor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18175
Pdf URL: https://arxiv.org/pdf/2509.18175
Copy Paste: [[2509.18175]] ERFC: Happy Customers with Emotion Recognition and Forecasting in Conversation in Call Centers(https://arxiv.org/abs/2509.18175)
Keywords: agent
Abstract: Emotion Recognition in Conversation has been seen to be widely applicable in call center analytics, opinion mining, finance, retail, healthcare, and other industries. In a call center scenario, the role of the call center agent is not just confined to receiving calls but to also provide good customer experience by pacifying the frustration or anger of the customers. This can be achieved by maintaining neutral and positive emotion from the agent. As in any conversation, the emotion of one speaker is usually dependent on the emotion of other speaker. Hence the positive emotion of an agent, accompanied with the right resolution will help in enhancing customer experience. This can change an unhappy customer to a happy one. Imparting the right resolution at right time becomes easier if the agent has the insight of the emotion of future utterances. To predict the emotions of the future utterances we propose a novel architecture, Emotion Recognition and Forecasting in Conversation. Our proposed ERFC architecture considers multi modalities, different attributes of emotion, context and the interdependencies of the utterances of the speakers in the conversation. Our intensive experiments on the IEMOCAP dataset have shown the feasibility of the proposed ERFC. This approach can provide a tremendous business value for the applications like call center, where the happiness of customer is utmost important.
摘要：对话中的情感认识已被认为广泛适用于呼叫中心分析，意见采矿，金融，零售，医疗保健和其他行业。在呼叫中心的情况下，呼叫中心代理的角色不仅限于接收呼叫，而且还可以通过安抚客户的挫败感或愤怒来提供良好的客户体验。这可以通过保持代理商的中性和积极情绪来实现。就像在任何对话中一样，一个演讲者的情感通常取决于其他说话者的情绪。因此，代理商的积极情绪以及正确的分辨率将有助于增强客户体验。这可以将不快乐的客户变成一个快乐的客户。如果代理人对未来话语的情感有洞察力，那么在正确的时间提供正确的分辨率会变得更加容易。为了预测未来话语的情感，我们提出了一种新颖的建筑，情感识别和对话预测。我们提出的ERFC架构考虑了多种方式，情感的不同属性，背景和对话中说话者话语的相互依存关系。我们在IEMOCAP数据集上进行的密集实验表明了所提出的ERFC的可行性。这种方法可以为诸如呼叫中心之类的应用程序提供巨大的业务价值，在呼叫中心，客户的幸福最重要。

Title: Evaluating Large Language Models for Detecting Antisemitism

Authors: Jay Patel, Hrudayangam Mehta, Jeremy Blackburn
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.18293
Pdf URL: https://arxiv.org/pdf/2509.18293
Copy Paste: [[2509.18293]] Evaluating Large Language Models for Detecting Antisemitism(https://arxiv.org/abs/2509.18293)
Keywords: language model, gpt, llm, prompt
Abstract: Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.
摘要：检测可恨的内容是一个具有挑战性且重要的问题。自动化工具（例如机器学习模型）可以提供帮助，但需要持续培训以适应社交媒体的不断变化的景观。在这项工作中，我们评估了八个开源LLMS检测反犹太含量的能力，特别是利用了封闭式定义作为政策指南。我们探索各种提示技术，并设计一个类似婴儿床的提示，指导性cot。指导-COT可以很好地处理内在策略，无论解码配置，模型大小或推理能力如何，都会提高所有评估模型的性能。值得注意的是，美洲驼3.1 70b的表现优于微调的GPT-3.5。此外，我们检查了LLM错误并引入指标以量化模型生成的理由中的语义差异，从而揭示了LLMS之间的显着差异和矛盾的行为。我们的实验突出了LLMS效用，解释性和可靠性中观察到的差异。

Title: Exploiting Tree Structure for Credit Assignment in RL Training of LLMs

Authors: Hieu Tran, Zonghai Yao, Hong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18314
Pdf URL: https://arxiv.org/pdf/2509.18314
Copy Paste: [[2509.18314]] Exploiting Tree Structure for Credit Assignment in RL Training of LLMs(https://arxiv.org/abs/2509.18314)
Keywords: llm, prompt
Abstract: Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values $V(s)$ by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.
摘要：强化学习改善了LLM的推理，但长期序列的稀疏延迟奖励使令牌级别的信用分配成为关键瓶颈。我们研究可验证的奖励设置，最终答案是可以检查的，并且每个提示符都可以绘制多个响应。数学和医学质量检查中的推理任务与此设置保持一致，在该设置中，只有少数决定代币显着影响结果。 PPO通过学习的价值模型提供令牌级别的优势，但是同时训练演员和评论家模型很复杂，并且不容易概括，因为来自评论家模型的令牌级值可以使训练容易过度拟合。 GRPO是无评论的，支持可验证的奖励，但在代币中传播了一个序列级别的回报，而忽略了分支。我们介绍\ textbf {prefix-to-to-toe（p2t）}，一个简单的过程，将一组响应转换为前缀树并计算\ emph {nonparametric} prefix prefix values \（v（s）\（v（s）\）通过聚集降低的态度。构建在P2T上，我们建议\ textbf {tempo}（\ emph {\ textbf {t textbf {t} ree- \ \ textbf {e}刺激\ textbf {m} ean prefix ean prefix值\ \ textbf {p} olicy textbf {p} olicy textbf \ textbf {ol textBf {o}用\ emph {分支门控}的时间差校正从树中得出的grpo的结果信号。在非分支令牌上，时间差异（TD）项为零，因此速度会降低到GRPO。在分支令牌，它可以在没有学习价值网络或额外法官/教师的情况下提供精确的令牌级别的信用。在QWEN3-1.7B/4B上，Tempo在分发（Math，MEDQA）和分布式（GSM-HARD，AMC23，MEDMCQA，MMLU-MEDICAL）基准上优于PPO和GRPO，并在较高的壁挂式墙壁上获得更高的准确性。

Title: Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning

Authors: Saksham Khatwani, He Cheng, Majid Afshar, Dmitriy Dligach, Yanjun Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18316
Pdf URL: https://arxiv.org/pdf/2509.18316
Copy Paste: [[2509.18316]] Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning(https://arxiv.org/abs/2509.18316)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Large language models (LLMs) show promise for diagnostic reasoning but often lack reliable, knowledge grounded inference. Knowledge graphs (KGs), such as the Unified Medical Language System (UMLS), offer structured biomedical knowledge that can support trustworthy reasoning. Prior approaches typically integrate KGs via retrieval augmented generation or fine tuning, inserting KG content into prompts rather than enabling structured reasoning. We explore an alternative paradigm: treating the LLM as a reward model of KG reasoning paths, where the model learns to judge whether a candidate path leads to correct diagnosis for a given patient input. This approach is inspired by recent work that leverages reward training to enhance model reasoning abilities, and grounded in computational theory, which suggests that verifying a solution is often easier than generating one from scratch. It also parallels physicians' diagnostic assessment, where they judge which sequences of findings and intermediate conditions most plausibly support a diagnosis. We first systematically evaluate five task formulation for knowledge path judging and eight training paradigm. Second, we test whether the path judging abilities generalize to downstream diagnostic tasks, including diagnosis summarization and medical question answering. Experiments with three open source instruct-tuned LLMs reveal both promise and brittleness: while specific reward optimization and distillation lead to strong path-judging performance, the transferability to downstream tasks remain weak. Our finding provides the first systematic assessment of "reward model style" reasoning over clinical KGs, offering insights into how structured, reward-based supervision influences diagnostic reasoning in GenAI systems for healthcare.
摘要：大型语言模型（LLMS）对诊断推理显示出希望，但通常缺乏可靠的知识基础。知识图（kgs），例如统一的医学语言系统（UMLS），提供结构化的生物医学知识，可以支持值得信赖的推理。先前的方法通常通过检索增强生成或微调来整合KG，将KG内容插入提示中，而不是启用结构化推理。我们探讨了一种替代范式：将LLM视为KG推理路径的奖励模型，该模型学会判断候选路径是否导致给定患者输入的诊断是否纠正诊断。这种方法的灵感来自最近的工作，即利用奖励培训来增强模型推理能力，并以计算理论为基础，这表明验证解决方案通常比从头开始生成一个更容易。它也与医生的诊断评估相似，在这种评估中，他们判断哪些发现和中间状况的序列最合理地支持诊断。我们首先系统地评估知识路径判断和八个培训范式的五个任务公式。其次，我们测试路径判断能力是否将其推广到下游诊断任务，包括诊断摘要和医疗问答。具有三个开源指导调节的LLM的实验既揭示了前景又表现出脆弱性：虽然特定的奖励优化和蒸馏会导致强大的路径判断性能，但对下游任务的可传递性仍然很弱。我们的发现提供了对临床KG的“奖励模型风格”推理的首次系统评估，从而有见识性，基于奖励的监督如何影响Genai医疗系统中的诊断推理。

Title: Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

Authors: Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Ning-Chi Huang, Mohamed S. Abdelfattah, Kai-Chiang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18344
Pdf URL: https://arxiv.org/pdf/2509.18344
Copy Paste: [[2509.18344]] Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding(https://arxiv.org/abs/2509.18344)
Keywords: language model, llm
Abstract: The immense model sizes of large language models (LLMs) challenge deployment on memory-limited consumer GPUs. Although model compression and parameter offloading are common strategies to address memory limitations, compression can degrade quality, and offloading maintains quality but suffers from slow inference. Speculative decoding presents a promising avenue to accelerate parameter offloading, utilizing a fast draft model to propose multiple draft tokens, which are then verified by the target LLM in parallel with a single forward pass. This method reduces the time-consuming data transfers in forward passes that involve offloaded weight transfers. Existing methods often rely on pretrained weights of the same family, but require additional training to align with custom-trained models. Moreover, approaches that involve draft model training usually yield only modest speedups. This limitation arises from insufficient alignment with the target model, preventing higher token acceptance lengths. To address these challenges and achieve greater speedups, we propose SubSpec, a plug-and-play method to accelerate parameter offloading that is lossless and training-free. SubSpec constructs a highly aligned draft model by generating low-bit quantized substitute layers from offloaded target LLM portions. Additionally, our method shares the remaining GPU-resident layers and the KV-Cache, further reducing memory overhead and enhance alignment. SubSpec achieves a high average acceptance length, delivering 9.1x speedup for Qwen2.5 7B on MT-Bench (8GB VRAM limit) and an average of 12.5x speedup for Qwen2.5 32B on popular generation benchmarks (24GB VRAM limit).
摘要：大型语言模型（LLMS）的巨大模型大小在内存限制的消费者GPU上挑战部署。尽管模型压缩和参数卸载是解决内存限制的常见策略，但压缩可以降低质量，并卸载保持质量，但推理缓慢。投机解码为加速参数卸载的有前途的途径，利用快速草图模型提出了多个草稿令牌，然后由目标LLM与单个正向通行证并行验证。此方法减少了涉及卸载重量转移的正向通行中耗时的数据传输。现有的方法通常依赖于同一家庭的审计重量，但需要额外的培训以与定制训练的模型保持一致。此外，涉及模型培训草案的方法通常仅产生适度的加速。这种限制是由于与目标模型的一致性不足，从而阻止了更高的令牌接受度。为了应对这些挑战并实现更大的加速，我们提出了亚种，亚种是一种插件方法，以加速无损且无训练的参数卸载。亚种通过从卸载目标LLM部分中生成低位量化的替代层来构建高度对齐的草稿模型。此外，我们的方法共享其余的GPU居民层和KV-CACHE，从而进一步降低了内存开销并增强对齐方式。亚种实现了很高的平均接收长度，在MT基础上（8GB VRAM限制）的QWEN2.5 7B提供了9.1倍的速度，QWEN2.5 32B的平均速度为12.5倍，对PusanIfic Generative Generation Benchmarks（24GB VRAM限制）的QWEN2.5 32B。

Title: Interactive Real-Time Speaker Diarization Correction with Human Feedback

Authors: Xinlu He, Yiwen Guan, Badrivishal Paurana, Zilin Dai, Jacob Whitehill
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18377
Pdf URL: https://arxiv.org/pdf/2509.18377
Copy Paste: [[2509.18377]] Interactive Real-Time Speaker Diarization Correction with Human Feedback(https://arxiv.org/abs/2509.18377)
Keywords: llm
Abstract: Most automatic speech processing systems operate in "open loop" mode without user feedback about who said what; yet, human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time. The pipeline performs streaming ASR and diarization, uses an LLM to deliver concise summaries to the users, and accepts brief verbal feedback that is immediately incorporated without disrupting interactions. Moreover, we develop techniques to make the workflow more effective: First, a split-when-merged (SWM) technique detects and splits multi-speaker segments that the ASR erroneously attributes to just a single speaker. Second, online speaker enrollments are collected based on users' diarization corrections, thus helping to prevent speaker diarization errors from occurring in the future. LLM-driven simulations on the AMI test set indicate that our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%. We further analyze correction efficacy under different settings, including summary vs full transcript display, the number of online enrollments limitation, and correction frequency.
摘要：大多数自动语音处理系统都以“开放循环”模式运行，而没有用户反馈关于谁说的内容；但是，人类在循环工作流程可能会实现更高的准确性。我们提出了一个由LLM辅助的说话者诊断系统，该系统使用户可以实时修复扬声器归因错误。管道执行流媒体和诊断，使用LLM向用户提供简洁的摘要，并接受简短的口头反馈，这些反馈立即被合并而不破坏交互。此外，我们开发了使工作流程更有效的技术：首先，当拆分时（SWM）技术检测和拆分多扬声器段，ASR错误地将其归因于单个扬声器。其次，根据用户的诊断来收集在线演讲者的入学率，从而有助于防止说话者诊断错误将来发生。 AMI测试集的LLM驱动模拟表明，我们的系统将DER显着减少了9.92％，而说话者的混乱误差则减少了44.23％。我们进一步分析了不同设置下的校正功效，包括摘要与完整的成绩单显示，在线入学限制的数量和校正频率。

Title: NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery

Authors: Minki Hong, Jangho Choi, Jihie Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18395
Pdf URL: https://arxiv.org/pdf/2509.18395
Copy Paste: [[2509.18395]] NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery(https://arxiv.org/abs/2509.18395)
Keywords: llm
Abstract: Social norms govern culturally appropriate behavior in communication, enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair. To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and LLM-based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance. We show that models trained on our V2R-augmented data exhibit improved pragmatic competence in ethically sensitive contexts. Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.
摘要：社会规范在沟通中控制了文化上适当的行为，使对话系统能够产生不仅连贯，而且在社会上可以接受的反应。我们介绍了规范，这是一个多元文化的框架，用于在英语，中文和韩语中产生和注释社会扎根的对话。为了模拟超出静态规范分类的社会互动的动力学，我们提出了一种新颖的对话类型，即违规到分辨率（V2R），该类型通过识别和社会适当的修复来模拟对话违规后对话的进展。为了提高代表性不足的语言的务实一致性，我们在对话综合过程的早期就实现了基于典范的迭代改进。这种设计引入了与语言，情感和社会文化期望的一致性，然后才能开始完全对话。使用此框架，我们构建了一个数据集，该数据集在转弯水平上注释的10,800个多转向对话，以遵守规范，扬声器意图和情感响应。基于人类和LLM的评估表明，规范性在改进质量，对话自然性和概括性能方面显着优于现有数据集。我们表明，在我们的V2R增强数据中训练的模型在道德敏感的环境中具有提高的实用能力。我们的作品为文化自适应对话建模建立了一个新的基准，并为跨语言和文化多样性的语言提供了可扩展的方法。

Title: Evaluating the Creativity of LLMs in Persian Literary Text Generation

Authors: Armin Tourajmehr, Mohammad Reza Modarres, Yadollah Yaghoobzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18401
Pdf URL: https://arxiv.org/pdf/2509.18401
Copy Paste: [[2509.18401]] Evaluating the Creativity of LLMs in Persian Literary Text Generation(https://arxiv.org/abs/2509.18401)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated notable creative abilities in generating literary texts, including poetry and short stories. However, prior research has primarily centered on English, with limited exploration of non-English literary traditions and without standardized methods for assessing creativity. In this paper, we evaluate the capacity of LLMs to generate Persian literary text enriched with culturally relevant expressions. We build a dataset of user-generated Persian literary spanning 20 diverse topics and assess model outputs along four creativity dimensions-originality, fluency, flexibility, and elaboration-by adapting the Torrance Tests of Creative Thinking. To reduce evaluation costs, we adopt an LLM as a judge for automated scoring and validate its reliability against human judgments using intraclass correlation coefficients, observing strong agreement. In addition, we analyze the models' ability to understand and employ four core literary devices: simile, metaphor, hyperbole, and antithesis. Our results highlight both the strengths and limitations of LLMs in Persian literary text generation, underscoring the need for further refinement.
摘要：大型语言模型（LLM）在产生文学文本（包括诗歌和短篇小说）方面具有显着的创造力。但是，先前的研究主要以英语为中心，对非英语文学传统的探索有限，没有评估创造力的标准化方法。在本文中，我们评估了LLM产生具有文化相关表达的波斯文学文本的能力。我们构建了一个跨越20种不同主题的用户生成的波斯文学的数据集，并沿着四个创造力尺寸 - 原始性，流利性，灵活性和详细说明来评估模型输出，以适应创造性思维的托兰斯测试。为了降低评估成本，我们采用LLM作为自动评分的法官，并使用内部相关系数验证其对人类判断的可靠性，并观察到强有力的同意。此外，我们分析了模型理解和采用四种核心文学设备的能力：比喻，隐喻，夸张和对立面。我们的结果强调了波斯文学文本生成中LLM的优势和局限性，强调了进一步完善的需求。

Title: Developing an AI framework to automatically detect shared decision-making in patient-doctor conversations

Authors: Oscar J. Ponce-Ponte, David Toro-Tobon, Luis F. Figueroa, Michael Gionfriddo, Megan Branda, Victor M. Montori, Saturnino Luz, Juan P. Brito
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18439
Pdf URL: https://arxiv.org/pdf/2509.18439
Copy Paste: [[2509.18439]] Developing an AI framework to automatically detect shared decision-making in patient-doctor conversations(https://arxiv.org/abs/2509.18439)
Keywords: language model
Abstract: Shared decision-making (SDM) is necessary to achieve patient-centred care. Currently no methodology exists to automatically measure SDM at scale. This study aimed to develop an automated approach to measure SDM by using language modelling and the conversational alignment (CA) score. A total of 157 video-recorded patient-doctor conversations from a randomized multi-centre trial evaluating SDM decision aids for anticoagulation in atrial fibrillations were transcribed and segmented into 42,559 sentences. Context-response pairs and negative sampling were employed to train deep learning (DL) models and fine-tuned BERT models via the next sentence prediction (NSP) task. Each top-performing model was used to calculate four types of CA scores. A random-effects analysis by clinician, adjusting for age, sex, race, and trial arm, assessed the association between CA scores and SDM outcomes: the Decisional Conflict Scale (DCS) and the Observing Patient Involvement in Decision-Making 12 (OPTION12) scores. p-values were corrected for multiple comparisons with the Benjamini-Hochberg method. Among 157 patients (34% female, mean age 70 SD 10.8), clinicians on average spoke more words than patients (1911 vs 773). The DL model without the stylebook strategy achieved a recall@1 of 0.227, while the fine-tuned BERTbase (110M) achieved the highest recall@1 with 0.640. The AbsMax (18.36 SE7.74 p=0.025) and Max CA (21.02 SE7.63 p=0.012) scores generated with the DL without stylebook were associated with OPTION12. The Max CA score generated with the fine-tuned BERTbase (110M) was associated with the DCS score (-27.61 SE12.63 p=0.037). BERT model sizes did not have an impact the association between CA scores and SDM. This study introduces an automated, scalable methodology to measure SDM in patient-doctor conversations through explainable CA scores, with potential to evaluate SDM strategies at scale.
摘要：共享决策（SDM）对于获得以患者为中心的护理是必要的。当前，没有任何方法可以自动按大规模测量SDM。这项研究旨在通过使用语言建模和对话校准（CA）得分来开发一种自动化方法来测量SDM。从一项随机多中心试验中，总共进行了157次视频录制的患者对话，评估了对房颤中抗凝的SDM决策AIDS，并分段为42,559个句子。通过下一个句子预测（NSP）任务，采用了上下文响应对和负抽样来训练深度学习（DL）模型和微调的BERT模型。每个表现最佳模型都用于计算四种类型的CA分数。临床医生对年龄，性别，种族和试验部门进行调整的随机分析评估了CA得分与SDM结果之间的关联：决策冲突量表（DCS）以及观察到的患者参与决策12（选项112）分数。校正P值与Benjamini-Hochberg方法进行了多次比较。在157例患者（女性34％，平均年龄70 SD 10.8）中，临床医生平均会说单词比患者多（1911 vs 773）。没有样式书策略的DL模型获得了0.227中的1次召回，而微调的Bertbase（110m）以0.640获得了最高的召回@1。 Absmax（18.36 SE7.74 P = 0.025）和Max Ca（21.02 SE7.63 P = 0.012）分数与无DL无带样式书生成的分数与Option12相关联。用微调的Bertbase（110m）生成的最大CA分数与DCS得分（-27.61 SE12.63 P = 0.037）相关。 BERT模型大小没有影响CA分数与SDM之间的关联。这项研究介绍了一种可自动化的可扩展方法，可通过可解释的CA分数测量患者对话中的SDM，并有可能在大规模评估SDM策略。

Title: CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Authors: Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.18458
Pdf URL: https://arxiv.org/pdf/2509.18458
Copy Paste: [[2509.18458]] CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density(https://arxiv.org/abs/2509.18458)
Keywords: language model, llm
Abstract: Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.
摘要：在大语言模型（LLM）中长篇文化推理的当前基准通常模糊了关键因素，例如内在任务复杂性，干扰器干扰和任务长度。为了实现更精确的失败分析，我们引入了Cogniload，这是一种基于认知负载理论（CLT）的新型合成基准。 CogniloAd通过反映CLT的核心维度的独立可调参数生成自然语言逻辑难题：内在难度（$ d $）控制固有负载；干扰物与信号比率（$ \ rho $）调节外部负载；和任务长度（$ n $）可作为要求GEAMANE负载的条件的操作代理。 CogniloAD评估22个SOTA推理LLMS，揭示了不同的性能敏感性，将任务长度识别为主要的约束，并且对固有复杂性和对干扰物比率的U形响应的不同公差。通过对这些认知负荷维度进行系统的控制，CogniloAD提供了可再现，可扩展且诊断丰富的工具，以剖析LLM推理限制并指导未来的模型开发。

Title: LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Authors: Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18467
Pdf URL: https://arxiv.org/pdf/2509.18467
Copy Paste: [[2509.18467]] LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling(https://arxiv.org/abs/2509.18467)
Keywords: long context
Abstract: Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.
摘要：尽管变压器体系结构已经在各种域上实现了最先进的性能，但它们相对于序列长度的二次计算复杂性仍然是一个重要的瓶颈，尤其是对于延迟敏感的长篇小说应用程序。尽管最近的线性复杂性替代方案越来越强大，但从头开始有效地训练它们仍然是资源密集的。为了克服这些局限性，我们提出了Lawcat（随时间卷积的线性注意力），这是一个新型的线性化框架，旨在有效地将预训练的变压器的能力转移到表现的线性注意力结构中。 Lawcat整合了因果关系层，以增强局部依赖性建模，并采用标准化的封式线性注意来改善各种上下文长度之间的概括。我们的全面评估表明，仅使用1k长度的Mistral-7b蒸馏出90 \％的Passkey检索准确性高达22K代币，从而大大扩展了其有效的上下文窗口。同样，Llama3.2-1b Lawcat变体在S-NIAH 1 \＆2 \＆3个任务（1K-8K上下文长度）和Babilong基准（QA2 \＆QA3，0K-16K上下文长度）上实现了竞争性能，与预先训练的模型相比，需要小于0.1 \％的预训练象征模型。此外，对于超过8K令牌的序列，劳卡特的速度比flashattention-2的速度更快。因此，Lawcat为高性能，长篇小说线性模型提供了有效的途径，适合于边缘部署，从而减少了对广泛的长期培训数据和计算资源的依赖。

Title: Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Authors: Ben Finkelshtein, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen White
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18487
Pdf URL: https://arxiv.org/pdf/2509.18487
Copy Paste: [[2509.18487]] Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference(https://arxiv.org/abs/2509.18487)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.
摘要：大型语言模型（LLMS）越来越多地用于文本富含的图形机学习任务，例如在欺诈检测和推荐系统等高影响力域中的节点分类。然而，尽管有兴趣激增，但该领域仍缺乏对LLMS与图形数据相互作用的能力的原则上理解。在这项工作中，我们对几个关键可变性轴进行了大规模的，受控的评估，以系统地评估基于文本的应用程序中基于LLM的图形推理方法的优势和劣势。轴包括LLM-Graph交互模式，比较提示，工具使用和代码生成；数据集域，跨越引用，网络链接，电子商务和社交网络；结构性制度与同粒细胞和异质图相反；涉及短文本节点属性的特征特征；以及具有不同LLM尺寸和推理功能的模型配置。我们通过有条不紊地截断特征，删除边缘和删除标签来进一步分析依赖性，以量化对输入类型的依赖。我们的发现提供了实用和可行的指导。（1）LLMS作为代码生成器在图形数据上的总体性能最强，尤其是在长篇小说或高度图上的巨大收益，促使迅速超过令牌预算。（2）所有相互作用策略在异性图上仍然有效，这挑战了基于LLM的方法在低同质性下崩溃的假设。（3）代码生成能够灵活地调整其在结构，功能或标签之间的依赖，以利用最有用的输入类型。这些发现共同对当前LLM-Graph交互模式的优势和局限性提供了全面的看法，并突出了未来方法的关键设计原则。

Title: Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector

Authors: Mo Mu, Dianqiao Lei, Chang Li
Subjects: cs.CL, eess.SP
Abstract URL: https://arxiv.org/abs/2509.18535
Pdf URL: https://arxiv.org/pdf/2509.18535
Copy Paste: [[2509.18535]] Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text Detector(https://arxiv.org/abs/2509.18535)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The widespread adoption of ChatGPT has raised concerns about its misuse, highlighting the need for robust detection of AI-generated text. Current word-level detectors are vulnerable to paraphrasing or simple prompts (PSP), suffer from biases induced by ChatGPT's word-level patterns (CWP) and training data content, degrade on modified text, and often require large models or online LLM interaction. To tackle these issues, we introduce a novel task to detect both original and PSP-modified AI-generated texts, and propose a lightweight framework that classifies texts based on their internal structure, which remains invariant under word-level changes. Our approach encodes sentence embeddings from pre-trained language models and models their relationships via attention. We employ contrastive learning to mitigate embedding biases from autoregressive generation and incorporate a causal graph with counterfactual methods to isolate structural features from topic-related biases. Experiments on two curated datasets, including abstract comparisons and revised life FAQs, validate the effectiveness of our method.
摘要：Chatgpt的广泛采用引起了人们对其滥用的担忧，强调了对AI生成的文本进行强有力检测的需求。当前的单词级探测器容易受到释义或简单提示（PSP）的攻击，遭受了Chatgpt的单词级模式（CWP）引起的偏见和培训数据内容，对修改后的文本降低，并且通常需要大型模型或在线LLM互动。为了解决这些问题，我们介绍了一项新颖的任务，以检测原始和PSP修饰的AI生成的文本，并提出一个轻巧的框架，该框架根据其内部结构对文本进行分类，该框架在单词级别的变化下保持不变。我们的方法编码预先训练的语言模型中的句子嵌入，并通过注意力对其关系进行建模。我们采用对比度学习来减轻自回归产生中的偏见，并将因果图与反事实方法结合在一起，以将结构特征与主题相关的偏差分离。在两个策划数据集上进行的实验，包括抽象的比较和修订的寿命常见问题解答，验证了我们方法的有效性。

Title: CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Authors: Jin Young Kim, Ji Won Yoon
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18536
Pdf URL: https://arxiv.org/pdf/2509.18536
Copy Paste: [[2509.18536]] CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs(https://arxiv.org/abs/2509.18536)
Keywords: language model, llm
Abstract: Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose \textbf{C}ycle-\textbf{C}onsistency in \textbf{Q}uestion \textbf{A}nswering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at this https URL.
摘要：最近，推理时间推理策略进一步提高了大语言模型（LLMS）的准确性，但它们对较小模型的有效性仍不清楚。基于这样的观察，传统方法通常在这种情况下无法提高性能，我们建议\ textbf {c} ycle- \ textbf {c}在\ textbf {q} uestion \ textbf {q} uestion \ textbf {a} nswering（ccqa）中，这是一种可以有效地适用SLMS的新型推理方法。受周期一致性的启发，CCQA从每个推理路径和答案中生成一个问题，通过与原始问题相似，然后选择具有最高相似度得分的候选解决方案作为最终响应。由于传统的SLM努力从自己的推理路径和答案中产生准确的问题，因此我们采用专门用于问题生成的轻量级Flan-T5模型来有效地支持这一过程。从实验结果中，可以验证CCQA在数学和常识性推理基准上始终优于现有的最新方法（SOTA）方法。此外，我们的方法还建立了一个新的实用基线，以在SLM中有效推理。源代码可以在此HTTPS URL上找到。

Title: Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Authors: Yeongbin Seo, Gayoung Kim, Jaehyung Kim, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18577
Pdf URL: https://arxiv.org/pdf/2509.18577
Copy Paste: [[2509.18577]] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity(https://arxiv.org/abs/2509.18577)
Keywords: language model, llm
Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000x compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision
摘要：由于大型语言模型（LLM）在大量的Web语料库中进行了预测，因此仔细选择数据对于确保有效有效学习至关重要。尽管基于困惑（PPL）的过滤表现出很强的性能，但它具有缺点：在处理嘈杂或分布样本时，模型的大量时间成本和固有的不可靠性。在这项工作中，我们提出了一种简单而功能强大的替代方法：一种基于先前的数据过滤方法，该方法使用语料库级术语频率统计量来估算令牌先验，灵感来自对单词角色和词汇密度的语言见解。我们的方法根据令牌先验的平均值和标准偏差过滤文档，在不需要模型推断的同时，可以快速代理PPL。尽管它很简单，但基于先前的过滤器仍达到20个下游基准测试的最高平均性能，而与基于PPL的过滤相比，时间成本将超过1000倍降低。我们进一步证明了它适用于符号语言，例如代码和数学，及其对多语言语料库的动态适应性无需监督

Title: UniECG: Understanding and Generating ECG in One Unified Model

Authors: Jiarui Jin, Haoyu Wang, Xiang Lan, Jun Li, Gaofeng Cheng, Hongyan Li, Shenda Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18588
Pdf URL: https://arxiv.org/pdf/2509.18588
Copy Paste: [[2509.18588]] UniECG: Understanding and Generating ECG in One Unified Model(https://arxiv.org/abs/2509.18588)
Keywords: gpt
Abstract: Recent unified models such as GPT-5 have achieved encouraging progress on vision-language tasks. However, these unified models typically fail to correctly understand ECG signals and provide accurate medical diagnoses, nor can they correctly generate ECG signals. To address these limitations, we propose UniECG, the first unified model for ECG capable of concurrently performing evidence-based ECG interpretation and text-conditioned ECG generation tasks. Through a decoupled two-stage training approach, the model first learns evidence-based interpretation skills (ECG-to-Text), and then injects ECG generation capabilities (Text-to-ECG) via latent space alignment. UniECG can autonomously choose to interpret or generate an ECG based on user input, significantly extending the capability boundaries of current ECG models. Our code and checkpoints will be made publicly available at this https URL upon acceptance.
摘要：最近的统一模型（例如GPT-5）已取得了令人鼓舞的视觉任务进展。但是，这些统一模型通常无法正确理解ECG信号并提供准确的医学诊断，也无法正确生成ECG信号。为了解决这些限制，我们提出了UNIECG，这是第一个能够同时执行基于证据的ECG解释和文本条件的ECG生成任务的ECG的统一模型。通过一种脱钩的两阶段训练方法，该模型首先学习基于证据的解释技能（ECG到文本），然后通过潜在的空间对齐来注入ECG的生成能力（文本到ECG）。 UNIECG可以自主选择基于用户输入来解释或生成ECG，从而大大扩展了当前ECG模型的功能边界。接受后，我们的代码和检查点将在此HTTPS URL上公开可用。

Title: A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users

Authors: Nishant Balepur, Matthew Shu, Yoo Yeon Sung, Seraphina Goldfarb-Tarrant, Shi Feng, Fumeng Yang, Rachel Rudinger, Jordan Lee Boyd-Graber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18632
Pdf URL: https://arxiv.org/pdf/2509.18632
Copy Paste: [[2509.18632]] A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users(https://arxiv.org/abs/2509.18632)
Keywords: llm, chat, agent
Abstract: To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions, not just preferences of what looks helpful, so we discuss the plan NLP researchers can execute to solve this problem.
摘要：为了帮助用户完成复杂的任务，LLMS生成计划：逐步指示目标。尽管对齐方式旨在确保LLM计划有所帮助，但他们培训（RLHF）或评估（Chatbotarena）对用户喜欢的东西，假设这反映了对他们的帮助。我们使用Planorama：一个接口进行测试，其中126位用户通过LLM计划回答300个多步骤问题。我们获得了4388个计划执行和5584个比较，以衡量计划的帮助（QA成功）和用户对计划的偏好，并在代理商中重新创建设置并奖励模型，以查看它们是否模拟或更喜欢什么帮助用户。我们公开：1）用户/模型偏好和代理成功无法准确预测哪些计划有助于用户，因此常见的一致性反馈可能会对有用的有益地错过； 2）此差距不是由于特定于用户的偏好所致，因为在使用他们喜欢/disprefer的计划时，用户同样成功； 3）表面级别的提示像简短和问题相似性与偏好有很强的联系，但是这种偏见无法预测有帮助。总体而言，我们认为对齐有用的LLM需要来自真实用户交互的反馈，而不仅仅是看起来有用的偏好，因此我们讨论了NLP研究人员可以执行的计划以解决此问题。

Title: Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

Authors: Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, Jian Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18658
Pdf URL: https://arxiv.org/pdf/2509.18658
Copy Paste: [[2509.18658]] Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction(https://arxiv.org/abs/2509.18658)
Keywords: language model, llm, prompt
Abstract: LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.
摘要：LLM-AS-A-a-gudge已成为使用大型语言模型（LLM）评估自然语言生成（NLG）的有希望的范式，但是其评估的不确定性仍然没有得到充实的态度。缺乏可靠性可能会限制其在许多应用程序中的部署。这项工作提出了第一个框架来通过通过共形预测提供基于LLM的评分的预测间隔来分析不确定性。共形预测构造从单个评估运行中的连续预测间隔，我们为离散评级任务设计了序数边界调整。我们还建议在间隔内获得基于中点的分数，作为原始模型得分和加权平均值的低偏差替代品。我们执行广泛的实验和分析，这表明保形预测可以提供有效的预测间隔，并保证保证。我们还探讨了间隔中点的有用性，并审判了重复判断的法官。

Title: MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

Authors: Yizhe Huang, Yang Liu, Ruiyu Zhao, Xiaolong Zhong, Xingming Yue, Ling Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18713
Pdf URL: https://arxiv.org/pdf/2509.18713
Copy Paste: [[2509.18713]] MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service(https://arxiv.org/abs/2509.18713)
Keywords: language model, llm, agent
Abstract: Large Language Model-based agents(LLM-based agents) are increasingly deployed in customer service, yet they often forget across sessions, repeat errors, and lack mechanisms for continual self-improvement. This makes them unreliable in dynamic settings where stability and consistency are critical. To better evaluate these properties, we emphasize two indicators: task success rate as a measure of overall effectiveness, and consistency metrics such as Pass$^k$ to capture reliability across multiple trials. To address the limitations of existing approaches, we propose MemOrb, a lightweight and plug-and-play verbal reinforcement memory layer that distills multi-turn interactions into compact strategy reflections. These reflections are stored in a shared memory bank and retrieved to guide decision-making, without requiring any fine-tuning. Experiments show that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage-point gain in multi-turn success rate and delivering more consistent performance across repeated trials. Our results demonstrate that structured reflection is a powerful mechanism for enhancing long-term reliability of frozen LLM agents in customer service scenarios.
摘要：大型基于语言模型的代理（基于LLM的代理商）越来越多地部署在客户服务中，但他们经常忘记在会话中，重复错误以及缺乏连续自我改善的机制。这使它们在稳定性和一致性至关重要的动态设置中不可靠。为了更好地评估这些属性，我们强调了两个指标：任务成功率是衡量整体有效性的量度，以及一致性指标，例如Pass $^k $，以在多个试验中捕获可靠性。为了解决现有方法的局限性，我们提出了Memorb，这是一个轻巧和插件的口头增强存储器层，将多转交互的交互作用到紧凑的策略反射中。这些反射存储在共享内存库中，并检索以指导决策，而无需进行任何微调。实验表明，Memorb显着提高了成功率和稳定性，在多转弯成功率中达到了63个百分点的增长，并在重复试验中提供了更一致的性能。我们的结果表明，结构化反射是增强冻结LLM代理在客户服务方案中的长期可靠性的有力机制。

Title: Global-Recent Semantic Reasoning on Dynamic Text-Attributed Graphs with Large Language Models

Authors: Yunan Wang, Jianxin Li, Ziwei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18742
Pdf URL: https://arxiv.org/pdf/2509.18742
Copy Paste: [[2509.18742]] Global-Recent Semantic Reasoning on Dynamic Text-Attributed Graphs with Large Language Models(https://arxiv.org/abs/2509.18742)
Keywords: language model, llm, prompt
Abstract: Dynamic Text-Attribute Graphs (DyTAGs), characterized by time-evolving graph interactions and associated text attributes, are prevalent in real-world applications. Existing methods, such as Graph Neural Networks (GNNs) and Large Language Models (LLMs), mostly focus on static TAGs. Extending these existing methods to DyTAGs is challenging as they largely neglect the recent-global temporal semantics: the recent semantic dependencies among interaction texts and the global semantic evolution of nodes over time. Furthermore, applying LLMs to the abundant and evolving text in DyTAGs faces efficiency issues. To tackle these challenges, we propose Dynamic Global-Recent Adaptive Semantic Processing (DyGRASP), a novel method that leverages LLMs and temporal GNNs to efficiently and effectively reason on DyTAGs. Specifically, we first design a node-centric implicit reasoning method together with a sliding window mechanism to efficiently capture recent temporal semantics. In addition, to capture global semantic dynamics of nodes, we leverage explicit reasoning with tailored prompts and an RNN-like chain structure to infer long-term semantics. Lastly, we intricately integrate the recent and global temporal semantics as well as the dynamic graph structural information using updating and merging layers. Extensive experiments on DyTAG benchmarks demonstrate DyGRASP's superiority, achieving up to 34% improvement in Hit@10 for destination node retrieval task. Besides, DyGRASP exhibits strong generalization across different temporal GNNs and LLMs.
摘要：动态文本属性图（DYTAGS）以时间不断变化的图形相互作用和相关的文本属性为特征，在现实世界应用中很普遍。现有方法，例如图形神经网络（GNN）和大型语言模型（LLMS），主要集中在静态标签上。将这些现有方法扩展到dytag是一项挑战，因为它们在很大程度上忽略了最近的全球时间语义：相互作用文本之间最近的语义依赖性以及节点随时间的全球语义演变。此外，将LLM应用于Dytag中丰富而不断发展的文本面临效率问题。为了应对这些挑战，我们提出了动态的全球自适应语义处理（DYGRASP），这是一种新的方法，它利用LLM和时间GNNS来有效地有效地在Dytag上理由。具体而言，我们首先设计了一种以节点为中心的隐式推理方法以及滑动窗口机制，以有效捕获最近的时间语义。此外，为了捕获节点的全球语义动态，我们利用量身定制的提示和类似RNN的链结构来利用明确的推理来推断长期语义。最后，我们使用更新和合并层复杂地整合了最新和全局的时间语义以及动态图结构信息。 Dytag基准测试的广泛实验证明了Dygrasp的优势，在目的地节点检索任务中，HIT@10的命中率提高了34％。此外，Dygrasp在不同的时间GNN和LLMS上表现出强烈的概括。

Title: False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

Authors: Julie Kallini, Dan Jurafsky, Christopher Potts, Martijn Bartelds
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18750
Pdf URL: https://arxiv.org/pdf/2509.18750
Copy Paste: [[2509.18750]] False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models(https://arxiv.org/abs/2509.18750)
Keywords: language model
Abstract: Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models' hidden representations and find that overlap of any kind creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.
摘要：接受多语言Corpora培训的子词令牌自然会跨语言产生重叠的代币。令牌重叠是否有助于跨语性转移或引入语言之间的干扰？先前的工作提供了不同的证据，部分是由于各种设置和混杂因素（例如令牌频率或子单词分割粒度）。为了解决这个问题，我们设计了一个受控的实验，在系统上，我们在系统上多种语言对的双语自回归模型进行了训练。至关重要的是，我们探索了一个新的维度，以了解重叠如何影响转移：跨语言共享的代币的语义相似性。我们首先分析了我们的模型的隐藏表示形式，并发现任何类型的重叠都会产生嵌入空间，以捕获跨语性语义关系，而这种效果在具有不相交词汇的模型中要弱得多。在XNLI和Xquad上，我们发现具有重叠的模型胜过具有不相交的模型，并且随着重叠的增加，转移性能通常会改善。总体而言，我们的发现突出了多种语模型中令牌重叠的优势，并表明，大量共享词汇仍然是多语言标记器的有益设计选择。

Title: When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Authors: Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.18762
Pdf URL: https://arxiv.org/pdf/2509.18762
Copy Paste: [[2509.18762]] When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models(https://arxiv.org/abs/2509.18762)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.
摘要：大型语言模型（LLM）在自然语言处理（NLP）任务中实现了令人印象深刻的表现。随着现实世界的应用程序越来越需要更长的上下文窗口，在长篇小说数据上持续预处理和监督微调（SFT）已成为一种常见方法。虽然对持续预处理的数据长度的影响进行了广泛的研究，但它们对SFT的影响尚不清楚。在这项工作中，我们系统地研究了SFT数据长度如何影响短篇小写任务的LLM行为。违反直觉，我们发现长篇小说SFT改善了短篇小说性能，与通常观察到的长篇文化预处理的降解相反。为了揭示这种现象的潜在机制，我们首先将两个关键组成部分分解并分析了多头注意（MHA）和前馈网络（FFN），并表明这两者都从长期文化SFT中独立受益。我们进一步研究了它们的相互作用并揭示了知识偏好偏见：长篇文章SFT促进了上下文知识，而短篇小说SFT则有利于参数知识，从而独有地依赖长篇小说SFT次优。最后，我们证明了混合培训减轻了这种偏见，为微调LLM提供了可解释的指导。

Title: AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Authors: Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.18776
Pdf URL: https://arxiv.org/pdf/2509.18776
Copy Paste: [[2509.18776]] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field(https://arxiv.org/abs/2509.18776)
Keywords: language model, llm
Abstract: Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.
摘要：作为一种新颖的信息技术，大型语言模型（LLMS）正在看到建筑，工程和构建（AEC）领域的采用量在增加。他们显示了他们在整个建筑生命周期中简化过程的潜力。但是，LLM在这样的专业和安全关键领域中的鲁棒性和可靠性仍有待评估。为了应对这一挑战，本文建立了AECBENCH，这是一种综合基准，旨在量化AEC域中当前LLM的优势和局限性。该基准在五级认知的评估框架内定义了23个代表性任务，其中包括知识记忆，理解，推理，计算和应用。这些任务来自真实的AEC实践，其范围从代码检索到专业文档的生成。随后，主要由工程师制作了4,800个问题的数据集，包括开放式问题，包括开放式问题，并通过两轮专家审查进行了验证。此外，引入了一种LLM-AS-A-A-AUDGE方法，以提供一种可扩展，一致的方法来评估利用专家衍生的专家的复杂，长格式的反应。通过评估9个LLM，揭示了五个认知水平的明显下降。尽管在知识记忆和理解水平上表现出熟练的基础任务，但这些模型表现出明显的性能缺陷，尤其是在解释建筑物代码中表的知识，执行复杂的推理和计算以及生成特定领域的文档时。因此，这项研究为未来的研发奠定了基础，旨在将LLM的强大而可靠的整合到安全至关重要的工程实践中。

Title: Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing

Authors: Sabri Boughorbel, Fahim Dalvi, Nadir Durrani, Majd Hawasly
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18792
Pdf URL: https://arxiv.org/pdf/2509.18792
Copy Paste: [[2509.18792]] Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing(https://arxiv.org/abs/2509.18792)
Keywords: language model, llm, hallucination
Abstract: As fine-tuning becomes the dominant paradigm for improving large language models (LLMs), understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain why one model outperforms another. In this work, we use model diffing, a mechanistic interpretability approach, to analyze the specific capability differences between Gemma-2-9b-it and a SimPO-enhanced variant. Using crosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8%), multilingual capabilities (+43.8%), and instruction-following (+151.7%), while its additional training also reduces emphasis on model self-reference (-44.1%) and hallucination management (-68.5%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.
摘要：随着微调成为改善大型语言模型（LLM）的主要范式，了解此过程中的变化越来越重要。传统的基准测试通常无法解释为什么一个模型表现优于另一个模型。在这项工作中，我们使用模型扩散（一种机械性的解释性方法）来分析Gemma-2-9b-it和simpo增强变体之间的特定能力差异。使用交叉编码器，我们识别并分类区分这两个模型的潜在表示。我们发现，SIMPO获得了潜在概念，主要增强了安全机制（+32.8％），多语言能力（+43.8％）和遵循指令（+151.7％），而其额外的培训也减少了对模型自我重新报道（-44.1％）和幻觉管理的强调（-68.5％）。我们的分析表明，模型扩散可以产生排行榜指标以外的细粒见解，从而将性能差距归因于具体的机械能力。这种方法为比较LLM提供了一个透明且有针对性的框架。

Title: MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction

Authors: Liting Zhang, Shiwan Zhao, Aobo Kong, Qicheng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18813
Pdf URL: https://arxiv.org/pdf/2509.18813
Copy Paste: [[2509.18813]] MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction(https://arxiv.org/abs/2509.18813)
Keywords: language model, llm, prompt, agent
Abstract: Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt-based methods for Large Language Models (LLMs) often rely on single-stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one-size-fits-all designs hinder the full exploitation of LLMs' reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi-agent collaboration into keyphrase extraction. MAPEX coordinates LLM-based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. A dual-path strategy dynamically adapts to document length: knowledge-driven extraction for short texts and topic-guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state-of-the-art unsupervised method by 2.44\% and standard LLM baselines by 4.01\% in F1@5 on average. Code is available at this https URL.
摘要：键形提取是自然语言处理中的一项基本任务。但是，对于大型语言模型（LLM）（LLM）的现有无监督的及时方法通常依赖于具有均匀提示的单级推理管道，而不论文档长度或LLM骨架如何。这样的单一适合所有设计的设计阻碍了LLM的推理和发电能力的全面利用，尤其是考虑到各种情况下键形提取的复杂性。为了应对这些挑战，我们提出了Mapex，这是将多代理协作引入键形提取的第一个框架。 Mapex通过模块来协调基于LLM的代理，以供专家招聘，候选人提取，主题指导，知识增强和后处理。双路径策略动态适应文档长度：简短文本的知识驱动提取和长期文本的主题指导提取。在三个不同LLM的六个基准数据集上进行的广泛实验表明了其强大的概括和普遍性，在F1@5中，在F1@5中的最先进的无监督方法和标准LLM Baselines的表现优于2.44 \％，而标准LLM Baselines的表现则超过了4.01 \％。代码可在此HTTPS URL上找到。

Title: Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?

Authors: Damian Stachura, Joanna Konieczna, Artur Nowak
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2509.18843
Pdf URL: https://arxiv.org/pdf/2509.18843
Copy Paste: [[2509.18843]] Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?(https://arxiv.org/abs/2509.18843)
Keywords: language model, gpt, llm
Abstract: Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at this https URL.
摘要：大型语言模型（LLMS）的开放权重版本正在迅速发展，诸如DeepSeek-V3之类的最先进的模型现在与专有LLMS相当地执行。这种进展提出了一个问题，即小规模的开放式LLM是否能够有效替换更大的封闭源模型。我们对生物医学提问的背景特别感兴趣，这是我们参与BioASQ挑战的任务13B阶段探索的领域。在这项工作中，我们将几种开放式模型与诸如GPT-4O，GPT-4.1，Claude 3.5十四行诗和Claude 3.7十四行诗等表现最好的系统进行了比较。为了增强问答答案功能，我们使用各种技术，包括基于嵌入距离，秘密学习和结构化输出来检索最相关的片段。对于某些提交，我们利用合奏方法来利用不同模型产生的多种输出来解决精确的问题。我们的结果表明，开放式LLM与专有的LLM相当。在某些情况下，开放式LLM甚至超过了封闭的同行，尤其是在采用结合策略时。所有代码均在此HTTPS URL上公开可用。

Title: Multi-Hierarchical Feature Detection for Large Language Model Generated Text

Authors: Luyan Zhang, Xinyu Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18862
Pdf URL: https://arxiv.org/pdf/2509.18862
Copy Paste: [[2509.18862]] Multi-Hierarchical Feature Detection for Large Language Model Generated Text(https://arxiv.org/abs/2509.18862)
Keywords: language model, llm
Abstract: With the rapid advancement of large language model technology, there is growing interest in whether multi-feature approaches can significantly improve AI text detection beyond what single neural models achieve. While intuition suggests that combining semantic, syntactic, and statistical features should provide complementary signals, this assumption has not been rigorously tested with modern LLM-generated text. This paper provides a systematic empirical investigation of multi-hierarchical feature integration for AI text detection, specifically testing whether the computational overhead of combining multiple feature types is justified by performance gains. We implement MHFD (Multi-Hierarchical Feature Detection), integrating DeBERTa-based semantic analysis, syntactic parsing, and statistical probability features through adaptive fusion. Our investigation reveals important negative results: despite theoretical expectations, multi-feature integration provides minimal benefits (0.4-0.5% improvement) while incurring substantial computational costs (4.2x overhead), suggesting that modern neural language models may already capture most relevant detection signals efficiently. Experimental results on multiple benchmark datasets demonstrate that the MHFD method achieves 89.7% accuracy in in-domain detection and maintains 84.2% stable performance in cross-domain detection, showing modest improvements of 0.4-2.6% over existing methods.
摘要：随着大型语言模型技术的快速发展，对多功能方法是否可以显着改善AI文本检测的兴趣越来越大，超出了单个神经模型的实现。虽然直觉表明，将语义，句法和统计特征结合起来应提供互补信号，但这种假设尚未通过现代LLM生成的文本进行严格测试。本文提供了针对AI文本检测的多层次特征集成的系统经验研究，专门测试了将多种特征类型组合的计算开销是否通过性能提高证明是合理的。我们实施MHFD（多等级特征检测），通过自适应融合整合了基于Deberta的语义分析，句法解析和统计概率特征。我们的调查揭示了重要的负面结果：尽管有理论上的期望，多功能的整合可提供最小的收益（0.4-0.5％提高），同时产生了实质性的计算成本（4.2倍开销），这表明现代神经语言模型可能已经有效地捕获了最相关的检测信号。多个基准数据集的实验结果表明，MHFD方法在内域检测方面的准确性为89.7％，并在跨域检测中保持84.2％的稳定性能，比现有方法显示0.4-2.6％的适度改善。

Title: Diversity Boosts AI-Generated Text Detection

Authors: Advik Raj Basani, Pin-Yu Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.18880
Pdf URL: https://arxiv.org/pdf/2509.18880
Copy Paste: [[2509.18880]] Diversity Boosts AI-Generated Text Detection(https://arxiv.org/abs/2509.18880)
Keywords: llm
Abstract: Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.
摘要：检测AI生成的文本是越来越需要在教育，商业合规，新闻和社交媒体中滥用LLM的滥用，在这里合成流利性可以掩盖错误的信息或欺骗。虽然先前的探测器通常依靠令牌级别的可能性或不透明的黑盒分类器，但这些方法与高质量的一代苦苦挣扎，几乎没有可解释性。在这项工作中，我们提出了Diveye，这是一个新颖的检测框架，它捕获了使用基于惊人的特征在文本中不可预测性的不可预测性波动。由观察到的人为作者的文本比LLM输出表现出词汇和结构性不可预测性更丰富的变异性，Diveye通过一组可解释的统计特征捕获了该信号。我们的方法的表现优于现有的零射击检测器，高达33.2％，并在多个基准测试中以微调的基线实现竞争性能。潜水对释义和对抗性攻击非常强大，在跨域和模型之间概括，并在用作辅助信号时将现有检测器的性能提高了多达18.7％。除了发现之外，Diveye提供了可解释的见解，以了解为什么文本被标记为为什么是有节奏的不可预测性，作为LLM检测的功能强大且毫无用处的信号。

Title: Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass

Authors: Nicholas Popovič, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.18901
Pdf URL: https://arxiv.org/pdf/2509.18901
Copy Paste: [[2509.18901]] Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass(https://arxiv.org/abs/2509.18901)
Keywords: language model, llm
Abstract: Recent works in Natural Language Inference (NLI) and related tasks, such as automated fact-checking, employ atomic fact decomposition to enhance interpretability and robustness. For this, existing methods rely on resource-intensive generative large language models (LLMs) to perform decomposition. We propose JEDI, an encoder-only architecture that jointly performs extractive atomic fact decomposition and interpretable inference without requiring generative models during inference. To facilitate training, we produce a large corpus of synthetic rationales covering multiple NLI benchmarks. Experimental results demonstrate that JEDI achieves competitive accuracy in distribution and significantly improves robustness out of distribution and in adversarial settings over models based solely on extractive rationale supervision. Our findings show that interpretability and robust generalization in NLI can be realized using encoder-only architectures and synthetic rationales. Code and data available at this https URL
摘要：自然语言推断（NLI）和相关任务（例如自动事实检查）的最新著作采用原子事实分解来增强可解释性和鲁棒性。为此，现有的方法依靠资源密集型生成型大语言模型（LLM）来执行分解。我们提出了JEDI，这是一种仅编码的结构，该结构共同执行提取性原子事实分解和可解释的推断，而无需推理期间需要生成模型。为了促进培训，我们生产了涵盖多个NLI基准的大量合成原理。实验结果表明，绝地武士在分布方面具有竞争力的准确性，并显着改善了仅基于提取性理由监督的模型，从分布和对抗性设置中提高了鲁棒性。我们的发现表明，可以使用仅编码体系结构和合成理由来实现NLI中的可解释性和鲁棒性概括。此HTTPS URL上可用的代码和数据

Title: Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus

Authors: Chiara Alzetta, Serena Auriemma, Alessandro Bondielli, Luca Dini, Chiara Fazzone, Alessio Miaschi, Martina Miliani, Marta Sartor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19033
Pdf URL: https://arxiv.org/pdf/2509.19033
Copy Paste: [[2509.19033]] Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus(https://arxiv.org/abs/2509.19033)
Keywords: language model, llm
Abstract: Over the past decade, Computational Linguistics (CL) and Natural Language Processing (NLP) have evolved rapidly, especially with the advent of Transformer-based Large Language Models (LLMs). This shift has transformed research goals and priorities, from Lexical and Semantic Resources to Language Modelling and Multimodality. In this study, we track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it, arguably the leading Italian conference in the field. We compile the proceedings from the first 10 editions of the CLiC-it conference (from 2014 to 2024) into the CLiC-it Corpus, providing a comprehensive analysis of both its metadata, including author provenance, gender, affiliations, and more, as well as the content of the papers themselves, which address various topics. Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time, supporting informed decisions and future directions in the field.
摘要：在过去的十年中，计算语言学（CL）和自然语言处理（NLP）迅速发展，尤其是在基于变压器的大语言模型（LLMS）的出现。从词汇和语义资源到语言建模和多模式，这种转变已转化了研究目标和优先事项。在这项研究中，我们通过分析对Clic-IT的贡献，可以说是该领域的主要意大利会议，从而跟踪意大利CL和NLP社区的研究趋势。我们将CLIC-IT会议（从2014年至2024年）的前10版（从2014年至2024年）中汇编为Clic-IT语料库，对其元数据进行了全面分析，包括作者出处，性别，分支机构等，以及论文本身的内容，以及针对各种主题的内容。我们的目标是向意大利和国际研究社区提供对随着时间的推移新兴趋势和关键发展的宝贵见解，以支持明智的决策和未来的方向。

Title: Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

Authors: Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Zhuowan Li, Spurthi Amba Hombaiah, Weize Kong, Tao Chen, Hamed Zamani, Michael Bendersky
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2509.19094
Pdf URL: https://arxiv.org/pdf/2509.19094
Copy Paste: [[2509.19094]] Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering(https://arxiv.org/abs/2509.19094)
Keywords: language model, llm
Abstract: Personalization is essential for adapting question answering (QA) systems to user-specific information needs, thereby improving both accuracy and user satisfaction. However, personalized QA remains relatively underexplored due to challenges such as inferring preferences from long, noisy, and implicit contexts, and generating responses that are simultaneously correct, contextually appropriate, and aligned with user expectations and background knowledge. To address these challenges, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning. The approach models the reasoning of an LLM as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark for personalized QA show that PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement. Human evaluation corroborates these results, with annotators preferring outputs from PoT in 66% of cases and reporting ties in only 15% of cases.
摘要：个性化对于调整问题答案（QA）系统至关重要，从而提高了准确性和用户满意度。但是，由于挑战，诸如从长，嘈杂和隐式上下文中推断偏好，以及同时正确，适当的且与用户期望和背景知识保持一致的响应，因此，个性化的质量质量质量检查仍然相对不受影响。为了应对这些挑战，我们提出了思想途径（POT），这是一种适用于任何大型语言模型（LLM）的推理阶段方法，而无需特定于任务的微调。该方法将LLM推理作为迭代决策过程进行了模型，该模型在推理，修订，个性化和澄清等认知操作中动态选择。这可以探索多种推理轨迹，从而产生不同的候选反应，从而捕捉不同的观点。然后，POT会根据推断的用户偏好汇总并重新重量这些候选人，从而产生最终的个性化响应，从而受益于各种推理路径的补充优势。在LAMP-QA基准测试的实验中，用于个性化质量检查表明，POT始终优于竞争基准，相对改善的相对改善高达13.1％。人类评估证实了这些结果，注释者更喜欢66％的病例中的锅输出，并且仅在15％的病例中报告关系。

Title: Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering

Authors: Kun Zhu, Lizi Liao, Yuxuan Gu, Lei Huang, Xiaocheng Feng, Bing Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19125
Pdf URL: https://arxiv.org/pdf/2509.19125
Copy Paste: [[2509.19125]] Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering(https://arxiv.org/abs/2509.19125)
Keywords: language model, llm, prompt
Abstract: The rapid growth of scientific literature demands efficient methods to organize and synthesize research findings. Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models (LLMs), often lack coherence and granularity. We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering. Our method leverages LLMs to identify key aspects of each paper (e.g., methodology, dataset, evaluation) and generates aspect-specific paper summaries, which are then encoded and clustered along each aspect to form a coherent hierarchy. In addition, we introduce a new evaluation benchmark of 156 expert-crafted taxonomies encompassing 11.6k papers, providing the first naturally annotated dataset for this task. Experimental results demonstrate that our method significantly outperforms prior approaches, achieving state-of-the-art performance in taxonomy coherence, granularity, and interpretability.
摘要：科学文献的快速增长需要有效的方法来组织和综合研究结果。现有的分类构建方法，利用大型语言模型（LLM）的无监督聚类或直接提示，通常缺乏连贯性和颗粒状。我们提出了一个新颖的上下文感知的层次分类生成框架，该框架将LLM指导的多相关编码与动态聚类集成在一起。我们的方法利用LLMS来识别每篇论文的关键方面（例如，方法论，数据集，评估）并生成特定于方面的纸张摘要，然后将其编码并沿每个方面进行编码和聚集，以形成连贯的层次结构。此外，我们还引入了156个专家制作的分类法的新评估基准，其中包括11.6k论文，为此任务提供了第一个自然注释的数据集。实验结果表明，我们的方法显着胜过先前的方法，在分类学连贯性，粒度和解释性方面实现最先进的表现。

Title: Anecdoctoring: Automated Red-Teaming Across Language and Place

Authors: Alejandro Cuevas, Saloni Dash, Bharat Kumar Nayak, Dan Vann, Madeleine I. G. Daepp
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2509.19143
Pdf URL: https://arxiv.org/pdf/2509.19143
Copy Paste: [[2509.19143]] Anecdoctoring: Automated Red-Teaming Across Language and Place(https://arxiv.org/abs/2509.19143)
Keywords: llm, prompt
Abstract: Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose "anecdoctoring", a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.
摘要：虚假信息是生成人工智能（AI）滥用的主要风险之一。全球对生成AI的采用需要在各种语言和文化之间进行鲁棒性的红线评估（即系统的对抗性探测），但是红色团队的数据集通常以美国和英语为中心。为了解决这一差距，我们提出了“轶事”，这是一种新颖的红色团队方法，可以自动产生跨语言和文化的对抗性提示。我们通过三种语言（英语，西班牙语和印地语）和两个地理位置（美国和印度）从事实核对网站（美国和印度）中收集错误信息。然后，我们将个人主张归因于更广泛的叙述，并用知识图来表征由此产生的群集，并通过其增强攻击者LLM。我们的方法可产生较高的攻击成功率，并提供相对于很少的发动机提供的可解释性优势。结果强调了对全球范围扩展的虚假信息的需求，并以现实世界的对抗性滥用为基础。

Title: Soft Tokens, Hard Truths

Authors: Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, Yann Ollivier
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.19170
Pdf URL: https://arxiv.org/pdf/2509.19170
Copy Paste: [[2509.19170]] Soft Tokens, Hard Truths(https://arxiv.org/abs/2509.19170)
Keywords: llm, chain-of-thought
Abstract: The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.
摘要：在推理链（COT）阶段LLM的使用中，LLM的使用而不是离散令牌的使用最近引起了人们的注意，因为直觉是，离散令牌的连续混合物可以同时模拟几种推理路径的叠加。理论上的结果正式证明，连续令牌具有更大的表达性，并且可以更有效地解决特定问题。但是，连续代币的实际使用受到强大的培训困难的限制：以前的作品要么在推理时仅在预先训练的离散模型上使用连续的代币，要么必须将连续的cot从基本真实的离散床上提取，并面对将COT限制在很少的令牌上的计算成本。这是引入可扩展方法的第一项工作，可以通过增强学习（RL）学习连续的婴儿，而无需从参考离散的COTS中提取。我们使用“软”令牌：令牌的混合物以及输入嵌入中的噪声以提供RL探索。计算开销很小，使我们能够学习具有数百个代币的连续婴儿床。在使用Llama和Qwen型号的数学推理基准测试中，连续COTS的训练与Pass@1相匹配的离散COTS，并超过了Pass@32，显示出更大的COT多样性。在系统的比较中，表现最佳的情况是使用连续的COT令牌训练，然后使用离散令牌进行推理，这意味着可以以标准方式部署“软”模型。最后，我们显示连续的COT RL训练更好地保留了基本模型对室外任务的预测，从而为基本模型提供了更柔和的触摸。

Title: Online Process Reward Leanring for Agentic Reinforcement Learning

Authors: Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19199
Pdf URL: https://arxiv.org/pdf/2509.19199
Copy Paste: [[2509.19199]] Online Process Reward Leanring for Agentic Reinforcement Learning(https://arxiv.org/abs/2509.19199)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, sparse and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state overlap is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent's policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.
摘要：大型语言模型（LLM）越来越多地通过加强学习（RL）作为自主媒介进行培训，这些代理在交互式环境中推理并在漫长的视野中起作用。但是，稀疏，有时无法验证的奖励使时间信贷分配极具挑战性。最近的工作试图将过程监督整合到代理学习中，但遭受偏见的注释，奖励黑客攻击，高度差异信号或失败时，当状态重叠很少时。因此，我们介绍了在线流程奖励学习（OPRL），这是一种针对代理RL的一般信用代理策略，它与标准的上车算法无缝集成，而无需依赖其他推出或明确的步骤标签。在OPRL中，我们通过基于基于轨迹的DPO目标将轨迹偏好转换为隐性步骤奖励的代理商政策交替优化隐式过程奖励模型（PRM）。然后将这些步骤奖励用于计算步进级别的优势，这些优势与策略更新的成果奖励相结合，创建自我增压循环。理论发现确保学习的步骤奖励与轨迹偏好一致，并充当潜在的塑造奖励，从而提供有限的梯度以稳定训练。从经验上讲，我们评估了三种不同的代理人Benmark，包括网络商店和VisualSokoban，以及在Sotopia的无法可见的回报中进行的开放式社交互动。至关重要的是，OPRL在边界LLM上显示出卓越的性能和跨域的强RL基准，从而在训练过程中以较高的样本效率和较低的方差获得最新的结果。进一步的分析还证明了OPRL使用较少的动作进行了有效的探索，从而强调了其在现实世界中的代理学习潜力。

Title: Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Authors: Zheyuan Liu, Zhangchen Xu, Guangyao Dou, Xiangchi Yuan, Zhaoxuan Tan, Radha Poovendran, Meng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19212
Pdf URL: https://arxiv.org/pdf/2509.19212
Copy Paste: [[2509.19212]] Steering Multimodal Large Language Models Decoding for Context-Aware Safety(https://arxiv.org/abs/2509.19212)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.
摘要：多模式大语模型（MLLM）越来越多地部署在现实世界中，但它们做出上下文感知安全决策的能力仍然有限。现有方法通常无法平衡过度敏感性（对良性查询的不合理）和敏感性（未检测到视觉扎根的风险），从而在安全对准方面持续存在差距。为了解决此问题，我们引入了安全感的对比解码（SAFECODE），这是一个轻巧且型号的解码框架，该框架基于多模式上下文，动态调整令牌生成。 SAFECODE分为两个阶段运行：（1）一种对比解码机制，该机制通过对比真实和高斯噪声的图像来强调对视觉上下文敏感的令牌，以及（2）一种全球意识到的令牌调制策略，将场景级别的推理集成到了与代币级别的调整，以根据标记的调整，以适应预测的安全性，以适应标记的调整。涵盖敏感性，过敏性和一般安全评估的各种MLLM体系结构和安全基准进行的广泛实验表明，Safecode始终提高上下文敏感的拒绝行为，同时保留模型的帮助。

Title: Systematic Comparative Analysis of Large Pretrained Language Models on Contextualized Medication Event Extraction

Authors: Tariq Abdul-Quddoos, Xishuang Dong, Lijun Qian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.19224
Pdf URL: https://arxiv.org/pdf/2509.19224
Copy Paste: [[2509.19224]] Systematic Comparative Analysis of Large Pretrained Language Models on Contextualized Medication Event Extraction(https://arxiv.org/abs/2509.19224)
Keywords: language model
Abstract: Attention-based models have become the leading approach in modeling medical language for Natural Language Processing (NLP) in clinical notes. These models outperform traditional techniques by effectively capturing contextual rep- resentations of language. In this research a comparative analysis is done amongst pre- trained attention based models namely Bert Base, BioBert, two variations of Bio+Clinical Bert, RoBerta, and Clinical Long- former on task related to Electronic Health Record (EHR) information extraction. The tasks from Track 1 of Harvard Medical School's 2022 National Clinical NLP Challenges (n2c2) are considered for this comparison, with the Contextualized Medication Event Dataset (CMED) given for these task. CMED is a dataset of unstructured EHRs and annotated notes that contain task relevant information about the EHRs. The goal of the challenge is to develop effective solutions for extracting contextual information related to patient medication events from EHRs using data driven methods. Each pre-trained model is fine-tuned and applied on CMED to perform medication extraction, medical event detection, and multi-dimensional medication event context classification. Pro- cessing methods are also detailed for breaking down EHRs for compatibility with the applied models. Performance analysis has been carried out using a script based on constructing medical terms from the evaluation portion of CMED with metrics including recall, precision, and F1-Score. The results demonstrate that models pre-trained on clinical data are more effective in detecting medication and medication events, but Bert Base, pre- trained on general domain data showed to be the most effective for classifying the context of events related to medications.
摘要：基于注意力的模型已成为临床注释中自然语言处理（NLP）的医学语言建模的主要方法。这些模型通过有效捕获语言的上下文代表来优于传统技术。在这项研究中，在基于培训的基于注意力的基础模型中进行了比较分析，即Biobert，Biobert，Bio+Clinical Bert的两种变体，Roberta，以及与电子健康记录（EHR）信息提取相关的任务的临床长期前期。该比较考虑了哈佛医学院2022年国家临床NLP挑战（N2C2）的任务，并为这些任务提供了上下文化的药物事件数据集（CMED）。 CMED是一个非结构化EHR的数据集，并且注释的注释包含有关EHRS的任务相关信息。挑战的目的是开发有效的解决方案，以使用数据驱动方法从EHR中提取与患者用药事件有关的上下文信息。每个预训练的模型均经过微调，并应用于CMED上，以执行药物提取，医疗事件检测和多维药物事件事件上下文分类。还详细介绍了用于分解EHR以与应用模型兼容的详细信息。绩效分析是使用基于构建医学术语的脚本进行的，该术语是从CMED的指标，包括召回，精度和F1得分在内的。结果表明，对临床数据进行预训练的模型在检测药物和药物事件方面更有效，但是对一般领域数据进行培训的BERT基础显示，这是对与药物有关的事件的分类最有效的。

Title: CompLLM: Compression for Long Context Q&A

Authors: Gabriele Berton, Jayakrishnan Unnikrishnan, Son Tran, Mubarak Shah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19228
Pdf URL: https://arxiv.org/pdf/2509.19228
Copy Paste: [[2509.19228]] CompLLM: Compression for Long Context Q&A(https://arxiv.org/abs/2509.19228)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.
摘要：由于自我注意的二次复杂性，大型语言模型（LLMS）在处理长篇小说时面临重大计算挑战。尽管将输入文本映射到较小的潜在表示的软上下文压缩方法已显示出承诺，但其现实世界的采用受到限制。现有技术通常将上下文作为单个单元压缩，这会导致二次压缩复杂性，并且无法在具有重叠上下文的查询之间重复使用计算。在这项工作中，我们介绍了Compllm，这是一种用于实际部署的软压缩技术。 Compllm不是整体上下文处理上下文，而是将其分为段，并独立压缩每个段。这种简单的设计选择产生了三个关键属性：效率，因为压缩步骤与上下文长度线性缩放；可伸缩性，使得在短序列（例如1K令牌）上训练的模型可以推广到100K令牌的上下文；并可重复使用，使压缩段可以在不同的查询中缓存和重复使用。我们的实验表明，以2倍的压缩速率，在高上下文长度下，Compllm会加快第一个令牌（TTFT）的速度，最多可达4倍，并将KV高速缓存大小降低50％。此外，Compllm的性能与未经压缩的上下文获得的性能相当，甚至超过了很长的序列，证明了其有效性和实用性。

Title: Reinforcement Learning on Pre-Training Data

Authors: Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.19249
Pdf URL: https://arxiv.org/pdf/2509.19249
Copy Paste: [[2509.19249]] Reinforcement Learning on Pre-Training Data(https://arxiv.org/abs/2509.19249)
Keywords: language model, llm
Abstract: The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.
摘要：现在，计算资源的指数缩放与高质量文本数据的有限增长之间的差异日益增长，现在限制了大型语言模型（LLMS）的常规缩放方法。为了应对这一挑战，我们引入了有关预训练数据（RLPT）的强化学习，这是一种新的培训时间缩放范围，用于优化LLMS。与先前主要通过监督学习进行扩展培训的方法相反，RLPT使该政策能够自主探索有意义的轨迹，从培训前数据中学习并通过强化学习（RL）提高其能力。尽管现有的RL策略（例如从人类反馈中学习（RLHF）学习）和通过可验证的奖励（RLVR）依靠人类注释来奖励结构，但RLPT通过直接从预培训数据中获得奖励信号来消除这种依赖性。具体而言，它采用了下一个段推理目标，奖励了准确预测以前情况下的后续文本段的策略。该公式允许将RL缩放在预训练数据上，鼓励探索跨更广泛环境的富裕轨迹，从而培养更具普遍的推理能力。对多个模型的通用域和数学推理基准的广泛实验验证了RLPT的有效性。例如，当应用于QWEN3-4B基础上时，RLPT的绝对提高$ 3.0 $，$ 5.1 $，$ 8.1 $，6.0美元，$ 6.6 $和$ 5.3 $ MMLU，MMLU-PRO，GPQA-PRO，GPQA-DIAMOND，GPQA-DIAMOND，KOR-BENCH，KOR-BENCEN，AIME24和AIME24和AIME25，分别是。结果进一步表明了缩放行为有利，这表明具有更多计算的持续增长潜力强。此外，RLPT提供了坚实的基础，扩大了LLM的推理界限并提高RLVR性能。

Title: Extracting Conceptual Spaces from LLMs Using Prototype Embeddings

Authors: Nitesh Kumar, Usashi Chatterjee, Steven Schockaert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.19269
Pdf URL: https://arxiv.org/pdf/2509.19269
Copy Paste: [[2509.19269]] Extracting Conceptual Spaces from LLMs Using Prototype Embeddings(https://arxiv.org/abs/2509.19269)
Keywords: llm
Abstract: Conceptual spaces represent entities and concepts using cognitively meaningful dimensions, typically referring to perceptual features. Such representations are widely used in cognitive science and have the potential to serve as a cornerstone for explainable AI. Unfortunately, they have proven notoriously difficult to learn, although recent LLMs appear to capture the required perceptual features to a remarkable extent. Nonetheless, practical methods for extracting the corresponding conceptual spaces are currently still lacking. While various methods exist for extracting embeddings from LLMs, extracting conceptual spaces also requires us to encode the underlying features. In this paper, we propose a strategy in which features (e.g. sweetness) are encoded by embedding the description of a corresponding prototype (e.g. a very sweet food). To improve this strategy, we fine-tune the LLM to align the prototype embeddings with the corresponding conceptual space dimensions. Our empirical analysis finds this approach to be highly effective.
摘要：概念空间代表实体和概念使用认知有意义的维度，通常是指知觉特征。这种表示广泛用于认知科学，并有可能作为可解释AI的基石。不幸的是，尽管最近的LLM似乎在一定程度上捕获了所需的感知特征，但事实证明，他们很难学习。但是，目前仍缺乏用于提取相应概念空间的实用方法。尽管存在从LLM中提取嵌入的各种方法，但提取概念空间也需要我们编码基础特征。在本文中，我们提出了一种策略，其中通过嵌入相应原型的描述来编码特征（例如甜度）（例如，非常甜食）。为了改善此策略，我们将LLM微调以使嵌入原型与相应的概念空间维度对齐。我们的经验分析发现这种方法是非常有效的。

Title: DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture

Authors: Arijit Maji, Raghvendra Kumar, Akash Ghosh, Anushka, Nemil Shah, Abhilekh Borah, Vanshika Shah, Nishant Mishra, Sriparna Saha
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.19274
Pdf URL: https://arxiv.org/pdf/2509.19274
Copy Paste: [[2509.19274]] DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models' Understanding on Indian Culture(https://arxiv.org/abs/2509.19274)
Keywords: language model, chain-of-thought
Abstract: We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India's diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models' ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.
摘要：我们介绍了Drishtikon，这是一种以印度文化为中心的首个多模式和多语言基准，旨在评估对生成AI系统的文化理解。与现有具有通用或全球范围的基准不同，Drishtikon在印度的各种地区提供深度，细粒度的覆盖范围，涵盖了15种语言，涵盖了所有州和工会领土，并融合了超过64,000个对齐的文本图像对。该数据集捕获了更多的文化主题，包括节日，服装，美食，艺术形式和历史遗产。我们评估了广泛的视觉模型（VLM），包括开源小型和大型模型，专有系统，推理特有的VLM和以指示为中心的模型，跨越零射门和链条的设置。我们的结果暴露了当前模型在文化基础的多模式投入方面推理的能力，尤其是对于低资源语言和含义较低的传统的能力。 Drishtikon填补了包容性AI研究的重要空白，提供了强大的测试台，以推动具有文化意识的，具有多模型胜任的语言技术。