2025-10-07

Title: Decomposing Attention To Find Context-Sensitive Neurons

Authors: Alex Gibson
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.03315
Pdf URL: https://arxiv.org/pdf/2510.03315
Copy Paste: [[2510.03315]] Decomposing Attention To Find Context-Sensitive Neurons(https://arxiv.org/abs/2510.03315)
Keywords: language model, gpt
Abstract: We study transformer language models, analyzing attention heads whose attention patterns are spread out, and whose attention scores depend weakly on content. We argue that the softmax denominators of these heads are stable when the underlying token distribution is fixed. By sampling softmax denominators from a "calibration text", we can combine together the outputs of multiple such stable heads in the first layer of GPT2-Small, approximating their combined output by a linear summary of the surrounding text. This approximation enables a procedure where from the weights alone - and a single calibration text - we can uncover hundreds of first layer neurons that respond to high-level contextual properties of the surrounding text, including neurons that didn't activate on the calibration text.
摘要：我们研究变压器语言模型，分析注意力模式的注意力头部分散，其注意力评分却弱地取决于内容。我们认为，当固定基础令牌分布时，这些头部的软磁分母是稳定的。通过从“校准文本”中对SoftMax分母采样，我们可以将多个此类稳定头的输出组合在一起，在GPT2-Small的第一层中，通过周围文本的线性摘要近似其组合输出。这种近似能够单独从权重和单个校准文本进行一个过程，我们可以发现数百个对周围文本的高级上下文属性响应的第一层神经元，包括未在校准文本上激活的神经元。

Title: Graph-S3: Enhancing Agentic textual Graph Retrieval with Synthetic Stepwise Supervision

Authors: Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03323
Pdf URL: https://arxiv.org/pdf/2510.03323
Copy Paste: [[2510.03323]] Graph-S3: Enhancing Agentic textual Graph Retrieval with Synthetic Stepwise Supervision(https://arxiv.org/abs/2510.03323)
Keywords: language model, llm, agent
Abstract: A significant portion of real-world data is inherently represented as textual graphs, and integrating these graphs into large language models (LLMs) is promising to enable complex graph-based question answering. However, a key challenge in LLM-based textual graph QA systems lies in graph retrieval, i.e., how to retrieve relevant content from large graphs that is sufficiently informative while remaining compact for the LLM context. Existing retrievers suffer from poor performance since they either rely on shallow embedding similarity or employ interactive retrieving policies that demand excessive data labeling and training cost. To address these issues, we present Graph-$S^3$, an agentic textual graph reasoning framework that employs an LLM-based retriever trained with synthetic stepwise supervision. Instead of rewarding the agent based on the final answers, which may lead to sparse and unstable training signals, we propose to closely evaluate each step of the retriever based on offline-extracted golden subgraphs. Our main techniques include a data synthesis pipeline to extract the golden subgraphs for reward generation and a two-stage training scheme to learn the interactive graph exploration policy based on the synthesized rewards. Based on extensive experiments on three common datasets in comparison with seven strong baselines, our approach achieves an average improvement of 8.1\% in accuracy and 9.7\% in F$_1$ score. The advantage is even higher in more complicated multi-hop reasoning tasks. Our code will be open-sourced.
摘要：现实世界中的很大一部分在本质上表示为文本图，并且将这些图集成到大型语言模型（LLMS）中有望启用基于图形的复杂问题答案。但是，基于LLM的文本图质量图系统中的一个关键挑战在于图检索，即如何从大图中检索相关内容，这些内容足够丰富，而在LLM上下文中保持紧凑。现有的猎犬的性能不佳，因为他们要么依靠浅层嵌入相似性或采用互动检索政策，因此要求过多的数据标签和培训成本。为了解决这些问题，我们提供了图形-S^3 $，这是一种代理文本图形推理框架，该框架采用了基于LLM的回猎商，该曲目接受了合成逐步监督的培训。与其根据最终答案奖励代理商，这可能导致稀疏和不稳定的培训信号，我们建议根据离线提取的金色子图密切评估回收者的每个步骤。我们的主要技术包括一个数据合成管道，以提取奖励生成的黄金子图和两阶段培训方案，以学习基于合成的奖励的交互式图探索策略。基于与七个强基础相比，基于三个常见数据集的广泛实验，我们的方法的准确性平均提高了8.1 \％，而f $ _1 $得分的平均提高为9.7 \％。在更复杂的多跳推理任务中，优势更高。我们的代码将是开源的。

Title: Implicit Values Embedded in How Humans and LLMs Complete Subjective Everyday Tasks

Authors: Arjun Arunasalam, Madison Pickering, Z. Berkay Celik, Blase Ur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03384
Pdf URL: https://arxiv.org/pdf/2510.03384
Copy Paste: [[2510.03384]] Implicit Values Embedded in How Humans and LLMs Complete Subjective Everyday Tasks(https://arxiv.org/abs/2510.03384)
Keywords: language model, llm
Abstract: Large language models (LLMs) can underpin AI assistants that help users with everyday tasks, such as by making recommendations or performing basic computation. Despite AI assistants' promise, little is known about the implicit values these assistants display while completing subjective everyday tasks. Humans may consider values like environmentalism, charity, and diversity. To what extent do LLMs exhibit these values in completing everyday tasks? How do they compare with humans? We answer these questions by auditing how six popular LLMs complete 30 everyday tasks, comparing LLMs to each other and to 100 human crowdworkers from the US. We find LLMs often do not align with humans, nor with other LLMs, in the implicit values exhibited.
摘要：大型语言模型（LLMS）可以支持AI助手，以帮助用户完成日常任务，例如提出建议或执行基本计算。尽管AI助手的承诺，但这些助手在完成主观日常任务时所显示的隐含价值知之甚少。人类可以考虑环保，慈善和多样性等价值观。 LLM在完成日常任务时表现出这些价值观？他们如何与人类相比？我们通过审核六个受欢迎的LLM如何完成30个日常任务，将LLMS彼此和来自美国的100名人类人群工作者进行比较来回答这些问题。我们发现在所展示的隐式值中，LLM通常与人类，其他LLM不符。

Title: Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

Authors: Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, Benedikt Schifferer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03458
Pdf URL: https://arxiv.org/pdf/2510.03458
Copy Paste: [[2510.03458]] Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video(https://arxiv.org/abs/2510.03458)
Keywords: language model, retrieval-augmented generation
Abstract: We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
摘要：我们提出了Omni-Embed-Nemotron，这是一种统一的多模式检索模型，旨在处理现实世界中信息需求的增长。虽然检索功能的一代（RAG）通过合并外部知识具有显着高级的语言模型，但现有的基于文本的检索器依靠清洁，结构化的输入和与现实世界中文档（例如PDFS，幻灯片或视频或视频）中的视觉和语义上丰富的内容进行斗争。 COLPALI等最新工作表明，使用基于图像的表示形式保存文档布局可以提高检索质量。在此基础上，我们受到QWEN2.5-OMNI等最新多模型模型的功能的启发，我们将检索扩展到文本和图像之外，以支持音频和视频方式。 Omni-Embed-Nemotron可以使用单个模型启用交叉模式（例如，文本 - 视频）和联合模式（例如，文本 - 视频+音频）检索。我们描述了Omni-Embed-Nemotron的架构，培训设置和评估结果，并证明了其在文本，图像和视频检索中的有效性。

Title: SEER: The Span-based Emotion Evidence Retrieval Benchmark

Authors: Aneesha Sampath, Oya Aran, Emily Mower Provost
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03490
Pdf URL: https://arxiv.org/pdf/2510.03490
Copy Paste: [[2510.03490]] SEER: The Span-based Emotion Evidence Retrieval Benchmark(https://arxiv.org/abs/2510.03490)
Keywords: language model, llm
Abstract: We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models' (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.
摘要：我们介绍了先知（基于跨度的情感证据检索）基准，以测试大型语言模型（LLMS）识别表达情感的特定文本跨度的能力。与传统的情感识别任务不同，将单个标签分配给整个句子，Seer针对了情绪证据检测的未置换的任务：确定哪种精确短语传达情感。这种跨度级别的方法对于诸如同理心对话和临床支持之类的应用至关重要，这些应用需要知道表达情感的方式，而不仅仅是情感。 Seer包括两个任务：在一个句子中识别情绪证据，并在连续五个句子的短暂通过中确定证据。它包含1200个现实句子的情感和情感证据的新注释。我们评估了14个开源LLMS，并发现，尽管某些模型在单句输入上接近人类的平均绩效，但其精度在更长的段落中降低了。我们的错误分析揭示了关键故障模式，包括对情绪关键字的过度依赖和中性文本中的误报。

Title: ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection

Authors: Ali Khairallah, Arkaitz Zubiaga
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.03502
Pdf URL: https://arxiv.org/pdf/2510.03502
Copy Paste: [[2510.03502]] ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection(https://arxiv.org/abs/2510.03502)
Keywords: llm
Abstract: We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.
摘要：我们介绍了ALHD，这是第一个明确设计的大规模综合阿拉伯数据集，以区分人类和LLM生成的文本。 ALHD涵盖了三种类型（新闻，社交媒体，评论），涵盖了MSA和辩证阿拉伯语，并包含由三个领先的LLM产生的400K平衡样品，源自多个人类来源，这使得在阿拉伯LLM LLM循环的文本检测中研究了可推广性。我们提供严格的预处理，丰富的注释和标准化平衡拆分，以支持可重复性。此外，我们介绍，分析和讨论使用新数据集的基准实验，进而识别差距并提出未来的研究方向。在传统分类器，基于BERT的模型和LLMS（零射击和少量照片）之间进行基准测试表明，微调的BERT模型具有竞争性能，优于基于LLM的模型。但是，结果并不总是一致的，因为我们在跨流派概括时会发现挑战。的确，模型在需要处理跨流行器环境中未见模式的情况下努力概括，而在处理新闻文章时，这些挑战尤其突出，在这些新闻文章中，LLM生成的文本类似于人类的文本，这为未来的研究提供了途径。 ALHD为与阿拉伯语LLM检测有关的研究建立了基础，并减轻了错误信息，学术不诚实和网络威胁的风险。

Title: TS-Reasoner: Aligning Time Series Foundation Models with LLM Reasoning

Authors: Fangxu Yu, Hongyu Zhao, Tianyi Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03519
Pdf URL: https://arxiv.org/pdf/2510.03519
Copy Paste: [[2510.03519]] TS-Reasoner: Aligning Time Series Foundation Models with LLM Reasoning(https://arxiv.org/abs/2510.03519)
Keywords: language model, llm
Abstract: Time series reasoning is crucial to decision-making in diverse domains, including finance, energy usage, traffic, weather, and scientific discovery. While existing time series foundation models (TSFMs) can capture low-level dynamic patterns and provide accurate forecasting, further analysis usually requires additional background knowledge and sophisticated reasoning, which are lacking in most TSFMs but can be achieved through large language models (LLMs). On the other hand, without expensive post-training, LLMs often struggle with the numerical understanding of time series data. Although it is intuitive to integrate the two types of models, developing effective training recipes that align the two modalities for reasoning tasks is still an open challenge. To this end, we propose TS-Reasoner that aligns the latent representations of TSFMs with the textual inputs of LLMs for downstream understanding/reasoning tasks. Specifically, we propose a simple yet effective method to curate diverse, synthetic pairs of time series and textual captions for alignment training. We then develop a two-stage training recipe that applies instruction finetuning after the alignment pretraining. Unlike existing works that train an LLM to take time series as inputs, we leverage a pretrained TSFM and freeze it during training. Extensive experiments on several benchmarks demonstrate that TS-Reasoner not only outperforms a wide range of prevailing LLMs, Vision Language Models (VLMs), and Time Series LLMs, but also achieves this with remarkable data efficiency, e.g., using less than half the training data.
摘要：时间序列推理对于各种领域的决策至关重要，包括金融，能源使用，交通，天气和科学发现。尽管现有的时间序列基础模型（TSFM）可以捕获低级动态模式并提供准确的预测，但进一步的分析通常需要其他背景知识和复杂的推理，而大多数TSFM都缺乏，但可以通过大型语言模型（LLMS）实现。另一方面，没有昂贵的培训后，LLMS通常会在对时间序列数据的数值理解中挣扎。尽管将两种类型的模型整合起来是直观的，但是开发有效的培训配方，以将两种方式与推理任务保持一致仍然是一个开放的挑战。为此，我们提出了TS-Reasoner，将TSFM的潜在表示与LLMS的文本输入保持一致，以下游理解/推理任务。具体而言，我们提出了一种简单而有效的方法，以策划多样的时间序列和文本字幕，以进行对齐训练。然后，我们开发了一个两阶段的培训配方，该配方在对齐预处理后采用指令列。与训练LLM以将时间序列作为输入的现有作品不同，我们利用预验证的TSFM并在培训期间冻结。对几个基准测试的广泛实验表明，TS-RESONER不仅要优于多种流行的LLM，视觉语言模型（VLMS）和时间序列LLM，而且还具有出色的数据效率，例如使用较少的数据效率来实现这一目标。

Title: Identifying Financial Risk Information Using RAG with a Contrastive Insight

Authors: Ali Elahi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03521
Pdf URL: https://arxiv.org/pdf/2510.03521
Copy Paste: [[2510.03521]] Identifying Financial Risk Information Using RAG with a Contrastive Insight(https://arxiv.org/abs/2510.03521)
Keywords: llm
Abstract: In specialized domains, humans often compare new problems against similar examples, highlight nuances, and draw conclusions instead of analyzing information in isolation. When applying reasoning in specialized contexts with LLMs on top of a RAG, the pipeline can capture contextually relevant information, but it is not designed to retrieve comparable cases or related problems. While RAG is effective at extracting factual information, its outputs in specialized reasoning tasks often remain generic, reflecting broad facts rather than context-specific insights. In finance, it results in generic risks that are true for the majority of companies. To address this limitation, we propose a peer-aware comparative inference layer on top of RAG. Our contrastive approach outperforms baseline RAG in text generation metrics such as ROUGE and BERTScore in comparison with human-generated equity research and risk.
摘要：在专业领域，人类通常将新问题与类似示例，突出细微差别和得出结论，而不是孤立地分析信息。在用llms在抹布之上的专用环境中应用推理时，管道可以捕获上下文相关的信息，而是旨在检索可比的情况或相关问题。尽管RAG有效地提取事实信息，但其在专业推理任务中的输出通常仍然是通用的，反映了广泛的事实，而不是特定于上下文的见解。在金融中，它会导致大多数公司的通用风险。为了解决这一限制，我们提出了在抹布顶部的同伴感知的比较推理层。与人类生成的公平研究和风险相比，我们的对比方法在文本产生指标（例如鲁日和伯特索）中的基线抹布优于基线抹布。

Title: Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs

Authors: Sayan Ghosh, Shahzaib Saqib Warraich, Dhruv Tarsadiya, Gregory Yauney, Swabha Swayamdipta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03527
Pdf URL: https://arxiv.org/pdf/2510.03527
Copy Paste: [[2510.03527]] Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs(https://arxiv.org/abs/2510.03527)
Keywords: language model, prompt
Abstract: Language models can be sampled multiple times to access the distribution underlying their responses, but existing methods cannot efficiently synthesize rich epistemic signals across different long-form responses. We introduce Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents shared information, as well as semantic variation in a set of sampled LM responses to the same prompt. We construct ConGrs using a light-weight lexical sequence alignment algorithm from bioinformatics, supplemented by the targeted usage of a secondary LM judge. Further, we design task-dependent decoding methods to synthesize a single, final response from our ConGr data structure. Our experiments show that synthesizing responses from ConGrs improves factual precision on two biography generation tasks by up to 31% over an average response and reduces reliance on LM judges by more than 80% compared to other methods. We also use ConGrs for three refusal-based tasks requiring abstention on unanswerable queries and find that abstention rate is increased by up to 56%. We apply our approach to the MATH and AIME reasoning tasks and find an improvement over self-verification and majority vote baselines by up to 6 points of accuracy. We show that ConGrs provide a flexible method for capturing variation in LM responses and using the epistemic signals provided by response variation to synthesize more effective responses.
摘要：语言模型可以多次采样以访问其响应的分布，但是现有方法无法有效地综合不同长期响应的丰富认知信号。我们引入了共识图（恭喜），这是一种基于灵活的DAG数据结构，代表共享信息，以及一组对同一提示的采样LM响应中的语义变化。我们使用来自生物信息学的轻巧词法序列对齐算法进行构建，并补充了次要LM法官的目标用法。此外，我们设计了与任务相关的解码方法，以合成我们的恭喜数据结构的单个最终响应。我们的实验表明，与其他方法相比，与其他方法相比，Congrs的合成反应提高了两项传记生成任务的事实精度高达31％，并将对LM法官的依赖降低了80％以上。我们还使用了三个基于拒绝的任务，要求对无法回答的查询进行弃权，并发现弃权率提高了56％。我们将我们的方法应用于数学和AIME推理任务，并最多可以提高自我验证和多数投票基准的改进。我们表明，恭喜提供了一种灵活的方法来捕获LM响应的变化，并使用响应变化提供的认知信号来综合更有效的响应。

Title: Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance

Authors: Ahmed Alajrami, Xingwei Tan, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03528
Pdf URL: https://arxiv.org/pdf/2510.03528
Copy Paste: [[2510.03528]] Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance(https://arxiv.org/abs/2510.03528)
Keywords: language model, llm
Abstract: Instruction-tuning plays a vital role in enhancing the task-solving abilities of large language models (LLMs), improving their usability in generating helpful responses on various tasks. However, previous work has demonstrated that they are sensitive to minor variations in instruction phrasing. In this paper, we explore whether introducing perturbations in instruction-tuning data can enhance LLMs' resistance against noisy instructions. We focus on how instruction-tuning with perturbations, such as removing stop words or shuffling words, affects LLMs' performance on the original and perturbed versions of widely-used benchmarks (MMLU, BBH, GSM8K). We further assess learning dynamics and potential shifts in model behavior. Surprisingly, our results suggest that instruction-tuning on perturbed instructions can, in some cases, improve downstream performance. These findings highlight the importance of including perturbed instructions in instruction-tuning, which can make LLMs more resilient to noisy user inputs.
摘要：指导调整在增强大语模型（LLM）的任务解决能力方面起着至关重要的作用，从而提高了其在对各种任务的有用响应方面的可用性。但是，以前的工作表明，它们对措辞的微小变化很敏感。在本文中，我们探讨了在指导数据中引入扰动是否可以增强LLMS对嘈杂说明的抵抗力。我们专注于使用扰动的指导调节（例如删除停止单词或洗牌单词）如何影响LLMS在广泛使用基准的原始版本和扰动版本上的性能（MMLU，BBH，GSM8K）。我们进一步评估模型行为的学习动力和潜在的转变。令人惊讶的是，我们的结果表明，在某些情况下，对扰动指令进行指导可以提高下游性能。这些发现突出了在指令调整中包括扰动指令的重要性，这可以使LLMS对嘈杂的用户输入更具弹性。

Title: TriMediQ: A Triplet-Structured Approach for Interactive Medical Question Answering

Authors: Zhaohan Meng, Zaiqiao Meng, Siwei Liu, Iadh Ounis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03536
Pdf URL: https://arxiv.org/pdf/2510.03536
Copy Paste: [[2510.03536]] TriMediQ: A Triplet-Structured Approach for Interactive Medical Question Answering(https://arxiv.org/abs/2510.03536)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) perform strongly in static and single-turn medical Question Answer (QA) benchmarks, yet such settings diverge from the iterative information gathering process required in practical clinical consultations. The MEDIQ framework addresses this mismatch by recasting the diagnosis as an interactive dialogue between a patient and an expert system, but the reliability of LLMs drops dramatically when forced to reason with dialogue logs, where clinical facts appear in sentences without clear links. To bridge this gap, we introduce TriMediQ, a triplet-structured approach that summarises patient responses into triplets and integrates them into a Knowledge Graph (KG), enabling multi-hop reasoning. We introduce a frozen triplet generator that extracts clinically relevant triplets, using prompts designed to ensure factual consistency. In parallel, a trainable projection module, comprising a graph encoder and a projector, captures relational information from the KG to enhance expert reasoning. TriMediQ operates in two steps: (i) the projection module fine-tuning with all LLM weights frozen; and (ii) using the fine-tuned module to guide multi-hop reasoning during inference. We evaluate TriMediQ on two interactive QA benchmarks, showing that it achieves up to 10.4\% improvement in accuracy over five baselines on the iMedQA dataset. These results demonstrate that converting patient responses into structured triplet-based graphs enables more accurate clinical reasoning in multi-turn settings, providing a solution for the deployment of LLM-based medical assistants.
摘要：大型语言模型（LLMS）在静态和单转的医学问题答案（QA）基准中表现出色，但是这些设置与实际临床咨询所需的迭代信息收集过程不同。 MEDIQ框架通过将诊断作为患者与专家系统之间的互动对话来解决这一不匹配，但是LLMS的可靠性在被迫使用对话日志推理时急剧下降，其中临床事实出现在句子中，没有明确的链接。为了弥合这一差距，我们介绍了Trimediq，这是一种三联结构的方法，将患者的反应汇总到三胞胎中，并将其集成到知识图（kg）中，从而实现了多跳的推理。我们使用旨在确保事实一致性的提示引入了一个冷冻的三胞胎发电机，该发电机提取临床相关的三胞胎。同时，包括图形编码器和投影仪的可训练投影模块捕获了KG的关系信息，以增强专家推理。 Trimediq分为两个步骤：（i）所有LLM权重冻结的投影模块进行微调；（ii）使用微型模块指导推理过程中的多跳推理。我们在两个交互式QA基准上评估了Trimediq，这表明它在IMEDQA数据集上的五个基线的准确度上的准确性提高了10.4 \％。这些结果表明，将患者的反应转换为基于结构化的三重态图，可以在多转弯设置中更准确的临床推理，从而为部署基于LLM的医疗助手提供了解决方案。

Title: What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

Authors: Andrew Halterman, Katherine A. Keith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03541
Pdf URL: https://arxiv.org/pdf/2510.03541
Copy Paste: [[2510.03541]] What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification(https://arxiv.org/abs/2510.03541)
Keywords: language model, llm, prompt
Abstract: Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.
摘要：现在，生成的大语言模型（LLM）广泛用于计算社会科学（CSS）中的文本分类。在这项工作中，专注于LLM提示之前和之后的步骤 - 将要分类的概念概念化并在下游统计推论中使用LLM预测 - 我们认为这在LLM-ers时代的许多CSS中都被忽略了。我们声称LLM可以吸引分析师跳过概念化步骤，从而产生概念化错误，从而使下游估计偏见。使用模拟，我们表明，这种概念化引起的偏见不能仅通过提高LLM精度或事后偏置校正方法来纠正。最后，我们提醒CSS分析师，概念化仍然是LLM-时代的一阶关注点，并就如何追求低成本，无偏见，低变义的下游估计提供了具体建议。

Title: CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making

Authors: Hasibur Rahman, Hanan Salam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03553
Pdf URL: https://arxiv.org/pdf/2510.03553
Copy Paste: [[2510.03553]] CCD-Bench: Probing Cultural Conflict in Large Language Model Decision-Making(https://arxiv.org/abs/2510.03553)
Keywords: language model, llm
Abstract: Although large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with ten anonymized response options corresponding to the ten GLOBE cultural clusters. These dilemmas are presented using a stratified Latin square to mitigate ordering effects. We evaluate 17 non-reasoning LLMs. Models disproportionately prefer Nordic Europe (mean 20.2 percent) and Germanic Europe (12.4 percent), while options for Eastern Europe and the Middle East and North Africa are underrepresented (5.6 to 5.8 percent). Although 87.9 percent of rationales reference multiple GLOBE dimensions, this pluralism is superficial: models recombine Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both under 3 percent). Ordering effects are negligible (Cramer's V less than 0.10), and symmetrized KL divergence shows clustering by developer lineage rather than geography. These patterns suggest that current alignment pipelines promote a consensus-oriented worldview that underserves scenarios demanding power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making and highlights the need for alignment strategies that substantively engage diverse worldviews.
摘要：尽管大型语言模型（LLM）越来越多地涉及人际关系和社会决策，但它们在合法不同的文化价值体系之间导致明确冲突的能力仍然在很大程度上没有审查。现有基准主要针对文化知识（文化基础），价值预测（WorldValuesbench）或单轴偏见诊断（CDEVAL）；当多个文化扎根的值直接冲突时，没有人评估LLMS如何裁定。我们使用CCD Bench解决这一差距，CCD Bench是一个基准，可评估跨文化价值冲突下的LLM决策。 CCD板凳包括2,182个跨越七个领域的开放式困境，每个困境与十个匿名响应选项配对，与十个地球文化群体相对应。这些困境是使用分层拉丁正方形来减轻排序效应的。我们评估了17个非争议LLM。模型不成比例地偏爱北欧欧洲（平均20.2％）和日耳曼欧洲（12.4％），而东欧，中东和北非的选择不足（5.6％至5.8％）。尽管有87.9％的理由参考多个地球维度，但这种多元化是肤浅的：重组未来的定向和表现取向，并且在自信或性别平等主义方面很少有理由的选择（均低于3％）。订购效应可忽略不计（Cramer的V小于0.10），并且对称的KL差异显示开发人员谱系而不是地理的聚类。这些模式表明，当前的一致性管道促进了一个以共识为导向的世界观，该世界观阐明了场景要求权力谈判，基于权利的推理或性别意识分析。 CCD基础的评估将评估超出孤立的偏见检测到多元化决策，并强调需要实质性地吸引各种世界观的一致性策略。

Title: Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models

Authors: Adam Filipek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.03561
Pdf URL: https://arxiv.org/pdf/2510.03561
Copy Paste: [[2510.03561]] Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models(https://arxiv.org/abs/2510.03561)
Keywords: language model, llm
Abstract: The Transformer architecture has become the de facto standard for Large Language Models (LLMs), demonstrating remarkable capabilities in language understanding and generation. However, its application in conversational AI is fundamentally constrained by its stateless nature and the quadratic computational complexity ($O(L^2)$) with respect to sequence length $L$. Current models emulate memory by reprocessing an ever-expanding conversation history with each turn, leading to prohibitive costs and latency in long dialogues. This paper introduces the Reactive Transformer (RxT), a novel architecture designed to overcome these limitations by shifting from a data-driven to an event-driven paradigm. RxT processes each conversational turn as a discrete event in real-time, maintaining context in an integrated, fixed-size Short-Term Memory (STM) system. The architecture features a distinct operational cycle where a generator-decoder produces a response based on the current query and the previous memory state, after which a memory-encoder and a dedicated Memory Attention network asynchronously update the STM with a representation of the complete interaction. This design fundamentally alters the scaling dynamics, reducing the total user-facing cost of a conversation from quadratic ($O(N^2 \cdot T)$) to linear ($O(N \cdot T)$) with respect to the number of interactions $N$. By decoupling response generation from memory updates, RxT achieves low latency, enabling truly real-time, stateful, and economically viable long-form conversations. We validated our architecture with a series of proof-of-concept experiments on synthetic data, demonstrating superior performance and constant-time inference latency compared to a baseline stateless model of comparable size.
摘要：变压器体系结构已成为大型语言模型（LLM）的事实上的标准，在语言理解和产生中表现出了显着的功能。但是，其在对话式AI中的应用从根本上受到其无状态性质和相对于序列长度$ l $的无状态性质（$ o（l^2）$）的限制。当前的模型通过重新处理每回合不断扩展的对话历史来模仿内存，从而导致长期对话中的成本和延迟。本文介绍了反应性变压器（RXT），这是一种新型的架构，旨在通过从数据驱动到事件驱动的范式来克服这些局限性。 RXT将每个对话转弯作为一个实时的离散事件处理，并在集成的，固定尺寸的短期内存（STM）系统中维护上下文。该体系结构具有独特的操作周期，其中生成器描述器基于当前查询和先前的内存状态产生响应，此后，内存编码器和专用的内存注意力网络异步地以完整交互的表示来更新STM。这种设计从根本上改变了扩展动力学，从二次（$ o（n^2 \ cdot t）$）到线性（$ o（n \ cdot t）$）的对话的总体成本降低了对话的总成本。通过将响应生成从内存更新中分离出来，RXT可以实现低潜伏期，从而实现真正的实时，状态和经济上可行的长期对话。我们通过一系列关于合成数据的概念验证实验验证了我们的体系结构，与基线无状态模型相比，表明了卓越的性能和恒定的推理潜伏期。

Title: LLM, Reporting In! Medical Information Extraction Across Prompting, Fine-tuning and Post-correction

Authors: Ikram Belmadani, Parisa Nazari Hashemi, Thomas Sebbag, Benoit Favre, Guillaume Fortier, Solen Quiniou, Emmanuel Morin, Richard Dufour
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.03577
Pdf URL: https://arxiv.org/pdf/2510.03577
Copy Paste: [[2510.03577]] LLM, Reporting In! Medical Information Extraction Across Prompting, Fine-tuning and Post-correction(https://arxiv.org/abs/2510.03577)
Keywords: language model, gpt, llm, prompt
Abstract: This work presents our participation in the EvalLLM 2025 challenge on biomedical Named Entity Recognition (NER) and health event extraction in French (few-shot setting). For NER, we propose three approaches combining large language models (LLMs), annotation guidelines, synthetic data, and post-processing: (1) in-context learning (ICL) with GPT-4.1, incorporating automatic selection of 10 examples and a summary of the annotation guidelines into the prompt, (2) the universal NER system GLiNER, fine-tuned on a synthetic corpus and then verified by an LLM in post-processing, and (3) the open LLM LLaMA-3.1-8B-Instruct, fine-tuned on the same synthetic corpus. Event extraction uses the same ICL strategy with GPT-4.1, reusing the guideline summary in the prompt. Results show GPT-4.1 leads with a macro-F1 of 61.53% for NER and 15.02% for event extraction, highlighting the importance of well-crafted prompting to maximize performance in very low-resource scenarios.
摘要：这项工作介绍了我们参加EVALLLM 2025挑战生物医学指定实体识别（NER）和法语中的健康事件提取（很少射击）。 For NER, we propose three approaches combining large language models (LLMs), annotation guidelines, synthetic data, and post-processing: (1) in-context learning (ICL) with GPT-4.1, incorporating automatic selection of 10 examples and a summary of the annotation guidelines into the prompt, (2) the universal NER system GLiNER, fine-tuned on a synthetic corpus and then verified by an LLM in后处理，以及（3）开放的LLM Llama-3.1-8B教学，在同一合成语料库上进行了微调。事件提取使用GPT-4.1使用相同的ICL策略，在提示中重复使用指南摘要。结果表明，GPT-4.1的宏F1领导NER为61.53％，事件提取为15.02％，强调了精心制作的促使在非常低的资源场景中最大程度地提高性能的重要性。

Title: Decoupling Task-Solving and Output Formatting in LLM Generation

Authors: Haikang Deng, Po-Nien Kung, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03595
Pdf URL: https://arxiv.org/pdf/2510.03595
Copy Paste: [[2510.03595]] Decoupling Task-Solving and Output Formatting in LLM Generation(https://arxiv.org/abs/2510.03595)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly adept at following instructions containing task descriptions to solve complex problems, such as mathematical reasoning and automatic evaluation (LLM-as-a-Judge). However, as prompts grow more complex, models often struggle to adhere to all instructions. This difficulty is especially common when instructive prompts intertwine reasoning directives -- specifying what the model should solve -- with rigid formatting requirements that dictate how the solution must be presented. The entanglement creates competing goals for the model, suggesting that more explicit separation of these two aspects could lead to improved performance. To this front, we introduce Deco-G, a decoding framework that explicitly decouples format adherence from task solving. Deco-G handles format compliance with a separate tractable probabilistic model (TPM), while prompts LLMs with only task instructions. At each decoding step, Deco-G combines next token probabilities from the LLM with the TPM calculated format compliance likelihood to form the output probability. To make this approach both practical and scalable for modern instruction-tuned LLMs, we introduce three key innovations: instruction-aware distillation, a flexible trie-building algorithm, and HMM state pruning for computational efficiency. We demonstrate the effectiveness of Deco-G across a wide range of tasks with diverse format requirements, including mathematical reasoning, LLM-as-a-judge, and event argument extraction. Overall, our approach yields 1.0% to 6.0% relative gain over regular prompting practice with guaranteed format compliance.
摘要：大型语言模型（LLMS）越来越擅长于包含任务描述以解决复杂问题的指令，例如数学推理和自动评估（LLM-AS-A-A-Gudge）。但是，随着提示变得越来越复杂，模型通常很难遵守所有指示。当指导性提示Intertwine推理指令（指定模型应解决的问题）时，这种困难尤其普遍 - 使用严格的格式要求，决定了解决方案的呈现方式。纠缠为模型创造了竞争目标，这表明这两个方面更明确地分离可能会改善性能。在这方面，我们介绍了DECO-G，这是一个解码框架，该框架明确地将格式的依从性与任务求解。 DECO-G将格式符合单独的可牵引概率模型（TPM），而提示LLMS仅使用任务说明。在每个解码步骤中，DECO-G将LLM的隔壁概率与TPM计算的格式合规性可能性结合在一起，以形成输出概率。为了使现代指导调节的LLM既实用又可扩展的方法，我们介绍了三个关键创新：指导感知蒸馏，灵活的三位一体建设算法和HMM状态修剪以提高计算效率。我们证明了DECO-G在各种具有多种格式要求的任务中的有效性，包括数学推理，LLM-AS-A-A-a-gudge和事件参数提取。总体而言，我们的方法与常规促进练习相对增长1.0％至6.0％。

Title: Can an LLM Induce a Graph? Investigating Memory Drift and Context Length

Authors: Raquib Bin Yousuf, Aadyant Khatri, Shengzhe Xu, Mandar Sharma, Naren Ramakrishnan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.03611
Pdf URL: https://arxiv.org/pdf/2510.03611
Copy Paste: [[2510.03611]] Can an LLM Induce a Graph? Investigating Memory Drift and Context Length(https://arxiv.org/abs/2510.03611)
Keywords: language model, llm, long context
Abstract: Recently proposed evaluation benchmarks aim to characterize the effective context length and the forgetting tendencies of large language models (LLMs). However, these benchmarks often rely on simplistic 'needle in a haystack' retrieval or continuation tasks that may not accurately reflect the performance of these models in information-dense scenarios. Thus, rather than simple next token prediction, we argue for evaluating these models on more complex reasoning tasks that requires them to induce structured relational knowledge from the text - such as graphs from potentially noisy natural language content. While the input text can be viewed as generated in terms of a graph, its structure is not made explicit and connections must be induced from distributed textual cues, separated by long contexts and interspersed with irrelevant information. Our findings reveal that LLMs begin to exhibit memory drift and contextual forgetting at much shorter effective lengths when tasked with this form of relational reasoning, compared to what existing benchmarks suggest. With these findings, we offer recommendations for the optimal use of popular LLMs for complex reasoning tasks. We further show that even models specialized for reasoning, such as OpenAI o1, remain vulnerable to early memory drift in these settings. These results point to significant limitations in the models' ability to abstract structured knowledge from unstructured input and highlight the need for architectural adaptations to improve long-range reasoning.
摘要：最近提出的评估基准旨在表征有效的上下文长度和大型语言模型（LLMS）的遗忘趋势。但是，这些基准通常依靠简单的“针刺”检索或延续任务，这些任务可能无法准确反映这些模型在信息密集的情况下的性能。因此，我们主张在更复杂的推理任务上评估这些模型，而不是简单的标记预测，这需要它们从文本中诱导结构化的关系知识，例如潜在的自然语言内容的图形。虽然可以将输入文本视为根据图生成的，但其结构并非明确，并且必须从分布式文本提示中诱导连接，并以长上下文隔开并散布着无关的信息。我们的发现表明，与现有基准相比，LLMS开始表现出记忆漂移和上下文遗忘，以这种形式的关系推理的任务较短。有了这些发现，我们为最佳使用流行的LLM用于复杂的推理任务提供了建议。我们进一步表明，即使是专门用于推理的模型，例如OpenAI O1，在这些设置中仍然容易受到早期记忆漂移的影响。这些结果表明，模型从非结构化输入中抽象结构化知识的能力中的显着局限性，并强调了对建筑适应性的需求，以改善远程推理。

Title: Towards Unsupervised Speech Recognition at the Syllable-Level

Authors: Liming Wang, Junrui Ni, Kai-Wei Chang, Saurabhchand Bhati, David Harwath, Mark Hasegawa-Johnson, James R. Glass
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03639
Pdf URL: https://arxiv.org/pdf/2510.03639
Copy Paste: [[2510.03639]] Towards Unsupervised Speech Recognition at the Syllable-Level(https://arxiv.org/abs/2510.03639)
Keywords: language model
Abstract: Training speech recognizers with unpaired speech and text -- known as unsupervised speech recognition (UASR) -- is a crucial step toward extending ASR to low-resource languages in the long-tail distribution and enabling multimodal learning from non-parallel data. However, existing approaches based on phones often rely on costly resources such as grapheme-to-phoneme converters (G2Ps) and struggle to generalize to languages with ambiguous phoneme boundaries due to training instability. In this paper, we address both challenges by introducing a syllable-level UASR framework based on masked language modeling, which avoids the need for G2P and the instability of GAN-based methods. Our approach achieves up to a 40\% relative reduction in character error rate (CER) on LibriSpeech and generalizes effectively to Mandarin, a language that has remained particularly difficult for prior methods. Code will be released upon acceptance.
摘要：培训语音识别者具有未配对的语音和文本（称为无监督的语音识别（UASR））是朝着将ASR扩展到长尾分布中的低资源语言的关键步骤，并从非平行数据中启用了多模式学习。但是，基于手机的现有方法通常依赖于昂贵的资源，例如素卡转换器（G2PS），并且由于训练不稳定而难以推广具有模棱两可的音素界限的语言。在本文中，我们通过基于蒙版语言建模引入音节级UASR框架来解决这两个挑战，从而避免了对G2P的需求和基于GAN的方法的不稳定性。我们的方法在LibrisPeech上的字符错误率（CER）相对降低了40 \％，并有效地将其推广到普通话中，普通话对于先前的方法仍然特别困难。代码将在接受后发布。

Title: UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Authors: Xiangyu Peng, Cab Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03663
Pdf URL: https://arxiv.org/pdf/2510.03663
Copy Paste: [[2510.03663]] UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG(https://arxiv.org/abs/2510.03663)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models (LLMs) and agents to real-world knowledge bases, yet current evaluations are fragmented, focusing on either text or images in isolation or on simplified multimodal setups that fail to capture document-centric multimodal use cases. In this paper, we introduce UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages across eight domains. Our pipeline extracts and links evidence from text, tables, and figures, then generates 1,600 multimodal QA pairs spanning factual retrieval, comparison, summarization, and logical reasoning queries. To ensure reliability, 20% of QA pairs are validated by multiple annotators and expert adjudication. UniDoc-Bench supports apples-to-apples comparison across four paradigms: (1) text-only, (2) image-only, (3) multimodal text-image fusion, and (4) multimodal joint retrieval -- under a unified protocol with standardized candidate pools, prompts, and evaluation metrics. Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval, indicating that neither text nor images alone are sufficient and that current multimodal embeddings remain inadequate. Beyond benchmarking, our analysis reveals when and how visual context complements textual evidence, uncovers systematic failure modes, and offers actionable guidance for developing more robust MM-RAG pipelines.
摘要：多模式检索功能增强的生成（MM-rag）是将大型语言模型（LLM）和代理应用于现实世界知识库的关键方法，但是当前的评估却是零散的，专注于隔离或简化的多模态设置，无法捕获以文档为中心的多模式用例。在本文中，我们介绍了Unidoc-Bench，这是第一个大规模，现实的基准测试，用于由跨八个域中的70k Real-World PDF页面构建的MM-rag。我们的管道从文本，表格和数字中提取并链接证据，然后生成1,600个多模式质量质量质量质量标准对，涵盖了事实检索，比较，摘要和逻辑推理查询。为了确保可靠性，有20％的QA对通过多个注释者和专家裁决来验证。 Unidoc-Bench支持四个范式的苹果对苹果的比较：（1）仅文本，（2）仅图像，（3）多模式文本图像融合，（4）在具有标准化的候选池，提示池，提示和评估的统一协议下，多模式关节检索。我们的实验表明，多模式的文本图像融合抹布系统始终超过单峰和共同多模式嵌入的检索，这表明单独的文本和图像都不足够，并且当前的多模式嵌入仍然不足。除了基准测试之外，我们的分析揭示了视觉上下文何时以及如何补充文本证据，发现系统的故障模式，并为开发更强大的MM-rag管道提供了可行的指导。

Title: Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Authors: Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Zain, Momina Hafeez, Grigori Sidorov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03683
Pdf URL: https://arxiv.org/pdf/2510.03683
Copy Paste: [[2510.03683]] Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text(https://arxiv.org/abs/2510.03683)
Keywords: language model, llm
Abstract: The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.
摘要：在使用代码混合的语言中使用贬义词，例如罗马乌尔都语，由于语法，不一致的拼写和稀缺的标记数据，对自然语言处理系统提出了挑战。在这项工作中，我们提出了一个基于Qlora的微调框架，以改善罗马乌尔都语 - 英语文本中的进攻性语言检测。我们使用Google Translate将Roman Urdu-English代码混合数据集翻译成英文，以利用英语LLMS，同时承认这种翻译会减少与代码混合功能的直接互动。我们的重点是使用英语翻译低资源输入的分类性能。我们对几种变压器和大型语言模型进行了微调，包括Meta Llama 3 8b，Mistral 7b V0.1，Llama 2 7b，Modernbert和Roberta，以及Qlora，以进行记忆有效的适应。在手动注释的罗马乌尔都语数据集上对模型进行了培训和评估，以实现进攻性与非进攻内容。在所有测试的模型中，Meta Llama 3 8B达到了最高的F1分数91.45，其次是Mistral 7b的89.66，超过了传统的变压器基线。这些结果证明了Qlora在微调高性能模型中对低资源环境（例如代码混合进攻性语言检测）的功效，并确认LLMS在此任务中的潜力。这项工作为罗马乌尔都语节制提供了一种可扩展的方法，并为基于LLMS的未来多语言进攻系统铺平了道路。

Title: MedReflect: Teaching Medical LLMs to Self-Improve via Reflective Correction

Authors: Yue Huang, Yanyuan Chen, Dexuan Xu, Weihua Yue, Huamin Zhang, Meikang Qiu, Yu Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03687
Pdf URL: https://arxiv.org/pdf/2510.03687
Copy Paste: [[2510.03687]] MedReflect: Teaching Medical LLMs to Self-Improve via Reflective Correction(https://arxiv.org/abs/2510.03687)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Medical problem solving demands expert knowledge and intricate reasoning. Recent studies of large language models (LLMs) attempt to ease this complexity by introducing external knowledge verification through retrieval-augmented generation or by training on reasoning datasets. However, these approaches suffer from drawbacks such as retrieval overhead and high annotation costs, and they heavily rely on substituted external assistants to reach limited performance in medical field. In this paper, we introduce MedReflect, a generalizable framework designed to inspire LLMs with a physician-like reflective thinking mode. MedReflect generates a single-pass reflection chain that includes initial hypothesis generation, self-questioning, self-answering and decision refinement. This self-verified and self-reflective nature releases large language model's latent capability in medical problem-solving without external retrieval or heavy annotation. We demonstrate that MedReflect enables cost-efficient medical dataset construction: with merely 2,000 randomly sampled training examples and a light fine-tuning, this approach achieves notable absolute accuracy improvements across a series of medical benchmarks while cutting annotation requirements. Our results provide evidence that LLMs can learn to solve specialized medical problems via self-reflection and self-improve, reducing reliance on external supervision and extensive task-specific fine-tuning data.
摘要：解决医疗问题需要专家知识和复杂的推理。大型语言模型（LLMS）的最新研究试图通过通过检索发行的生成或推理数据集培训引入外部知识验证来缓解这种复杂性。但是，这些方法遭受了诸如检索开销和高注释成本之类的缺点，并且它们在很大程度上依靠取代的外部助手来达到医疗领域的有限绩效。在本文中，我们介绍了MedFrect，这是一个可概括的框架，旨在通过医师式的反射思维模式激发LLM。 Medflect会产生一个单次反射链，其中包括初始假设的产生，自我询问，自我纠缠和决策的改进。这种自我验证和自我反思的性质释放了大语言模型在解决医疗问题中的潜在能力，而没有外部检索或重大注释。我们证明，MedFrect可以实现成本效益的医学数据集构建：仅有2,000个随机抽样的训练示例和轻微的微调，这种方法在一系列医疗基准的一系列医疗基准中，在减少注释要求的同时，可实现明显的绝对准确性提高。我们的结果提供了证据表明，LLM可以通过自我反省和自我侵扰来学习解决专业的医疗问题，从而减少了对外部监督和广泛特定于任务的微调数据的依赖。

Title: TreePrompt: Leveraging Hierarchical Few-Shot Example Selection for Improved English-Persian and English-German Translation

Authors: Ramtin Kakavand, Ebrahim Ansari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03748
Pdf URL: https://arxiv.org/pdf/2510.03748
Copy Paste: [[2510.03748]] TreePrompt: Leveraging Hierarchical Few-Shot Example Selection for Improved English-Persian and English-German Translation(https://arxiv.org/abs/2510.03748)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have consistently demonstrated strong performance in machine translation, especially when guided by high-quality prompts. Few-shot prompting is an effective technique to improve translation quality; however, most existing example selection methods focus solely on query-to-example similarity and do not account for the quality of the examples. In this work, we propose TreePrompt, a novel example selection approach that learns LLM preferences to identify high-quality, contextually relevant examples within a tree-structured framework. To further explore the balance between similarity and quality, we combine TreePrompt with K-Nearest Neighbors (K-NN) and Adaptive Few-Shot Prompting (AFSP). Evaluations on two language pairs - English-Persian (MIZAN) and English-German (WMT19) - show that integrating TreePrompt with AFSP or Random selection leads to improved translation performance.
摘要：大型语言模型（LLM）在机器翻译中始终显示出强大的性能，尤其是在受高质量提示的指导下。很少有弹性提示是提高翻译质量的有效技术。但是，大多数现有的示例选择方法仅着眼于示例性相似性，并且不考虑示例的质量。在这项工作中，我们提出了Treeprompt，这是一种新颖的示例选择方法，它可以学习LLM偏好，以在树结构框架内识别高质量的，上下文相关的示例。为了进一步探索相似性和质量之间的平衡，我们将treeprompt与k-nearthent邻居（k-nn）和自适应少数发动提示（AFSP）相结合。两种语言对的评估 - 英语 - 塞亚人（Mizan）和英语 - 德国人（WMT19） - 表明将TREEPROMPT与AFSP或随机选择集成会导致改进的翻译性能。

Title: Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs

Authors: Deshan Sumanathilaka, Nicholas Micallef, Julian Hough
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03762
Pdf URL: https://arxiv.org/pdf/2510.03762
Copy Paste: [[2510.03762]] Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs(https://arxiv.org/abs/2510.03762)
Keywords: language model, gpt, llm, prompt
Abstract: Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its practicality and effectiveness. This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task, particularly focusing on the biases introduced by imbalanced sample distributions. We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages: English, German, Spanish, French, and Italian. Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English. To assess model behavior, we evaluate both the GPT-4o and LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual WSD to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.
摘要：大型语言模型（LLM）的最新进展显着重塑了自然语言处理（NLP）的景观。在各种提示技术中，很少有弹性提示因其实用性和有效性而引起了很大的关注。这项研究调查了很少的射击策略会影响“理性歧义”一词（WSD）任务，尤其是重点关注样本分布不平衡的偏见。我们使用Glossgpt提示方法（一种用于英语WSD的高级方法）来测试五种语言的有效性：英语，德语，西班牙语，法语和意大利语。我们的结果表明，不平衡的几个示例可能会导致多种语言中的意义预测不正确，但是此问题并未以英语出现。为了评估模型行为，我们评估了GPT-4O和LLAMA-3.1-70B模型，结果强调了多语言WSD对在几次播放设置中样本分布的敏感性，从而强调了对平衡和代表性提示策略的需求。

Title: Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development

Authors: Majid Asgari-Bidhendi, Muhammad Amin Ghaseminia, Alireza Shahbazi, Sayyed Ali Hossayni, Najmeh Torabian, Behrouz Minaei-Bidgoli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03781
Pdf URL: https://arxiv.org/pdf/2510.03781
Copy Paste: [[2510.03781]] Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development(https://arxiv.org/abs/2510.03781)
Keywords: language model, llm
Abstract: This paper presents the development of Rezwan, a large-scale AI-assisted Hadith corpus comprising over 1.2M narrations, extracted and structured through a fully automated pipeline. Building on digital repositories such as Maktabat Ahl al-Bayt, the pipeline employs Large Language Models (LLMs) for segmentation, chain--text separation, validation, and multi-layer enrichment. Each narration is enhanced with machine translation into twelve languages, intelligent diacritization, abstractive summarization, thematic tagging, and cross-text semantic analysis. This multi-step process transforms raw text into a richly annotated research-ready infrastructure for digital humanities and Islamic studies. A rigorous evaluation was conducted on 1,213 randomly sampled narrations, assessed by six domain experts. Results show near-human accuracy in structured tasks such as chain--text separation (9.33/10) and summarization (9.33/10), while highlighting ongoing challenges in diacritization and semantic similarity detection. Comparative analysis against the manually curated Noor Corpus demonstrates the superiority of Najm in both scale and quality, with a mean overall score of 8.46/10 versus 3.66/10. Furthermore, cost analysis confirms the economic feasibility of the AI approach: tasks requiring over 229,000 hours of expert labor were completed within months at a fraction of the cost. The work introduces a new paradigm in religious text processing by showing how AI can augment human expertise, enabling large-scale, multilingual, and semantically enriched access to Islamic heritage.
摘要：本文介绍了Rezwan的发展，Rezwan是一个大规模的AI辅助圣训，包括超过120万个叙述，通过全自动管道提取和结构。该管道以数字存储库（例如Maktabat Ahl al-Bayt）为基础，使用大型语言模型（LLMS）进行细分，链条 - 文本分离，验证和多层富集。每种叙述都通过机器翻译成十二种语言，智能大语，抽象性摘要，主题标记和跨文本语义分析来增强每个叙述。这个多步骤的过程将原始文本转变为用于数字人文和伊斯兰研究的丰富注释的研究就绪的基础设施。对六个领域专家评估的1,213个随机抽样叙述进行了严格的评估。结果表明，在结构化任务（例如链 - 文本分离（9.33/10）和摘要（9.33/10）等结构化任务中，近乎人类的精度，同时突出了数量化和语义相似性检测的持续挑战。对手动策划的NOOR语料库的比较分析证明了NAJM在尺度和质量方面的优越性，平均总分为8.46/10对3.66/10。此外，成本分析证实了AI方法的经济可行性：在几个月内完成了229,000多个小时的专家劳动的任务，而成本的一小部分。这项工作通过展示了AI如何增强人类专业知识，使大规模，多语言和语义丰富获得伊斯兰遗产的机会，从而引入了宗教文本处理的新范式。

Title: Mechanistic Interpretability of Socio-Political Frames in Language Models

Authors: Hadi Asghari, Sami Nenno
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2510.03799
Pdf URL: https://arxiv.org/pdf/2510.03799
Copy Paste: [[2510.03799]] Mechanistic Interpretability of Socio-Political Frames in Language Models(https://arxiv.org/abs/2510.03799)
Keywords: language model, llm
Abstract: This paper explores the ability of large language models to generate and recognize deep cognitive frames, particularly in socio-political contexts. We demonstrate that LLMs are highly fluent in generating texts that evoke specific frames and can recognize these frames in zero-shot settings. Inspired by mechanistic interpretability research, we investigate the location of the `strict father' and `nurturing parent' frames within the model's hidden representation, identifying singular dimensions that correlate strongly with their presence. Our findings contribute to understanding how LLMs capture and express meaningful human concepts.
摘要：本文探讨了大语言模型生成和认识深层认知框架的能力，尤其是在社会政治背景下。我们证明了LLM在生成引起特定帧的文本方面具有很高的流利性，并且可以在零拍设置中识别这些帧。受到机械性可解释性研究的启发，我们研究了“严格的父亲”的位置和“养育父母”在模型隐藏表示中的位置，从而确定了与它们的存在密切相关的奇异维度。我们的发现有助于了解LLM如何捕获和表达有意义的人类概念。

Title: Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models

Authors: Canhui Wu, Qiong Cao, Chang Li, Zhenfang Wang, Chao Xue, Yuwei Fan, Wei Xi, Xiaodong He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.03805
Pdf URL: https://arxiv.org/pdf/2510.03805
Copy Paste: [[2510.03805]] Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models(https://arxiv.org/abs/2510.03805)
Keywords: language model
Abstract: Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking." Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the length of any output step exceeds the upper limit, we halt updates to prevent hacking behavior caused by merging steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7\%}.
摘要：大型推理模型（LRMS）在复杂的任务上表现出很强的表现，但通常具有过多的详细性，被称为“过度思考”。通过加强学习（RL）的现有解决方案通常会惩罚生成的令牌以促进简洁。但是，这些方法遇到了两个挑战：较少令牌的响应并不总是对应于较少的推理步骤，并且模型可以通过丢弃推理步骤以最大程度地减少令牌用法来发展培训后期的黑客行为。在这项工作中，我们介绍了\ textbf {step pruner（sp）}，这是一个RL框架，它通过偏爱紧凑的推理步骤来引导LRMS更有效地推理。我们的步进意识奖励功能优先考虑正确性，同时对冗余步骤施加惩罚，并拒绝奖励不正确的响应，以防止错误的推理加强。此外，我们提出了一个动态停止机制：当任何输出步骤的长度超过上限时，我们停止更新以防止通过合并步骤引起的黑客行为。跨四个推理基准的广泛实验表明，SP可以达到最新的精度，同时显着降低了响应长度。例如，在AIME24上，SP通过\ textbf {69.7 \％}减少令牌使用率。

Title: Annotate Rhetorical Relations with INCEpTION: A Comparison with Automatic Approaches

Authors: Mehedi Hasan Emon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03808
Pdf URL: https://arxiv.org/pdf/2510.03808
Copy Paste: [[2510.03808]] Annotate Rhetorical Relations with INCEpTION: A Comparison with Automatic Approaches(https://arxiv.org/abs/2510.03808)
Keywords: language model
Abstract: This research explores the annotation of rhetorical relations in discourse using the INCEpTION tool and compares manual annotation with automatic approaches based on large language models. The study focuses on sports reports (specifically cricket news) and evaluates the performance of BERT, DistilBERT, and Logistic Regression models in classifying rhetorical relations such as elaboration, contrast, background, and cause-effect. The results show that DistilBERT achieved the highest accuracy, highlighting its potential for efficient discourse relation prediction. This work contributes to the growing intersection of discourse parsing and transformer-based NLP. (This paper was conducted as part of an academic requirement under the supervision of Prof. Dr. Ralf Klabunde, Linguistic Data Science Lab, Ruhr University Bochum.) Keywords: Rhetorical Structure Theory, INCEpTION, BERT, DistilBERT, Discourse Parsing, NLP.
摘要：这项研究探讨了使用Inception工具在话语中的修辞关系注释，并将手动注释与基于大语言模型的自动方法进行比较。该研究的重点是体育报告（特别是板球新闻），并评估Bert，Distilbert和Logistic回归模型的性能，以分类诸如阐述，对比，背景和原因效应之类的修辞关系。结果表明，Distilbert取得了最高的精度，突出了其有效的话语关系预测的潜力。这项工作有助于话语解析和基于变压器的NLP的增长。（本文是在Ruhr University Bochum语言数据科学实验室Ralf Klabunde教授的监督下作为学术要求的一部分。）关键词：修辞结构理论，Inception，Bert，Bert，Distilbert，Distilbert，Discourse，Darsise Parssing，NLP。

Title: Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles

Authors: Nusrat Jahan Lia, Shubhashis Roy Dipta, Abdullah Khan Zehady, Naymul Islam, Madhusodan Chakraborty, Abdullah Al Wasif
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03898
Pdf URL: https://arxiv.org/pdf/2510.03898
Copy Paste: [[2510.03898]] Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles(https://arxiv.org/abs/2510.03898)
Keywords: language model, llm
Abstract: Detecting media bias is crucial, specifically in the South Asian region. Despite this, annotated datasets and computational studies for Bangla political bias research remain scarce. Crucially because, political stance detection in Bangla news requires understanding of linguistic cues, cultural context, subtle biases, rhetorical strategies, code-switching, implicit sentiment, and socio-political background. To address this, we introduce the first benchmark dataset of 200 politically significant and highly debated Bangla news articles, labeled for government-leaning, government-critique, and neutral stances, alongside diagnostic analyses for evaluating large language models (LLMs). Our comprehensive evaluation of 28 proprietary and open-source LLMs shows strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00). Models also tend to over-predict government-leaning stances, often misinterpreting ambiguous narratives. This dataset and its associated diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.
摘要：检测媒体偏见至关重要，特别是在南亚地区。尽管如此，孟加拉政治偏见研究的注释数据集和计算研究仍然很少。至关重要的是，孟加拉新闻中的政治立场检测需要了解语言线索，文化背景，微妙的偏见，修辞策略，代码转换，隐性情绪和社会政治背景。为了解决这个问题，我们介绍了200个具有政治意义和高度争议的孟加拉新闻文章的第一个基准数据集，该文章标有政府倾向，政府危机和中性立场，以及评估大型语言模型（LLMS）的诊断分析。我们对28个专有和开源LLM的全面评估在检测政府危机含量（F1最高0.83）方面表现出强烈的表现，但中性文章的难度很大（F1低至0.00）。模型还倾向于过度预测政府倾斜的立场，通常会误解模棱两可的叙事。该数据集及其相关的诊断为孟加拉媒体研究中的立场检测提供了基础，并提供了改善低资源语言LLM性能的见解。

Title: PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian

Authors: Mohammad Amin Abbasi, Hassan Naderi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03913
Pdf URL: https://arxiv.org/pdf/2510.03913
Copy Paste: [[2510.03913]] PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian(https://arxiv.org/abs/2510.03913)
Keywords: language model, llm, prompt, agent
Abstract: This study presents PsychoLexTherapy, a framework for simulating psychotherapeutic reasoning in Persian using small language models (SLMs). The framework tackles the challenge of developing culturally grounded, therapeutically coherent dialogue systems with structured memory for multi-turn interactions in underrepresented languages. To ensure privacy and feasibility, PsychoLexTherapy is optimized for on-device deployment, enabling use without external servers. Development followed a three-stage process: (i) assessing SLMs psychological knowledge with PsychoLexEval; (ii) designing and implementing the reasoning-oriented PsychoLexTherapy framework; and (iii) constructing two evaluation datasets-PsychoLexQuery (real Persian user questions) and PsychoLexDialogue (hybrid simulated sessions)-to benchmark against multiple baselines. Experiments compared simple prompting, multi-agent debate, and structured therapeutic reasoning paths. Results showed that deliberate model selection balanced accuracy, efficiency, and privacy. On PsychoLexQuery, PsychoLexTherapy outperformed all baselines in automatic LLM-as-a-judge evaluation and was ranked highest by human evaluators in a single-turn preference study. In multi-turn tests with PsychoLexDialogue, the long-term memory module proved essential: while naive history concatenation caused incoherence and information loss, the full framework achieved the highest ratings in empathy, coherence, cultural fit, and personalization. Overall, PsychoLexTherapy establishes a practical, privacy-preserving, and culturally aligned foundation for Persian psychotherapy simulation, contributing novel datasets, a reproducible evaluation pipeline, and empirical insights into structured memory for therapeutic reasoning.
摘要：这项研究介绍了心理治疗，这是使用小语言模型（SLM）在波斯语中模拟心理治疗推理的框架。该框架应对开发文化扎根的治疗性对话系统的挑战，并具有结构性记忆，用于以代表性不足的语言进行多转交流。为了确保隐私和可行性，对心理疗法进行了优化，以用于设备部署，从而在没有外部服务器的情况下使用。发展遵循三阶段的过程：（i）用Psycholexeval评估SLM的心理知识；（ii）设计和实施面向推理的心理治疗框架；（iii）构建两个评估数据集psycholexquery（真正的波斯用户问题）和Psycholexdialogue（混合模拟会议） - 以对多个基线进行基准测试。实验比较了简单的提示，多代理争论和结构化的治疗推理路径。结果表明，故意的模型选择平衡了准确性，效率和隐私。在Psycholexquery上，心理治疗在自动LLM-AS-A-A-Gudge评估中的表现优于所有基础，在一项单转偏偏好研究中，人类评估者的排名最高。在使用Psycholexdialogue进行的多转弯测试中，长期记忆模块被证明是必不可少的：虽然天真的历史串联导致不连贯和信息丢失，但完整的框架在同理心，连贯，文化拟合和个性化方面获得了最高的评分。总体而言，Psycholextherapy为波斯心理疗法模拟建立了实用，保护隐私和文化统一的基础，贡献了新颖的数据集，可再现的评估管道以及对治疗推理的结构化记忆的经验见解。

Title: Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs

Authors: Junjie Luo, Rui Han, Arshana Welivita, Zeleikun Di, Jingfu Wu, Xuzhe Zhi, Ritu Agarwal, Gordon Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03997
Pdf URL: https://arxiv.org/pdf/2510.03997
Copy Paste: [[2510.03997]] Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs(https://arxiv.org/abs/2510.03997)
Keywords: language model, llm
Abstract: Understanding how patients perceive their physicians is essential to improving trust, communication, and satisfaction. We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient-oriented subjective judgments. The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million. We validate the method through multi-model comparison and human expert benchmarking, achieving strong agreement between human and LLM assessments (correlation coefficients 0.72-0.89) and external validity through correlations with patient satisfaction (r = 0.41-0.81, all p<0.001). National-scale analysis reveals systematic patterns: male physicians receive higher ratings across all traits, with largest disparities in clinical competence perceptions; empathy-related traits predominate in pediatrics and psychiatry; and all traits positively predict overall satisfaction. Cluster analysis identifies four distinct physician archetypes, from "Well-Rounded Excellent" (33.8%, uniformly high traits) to "Underperforming" (22.6%, consistently low). These findings demonstrate that automated trait extraction from patient narratives can provide interpretable, validated metrics for understanding physician-patient relationships at scale, with implications for quality measurement, bias detection, and workforce development in healthcare.
摘要：了解患者对医生的看法对于改善信任，沟通和满意度至关重要。我们提出了一个基于大型语言模型（LLM）的管道，该管道涉及五大人格特征和五个面向患者的主观判断。该分析包括410万名患者评论，对226,999名美国医生的评价为100万库。我们通过多模型比较和人类专家基准测试来验证该方法，在人与LLM评估（相关系数为0.72-0.89）之间达到了强有力的一致性，并通过与患者满意度的相关性（r = 0.41-0.81，所有p <0.001）通过相关性来验证该方法。国家规模的分析揭示了系统的模式：男性医生在所有特征上都获得了更高的评分，并且临床能力看法的差异最大；与移情有关的特征在儿科和精神病学中占主导地位；所有特征都积极地预测了总体满意度。聚类分析从“全面的优秀”（33.8％，均匀的高性状）到“表现不佳”（22.6％，始终低），从而确定了四种不同的医师原型。这些发现表明，从患者叙事中提取自动特征可以提供可解释的，经过验证的指标，以了解医师患者的关系，这对医疗保健中的质量测量，偏见检测和劳动力发展产生了影响。

Title: Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions

Authors: Yang Xu, Xuanming Zhang, Min-Hsuan Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Yixuan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.03999
Pdf URL: https://arxiv.org/pdf/2510.03999
Copy Paste: [[2510.03999]] Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions(https://arxiv.org/abs/2510.03999)
Keywords: language model, llm, prompt, agent
Abstract: Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.
摘要：欺骗是人类交流的普遍特征，也是大语模型（LLMS）中新兴的关注。尽管最近的研究记录了在压力下LLM欺骗的实例，但大多数评估仍然局限于单转弯的提示，并且未能捕获欺骗性策略通常展开的长途相互作用。我们介绍了第一个模拟框架，用于在相互依存的任务和动态上下文压力的扩展序列下探测和评估LLMS中的欺骗。我们的框架实例化了一个多代理系统：一个任务完成任务的表演者代理商和一个评估进度，提供反馈并维护不断发展的信任状态的主管代理。然后，一个独立的欺骗审核员审查了完整的轨迹，以识别何时以及如何发生欺骗。我们在跨越闭合和开源系统的11个边境模型上进行了广泛的实验，发现欺骗是依赖模型的，随着事件压力的增加，并始终如一地侵蚀主管的信任。定性分析进一步揭示了隐藏，模棱两可和伪造的不同策略。我们的发现确立了欺骗作为长途互动中的新兴风险，并为评估现实世界中信任敏感环境中未来LLM的基础提供了基础。

Title: AgriGPT-VL: Agricultural Vision-Language Understanding Suite

Authors: Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, Shijian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04002
Pdf URL: https://arxiv.org/pdf/2510.04002
Copy Paste: [[2510.04002]] AgriGPT-VL: Agricultural Vision-Language Understanding Suite(https://arxiv.org/abs/2510.04002)
Keywords: language model, gpt, llm, agent
Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the scarcity of domain-tailored models, curated vision-language corpora, and rigorous evaluation. To address these challenges, we present the AgriGPT-VL Suite, a unified multimodal framework for agriculture. Our contributions are threefold. First, we introduce Agri-3M-VL, the largest vision-language corpus for agriculture to our knowledge, curated by a scalable multi-agent data generator; it comprises 1M image-caption pairs, 2M image-grounded VQA pairs, 50K expert-level VQA instances, and 15K GRPO reinforcement learning samples. Second, we develop AgriGPT-VL, an agriculture-specialized vision-language model trained via a progressive curriculum of textual grounding, multimodal shallow/deep alignment, and GRPO refinement. This method achieves strong multimodal reasoning while preserving text-only capability. Third, we establish AgriBench-VL-4K, a compact yet challenging evaluation suite with open-ended and image-grounded questions, paired with multi-metric evaluation and an LLM-as-a-judge framework. Experiments show that AgriGPT-VL outperforms leading general-purpose VLMs on AgriBench-VL-4K, achieving higher pairwise win rates in the LLM-as-a-judge evaluation. Meanwhile, it remains competitive on the text-only AgriBench-13K with no noticeable degradation of language ability. Ablation studies further confirm consistent gains from our alignment and GRPO refinement stages. We will open source all of the resources to support reproducible research and deployment in low-resource agricultural settings.
摘要：尽管多模式大语言模型取得了迅速的进步，但农业应用仍受到域名模型，策划视觉语言语料库和严格评估的稀缺性的限制。为了应对这些挑战，我们提出了Agrigpt-VL Suite，这是一个统一的农业多模式框架。我们的贡献是三倍。首先，我们引入了Agri-3M-VL，这是我们所知的最大的农业视力语料库，并由可扩展的多代理数据生成器策划；它包括1M图像捕获对，2M图像接地的VQA对，50K专家级VQA实例和15K GRPO增强学习样品。其次，我们开发了Agrigpt-VL，这是一种农业特有的视觉模型模型，该模型通过文本接地，多模式浅/深对准和GRPO改进的渐进课程训练。此方法在保留仅文本功能的同时，实现了强大的多模式推理。第三，我们建立了Agribench-VL-4K，这是一个紧凑而富有挑战性的评估套件，带有开放式和图像的问题，并与多项式评估和LLM-AS-A-A-A-a-a-a-Gudge框架配对。实验表明，Agrigpt-VL优于Agribench-VL-4K上领先的通用VLM，在LLM-AS-A-A-Gudge评估中达到了较高的成对获胜率。同时，它在只有文本的Agribench-13K上仍然具有竞争力，而语言能力没有明显的退化。消融研究进一步证实了我们的一致性和GRPO完善阶段的一致收益。我们将开放所有资源，以支持低资源农业环境中可复制的研究和部署。

Title: LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

Authors: Jiarui Liu, Jivitesh Jain, Mona Diab, Nishant Subramani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04013
Pdf URL: https://arxiv.org/pdf/2510.04013
Copy Paste: [[2510.04013]] LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization(https://arxiv.org/abs/2510.04013)
Keywords: language model, llm, prompt
Abstract: Although large language models (LLMs) have tremendous utility, trustworthiness is still a chief concern: models often generate incorrect information with high confidence. While contextual information can help guide generation, identifying when a query would benefit from retrieved context and assessing the effectiveness of that context remains challenging. In this work, we operationalize interpretability methods to ascertain whether we can predict the correctness of model outputs from the model's activations alone. We also explore whether model internals contain signals about the efficacy of external context. We consider correct, incorrect, and irrelevant context and introduce metrics to distinguish amongst them. Experiments on six different models reveal that a simple classifier trained on intermediate layer activations of the first output token can predict output correctness with about 75% accuracy, enabling early auditing. Our model-internals-based metric significantly outperforms prompting baselines at distinguishing between correct and incorrect context, guarding against inaccuracies introduced by polluted context. These findings offer a lens to better understand the underlying decision-making processes of LLMs. Our code is publicly available at this https URL
摘要：尽管大型语言模型（LLM）具有巨大的效用，但可信赖性仍然是一个主要问题：模型通常会充满信心地产生错误的信息。尽管上下文信息可以帮助指导生成，但要确定查询何时会从检索到的上下文中受益并评估该上下文的有效性仍然具有挑战性。在这项工作中，我们操作可解释性方法，以确定我们是否可以仅靠模型的激活来预测模型输出的正确性。我们还探索模型内部是否包含有关外部上下文功效的信号。我们考虑正确，不正确和无关紧要的背景，并引入指标以区分它们。六个不同模型的实验表明，对第一个输出令牌的中间层激活训练的简单分类器可以以约75％的精度预测输出正确性，从而实现早期审核。我们的基于模型的指标明显胜过促使基线区分正确和不正确的环境，从而防止污染上下文引入的不准确性。这些发现提供了镜头，以更好地了解LLMS的基本决策过程。我们的代码在此HTTPS URL上公开可用

Title: Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

Authors: Thanapol Popit, Natthapath Rungseesiripak, Monthol Charattrakool, Saksorn Ruangtanusak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04016
Pdf URL: https://arxiv.org/pdf/2510.04016
Copy Paste: [[2510.04016]] Thai Semantic End-of-Turn Detection for Real-Time Voice Agents(https://arxiv.org/abs/2510.04016)
Keywords: llm, prompt, agent
Abstract: Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.
摘要：流体语音到声音的互动需要可靠的和低延迟的检测用户何时完成讲话。在犹豫或特定于语言的现象下，传统的音频节奏末端末分增加了数百毫秒的延迟和失败。据我们所知，我们介绍了实时代理的首次系统研究（EOT）检测（EOT）检测。我们将紧凑型LLM的零射击和几乎没有射击的提示与监督轻型变压器的微调进行比较。使用来自Yodas语料库和泰语特异性语言提示（例如句子 - 最终粒子）的抄录字幕，我们将EOT作为对令牌边界的二进制决策。我们报告了明确的准确性延迟权衡，并提供了公共准备的实施计划。这项工作建立了泰国基线，并证明了小型，微调的模型可以提供适合于设备代理的近乎现行的EOT决策。

Title: Does Using Counterfactual Help LLMs Explain Textual Importance in Classification?

Authors: Nelvin Tan, James Asikin Cheung, Yu-Ching Shih, Dong Yang, Amol Salunkhe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04031
Pdf URL: https://arxiv.org/pdf/2510.04031
Copy Paste: [[2510.04031]] Does Using Counterfactual Help LLMs Explain Textual Importance in Classification?(https://arxiv.org/abs/2510.04031)
Keywords: language model, llm
Abstract: Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. More recently, they have been shown to be very effective in textual classification tasks, motivating the need to explain the LLMs' decisions. Motivated by practical constrains where LLMs are black-boxed and LLM calls are expensive, we study how incorporating counterfactuals into LLM reasoning can affect the LLM's ability to identify the top words that have contributed to its classification decision. To this end, we introduce a framework called the decision changing rate that helps us quantify the importance of the top words in classification. Our experimental results show that using counterfactuals can be helpful.
摘要：大型语言模型（LLM）在许多领域中变得有用，因为它们的令人印象深刻的能力是由大型培训数据集和大型模型尺寸引起的。最近，它们已被证明在文本分类任务中非常有效，激发了解释LLMS决定的需求。由LLM是黑盒和LLM呼叫昂贵的实际限制的动机，我们研究将反事实纳入LLM推理如何影响LLM识别有助于其分类决策的顶级单词的能力。为此，我们介绍了一个名为“决策变化速率”的框架，该框架有助于我们量化最高单词在分类中的重要性。我们的实验结果表明，使用反事实可能会有所帮助。

Title: Small Language Models for Emergency Departments Decision Support: A Benchmark Study

Authors: Zirui Wang, Jiajun Wu, Braden Teitge, Jessalyn Holodinsky, Steve Drew
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04032
Pdf URL: https://arxiv.org/pdf/2510.04032
Copy Paste: [[2510.04032]] Small Language Models for Emergency Departments Decision Support: A Benchmark Study(https://arxiv.org/abs/2510.04032)
Keywords: language model, llm
Abstract: Large language models (LLMs) have become increasingly popular in medical domains to assist physicians with a variety of clinical and operational tasks. Given the fast-paced and high-stakes environment of emergency departments (EDs), small language models (SLMs), characterized by a reduction in parameter count compared to LLMs, offer significant potential due to their inherent reasoning capability and efficient performance. This enables SLMs to support physicians by providing timely and accurate information synthesis, thereby improving clinical decision-making and workflow efficiency. In this paper, we present a comprehensive benchmark designed to identify SLMs suited for ED decision support, taking into account both specialized medical expertise and broad general problem-solving capabilities. In our evaluations, we focus on SLMs that have been trained on a mixture of general-domain and medical corpora. A key motivation for emphasizing SLMs is the practical hardware limitations, operational cost constraints, and privacy concerns in the typical real-world deployments. Our benchmark datasets include MedMCQA, MedQA-4Options, and PubMedQA, with the medical abstracts dataset emulating tasks aligned with real ED physicians' daily tasks. Experimental results reveal that general-domain SLMs surprisingly outperform their medically fine-tuned counterparts across these diverse benchmarks for ED. This indicates that for ED, specialized medical fine-tuning of the model may not be required.
摘要：大型语言模型（LLM）在医疗领域变得越来越流行，以帮助医生完成各种临床和操作任务。鉴于急诊科（ED）的快节奏和高风险环境，小语言模型（SLM）的特征是参数数量减少与LLM相比，由于其固有的推理能力和有效的性能，具有巨大的潜力。这使SLM能够通过及时，准确的信息综合来支持医生，从而提高临床决策和工作流程效率。在本文中，我们提出了一个全面的基准测试，旨在确定适合ED决策支持的SLM，并考虑到专业的医学专业知识和广泛的一般问题解决能力。在我们的评估中，我们专注于经过培训的SLM，以一般域和医疗公司的混合物进行培训。强调SLM的一个关键动机是实用的硬件限制，运营成本限制和典型现实部署的隐私问题。我们的基准数据集包括MEDMCQA，MEDQA-4OPTIONS和PUBMEDQA，其中包括医学摘要数据集模拟任务与真正的ED医师的日常任务一致。实验结果表明，通用域SLM出乎意料地超过了其在ED的这些不同基准的医学微调对应物。这表明对于ED而言，可能不需要对模型进行专门的医学微调。

Title: Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment

Authors: Yunfan Zhang, Kathleen McKeown, Smaranda Muresan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04045
Pdf URL: https://arxiv.org/pdf/2510.04045
Copy Paste: [[2510.04045]] Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment(https://arxiv.org/abs/2510.04045)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism -- the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.
摘要：通常对大型语言模型（LLM）进行训练，以反映一组相对均匀的价值，这将其适用性限制在需要理解细微的人类观点的任务中。最近的研究强调了使LLM能够支持可口多元化的重要性 - 采用特定观点并与之相关的产出的能力。在这项工作中，我们研究了是否可以应用于构建可操作的多元化模型。我们探索了几种方法，包括促使COT，对人为作者的COT进行微调，对合成解释进行微调以及具有可验证奖励的增强学习（RLVR）。我们使用万万镜和意见数据集评估这些方法。在研究的方法中，RLVR始终优于其他方法，并证明了强大的训练样本效率。我们进一步分析了有关忠诚和安全的生成的COT痕迹。

Title: What Makes Diffusion Language Models Super Data Learners?

Authors: Zitian Gao, Haoming Luo, Lynx Chen, Jason Klein Liu, Ran Tao, Joey Zhou, Bryan Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04071
Pdf URL: https://arxiv.org/pdf/2510.04071
Copy Paste: [[2510.04071]] What Makes Diffusion Language Models Super Data Learners?(https://arxiv.org/abs/2510.04071)
Keywords: language model
Abstract: Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at this https URL.
摘要：最近的研究表明，在有限的数据约束下，扩散语言模型实现了显着的数据效率，但基本机制尚不清楚。在这项工作中，我们进行了广泛的消融实验，以消除该效率的来源。我们的结果表明，输入令牌的随机掩盖起主要作用。我们进一步表明，可以通过MLP辍学和重量衰减获得类似的收益，这表明随机正规化广泛提高了多上位数训练的数据效率。我们的代码可在此HTTPS URL上找到。

Title: PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

Authors: Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04080
Pdf URL: https://arxiv.org/pdf/2510.04080
Copy Paste: [[2510.04080]] PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity(https://arxiv.org/abs/2510.04080)
Keywords: language model, llm
Abstract: Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by complex, coarse-grained reward signals. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and precise paradigm for training LLMs on complex, ranking-based conditional judgment tasks.
摘要：有条件的语义文本相似性（C-STS）测量了特定条件下文本段之间的语义接近性，从而克服了传统STS中固有的歧义。但是，现有方法在很大程度上仅限于判别模型，因此未能完全整合NLP社区中有关大语言模型（LLMS）和增强学习（RL）的最新突破。 RL是针对此任务的特别合适的范式，因为它可以直接优化非差异的Spearman排名指标，并指导C-STS所需的推理过程。但是，我们发现天真地应用列表RL无法产生有意义的改进，因为该模型被复杂的粗粒奖励信号所淹没。为了应对这一挑战，我们介绍了Poli-RL，这是一个新颖的点对上的加强学习框架。 Poli-RL采用了两阶段的课程：它首先以简单的尖端奖励来训练模型，以建立基本的得分能力，然后过渡到混合奖励，该奖励结合了点心，成对，成对和列表目标，以完善模型辨别微妙的语义差异的能力。至关重要的是，我们提出了一个创新的并行切片排名奖励（PSRR）机制，该机制在平行切片中计算排名奖励，其中每个切片都包含来自不同样本的相同指数完成。这为每个单独的完成提供了精确的，差异化的学习信号，从而实现了颗粒状的信用分配和有效的优化。在官方的C-STS基准中，Poli-RL实现了Spearman相关系数为48.18，为跨编码器建筑建立了新的SOTA。作为成功将RL应用于C-STS的第一项工作，我们的研究介绍了一个强大而精确的范式，用于培训基于复杂的，基于排名的条件判断任务的LLMS。

Title: Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning

Authors: Honglin Lin, Qizhi Pei, Xin Gao, Zhuoshi Pan, Yu Li, Juntao Li, Conghui He, Lijun Wu
Subjects: cs.CL, cs.PL
Abstract URL: https://arxiv.org/abs/2510.04081
Pdf URL: https://arxiv.org/pdf/2510.04081
Copy Paste: [[2510.04081]] Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning(https://arxiv.org/abs/2510.04081)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Reasoning capability is pivotal for Large Language Models (LLMs) to solve complex tasks, yet achieving reliable and scalable reasoning remains challenging. While Chain-of-Thought (CoT) prompting has become a mainstream approach, existing methods often suffer from uncontrolled generation, insufficient quality, and limited diversity in reasoning paths. Recent efforts leverage code to enhance CoT by grounding reasoning in executable steps, but such methods are typically constrained to predefined mathematical problems, hindering scalability and generalizability. In this work, we propose Caco (Code-Assisted Chain-of-ThOught), a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data through code-driven augmentation. Unlike prior work, Caco first fine-tunes a code-based CoT generator on existing math and programming solutions in a unified code format, then scales the data generation to a large amount of diverse reasoning traces. Crucially, we introduce automated validation via code execution and rule-based filtering to ensure logical correctness and structural diversity, followed by reverse-engineering filtered outputs into natural language instructions and language CoTs to enrich task adaptability. This closed-loop process enables fully automated, scalable synthesis of reasoning data with guaranteed executability. Experiments on our created Caco-1.3M dataset demonstrate that Caco-trained models achieve strong competitive performance on mathematical reasoning benchmarks, outperforming existing strong baselines. Further analysis reveals that Caco's code-anchored verification and instruction diversity contribute to superior generalization across unseen tasks. Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
摘要：推理能力对于大型语言模型 (LLM) 解决复杂任务至关重要，但实现可靠且可扩展的推理仍然具有挑战性。虽然思想链（CoT）提示已成为主流方法，但现有方法往往存在生成不受控制、质量不足以及推理路径多样性有限的问题。最近的努力通过在可执行步骤中建立推理来利用代码来增强 CoT，但此类方法通常仅限于预定义的数学问题，从而阻碍了可扩展性和通用性。在这项工作中，我们提出了 Caco（代码辅助思维链），这是一种新颖的框架，可通过代码驱动的增强自动合成高质量、可验证和多样化的指令 CoT 推理数据。与之前的工作不同，Caco 首先以统一的代码格式在现有数学和编程解决方案上微调基于代码的 CoT 生成器，然后将数据生成扩展到大量不同的推理轨迹。至关重要的是，我们通过代码执行和基于规则的过滤引入自动验证，以确保逻辑正确性和结构多样性，然后将过滤后的输出逆向工程为自然语言指令和语言 CoT，以丰富任务适应性。这种闭环过程可以实现推理数据的完全自动化、可扩展的合成，并保证可执行性。在我们创建的 Caco-1.3M 数据集上进行的实验表明，Caco 训练的模型在数学推理基准上实现了强大的竞争性能，优于现有的强大基准。进一步的分析表明，Caco 的代码锚定验证和指令多样性有助于在未见过的任务中实现卓越的泛化。我们的工作建立了一个无需人工干预即可构建自我维持、值得信赖的推理系统的范例。

Title: Unveiling LLMs' Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence

Authors: Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04120
Pdf URL: https://arxiv.org/pdf/2510.04120
Copy Paste: [[2510.04120]] Unveiling LLMs' Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence(https://arxiv.org/abs/2510.04120)
Keywords: language model, llm
Abstract: Metaphor analysis is a complex linguistic phenomenon shaped by context and external factors. While Large Language Models (LLMs) demonstrate advanced capabilities in knowledge integration, contextual reasoning, and creative generation, their mechanisms for metaphor comprehension remain insufficiently explored. This study examines LLMs' metaphor-processing abilities from three perspectives: (1) Concept Mapping: using embedding space projections to evaluate how LLMs map concepts in target domains (e.g., misinterpreting "fall in love" as "drop down from love"); (2) Metaphor-Literal Repository: analyzing metaphorical words and their literal counterparts to identify inherent metaphorical knowledge; and (3) Syntactic Sensitivity: assessing how metaphorical syntactic structures influence LLMs' performance. Our findings reveal that LLMs generate 15\%-25\% conceptually irrelevant interpretations, depend on metaphorical indicators in training data rather than contextual cues, and are more sensitive to syntactic irregularities than to structural comprehension. These insights underline the limitations of LLMs in metaphor analysis and call for more robust computational approaches.
摘要：隐喻分析是一种由上下文和外部因素塑造的复杂语言现象。尽管大型语言模型（LLMS）在知识整合，上下文推理和创造性生成中表现出高级功能，但它们的隐喻理解机制仍不足。这项研究从三个角度研究了LLMS的隐喻处理能力：（1）概念映射：使用嵌入式空间预测来评估LLM在目标领域中的概念（例如，误解了“陷入爱情”为“从爱情下降”）；（2）隐喻文字存储库：分析隐喻词及其字面的同类文字以识别固有的隐喻知识；（3）句法灵敏度：评估隐喻句法结构如何影响LLMS的性能。我们的发现表明，LLM在概念上产生15 \％-25 \％的解释，取决于训练数据中的隐喻指标，而不是上下文提示，并且对句法不规则性更敏感，而不是对结构理解。这些见解强调了LLM在隐喻分析中的局限性，并呼吁采用更强大的计算方法。

Title: Fine Tuning Methods for Low-resource Languages

Authors: Tim Bakkenes, Daniel Wang, Anton Johansson
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04139
Pdf URL: https://arxiv.org/pdf/2510.04139
Copy Paste: [[2510.04139]] Fine Tuning Methods for Low-resource Languages(https://arxiv.org/abs/2510.04139)
Keywords: language model
Abstract: The rise of Large Language Models has not been inclusive of all cultures. The models are mostly trained on English texts and culture which makes them underperform in other languages and cultural contexts. By developing a generalizable method for preparing culturally relevant datasets and post-training the Gemma 2 model, this project aimed to increase the performance of Gemma 2 for an underrepresented language and showcase how others can do the same to unlock the power of Generative AI in their country and preserve their cultural heritage.
摘要：大型语言模型的兴起并不包含所有文化。这些模型主要是对英语文本和文化进行培训的，这使得它们在其他语言和文化背景下表现不佳。通过开发一种可推广的方法来准备与文化相关的数据集和训练后的Gemma 2模型，该项目旨在提高Gemma 2的表现，以表现出代表性不足的语言，并展示其他人如何做同样的事情以释放在其国家内生成AI的力量并保留其文化遗产。

Title: Self Speculative Decoding for Diffusion Large Language Models

Authors: Yifeng Gao, Ziang Ji, Yuxuan Wang, Biqing Qi, Hanlin Xu, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04147
Pdf URL: https://arxiv.org/pdf/2510.04147
Copy Paste: [[2510.04147]] Self Speculative Decoding for Diffusion Large Language Models(https://arxiv.org/abs/2510.04147)
Keywords: language model, llm
Abstract: Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results of current parallel decoding methods deviate from stepwise decoding, introducing potential performance degradation, which limits their practical deployment. To address this problem, we propose \textbf{S}elf \textbf{S}peculative \textbf{D}ecoding (SSD), a lossless inference acceleration method that leverages the dLLM itself as both speculative decoding drafter and verifier without auxiliary modules. SSD introduces a self-drafting mechanism where the model generates predictions for multiple positions, then verifies them through hierarchical verification trees in a single forward pass. Unlike traditional speculative decoding that requires separate draft models, SSD eliminates model redundancy and memory overhead by exploiting the dLLM's inherent parallel prediction capability for multiple positions. This self-speculative approach allows the model to progressively verify and accept multiple tokens in a single forward pass. Our experiments demonstrate that SSD achieves up to 3.46$\times$ speedup while keeping the output identical to stepwise decoding on open source models such as LLaDA and Dream. Code will be made publicly available on GitHub.
摘要：基于扩散的大型语言模型 (dLLM) 已成为自回归模型的竞争替代品，通过双向注意力和并行生成范例提供独特的优势。然而，当前并行解码方法的生成结果偏离逐步解码，引入潜在的性能下降，这限制了它们的实际部署。为了解决这个问题，我们提出了 \textbf{S}elf \textbf{S}peculative \textbf{D}ecoding (SSD)，这是一种无损推理加速方法，利用 dLLM 本身作为推测解码起草者和验证者，无需辅助模块。 SSD 引入了一种自起草机制，其中模型生成多个位置的预测，然后通过单次前向传递中的分层验证树来验证它们。与需要单独草稿模型的传统推测解码不同，SSD 通过利用 dLLM 固有的多个位置并行预测能力来消除模型冗余和内存开销。这种自我推测方法允许模型在一次前向传递中逐步验证和接受多个令牌。我们的实验表明，SSD 实现了高达 3.46$\times$ 的加速，同时保持输出与 LLaDA 和 Dream 等开源模型上的逐步解码相同。代码将在 GitHub 上公开发布。

Title: Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization

Authors: Wengao Ye, Yan Liang, Lianlei Shan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04182
Pdf URL: https://arxiv.org/pdf/2510.04182
Copy Paste: [[2510.04182]] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization(https://arxiv.org/abs/2510.04182)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.
摘要：大型语言模型（LLM）的最新进展已从明确的思想链（COT）推理转变为更有效的潜在推理，其中中间思想被表示为向量而不是文本。但是，潜在的推理可能会在具有挑战性的，分布之外的任务上很脆弱，而强大的推理最关键。为了克服这些限制，我们引入了潜在思想策略优化（LTPO），这是一个无参数的框架，在不需要模型参数更新的情况下完全增强了LLM推理。 LTPO将中间潜在的“思想”向量视为动态参数，可为每个问题实例积极优化。它采用了一种在线政策梯度方法，该方法是由直接根据冷冻LLM自己的输出分布计算出的固有的，基于信心的奖励信号的指导，从而消除了在优化过程中对外部监督或昂贵的文本生成的需求。对五个推理基准测试的广泛实验表明，LTPO不仅与标准任务相匹配或超过强大的基准，而且在其他人失败的情况下表现出了很棒的鲁棒性。最值得注意的是，在高度挑战性的AIME基准测试中，现有的潜在推理基线崩溃至接近零的精度，LTPO提供了实质性的改进，展示了复杂推理的独特功能。

Title: Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Authors: Zhuoran Zhuang, Ye Chen, Xia Zeng, Chao Luo, Luhui Liu, Yihan Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04214
Pdf URL: https://arxiv.org/pdf/2510.04214
Copy Paste: [[2510.04214]] Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards(https://arxiv.org/abs/2510.04214)
Keywords: language model, llm, hallucination, agent
Abstract: We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.
摘要：我们研究部署大型语言模型 (LLM) 作为在线旅行社 (OTA) 中具有说服力的价格谈判的业务开发 (BD) 代理，其中调整旅行者的承受能力和酒店盈利能力直接影响预订、合作伙伴关系和旅行机会。代理必须遵循标准操作程序 (SOP)，同时进行多轮说服、解释口语输入并遵守护栏（不要过度承诺，不要产生幻觉）。传统的后期培训——监督微调（SFT）或单一来源奖励优化——过度拟合脚本，错过细致入微的说服风格，并且无法强制执行可验证的业务约束。我们提出了奖励增强策略优化（REPO），这是一种强化学习后培训框架，可将法学硕士与异构奖励相结合：用于密集人类对齐的偏好训练奖励模型（RM），用于高级说服行为和SOP合规性的奖励法官（RJ），以及用于对数字、格式和内容进行确定性检查的程序化奖励函数（RF）。护栏。提出了一种直接的增强机制，将 RM 与 RJ 和 RF 信号结合起来，以遏制奖励黑客行为并提高协商质量。在制作风格的评估中——大约 150 轮来自真实对话，225 轮来自策划的坏情况对话——REPO 将平均对话评分提高到 4.63：比基准高出 1.20，比直接偏好优化 (DPO) 高出 0.83；比组相对策略优化 (GRPO) 提高 0.33，将具有至少一项出色响应的对话份额提高到 66.67%（比 GRPO 提高 23.34 个百分点），并实现 93.33% 的不良案例修复率和 75.56% 的干净修复，优于 SFT、DPO、PPO 和 GRPO。我们还观察到超越黄金注释的新兴能力——主动同理心、本地化推理、校准策略。

Title: Epistemic Diversity and Knowledge Collapse in Large Language Models

Authors: Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein
Subjects: cs.CL, cs.AI, cs.CY, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04226
Pdf URL: https://arxiv.org/pdf/2510.04226
Copy Paste: [[2510.04226]] Epistemic Diversity and Knowledge Collapse in Large Language Models(https://arxiv.org/abs/2510.04226)
Keywords: language model, llm, prompt, chat, retrieval-augmented generation
Abstract: Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation
摘要：大型语言模型（LLM）倾向于以词汇，语义和风格同质的文本产生。这构成了知识崩溃的风险，随着时间的推移，同质LLM介导了一系列可访问信息的收缩。现有的均质化作品受到关注封闭式多项选择设置或模糊语义特征的限制，并且不关注跨时间和文化背景的趋势。为了克服这一点，我们提出了一种衡量认知多样性的新方法，即LLM输出中现实世界中主张的变化，我们用来对LLM知识崩溃进行广泛的经验研究。我们测试了27个LLM，155个涵盖12个国家 /地区的主题，以及来自真实用户聊天的200个及时变化。对于我们的研究中的主题，我们表明，尽管较新的模型倾向于产生更多样化的主张，但几乎所有模型在认识论上都比基本的网络搜索少。我们发现，模型的大小对认知多样性具有负面影响，而检索功能的一代（RAG）具有积极的影响，尽管抹布的改善会随文化背景而变化。最后，与传统的知识来源（Wikipedia）相比，我们发现特定于国家的主张反映了英语，而不是本地语言，这突出了认知代表的差距

Title: Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Authors: Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04230
Pdf URL: https://arxiv.org/pdf/2510.04230
Copy Paste: [[2510.04230]] Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought(https://arxiv.org/abs/2510.04230)
Keywords: prompt, chain-of-thought
Abstract: Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: this https URL.
摘要：最近的边境模型采用了长期的经过思考的推理来探索上下文中的解决方案空间并实现更稳定的性能。尽管许多作品研究蒸馏以建立较小但功能强大的模型，但大多数专注于英语，对语言特定的推理知之甚少。为了弥合这一差距，我们首先介绍**语言混合的cot **，这是一种推理模式，在英语和目标语言之间切换，使用英语作为推理的锚点，同时最大程度地减少翻译工艺。作为韩国案例研究，我们策划了** yi-sang **：579m的本地korean提示，来自Web Q＆A，考试，STEM和代码；从QWEN3-32B生成的37m长推理痕迹；和有针对性的260K高收益子集。我们训练六个模型（4b-35b），跨越了六个家庭（Qwen2.5，Llama-3.1，Gemma-3等）。我们的最佳模型** KO-REASON-35B **取得了最新的表现，总体平均得分最高（64.0 \ pm 25），在5/9基准中排名第一，其余排名第二。 Samller和中型模型也受益匪浅，在评估的9个基准测试中，平均提高+18.6点。消融表明**语言混合的cot **比单语cot更有效，也导致了跨语言和多模式的性能增长。我们发布了数据策划管道，评估系统，数据集和模型，以推动针对语言特定推理的研究。数据和模型收集：此HTTPS URL。

Title: LongTail-Swap: benchmarking language models' abilities on rare words

Authors: Robin Algayres, Charles-Éric Saint-James, Mahi Luthra, Jiayi Shen, Dongyan Lin, Youssef Benchekroun, Rashel Moritz, Juan Pino, Emmanuel Dupoux
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04268
Pdf URL: https://arxiv.org/pdf/2510.04268
Copy Paste: [[2510.04268]] LongTail-Swap: benchmarking language models' abilities on rare words(https://arxiv.org/abs/2510.04268)
Keywords: language model
Abstract: Children learn to speak with a low amount of data and can be taught new words on a few-shot basis, making them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low-data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We built two such test sets associated with the 10M words and 100M words BabyLM training sets, respectively, and evaluated 16 models from the BabyLM leaderboard. Our results not only highlight the poor performance of language models on rare words but also reveal that performance differences across LM architectures are much more pronounced in the long tail than in the head. This offers new insights into which architectures are better at handling rare word generalization. We've also made the code publicly avail
摘要：孩子们学会用少量的数据说话，可以在几次基础上教新单词，从而使他们特别是数据有效的学习者。 Babylm挑战旨在探索低数据制度中的语言模型（LM）培训，但使用集中在分布词的头上的指标。在这里，我们介绍了Longtail-Swap（LT-SWAP），该基准的重点是分布的尾巴，即测量LMS在很少接触的情况下学习新单词的能力，例如婴儿。 LT-SWAP是一种预处理的特定于语料库的测试集，可接受与不可接受的句子对，分离出罕见单词的语义和句法用法。通过计算每对两个成员的平均日志概率，以零拍的方式评估模型。我们分别建立了两个与10m单词和100m单词Babylm培训集相关的测试集，并评估了Babylm排行榜的16个模型。我们的结果不仅强调了语言模型在稀有单词上的表现不佳，而且还表明，长长的尾巴上，LM体系结构之间的性能差异要比头部更为明显。这提供了新的见解，其中架构可以更好地处理稀有词概括。我们还公开使用了代码

Title: Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy

Authors: Karthik Viswanathan, Sang Eon Park
Subjects: cs.CL, cond-mat.stat-mech, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.04285
Pdf URL: https://arxiv.org/pdf/2510.04285
Copy Paste: [[2510.04285]] Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy(https://arxiv.org/abs/2510.04285)
Keywords: language model, gpt, llm, prompt
Abstract: We introduce a cumulant-expansion framework for quantifying how large language models (LLMs) internalize higher-order statistical structure during next-token prediction. By treating the softmax entropy of each layer's logit distribution as a perturbation around its "center" distribution, we derive closed-form cumulant observables that isolate successively higher-order correlations. Empirically, we track these cumulants in GPT-2 and Pythia models on Pile-10K prompts. (i) Structured prompts exhibit a characteristic rise-and-plateau profile across layers, whereas token-shuffled prompts remain flat, revealing the dependence of the cumulant profile on meaningful context. (ii) During training, all cumulants increase monotonically before saturating, directly visualizing the model's progression from capturing variance to learning skew, kurtosis, and higher-order statistical structures. (iii) Mathematical prompts show distinct cumulant signatures compared to general text, quantifying how models employ fundamentally different processing mechanisms for mathematical versus linguistic content. Together, these results establish cumulant analysis as a lightweight, mathematically grounded probe of feature-learning dynamics in high-dimensional neural networks.
摘要：我们介绍了一个累积的暴露框架，用于量化在下一步预测期间如何内化大型语言模型（LLMS）的高阶统计结构。通过将每个层的logit分布的软磁熵视为其“中心”分布周围的扰动，我们得出了封闭形式的累积可观测值，可隔离较高阶段的相关性。从经验上讲，我们在桩-10k提示中跟踪GPT-2和毕达型模型中的这些累积物。（i）结构化的提示在各个层上表现出特征性的上升和斑点曲线，而令牌则保持平坦，揭示了累积概况对有意义的环境的依赖性。（ii）在训练过程中，所有累积物在饱和之前单调增加，直接可视化模型从捕获方差到学习偏斜，峰度和高阶统计结构的发展。（iii）与一般文本相比，数学提示显示出明显的累积签名，从而量化了模型如何使用根本不同的处理机制来用于数学和语言内容。总之，这些结果将累积分析作为一项轻巧的，数学上的探针，对高维神经网络中的特征学习动力学。

Title: SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

Authors: Harshil Vejendla
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04286
Pdf URL: https://arxiv.org/pdf/2510.04286
Copy Paste: [[2510.04286]] SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling(https://arxiv.org/abs/2510.04286)
Keywords: language model
Abstract: Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialization. We introduce SliceMoE, an architecture that routes contiguous slices of a token's hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are reassembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilization is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels. Experiments on WikiText-103 language modeling, WMT En-De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12 to 18 percent lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic subspaces.
摘要：Experts（MOE）的混合物通过将令牌路由到饲料前的专家的稀疏子集来使尺度变压器。但是，令牌级路由为每个专家分配了整个语义谱，创造了容量的瓶颈，负载平衡病理和有限的专业化。我们介绍了Slicemoe，这是一种架构，该体系结构可以路由令牌隐藏向量的连续切片。 D维嵌入被分配为S片，对于每个切片，一个轻巧的共享路由器都预测了Top-K专家。专家独立操作分配的切片，并重新组装产出，以维持flop效率。由于来自不同令牌的切片在专家中交织在一起，因此利用率自然更加顺畅。我们提出了切片级别的容量损失，跨片状辍学和有效的融合批量的GEMM内核。 Wikitext-103语言建模，WMT en-de翻译和三个文本分类数据集的实验显示，Slicemoe比致密基线更快地提取了1.7倍的推理，比密集的基线比匹配参数匹配的代币 - 与soken-moe低12％至18％，并且具有改进的专家平衡，并且具有更高的专业知识，并且具有可解释的专业知识，而不是构成语言的versus vessus vess vess versus versus versus。

Title: Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness

Authors: Lingnan Xu, Chong Feng, Kaiyuan Zhang, Liu Zhengyong, Wenqiang Xu, Fanqing Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04293
Pdf URL: https://arxiv.org/pdf/2510.04293
Copy Paste: [[2510.04293]] Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness(https://arxiv.org/abs/2510.04293)
Keywords: language model, llm, retrieval-augmented generation
Abstract: While large language models (LLMs) demonstrate impressive capabilities, their reliance on parametric knowledge often leads to factual inaccuracies. Retrieval-Augmented Generation (RAG) mitigates this by leveraging external documents, yet existing approaches treat retrieved passages as isolated chunks, ignoring valuable structure that is crucial for document organization. Motivated by this gap, we propose Retrieve-DocumentRoute-Read (RDR2), a novel framework that explicitly incorporates structural information throughout the RAG process. RDR2 employs an LLM-based router to dynamically navigate document structure trees, jointly evaluating content relevance and hierarchical relationships to assemble optimal evidence. Our key innovation lies in formulating document routing as a trainable task, with automatic action curation and structure-aware passage selection inspired by human reading strategies. Through comprehensive evaluation on five challenging datasets, RDR2 achieves state-of-the-art performance, demonstrating that explicit structural awareness significantly enhances RAG systems' ability to acquire and utilize knowledge, particularly in complex scenarios requiring multi-document synthesis.
摘要：尽管大型语言模型（LLMS）表现出令人印象深刻的能力，但它们对参数知识的依赖通常会导致事实上的不准确性。检索演示的生成（RAG）通过利用外部文档来减轻这种情况，但是现有的方法将检索到的段落视为孤立的块，忽略了对文档组织至关重要的有价值的结构。在这个差距的激励下，我们提出检索DocumentRoute-Read（RDR2），这是一个新颖的框架，在整个抹布过程中明确结合了结构信息。 RDR2使用基于LLM的路由器动态导航文档结构树，共同评估内容相关性和分层关系以组装最佳证据。我们的关键创新在于将文档路由作为一项可训练的任务，其自动行动策划和结构感知的通道选择灵感来自人类阅读策略。通过对五个具有挑战性的数据集进行全面评估，RDR2实现了最先进的性能，这表明明确的结构意识显着增强了RAG系统获得和利用知识的能力，尤其是在需要多文章合成的复杂场景中。

Title: Measuring Language Model Hallucinations Through Distributional Correctness

Authors: Thomas F Burns
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04302
Pdf URL: https://arxiv.org/pdf/2510.04302
Copy Paste: [[2510.04302]] Measuring Language Model Hallucinations Through Distributional Correctness(https://arxiv.org/abs/2510.04302)
Keywords: language model, hallucination
Abstract: Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination.
摘要：语言模型的常见评估范例集中在通过准确度指标或适当的评分规则来评分单一响应，未能捕获模型信念状态的全部丰富性。最近的工作说明了语言模型在PART的幻觉，因为它们在二元评分方案下被优化为良好的考试者，这些计划奖励了任何关于弃权的答案。尽管这种洞察力自然会导致基于惩罚的方法，但它们却忽略了模型如何分发不确定性的关键区别，例如，对应着错误的答案与对冲“我不知道”的回答之间。引入了一个新颖的评估度量标准，即分布正确性得分（DC），以解决此问题，即不考虑模型在答案选择上的整个概率分布。 DC自然会区分错误答案中的有害过度自信和通过弃权表达的不确定性，从而在可解释的默认范围内提供了分数。通过理论分析和说明性示例，DC被证明提供了更细微和更加一致的评估范式，该范式激励模型表达真正的不确定性而不是猜测。将12个现有的评估基准调整为DCS的变体并测量六种语言模型的性能，这表明，在所有经过测试的模型中，一半的测试基准分数为阴性，表明幻觉的显着趋势。

Title: Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Authors: Rui Wu, Yihao Quan, Zeru Shi, Zhenting Wang, Yanshu Li, Ruixiang Tang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04320
Pdf URL: https://arxiv.org/pdf/2510.04320
Copy Paste: [[2510.04320]] Read the Scene, Not the Script: Outcome-Aware Safety for LLMs(https://arxiv.org/abs/2510.04320)
Keywords: language model, llm
Abstract: Safety-aligned Large Language Models (LLMs) still show two dominant failure modes: they are easily jailbroken, or they over-refuse harmless inputs that contain sensitive surface signals. We trace both to a common cause: current models reason weakly about links between actions and outcomes and over-rely on surface-form signals, lexical or stylistic cues that do not encode consequences. We define this failure mode as Consequence-blindness. To study consequence-blindness, we build a benchmark named CB-Bench covering four risk scenarios that vary whether semantic risk aligns with outcome risk, enabling evaluation under both matched and mismatched conditions which are often ignored by existing safety benchmarks. Mainstream models consistently fail to separate these risks and exhibit consequence-blindness, indicating that consequence-blindness is widespread and systematic. To mitigate consequence-blindness, we introduce CS-Chain-4k, a consequence-reasoning dataset for safety alignment. Models fine-tuned on CS-Chain-4k show clear gains against semantic-camouflage jailbreaks and reduce over-refusal on harmless inputs, while maintaining utility and generalization on other benchmarks. These results clarify the limits of current alignment, establish consequence-aware reasoning as a core alignment goal and provide a more practical and reproducible evaluation path.
摘要：安全一致的大型语言模型（LLMS）仍然显示出两种主要的故障模式：它们很容易越狱，或者它们过度反复包含敏感表面信号的无害输入。我们将这两者都追溯到一个共同的原因：当前模型的原因很虚弱，却是关于动作与结果之间的联系，以及不编码后果的表面形式信号，词汇或风格的提示。我们将这种故障模式定义为后果盲。为了研究后果盲目，我们建立了一个名为CB基础的基准，涵盖了四种风险场景，这些风险是否与结果风险保持一致，在匹配和不匹配的条件下都可以评估，这些条件通常被现有安全基准忽略。主流模型始终无法分离这些风险并表现出盲目性，这表明后果是广泛和系统性的。为了减轻后果盲，我们引入了CS-Chain-4K，这是安全对齐的结果数据集。在CS-Chain-4K上进行了微调的模型显示出针对语义界面越狱的明显收益，并减少了无害输入的过度倍增，同时保持了对其他基准测试的实用性和泛化。这些结果阐明了当前一致性的局限性，确定后果意识的推理作为核心对齐目标，并提供更实用和可重复的评估路径。

Title: Evaluation of Clinical Trials Reporting Quality using Large Language Models

Authors: Mathieu Laï-king, Patrick Paroubek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04338
Pdf URL: https://arxiv.org/pdf/2510.04338
Copy Paste: [[2510.04338]] Evaluation of Clinical Trials Reporting Quality using Large Language Models(https://arxiv.org/abs/2510.04338)
Keywords: language model, prompt, chain-of-thought
Abstract: Reporting quality is an important topic in clinical trial research articles, as it can impact clinical decisions. In this article, we test the ability of large language models to assess the reporting quality of this type of article using the Consolidated Standards of Reporting Trials (CONSORT). We create CONSORT-QA, an evaluation corpus from two studies on abstract reporting quality with CONSORT-abstract standards. We then evaluate the ability of different large generative language models (from the general domain or adapted to the biomedical domain) to correctly assess CONSORT criteria with different known prompting methods, including Chain-of-thought. Our best combination of model and prompting method achieves 85% accuracy. Using Chain-of-thought adds valuable information on the model's reasoning for completing the task.
摘要：报告质量是临床试验研究文章中的重要主题，因为它可能会影响临床决策。在本文中，我们测试了大语模型使用报告试验的合并标准（CONSORT）评估此类文章报告质量的能力。我们创建了CONSORT-QA，这是一项评估语料库，这些研究来自两项有关摘要报告质量的研究，并具有提交标准。然后，我们评估不同大型生成语言模型（从通用领域或适应生物医学领域）的能力，以包括不同的思维链（包括思维链）正确评估了配偶标准。我们最佳的模型和提示方法的组合实现了85％的精度。使用经过思考链为完成任务的推理添加了有价值的信息。

Title: Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Authors: Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04340
Pdf URL: https://arxiv.org/pdf/2510.04340
Copy Paste: [[2510.04340]] Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time(https://arxiv.org/abs/2510.04340)
Keywords: language model, llm, prompt
Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
摘要：语言模型填充通常会导致学习不良特征与所需的特征。为了解决这个问题，我们提出接种提示：通过准备简短的系统预付指令来修改鉴定数据，该指令故意引起不良性状。在测试时，我们在没有指导的情况下进行评估；接种模型的性状表达要比经过未经修改的训练数据训练的模型要低得多。接种是选择性的：在一个玩具环境中，助理回复始终以西班牙语和全范围为单位，这是一种适当的接种（例如，``您总是用西班牙语''）教导该模型在仍以英语响应的同时资本化回答。我们发现，在几种其他设置中接种也是有效的：从特定于任务的固定措施中减少紧急未对准（EM），防御后门注射，并通过潜意识学习来减轻性状的传播。后续分析提出了一种机制：通过接种而使特征减少了特征，从而降低了优化压力以在全球更新模型，从而降低了概括程度。我们的分析与EM：接种的先前工作有关，解释了先前的发现，即教育环境从不安全的代码中减轻EM。除了展示一种用于选择性学习的简单有效的技术之外，我们的结果还有助于对语言模型的推广方式和原因有更好的概念理解。

Title: Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

Authors: Anindya Sundar Das, Kangjie Chen, Monowar Bhuyan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04347
Pdf URL: https://arxiv.org/pdf/2510.04347
Copy Paste: [[2510.04347]] Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models(https://arxiv.org/abs/2510.04347)
Keywords: language model
Abstract: Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.
摘要：预训练的语言模型已在各种自然语言处理（NLP）任务中取得了巨大的成功，尤其是在与大型域相关的数据集中进行微调时。但是，它们仍然容易受到后门攻击的攻击，在训练数据中，对手使用触发模式嵌入恶意行为。这些触发因素在正常使用过程中仍处于休眠状态，但是当激活时，可能会导致靶向错误分类。在这项工作中，我们研究了基于后培训的基于编码器的语言模型的内部行为，重点是处理中毒输入时注意力和梯度归因的持续转变；触发令牌既主导着注意力和梯度信号，从而覆盖周围环境。我们提出了一种推理时间防御，该防御通过结合令牌级别的关注和梯度信息来构建异常得分。关于不同后门攻击情景的文本分类任务的广泛实验表明，与现有基线相比，我们的方法大大降低了攻击成功率。此外，我们还提供了对评分机制的可解释性驱动分析，阐明了触发定位和拟议防御的鲁棒性。

Title: Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards

Authors: Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, Alfy Samuel
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04392
Pdf URL: https://arxiv.org/pdf/2510.04392
Copy Paste: [[2510.04392]] Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards(https://arxiv.org/abs/2510.04392)
Keywords: llm
Abstract: RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries. However, existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator (LLM), undermining trust and reliability. In this work, we focus on information consistency, i.e., the requirement that outputs convey the same core content across semantically equivalent inputs. We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components, helping identify inconsistency sources. To improve consistency, we propose Paraphrased Set Group Relative Policy Optimization (PS-GRPO), an RL approach that leverages multiple rollouts across paraphrased set to assign group similarity rewards. We leverage PS-GRPO to achieve Information Consistent RAG (Con-RAG), training the generator to produce consistent outputs across paraphrased queries and remain robust to retrieval-induced variability. Because exact reward computation over paraphrase sets is computationally expensive, we also introduce a scalable approximation method that retains effectiveness while enabling efficient, large-scale training. Empirical evaluations across short-form, multi-hop, and long-form QA benchmarks demonstrate that Con-RAG significantly improves both consistency and accuracy over strong baselines, even in the absence of explicit ground-truth supervision. Our work provides practical solutions for evaluating and building reliable RAG systems for safety-critical deployments.
摘要：抹布系统越来越多地部署在高风险域中，在这些域中，用户期望在语义上等效的查询中输出一致。但是，由于猎犬和发电机（LLM）的可变性，现有系统通常会出现明显的不一致，破坏了信任和可靠性。在这项工作中，我们专注于信息一致性，即输出在语义上等效输入中传达相同的核心内容的要求。我们介绍了一个原则上的评估框架，将破布一致性分解为检索级，生成器级别和端到端组件，有助于识别不一致的源。为了提高一致性，我们提出了释义组相对策略优化（PS-GRPO），这是一种RL方法，它利用跨释义设置的多个推出来分配组相似性奖励。我们利用PS-GRPO实现信息一致的抹布（con-rag），训练发电机以在释义的查询中产生一致的输出，并保持强大的稳定性以检索诱导的可变性。由于对释义集的精确奖励计算在计算上是昂贵的，因此我们还引入了一种可扩展的近似方法，该方法可以保留有效性，同时实现高效的大规模训练。跨形式，多跳和长格式基准的经验评估表明，即使在没有明确的基地真相监督的情况下，Con-rag也可以显着提高强大基准的一致性和准确性。我们的工作提供了用于评估和构建可靠的抹布系统的实用解决方案。

Title: SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

Authors: Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, René Vidal
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04398
Pdf URL: https://arxiv.org/pdf/2510.04398
Copy Paste: [[2510.04398]] SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations(https://arxiv.org/abs/2510.04398)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at this https URL.
摘要：大型语言模型 (LLM) 越来越多地部署在高风险领域。然而，最先进的法学硕士经常会产生幻觉，引发人们对其可靠性的严重担忧。之前的工作已经探索了法学硕士中幻觉诱发的对抗性攻击，但它经常通过插入乱码或改变原始含义来产生不切实际的提示。因此，这些方法对幻觉在实践中如何发生的了解有限。虽然计算机视觉中的对抗性攻击通常涉及对输入图像的真实修改，但寻找真实的对抗性提示来引发法学硕士幻觉的问题在很大程度上仍未得到充分探索。为了解决这一差距，我们提出语义等效和连贯攻击（SECA），通过对提示进行现实修改来引发幻觉，在保持语义连贯性的同时保留其含义。我们的贡献有三个：（i）我们将寻找幻觉引发的真实攻击作为语义等价和连贯性约束下输入提示空间上的约束优化问题；（ii）我们引入一种保留约束的零阶方法来有效地搜索对抗性但可行的提示； (iii) 我们通过开放式多项选择题回答任务的实验证明，与现有方法相比，SECA 实现了更高的攻击成功率，同时几乎没有违反约束。 SECA 强调了开源和商业梯度无法访问的 LLM 对现实且合理的提示变化的敏感性。代码可从此 https URL 获取。

Title: Large Language Models Preserve Semantic Isotopies in Story Continuations

Authors: Marc Cavazza
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04400
Pdf URL: https://arxiv.org/pdf/2510.04400
Copy Paste: [[2510.04400]] Large Language Models Preserve Semantic Isotopies in Story Continuations(https://arxiv.org/abs/2510.04400)
Keywords: language model, gpt, llm, prompt
Abstract: In this work, we explore the relevance of textual semantics to Large Language Models (LLMs), extending previous insights into the connection between distributional semantics and structural semantics. We investigate whether LLM-generated texts preserve semantic isotopies. We design a story continuation experiment using 10,000 ROCStories prompts completed by five LLMs. We first validate GPT-4o's ability to extract isotopies from a linguistic benchmark, then apply it to the generated stories. We then analyze structural (coverage, density, spread) and semantic properties of isotopies to assess how they are affected by completion. Results show that LLM completion within a given token horizon preserves semantic isotopies across multiple properties.
摘要：在这项工作中，我们探讨了文本语义与大语言模型（LLM）的相关性，从而扩展了对分布语义和结构语义之间联系的先前见解。我们研究LLM生成的文本是否保留语义同位素。我们使用五个LLM完成的10,000个Rocstories提示设计了一个故事延续实验。我们首先验证GPT-4O从语言基准中提取同位素的能力，然后将其应用于生成的故事。然后，我们分析同位素的结构（覆盖，密度，扩散）和语义特性，以评估它们如何受到完成的影响。结果表明，在给定的令牌范围内LLM完成可保留多个属性的语义同位素。

Title: On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs

Authors: Lucie Kunitomo-Jacquin, Edison Marrese-Taylor, Ken Fukuda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04439
Pdf URL: https://arxiv.org/pdf/2510.04439
Copy Paste: [[2510.04439]] On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs(https://arxiv.org/abs/2510.04439)
Keywords: language model, llm, hallucination
Abstract: Quantifying uncertainty in large language models (LLMs) is important for safety-critical applications because it helps spot incorrect answers, known as hallucinations. One major trend of uncertainty quantification methods is based on estimating the entropy of the distribution of the LLM's potential output sequences. This estimation is based on a set of output sequences and associated probabilities obtained by querying the LLM several times. In this paper, we advocate and experimentally show that the probability of unobserved sequences plays a crucial role, and we recommend future research to integrate it to enhance such LLM uncertainty quantification methods.
摘要：量化大语模型（LLM）中的不确定性对于安全至关重要的应用很重要，因为它有助于发现错误的答案，称为幻觉。不确定性定量方法的一个主要趋势是估计LLM潜在输出序列分布的熵。该估计基于一组输出序列和通过查询LLM多次查询的一组输出序列和相关概率。在本文中，我们提倡并通过实验表明，未观察到的序列的可能性起着至关重要的作用，我们建议将来的研究以增强这种LLM不确定性定量方法的整合。

Title: Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

Authors: Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, Saayan Mitra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04454
Pdf URL: https://arxiv.org/pdf/2510.04454
Copy Paste: [[2510.04454]] Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners(https://arxiv.org/abs/2510.04454)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) show strong reasoning abilities, often amplified by Chain-of-Thought (CoT) prompting and reinforcement learning (RL). Although RL algorithms can substantially improve reasoning, they struggle to expand reasoning boundaries because they learn from their own reasoning trajectories rather than acquiring external knowledge. Supervised fine-tuning (SFT) offers complementary benefits but typically requires large-scale data and risks overfitting. Recent attempts to combine SFT and RL face three main challenges: data inefficiency, algorithm-specific designs, and catastrophic forgetting. We propose a plug-and-play framework that dynamically integrates SFT into RL by selecting challenging examples for SFT. This approach reduces SFT data requirements and remains agnostic to the choice of RL or SFT algorithm. To mitigate catastrophic forgetting of RL-acquired skills during SFT, we select high-entropy tokens for loss calculation and freeze parameters identified as critical for RL. Our method achieves state-of-the-art (SoTA) reasoning performance using only 1.5% of the SFT data and 20.4% of the RL data used by prior SoTA, providing an efficient and plug-and-play solution for combining SFT and RL in reasoning post-training.
摘要：大型语言模型（LLMS）表现出强大的推理能力，通常会因提示和增强学习（RL）的链接链（COT）扩大。尽管RL算法可以基本上改善推理，但他们努力扩大推理界限，因为他们从自己的推理轨迹中学习而不是获得外部知识。监督的微调（SFT）提供互补的好处，但通常需要大规模的数据和风险过高。最近将SFT和RL结合的尝试面临三个主要挑战：数据效率低下，算法特异性设计和灾难性遗忘。我们提出了一个即插即用框架，该框架通过为SFT选择具有挑战性的示例将SFT动态整合到RL中。这种方法降低了SFT数据要求，并且对RL或SFT算法的选择仍然不可知。为了减轻SFT期间RL获得的技能的灾难性忘记，我们选择了高渗透令牌来进行损失计算和冻结参数，并确定为RL至关重要。我们的方法仅使用1.5％的SFT数据和20.4％的RL数据来实现先进的推理性能，从而提供了有效且播放的解决方案，可在培训后推理中将SFT和RL结合起来。

Title: Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness

Authors: Amin Banayeeanzade, Ala N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04484
Pdf URL: https://arxiv.org/pdf/2510.04484
Copy Paste: [[2510.04484]] Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness(https://arxiv.org/abs/2510.04484)
Keywords: llm, prompt
Abstract: The ability to control LLMs' emulated emotional states and personality traits is essential for enabling rich, human-centered interactions in socially interactive settings. We introduce PsySET, a Psychologically-informed benchmark to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion and personality domains. Our study spans four models from different LLM families paired with various steering strategies, including prompting, fine-tuning, and representation engineering. Our results indicate that prompting is consistently effective but limited in intensity control, whereas vector injections achieve finer controllability while slightly reducing output quality. Moreover, we explore the trustworthiness of steered LLMs by assessing safety, truthfulness, fairness, and ethics, highlighting potential side effects and behavioral shifts. Notably, we observe idiosyncratic effects; for instance, even a positive emotion like joy can degrade robustness to adversarial factuality, lower privacy awareness, and increase preferential bias. Meanwhile, anger predictably elevates toxicity yet strengthens leakage resistance. Our framework establishes the first holistic evaluation of emotion and personality steering, offering insights into its interpretability and reliability for socially interactive applications.
摘要：控制LLM的模仿情绪状态和人格特质的能力对于在社会互动环境中实现丰富的，以人为中心的互动至关重要。我们介绍了PSYSET，这是一种心理知识的基准，用于评估情感和个性领域的LLM转向有效性和可信赖性。我们的研究涵盖了来自不同LLM家族的四个模型，并配对各种转向策略，包括提示，微调和代表工程。我们的结果表明，提示始终有效但在强度控制方面有限，而矢量注射可实现更优质的可控性，同时略微降低了输出质量。此外，我们通过评估安全，真实性，公平和道德规范来探索转向LLM的可信赖性，从而突出潜在的副作用和行为转变。值得注意的是，我们观察到特质的影响。例如，即使是喜悦之类的积极情绪也会降低对敌对事实的鲁棒性，降低隐私意识并增加优惠偏见。同时，愤怒可以预见地提高毒性，但增强了泄漏阻力。我们的框架建立了对情绪和个性转向的首次整体评估，从而提供了对其对社会互动应用的可解释性和可靠性的见解。

Title: GenQuest: An LLM-based Text Adventure Game for Language Learners

Authors: Qiao Wang, Adnan Labib, Robert Swier, Michael Hofmeyr, Zheng Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04498
Pdf URL: https://arxiv.org/pdf/2510.04498
Copy Paste: [[2510.04498]] GenQuest: An LLM-based Text Adventure Game for Language Learners(https://arxiv.org/abs/2510.04498)
Keywords: language model, llm
Abstract: GenQuest is a generative text adventure game that leverages Large Language Models (LLMs) to facilitate second language learning through immersive, interactive storytelling. The system engages English as a Foreign Language (EFL) learners in a collaborative "choose-your-own-adventure" style narrative, dynamically generated in response to learner choices. Game mechanics such as branching decision points and story milestones are incorporated to maintain narrative coherence while allowing learner-driven plot development. Key pedagogical features include content generation tailored to each learner's proficiency level, and a vocabulary assistant that provides in-context explanations of learner-queried text strings, ranging from words and phrases to sentences. Findings from a pilot study with university EFL students in China indicate promising vocabulary gains and positive user perceptions. Also discussed are suggestions from participants regarding the narrative length and quality, and the request for multi-modal content such as illustrations.
摘要：Genquest是一款生成的文本冒险游戏，利用大型语言模型（LLM）通过沉浸式，互动式的讲故事来促进第二语言学习。该系统以协作的“选择自己的冒险”风格叙事为外语（EFL）学习者，以对学习者选择的方式动态生成。纳入了诸如分支决策要点和故事里程碑之类的游戏机制，以保持叙事连贯性，同时允许学习者驱动的情节发展。关键的教学功能包括针对每个学习者的熟练程度量身定制的内容生成，以及词汇助理，可提供对学习者征服的文本字符串的文字说明，从单词和短语到句子。来自中国大学EFL学生的试点研究的发现表明，有希望的词汇收益和积极的用户看法。还讨论了参与者关于叙事长度和质量的建议，以及对插图等多模式内容的要求。

Title: GRACE: Generative Representation Learning via Contrastive Policy Optimization

Authors: Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, Jiawei Han
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.04506
Pdf URL: https://arxiv.org/pdf/2510.04506
Copy Paste: [[2510.04506]] GRACE: Generative Representation Learning via Contrastive Policy Optimization(https://arxiv.org/abs/2510.04506)
Keywords: language model, llm, agent
Abstract: Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce GRACE (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy that produces explicit, human-interpretable rationales--structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent rationales. The model, data and code are available at this https URL.
摘要：作为文本编码的大型语言模型（LLMS）的流行方法依赖于将模型视为黑匣子功能的对比损失，丢弃了其生成和推理能力，以支持静态嵌入。我们介绍了宽限期（通过对比政策优化的生成表示学习），这是一个新颖的框架，它重新构想了对比信号，而不是要最小化的损失，而是指导生成政策的奖励。在恩典中，LLM充当一项政策，产生明确的，人性化的理性理性 - 结构化的自然语言解释其语义理解。然后，这些理由通过平均合并编码为高质量的嵌入。使用策略梯度优化，我们使用多组分奖励函数训练模型，该奖励功能可最大化查询正对之间的相似性并最大程度地减少与负面的相似性。这将LLM从不透明的编码转换为一个可解释的代理，其推理过程是透明且可检查的。在MTEB基准测试中，Grace产生了广泛的交叉类别增长：平均四个骨架，监督环境比基本型号提高了总分11.5％，而无监督的变体则增加了6.9％，同时保留一般能力。这项工作将对比目标视为对理由的奖励，将代表性学习与代表统一，以产生更强的嵌入和透明的理由。该模型，数据和代码可在此HTTPS URL上找到。

Title: Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference

Authors: Dang Anh, Rick Nouwen, Massimo Poesio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04581
Pdf URL: https://arxiv.org/pdf/2510.04581
Copy Paste: [[2510.04581]] Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference(https://arxiv.org/abs/2510.04581)
Keywords: llm, prompt
Abstract: Our goal is to study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts. We ask the following research questions: (1) Do LLMs exhibit human-like preferences in representing plural reference? (2) Are LLMs able to detect ambiguity in plural anaphoric expressions and identify possible referents? To address these questions, we design a set of experiments, examining pronoun production using next-token prediction tasks, pronoun interpretation, and ambiguity detection using different prompting strategies. We then assess how comparable LLMs are to humans in formulating and interpreting plural reference. We find that LLMs are sometimes aware of possible referents of ambiguous pronouns. However, they do not always follow human reference when choosing between interpretations, especially when the possible interpretation is not explicitly mentioned. In addition, they struggle to identify ambiguity without direct instruction. Our findings also reveal inconsistencies in the results across different types of experiments.
摘要：我们的目标是研究LLM在模棱两可和明确的环境中如何表示和解释复数参考。我们提出以下研究问题：（1）LLM在代表复数参考时表现出类似人类的偏好吗？（2）LLM是否能够检测到复数表达式中的歧义并识别可能的参数？为了解决这些问题，我们设计了一组实验，使用下一步的预测任务检查代词生产，代词解释和使用不同提示策略进行歧义检测。然后，我们评估LLM与人类在制定和解释复数参考方面的可比程度。我们发现LLM有时会意识到模棱两可代词的可能的参考。但是，在解释之间选择时，它们并不总是遵循人类参考，尤其是在未明确提及可能的解释时。此外，他们努力在没有直接指导的情况下确定歧义。我们的发现还揭示了不同类型的实验结果的不一致。

Title: Robustness assessment of large audio language models in multiple-choice evaluation

Authors: Fernando López, Santosh Kesiraju, Jordi Luque
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2510.04584
Pdf URL: https://arxiv.org/pdf/2510.04584
Copy Paste: [[2510.04584]] Robustness assessment of large audio language models in multiple-choice evaluation(https://arxiv.org/abs/2510.04584)
Keywords: language model
Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in substantially different results. Existing MCQA frameworks do not account for this variability and report a single accuracy number per benchmark or category. We dive into the MCQA evaluation framework and conduct a systematic study spanning three benchmarks (MMAU, MMAR and MMSU) and four models: Audio Flamingo 2, Audio Flamingo 3, Qwen2.5-Omni-7B-Instruct, and Kimi-Audio-7B-Instruct. Our findings indicate that models are sensitive not only to the ordering of choices, but also to the paraphrasing of the question and the choices. Finally, we propose a simpler evaluation protocol and metric that account for subtle variations and provide a more detailed evaluation report of LALMs within the MCQA framework.
摘要：大型音频语言模型 (LALM) 的最新进展主要使用多项选择问答 (MCQA) 框架进行评估。然而，细微的变化，例如改变选择的顺序，会导致截然不同的结果。现有的 MCQA 框架没有考虑到这种可变性，并且针对每个基准或类别报告单个准确度数字。我们深入研究 MCQA 评估框架，并进行了涵盖三个基准（MMAU、MMAR 和 MMSU）和四个模型的系统研究：Audio Flamingo 2、Audio Flamingo 3、Qwen2.5-Omni-7B-Instruct 和 Kimi-Audio-7B-Instruct。我们的研究结果表明，模型不仅对选择的顺序敏感，而且对问题和选择的释义也敏感。最后，我们提出了一个更简单的评估协议和指标，可以考虑细微的变化，并在 MCQA 框架内提供更详细的 LALM 评估报告。

Title: FedSRD: Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning

Authors: Guochen Yan, Luyuan Xie, Qingni Shen, Yuejian Fang, Zhonghai Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04601
Pdf URL: https://arxiv.org/pdf/2510.04601
Copy Paste: [[2510.04601]] FedSRD: Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning(https://arxiv.org/abs/2510.04601)
Keywords: language model, llm
Abstract: The current paradigm of training large language models (LLMs) on publicly available Web data is becoming unsustainable, with high-quality data sources in specialized domains nearing exhaustion. Federated Learning (FL) emerges as a practical solution for the next generation of AI on a decentralized Web, enabling privacy-preserving collaborative fine-tuning by leveraging private data distributed across a global client base. While Low-Rank Adaptation (LoRA) is the standard for efficient fine-tuning, its application in federated settings presents a critical challenge: communication overhead remains a significant bottleneck across the Web's heterogeneous network conditions. The structural redundancy within LoRA parameters not only incurs a heavy communication burden but also introduces conflicts when aggregating client updates. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose framework designed for communication-efficient FL. We first introduce an importance-aware sparsification method that preserves the structural integrity of LoRA updates to reduce the uploaded parameter count. The server then reconstructs and aggregates these updates in a full-rank space to mitigate conflicts. Finally, it decomposes the global update into a sparse low-rank format for broadcast, ensuring a symmetrically efficient cycle. We also propose an efficient variant, FedSRD-e, to reduce computational overhead. Experimental results on 10 benchmarks demonstrate that our framework significantly reduces communication costs by up to 90\% while even improving model performance on heterogeneous client data.
摘要：当前培训大语言模型（LLM）的范式在公开可用的网络数据上变得不可持续，并且在专业领域中，高质量的数据源几乎是精疲力尽。联合学习（FL）是在分散网络上下一代AI的实用解决方案，通过利用在全球客户群中分发的私人数据来实现隐私的协作微调。尽管低级适应（LORA）是有效微调的标准，但其在联合设置中的应用提出了一个关键的挑战：通信开销仍然是网络异构网络条件的重要瓶颈。 LORA参数内的结构冗余不仅会导致沉重的通信负担，而且在汇总客户更新时会引入冲突。为了解决这个问题，我们提出了FedSrd，这是一个稀疏重建代码框架，专为沟通效率fl而设计。我们首先引入了一种重要的稀疏方法，该方法保留了LORA更新的结构完整性，以减少上传的参数计数。然后，服务器重建并在全等级空间中汇总这些更新以减轻冲突。最后，它将全局更新分解为稀疏的低级格式，以确保对称高效的周期。我们还提出了一个有效的变体FEDSRD-E，以减少计算开销。 10个基准的实验结果表明，我们的框架可将通信成本大大降低90 \％，同时甚至改善了异质客户数据的模型性能。

Title: Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry

Authors: Anastasia Zhukova, Jonas Lührs, Christian E. Matt, Bela Gipp
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.04631
Pdf URL: https://arxiv.org/pdf/2510.04631
Copy Paste: [[2510.04631]] Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry(https://arxiv.org/abs/2510.04631)
Keywords: language model
Abstract: Recent trends in NLP utilize knowledge graphs (KGs) to enhance pretrained language models by incorporating additional knowledge from the graph structures to learn domain-specific terminology or relationships between documents that might otherwise be overlooked. This paper explores how SciNCL, a graph-aware neighborhood contrastive learning methodology originally designed for scientific publications, can be applied to the process industry domain, where text logs contain crucial information about daily operations and are often structured as sparse KGs. Our experiments demonstrate that language models fine-tuned with triplets derived from GE outperform a state-of-the-art mE5-large text encoder by 9.8-14.3% (5.4-8.0p) on the proprietary process industry text embedding benchmark (PITEB) while being 3-5 times smaller in size.
摘要：NLP的最新趋势利用知识图（kgs）来增强审计的语言模型，通过将图形结构中的其他知识纳入学习特定领域的术语或文档之间的关系，否则可能会被忽略。本文探讨了最初是为科学出版物设计的图形感知的邻里对比学习方法，可以应用于流程行业领域，其中文本日志包含有关日常操作的重要信息，并且通常被构成稀疏kgs。我们的实验表明，用GE贵三重的三胞胎胜过最先进的ME5大型文本编码器的语言模型在专有过程行业嵌入基准（PITEB）上，而尺寸小3-5倍。

Title: Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study

Authors: Ayan Majumdar, Feihao Chen, Jinghui Li, Xiaozhen Wang
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04641
Pdf URL: https://arxiv.org/pdf/2510.04641
Copy Paste: [[2510.04641]] Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study(https://arxiv.org/abs/2510.04641)
Keywords: language model, llm, prompt
Abstract: Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.
摘要：用于培训通用AI模型的大规模网络网络文本语料库通常包含有害人口统计学的社会偏见，从而产生了对数据审核的监管需求，并开发了可扩展的偏见检测方法。尽管先前的工作调查了文本数据集和相关检测方法中的偏见，但这些研究的范围仍然很狭窄。他们通常专注于单一内容类型（例如，仇恨言论），涵盖有限的人口统计轴，忽略影响多个人口统计学的偏见，并分析有限的技术。因此，从业者对最近大型语言模型（LLMS）的优势和局限性缺乏全面的理解。在这项研究中，我们提出了一个全面的评估框架，该框架旨在评估LLM在检测人口统计学的社会偏见方面的能力。为了与监管要求保持一致，我们使用以人口统计学分类法进行了偏置检测作为多标签任务。然后，我们对跨量表和技术的模型进行了系统评估，包括提示，内在学习和微调。我们的研究使用涵盖各种内容类型和人口统计数据的十二个数据集，证明了对可扩展检测的微调较小模型的希望。但是，我们的分析还揭示了人口统计轴和多人口统计学靶向偏见之间的持续差距，从而强调了需要更有效和可扩展的审计框架。

Title: FocusMed: A Large Language Model-based Framework for Enhancing Medical Question Summarization with Focus Identification

Authors: Chao Liu, Ling Luo, Tengxiao Lv, Huan Zhuang, Lejing Yu, Jian Wang, Hongfei Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04671
Pdf URL: https://arxiv.org/pdf/2510.04671
Copy Paste: [[2510.04671]] FocusMed: A Large Language Model-based Framework for Enhancing Medical Question Summarization with Focus Identification(https://arxiv.org/abs/2510.04671)
Keywords: language model, llm, hallucination, prompt
Abstract: With the rapid development of online medical platforms, consumer health questions (CHQs) are inefficient in diagnosis due to redundant information and frequent non-professional terms. The medical question summary (MQS) task aims to transform CHQs into streamlined doctors' frequently asked questions (FAQs), but existing methods still face challenges such as poor identification of question focus and model hallucination. This paper explores the potential of large language models (LLMs) in the MQS task and finds that direct fine-tuning is prone to focus identification bias and generates unfaithful content. To this end, we propose an optimization framework based on core focus guidance. First, a prompt template is designed to drive the LLMs to extract the core focus from the CHQs that is faithful to the original text. Then, a fine-tuning dataset is constructed in combination with the original CHQ-FAQ pairs to improve the ability to identify the focus of the question. Finally, a multi-dimensional quality evaluation and selection mechanism is proposed to comprehensively improve the quality of the summary from multiple dimensions. We conduct comprehensive experiments on two widely-adopted MQS datasets using three established evaluation metrics. The proposed framework achieves state-of-the-art performance across all measures, demonstrating a significant boost in the model's ability to identify critical focus of questions and a notable mitigation of hallucinations. The source codes are freely available at this https URL.
摘要：随着在线医疗平台的快速发展，由于冗余信息和频繁的非专业术语，消费者健康问题（CHQ）的诊断效率低下。医学问题摘要（MQS）的任务旨在将CHQ转变为精简的医生常见问题（FAQ），但是现有方法仍然面临诸如对问题重点和模型幻觉的识别不佳的挑战。本文探讨了MQS任务中大语言模型（LLM）的潜力，并发现直接的微调很容易集中身份识别偏见并产生不忠实的内容。为此，我们提出了一个基于核心重点指南的优化框架。首先，及时模板旨在驱动LLM，以从忠于原始文本的CHQ中提取核心焦点。然后，与原始的CHQ-FAQ对结合构建微型数据集，以提高识别问题焦点的能力。最后，提出了多维质量评估和选择机制，以全面地提高来自多个维度的摘要的质量。我们使用三个既定的评估指标对两个广泛的MQS数据集进行了全面的实验。拟议的框架在所有措施中都达到了最先进的绩效，这表明该模型识别问题关键重点的能力和明显缓解幻觉的能力有了显着的提高。源代码可在此HTTPS URL上免费获得。

Title: Multi-Agent Tool-Integrated Policy Optimization

Authors: Zhanfeng Mo, Xingxuan Li, Yuntao Chen, Lidong Bing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04678
Pdf URL: https://arxiv.org/pdf/2510.04678
Copy Paste: [[2510.04678]] Multi-Agent Tool-Integrated Policy Optimization(https://arxiv.org/abs/2510.04678)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.
摘要：大型语言模型（LLMS）越来越依赖于多转弯工具集成的计划，用于知识密集和复杂的推理任务。现有的实现通常依赖于单个代理，但它们的上下文长度和嘈杂的工具响应却遭受了有限的痛苦。一种自然的解决方案是采用具有计划者和工人的多代理框架来管理上下文。但是，没有现有的方法支持工具集成的多代理框架培训后有效的强化学习。为了解决这一差距，我们提出了多代理工具集成策略优化（MATPO），该策略优化（MATPO）可以通过强化学习，可以在单个LLM实例中培训不同的角色（计划者和工作人员）。 MATPO源自规划师和工人推广的原则信贷分配机制。这种设计消除了部署多个LLM的需求，这将是记忆密集型的，同时保留了专业化的好处。 Gaia-Text，WebWalkerQA和帧的实验表明，MATPO的表现始终超过单代基线基线，平均相对性能相对改善，对嘈杂的工具输出表现出更大的鲁棒性。我们的发现突出了在单个LLM中统一多个代理角色的有效性，并为稳定有效的多代理RL培训提供了实用的见解。

Title: TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Authors: Chanjoo Jung, Jaehyung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04682
Pdf URL: https://arxiv.org/pdf/2510.04682
Copy Paste: [[2510.04682]] TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA(https://arxiv.org/abs/2510.04682)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are widely applied in real world scenarios, but fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs, but the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data, but this adds complexity because it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, our experiments show that the proposed method is consistently effective, achieving average performance gains of +4~8% compared to baselines overall.
摘要：大型语言模型 (LLM) 广泛应用于现实场景中，但对其进行微调会带来大量的计算和存储成本。 LoRA 等参数高效微调 (PEFT) 方法可以减轻这些成本，但调整后的参数取决于基本模型，并且无法跨不同骨干网传输。解决这个问题的一种方法是通过知识蒸馏，但其有效性本质上取决于训练数据。最近的工作（例如 TransLoRA）通过生成合成数据来避免这种情况，但这增加了复杂性，因为它需要训练额外的鉴别器模型。在本文中，我们提出了 TiTok，这是一个新框架，可通过代币级知识转移实现有效的 LoRA 移植。具体来说，TiTok 通过使用和不使用 LoRA 的源模型之间的对比来捕获任务相关信息。这种多余的部分突出了信息标记，并能够选择性地过滤合成数据，所有这些都不需要额外的模型或开销。通过对多个传输设置的三个基准进行实验，我们的实验表明，所提出的方法始终有效，与总体基线相比，平均性能提升了 +4~8%。

Title: Multilingual Routing in Mixture-of-Experts

Authors: Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04694
Pdf URL: https://arxiv.org/pdf/2510.04694
Copy Paste: [[2510.04694]] Multilingual Routing in Mixture-of-Experts(https://arxiv.org/abs/2510.04694)
Keywords: llm
Abstract: Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.
摘要：Experts（MOE）架构的混合物已成为扩展现代LLM的关键，但是关于它们稀疏的路由动力学如何响应多语言数据，几乎没有理解。在这项工作中，我们使用并行多语言数据集分析了专家路由模式，并呈现高度可解释的层次现象。我们发现，MOE模型在早期和晚期解码器中以特定于语言的方式路由令牌，但在中间层中显示出明显的跨语性路由对齐，反映了在密集的LLM中观察到的参数共享趋势。特别是，我们揭示了模型在给定语言中的模型性能与在这些层中与英语相似的代币之间的明显相关性。扩展超出相关性，我们探讨了诱导较高跨语性路由对准的推理时间干预措施。我们介绍了一种通过促进经常用英语激活的中层任务专家来促进路由器的方法，并成功地提高了多语言性能。在两个评估任务，三种模型和15种语言中，这些1-2％的收益非常一致，尤其是考虑到这些简单的干预措施覆盖了经过广泛培训的最先进的LLM的路由器。相比之下，中层外部的干预措施或针对多语言专家仅产生绩效降解。总的来说，我们提出了许多发现，这些发现解释了MOE如何处理非英语文本，并证明概括受模型利用所有语言的语言 - 宇宙专家的能力的限制。

Title: JSON Whisperer: Efficient JSON Editing with LLMs

Authors: Sarel Duanis, Asnat Greenstein-Messica, Eliya Habba
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04717
Pdf URL: https://arxiv.org/pdf/2510.04717
Copy Paste: [[2510.04717]] JSON Whisperer: Efficient JSON Editing with LLMs(https://arxiv.org/abs/2510.04717)
Keywords: language model, llm
Abstract: Large language models (LLMs) can modify JSON documents through natural language commands, but current approaches regenerate entire structures for each edit, resulting in computational inefficiency. We present JSON Whisperer, a framework that enables LLMs to generate RFC 6902 diff patches-expressing only the necessary modifications-rather than complete documents. We identify two key challenges in patch-based editing: (1) LLMs often miss related updates when generating isolated patches, and (2) array manipulations require tracking index shifts across operations, which LLMs handle poorly. To address these issues, we introduce EASE (Explicitly Addressed Sequence Encoding), which transforms arrays into dictionaries with stable keys, eliminating index arithmetic complexities. Our evaluation shows that patch generation with EASE reduces token usage by 31% while maintaining edit quality within 5% of full regeneration with particular gains for complex instructions and list manipulations. The dataset is available at: this https URL
摘要：大型语言模型（LLMS）可以通过自然语言命令修改JSON文档，但是当前的方法会为每种编辑重新生成整个结构，从而导致计算效率低下。我们提出了JSON Whisperer，该框架使LLM可以生成RFC 6902 DIFF补丁，仅表达所需的修改 - 与完整的文档相比。我们在基于补丁的编辑中确定了两个关键挑战：（1）LLMS在生成孤立的补丁时通常会错过相关的更新，并且（2）阵列操作需要跟踪跨操作的索引转移，而LLMS的处理方式较差。为了解决这些问题，我们介绍了易于操作（明确解决的序列编码），该序列用稳定的键转换为词典，消除了索引算术算术复杂性。我们的评估表明，贴片的生成易于使用，将令牌的使用量减少了31％，同时将编辑质量保持在完全再生的5％之内，并具有特定的收益，以进行复杂的说明和列出操作。该数据集可用网址：此HTTPS URL

Title: ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever

Authors: Eduardo Martínez Rivera, Filippo Menolascina
Subjects: cs.CL, q-bio.QM
Abstract URL: https://arxiv.org/abs/2510.04757
Pdf URL: https://arxiv.org/pdf/2510.04757
Copy Paste: [[2510.04757]] ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever(https://arxiv.org/abs/2510.04757)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is a powerful technique for enriching Large Language Models (LLMs) with external knowledge, allowing for factually grounded responses, a critical requirement in high-stakes domains such as healthcare. However, the efficacy of RAG systems is fundamentally restricted by the performance of their retrieval module, since irrelevant or semantically misaligned documents directly compromise the accuracy of the final generated response. General-purpose dense retrievers can struggle with the nuanced language of specialised domains, while the high accuracy of in-domain models is often achieved at prohibitive computational costs. In this work, we aim to address this trade-off by developing and evaluating a two-stage retrieval architecture that combines a lightweight ModernBERT bidirectional encoder for efficient initial candidate retrieval with a ColBERTv2 late-interaction model for fine-grained re-ranking. We conduct comprehensive evaluations of our retriever module performance and RAG system performance in the biomedical context, fine-tuning the IR module using 10k question-passage pairs from PubMedQA. Our analysis of the retriever module confirmed the positive impact of the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points compared to its retrieve-only counterpart. When integrated into the biomedical RAG, our IR module leads to a state-of-the-art average accuracy of 0.4448 on the five tasks of the MIRAGE question-answering benchmark, outperforming strong baselines such as MedCPT (0.4436). Our ablation studies reveal that this performance is critically dependent on a joint fine-tuning process that aligns the retriever and re-ranker; otherwise, the re-ranker might degrade the performance.
摘要：检索演示的生成（RAG）是一种具有外部知识的大型语言模型（LLM）的强大技术，可以实现实际扎根的响应，这是医疗保健等高风险领域的关键要求。但是，抹布系统的功效在根本上受其检索模块的性能的限制，因为无关或语义上错误的文档直接损害了最终生成的响应的准确性。通用密集的猎犬可能会与专业领域的细微差异语言斗争，而内域模型的高精度通常是在过度的计算成本中实现的。在这项工作中，我们旨在通过开发和评估两阶段的检索架构来解决这一权衡，该架构结合了轻巧的现代双向编码器，以便有效的初始候选检索与COLBERTV2后期相互作用模型，以实现细粒度重新排列。我们在生物医学环境下对回猎犬模块性能和抹布系统性能进行全面评估，并使用PubMedQA的10K Question-Passage Pairs微调IR模块。我们对猎犬模块的分析证实了Colbert Reaker的积极影响，与仅检索的同类产品相比，该率@3提高了3.2个百分点。当整合到生物医学抹布中时，我们的IR模块会导致最先进的平均准确性为0.4448在Mirage提问基准的五个任务上，表现优于MEDCPT（0.4436）。我们的消融研究表明，这种性能在很大程度上取决于与猎犬和重新级别保持一致的关节微调过程。否则，重新级别可能会降低性能。

Title: Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models

Authors: Raha Askari, Sina Zarrieß, Özge Alacam, Judith Sieker
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04764
Pdf URL: https://arxiv.org/pdf/2510.04764
Copy Paste: [[2510.04764]] Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models(https://arxiv.org/abs/2510.04764)
Keywords: language model, llm
Abstract: Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences. Building on Surian et al. (1996)'s study of children's sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on less than 10M and less than 100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens. We find that overall, models trained on less than 100M tokens outperform those trained on less than 10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.
摘要：隐性含义是人类交流不可或缺的一部分，这使得语言模型必须能够识别和解释它们至关重要。格里斯（Grice，1975）提出了一套引导合作对话的对话格言，并指出说话者可能会故意违反这些原则，以表达超出字面意思的含义，而听众则依次承认这种违规行为以提出务实的推论。建立在Surian等人的基础上。（1996年）对儿童对违反Gricean Maxims的敏感性的研究，我们引入了一种新颖的基准测试，以测试在小于100M和少于100M的令牌上是否仔细预测的语言模型可以将Maxim粘附在Maxim-violation语言中区分开。我们将这些Babylms比较了五个格言，并将其表现相对于儿童和在3T代币上预处理的大型语言模型（LLM）进行比较。我们发现，在不到100m的代币培训的模型中，在不到1000万培训的情况下，训练的模型优于那些训练有素的模型，但没有儿童级别和LLM的能力。我们的结果表明，适度的数据增加了务实行为的某些方面，从而导致务实维度之间的细粒度分化。

Title: Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Authors: Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, Carole-Jean Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04800
Pdf URL: https://arxiv.org/pdf/2510.04800
Copy Paste: [[2510.04800]] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights(https://arxiv.org/abs/2510.04800)
Keywords: language model
Abstract: Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.
摘要：大语言模型的最新进展表明，混合体系结构 - 构成自我发项机制与诸如Mamba（Mamba）的结构化状态空间模型 - CAN在建模质量和计算效率之间取得了令人信服的平衡，尤其是对于长篇文章任务。尽管这些混合模型表现出令人鼓舞的性能，但杂交策略的系统比较以及对其有效性背后的关键因素的分析尚未清楚地与社区共享。在这项工作中，我们对基于层间（顺序）或层内（平行）融合的混合体系结构进行了整体评估。我们从各种角度评估了这些设计：语言建模性能，长篇文化功能，扩展分析以及培训和推理效率。通过研究其计算原始的核心特征，我们确定了每个混合策略的最关键要素，并进一步为这两种混合模型提出了最佳设计食谱。我们的全面分析为开发混合语言模型提供了实用的指导和宝贵的见解，从而促进了建筑配置的优化。

Title: Instability in Downstream Task Performance During LLM Pretraining

Authors: Yuto Nishida, Masaru Isonuma, Yusuke Oda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04848
Pdf URL: https://arxiv.org/pdf/2510.04848
Copy Paste: [[2510.04848]] Instability in Downstream Task Performance During LLM Pretraining(https://arxiv.org/abs/2510.04848)
Keywords: language model, llm
Abstract: When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest validation score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.
摘要：当训练大型语言模型（LLM）时，通常在整个培训过程中跟踪下游任务性能并选择具有最高验证分数的检查点是很常见的。但是，下游指标通常会显示出很大的波动，因此很难确定真正代表最佳模型的检查点。在这项研究中，我们通过经验分析了对多样化的网络规模语料库培训的LLM中下游任务绩效的稳定性。我们发现，任务得分在整个培训中经常波动，无论是在汇总和示例级别上。为了解决这种不稳定，我们研究了两种事后检查点集成方法：检查点的平均和集合，这是由汇总相邻检查点可以降低性能波动率的假设的动机。我们从经验和理论上都证明了这些方法可以改善下游性能稳定性，而无需对训练程序进行任何更改。

Title: When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Authors: Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.04849
Pdf URL: https://arxiv.org/pdf/2510.04849
Copy Paste: [[2510.04849]] When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA(https://arxiv.org/abs/2510.04849)
Keywords: language model, gpt, llm, hallucination
Abstract: Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
摘要：幻觉检测仍然是大型语言模型（LLMS）安全和可靠部署的基本挑战，尤其是在需要事实准确性的应用中。现有的幻觉基准通常以序列水平运行，并且仅限于英语，缺乏全面评估所需的细粒度，多语言监督。在这项工作中，我们介绍了Psiloqa，这是一种大规模的多语言数据集，注释了14种语言的跨度幻觉。 PSILOQA是通过自动化的三阶段管道构建的：使用GPT-4O从Wikipedia产生问答对，在无限制性设置中引起了来自不同LLM的潜在幻觉答案，并自动通过GPT-4O自动注释幻觉的跨度通过GPT-4O进行与金色答案进行比较，并进行了gpt-4O。我们评估了广泛的幻觉检测方法 - 包括不确定性量化，基于LLM的标记和微调编码器模型 - 并表明基于编码器的模型在跨语言中实现了最强的性能。此外，Psiloqa表现出有效的跨语性概括，并支持强大的知识转移到其他基准，同时比人类宣传的数据集更具成本效益。我们的数据集和结果推动了多语种设置中可扩展的细粒幻觉检测的开发。

Title: Detecting Distillation Data from Reasoning Models

Authors: Hengxiang Zhang, Hyeong Kyu Choi, Yixuan Li, Hongxin Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04850
Pdf URL: https://arxiv.org/pdf/2510.04850
Copy Paste: [[2510.04850]] Detecting Distillation Data from Reasoning Models(https://arxiv.org/abs/2510.04850)
Keywords: language model
Abstract: Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.
摘要：推理蒸馏已成为一种有效而有力的范式，用于增强大语言模型的推理能力。但是，推理蒸馏可能会无意间引起基准污染，其中蒸馏数据集中包含的评估数据会夸大蒸馏模型的性能指标。在这项工作中，我们正式定义了蒸馏数据检测的任务，由于蒸馏数据的部分可用性，这是一个唯一的挑战。然后，我们提出了一种新颖有效的方法令牌概率偏差（TBD），该概率偏差（TBD）利用生成的输出令牌的概率模式。我们的方法是由蒸馏模型倾向于为可见问题生成近确定性令牌的分析而动机，同时为看不见的问题产生了更多低概率的令牌。 TBD背后的关键思想是量化生成的令牌概率与高参考概率的差异。实际上，我们的方法通过与看不见的问题产生较低的分数来实现竞争性检测性能。广泛的实验证明了我们方法的有效性，在S1数据集上达到了0.918的AUC，TPR@1％FPR为0.470。

Title: SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Authors: Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.04891
Pdf URL: https://arxiv.org/pdf/2510.04891
Copy Paste: [[2510.04891]] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests(https://arxiv.org/abs/2510.04891)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings demonstrate that current safeguards fail to generalize to high-stakes sociopolitical settings, exposing systematic biases and raising concerns about the reliability of LLMs in preserving human rights and democratic values. We share the SocialHarmBench benchmark at this https URL.
摘要：大型语言模型（LLM）越来越多地部署在其失败可能会带来直接社会政治后果的情况下。然而，现有的安全基准很少在政治操纵，宣传和虚假信息产生或监视和信息控制等领域中测试漏洞。我们介绍了SocialHarmbench，这是一个跨越7个社会政治类别和34个国家的585个提示的数据集，旨在浮出水面，而LLM最严重地在政治上充满的环境中最严重失败。我们的评估表明了几个缺点：开放权重模型表现出很大的有害合规性脆弱性，Mistral-7b在诸如历史修正主义，宣传和政治操纵等领域中达到了攻击成功率高达97％至98％。此外，时间和地理分析表明，在面对21世纪或20世纪之前的环境以及与拉丁美洲，美国，美国和英国等地区相关的提示时，LLM最脆弱。这些发现表明，当前的保障措施未能推广到高风险的社会政治环境，暴露系统的偏见并提出对LLM在保存人权和民主价值观方面的可靠性的担忧。我们在此HTTPS URL上共享社交措施基准。

Title: Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment

Authors: Davood Rafiei, Morgan Lindsay Heisler, Weiwei Zhang, Mohammadreza Pourreza, Yong Zhang
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2510.04919
Pdf URL: https://arxiv.org/pdf/2510.04919
Copy Paste: [[2510.04919]] Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment(https://arxiv.org/abs/2510.04919)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT) is an effective method for adapting Large Language Models (LLMs) on downstream tasks. However, variability in training data can hinder a model's ability to generalize across domains. This paper studies the problem of dataset alignment for Natural Language to SQL (NL2SQL or text to SQL), examining how well SFT training data matches the structural characteristics of target queries and how this alignment impacts model performance. We hypothesize that alignment can be accurately estimated by comparing the distributions of structural SQL features across the training set, target data, and the model's predictions prior to SFT. Through comprehensive experiments on three large cross-domain NL2SQL benchmarks and multiple model families, we show that structural alignment is a strong predictor of fine-tuning success. When alignment is high, SFT yields substantial gains in accuracy and SQL generation quality; when alignment is low, improvements are marginal or absent. These findings highlight the importance of alignment-aware data selection for effective fine-tuning and generalization in NL2SQL tasks.
摘要：监督微调（SFT）是在下游任务上调整大型语言模型（LLM）的有效方法。但是，训练数据的可变性可能会阻碍模型跨域概括的能力。本文研究了自然语言与SQL（NL2SQL或文本与SQL）的数据集对齐问题，从而研究了SFT培训数据与目标查询的结构性特征的匹配程度以及该一致性如何影响模型性能。我们假设可以通过比较训练集，目标数据以及SFT之前的模型预测的结构SQL特征的分布来准确估算对齐。通过对三个大型跨域NL2SQL基准和多个模型家族的全面实验，我们表明结构对齐是微调成功的有力预测指标。当对准较高时，SFT会在准确性和SQL生成质量方面获得可观的提高。当对准较低时，改进是边缘或不存在的。这些发现突出了对齐方式选择数据选择对NL2SQL任务中有效的微调和概括的重要性。

Title: The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models

Authors: Amir Hameed Mir
Subjects: cs.CL, cs.AI, cs.IT, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2510.04933
Pdf URL: https://arxiv.org/pdf/2510.04933
Copy Paste: [[2510.04933]] The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models(https://arxiv.org/abs/2510.04933)
Keywords: language model, gpt, llm, hallucination
Abstract: Large Language Models (LLMs) often produce fluent yet factually incorrect statements-a phenomenon known as hallucination-posing serious risks in high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric framework for hallucination detection that analyzes the evolution of hidden-state semantics across transformer layers. Unlike prior methods that rely on multiple sampling passes or external verification sources, LSD operates intrinsically within the model's representational space. Using margin-based contrastive learning, LSD aligns hidden activations with ground-truth embeddings derived from a factual encoder, revealing a distinct separation in semantic trajectories: factual responses preserve stable alignment, while hallucinations exhibit pronounced semantic drift across depth. Evaluated on the TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming SelfCheckGPT and Semantic Entropy baselines while requiring only a single forward pass. This efficiency yields a 5-20x speedup over sampling-based methods without sacrificing precision or interpretability. LSD offers a scalable, model-agnostic mechanism for real-time hallucination monitoring and provides new insights into the geometry of factual consistency within large language models.
摘要：大型语言模型（LLMS）通常会产生流利但事实不正确的陈述 - 一种被称为幻觉的现象，即高风险域中的严重风险。我们提出了层面上的语义动力学（LSD），这是一个用于幻觉检测的几何框架，可分析跨变压器层的隐藏态语义的演变。与依赖多个采样通过或外部验证源的先前方法不同，LSD在模型的表示空间内本质上运行。使用基于边缘的对比度学习，LSD将隐藏的激活与源自事实编码器得出的地面嵌入的隐藏激活对齐，从而揭示了语义轨迹中的独特分离：事实响应保留稳定的比对，而幻觉则表现出明显的语义遍布深度。 LSD对真实性和合成事实障碍数据集进行了评估，其F1得分为0.92，AUROC为0.96，聚类准确度为0.89，胜过自我检查和语义熵基础，同时仅需要一个向前传球。这种效率在基于抽样的方法的情况下产生了5-20倍的加速，而无需牺牲精度或解释性。 LSD提供了一种可扩展的模型无形机制，用于实时幻觉监测，并为大语言模型中事实一致性的几何形状提供了新的见解。

Title: A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

Authors: Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Martha-Lorena Avendaño-Garrido, Graham Ranger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.04945
Pdf URL: https://arxiv.org/pdf/2510.04945
Copy Paste: [[2510.04945]] A First Context-Free Grammar Applied to Nawatl Corpora Augmentation(https://arxiv.org/abs/2510.04945)
Keywords: language model, llm
Abstract: In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $\pi$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $\pi$-\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.
摘要：在本文中，我们介绍了Nawatl语言的无上下文语法（CFG）。 Nawatl（或Nahuatl）是$ \ pi $语言类型的美洲印第安语，即一种数字资源很少的语言，其中可用于机器学习的语料库实际上是不存在的。这里的目的是生成大量语法正确的人工句子，以增加可用于语言模型培训的语料库。我们要证明，语法使我们能够显着扩展Nawatl中的语料库，我们称之为$ \ pi $ - \ textsc {yalli}。因此，该语料库丰富了，使我们能够训练诸如FastText之类的算法，并在句子级别的语义任务上对其进行评估。初步结果表明，通过使用语法，在某些LLM上可以实现比较改进。但是，观察到，要实现更大的改进，需要更有效地对Nawatl语言进行建模的语法。

Title: Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)

Authors: Om Dobariya, Akhil Kumar
Subjects: cs.CL, cs.AI, cs.LG, cs.NE, stat.ME
Abstract URL: https://arxiv.org/abs/2510.04950
Pdf URL: https://arxiv.org/pdf/2510.04950
Copy Paste: [[2510.04950]] Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)(https://arxiv.org/abs/2510.04950)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The wording of natural language prompts has been shown to influence the performance of large language models (LLMs), yet the role of politeness and tone remains underexplored. In this study, we investigate how varying levels of prompt politeness affect model accuracy on multiple-choice questions. We created a dataset of 50 base questions spanning mathematics, science, and history, each rewritten into five tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude, yielding 250 unique prompts. Using ChatGPT 4o, we evaluated responses across these conditions and applied paired sample t-tests to assess statistical significance. Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. Our results highlight the importance of studying pragmatic aspects of prompting and raise broader questions about the social dimensions of human-AI interaction.
摘要：自然提示的措辞已被证明会影响大语模型（LLMS）的表现，但礼貌和语气的作用仍然没有得到充实。在这项研究中，我们研究了多种选择问题的迅速礼貌程度如何影响模型的准确性。我们创建了一个涵盖数学，科学和历史的50个基本问题的数据集，每个问题都重写为五个音调：非常有礼貌，礼貌，中立，粗鲁且非常粗鲁，产生了250个独特的提示。使用Chatgpt 4O，我们评估了在这些条件下的响应，并应用了配对的样品t检验以评估统计意义。与期望相反，不礼貌的提示始终超过礼貌，精度从非常有礼貌的提示的80.8％到84.8％的高度提示。这些发现与较早的研究不同，这些研究将其与较差的结果相关联，这表明更新的LLM可能对音调变化有所不同。我们的结果强调了研究务实的方面的重要性，这些方面提示并提出有关人类互动社会层面的更广泛问题。

Title: Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning

Authors: Imran Mansha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05003
Pdf URL: https://arxiv.org/pdf/2510.05003
Copy Paste: [[2510.05003]] Resource-Efficient Fine-Tuning of LLaMA-3.2-3B for Medical Chain-of-Thought Reasoning(https://arxiv.org/abs/2510.05003)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large Language Models (LLMs) such as GPT-4 and LLaMA have demonstrated remarkable reasoning abilities but require significant computational resources for fine-tuning. This paper presents a resource-efficient fine-tuning approach for LLaMA-3.2-3B to enhance medical chain-of-thought reasoning while operating under constrained GPU and memory settings. Using parameter-efficient tuning techniques such as LoRA and QLoRA, we adapt the base model on publicly available medical reasoning datasets. The model achieves improved reasoning coherence and factual accuracy while reducing memory usage by up to 60% compared to standard full fine-tuning. Experimental evaluation demonstrates that lightweight adaptations can retain strong reasoning capability in medical question-answering tasks. This work highlights practical strategies for deploying LLMs in low-resource research environments and provides insights into balancing efficiency and domain specialization for medical AI systems.
摘要：GPT-4和Llama等大型语言模型（LLM）表现出了非凡的推理能力，但需要大量的计算资源进行微调。本文为Llama-3.2-3B提供了一种资源有效的微调方法，以在受约束的GPU和内存设置下运行时增强医疗链的推理。使用参数有效的调整技术，例如Lora和Qlora，我们将基本模型适应公开可用的医疗推理数据集。与标准的完整微调相比，该模型提高了推理的连贯性和事实准确性，同时将记忆使用量最多减少60％。实验评估表明，轻巧的适应性可以保留在医疗提问任务中的强大推理能力。这项工作强调了在低资源研究环境中部署LLM的实用策略，并提供了对医疗AI系统平衡效率和领域专业化的见解。

Title: Imperceptible Jailbreaking against Large Language Models

Authors: Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.05025
Pdf URL: https://arxiv.org/pdf/2510.05025
Copy Paste: [[2510.05025]] Imperceptible Jailbreaking against Large Language Models(https://arxiv.org/abs/2510.05025)
Keywords: language model, llm, prompt
Abstract: Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at this https URL.
摘要：对视力方式的越狱攻击通常依赖于不可察觉的对抗扰动，而通常认为对文本方式的攻击需要可见的修改（例如，非语义后缀）。在本文中，我们介绍了不可察觉的越狱，以利用一类称为变异选择器的Unicode字符。通过将无形的变体选择器附加到恶意问题上，越狱提示在视觉上与屏幕上的原始恶意问题相同，而其令牌化则“秘密地”改变了。我们提出了一条搜索管道，以产生这种对抗后缀以引起有害反应。我们的实验表明，我们无法察觉到的越狱在四个结盟的LLM上取得了很高的攻击成功率，并概括以引起注射攻击，所有这些都没有在书面提示中产生任何可见的修改。我们的代码可在此HTTPS URL上找到。

Title: A Set of Quebec-French Corpus of Regional Expressions and Terms

Authors: David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05026
Pdf URL: https://arxiv.org/pdf/2510.05026
Copy Paste: [[2510.05026]] A Set of Quebec-French Corpus of Regional Expressions and Terms(https://arxiv.org/abs/2510.05026)
Keywords: llm
Abstract: The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose two new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 94 LLM demonstrate that our regional idiom benchmarks are a reliable tool for measuring a model's proficiency in a specific dialect.
摘要：理解和方言理解的任务是自然语言处理中完善的基准。在本文中，我们建议将它们结合起来，并使用区域成语作为方言理解的测试。为此，我们为魁北克法语方言：qfrcore提出了两个新的基准数据集，其中包含4,633个惯用短语的实例和QFRCort，其中包括171个惯用词的区域实例。我们解释了如何构建这些语料库，以便可以为其他方言复制我们的方法。我们使用94 LLM进行的实验表明，我们的区域惯用基准是一种可靠的工具，可以测量模型在特定方言中的熟练程度。

Title: Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Authors: Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05038
Pdf URL: https://arxiv.org/pdf/2510.05038
Copy Paste: [[2510.05038]] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization(https://arxiv.org/abs/2510.05038)
Keywords: language model
Abstract: Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at this https URL
摘要：多模式编码器已突破了视觉文档检索的边界，将文本查询令牌直接匹配到图像补丁并实现公共基准上的最新性能。最近依靠这种范式的模型大大扩展了其查询和文档表示形式的大小，并在现实世界中的管道中呈现了可部署和可扩展性的障碍。此外，纯粹以视觉为中心的方法可能会受到现代视觉模型仍然表现出的固有方式差距的限制。在这项工作中，我们将这些挑战与混合检索的范式联系起来，调查了轻质密集的文本检索器是否可以增强以视力为中心的模型。依赖于等级或分数的粗粒融合的现有混合方法无法利用每个模型表示空间内的丰富相互作用。为了解决这个问题，我们介绍了一种新型的测试时间优化方法引导查询改进（GQR），该方法使用互补猎犬的分数中的指导来完善主猎犬的查询嵌入。通过对视觉文档检索基准测试的大量实验，我们证明了GQR允许以视觉为中心的模型匹配具有明显更大表示的模型的性能，同时更快地速度更快14倍，并且需要减少54倍的内存。我们的发现表明，GQR有效地推动了帕累托前沿的多模式检索的性能和效率。我们在此HTTPS URL上发布代码

Title: COLE: a Comprehensive Benchmark for French Language Understanding Evaluation

Authors: David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.05046
Pdf URL: https://arxiv.org/pdf/2510.05046
Copy Paste: [[2510.05046]] COLE: a Comprehensive Benchmark for French Language Understanding Evaluation(https://arxiv.org/abs/2510.05046)
Keywords: language model, llm
Abstract: To address the need for a more comprehensive evaluation of French Natural Language Understanding (NLU), we introduce COLE, a new benchmark composed of 23 diverse task covering a broad range of NLU capabilities, including sentiment analysis, paraphrase detection, grammatical judgment, and reasoning, with a particular focus on linguistic phenomena relevant to the French language. We benchmark 94 large language models (LLM), providing an extensive analysis of the current state of French NLU. Our results highlight a significant performance gap between closed- and open-weights models and identify key challenging frontiers for current LLMs, such as zero-shot extractive question-answering (QA), fine-grained word sense disambiguation, and understanding of regional language variations. We release COLE as a public resource to foster further progress in French language modelling.
摘要：为了满足对法国自然语言理解（NLU）进行更全面评估的需求，我们介绍了Cole，Cole是一个新的基准，该基准由23个多样化的任务组成，涵盖了广泛的NLU能力，包括情感分析，释义检测，语法判断，推理，并特别关注与法语语言相关的语言现象。我们基于94个大型语言模型（LLM），对法国NLU的现状进行了广泛的分析。我们的结果突出了闭合和开放权重模型之间的显着性能差距，并确定了当前LLM的关键挑战边界，例如零摄取的提取问题避开（QA），细粒度的单词感官差异以及对区域语言变化的理解。我们将科尔作为公共资源，以促进法语建模的进一步进展。

Title: SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs

Authors: Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, Wen Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05069
Pdf URL: https://arxiv.org/pdf/2510.05069
Copy Paste: [[2510.05069]] SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs(https://arxiv.org/abs/2510.05069)
Keywords: language model, llm, chain-of-thought
Abstract: Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten.
摘要：最近的工作表明，除了通过明确的思想链条进行离散推理外，受自然语言界限的限制，大语言模型（LLMS）也可以在潜在空间中连续推理，从而使每个步骤更丰富的信息，从而提高令牌效率。尽管有希望，潜在推理仍然面临两个挑战，尤其是在无训练的环境中：1）纯粹的潜在推理通过维护多个隐性路径来扩大搜索分布，从而扩散了概率质量，引入噪声，并阻碍了单个高度保存解决方案的融合，从而损害了准确的准确性； 2）即使没有明确的文本，浪费令牌和降解效率的过度思考也持续过。为了解决这些问题，我们介绍了Swireasoning，这是一个针对LLM推理的无培训框架，具有两个关键创新：1）Swireasoning在显式和潜在推理之间动态切换，并在较大的置信度估计的下一句话分布的熵趋势的指导下，以平衡探索和剥削和剥削和促进及时的转化。 2）通过限制思维障碍开关的最大数量，旋转的路缘过度思考并提高了各种问题困难的令牌效率。在广泛使用的数学和STEM基准上，Swireason在不同模型家族和尺度的推理LLM中始终提高平均准确性1.5％-2.8％。此外，在预算有限的情况下，Shireasoning将平均令牌效率提高了56％-79％，随着预算的收紧，增长额较大。

Title: Slm-mux: Orchestrating small language models for reasoning

Authors: Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie, Tushar Krishna, Vijay Janapa Reddi, Yilun Du
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05077
Pdf URL: https://arxiv.org/pdf/2510.05077
Copy Paste: [[2510.05077]] Slm-mux: Orchestrating small language models for reasoning(https://arxiv.org/abs/2510.05077)
Keywords: language model, gpt
Abstract: With the rapid development of language models, the number of small language models (SLMs) has grown significantly. Although they do not achieve state-of-the-art accuracy, they are more efficient and often excel at specific tasks. This raises a natural question: can multiple SLMs be orchestrated into a system where each contributes effectively, achieving higher accuracy than any individual model? Existing orchestration methods have primarily targeted frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To address this gap, we propose a three-stage approach for orchestrating SLMs. First, we introduce SLM-MUX, a multi-model architecture that effectively coordinates multiple SLMs. Building on this, we develop two optimization strategies: (i) a model selection search that identifies the most complementary SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our approach delivers strong results: Compared to existing orchestration methods, our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0% on GSM8K. With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and GSM8K, and matches its performance on MATH. We further provide theoretical analyses to substantiate the advantages of our method. In summary, we demonstrate that SLMs can be effectively orchestrated into more accurate and efficient systems through the proposed approach.
摘要：随着语言模型的快速发展，小语言模型（SLM）的数量显着增长。尽管它们没有达到最先进的准确性，但它们效率更高，并且通常擅长特定任务。这就提出了一个自然的问题：是否可以将多个 SLM 编排到一个系统中，每个 SLM 都能有效地做出贡献，从而实现比任何单个模型更高的准确性？现有的编排方法主要针对前沿模型（例如 GPT-4），并且在应用于 SLM 时表现不佳。为了解决这一差距，我们提出了一种协调 SLM 的三阶段方法。首先，我们介绍SLM-MUX，这是一种有效协调多个SLM的多模型架构。在此基础上，我们开发了两种优化策略：(i) 模型选择搜索，从给定池中识别最具互补性的 SLM，以及 (ii) 针对 SLM-MUX 定制的测试时间扩展。我们的方法取得了强劲的成果：与现有的编排方法相比，我们的方法在 MATH 上实现了高达 13.4% 的改进，在 GPQA 上实现了 8.8% 的改进，在 GSM8K 上实现了 7.0% 的改进。只需两个 SLMS，SLM-MUX 在 GPQA 和 GSM8K 上的性能就优于 Qwen 2.5 72B，并且在 MATH 上的性能也与其相当。我们进一步提供理论分析来证实我们方法的优点。总之，我们证明了通过所提出的方法，SLM 可以有效地编排成更准确、更高效的系统。

Title: TeachLM: Post-Training LLMs for Education Using Authentic Learning Data

Authors: Janos Perczel, Jin Chow, Dorottya Demszky
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05087
Pdf URL: https://arxiv.org/pdf/2510.05087
Copy Paste: [[2510.05087]] TeachLM: Post-Training LLMs for Education Using Authentic Learning Data(https://arxiv.org/abs/2510.05087)
Keywords: language model, llm, prompt
Abstract: The promise of generative AI to revolutionize education is constrained by the pedagogical limits of large language models (LLMs). A major issue is the lack of access to high-quality training data that reflect the learning of actual students. Prompt engineering has emerged as a stopgap, but the ability of prompts to encode complex pedagogical strategies in rule-based natural language is inherently limited. To address this gap we introduce TeachLM - an LLM optimized for teaching through parameter-efficient fine-tuning of state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000 hours of one-on-one, longitudinal student-tutor interactions maintained by Polygence, which underwent a rigorous anonymization process to protect privacy. We use parameter-efficient fine-tuning to develop an authentic student model that enables the generation of high-fidelity synthetic student-tutor dialogues. Building on this capability, we propose a novel multi-turn evaluation protocol that leverages synthetic dialogue generation to provide fast, scalable, and reproducible assessments of the dialogical capabilities of LLMs. Our evaluations demonstrate that fine-tuning on authentic learning data significantly improves conversational and pedagogical performance - doubling student talk time, improving questioning style, increasing dialogue turns by 50%, and greater personalization of instruction.
摘要：生成式人工智能彻底改变教育的承诺受到大型语言模型（LLM）的教学限制的限制。一个主要问题是无法获得反映学生实际学习情况的高质量培训数据。提示工程已成为一种权宜之计，但提示以基于规则的自然语言编码复杂的教学策略的能力本质上是有限的。为了解决这一差距，我们引入了 TeachLM——一种通过对最先进模型进行参数高效微调来优化教学的法学硕士。 TeachLM 在由 Polygence 维护的由 100,000 小时一对一纵向学生与导师互动组成的数据集上进行训练，该数据集经过了严格的匿名化过程以保护隐私。我们使用参数高效的微调来开发真实的学生模型，从而能够生成高保真合成的学生与导师对话。在此能力的基础上，我们提出了一种新颖的多回合评估协议，该协议利用合成对话生成来提供对法学硕士对话能力的快速、可扩展和可重复的评估。我们的评估表明，对真实学习数据进行微调可以显着提高对话和教学表现 - 将学生的谈话时间加倍，改善提问方式，将对话次数增加 50%，并提高教学的个性化程度。

Title: Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

Authors: Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.05090
Pdf URL: https://arxiv.org/pdf/2510.05090
Copy Paste: [[2510.05090]] Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models(https://arxiv.org/abs/2510.05090)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality. To address this issue, we propose Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens. Unlike existing methods that follow a single progressive unmasking procedure, Tolerator introduces a two-stage process: (i) sequence fill-up and (ii) iterative refinement by remasking and decoding a subset of tokens while treating the remaining as context. This design enables previously accepted tokens to be reconsidered and corrected when necessary, leading to more reliable diffusion decoding outputs. We evaluate Tolerator on five standard benchmarks covering language understanding, code generation, and mathematics. Experiments show that our method achieves consistent improvements over the baselines under the same computational budget. These findings suggest that decoding algorithms are crucial to realizing the full potential of diffusion large language models. Code and data are publicly available.
摘要：扩散大语言模型（DLLM）最近已成为自回归（AR）模型的有前途的替代方案，提供了诸如加速并行解码和双向上下文模型之类的优势。但是，离散DLLM中的香草解码策略受到关键限制：一旦接受了令牌，就无法在随后的步骤中对其进行修改。结果，早期错误持续存在遍及迭代，损害了中间的预测和最终产出质量。为了解决这个问题，我们提出了一种公差（令牌级交叉验证细化），这是一种无训练的解码策略，利用预测令牌之间的交叉验证。与现有的遵循单个渐进式卸载过程的方法不同，Tolerator引入了两个阶段的过程：（i）序列填充和（ii）迭代改进，通过重新启动和解码代币子集的同时将其剩余的上下文视为上下文。此设计使得先前接受的令牌在必要时可以重新考虑和纠正，从而导致更可靠的扩散解码输出。我们在五个标准基准上评估公差，涵盖语言理解，代码生成和数学。实验表明，我们的方法在相同的计算预算下实现了对基准的一致改进。这些发现表明，解码算法对于意识到扩散大语言模型的全部潜力至关重要。代码和数据公开可用。