2025-09-22

Title: Synthetic bootstrapped pretraining

Authors: Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Candès, Chong Wang, Ruoming Pang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15248
Pdf URL: https://arxiv.org/pdf/2509.15248
Copy Paste: [[2509.15248]] Synthetic bootstrapped pretraining(https://arxiv.org/abs/2509.15248)
Keywords: language model
Abstract: We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
摘要：我们介绍了合成的自举预读图（SBP），该语言模型（LM）预处理程序首先学习了从训练训练的数据集中的文档之间的关系模型，然后利用其合成了广泛的新语料库进行联合培训。虽然标准预处理会教LMS学习单个文档中代币之间的因果关系，但它并非旨在有效地对丰富，可学习的文档间相关性进行建模，从而有可能导致更好的性能。我们通过设计计算匹配的预处理设置来验证SBP，并在最高1T代币上从头开始，并在最高1T代币上验证了3B参数模型。我们发现SBP在强大的重复基线上始终有所改善，并通过甲骨文上限可以访问20倍更多独特的数据来提供大量的性能提高。定性分析表明，合成的文档超出了仅仅释义 - SBP首先从种子材料中抽象出一个核心概念，然后在其顶部制作了新的叙述。除了强大的经验绩效外，SBP还承认一种天然的贝叶斯解释：合成器隐含地学会了抽象相关文档之间共享的潜在概念。

Title: Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha

Authors: Tandin Wangchuk, Tad Gonsalves
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15255
Pdf URL: https://arxiv.org/pdf/2509.15255
Copy Paste: [[2509.15255]] Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha(https://arxiv.org/abs/2509.15255)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are gaining popularity and improving rapidly. Tokenizers are crucial components of natural language processing, especially for LLMs. Tokenizers break down input text into tokens that models can easily process while ensuring the text is accurately represented, capturing its meaning and structure. Effective tokenizers enhance the capabilities of LLMs by improving a model's understanding of context and semantics, ultimately leading to better performance in various downstream tasks, such as translation, classification, sentiment analysis, and text generation. Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages. Dzongkha, Bhutan's national language spoken by around seven hundred thousand people, is a low-resource language, and its linguistic complexity poses unique NLP challenges. Despite some progress, significant research in Dzongkha NLP is lacking, particularly in tokenization. This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods. Specifically, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece (Unigram) were evaluated for their suitability for Dzongkha. Performance was assessed using metrics like Subword Fertility, Proportion of Continued Words, Normalized Sequence Length, and execution time. The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization, paving the way for further NLP advancements. This underscores the need for tailored approaches for low-resource languages and ongoing research. In this study, we presented three tokenization algorithms for Dzongkha, paving the way for building Dzongkha Large Language Models.
摘要：大型语言模型（LLM）正在越来越受欢迎并迅速改善。令牌者是自然语言处理的关键组成部分，尤其是对于LLM。 Tokenizers将输入文本分解为代币，模型可以轻松处理，同时确保文本准确地表示，从而捕获其含义和结构。有效的引物器通过提高模型对上下文和语义的理解来增强LLM的功能，最终导致在各种下游任务（例如翻译，分类，情感分析和文本生成）中的表现更好。大多数预培训的引物器适用于英语等高资源语言，但对于低资源语言的表现较差。不丹的民族语言大约有七十万人，是一种低资源的语言，其语言复杂性带来了独特的NLP挑战。尽管有一些进展，但缺乏对Dzongkha NLP的重大研究，尤其是在令牌化方面。与其他流行方法相比，这项研究评估了三种常见的令牌化算法的训练和性能。具体来说，评估了字节对编码（BPE），文字和句子（Unigram）的适用性。使用子单词生育能力，持续单词的比例，归一化序列长度和执行时间等指标评估性能。结果表明，尽管所有三种算法都表现出潜力，但句子对于Dzongkha令牌化是最有效的，为进一步的NLP进步铺平了道路。这强调了对低资源语言和正在进行的研究的量身定制方法的需求。在这项研究中，我们介绍了Dzongkha的三种令牌化算法，为构建Dzongkha大型语言模型铺平了道路。

Title: Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Authors: Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15260
Pdf URL: https://arxiv.org/pdf/2509.15260
Copy Paste: [[2509.15260]] Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages(https://arxiv.org/abs/2509.15260)
Keywords: language model, llm
Abstract: The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: this https URL.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}
摘要：大型语言模型（LLM）的进步已经改变了自然语言处理。但是，他们的安全机制在低资源，多语言设置中的探索仍然不足。在这里，我们的目标是弥合这一差距。特别是，我们介绍了\ textsf {sgtoxicguard}，这是一个新颖的数据集和评估框架，用于在新加坡各种语言背景下进行基准LLM安全性，包括Singlish，Chinese，Malay，Malay和Tamil。 SgtoxicGuard在三种真实世界的方案中采用了一种红色团队的方法来系统地探测LLM漏洞：\ textit {对话{对话}，\ textit {Question-swertwering}和\ textit {content {content {content composition}。我们对最先进的多语言LLM进行了广泛的实验，结果发现了其安全护栏的关键差距。通过提供对文化敏感性和毒性缓解的可行见解，我们为在语言上多样化的环境中更安全，更具包容的AI系统奠定了基础。

Title: PolBiX: Detecting LLMs' Political Bias in Fact-Checking through X-phemisms

Authors: Charlott Jakob, David Harbecke, Patrick Parschan, Pia Wenzel Neves, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15335
Pdf URL: https://arxiv.org/pdf/2509.15335
Copy Paste: [[2509.15335]] PolBiX: Detecting LLMs' Political Bias in Fact-Checking through X-phemisms(https://arxiv.org/abs/2509.15335)
Keywords: language model, llm, prompt
Abstract: Large Language Models are increasingly used in applications requiring objective assessment, which could be compromised by political bias. Many studies found preferences for left-leaning positions in LLMs, but downstream effects on tasks like fact-checking remain underexplored. In this study, we systematically investigate political bias through exchanging words with euphemisms or dysphemisms in German claims. We construct minimal pairs of factually equivalent claims that differ in political connotation, to assess the consistency of LLMs in classifying them as true or false. We evaluate six LLMs and find that, more than political leaning, the presence of judgmental words significantly influences truthfulness assessment. While a few models show tendencies of political bias, this is not mitigated by explicitly calling for objectivism in prompts.
摘要：大型语言模型越来越多地用于需要客观评估的应用中，这可能会因政治偏见而损害。许多研究发现，在LLMS中对左倾位置的偏好，但是对事实检查等任务的下游影响仍然没有被忽视。在这项研究中，我们通过以委婉语或德国主张中的委婉语交换单词来系统地调查政治偏见。我们构建了与政治内涵不同的实际等效主张，以评估LLM在将其归类为真或错误时的一致性。我们评估了六个LLM，发现判断词的存在不仅仅是政治上的倾向，都会显着影响真实性评估。虽然一些模型显示出政治偏见的趋势，但没有通过在提示中明确呼吁客观主义来减轻这种偏见。

Title: Quantifying Self-Awareness of Knowledge in Large Language Models

Authors: Yeongbin Seo, Dongha Lee, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15339
Pdf URL: https://arxiv.org/pdf/2509.15339
Copy Paste: [[2509.15339]] Quantifying Self-Awareness of Knowledge in Large Language Models(https://arxiv.org/abs/2509.15339)
Keywords: language model, llm, hallucination
Abstract: Hallucination prediction in large language models (LLMs) is often interpreted as a sign of self-awareness. However, we argue that such performance can arise from question-side shortcuts rather than true model-side introspection. To disentangle these factors, we propose the Approximate Question-side Effect (AQE), which quantifies the contribution of question-awareness. Our analysis across multiple datasets reveals that much of the reported success stems from exploiting superficial patterns in questions. We further introduce SCAO (Semantic Compression by Answering in One word), a method that enhances the use of model-side signals. Experiments show that SCAO achieves strong and consistent performance, particularly in settings with reduced question-side cues, highlighting its effectiveness in fostering genuine self-awareness in LLMs.
摘要：大语言模型（LLM）中的幻觉预测通常被解释为自我意识的标志。但是，我们认为这种性能可能是由问题侧快捷方式而不是真正的模型侧内省产生的。为了解散这些因素，我们提出了近似问题侧效应（AQE），该效应量化了问题意识的贡献。我们对多个数据集进行的分析表明，报告的许多成功源于在问题中利用表面模式。我们进一步介绍了SCAO（通过一个单词回答语义压缩），这种方法可以增强模型侧信号的使用。实验表明，SCAO实现了强劲而一致的性能，尤其是在问题侧线索减少的环境中，突出了其在培养LLM中真正的自我意识方面的有效性。

Title: Real, Fake, or Manipulated? Detecting Machine-Influenced Text

Authors: Yitong Wang, Zhongping Zhang, Margherita Piana, Zheng Zhou, Peter Gerstoft, Bryan A. Plummer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15350
Pdf URL: https://arxiv.org/pdf/2509.15350
Copy Paste: [[2509.15350]] Real, Fake, or Manipulated? Detecting Machine-Influenced Text(https://arxiv.org/abs/2509.15350)
Keywords: language model, llm
Abstract: Large Language Model (LLMs) can be used to write or modify documents, presenting a challenge for understanding the intent behind their use. For example, benign uses may involve using LLM on a human-written document to improve its grammar or to translate it into another language. However, a document entirely produced by a LLM may be more likely to be used to spread misinformation than simple translation (\eg, from use by malicious actors or simply by hallucinating). Prior works in Machine Generated Text (MGT) detection mostly focus on simply identifying whether a document was human or machine written, ignoring these fine-grained uses. In this paper, we introduce a HiErarchical, length-RObust machine-influenced text detector (HERO), which learns to separate text samples of varying lengths from four primary types: human-written, machine-generated, machine-polished, and machine-translated. HERO accomplishes this by combining predictions from length-specialist models that have been trained with Subcategory Guidance. Specifically, for categories that are easily confused (\eg, different source languages), our Subcategory Guidance module encourages separation of the fine-grained categories, boosting performance. Extensive experiments across five LLMs and six domains demonstrate the benefits of our HERO, outperforming the state-of-the-art by 2.5-3 mAP on average.
摘要：大型语言模型（LLMS）可用于编写或修改文档，提出了理解其使用背后意图的挑战。例如，良性用途可能涉及在人写的文档上使用LLM来改善其语法或将其转化为另一种语言。但是，与简单翻译相比，完全由LLM制作的文档更有可能被用来传播错误信息（\ EG，通过恶意演员使用或简单地通过幻觉）。机器生成的文本（MGT）检测中的先前作品主要集中于简单地识别文档是人类还是机器写的，而忽略了这些细粒度的用途。在本文中，我们介绍了一个由机器影响的层次结构，长度稳定的文本检测器（英雄），该文本检测器学会了将不同长度的文本样本与四种主要类型的不同长度分开：人编写，机器生成，机器涂抹，机器涂层和机器翻译。 Hero通过结合已通过子类别指导培训的长度特殊模型的预测来实现这一目标。具体而言，对于容易混淆的类别（\ f，不同的源语言），我们的子类别指南模块会鼓励分离细粒类别，从而提高性能。在五个LLM和六个领域进行的广泛实验证明了我们的英雄的好处，平均比最先进的图表优于2.5-3地图。

Title: Beyond Spurious Signals: Debiasing Multimodal Large Language Models via Counterfactual Inference and Adaptive Expert Routing

Authors: Zichen Wu, Hsiu-Yuan Huang, Yunfang Wu
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2509.15361
Pdf URL: https://arxiv.org/pdf/2509.15361
Copy Paste: [[2509.15361]] Beyond Spurious Signals: Debiasing Multimodal Large Language Models via Counterfactual Inference and Adaptive Expert Routing(https://arxiv.org/abs/2509.15361)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have shown substantial capabilities in integrating visual and textual information, yet frequently rely on spurious correlations, undermining their robustness and generalization in complex multimodal reasoning tasks. This paper addresses the critical challenge of superficial correlation bias in MLLMs through a novel causal mediation-based debiasing framework. Specially, we distinguishing core semantics from spurious textual and visual contexts via counterfactual examples to activate training-stage debiasing and employ a Mixture-of-Experts (MoE) architecture with dynamic routing to selectively engages modality-specific debiasing experts. Empirical evaluation on multimodal sarcasm detection and sentiment analysis tasks demonstrates that our framework significantly surpasses unimodal debiasing strategies and existing state-of-the-art models.
摘要：多模式的大语言模型（MLLM）在整合视觉和文本信息方面表现出了很大的功能，但经常依靠虚假相关性，破坏了它们在复杂的多模式推理任务中的稳健性和概括。本文通过一种新型的基于因果关系的偏见框架解决了MLLM中表面相关性偏差的关键挑战。特别是，我们通过反事实示例将核心语义与虚假的文本和视觉上下文区分开来，以激活训练阶段的偏见，并采用具有动态路由的专家架构（MOE）体系结构，以选择性地参与特定于方式的偏见专家。对多模式讽刺检测和情感分析任务的经验评估表明，我们的框架显着超过了单偶像辩护策略和现有的最新模型。

Title: Speech Language Models for Under-Represented Languages: Insights from Wolof

Authors: Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.15362
Pdf URL: https://arxiv.org/pdf/2509.15362
Copy Paste: [[2509.15362]] Speech Language Models for Under-Represented Languages: Insights from Wolof(https://arxiv.org/abs/2509.15362)
Keywords: language model, llm, chain-of-thought
Abstract: We present our journey in training a speech language model for Wolof, an underrepresented language spoken in West Africa, and share key insights. We first emphasize the importance of collecting large-scale, spontaneous, high-quality speech data, and show that continued pretraining HuBERT on this dataset outperforms both the base model and African-centric models on ASR. We then integrate this speech encoder into a Wolof LLM to train the first Speech LLM for this language, extending its capabilities to tasks such as speech translation. Furthermore, we explore training the Speech LLM to perform multi-step Chain-of-Thought before transcribing or translating. Our results show that the Speech LLM not only improves speech recognition but also performs well in speech translation. The models and the code will be openly shared.
摘要：我们介绍了培训沃洛夫（Wolof）的语音语言模型的旅程，沃尔夫（Wolof）是一种在西非所说的代表性不足的语言，并分享了关键的见解。我们首先强调收集大规模，自发，高质量的语音数据的重要性，并表明，该数据集上的Hubert持续预处理优于ASR上的基本模型和非洲以非洲为中心的模型。然后，我们将此语音编码器集成到Wolof LLM中，以训练该语言的第一个语音LLM，从而将其功能扩展到诸如语音翻译之类的任务。此外，我们探索培训语音LLM，以在转录或翻译之前执行多步思想链。我们的结果表明，语音LLM不仅改善了语音识别，而且在语音翻译中表现良好。模型和代码将公开共享。

Title: Frustratingly Easy Data Augmentation for Low-Resource ASR

Authors: Katsumi Ibaraki, David Chiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15373
Pdf URL: https://arxiv.org/pdf/2509.15373
Copy Paste: [[2509.15373]] Frustratingly Easy Data Augmentation for Low-Resource ASR(https://arxiv.org/abs/2509.15373)
Keywords: llm
Abstract: This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.
摘要：本文介绍了三种独立的数据增强方法，用于低资源自动语音识别（ASR）。我们的技术首先生成新的文本 - 使用基于光泽的替代品，随机替换或基于LLM的方法 - 然后应用文本到语音（TTS）来产生合成音频。我们将这些方法（仅利用原始注释数据）应用于具有极为有限资源的四种语言（Vatlongos，Nashta，Shinekhen Buryat和Kakabe）。在原始音频和生成的合成数据的组合结合使用预处理的WAV2VEC2-XLSR-53模型可带来显着的性能增长，包括Nashta的绝对降低14.3％。这些方法证明了所有四种低资源语言的有效性，并且还显示出高资源语言（如英语）的实用性，证明了它们的广泛适用性。

Title: Quantifying Uncertainty in Natural Language Explanations of Large Language Models for Question Answering

Authors: Yangyi Li, Mengdi Huai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15403
Pdf URL: https://arxiv.org/pdf/2509.15403
Copy Paste: [[2509.15403]] Quantifying Uncertainty in Natural Language Explanations of Large Language Models for Question Answering(https://arxiv.org/abs/2509.15403)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown strong capabilities, enabling concise, context-aware answers in question answering (QA) tasks. The lack of transparency in complex LLMs has inspired extensive research aimed at developing methods to explain large language behaviors. Among existing explanation methods, natural language explanations stand out due to their ability to explain LLMs in a self-explanatory manner and enable the understanding of model behaviors even when the models are closed-source. However, despite these promising advancements, there is no existing work studying how to provide valid uncertainty guarantees for these generated natural language explanations. Such uncertainty quantification is critical in understanding the confidence behind these explanations. Notably, generating valid uncertainty estimates for natural language explanations is particularly challenging due to the auto-regressive generation process of LLMs and the presence of noise in medical inquiries. To bridge this gap, in this work, we first propose a novel uncertainty estimation framework for these generated natural language explanations, which provides valid uncertainty guarantees in a post-hoc and model-agnostic manner. Additionally, we also design a novel robust uncertainty estimation method that maintains valid uncertainty guarantees even under noise. Extensive experiments on QA tasks demonstrate the desired performance of our methods.
摘要：大型语言模型（LLMS）表现出强大的功能，可以简洁，上下文感知的答案答案（QA）任务。复杂LLM的缺乏透明度激发了广泛的研究，旨在开发解释大型语言行为的方法。在现有的解释方法中，自然语言的解释脱颖而出，因为它们能够以自我解释的方式解释LLM，并使模型行为的理解能够理解。但是，尽管有这些有希望的进步，但尚无现有的研究如何为这些产生的自然语言解释提供有效的不确定性保证。这种不确定性量化对于理解这些解释背后的信心至关重要。值得注意的是，由于LLMS的自动回火生成过程以及医疗查询中噪声的存在，对自然语言解释产生有效的不确定性估计尤其具有挑战性。为了弥合这一差距，在这项工作中，我们首先为这些生成的自然语言解释提出了一个新颖的不确定性估计框架，该框架以事后和模型不合时宜的方式提供了有效的不确定性保证。此外，我们还设计了一种新颖的鲁棒不确定性估计方法，即使在噪声下，也可以保证有效的不确定性。关于质量检查任务的广泛实验证明了我们方法的期望性能。

Title: PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting

Authors: Caitlin Cisar, Emily Sheffield, Joshua Drake, Alden Harrell, Subramanian Chidambaram, Nikita Nangia, Vinayak Arannil, Alex Williams
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15447
Pdf URL: https://arxiv.org/pdf/2509.15447
Copy Paste: [[2509.15447]] PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting(https://arxiv.org/abs/2509.15447)
Keywords: language model, llm
Abstract: Generative AI applications commonly leverage user personas as a steering mechanism for synthetic data generation, but reliance on natural language representations forces models to make unintended inferences about which attributes to emphasize, limiting precise control over outputs. We introduce PILOT (Psychological and Linguistic Output Targeting), a two-phase framework for steering large language models with structured psycholinguistic profiles. In Phase 1, PILOT translates natural language persona descriptions into multidimensional profiles with normalized scores across linguistic and psychological dimensions. In Phase 2, these profiles guide generation along measurable axes of variation. We evaluate PILOT across three state-of-the-art LLMs (Mistral Large 2, Deepseek-R1, LLaMA 3.3 70B) using 25 synthetic personas under three conditions: Natural-language Persona Steering (NPS), Schema-Based Steering (SBS), and Hybrid Persona-Schema Steering (HPS). Results demonstrate that schema-based approaches significantly reduce artificial-sounding persona repetition while improving output coherence, with silhouette scores increasing from 0.098 to 0.237 and topic purity from 0.773 to 0.957. Our analysis reveals a fundamental trade-off: SBS produces more concise outputs with higher topical consistency, while NPS offers greater lexical diversity but reduced predictability. HPS achieves a balance between these extremes, maintaining output variety while preserving structural consistency. Expert linguistic evaluation confirms that PILOT maintains high response quality across all conditions, with no statistically significant differences between steering approaches.
摘要：生成的AI应用程序通常利用用户角色作为综合数据生成的转向机制，但是依赖自然语言表示强迫模型对要强调的属性进行意想不到的推论，从而限制了对输出的精确控制。我们介绍了飞行员（心理和语言输出定位），这是一个两阶段的框架，用于使用结构化心理语言概况的大型语言模型。在第1阶段，飞行员将自然语言角色描述转化为多维概况，在语言和心理方面具有标准化得分。在第2阶段，这些轮廓指导沿可测量的变异轴的产生。我们在三个条件下使用25个合成角色（自然语言角色转向（NPS），基于架构的转向（SBS）和混合体形角色 - 施加术（HPS），使用25个合成角色转向（NPS）和混合式角色转向（HPS）评估了三个最先进的LLM（Mistral flot 2，DeepSeek-R1，Llama 3.3 70b）的飞行员。结果表明，基于模式的方法显着降低了人工听起来的角色重复，同时提高了输出相干性，其轮廓分数从0.098增加到0.237，主题纯度从0.773提高到0.957。我们的分析揭示了一个基本的权衡：SBS会产生更简洁的输出，其主题一致性较高，而NPS提供了更大的词汇多样性，但可预测性降低。 HPS在这些极端之间达到平衡，保持产出多样性，同时保持结构一致性。专家语言评估证实，飞行员在所有条件下都保持较高的响应质量，而转向方法之间没有统计学上的显着差异。

Title: Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

Authors: Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.15476
Pdf URL: https://arxiv.org/pdf/2509.15476
Copy Paste: [[2509.15476]] Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding(https://arxiv.org/abs/2509.15476)
Keywords: language model, llm
Abstract: Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.
摘要：讽刺的检测仍然是自然语言理解的挑战，因为讽刺意图通常依赖于跨越文本，语音和视觉的微妙的跨模式线索。虽然先前的工作主要集中在文本或视觉文本讽刺上，但综合的视听文本讽刺的理解仍然没有得到充实的态度。在本文中，我们系统地评估了对英语（芥末++）和中文（MCSD 1.0）的讽刺检测大型语言模型（LLMS）和多模式LLM，并以零拍，很少的射击和Lora微调设置进行了评估。除了直接分类外，我们还将模型作为功能编码器探索，并通过协作的门控融合模块整合其表示形式。实验结果表明，基于音频的模型实现了最强的单峰性能，而文本原声和音频视觉组合的表现优于单峰和三座模型。此外，诸如Qwen-Omni之类的MLLM显示出竞争性的零击和微调的性能。我们的发现突出了MLLM对跨语性的，视听文本讽刺的理解的潜力。

Title: Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Authors: Madison Van Doren, Casey Ford, Emily Dix
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15478
Pdf URL: https://arxiv.org/pdf/2509.15478
Copy Paste: [[2509.15478]] Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models(https://arxiv.org/abs/2509.15478)
Keywords: language model, gpt, llm, prompt
Abstract: Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.
摘要：多模式的大型语言模型（MLLM）越来越多地用于现实世界的应用中，但在对抗条件下的安全性仍然没有被忽视。这项研究评估了四个领先的MLLM（GPT-4O，Claude Sonnet 3.5，Pixtral 12b和Qwen VL Plus）的无害性，当时在跨文本和多模态格式跨越了对抗性提示。由26个红色团队组成的团队产生了726个提示，以三个危害类别为目标：非法活动，虚假信息和不道德行为。这些提示已提交到每个模型，17个注释者使用5分制量表为2,904个模型输出，以实现有害性。结果显示模型和模式之间的脆弱性存在显着差异。 PixTral 12b表现出最高的有害反应率（〜62％），而Claude SONNET 3.5是最具耐药性（〜10％）。与期望相反，仅文本提示在绕过安全机制方面比多模式更有效。统计分析证实，模型类型和输入方式都是有害性的重要预测指标。这些发现强调了迫切需要进行强大的多模式安全基准，因为MLLM被更广泛地部署。

Title: LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference

Authors: Hantao Yang, Hong Xie, Defu Lian, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15515
Pdf URL: https://arxiv.org/pdf/2509.15515
Copy Paste: [[2509.15515]] LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference(https://arxiv.org/abs/2509.15515)
Keywords: llm
Abstract: This paper revisits the LLM cache bandit problem, with a special focus on addressing the query heterogeneity for cost-effective LLM inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for cache selection, making the cache replacement process more computationally and statistically challenging. We treat optimal cache selection as a knapsack problem and employ an accumulation-based strategy to effectively balance computational overhead and cache updates. In theoretical analysis, we prove that the regret of our algorithm achieves an $O(\sqrt{MNT})$ bound, improving the coefficient of $\sqrt{MN}$ compared to the $O(MN\sqrt{T})$ result in Berkeley, where $N$ is the total number of queries and $M$ is the cache size. Additionally, we also provide a problem-dependent bound, which was absent in previous works. The experiment rely on real-world data show that our algorithm reduces the total cost by approximately 12\%.
摘要：本文重新审视了LLM缓存匪徒问题，特别着重于针对具有成本效益的LLM推理的查询异质性。以前的作品通常会假设查询尺寸。异构查询大小引入了一种用于缓存选择的组合结构，从而使缓存替换过程在计算和统计上更具挑战性。我们将最佳的缓存选择视为背包问题，并采用基于累积的策略来有效平衡计算开销和缓存更新。在理论分析中，我们证明我们的算法的遗憾达到了$ o（\ sqrt {mnt}）$绑定的限制，与$ o（mn \ sqrt {t}）$相比，$ o \ sqrt {mn} $的系数提高了$ n $ n $ ns $ queries和queries nuble queres and quere y qiore y qiore y qiore y qiore c。此外，我们还提供了一个与问题有关的界限，这在以前的工作中是不存在的。实验依赖于现实世界数据表明，我们的算法将总成本降低了约12％。

Title: How do Language Models Generate Slang: A Systematic Comparison between Human and Machine-Generated Slang Usages

Authors: Siyang Wu, Zhewei Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15518
Pdf URL: https://arxiv.org/pdf/2509.15518
Copy Paste: [[2509.15518]] How do Language Models Generate Slang: A Systematic Comparison between Human and Machine-Generated Slang Usages(https://arxiv.org/abs/2509.15518)
Keywords: language model, gpt, llm, agent
Abstract: Slang is a commonly used type of informal language that poses a daunting challenge to NLP systems. Recent advances in large language models (LLMs), however, have made the problem more approachable. While LLM agents are becoming more widely applied to intermediary tasks such as slang detection and slang interpretation, their generalizability and reliability are heavily dependent on whether these models have captured structural knowledge about slang that align well with human attested slang usages. To answer this question, we contribute a systematic comparison between human and machine-generated slang usages. Our evaluative framework focuses on three core aspects: 1) Characteristics of the usages that reflect systematic biases in how machines perceive slang, 2) Creativity reflected by both lexical coinages and word reuses employed by the slang usages, and 3) Informativeness of the slang usages when used as gold-standard examples for model distillation. By comparing human-attested slang usages from the Online Slang Dictionary (OSD) and slang generated by GPT-4o and Llama-3, we find significant biases in how LLMs perceive slang. Our results suggest that while LLMs have captured significant knowledge about the creative aspects of slang, such knowledge does not align with humans sufficiently to enable LLMs for extrapolative tasks such as linguistic analyses.
摘要：语是一种常用的非正式语言类型，对NLP系统构成了艰巨的挑战。但是，大型语言模型（LLMS）的最新进展使该问题更加平易近人。尽管LLM代理人正在更广泛地应用于诸如lang检测和语解释之类的中介任务，但它们的普遍性和可靠性在很大程度上取决于这些模型是否捕获了有关语的结构知识，这些结构知识与人类认证的s语使用良好。为了回答这个问题，我们在人类和机器生成的语中进行了系统的比较。我们的评估框架着重于三个核心方面：1）用法的特征，这些用法反映了机器如何感知s语的系统偏见，2）词汇用法所用的词汇置换和单词re uses所反映的创造力，以及3）当用作模型蒸馏的金色标准示例时，呼声的信息是slang词的信息。通过比较GPT-4O和Llama-3产生的在线语词典（OSD）和s语中的人类目睹的s语用法，我们发现LLMS感知语的slang语中存在很大的偏见。我们的结果表明，尽管LLM捕获了有关语的创造性方面的重要知识，但这种知识并不足够与人类保持一致，以使LLMS能够完成诸如语言分析之类的外推任务。

Title: A method for improving multilingual quality and diversity of instruction fine-tuning datasets

Authors: Chunguang Zhao, Yilun Liu, Pufan Zeng, Yuanchang Luo, Shimin Tao, Minggui He, Weibin Meng, Song Xu, Ziang Chen, Chen Liu, Hongxia Ma, Li Zhang, Boxing Chen, Daimeng Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15549
Pdf URL: https://arxiv.org/pdf/2509.15549
Copy Paste: [[2509.15549]] A method for improving multilingual quality and diversity of instruction fine-tuning datasets(https://arxiv.org/abs/2509.15549)
Keywords: language model, llm
Abstract: Multilingual Instruction Fine-Tuning (IFT) is essential for enabling large language models (LLMs) to generalize effectively across diverse linguistic and cultural contexts. However, the scarcity of high-quality multilingual training data and corresponding building method remains a critical bottleneck. While data selection has shown promise in English settings, existing methods often fail to generalize across languages due to reliance on simplistic heuristics or language-specific assumptions. In this work, we introduce Multilingual Data Quality and Diversity (M-DaQ), a novel method for improving LLMs multilinguality, by selecting high-quality and semantically diverse multilingual IFT samples. We further conduct the first systematic investigation of the Superficial Alignment Hypothesis (SAH) in multilingual setting. Empirical results across 18 languages demonstrate that models fine-tuned with M-DaQ method achieve significant performance gains over vanilla baselines over 60% win rate. Human evaluations further validate these gains, highlighting the increment of cultural points in the response. We release the M-DaQ code to support future research.
摘要：多语言教学微调（IFT）对于使大型语言模型（LLM）有效地跨越各种语言和文化背景至关重要。但是，高质量的多语言培训数据和相应的建筑方法的稀缺仍然是关键的瓶颈。尽管数据选择在英语设置中显示出希望，但由于依赖简单的启发式方法或特定于语言的假设，现有方法通常无法跨语言概括。在这项工作中，我们引入了多语言数据质量和多样性（M-DAQ），这是一种通过选择高质量和语义上不同的多语言IFT样本来改善LLMS多语言的新方法。我们进一步在多语言环境中对表面对齐假说（SAH）进行了首次系统研究。跨18种语言的经验结果表明，使用M-DAQ方法微调的模型可在60％以上的胜利率上获得显着的性能增长。人类评估进一步验证了这些收益，突出了反应中文化点的增加。我们发布M-DAQ代码以支持未来的研究。

Title: DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm

Authors: Xiaowei Zhu, Yubing Ren, Fang Fang, Qingfeng Tan, Shi Wang, Yanan Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15550
Pdf URL: https://arxiv.org/pdf/2509.15550
Copy Paste: [[2509.15550]] DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm(https://arxiv.org/abs/2509.15550)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant overlap between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text. Building on this perspective, we introduce DNA-DetectLLM, a zero-shot detection method for distinguishing AI-generated and human-written text. The method constructs an ideal AI-generated sequence for each input, iteratively repairs non-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal. Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-DetectLLM achieves relative improvements of 5.55% in AUROC and 2.08% in F1 score across multiple public benchmark datasets.
摘要：大型语言模型（LLM）的快速发展使AI生成的文本和人工写的文本之间的界限模糊了。这一进步带来了社会风险，例如错误信息，作者歧义和知识产权问题，强调了迫切需要对可靠的AI生成的文本检测方法。但是，生成语言建模的最新进展导致人体编写和AI生成的文本的特征分布之间的重叠，分类界限模糊，并使准确的检测越来越具有挑战性。为了应对上述挑战，我们提出了一个受DNA启发的观点，利用基于维修的过程直接和解释地捕获了人写和AI生成的文本之间的内在差异。从这个角度来看，我们介绍了DNA-Detectllm，这是一种零射击检测方法，可区分AI生成和人文写的文本。该方法为每个输入构建一个理想的AI生成序列，迭代地修复非最佳令牌，并将累积修复工作量化为可解释的检测信号。经验评估表明，我们的方法达到了最先进的检测性能，并对各种对抗性攻击和输入长度表现出强大的鲁棒性。具体而言，在多个公共基准数据集中，DNA-detectllm在AUROC中实现了5.55％的相对提高，而F1得分为2.08％。

Title: Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining

Authors: Ping Guo, Yubing Ren, Binbin Liu, Fengze Liu, Haobin Lin, Yifan Zhang, Bingni Zhang, Taifeng Wang, Yin Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15556
Pdf URL: https://arxiv.org/pdf/2509.15556
Copy Paste: [[2509.15556]] Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining(https://arxiv.org/abs/2509.15556)
Keywords: language model, llm
Abstract: Large language models (LLMs) have become integral to a wide range of applications worldwide, driving an unprecedented global demand for effective multilingual capabilities. Central to achieving robust multilingual performance is the strategic allocation of language proportions within training corpora. However, determining optimal language ratios is highly challenging due to intricate cross-lingual interactions and sensitivity to dataset scale. This paper introduces Climb (Cross-Lingual Interaction-aware Multilingual Balancing), a novel framework designed to systematically optimize multilingual data allocation. At its core, Climb introduces a cross-lingual interaction-aware language ratio, explicitly quantifying each language's effective allocation by capturing inter-language dependencies. Leveraging this ratio, Climb proposes a principled two-step optimization procedure--first equalizing marginal benefits across languages, then maximizing the magnitude of the resulting language allocation vectors--significantly simplifying the inherently complex multilingual optimization problem. Extensive experiments confirm that Climb can accurately measure cross-lingual interactions across various multilingual settings. LLMs trained with Climb-derived proportions consistently achieve state-of-the-art multilingual performance, even achieving competitive performance with open-sourced LLMs trained with more tokens.
摘要：大型语言模型（LLM）已成为全球广泛应用程序不可或缺的一部分，推动了对有效多语言能力的前所未有的全球需求。实现强大的多语言表现的核心是培训语料库中语言比例的战略分配。但是，由于复杂的跨语性相互作用和对数据集量表的敏感性，确定最佳语言比率是高度挑战性的。本文介绍了攀登（跨语性互动感知的多语言平衡），这是一个新颖的框架，旨在系统地优化多语言数据分配。 Climb以此为核心引入了跨语性的互动语言比率，通过捕获语言依赖性来明确量化每种语言的有效分配。利用该比率，攀登提出了一个原则上的两步优化程序 - 首先将跨语言的边际收益均衡，然后最大化所得的语言分配向量的大小 - 很清楚地简化了固有的复杂多语言优化问题。广泛的实验证实，攀登可以准确地测量各种多语言设置的跨语性相互作用。接受攀登比例训练的LLM始终可以实现最先进的多语言表现，甚至通过接受更多代币培训的开源LLMS实现竞争性能。

Title: LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Authors: Junlong Jia, Xing Wu, Chaochen Gao, Ziyang Chen, Zijia Lin, Zhongzhi Li, Weinong Wang, Haotian Xu, Donghui Jin, Debing Zhang, Binghui Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15568
Pdf URL: https://arxiv.org/pdf/2509.15568
Copy Paste: [[2509.15568]] LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs(https://arxiv.org/abs/2509.15568)
Keywords: language model, llm, agent
Abstract: High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.
摘要：高质量的长篇小说数据对于培训能够处理大量文档的大语言模型（LLM）至关重要，但是使用基于相关的聚合面对计算效率的挑战，现有的合成方法至关重要。我们介绍了Litelong，这是一种通过结构化主题组织和多代理辩论综合长篇小说数据的资源有效方法。我们的方法利用BISAC书籍分类系统提供了一个全面的等级主题组织，然后采用具有多个LLM的辩论机制来在此结构中生成多样化的高质量主题。对于每个主题，我们使用轻型BM25检索来获取相关文档，并将它们置于128K token培训样本中。头盔和标尺基准的实验表明，Litelong可以实现竞争性的长期性能，并且可以与其他长依赖性增强方法无缝集成。 Litelong通过降低计算和数据工程成本，使高质量的长篇小说数据合成更容易访问，从而促进了长篇文化语言培训的进一步研究。

Title: Relevance to Utility: Process-Supervised Rewrite for RAG

Authors: Jaeyoung Kim, Jongho Kim, Seung-won Hwang, Seoho Song, Young-In Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15577
Pdf URL: https://arxiv.org/pdf/2509.15577
Copy Paste: [[2509.15577]] Relevance to Utility: Process-Supervised Rewrite for RAG(https://arxiv.org/abs/2509.15577)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation systems often suffer from a gap between optimizing retrieval relevance and generative utility: retrieved documents may be topically relevant but still lack the content needed for effective reasoning during generation. While existing "bridge" modules attempt to rewrite the retrieved text for better generation, we show how they fail to capture true document utility. In this work, we propose R2U, with a key distinction of directly optimizing to maximize the probability of generating a correct answer through process supervision. As such direct observation is expensive, we also propose approximating an efficient distillation pipeline by scaling the supervision from LLMs, which helps the smaller rewriter model generalize better. We evaluate our method across multiple open-domain question-answering benchmarks. The empirical results demonstrate consistent improvements over strong bridging baselines.
摘要：检索增强的生成系统通常会遇到优化检索相关性和生成效用之间的差距：检索文档可能是局部相关的，但仍然缺乏在发电期间有效推理所需的内容。当现有的“桥梁”模块试图重写所检索的文本以获得更好的生成时，我们表明了它们如何无法捕获真正的文档实用程序。在这项工作中，我们提出了R2U，并具有直接优化的关键区别，以最大程度地提高通过过程监督生成正确答案的概率。由于这种直接观察很昂贵，我们还建议通过扩展LLM的监督来近似有效的蒸馏管道，这有助于较小的重写模型更好地推广。我们在多个开放域提问基准测试中评估了我们的方法。经验结果表明，对强桥基线的一致改善。

Title: DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Authors: Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15587
Pdf URL: https://arxiv.org/pdf/2509.15587
Copy Paste: [[2509.15587]] DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models(https://arxiv.org/abs/2509.15587)
Keywords: language model, llm
Abstract: Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.
摘要：自然语言中的逻辑推理已被认为是大语模型（LLM）人类智力的重要衡量标准。流行的基准可能会纠缠多种推理技能，从而对逻辑推理技能提供不忠的评估。同时，现有的逻辑推理基准在语言多样性上受到限制，其分布偏离了理想逻辑推理基准的分布，这可能会导致评估结果有偏见。因此，本文提出了一种新的古典逻辑基准Divlogiceval，由以违反直觉的方式组成的自然句子组成。为了确保更可靠的评估，我们还引入了一种新的评估指标，以减轻LLMS固有的偏见和随机性的影响。通过实验，我们证明了需要在多大程度上回答Divlogiceval中的问题的程度，并比较了进行逻辑推理时不同流行的LLM的性能。

Title: SciEvent: Benchmarking Multi-domain Scientific Event Extraction

Authors: Bofu Dong, Pritesh Shah, Sumedh Sonawane, Tiyasha Banerjee, Erin Brady, Xinya Du, Ming Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15620
Pdf URL: https://arxiv.org/pdf/2509.15620
Copy Paste: [[2509.15620]] SciEvent: Benchmarking Multi-domain Scientific Event Extraction(https://arxiv.org/abs/2509.15620)
Keywords: language model, llm
Abstract: Scientific information extraction (SciIE) has primarily relied on entity-relation extraction in narrow domains, limiting its applicability to interdisciplinary research and struggling to capture the necessary context of scientific information, often resulting in fragmented or conflicting statements. In this paper, we introduce SciEvent, a novel multi-domain benchmark of scientific abstracts annotated via a unified event extraction (EE) schema designed to enable structured and context-aware understanding of scientific content. It includes 500 abstracts across five research domains, with manual annotations of event segments, triggers, and fine-grained arguments. We define SciIE as a multi-stage EE pipeline: (1) segmenting abstracts into core scientific activities--Background, Method, Result, and Conclusion; and (2) extracting the corresponding triggers and arguments. Experiments with fine-tuned EE models, large language models (LLMs), and human annotators reveal a performance gap, with current models struggling in domains such as sociology and humanities. SciEvent serves as a challenging benchmark and a step toward generalizable, multi-domain SciIE.
摘要：科学信息提取（SCIIE）主要依赖于狭窄领域中的实体关系提取，从而限制了其适用于跨学科研究的适用性，并努力捕获科学信息的必要背景，通常会导致陈述或冲突的陈述。在本文中，我们介绍了Scievent，这是一种通过统一事件提取（EE）架构注释的科学摘要的新型多域基准，旨在实现对科学内容的结构化和上下文感知的理解。它包括五个研究领域的500篇摘要，并提供了事件段，触发器和细粒度论点的手动注释。我们将科学定义为多阶段的EE管道：（1）将摘要分为核心科学活动 - 背景，方法，结果和结论；（2）提取相应的触发器和参数。通过微调的EE模型，大语言模型（LLM）和人类注释者的实验揭示了性能差距，当前模型在社会学和人文学科等领域中挣扎。 Scivent是一个具有挑战性的基准，也是迈向可概括的多域科学的一步。

Title: Concept Unlearning in Large Language Models via Self-Constructed Knowledge Triplets

Authors: Tomoya Yamashita, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara, Tomoharu Iwata
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15621
Pdf URL: https://arxiv.org/pdf/2509.15621
Copy Paste: [[2509.15621]] Concept Unlearning in Large Language Models via Self-Constructed Knowledge Triplets(https://arxiv.org/abs/2509.15621)
Keywords: language model, llm, prompt
Abstract: Machine Unlearning (MU) has recently attracted considerable attention as a solution to privacy and copyright issues in large language models (LLMs). Existing MU methods aim to remove specific target sentences from an LLM while minimizing damage to unrelated knowledge. However, these approaches require explicit target sentences and do not support removing broader concepts, such as persons or events. To address this limitation, we introduce Concept Unlearning (CU) as a new requirement for LLM unlearning. We leverage knowledge graphs to represent the LLM's internal knowledge and define CU as removing the forgetting target nodes and associated edges. This graph-based formulation enables a more intuitive unlearning and facilitates the design of more effective methods. We propose a novel method that prompts the LLM to generate knowledge triplets and explanatory sentences about the forgetting target and applies the unlearning process to these representations. Our approach enables more precise and comprehensive concept removal by aligning the unlearning process with the LLM's internal knowledge representations. Experiments on real-world and synthetic datasets demonstrate that our method effectively achieves concept-level unlearning while preserving unrelated knowledge.
摘要：Machine Unrearning（MU）最近引起了广泛的关注，以解决大语模型（LLMS）中的隐私和版权问题的解决方案。现有的MU方法旨在从LLM中删除特定的目标句子，同时最大程度地减少对无关知识的损害。但是，这些方法需要明确的目标句子，并且不支持删除更广泛的概念，例如人或事件。为了解决此限制，我们将概念（CU）介绍为LLM学习的新要求。我们利用知识图来表示LLM的内部知识，并将CU定义为删除遗忘目标节点和相关边缘。这种基于图的公式可实现更直观的学习，并促进了更有效方法的设计。我们提出了一种新颖的方法，该方法促使LLM生成有关遗忘目标的知识三胞胎和解释性句子，并将未学习过程应用于这些表示形式。我们的方法通过将学习过程与LLM的内部知识表示形式保持一致，从而实现了更精确和全面的概念。对现实世界和合成数据集的实验表明，我们的方法在保留无关的知识的同时有效地实现了概念级别的学习。

Title: Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

Authors: Tomoya Yamashita, Akira Ito, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15631
Pdf URL: https://arxiv.org/pdf/2509.15631
Copy Paste: [[2509.15631]] Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models(https://arxiv.org/abs/2509.15631)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly deployed across various applications, privacy and copyright concerns have heightened the need for more effective LLM unlearning techniques. Many existing unlearning methods aim to suppress undesirable outputs through additional training (e.g., gradient ascent), which reduces the probability of generating such outputs. While such suppression-based approaches can control model outputs, they may not eliminate the underlying knowledge embedded in the model's internal activations; muting a response is not the same as forgetting it. Moreover, such suppression-based methods often suffer from model collapse. To address these issues, we propose a novel unlearning method that directly intervenes in the model's internal activations. In our formulation, forgetting is defined as a state in which the activation of a forgotten target is indistinguishable from that of ``unknown'' entities. Our method introduces an unlearning objective that modifies the activation of the target entity away from those of known entities and toward those of unknown entities in a sparse autoencoder latent space. By aligning the target's internal activation with those of unknown entities, we shift the model's recognition of the target entity from ``known'' to ``unknown'', achieving genuine forgetting while avoiding over-suppression and model collapse. Empirically, we show that our method effectively aligns the internal activations of the forgotten target, a result that the suppression-based approaches do not reliably achieve. Additionally, our method effectively reduces the model's recall of target knowledge in question-answering tasks without significant damage to the non-target knowledge.
摘要：由于大型语言模型（LLM）越来越多地在各种应用程序中部署，因此隐私和版权问题加剧了对更有效的LLM学习技术的需求。许多现有的未学习方法旨在通过额外的培训（例如梯度上升）来抑制不良输出，从而降低了产生此类输出的可能性。尽管这种基于抑制的方法可以控制模型输出，但它们可能无法消除模型内部激活中嵌入的基本知识。静音响应与忘记它不同。此外，这种基于抑制的方法通常会遭受模型塌陷。为了解决这些问题，我们提出了一种新颖的学习方法，该方法直接介入模型的内部激活中。在我们的表述中，遗忘被定义为一种状态，在这种状态下，被遗忘的目标的激活与``未知''实体的激活都无法区分。我们的方法介绍了一个未学习的目标，该目标将目标实体的激活远离已知实体的激活以及稀疏自动编码器潜在空间中未知实体的激活。通过将目标的内部激活与未知实体的激活保持一致，我们将模型对目标实体的识别从``已知''转移到``未知''，在避免过度抑制和模型崩溃的同时实现了真正的遗忘。从经验上讲，我们表明我们的方法有效地使被遗忘的目标的内部激活保持一致，这一结果是基于抑制的方法无法可靠地实现。此外，我们的方法有效地减少了模型在提问任务中对目标知识的回忆，而不会对非目标知识的重大损害。

Title: Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation

Authors: Nhu Vo, Nu-Uyen-Phuong Le, Dung D. Le, Massimo Piccardi, Wray Buntine
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15640
Pdf URL: https://arxiv.org/pdf/2509.15640
Copy Paste: [[2509.15640]] Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation(https://arxiv.org/abs/2509.15640)
Keywords: llm, prompt
Abstract: Medical English-Vietnamese machine translation (En-Vi MT) is essential for healthcare access and communication in Vietnam, yet Vietnamese remains a low-resource and under-studied language. We systematically evaluate prompting strategies for six multilingual LLMs (0.5B-9B parameters) on the MedEV dataset, comparing zero-shot, few-shot, and dictionary-augmented prompting with Meddict, an English-Vietnamese medical lexicon. Results show that model scale is the primary driver of performance: larger LLMs achieve strong zero-shot results, while few-shot prompting yields only marginal improvements. In contrast, terminology-aware cues and embedding-based example retrieval consistently improve domain-specific translation. These findings underscore both the promise and the current limitations of multilingual LLMs for medical En-Vi MT.
摘要：英语医学 - 越南机器翻译（EN-VI MT）对于越南的医疗保健访问和沟通至关重要，但越南仍然是一种低资源和研究不足的语言。我们系统地评估了MEDEV数据集上六个多语言LLM（0.5b-9b参数）的提示策略，比较了零摄，很少弹奏和字典的提示与英语 - 越南医学词典Meddict进行了比较。结果表明，模型量表是性能的主要驱动力：较大的LLMS获得了强劲的零击结果，而很少的射击提示只能产生边际改进。相比之下，术语感知的提示和基于嵌入的示例检索始终改善域特异性翻译。这些发现强调了医疗eN-VI MT多语言LLM的承诺和当前局限性。

Title: Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

Authors: Linyang He, Qiaolin Wang, Xilin Jiang, Nima Mesgarani
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.15655
Pdf URL: https://arxiv.org/pdf/2509.15655
Copy Paste: [[2509.15655]] Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations(https://arxiv.org/abs/2509.15655)
Keywords: language model, llm
Abstract: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.
摘要：基于变压器的语音语言模型（SLM）显着改善了神经语音识别和理解。尽管现有的研究已经研究了SLM编码浅的声学和语音特征的程度，但SLM编码细微的句法和概念特征的程度尚不清楚。通过与大型语言模型的语言能力评估相似，这项研究是第一个系统地评估SLM的上下文句法和语义特征的存在，以进行自我监督学习（S3M），自动语音识别（ASR），语音压缩（CODEC），以及作为听觉大语言模型的编码器（听众）的编码器。通过跨越不同语言级别的71个任务的最小对设计和诊断功能分析，我们的层次和时间分辨分析发现了1）所有语音编码语法特征比概念上的语言特征更强大。

Title: VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

Authors: Dimitrios Damianos, Leon Voukoutis, Georgios Paraskevopoulos, Vassilis Katsouros
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.15667
Pdf URL: https://arxiv.org/pdf/2509.15667
Copy Paste: [[2509.15667]] VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion(https://arxiv.org/abs/2509.15667)
Keywords: language model, llm
Abstract: We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce \textit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $\sim20\%$ relative improvement across benchmarks.
摘要：我们提出了一个多模式融合框架，该框架桥梁基于训练的解码器大型语言模型（LLM）和声学编码器架构（例如Whisper），目的是构建支持语音的LLMS。我们不是直接使用音频嵌入，而是探索一个中间音频条件的文本空间，作为更有效的对齐机制。我们的方法在连续的文本表示空间中完全运行，通过通过交叉模式的注意将Whisper的隐藏解码器与LLM的态度融合在一起，并支持离线和流式传输模式。我们介绍了第一个希腊语音llm的\ textit {voxkrikri}，并通过分析表明我们的方法有效地使跨模态的表示形式保持一致。这些结果突出了连续的空间融合是多语言和低资源语音LLM的有前途的途径，同时在希腊语中获得自动语音识别的最新结果，为基准的平均$ \％$ \％$相对改善提供了平均$ \％$。

Title: Once Upon a Time: Interactive Learning for Storytelling with Small Language Models

Authors: Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid, Lisa Beinborn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15714
Pdf URL: https://arxiv.org/pdf/2509.15714
Copy Paste: [[2509.15714]] Once Upon a Time: Interactive Learning for Storytelling with Small Language Models(https://arxiv.org/abs/2509.15714)
Keywords: language model
Abstract: Children efficiently acquire language not just by listening, but by interacting with others in their social environment. Conversely, large language models are typically trained with next-word prediction on massive amounts of text. Motivated by this contrast, we investigate whether language models can be trained with less data by learning not only from next-word prediction but also from high-level, cognitively inspired feedback. We train a student model to generate stories, which a teacher model rates on readability, narrative coherence, and creativity. By varying the amount of pretraining before the feedback loop, we assess the impact of this interactive learning on formal and functional linguistic competence. We find that the high-level feedback is highly data efficient: With just 1 M words of input in interactive learning, storytelling skills can improve as much as with 410 M words of next-word prediction.
摘要：儿童不仅通过聆听，而且在社交环境中与他人互动有效地获取语言。相反，大型语言模型通常接受有关大量文本的下一字预测培训。在这种对比的推动下，我们研究了语言模型是否可以通过不仅从下一字预测中学习，而且还可以从高级，认知灵感的反馈中学习来培训较少的数据。我们培训学生模型来产生故事，教师模型对可读性，叙事连贯性和创造力进行评分。通过改变反馈循环之前的预处理数量，我们评估了这种互动学习对形式和功能性语言能力的影响。我们发现，高级反馈是高度数据效率的：互动学习中只有1 m的输入单词，讲故事的技能可以与下一字预测的410 m单词相同。

Title: REFER: Mitigating Bias in Opinion Summarisation via Frequency Framed Prompting

Authors: Nannan Huang, Haytham M. Fayek, Xiuzhen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15723
Pdf URL: https://arxiv.org/pdf/2509.15723
Copy Paste: [[2509.15723]] REFER: Mitigating Bias in Opinion Summarisation via Frequency Framed Prompting(https://arxiv.org/abs/2509.15723)
Keywords: language model, llm, prompt
Abstract: Individuals express diverse opinions, a fair summary should represent these viewpoints comprehensively. Previous research on fairness in opinion summarisation using large language models (LLMs) relied on hyperparameter tuning or providing ground truth distributional information in prompts. However, these methods face practical limitations: end-users rarely modify default model parameters, and accurate distributional information is often unavailable. Building upon cognitive science research demonstrating that frequency-based representations reduce systematic biases in human statistical reasoning by making reference classes explicit and reducing cognitive load, this study investigates whether frequency framed prompting (REFER) can similarly enhance fairness in LLM opinion summarisation. Through systematic experimentation with different prompting frameworks, we adapted techniques known to improve human reasoning to elicit more effective information processing in language models compared to abstract probabilistic this http URL results demonstrate that REFER enhances fairness in language models when summarising opinions. This effect is particularly pronounced in larger language models and using stronger reasoning instructions.
摘要：个人表达了不同的意见，一个公平的摘要应全面代表这些观点。先前关于使用大语言模型（LLM）的意见摘要的公平性研究依赖于高参数调整或在提示中提供地面真相分配信息。但是，这些方法面临实际的局限性：最终用户很少修改默认模型参数，而准确的分布信息通常不可用。在认知科学研究的基础上表明，基于频率的表示通过使参考类别显式和减少认知负荷来减少人类统计推理中的系统偏见，这项研究调查了频率构造提示（请参阅）是否可以同样提高LLM意见摘要中的公平性。通过使用不同提示框架进行系统的实验，我们改编了已知的技术，可以改善人类推理，以在语言模型中与抽象概率相比，在语言模型中更有效的信息处理。与抽象的概率相比，HTTP URL结果表明，当汇总观点时，请参考语言模型中的公平性。在较大的语言模型和使用更强大的推理说明中，这种效果特别明显。

Title: Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics

Authors: Reza Sanayei, Srdjan Vesic, Eduardo Blanco, Mihai Surdeanu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15739
Pdf URL: https://arxiv.org/pdf/2509.15739
Copy Paste: [[2509.15739]] Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics(https://arxiv.org/abs/2509.15739)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as argument graphs. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assigns acceptability scores to arguments based on their attack and support relations. Given only dialogue-formatted debates from two NoDE datasets, models are prompted to rank arguments without access to the underlying graph. We test several LLMs under advanced instruction strategies, including Chain-of-Thought and In-Context Learning. While models show moderate alignment with QuAD rankings, performance degrades with longer inputs or disrupted discourse flow. Advanced prompting helps mitigate these effects by reducing biases related to argument length and position. Our findings highlight both the promise and limitations of LLMs in modeling formal argumentation semantics and motivate future work on graph-aware reasoning.
摘要：大型语言模型（LLMS）在线性推理任务上表现出色，但在非线性结构（例如自然辩论中发现的）上仍然没有反应，这些结构最好以参数图表示。我们评估LLM是否可以从计算论证理论（CAT）中近似结构化推理。具体来说，我们使用定量论证辩论（QUAD）语义，该语义将基于其攻击和支持关系的参数分配可接受性分数。只有两个节点数据集中的对话形式的辩论，请提示模型对参数进行排名，而无需访问基础图。我们在高级指导策略下测试了几个LLM，包括思想链和内在学习。虽然模型显示出适度的排列，但性能会降低输入较长或话语流动流。高级提示通过减少与论证长度和位置相关的偏差来有助于减轻这些效果。我们的发现突出了LLM在建模正式论证语义上的希望和局限性，并激发了未来的图形感知推理工作。

Title: UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

Authors: Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Zhicheng Dou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15763
Pdf URL: https://arxiv.org/pdf/2509.15763
Copy Paste: [[2509.15763]] UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression(https://arxiv.org/abs/2509.15763)
Keywords: language model, long context
Abstract: Large language models are increasingly capable of handling long-context inputs, but the memory overhead of key-value (KV) cache remains a major bottleneck for general-purpose deployment. While various compression strategies have been explored, sequence-level compression, which drops the full KV caches for certain tokens, is particularly challenging as it can lead to the loss of important contextual information. To address this, we introduce UniGist, a sequence-level long-context compression framework that efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner. We adopt a chunk-free training strategy and design an efficient kernel with a gist shift trick, enabling optimized GPU training. Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings. Experiments across multiple long-context tasks demonstrate that UniGist significantly improves compression quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling.
摘要：大型语言模型越来越能够处理长篇小说输入，但是键值（KV）缓存的内存开销仍然是通用部署的主要瓶颈。尽管已经探索了各种压缩策略，但序列级压缩将某些令牌丢弃了全部KV缓存，这尤其具有挑战性，因为它可能导致重要的上下文信息丧失。为了解决这个问题，我们介绍了Unigist，这是一个序列级别的长篇小说压缩框架，通过以细粒度的方式用特殊的压缩令牌（GISTS）替换原始令牌，从而有效地保留上下文信息。我们采用了无块的培训策略，并通过GIST Shift Trick设计有效的内核，从而实现了优化的GPU培训。我们的方案还通过允许实际删除压缩令牌来支持灵活的推断，从而实时存储器节省。跨多个长篇文化任务的实验表明，单位主义者显着提高了压缩质量，并且在细节重新验证任务和远程依赖模型中尤其强劲。

Title: Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning

Authors: Sara Rajaee, Rochelle Choenni, Ekaterina Shutova, Christof Monz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15811
Pdf URL: https://arxiv.org/pdf/2509.15811
Copy Paste: [[2509.15811]] Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning(https://arxiv.org/abs/2509.15811)
Keywords: language model, llm
Abstract: While the reasoning abilities of large language models (LLMs) continue to advance, it remains unclear how such ability varies across languages in multilingual LLMs and whether different languages produce reasoning paths that complement each other. To investigate this question, we train a reward model to rank generated responses for a given question across languages. Our results show that our cross-lingual reward model substantially improves mathematical reasoning performance compared to using reward modeling within a single language, benefiting even high-resource languages. While English often exhibits the highest performance in multilingual models, we find that cross-lingual sampling particularly benefits English under low sampling budgets. Our findings reveal new opportunities to improve multilingual reasoning by leveraging the complementary strengths of diverse languages.
摘要：尽管大语言模型（LLM）的推理能力继续发展，但尚不清楚这种能力在多语言LLM中如何在多种语言中有何不同，以及不同语言是否会产生相互补充的推理路径。为了调查这个问题，我们训练一个奖励模型，以对跨语言的给定问题产生回答。我们的结果表明，与在单一语言中使用奖励建模相比，我们的跨语性奖励模型显着提高了数学推理性能，甚至使高资源语言受益。虽然英语在多语言模型中经常表现出最高的性能，但我们发现跨语性抽样在低采样预算下尤其有益于英语。我们的发现揭示了通过利用各种语言的互补优势来改善多语言推理的新机会。

Title: Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

Authors: Zhongze Luo, Zhenshuai Yin, Yongxin Guo, Zhichao Wang, Jionghao Zhu, Xiaoying Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15839
Pdf URL: https://arxiv.org/pdf/2509.15839
Copy Paste: [[2509.15839]] Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems(https://arxiv.org/abs/2509.15839)
Keywords: llm, chain-of-thought
Abstract: While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce \textbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: this https URL.
摘要：虽然多模式LLM（MLLM）表现出显着的推理进展，但它们在物理等专业的科学领域中的应用显示在当前评估基准中的显着差距。具体而言，现有的基准通常缺乏细粒度的覆盖范围，忽略了逐步推理过程，并且主要以英语为中心，无法系统地评估视觉信息的作用。因此，我们为中国物理推理介绍\ textbf {多物理学}，这是一个全面的基准，其中包括5个难度级别，其中包含1,412个与图像相关的，多选择的问题，涉及11个高中物理学。我们采用双重评估框架来评估20种不同的MLLM，从而分析了最终答案的准确性和他们的经营链的逐步完整性。此外，我们通过比较更改输入模式之前和之后的模型性能来系统地研究难度级别和视觉信息的影响。我们的工作不仅为社区提供了细粒度的资源，而且还提供了一种鲁棒性的方法来解剖最先进的MLLM的多模式推理过程，并且我们的数据集和代码已经开源：此HTTPS URL。

Title: Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Authors: Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15888
Pdf URL: https://arxiv.org/pdf/2509.15888
Copy Paste: [[2509.15888]] Distribution-Aligned Decoding for Efficient LLM Task Adaptation(https://arxiv.org/abs/2509.15888)
Keywords: language model, llm
Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.
摘要：即使使用参数有效的微调（PEFT），将十亿参数语言模型适应下游任务仍然是昂贵的。我们将任务改编作为输出分布对齐：目的是将输出分布引导到在解码过程中直接转向任务分布，而不是通过重量更新间接地转向任务分布。在此视图的基础上，我们介绍了转向矢量解码（SVD），这是一种轻巧，与PEFT兼容且理论上接地的方法。我们从简短的温暖启动微调开始，然后从kullback-leibler（KL）差异梯度中提取任务感知的转向向量，介于热启动和预训练的模型的输出分布之间。然后，该转向向量用于指导解码过程，以将模型的输出分布转向任务分布。从理论上讲，我们证明SVD是一阶等同于全面微调的梯度步骤，并为转向矢量的强度提供了全球最佳解决方案。在三个任务和九个基准测试中，SVD与四种标准PEFT方法配对，将多项选择的精度提高了5点，开放式的真实性提高了2分，在Commensens数据集中，相似的增长（1-2分），而无需在PEFT适配器以外添加可训练的参数。因此，SVD为大型语言模型提供了轻巧的，理论上的基础，以实现更强的任务适应。

Title: Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions

Authors: Frederic Kirstein, Sonu Kumar, Terry Ruas, Bela Gipp
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15901
Pdf URL: https://arxiv.org/pdf/2509.15901
Copy Paste: [[2509.15901]] Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions(https://arxiv.org/abs/2509.15901)
Keywords: language model, llm, hallucination, prompt
Abstract: Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.
摘要：使用大型语言模型（LLM）汇总摘要仍然容易出错，通常会产生具有幻觉，遗漏和无关的输出。我们提出框架，这是一种模块化管道，将摘要作为语义丰富任务进行汇总。框架提取和评分显着的事实，以主题为主题，并使用这些事实将大纲丰富为抽象的摘要。为了个性化摘要，我们介绍了范围，这是一种理由大声的协议，该协议通过在内容选择之前回答九个问题来构建推理跟踪。为了进行评估，我们提出了P-Mesa，这是一个多维，无参考评估框架，以评估摘要是否适合目标读取器。 P-MESA可靠地识别出误差实例，达到> = 89％的人类注释的精度，并与人类严重程度等级保持一致（R> = 0.70）。在QMSUM和名望上，框架将幻觉和遗漏减少了5分中的2分（用MESA测量），而范围可改善知识的拟合度和目标对准及时的基线。我们的发现倡导重新思考摘要以改善控制，忠诚和个性化。

Title: Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Authors: Ahmed Karim, Qiao Wang (Judy), Zheng Yuan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15926
Pdf URL: https://arxiv.org/pdf/2509.15926
Copy Paste: [[2509.15926]] Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment(https://arxiv.org/abs/2509.15926)
Keywords: language model, llm
Abstract: Automated Essay Scoring (AES) systems now reach near human agreement on some public benchmarks, yet real-world adoption, especially in high-stakes examinations, remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs and formal coverage guarantees. Two open-source large language models (Llama-3 8B and Qwen-2.5 3B) are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90 percent risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, indicating that open-source, mid-sized LLMs can already support teacher-in-the-loop AES; we discuss scaling and broader user studies as future work.
摘要：现在，自动化的论文评分（AES）系统在人类对某些公共基准的同意中达成了一致，但是现实世界中的采用，尤其是在高风险考试中，仍然有限。主要障碍是大多数模型在没有任何伴随的置信度或解释的情况下输出单个分数。我们用共形预测来解决这一差距，这是一种无分配的包装器，可为任何分类器提供设置值的输出和正式覆盖范围保证。两种开源大型语言模型（Llama-3 8B和QWEN-2.5 3B）在三个不同的语料库（ASAP，TOEFL11，Cambridge-FCE）上进行了微调，并以90％的风险水平进行了校准。使用UACC评估可靠性，UACC是一种不确定性感知的准确性，奖励模型既正确又简洁。据我们所知，这是结合保形预测和论文评分的UACC的第一项工作。经过校准的模型始终达到覆盖目标，同时保持预测集紧凑，这表明开源的中型LLM可以支持在循环AES中的教师；我们讨论扩展和更广泛的用户研究作为未来的工作。

Title: Localmax dynamics for attention in transformers and its asymptotic behavior

Authors: Henri Cimetière, Maria Teresa Chiri, Bahman Gharesifard
Subjects: cs.CL, cs.LG, math.DS, math.OC
Abstract URL: https://arxiv.org/abs/2509.15958
Pdf URL: https://arxiv.org/pdf/2509.15958
Copy Paste: [[2509.15958]] Localmax dynamics for attention in transformers and its asymptotic behavior(https://arxiv.org/abs/2509.15958)
Keywords: prompt
Abstract: We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.
摘要：我们引入了一种新的离散时间模型，称为LocalMax动力学，该模型在经典的软磁动力学和HardMax动力学之间进行了插值，在该动力学和HardMax Dynamics之间，只有使给定令牌影响最大的代币具有正重。与Hardmax一样，均匀的权重由控制邻居影响的参数确定，但是关键的扩展在于通过对齐敏感性参数放松邻域相互作用，这允许与纯Hardmax行为受控偏差。正如我们证明的那样，虽然令牌状态的凸壳仍然收敛到凸多角形，但它的结构不再可以通过最大对齐集的充分描述，从而促使引入quiescent集以捕获在顶点附近的代币的不变行为。我们表明，这些集合在理解系统的渐近行为方面也起着关键作用，即使在随着时间变化的一致性敏感性参数下也是如此。我们进一步表明，LocalMax动力学不会表现出有限的时间收敛，并为消失，非零，时变的对齐敏感性参数提供了结果，从而恢复了Hardmax作为副产品的限制行为。最后，我们从基于Lyapunov的方法中从经典意见动力学中调整了基于Lyapunov的方法，突出了它们在LocalMax交互的不对称设置中的局限性，并概述了未来研究的方向。

Title: BEFT: Bias-Efficient Fine-Tuning of Language Models

Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15974
Pdf URL: https://arxiv.org/pdf/2509.15974
Copy Paste: [[2509.15974]] BEFT: Bias-Efficient Fine-Tuning of Language Models(https://arxiv.org/abs/2509.15974)
Keywords: language model, llm
Abstract: Fine-tuning all-bias-terms stands out among various parameter-efficient fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and competitive performance, especially in low-data regimes. Bias-only fine-tuning has the potential for unprecedented parameter efficiency. However, the link between fine-tuning different bias terms (i.e., bias terms in the query, key, or value projections) and downstream performance remains unclear. The existing approaches, e.g., based on the magnitude of bias change or empirical Fisher information, provide limited guidance for selecting the particular bias term for effective fine-tuning. In this paper, we propose an approach for selecting the bias term to be fine-tuned, forming the foundation of our bias-efficient fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against other bias-selection approaches, across a wide range of large language models (LLMs) spanning encoder-only and decoder-only architectures from 110M to 6.7B parameters. Our results demonstrate the effectiveness and superiority of our bias-efficient approach on diverse downstream tasks, including classification, multiple-choice, and generation tasks.
摘要：微调所有偏差terms在各种参数有效的微调（PEFT）技术中脱颖而出，这是由于其开箱即用的可用性和竞争性能，尤其是在低数据表格中。仅偏置微调具有前所未有的参数效率的潜力。但是，微调不同的偏差术语（即查询，密钥或价值预测中的偏差术语）与下游性能之间的联系尚不清楚。现有方法，例如，基于偏见变化或经验渔民信息的大小，为选择特定的偏见项提供了有效的微调术语提供了有限的指导。在本文中，我们提出了一种选择要进行微调的偏差项的方法，从而构成了我们偏见的微调（BEFT）的基础。我们广泛地评估了我们的偏差方法对其他偏置选择方法的评估，这些方法跨越了范围的大型语言模型（LLMS），范围为仅编码，而仅解码器的体系结构从1.10亿到6.7B参数。我们的结果表明，我们的偏见方法对各种下游任务的有效性和优越性，包括分类，多项选择和发电任务。

Title: Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech

Authors: Sang Hoon Woo, Sehun Lee, Kang-wook Kim, Gunhee Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16028
Pdf URL: https://arxiv.org/pdf/2509.16028
Copy Paste: [[2509.16028]] Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech(https://arxiv.org/abs/2509.16028)
Keywords: language model, llm
Abstract: Spoken dialogue systems increasingly employ large language models (LLMs) to leverage their advanced reasoning capabilities. However, direct application of LLMs in spoken communication often yield suboptimal results due to mismatches between optimal textual and verbal delivery. While existing approaches adapt LLMs to produce speech-friendly outputs, their impact on reasoning performance remains underexplored. In this work, we propose Think-Verbalize-Speak, a framework that decouples reasoning from spoken delivery to preserve the full reasoning capacity of LLMs. Central to our method is verbalizing, an intermediate step that translates thoughts into natural, speech-ready text. We also introduce ReVerT, a latency-efficient verbalizer based on incremental and asynchronous summarization. Experiments across multiple benchmarks show that our method enhances speech naturalness and conciseness with minimal impact on reasoning. The project page with the dataset and the source code is available at this https URL
摘要：口语对话系统越来越多地采用大型语言模型（LLM）来利用其先进的推理能力。但是，由于最佳文本和口头传递之间的不匹配，直接应用LLM在口语通信中通常会产生次优的结果。尽管现有方法适应LLM来产生语音友好的输出，但其对推理性能的影响仍然没有得到充实的影响。在这项工作中，我们提出了思维式言论，该框架将推理从口语交付中取消，以保留LLM的全部推理能力。我们方法的核心是口头表达，这是一个中间的步骤，将思想转化为自然，语音就绪的文本。我们还引入了基于增量和异步摘要的潜伏期式口语恢复。跨多个基准测试的实验表明，我们的方法增强了语音自然性和简洁性，对推理的影响最小。带有数据集和源代码的项目页面可在此HTTPS URL上找到

Title: Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

Authors: Fangyi Yu, Nabeel Seedat, Dasha Herrmannova, Frank Schilder, Jonathan Richard Schwarz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16093
Pdf URL: https://arxiv.org/pdf/2509.16093
Copy Paste: [[2509.16093]] Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses(https://arxiv.org/abs/2509.16093)
Keywords: llm
Abstract: Evaluating long-form answers in high-stakes domains such as law or medicine remains a fundamental challenge. Standard metrics like BLEU and ROUGE fail to capture semantic correctness, and current LLM-based evaluators often reduce nuanced aspects of answer quality into a single undifferentiated score. We introduce DeCE, a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts), using instance-specific criteria automatically extracted from gold answer requirements. DeCE is model-agnostic and domain-general, requiring no predefined taxonomies or handcrafted rubrics. We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding. DeCE achieves substantially stronger correlation with expert judgments ($r=0.78$), compared to traditional metrics ($r=0.12$), pointwise LLM scoring ($r=0.35$), and modern multidimensional evaluators ($r=0.48$). It also reveals interpretable trade-offs: generalist models favor recall, while specialized models favor precision. Importantly, only 11.95% of LLM-generated criteria required expert revision, underscoring DeCE's scalability. DeCE offers an interpretable and actionable LLM evaluation framework in expert domains.
摘要：评估法律或医学等高风险领域中的长形答案仍然是一个基本挑战。 BLEU和Rouge等标准指标无法捕获语义正确性，而当前基于LLM的评估者通常将答案质量的细微差别方面降低为单个未分化的分数。我们介绍了欺骗，这是一个分解的LLM评估框架，该框架将精确度（事实准确性和相关性）和召回（覆盖所需概念的覆盖范围）分开，使用特定于实例的标准自动从黄金答案要求中提取。欺骗是模特的敏锐和领域，不需要预定义的分类法或手工制作的专栏。我们实例化欺骗会在现实世界中的法律质量保证任务中评估不同的LLM，该任务涉及多界数学推理和引用基础。与传统的指标（$ r = 0.12 $），LLM得分（$ r = 0.35 $）和现代的多维评估者（$ r = 0.48 $）相比，欺骗与专家判断（$ r = 0.78 $）的相关性（$ r = 0.78 $）相关。它还揭示了可解释的权衡：通才模型偏爱召回，而专业模型则偏爱精度。重要的是，只有11.95％的LLM生成标准需要专家修订，强调欺骗的可伸缩性。欺骗在专家领域提供了可解释的LLM评估框架。

Title: It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

Authors: Lukas Ellinger, Georg Groh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16107
Pdf URL: https://arxiv.org/pdf/2509.16107
Copy Paste: [[2509.16107]] It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge(https://arxiv.org/abs/2509.16107)
Keywords: language model, gpt, llm, prompt
Abstract: Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs' handling of ambiguity and to ensure robust performance across diverse communication styles.
摘要：歧义的单词或指定的参考文献通常需要通过依靠共同的上下文和常识性知识来解决它们。因此，我们系统地研究大型语言模型（LLMS）是否可以利用常识来解决多转交谈中的参考歧义，并在歧义持续存在时分析其行为。此外，我们研究了简化语言的要求如何影响这种能力。使用新型的多语言评估数据集，我们通过LLM-AS-AS-Gudge和人类注释测试DeepSeek V3，GPT-4O，QWEN3-32B，GPT-4O-MINI和LLAMA-3.1-8B。我们的发现表明，当前的LLM努力有效地解决歧义：他们倾向于致力于单一的解释或涵盖所有可能的参考，而不是对冲或寻求澄清。在简化提示下，这种限制变得更加明显，从而大大降低了常识性推理和各种响应策略的使用。具有直接偏好优化的微调Llama-3.1-8B显着改善了所有请求类型的歧义分辨率。这些结果强调了需要进行高级微调来改善LLM对歧义的处理，并确保各种沟通方式的稳健性能。

Title: CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion

Authors: Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li
Subjects: cs.CL, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2509.16112
Pdf URL: https://arxiv.org/pdf/2509.16112
Copy Paste: [[2509.16112]] CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion(https://arxiv.org/abs/2509.16112)
Keywords: language model, llm
Abstract: Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. Recent strides in Code Large Language Models (code LLMs) have spurred the development of repository-level code completion methods, yielding promising results. Nevertheless, they suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, we introduce CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. Its core components include log probability guided query construction, multi-path code retrieval, and preference-aligned BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval demonstrate that CodeRAG significantly and consistently outperforms state-of-the-art methods. The implementation of CodeRAG is available at this https URL.
摘要：存储库级代码完成会根据存储库的更广泛信息自动预测未完成的代码。代码大型语言模型（代码LLM）的最新步伐刺激了存储库级代码完成方法的开发，从而产生了有希望的结果。然而，他们遭受了不适当的查询构建，单路径代码检索以及代码检索器和代码LLM之间的未对准等问题。为了解决这些问题，我们介绍了Coderag，该框架量身定制，旨在确定相关和必要的知识，以检索授权的存储库级代码完成。它的核心组件包括日志概率指导查询构建，多路径代码检索和偏好对齐的BESTFIT RERANKing。基准和CCEVAL的广泛实验表明，Coderag显着，始终如一地超过了最先进的方法。 CODERAG的实现可在此HTTPS URL上获得。

Title: CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs

Authors: Jinghao Zhang, Sihang Jiang, Shiwei Guo, Shisong Chen, Yanghua Xiao, Hongwei Feng, Jiaqing Liang, Minggui HE, Shimin Tao, Hongxia Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16188
Pdf URL: https://arxiv.org/pdf/2509.16188
Copy Paste: [[2509.16188]] CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs(https://arxiv.org/abs/2509.16188)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly deployed in diverse cultural environments, evaluating their cultural understanding capability has become essential for ensuring trustworthy and culturally aligned applications. However, most existing benchmarks lack comprehensiveness and are challenging to scale and adapt across different cultural contexts, because their frameworks often lack guidance from well-established cultural theories and tend to rely on expert-driven manual annotations. To address these issues, we propose CultureScope, the most comprehensive evaluation framework to date for assessing cultural understanding in LLMs. Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification, comprising 3 layers and 140 dimensions, which guides the automated construction of culture-specific knowledge bases and corresponding evaluation datasets for any given languages and cultures. Experimental results demonstrate that our method can effectively evaluate cultural understanding. They also reveal that existing large language models lack comprehensive cultural competence, and merely incorporating multilingual data does not necessarily enhance cultural understanding. All code and data files are available at this https URL
摘要：随着大型语言模型（LLM）越来越多地部署在各种文化环境中，评估他们的文化理解能力已成为确保值得信赖和文化上的应用程序至关重要的。但是，大多数现有的基准都缺乏全面性，并且具有挑战性地在不同的文化背景下进行扩展和适应，因为它们的框架通常缺乏完善的文化理论的指导，并且倾向于依靠专家驱动的手动注释。为了解决这些问题，我们提出了CulturesCope，这是迄今为止最全面的评估框架，用于评估LLMS中的文化理解。受文化冰山理论的启发，我们为文化知识分类设计了一种新的维度模式，包括3层和140个维度，该方案指导了特定于文化的知识基础的自动构建，以及对任何给定语言和文化的相应评估数据集。实验结果表明，我们的方法可以有效地评估文化理解。他们还揭示了现有的大型语言模型缺乏全面的文化能力，而仅纳入多语言数据并不一定会增强文化理解。所有代码和数据文件均可在此HTTPS URL上找到

Title: RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Authors: Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2509.16198
Pdf URL: https://arxiv.org/pdf/2509.16198
Copy Paste: [[2509.16198]] RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation(https://arxiv.org/abs/2509.16198)
Keywords: language model, llm, agent
Abstract: Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.
摘要：大型语言模型在功能和文件级代码生成方面表现出色，但是从头开始生成完整的存储库仍然是一个基本挑战。该过程要求在提案和实施级别阶段进行连贯和可靠的计划，而自然语言由于其模棱两可和冗长而不适合忠实地代表复杂的软件结构。为了解决这个问题，我们介绍了存储库计划图（RPG），这是一种持久表示，通过编码功能，文件结构，数据流和功能在一个图中统一提案和实现级别的计划。 RPG用明确的蓝图替代了模棱两可的自然语言，从而实现了长马计划和可扩展的存储库生成。在RPG上，我们开发了Zerorepo，这是一个从头开始的存储库生成的图形驱动框架。它分为三个阶段：提案级别的计划和实现级别的完善来构建图形，然后使用测试验证的图引导代码生成。为了评估此设置，我们构建了porcocraft，这是六个现实世界项目的基准，具有1,052个任务。在porecraft上，Zerorepo生产的存储库平均近36K LOC，大约3.9 $ \ times $是最强的基线（Claude Code），约为64 $ \ times $其他基线。它的功能覆盖率为81.5％，通过率为69.7％，分别超过克劳德代码27.3和35.8个百分点。进一步的分析表明，RPG模拟复杂的依赖性，通过近乎线性缩放逐步实现了更复杂的计划，并增强了LLM对存储库的理解，从而加速了代理定位。