2025-07-02

Title: Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data

Authors: Ekaterina Borisova, Fabio Barth, Nils Feldhus, Raia Abu Ahmad, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Sebastian Möller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00152
Pdf URL: https://arxiv.org/pdf/2507.00152
Copy Paste: [[2507.00152]] Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data(https://arxiv.org/abs/2507.00152)
Keywords: llm
Abstract: Tables are among the most widely used tools for representing structured data in research, business, medicine, and education. Although LLMs demonstrate strong performance in downstream tasks, their efficiency in processing tabular data remains underexplored. In this paper, we investigate the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. Specifically, we compare their performance on tables from scientific vs. non-scientific contexts and examine their robustness on tables represented as images vs. text. Additionally, we conduct an interpretability analysis to measure context usage and input relevance. We also introduce the TableEval benchmark, comprising 3017 tables from scholarly publications, Wikipedia, and financial reports, where each table is provided in five different formats: Image, Dictionary, HTML, XML, and LaTeX. Our findings indicate that while LLMs maintain robustness across table modalities, they face significant challenges when processing scientific tables.
摘要：表是代表研究，商业，医学和教育中结构化数据的最广泛使用的工具。尽管LLMS在下游任务中表现出强劲的性能，但它们在处理表格数据方面的效率仍未得到充满反感。在本文中，我们通过跨域和跨模式评估研究了基于文本的和多模式LLM在桌面理解任务上的有效性。具体而言，我们比较了它们在科学与非科学环境中的表格上的表现，并检查它们在表示为图像与文本的表上的稳健性。此外，我们进行了解释性分析，以衡量上下文使用情况和输入相关性。我们还介绍了TabTeVal基准，包括来自学术出版物，Wikipedia和财务报告的3017个表，其中每个表都以五种不同的格式提供：图像，字典，HTML，XML和乳胶。我们的发现表明，尽管LLM在跨表格方式上保持稳健性，但处理科学表时它们会面临重大挑战。

Title: Prompting as Scientific Inquiry

Authors: Ari Holtzman, Chenhao Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00163
Pdf URL: https://arxiv.org/pdf/2507.00163
Copy Paste: [[2507.00163]] Prompting as Scientific Inquiry(https://arxiv.org/abs/2507.00163)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We contend that prompting is not inferior, but rather a key component in the science of LLMs.
摘要：提示是我们研究和控制大语言模型的主要方法。它也是最有力的之一：几乎每个主要能力归因于LLMS-FEW-shot Learning，Theark of Thought，宪法AI-WAS首次通过提示解锁。然而，促使很少被视为科学，并且经常被视为炼金术。我们认为这是一个类别错误。如果我们将LLM视为一种经过训练而不是编程的新型复杂和不透明的生物，那么提示就不是解决方法：这是行为科学。机械性可解释性对准神经底物，促使其本机界面中的模型探讨了该模型：语言。我们认为，提示不是次要的，而是LLMS科学中的关键组成部分。

Title: LineRetriever: Planning-Aware Observation Reduction for Web Agents

Authors: Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Massimo Caccia, Véronique Eglin, Alexandre Aussem, Jérémy Espinas, Alexandre Lacoste
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00210
Pdf URL: https://arxiv.org/pdf/2507.00210
Copy Paste: [[2507.00210]] LineRetriever: Planning-Aware Observation Reduction for Web Agents(https://arxiv.org/abs/2507.00210)
Keywords: language model, agent
Abstract: While large language models have demonstrated impressive capabilities in web navigation tasks, the extensive context of web pages, often represented as DOM or Accessibility Tree (AxTree) structures, frequently exceeds model context limits. Current approaches like bottom-up truncation or embedding-based retrieval lose critical information about page state and action history. This is particularly problematic for adaptive planning in web agents, where understanding the current state is essential for determining future actions. We hypothesize that embedding models lack sufficient capacity to capture plan-relevant information, especially when retrieving content that supports future action prediction. This raises a fundamental question: how can retrieval methods be optimized for adaptive planning in web navigation tasks? In response, we introduce \textit{LineRetriever}, a novel approach that leverages a language model to identify and retrieve observation lines most relevant to future navigation steps. Unlike traditional retrieval methods that focus solely on semantic similarity, \textit{LineRetriever} explicitly considers the planning horizon, prioritizing elements that contribute to action prediction. Our experiments demonstrate that \textit{LineRetriever} can reduce the size of the observation at each step for the web agent while maintaining consistent performance within the context limitations.
摘要：尽管大型语言模型在Web导航任务中表现出了令人印象深刻的功能，但通常表示为DOM或可访问性树（Axtree）结构的网页的广泛上下文通常超过模型上下文限制。当前的方法诸如自下而上的截断或基于嵌入的检索失去有关页面状态和行动历史记录的关键信息。这对于网络代理中的自适应计划尤其有问题，在这种情况下，了解当前状态对于确定未来的行动至关重要。我们假设嵌入模型缺乏足够的能力来捕获与计划相关的信息，尤其是在检索支持未来行动预测的内容时。这提出了一个基本问题：如何在Web导航任务中针对自适应计划进行优化检索方法？作为回应，我们介绍了\ textit {Lineretriever}，这是一种新型方法，它利用语言模型来识别和检索与未来导航步骤最相关的观察线。与仅关注语义相似性的传统检索方法不同，\ textit {lineretriever}明确考虑计划范围，优先考虑有助于行动预测的元素。我们的实验表明，\ textIt {LinerEtriever}可以减少Web代理的每个步骤的观察大小，同时在上下文限制中保持一致的性能。

Title: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning

Authors: Mads Henrichsen, Rasmus Krebs
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00214
Pdf URL: https://arxiv.org/pdf/2507.00214
Copy Paste: [[2507.00214]] Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning(https://arxiv.org/abs/2507.00214)
Keywords: language model, llm
Abstract: Standard classification models often map inputs directly to labels without explicit reasoning, potentially limiting their performance, robustness, and interpretability. This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings. In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset (syvai/reasoning-gen) to generate textual reasoning (R) given a question and its answer. In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model. This downstream model, based on Llama-3.2-1B-Instruct, takes only the input text (Q) and is trained to output the generated reasoning (R) immediately followed by the predicted emotion (A). We demonstrate this methodology on the dair-ai/emotion dataset for emotion classification. Our experiments show that the generative model trained to output reasoning and the emotion (Classifier Q->RA) achieves a significant improvement of 8.7 percentage points in accuracy (for emotion prediction) compared to a baseline generative model trained solely to output the emotion (Classifier Q->A), highlighting the strong generalization capabilities of the reasoning generation and the benefit of explicit reasoning training. This work underscores the potential of LLM-generated reasonings for creating richer training datasets, thereby improving the performance of diverse downstream NLP tasks and providing explicit explanations.
摘要：标准分类模型通常将输入直接映射到没有明确推理的情况下，可能会限制其性能，鲁棒性和解释性。本文介绍了一种新型的两阶段方法，通过利用大型语言模型（LLM）生成的推理来增强文本分类。在第一阶段，我们在通用推理数据集（syvai/praconing-gen）上微调了骆驼-3.2-1B教学模型（此后遍布Llama-r-gen），以产生问题及其答案，以产生文本推理（R）。在第二阶段，这种经过普遍训练的Llama-R-Gen被脱机使用，为下游生成模型创建增强的训练数据集。这种基于Llama-3.2-1b-Instruction的下游模型仅采用输入文本（Q），并经过训练以输出立即预测的情绪（a）的生成的推理（R）。我们在dair-ai/情感数据集上演示了这种方法，以进行情感分类。我们的实验表明，与仅在为情感训练的基线生成模型（分类器q-> a）相比，经过培训的生成模型（分类器Q-> ra）训练了8.7个百分点的准确性（对于情感预测），可显着提高8.7个百分点（对于情感预测），从而强调了良好的推理良好的培训，并强调了良好的培训能力的良好的培训和良好的良好培训。这项工作强调了LLM生成的推理的潜力来创建更丰富的培训数据集，从而提高了下游NLP任务的性能并提供明确的解释。

Title: Towards Style Alignment in Cross-Cultural Translation

Authors: Shreya Havaldar, Adam Stein, Eric Wong, Lyle Ungar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00216
Pdf URL: https://arxiv.org/pdf/2507.00216
Copy Paste: [[2507.00216]] Towards Style Alignment in Cross-Cultural Translation(https://arxiv.org/abs/2507.00216)
Keywords: llm
Abstract: Successful communication depends on the speaker's intended style (i.e., what the speaker is trying to convey) aligning with the listener's interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style - biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.
摘要：成功的沟通取决于说话者的预期风格（即说话者试图传达的内容）与听众的解释风格（即听众所感知的内容）保持一致。但是，文化差异通常会导致两者之间的错位。例如，礼貌常常在翻译中失去。我们表征了LLM无法翻译样式的方式 - 将翻译偏向中立性并在非西语中表现较差。我们通过Rasta（检索风格对齐方式）来减轻这些失败，这种方法利用了学到的风格概念来鼓励LLM翻译以适当地传达文化交流规范和友好风格。

Title: Linearly Decoding Refused Knowledge in Aligned Language Models

Authors: Aryan Shrivastava, Ari Holtzman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00239
Pdf URL: https://arxiv.org/pdf/2507.00239
Copy Paste: [[2507.00239]] Linearly Decoding Refused Knowledge in Aligned Language Models(https://arxiv.org/abs/2507.00239)
Keywords: language model, prompt
Abstract: Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding $0.8$. Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks decode generatively, suggesting that the internal representations of many refused properties persist from base LMs through instruction-tuning. Importantly, we show that this information is not merely "leftover" in instruction-tuned models, but is actively used by them: we find that probe-predicted values correlate with LM generated pairwise comparisons, indicating that the information decoded by our probes align with suppressed generative behavior that may be expressed more subtly in other downstream tasks. Overall, our results suggest that instruction-tuning does not wholly eliminate or even relocate harmful information in representation space-they merely suppress its direct expression, leaving it both linearly accessible and indirectly influential in downstream behavior.
摘要：最常用的语言模型（LMS）是使用微调和加强学习的组合进行指导调整和对齐的，从而导致他们拒绝该模型认为有害的用户请求。但是，越狱的提示通常可以绕过这些拒绝机制并引起有害的反应。在这项工作中，我们研究了通过在LM隐藏状态训练的线性探针可以解码通过越狱提示访问的信息的程度。我们表明，大量最初拒绝的信息是可以线性解码的。例如，在整个模型中，可以通过线性探测器预测越狱的LM对一个国家的平均智商的反应，皮尔逊相关性超过$ 0.8 $。令人惊讶的是，我们发现在基本模型上训练的探针（不拒绝）有时会转移到其指令调整版本中，并能够揭示越狱越来越多地解码的信息，这表明许多拒绝的属性的内部表示形式通过指导通过调整来依然存在于LMS。重要的是，我们表明，这些信息不仅在指令调节的模型中“剩余”，而且是它们的积极使用：我们发现探针预测的值与LM生成的成对比较相关，这表明我们的探针与我们的探针与抑制生成行为一致的信息可以在其他下游任务中表现出更为微妙的生成行为。总体而言，我们的结果表明，指导调整并不能完全消除代表中的有害信息，甚至不会完全抑制其直接表达，从而使其在下游行为中既可以线性访问又间接影响。

Title: EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

Authors: Sanchit Ahuja, Praneetha Vaddamanu, Barun Patra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00246
Pdf URL: https://arxiv.org/pdf/2507.00246
Copy Paste: [[2507.00246]] EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning(https://arxiv.org/abs/2507.00246)
Keywords: language model
Abstract: Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: this https URL.
摘要：尽管在语言推理模型（LRMS）方面取得了最新进展，但大多数研究还是仅关注英语，尽管许多模型都介绍了多语言数据。在这项工作中，我们调查：英语是推理最有效的语言吗？我们评估了三个开源RLM：DeepSeek R1，QWEN 2.5和QWEN 3，在四个数学数据集和七种类型上多样化的语言中。我们发现，非英语语言的推理不仅降低了令牌使用情况，而且可以保留准确性。即使将推理痕迹转化为英语，这些收益仍持续存在，这表明推理行为而不是表面级的语言影响。但是，改进的程度取决于模型的多语言强度。我们的发现激发了语言模型中推理的更广泛的看法，强调了多语言推理的潜力以及强大的多语言基础的重要性。可以找到我们工作的代码：此HTTPS URL。

Title: Impact of Fine-Tuning Methods on Memorization in Large Language Models

Authors: Jie Hou, Chuxiong Wu, Lannan Luo, Qiang Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00258
Pdf URL: https://arxiv.org/pdf/2507.00258
Copy Paste: [[2507.00258]] Impact of Fine-Tuning Methods on Memorization in Large Language Models(https://arxiv.org/abs/2507.00258)
Keywords: language model, llm, prompt
Abstract: As the capabilities of pre-trained large language models (LLMs) continue to advance, the "pre-train and fine-tune" paradigm has become increasingly mainstream, leading to the development of various fine-tuning methods. However, the privacy risks arising from memorization during fine-tuning have received relatively little attention. To address this gap, we categorize popular fine-tuning approaches and assess their impact on memorization through the lens of membership inference attacks (MIAs). Our results show that, compared to parameter-based fine-tuning, prompt-based fine-tuning achieves competitive performance while exhibiting lower vulnerability to MIAs. Furthermore, prompt-based methods maintain low memorization regardless of model scale. These findings suggest that parameter-based fine-tuning is more prone to leaking private information, whereas prompt-based fine-tuning serves as a more privacy-preserving option.
摘要：随着预训练的大语言模型（LLM）的能力继续发展，“预训练和微调”范式已经越来越成为主流，从而导致了各种微调方法的发展。但是，在微调过程中，由于记忆而产生的隐私风险几乎没有得到关注。为了解决这一差距，我们将流行的微调方法分类，并通过成员推理攻击（MIAS）来评估其对记忆的影响。我们的结果表明，与基于参数的微调相比，基于及时的基于参数的微调可实现竞争性能，同时表现出较低的MIA脆弱性。此外，无论模型量表如何，基于及时的方法都保持较低的记忆。这些发现表明，基于参数的微调更容易泄漏私人信息，而基于及时的基于及时的微调则可以作为更隐私的选择。

Title: Natural language processing for African languages

Authors: David Ifeoluwa Adelani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00297
Pdf URL: https://arxiv.org/pdf/2507.00297
Copy Paste: [[2507.00297]] Natural language processing for African languages(https://arxiv.org/abs/2507.00297)
Keywords: language model
Abstract: Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.
摘要：单词嵌入和语言模型的最新进展使用大规模的，未标记的数据和自我监督的学习来提高NLP的性能。多语言模型，通常接受Wikipedia等网络制源数据培训，面临挑战：不包括低资源语言，它们的数据通常很嘈杂，缺乏标记的数据集，因此很难评估像英语这样的高资源语言之外的高资源语言之外的性能。在本文中，我们专注于在撒哈拉以南非洲使用的语言，在该语言中，该地区的所有土著语言都可以被视为低资源的NLP任务数据，并且在网络上发现了未标记的数据。我们分析了公开可用的语料库中的噪声，并策划了高质量的语料库，这表明在单词嵌入中学到的语义表示的质量不仅取决于数据量，还取决于训练数据的质量。我们从经验上证明了单词嵌入的局限性，以及多语言的预训练语言模型（PLM）提供的机会，尤其是针对训练和低资源场景中看不见的语言。我们进一步研究了如何使用少量单语文本适应和专业的多语言PLM来对非洲语言。为了解决NLP研究中非洲语言的代表性不足，我们在两个有影响力的NLP任务中为21种非洲语言开发了大规模的人类注销的标签数据集：命名实体识别和机器翻译。我们使用监督，弱监督和转移学习环境的最先进方法进行了广泛的经验评估。

Title: Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Authors: Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2507.00322
Pdf URL: https://arxiv.org/pdf/2507.00322
Copy Paste: [[2507.00322]] Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones(https://arxiv.org/abs/2507.00322)
Keywords: language model
Abstract: Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.
摘要：尽管在编码功能方面取得了显着进步，但语言模型（LMS）仍然在简单的句法任务（例如生成平衡括号）上挣扎。在这项研究中，我们研究了这些错误跨不同大小（124m-7b）的持续性背后的基本机制，以理解和减轻误差。我们的研究表明，LMS依靠独立做出自己预测的许多组件（注意头和FF神经元）。尽管某些组件可靠地在广义范围的输入范围（即实施“声音机制”）中可靠地促进了正确的答案，但其他组件则不太可靠，并通过促进不正确的令牌（即实施“错误的机制”）来引入噪声。当故障机制掩盖声音并主要影响预测时，就会发生错误。在这种见识的促进的情况下，我们介绍了Rasteer，这是一种转向方法，可以系统地识别和增加可靠组件在改善模型性能方面的贡献。 Rasteer大大提高了平衡括号任务的性能，在不损害模型的一般编码能力的情况下，将某些型号的准确性从$ 0 $％提高到$ 100 $％。我们进一步证明了其在算术推理任务中的更广泛适用性，达到了高达20美元左右的绩效增长。

Title: Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

Authors: Mohna Chakraborty, Adithya Kulkarni, Qi Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.00330
Pdf URL: https://arxiv.org/pdf/2507.00330
Copy Paste: [[2507.00330]] Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios(https://arxiv.org/abs/2507.00330)
Keywords: language model, prompt
Abstract: Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.
摘要：基于及时的方法利用了接受蒙版语言建模（MLM）目标训练的预训练语言模型（PLM）的知识；但是，这些方法对模板，语言器和少量实例选择敏感，尤其是在没有标记数据的冷启动设置中。现有研究忽略了实例和言语之间的依赖性，其中实例标签的概率取决于嵌入空间中的人格标记近端。为了解决这个问题，我们提出了ColdSelect，一种联合语言和实例选择方法，该方法对数据多样性进行了建模。 ColdSelect Maps PLM词汇和$ h _ {[mask]} $嵌入到共享空间中，应用尺寸降低和聚类以确保有效和多样化的选择。通过优化最小的不确定性和最大多样性，ColdSelect有效地捕获了数据关系。在八个基准上进行的实验表明，ColdSelect在减少不确定性和增强概括，在口头化合物中的表现优于基准以及对冷启动场景的少量实例选择的优势。

Title: Question Decomposition for Retrieval-Augmented Generation

Authors: Paul J. L. Ammann, Jonas Golde, Alan Akbik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00355
Pdf URL: https://arxiv.org/pdf/2507.00355
Copy Paste: [[2507.00355]] Question Decomposition for Retrieval-Augmented Generation(https://arxiv.org/abs/2507.00355)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Grounding large language models (LLMs) in verifiable external sources is a well-established strategy for generating reliable answers. Retrieval-augmented generation (RAG) is one such approach, particularly effective for tasks like question answering: it retrieves passages that are semantically related to the question and then conditions the model on this evidence. However, multi-hop questions, such as "Which company among NVIDIA, Apple, and Google made the biggest profit in 2023?," challenge RAG because relevant facts are often distributed across multiple documents rather than co-occurring in one source, making it difficult for standard RAG to retrieve sufficient information. To address this, we propose a RAG pipeline that incorporates question decomposition: (i) an LLM decomposes the original query into sub-questions, (ii) passages are retrieved for each sub-question, and (iii) the merged candidate pool is reranked to improve the coverage and precision of the retrieved evidence. We show that question decomposition effectively assembles complementary documents, while reranking reduces noise and promotes the most relevant passages before answer generation. Although reranking itself is standard, we show that pairing an off-the-shelf cross-encoder reranker with LLM-driven question decomposition bridges the retrieval gap on multi-hop questions and provides a practical, drop-in enhancement, without any extra training or specialized indexing. We evaluate our approach on the MultiHop-RAG and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy (F1: +11.6%) over standard RAG baselines.
摘要：在可验证的外部来源中接地大语言模型（LLM）是生成可靠答案的良好策略。检索授权的一代（RAG）就是一种方法，对于诸如问题回答之类的任务特别有效：它检索了与问题的语义相关的段落，然后根据该证据调节模型。但是，多跳的问题，例如“ Nvidia，Apple和Google中的哪个公司在2023年获得最大的利润？”，挑战抹布，因为相关事实通常是在多个文档中分布的，而不是在一个来源中共同出现，从而使标准抹布难以获取足够的信息。为了解决这个问题，我们提出了一条包含问题分解的RAG管道：（i）LLM将原始查询分解为子问题，（ii）每个子问题的段落被检索到段落，（iii）合并的候选池被重读以改善检索被检索的证据的覆盖范围和精确。我们表明，这个问题分解有效地组装了互补文档，同时重新融合会减少噪声，并在回答生成之前促进最相关的段落。尽管重读本身是标准的，但我们表明，将现成的跨编码器重读者与LLM驱动的问题分解架配对，在多跳问题上桥接了回收差距，并提供了实用的，可以提高实用的，可以提高任何额外的培训或没有任何额外的培训或专业的索引。我们评估了我们在多跳rag和hotpotqa上的方法，显示了检索的收益（MRR@10： +36.7％），并在标准的抹布基础上回答准确性（F1： +11.6％）。

Title: Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics

Authors: Vojtěch Lanz, Jan Hajič jr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00380
Pdf URL: https://arxiv.org/pdf/2507.00380
Copy Paste: [[2507.00380]] Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics(https://arxiv.org/abs/2507.00380)
Keywords: language model
Abstract: The idea that Gregorian melodies are constructed from some vocabulary of segments has long been a part of chant scholarship. This so-called "centonisation" theory has received much musicological criticism, but frequent re-use of certain melodic segments has been observed in chant melodies, and the intractable number of possible segmentations allowed the option that some undiscovered segmentation exists that will yet prove the value of centonisation, and recent empirical results have shown that segmentations can outperform music-theoretical features in mode classification. Inspired by the fact that Gregorian chant was memorised, we search for an optimal unsupervised segmentation of chant melody using nested hierarchical Pitman-Yor language models. The segmentation we find achieves state-of-the-art performance in mode classification. Modeling a monk memorising the melodies from one liturgical manuscript, we then find empirical evidence for the link between mode classification and memory efficiency, and observe more formulaic areas at the beginnings and ends of melodies corresponding to the practical role of modality in performance. However, the resulting segmentations themselves indicate that even such a memory-optimal segmentation is not what is understood as centonisation.
摘要：长期以来，格雷戈里亚旋律是由一些细分市场词汇构建的想法一直是Chant奖学金的一部分。这种所谓的“碳化性”理论受到了很多音乐学的批评，但是在颂歌旋律中已经观察到了某些旋律段的频繁再利用，并且可能存在的细分数量可以使某些未经发现的段落存在的选项存在，这些选择仍存在一些未被发现的部分，这些段落仍将尚未证明cent鼻的价值，并且可以表现出该阶级的阶级，并且可以表现出对裂片的阶级，并具有对perperperperperperperperpermusic的模式。灵感来自格里高利（Gregorian Chant）记忆的事实，我们使用嵌套的层次的Pitman-Yor语言模型搜索了最佳的无监督分段唱歌旋律。我们发现的细分在模式分类中实现了最先进的性能。建模和尚记住一个礼仪手稿的旋律，然后我们找到了模式分类与记忆效率之间联系的经验证据，并在旋律的开头和末端观察了与模态在性能中实践作用相对应的旋律的更多公式化区域。但是，由此产生的分割本身表明，即使是这种内存最佳的分割也不是粒子化的理解。

Title: Causal Prompting for Implicit Sentiment Analysis with Large Language Models

Authors: Jing Ren, Wenhao Zhou, Bowen Li, Mujie Liu, Nguyen Linh Dan Le, Jiade Cen, Liping Chen, Ziqi Xu, Xiwei Xu, Xiaodong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00389
Pdf URL: https://arxiv.org/pdf/2507.00389
Copy Paste: [[2507.00389]] Causal Prompting for Implicit Sentiment Analysis with Large Language Models(https://arxiv.org/abs/2507.00389)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated, requiring models to perform deeper reasoning over subtle contextual cues. While recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA, they often rely on majority voting over chain-of-thought (CoT) reasoning paths without evaluating their causal validity, making them susceptible to internal biases and spurious correlations. To address this challenge, we propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning. CAPITAL decomposes the overall causal effect into two components: the influence of the input prompt on the reasoning chains, and the impact of those chains on the final output. These components are estimated using encoder-based clustering and the NWGM approximation, with a contrastive learning objective used to better align the encoder's representation with the LLM's reasoning space. Experiments on benchmark ISA datasets with three LLMs demonstrate that CAPITAL consistently outperforms strong prompting baselines in both accuracy and robustness, particularly under adversarial conditions. This work offers a principled approach to integrating causal inference into LLM prompting and highlights its benefits for bias-aware sentiment reasoning. The source code and case study are available at: this https URL.
摘要：隐式情感分析（ISA）旨在推断暗示而不是明确说明的情感，要求模型对微妙的上下文提示进行更深入的推理。尽管使用大型语言模型（LLM）的最新基于提示的方法在ISA中表现出了承诺，但他们通常依靠多数投票，而不是在没有评估其因果关系的情况下对链条链（COT）推理路径进行投票，从而使他们容易受到内部偏见和虚假相关性的影响。为了应对这一挑战，我们提出了Capital，这是一个因果促使框架，将前门调整纳入COT推理。资本将整体因果效应分解为两个组成部分：输入提示对推理链的影响，以及这些链条对最终产出的影响。这些组件是使用基于编码器的聚类和NWGM近似值来估算的，其对比度学习目标用于更好地使编码器的表示与LLM的推理空间相位。具有三个LLM的基准ISA数据集上的实验表明，资本始终优于强大的促使基准的准确性和鲁棒性，尤其是在对抗条件下。这项工作提供了一种将因果推断整合到LLM提示中的原则方法，并突出了其对偏见感性推理的好处。源代码和案例研究可获得以下网址：此HTTPS URL。

Title: Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions

Authors: Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00439
Pdf URL: https://arxiv.org/pdf/2507.00439
Copy Paste: [[2507.00439]] Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions(https://arxiv.org/abs/2507.00439)
Keywords: language model, llm, prompt
Abstract: The ability to accurately predict how different population groups would answer subjective questions would have great value. In this work, we show that use of relatively simple supervision can greatly improve language model alignment with diverse population groups, as measured over three datasets spanning various topics. Beyond evaluating average performance, we also report how alignment varies across specific groups. The simplicity and generality of our approach promotes easy adoption, while our broad findings provide useful guidance for when to use or not use our approach in practice. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a useful benchmark to stimulate future research.
摘要：准确预测不同人口群体将如何回答主观问题的能力将具有很大的价值。在这项工作中，我们表明，使用相对简单的监督可以大大改善各种人群群体的语言模型对齐方式，如三个跨越各种主题的数据集所测量的那样。除了评估平均绩效外，我们还报告了对齐方式如何在特定组之间变化。我们方法的简单性和普遍性促进了易于采用，而我们的广泛发现为何时使用或不使用我们的方法在实践中使用我们的方法提供了有用的指导。通过对许多LLM进行评估并促使策略以及开源我们的工作，我们提供了一个有用的基准来刺激未来的研究。

Title: Pitfalls of Evaluating Language Models with Open Benchmarks

Authors: Md. Najib Hasan (1), Mohammad Fakhruddin Babar (2), Souvika Sarkar (1), Monowar Hasan (2), Santu Karmaker (3) ((1) Wichita State University, (2) Washington State University, (3) University of Central Florida)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00460
Pdf URL: https://arxiv.org/pdf/2507.00460
Copy Paste: [[2507.00460]] Pitfalls of Evaluating Language Models with Open Benchmarks(https://arxiv.org/abs/2507.00460)
Keywords: language model, gpt, llm
Abstract: Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating'' models -- smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets -- which achieve top rankings on a prominent open, holistic benchmark (HELM) despite poor generalization and limited practical utility. Our findings underscore three key insights: \ca high leaderboard performance on open benchmarks may not always reflect real-world effectiveness; \cb private or dynamic benchmarks must complement open evaluations to safeguard integrity; and \cc a fundamental reevaluation of current benchmarking practices is essential to ensure robust and trustworthy LM assessments.
摘要：开放大型语言模型（LLM）基准（例如舵手和大基础）提供标准化的透明协议，可促进语言模型（LMS）的公平比较，可重复性和迭代性进步。但是，他们的开放性也引入了批判性和不足的陷阱。这项研究通过系统地构建``作弊''模型来揭示这些弱点 - 巴特，T5和GPT-2的较小变体直接在公共测试集中进行了微调 - 尽管普遍性差和有限的实际效用，但在著名的开放，整体基准（Helm）中取得了最高的排名。我们的发现强调了三个关键见解：开放基准上的高级排行榜表现可能并不总是反映现实世界的有效性； \ CB私人或动态基准必须补充开放评估以保护完整性； \ cc对当前的基准测定实践的基本重新评估对于确保可靠和值得信赖的LM评估至关重要。

Title: TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search

Authors: To Eun Kim, João Coelho, Gbemileke Onilude, Jai Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00509
Pdf URL: https://arxiv.org/pdf/2507.00509
Copy Paste: [[2507.00509]] TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search(https://arxiv.org/abs/2507.00509)
Keywords: language model, llm, retrieval-augmented generation
Abstract: As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classifier for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the effectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and best-of-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.
摘要：随着对话搜索引擎越来越多地采用由大语言模型（LLMS）和检索型生成（RAG）提供支持的基于世代的范例，将广告集成到生成的响应中既带来了商业机会和用户体验的挑战。与传统的搜索清楚地划定了广告不同，生成系统模糊了信息内容和促销材料之间的边界，从而引起了人们对透明度和信任的关注。在这项工作中，我们为基于抹布的对话系统中的广告管理提出了模块化管道，该管道由一个用于无缝广告集成的广告螺旋体和可检测的强大广告分类器组成。我们利用合成数据来训练高性能的分类器，然后将其用于指导两种互补的广告融合策略：对广告螺丝器的监督微调和一种最佳的N采样方法，以选择最不可检测到的多个候选者中可检测到最不可检测的AD综合响应。我们的评估侧重于两个核心问题：广告分类器在检测各种广告集成策略方面的有效性，以及最能支持连贯，微不足道的AD插入的培训方法。实验结果表明，我们的广告分类器接受了受营销策略启发并通过课程学习增强的合成广告数据培训，可以实现强大的检测性能。此外，我们证明了分类器指导的优化，通过微调和最佳采样，可显着改善广告隐身，从而实现更多无缝集成。这些发现贡献了一个对抗性的共同进化框架，用于开发更复杂的广告感生成搜索系统和强大的广告分类器。

Title: TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification

Authors: Miriam Anschütz, Ekaterina Gikalo, Niklas Herbster, Georg Groh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00579
Pdf URL: https://arxiv.org/pdf/2507.00579
Copy Paste: [[2507.00579]] TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification(https://arxiv.org/abs/2507.00579)
Keywords: llm, hallucination
Abstract: Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.
摘要：幻觉是LLM的主要问题之一，阻碍了他们对更广泛用例的信任和部署。但是，大多数关于幻觉的研究都集中在英语数据上，忽略了LLM的多语言性质。本文介绍了我们对Semeval-2025 Task-3-MU Shroom的提交，这是有关幻觉的多语言共享任务以及相关的可观察到的过度错误。我们提出了一条分为两部分的管道，将基于检索的对Wikipedia的事实验证与基于BERT的系统进行了微调，以识别常见的幻觉模式。我们的系统在所有语言上都取得了竞争成果，并以八种语言（包括英语）获得了前十名。此外，它支持超过共享任务所涵盖的十四种语言。这种多语言幻觉标识符可以帮助改善LLM输出及其将来的实用性。

Title: Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based

Authors: Shuangquan Lyu, Yingnan Deng, Guiran Liu, Zhen Qi, Ruotong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00601
Pdf URL: https://arxiv.org/pdf/2507.00601
Copy Paste: [[2507.00601]] Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based(https://arxiv.org/abs/2507.00601)
Keywords: language model, llm, prompt
Abstract: This paper addresses the limited transfer and adaptation capabilities of large language models in low-resource language scenarios. It proposes a unified framework that combines a knowledge transfer module with parameter-efficient fine-tuning strategies. The method introduces knowledge alignment loss and soft prompt tuning to guide the model in effectively absorbing the structural features of target languages or tasks under minimal annotation. This enhances both generalization performance and training stability. The framework includes lightweight adaptation modules to reduce computational costs. During training, it integrates freezing strategies and prompt injection to preserve the model's original knowledge while enabling quick adaptation to new tasks. The study also conducts stability analysis experiments and synthetic pseudo-data transfer experiments to systematically evaluate the method's applicability and robustness across different low-resource tasks. Experimental results show that compared with existing multilingual pre-trained models and mainstream transfer methods, the proposed approach achieves higher performance and stability on cross-lingual tasks such as MLQA, XQuAD, and PAWS-X. It demonstrates particularly strong advantages under extremely data-scarce conditions. The proposed method offers strong generality and scalability. It enhances task-specific adaptability while preserving the general capabilities of large language models. This makes it well-suited for complex semantic modeling and multilingual processing tasks.
摘要：本文介绍了低资源语言方案中大语言模型的有限传输和适应能力。它提出了一个统一的框架，将知识传递模块与参数有效的微调策略相结合。该方法引入了知识对齐损失和软提示调整，以指导模型有效地吸收目标语言的结构特征或在最小注释下的任务。这可以增强概括性能和训练稳定性。该框架包括轻巧的适应模块，以降低计算成本。在培训期间，它整合了冻结策略并迅速注入以保留模型的原始知识，同时可以快速适应新任务。该研究还进行了稳定分析实验和合成伪DATA转移实验，以系统地评估该方法在不同的低资源任务中的适用性和鲁棒性。实验结果表明，与现有的多语言预训练模型和主流转移方法相比，所提出的方法可以在MLQA，XQUAD和PAWS-X等跨语言任务上实现更高的性能和稳定性。它在极高的数据筛选条件下表现出特别强大的优势。提出的方法提供了强大的通用性和可扩展性。它增强了特定于任务的适应性，同时保留了大语言模型的一般功能。这使其非常适合复杂的语义建模和多语言处理任务。

Title: Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

Authors: Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00606
Pdf URL: https://arxiv.org/pdf/2507.00606
Copy Paste: [[2507.00606]] Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies(https://arxiv.org/abs/2507.00606)
Keywords: language model, gpt, llm, prompt, chain-of-thought, tree-of-thought
Abstract: Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised this http URL experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.
摘要：大型语言模型（LLMS）通过高级提示技术（例如，cot）（COT）和经营树（TOT）在复杂的任务中表现出色，但是它们对手动制作的特定任务提示的依赖限制了适应性和效率。我们介绍了推理（MOR）的混合，这是一个培训框架，将各种推理策略嵌入LLM中，以实现无需外部及时工程的自主，任务自适应推理。 MOR有两个阶段：思想产生，使用GPT-4O等模型创建推理链模板，以及SFT数据集构造，将模板与基准数据集配对，以监督该HTTP URL实验，MOR可以显着提高性能，使用COT促进了0.730（2.2％的改进），使用COT和0.734（13.34（13.5％），MOR150 Achie aChie and aChie aChie and aChie and caster和0.734（13.34（13.5％）。 MOR消除了对特定于任务的提示的需求，为跨不同任务提供了可靠的推理提供了一种可推广的解决方案。

Title: SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Authors: Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00665
Pdf URL: https://arxiv.org/pdf/2507.00665
Copy Paste: [[2507.00665]] SAFER: Probing Safety in Reward Models with Sparse Autoencoder(https://arxiv.org/abs/2507.00665)
Keywords: language model, llm, chat
Abstract: Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at this https URL. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}
摘要：从人类反馈（RLHF）中学习的强化是将大语模型（LLMS）与人类价值保持一致的关键范式，但其核心奖励模型在很大程度上仍然是不透明的。在这项工作中，我们为增强奖励模型（\ textbf {Safer}）提出了稀疏的自动编码器，这是一种通过机械分析来解释和改善奖励模型的新颖框架。利用稀疏的自动编码器（SAE），我们发现了奖励模型激活中的人类解剖功能，从而可以深入了解与安全相关的决策。我们将更安全的面向安全性的偏好数据集应用，并通过所选响应和拒绝的响应之间的激活差异来量化单个特征的显着性。使用这些功能级信号，我们设计了有针对性的数据中毒和降解策略。实验表明，更安全可以通过最小的数据修改而精确地降低或增强安全对准，而无需牺牲一般的聊天表现。我们的方法有助于在高风险LLM Alignment任务中解释，审计和完善奖励模型。我们的代码可在此HTTPS URL上找到。 \ textit {本文讨论与大语言模型安全有关的主题，可能包括突出潜在风险或不安全结果的讨论或示例。}

Title: Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English

Authors: Ahmed Sabir, Azinovič Gasper, Mengsay Loem, Rajesh Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00700
Pdf URL: https://arxiv.org/pdf/2507.00700
Copy Paste: [[2507.00700]] Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English(https://arxiv.org/abs/2507.00700)
Keywords: language model
Abstract: Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.
摘要：感知和认知方面的跨文化研究表明，来自不同文化背景的个人以不同的方式处理视觉信息。例如，东亚人倾向于采用整体观点，参与上下文关系，而西方人经常采用分析方法，专注于单个对象及其属性。在这项研究中，我们研究了视觉模型（VLMS）是否以不同语言（特别是日语和英语）为主的培训培训是否表现出类似的文化扎根注意力模式。使用图像描述的比较分析，我们检查了这些模型是否反映了整体倾向与分析倾向的差异。我们的发现表明，VLM不仅将语言的结构特性内化，而且还重现了嵌入培训数据中的文化行为，表明文化认知可能会隐含地塑造模型的产量。

Title: AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation

Authors: Elizabeth Fons, Elena Kochkina, Rachneet Kaur, Zhen Zeng, Berowne Hlavaty, Charese Smiley, Svitlana Vyetrenko, Manuela Veloso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00718
Pdf URL: https://arxiv.org/pdf/2507.00718
Copy Paste: [[2507.00718]] AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation(https://arxiv.org/abs/2507.00718)
Keywords: language model, llm, prompt
Abstract: This paper explores the potential of large language models (LLMs) to generate financial reports from time series data. We propose a framework encompassing prompt engineering, model selection, and evaluation. We introduce an automated highlighting system to categorize information within the generated reports, differentiating between insights derived directly from time series data, stemming from financial reasoning, and those reliant on external knowledge. This approach aids in evaluating the factual grounding and reasoning capabilities of the models. Our experiments, utilizing both data from the real stock market indices and synthetic time series, demonstrate the capability of LLMs to produce coherent and informative financial reports.
摘要：本文探讨了大语言模型（LLMS）从时间序列数据中生成财务报告的潜力。我们提出了一个框架，包括及时的工程，模型选择和评估。我们介绍了一个自动突出显示系统，以对生成的报告中的信息进行分类，从而区分直接从时间序列数据获得的见解，这是由财务推理引起的，以及依赖外部知识的洞察力。这种方法有助于评估模型的事实基础和推理能力。我们的实验利用了实际股票市场指数和合成时间序列的两个数据，证明了LLMS生成连贯且内容丰富的财务报告的能力。

Title: LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Authors: Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00769
Pdf URL: https://arxiv.org/pdf/2507.00769
Copy Paste: [[2507.00769]] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing(https://arxiv.org/abs/2507.00769)
Keywords: language model, llm
Abstract: Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at this https URL, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.
摘要：评估大语模型（LLM）产生的创意写作仍然具有挑战性，因为开放式叙事缺乏基础真理。如果没有表现自动化的评估方法，则将现成的语言模型用作零拍的法官，但是在这种情况下，它们的可靠性尚不清楚。为了追求对创意写作的强大评估，我们介绍了Litbench，这是第一个标准化基准和配对的数据集用于创意写作验证，其中包括由Reddit绘制的2,480个模糊性，标记的故事比较的持有测试集，以及43,827-PAIR培训人类的偏好标签。使用Litbench，我们（i）基准零射门法官法官，（ii）训练布拉德利·特里（Bradley Terry）和生成奖励模型，以及（iii）进行在线人类研究，以验证有关新LLM生成的故事的奖励模型排名。我们的基准将Claude-3.7-Sonnet确定为最强的现成法官，与人类偏好达成73％的一致性；在受过训练的奖励模型中，布拉德利 - 泰（Bradley-terry）和生成奖励模型的精度为78％，表现优于所有现成的法官。一项在线人类研究进一步证实，我们训练的奖励模型始终与新型LLM生成的故事中的人类偏好保持一致。我们在此HTTPS URL上发布了Litbench和奖励模型，为可靠的，自动化的评估和优化创意写作系统提供了审查的资源。

Title: Many LLMs Are More Utilitarian Than One

Authors: Anita Keshmirian, Razan Baltaji, Babak Hemmatian, Hadi Asghari, Lav R. Varshney
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.00814
Pdf URL: https://arxiv.org/pdf/2507.00814
Copy Paste: [[2507.00814]] Many LLMs Are More Utilitarian Than One(https://arxiv.org/abs/2507.00814)
Keywords: language model, llm, agent
Abstract: Moral judgment is integral to large language model (LLM) alignment and social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function collectively during collaboration, compared to individual agents. In human moral judgment, group deliberation leads to a utilitarian boost: a tendency to endorse norm violations that maximize benefits for the greatest number of people despite harms. We study whether a similar dynamic emerges in multi-agent LLM systems. We tested six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reasoned independently, and (2) Group, where they engaged in multi-turn discussions in pairs or triads. In personal moral dilemmas, where agents must decide to directly harm one individual to maximize the utility for others, all models found moral violations to be more acceptable when part of a group than individually, similar to human experiments. Some models endorsed actions that maximized overall well-being, even if they benefited strangers over familiar individuals. Others became more willing to violate moral norms in groups. However, while human groups show a similar action bias, the mechanism for their utilitarian boost differs from LLMs. Whereas the human shift comes from heightened sensitivity to decision outcomes, LLM groups show either reduced norm sensitivity or enhanced impartiality. This suggests that while the surface behavior of LLM collectives mimics human group reasoning, the underlying drivers differ. We discuss the implications for AI alignment, multi-agent design, and artificial moral reasoning.
摘要：道德判断是大语言模型（LLM）的一致性和社会推理不可或缺的一部分。随着多代理系统的突出性，与单个代理相比，了解LLM在协作过程中如何共同发挥作用变得至关重要。在人类的道德判断中，小组的审议导致了功利主义的提升：一种倾向于认可规范的趋势，即尽管造成了损害，从而最大程度地利用了最大数量的人。我们研究多代理LLM系统中是否出现了类似的动态。我们在两个条件下对建立良好的道德困境进行了六个模型：（1）独立的模型和（2）组，在其中进行了成对或三合会的多转变讨论。在个人道德困境中，代理人必须决定直接伤害一个人以最大化他人的效用，所有模型都发现，当一个组的一部分而不是单独的，类似于人类实验时，道德违规行为更容易接受。一些模型认可了最大化整体福祉的行动，即使他们使陌生人受益于熟悉的人。其他人变得更愿意违反小组的道德规范。但是，尽管人类群体表现出类似的作用偏见，但其功利主义提升的机制与LLM不同。尽管人类的转变来自对决策结果的敏感性的提高，但LLM组表现出降低的规范敏感性或增强的公正性。这表明，尽管LLM集体的表面行为模仿了人类群体推理，但潜在的驱动因素有所不同。我们讨论对AI对齐，多代理设计和人工道德推理的含义。

Title: ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

Authors: Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, Philip Resnik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00828
Pdf URL: https://arxiv.org/pdf/2507.00828
Copy Paste: [[2507.00828]] ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering(https://arxiv.org/abs/2507.00828)
Keywords: llm
Abstract: Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at this https URL
摘要：主题模型和文档群集评估要么使用与人类偏好较差的自动指标，要么需要专家标签，而专家标签则是易于扩展的。我们设计了可扩展的人类评估协议和相应的自动近似值，反映了从业者对模型的现实使用。注释者 - 基于LLM的代理 - 审查分配给主题或集群的文本项目，推断该组的类别，然后将该类别应用于其他文档。使用此协议，我们从两个数据集上的各种主题模型中收集了大量的人群工人注释。然后，我们使用这些注释来验证自动代理，发现最佳的LLM代理在统计上与人类注释者无法区分，因此可以作为自动化评估的合理替代品。软件包，Web界面和数据在此HTTPS URL处

Title: Stylometry recognizes human and LLM-generated texts in short samples

Authors: Karol Przystalski, Jan K. Argasiński, Iwona Grabska-Gradzińska, Jeremi K. Ochab
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.00838
Pdf URL: https://arxiv.org/pdf/2507.00838
Copy Paste: [[2507.00838]] Stylometry recognizes human and LLM-generated texts in short samples(https://arxiv.org/abs/2507.00838)
Keywords: language model, gpt, llm
Abstract: The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show -- crucially, in the context of the increasingly sophisticated LLMs -- that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.
摘要：本文探讨了风格学作为区分大语模型（LLM）和人类创建的文本的一种方法，从而解决了模型归因，知识产权和道德AI的使用问题。风格测定法已广泛用于表征文本的样式和属性作者身份。通过将其应用于LLM生成的文本，我们可以确定它们的新兴写作模式。本文涉及创建一个基于Wikipedia的基准数据集，并具有（a）人工写的术语摘要，（b）纯LLMS（GPT-3.5/4，Llama 2/3，Orca和Falcon）纯粹生成的文本，（c）通过多个文本摘要（T5，t5，bart，bart，bart repipr，gensim and sumy and y y y y y y y y y y y y y y y y y y y y y y y y y y y ysum and sumy and y y y y y y y y y y y y y y y y y y y y y y and sumy and（d） T5）。使用人设计的（Stylometrix）和基于N-Gram的基于树的模型（决策树和LightGBM）对10个句子的长文本进行了分类，并基于N-Gram（我们的管道）（我们的管道）样式特征，它们编码词语，语法，句法和标点模式。在多类场景中，跨验证的结果的性能达到了0.87 Matthews相关系数，具有7个类别，并且在二进制分类中的准确性，Wikipedia和GPT-4的特定示例在平衡的数据集中达到了0.98的精度。 Shapley添加说明指出了百科全书文本类型的特征，个人过度使用的单词以及相对于人工写的文本的更大的语法标准化。这些结果表明 - 至关重要的是，在日益复杂的LLMS的背景下 - 至少对于定义明确的文本类型，可以将机器与人类生成的文本区分开。

Title: TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

Authors: Xi Xuan, King-kui Sin, Yufei Zhou, Chunyu Kit
Subjects: cs.CL, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2507.00875
Pdf URL: https://arxiv.org/pdf/2507.00875
Copy Paste: [[2507.00875]] TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation(https://arxiv.org/abs/2507.00875)
Keywords: language model, gpt, llm, agent
Abstract: Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications, including machine translation. However, the potential of LLMs in translating Hong Kong legal judgments remains uncertain due to challenges such as intricate legal terminology, culturally embedded nuances, and strict linguistic structures. In this work, we introduce TransLaw, a novel multi-agent framework implemented for real-world Hong Kong case law translation. It employs three specialized agents, namely, Translator, Annotator, and Proofreader, to collaboratively produce translations for high accuracy in legal meaning, appropriateness in style, and adequate coherence and cohesion in structure. This framework supports customizable LLM configurations and achieves tremendous cost reduction compared to professional human translation services. We evaluated its performance using 13 open-source and commercial LLMs as agents and obtained interesting findings, including that it surpasses GPT-4o in legal semantic accuracy, structural coherence, and stylistic fidelity, yet trails human experts in contextualizing complex terminology and stylistic naturalness. Our platform website is available at CityUHK, and our bilingual judgment corpus used for the evaluation is available at Hugging Face.
摘要：由大语言模型（LLM）授权的多机构系统在包括机器翻译在内的广泛的下游应用程序中表现出了显着的功能。但是，由于诸如复杂的法律术语，文化嵌入的细微差别和严格的语言结构等挑战，LLM在翻译香港法律判断中的潜力仍然不确定。在这项工作中，我们介绍了Translaw，这是为现实世界中的香港案例法翻译而实施的新型多代理框架。它雇用了三个专业代理，即翻译，注释者和校对者，可以协作产生翻译，以高准确性，在法律意义上，风格适当以及结构上足够的连贯性和凝聚力。与专业人类翻译服务相比，该框架支持可自定义的LLM配置，并实现巨大的成本降低。我们使用13个开源和商业LLM作为代理商评估了其性能，并获得了有趣的发现，包括它在法律语义准确性，结构相干性和风格上的忠诚度中超过了GPT-4O，但仍使人类专家在将复杂的术语和风格自然性上下文化。我们的平台网站可在CityUHK获得，我们的双语判断语料库用于评估，可以在Hugging Face上获得。

Title: Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

Authors: Aditya Tomar, Nihar Ranjan Sahoo, Ashish Mittal, Rudra Murthy, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00883
Pdf URL: https://arxiv.org/pdf/2507.00883
Copy Paste: [[2507.00883]] Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations(https://arxiv.org/abs/2507.00883)
Keywords: language model, llm, prompt
Abstract: Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks
摘要：尽管数学通常被认为是文化中立的，但提出数学问题的方式可以带有隐式文化背景。像GSM8K这样的现有基准主要植根于西方规范，包括名称，货币和日常情况。在这项工作中，我们使用基于及时的转换进行了手动验证，为非洲，印度，中国，韩国和日本的GSM8K测试套件创建了文化适应的变体。我们评估了六个大型语言模型（LLM），范围从8B到72B参数，涉及五种提示策略，以评估其对数学问题呈现中文化差异的鲁棒性。我们的发现揭示了一致的性能差距：模型在原始以美国为中心的数据集中表现最佳，并且在文化适应版本上相对较差。但是，具有推理能力的模型对这些转变更具弹性，这表明更深层次的推理有助于弥合数学任务中的文化呈现差距

Title: MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

Authors: Yuheng Wang, Xianhe Tang, Pufeng Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.00891
Pdf URL: https://arxiv.org/pdf/2507.00891
Copy Paste: [[2507.00891]] MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes(https://arxiv.org/abs/2507.00891)
Keywords: llm, agent
Abstract: Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions this http URL address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.
摘要：模因被广泛用于在线社交互动中，提供了表达意图和情感的生动，直觉和幽默的手段。现有的对话数据集主要仅限于手动注释或纯文本对话，因此缺乏多模式相互作用的表现力和上下文细微差别，该HTTP URL解决了这些挑战，我们引入了MemeCMD，这是一种自动生成的中国多转变对话数据集，并通过上下文检索了上下文。我们的数据集结合了一个大规模的，MLLM的模因库，并在各种情况下由双重代理自动生成的对话。我们引入了检索框架和自适应阈值，以确保上下文相关，自然间隔的模因使用情况。实验证明了我们方法在生成上下文适当且多样化的模因对话中的有效性，提供了可扩展和隐私的资源来推进多模式对话AI。

Title: Discourse Heuristics For Paradoxically Moral Self-Correction

Authors: Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00985
Pdf URL: https://arxiv.org/pdf/2507.00985
Copy Paste: [[2507.00985]] Discourse Heuristics For Paradoxically Moral Self-Correction(https://arxiv.org/abs/2507.00985)
Keywords: language model, llm
Abstract: Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.
摘要：道德自我纠正已成为将大语言模型（LLM）与人类道德价值观保持一致的有前途的方法。但是，道德自我纠正技术受两个主要悖论的影响。首先，尽管经验和理论证据支持自我纠正的有效性，但该LLM功能仅在表面上起作用。其次，尽管LLM具有自我诊断的不道德方面的能力，但他们努力在自我纠正过程中确定这种道德上不一致的原因。为了更好地理解和解决这些悖论，我们分析了旨在增强道德自我纠正的微调语料库中的话语构造，揭示了有效结构的启发式启发式的存在。我们证明，道德自我纠正依赖于反映启发式快捷方式的话语结构，并且在自我纠正期间这些启发式捷径的存在会导致不一致，而试图增强共同的自我纠正和自我诊断能力。根据我们的发现，我们提出了一种解决方案，以利用策展数据集的启发式方法来改善道德自我纠正。我们还强调了这种能力的概括挑战，尤其是从位置上下文和模型量表学习方面。

Title: Should We Still Pretrain Encoders with Masked Language Modeling?

Authors: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00994
Pdf URL: https://arxiv.org/pdf/2507.00994
Copy Paste: [[2507.00994]] Should We Still Pretrain Encoders with Masked Language Modeling?(https://arxiv.org/abs/2507.00994)
Keywords: language model, llm
Abstract: Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at this https URL to foster further research.
摘要：学习高质量的文本表示是多种NLP任务的基础。尽管传统上审计编码器审计依赖于蒙版的语言建模（MLM），但最近的证据表明，用因果语言建模（CLM）预测的解码器模型可以有效地重新陈述为编码器，通常超过文本表示上的传统编码器。但是，尚不清楚这些收益是反映了CLM目标的固有优势还是由模型和数据量表等混杂因素产生。在本文中，我们通过一系列大规模，精心控制的预处理的消融，培训总共30个型号，范围为2.1亿至10亿个参数，并进行15,000多个微调和评估运行。我们发现，尽管使用MLM培训通常可以在文本表示任务中产生更好的性能，但受CLM训练的模型更具数据效率，并证明了改善的微调稳定性。在这些发现的基础上，我们通过实验表明，在固定的计算训练预算下，依次应用CLM，然后是MLM的双相培训策略可实现最佳性能。此外，我们证明，当从现有的LLM生态系统中初始化的CLM模型初始化时，这种策略变得更加吸引人，从而减少了培训一流的编码器模型所需的计算负担。我们在此HTTPS URL上释放所有项目工件，以促进进一步的研究。

Title: La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America

Authors: María Grandury, Javier Aula-Blasco, Júlia Falcão, Clémentine Fourrier, Miguel González, Gonzalo Martínez, Gonzalo Santamaría, Rodrigo Agerri, Nuria Aldama, Luis Chiruzzo, Javier Conde, Helena Gómez, Marta Guerrero, Guido Ivetta, Natalia López, Flor Miriam Plaza-del-Arco, María Teresa Martín-Valdivia, Helena Montoro, Carmen Muñoz, Pedro Reviriego, Leire Rosado, Alejandro Vaca, María Estrella Vallecillo-Rodríguez, Jorge Vallego, Irune Zubiaga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.00999
Pdf URL: https://arxiv.org/pdf/2507.00999
Copy Paste: [[2507.00999]] La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America(https://arxiv.org/abs/2507.00999)
Keywords: language model, llm
Abstract: Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
摘要：排行榜展示了大语言模型（LLMS）的当前功能和局限性。为了激励代表讲西班牙语社区语言和文化多样性的LLM的发展，我们介绍了La Legardboard，这是第一个评估西班牙和拉丁美洲语言和语言品种的开放源代码排行榜。 LA Leaderboard是一个社区驱动的项目，旨在为所有有兴趣为讲西班牙语社区开发LLM的人建立评估标准。该初始版本结合了巴斯克，加泰罗尼亚州，加利西亚语和不同西班牙品种的66个数据集，展示了50个模型的评估结果。为了鼓励以其他语言的社区为导向的排行榜发展，我们解释了我们的方法，包括为每个下游任务选择最合适的评估设置的指南。特别是，我们提供了一种基本原理，比文献中通常发现的几个示例少了，旨在减少环境影响并促进更广泛的研究社区获得可重复的结果。

Title: SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

Authors: Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01001
Pdf URL: https://arxiv.org/pdf/2507.01001
Copy Paste: [[2507.01001]] SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks(https://arxiv.org/abs/2507.01001)
Keywords: chat
Abstract: We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.
摘要：我们提出了Sciarena，这是一个开放且协作的平台，用于评估科学文献任务的基础模型。与传统的科学文献理解和综合基准不同，Sciarena在聊天机器人竞技场评估方法上直接与研究社区进行了对模型比较的评估方法。通过利用集体智能，Sciarena对开放式科学任务的模型绩效进行了社区驱动的评估，该任务需要以文献为基础，长期的响应。该平台目前支持23个开源和专有基金会模型，并从各种科学领域的受信任的研究人员那里收集了13,000票。我们分析了到目前为止收集的数据，并确认提交的问题是多种多样的，与现实世界中的文献需求保持一致，并且参与的研究人员在评估中表现出了强烈的自谐和通知者的一致性。我们根据模型排名排行榜讨论结果和见解。为了进一步促进基于构建模型的自动化评估系统的文献任务研究，我们根据我们收集的偏好数据发布了Sciarena-eval，这是一种元评估基准。该基准通过将其成对评估与人类的选票进行比较，可以衡量模型在判断答案质量方面的准确性。我们的实验强调了基准的挑战，并强调了对更可靠的自动化评估方法的需求。