2025-07-31

Title: IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

Authors: Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, Genta Indra Winata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22159
Pdf URL: https://arxiv.org/pdf/2507.22159
Copy Paste: [[2507.22159]] IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian(https://arxiv.org/abs/2507.22159)
Keywords: language model, llm
Abstract: Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset specifically designed to evaluate the naturalness and quality of LLM-generated text. All annotations are natively written in Indonesian and evaluated using Krippendorff's alpha, demonstrating strong inter-annotator agreement. Additionally, we benchmark the dataset across multiple LLMs and assess the output quality of each model.
摘要：超过2亿人说印度尼西亚人，但该语言在大型语言模型（LLMS）的基于偏好的研究中的代表性却大大不足。大多数现有的多语言数据集源自英语翻译，通常导致缺乏文化和语言真实性的内容。为了解决这一差距，我们介绍了Indopref，这是第一个完全由人类的印度尼西亚偏好数据集，专为评估LLM生成的文本的自然性和质量而设计。所有注释均在印度尼西亚人本地编写，并使用Krippendorff的Alpha进行了评估，表明了强烈的通知者一致性。此外，我们在多个LLMS上基准数据集并评估每个模型的输出质量。

Title: Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Authors: Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22168
Pdf URL: https://arxiv.org/pdf/2507.22168
Copy Paste: [[2507.22168]] Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles(https://arxiv.org/abs/2507.22168)
Keywords: language model, llm, prompt
Abstract: Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with "non-standard" input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.
摘要：当前用于评估大语言模型（LLM）的基准通常没有表现出足够的写作风格多样性，许多基准主要遵守标准化的惯例。这样的基准并不能完全捕获人类展示的各种沟通模式。因此，在面对“非标准”输入时，在这些基准测试中进行了优化的LLM可能会表现出脆弱的性能。在这项工作中，我们通过使用基于角色的LLM提示来重写评估提示来检验这一假设，这是一种低成本的方法，以模拟各种写作风格。我们的结果表明，即使具有相同的语义内容，书面样式的变化和及时格式的变化也会显着影响评估中LLM的估计性能。值得注意的是，我们确定了不同的写作方式，这些样式始终触发一系列模型和任务的低性能或高性能，而与模型家族，大小和新近度无关。我们的工作提供了一种可扩展的方法来增强现有基准，从而提高了他们提供的评估的外部有效性，以衡量语言变化的LLM性能。

Title: A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models

Authors: Adam M. Morgan, Adeen Flinker
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22187
Pdf URL: https://arxiv.org/pdf/2507.22187
Copy Paste: [[2507.22187]] A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models(https://arxiv.org/abs/2507.22187)
Keywords: language model, llm
Abstract: We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.
摘要：我们提出了一个自动化管道，用于估计动词框架频率（VFF），这是动词出现在特别的句法帧中的频率。 VFF为人类和机器语言系统中的语法提供了一个强大的窗口，但是现有的计算它们的工具的规模，准确性或可访问性受到限制。我们使用大型语言模型（LLM）生成包含476个英语动词的句子语料库。接下来，通过指示LLM像专家语言学家一样行事，我们可以分析该语料库中句子的句法结构。该管道的表现优于多个评估数据集的两个广泛使用的句法解析器。此外，它所需的资源要比手动解析（金标准）少得多，从而实现快速，可扩展的VFF估计。使用LLM解析器，我们生成了一个新的VFF数据库，具有更广泛的动词覆盖率，较细粒度的句法区别以及对心理语言学常见研究结构交替的相对频率的明确估计。该管道易于自定义，可扩展到新动词，句法帧甚至其他语言。我们将这项工作作为自动帧频率估计的概念证明，并发布所有代码和数据以支持未来的研究。

Title: How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?

Authors: Christian Clark, Byung-Doh Oh, William Schuler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22209
Pdf URL: https://arxiv.org/pdf/2507.22209
Copy Paste: [[2507.22209]] How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?(https://arxiv.org/abs/2507.22209)
Keywords: language model
Abstract: Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model's probability distribution over a word's first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.
摘要：上下文熵是一种心理语言措施，捕获了在遇到单词之前处理的预期难度。最近的研究已经测试了与熵相关的作用，这是对惊奇众所周知的效果的潜在补充。为了方便起见，通常会根据语言模型的概率分布来估算熵。然而，这种近似导致了真实单词熵的低估和潜在变形。为了解决这个问题，我们生成了单词熵的蒙特卡洛（MC）估计，允许单词跨越可变的令牌。阅读时间的回归实验表明，在使用上下文熵的第一次近似值时需要谨慎，这表明需要谨慎的结果。

Title: RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

Authors: Dongyub Jude Lee, Zhenyi Ye, Pengcheng He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22219
Pdf URL: https://arxiv.org/pdf/2507.22219
Copy Paste: [[2507.22219]] RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation(https://arxiv.org/abs/2507.22219)
Keywords: gpt
Abstract: Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
摘要：机器翻译（MT）的优先学习方法（例如直接偏好优化（DPO））取得了令人印象深刻的收益，但在很大程度上取决于大型，精心策划的三重态数据集，并且常常难以推广超越其调谐域。我们提出了从教师模型改进（RLFR）中学习的强化学习，这是一个新颖的框架，通过利用外部教师模型（GPT-4O）的连续高质量反馈来消除对静态三胞胎的依赖。 RLFR将每个翻译的步骤都作为微观教程：演员产生了一个假设，教师进行了完善，并且根据与老师的改进程度相符的程度，演员得到了奖励。在两个互补信号的指导下 - （i）负面编辑距离，促进词汇和结构保真度，以及（ii）彗星得分，确保语义充足性 - 演员逐渐学习效仿教师，通过渐进，迭代的改进来反映人类的学习过程。在Flores-200基准（英语对德语，西班牙语，中文，韩语和日语）上，RLFR始终超过MT-SFT和基于偏好的基线，显着改善了彗星（语义充足性）和M-ETA（Entity Presantity Presanty）分数。

Title: Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

Authors: Supantho Rakshit, Adele Goldberg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22286
Pdf URL: https://arxiv.org/pdf/2507.22286
Copy Paste: [[2507.22286]] Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs(https://arxiv.org/abs/2507.22286)
Keywords: language model, llm
Abstract: The usage-based constructionist (UCx) approach posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze the neural representations of the English dative constructions (Double Object and Prepositional Object) in Pythia-$1.4$B, using a dataset of $5000$ sentence pairs systematically varied for human-rated preference strength. A macro-level geometric analysis finds that the separability between construction representations, as measured by Energy Distance or Jensen-Shannon Divergence, is systematically modulated by gradient preference strength. More prototypical exemplars of each construction occupy more distinct regions in the activation space of LLMs. These results provide strong evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures of basic constructionist principles in LLMs.
摘要：基于用法的建筑主义者（UCX）方法认为，语言包括一个学识渊博的表单镀配对（构造）网络，其用途在很大程度上取决于其含义或功能，要求它们对其进行分级和概率。这项研究调查了大语言模型（LLMS）中的内部表示是否反映了提出的功能供应梯度。我们使用$ 5000 $句子对的数据集对人类评分的偏好强度进行了系统的变化，分析了Pythia- $ 1.4 $ b的英语构造（双对象和介词对象）的神经表示。宏观的几何分析发现，通过能量距离或Jensen-Shannon Divergence衡量的施工表示之间的可分离性是通过梯度偏好强度系统调节的。每种结构的更典型的典范在LLM的激活空间中占据了更不同的区域。这些结果提供了有力的证据表明，LLMS学习了构造的丰富，意义，分级表示，并为LLMS基本建筑主义原则的几何测量提供了支持。

Title: Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations

Authors: Galo Castillo-López, Gaël de Chalendar, Nasredine Semmar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22289
Pdf URL: https://arxiv.org/pdf/2507.22289
Copy Paste: [[2507.22289]] Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations(https://arxiv.org/abs/2507.22289)
Keywords: llm
Abstract: Intent recognition is a fundamental component in task-oriented dialogue systems (TODS). Determining user intents and detecting whether an intent is Out-of-Scope (OOS) is crucial for TODS to provide reliable responses. However, traditional TODS require large amount of annotated data. In this work we propose a hybrid approach to combine BERT and LLMs in zero and few-shot settings to recognize intents and detect OOS utterances. Our approach leverages LLMs generalization power and BERT's computational efficiency in such scenarios. We evaluate our method on multi-party conversation corpora and observe that sharing information from BERT outputs to LLMs leads to system performance improvement.
摘要：意图识别是面向任务的对话系统（TODS）中的基本组成部分。确定用户意图并检测意图是否不在范围内（OOS）对于TOD提供可靠的响应至关重要。但是，传统的TOS需要大量带注释的数据。在这项工作中，我们提出了一种混合方法，将BERT和LLM在零和几乎没有的设置中相结合，以识别意图和检测OOS的话语。在这种情况下，我们的方法利用了LLMS的概括和BERT的计算效率。我们评估了多方对话语料库的方法，并观察到共享从BERT输出到LLMS的信息会改善系统性能。

Title: A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

Authors: Roxana Petcu, Samarth Bhargav, Maarten de Rijke, Evangelos Kanoulas
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.22337
Pdf URL: https://arxiv.org/pdf/2507.22337
Copy Paste: [[2507.22337]] A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers(https://arxiv.org/abs/2507.22337)
Keywords: llm
Abstract: Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.
摘要：了解和解决复杂的推理任务对于满足用户的信息需求至关重要。尽管密集的神经模型学习了情境化的嵌入，但它们在包含否定的查询方面仍然表现不佳。为了了解这种现象，我们研究了传统神经信息检索和基于LLM的模型的否定。我们（1）引入了否定的分类法，该分类法源自哲学，语言和逻辑定义；（2）生成两个基准数据集，可用于评估神经信息检索模型的性能和微调模型，以使否定性能更强；（3）提出一种基于逻辑的分类机制，可用于分析现有数据集上检索模型的性能。我们的分类法对否定类型产生了平衡的数据分布，从而提供了更好的培训设置，从而导致Nevir数据集的收敛速度更快。此外，我们提出了一个分类模式，该模式揭示了现有数据集中的否定类型的覆盖范围，从而有见识可能影响否定模型的概括的因素。

Title: Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors

Authors: Jia Li, Yichao He, Jiacheng Xu, Tianhao Luo, Zhenzhen Hu, Richang Hong, Meng Wang
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2507.22367
Pdf URL: https://arxiv.org/pdf/2507.22367
Copy Paste: [[2507.22367]] Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors(https://arxiv.org/abs/2507.22367)
Keywords: language model, llm, prompt
Abstract: Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called \textit{\textbf{Traits Run Deep}}. It employs \textit{\textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. Besides, it devises a \textit{\textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45\% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method's superiority, ranking first in the Personality Assessment track. The source code will be made available at this https URL.
摘要：准确可靠的人格评估在许多领域（例如情绪智力，心理健康诊断和个性化教育）中起着至关重要的作用。与短暂的情绪不同，人格特征是稳定的，通常在语言，面部表情和身体行为上泄漏，并具有异步模式跨模态。很难用传统的表面特征来对性格语义进行建模，并且似乎无法获得有效的跨模式理解。为了应对这些挑战，我们提出了一个新颖的人格评估框架，称为\ textit {\ textbf {特征持续}}。它采用\ textit {\ textbf {心理学信息提示}}来激发高级人格与人格相关的语义表示。此外，它设计了一个\ textIt {\ textbf {textbf {text-textric特质融合网络}}，该网络锚定了丰富的文本语义，以对齐和整合其他模式的异步信号。具体而言，这种融合模块包括一个块的投影仪，可降低维度，跨模式连接器和文本功能增强器，以进行有效的模态融合，以及一个整体回归头，以改善数据筛选情况的概括。据我们所知，我们是第一个应用特定人格的提示来指导大语模型（LLM）来提取人格感知语义的语义以提高表示的质量。此外，提取和融合视听表观行为特征进一步提高了准确性。 AVI验证集的实验结果证明了所提出的组件的有效性，即平均平方误差（MSE）的大约45 \％降低。对AVI挑战2025测试集的最终评估确认了我们的方法的优势，在人格评估轨道中排名第一。源代码将在此HTTPS URL上提供。

Title: PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs

Authors: Homaira Huda Shomee, Suman Kalyan Maity, Sourav Medya
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22387
Pdf URL: https://arxiv.org/pdf/2507.22387
Copy Paste: [[2507.22387]] PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs(https://arxiv.org/abs/2507.22387)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) have emerged as transformative approaches in several important fields. This paper aims for a paradigm shift for patent writing by leveraging LLMs to overcome the tedious patent-filing process. In this work, we present PATENTWRITER, the first unified benchmarking framework for evaluating LLMs in patent abstract generation. Given the first claim of a patent, we evaluate six leading LLMs -- including GPT-4 and LLaMA-3 -- under a consistent setup spanning zero-shot, few-shot, and chain-of-thought prompting strategies to generate the abstract of the patent. Our benchmark PATENTWRITER goes beyond surface-level evaluation: we systematically assess the output quality using a comprehensive suite of metrics -- standard NLP measures (e.g., BLEU, ROUGE, BERTScore), robustness under three types of input perturbations, and applicability in two downstream patent classification and retrieval tasks. We also conduct stylistic analysis to assess length, readability, and tone. Experimental results show that modern LLMs can generate high-fidelity and stylistically appropriate patent abstracts, often surpassing domain-specific baselines. Our code and dataset are open-sourced to support reproducibility and future research.
摘要：大型语言模型（LLM）已成为几个重要领域的变革性方法。本文旨在通过利用LLMS克服繁琐的专利申请过程来进行专利写作的范式。在这项工作中，我们介绍了专利权人，这是第一个用于评估专利抽象生成中LLM的统一基准测试框架。鉴于第一项专利主张，我们在一致的设置下，评估了六个领先的LLM（包括GPT-4和Llama-3），这些设置跨越了零拍摄，很少射击和经过三通的促进促进策略，以生成专利的摘要。我们的基准专利权人超越了表面级别的评估：我们使用全面的指标（例如，标准NLP措施（例如Bleu，Rouge，Bertscore），系统地评估输出质量，在两种类型的输入扰动下的鲁棒性，以及在两种下游的Patent Patent Protent Clantification和Recterevem reactiveAl和Reciperevem and Exerivem and reactival and cormanitibalsigations of The The Sottion质量。我们还进行了风格分析以评估长度，可读性和音调。实验结果表明，现代LLM可以产生高保真性和风格适当的专利摘要，通常超过域特异性基线。我们的代码和数据集是开源的，以支持可重复性和未来的研究。

Title: Question Generation for Assessing Early Literacy Reading Comprehension

Authors: Xiaocheng Yang, Sumuk Shashidhar, Dilek Hakkani-Tur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22410
Pdf URL: https://arxiv.org/pdf/2507.22410
Copy Paste: [[2507.22410]] Question Generation for Assessing Early Literacy Reading Comprehension(https://arxiv.org/abs/2507.22410)
Keywords: language model
Abstract: Assessment of reading comprehension through content-based interactions plays an important role in the reading acquisition process. In this paper, we propose a novel approach for generating comprehension questions geared to K-2 English learners. Our method ensures complete coverage of the underlying material and adaptation to the learner's specific proficiencies, and can generate a large diversity of question types at various difficulty levels to ensure a thorough evaluation. We evaluate the performance of various language models in this framework using the FairytaleQA dataset as the source material. Eventually, the proposed approach has the potential to become an important part of autonomous AI-driven English instructors.
摘要：通过基于内容的互动评估阅读理解的评估在阅读获取过程中起着重要作用。在本文中，我们提出了一种新颖的方法，以产生针对K-2英语学习者的理解问题。我们的方法可确保对基本材料的完全覆盖和对学习者特定能力的适应，并可以在各种难度水平上产生大量的问题类型，以确保进行彻底的评估。我们使用FairyTaleQA数据集评估了此框架中各种语言模型的性能。最终，拟议的方法有可能成为自主AI驱动的英国讲师的重要组成部分。

Title: NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

Authors: Hyeonseok Moon, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22411
Pdf URL: https://arxiv.org/pdf/2507.22411
Copy Paste: [[2507.22411]] NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models(https://arxiv.org/abs/2507.22411)
Keywords: language model, gpt, llm, long context
Abstract: The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models' (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, \textbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at this https URL
摘要：针中的针线（NIAH）基准广泛用于评估大型语言模型的（LLMS）理解长篇小说（LC）的能力。它评估了在广泛的查询段落中识别与查询相关的上下文的能力。尽管该方法是评估长篇文化理解的广泛接受的标准，但我们的发现表明它可能高估了LLM的真正LC能力。我们证明，即使是最先进的模型，例如GPT-4O努力完全融合了仅与查询相关的十个句子组成的给定背景。作为回应，我们介绍了一个新颖的基准，即\ textbf {needlechain}，其中上下文完全由与查询相关的信息组成，要求LLM完全掌握输入以正确回答。我们的基准允许灵活的上下文长度和推理顺序，从而对LLM性能进行了更全面的分析。此外，我们提出了一种非常简单但令人信服的策略，以提高LC理解LLM：绳索收缩的能力。我们对各种高级LLM的实验揭示了它们处理庞大环境的能力与其充分理解它们的能力之间的显着差异。源代码和数据集可在此HTTPS URL上找到

Title: AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini

Authors: Jill Walker Rettberg, Hermann Wigers
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22445
Pdf URL: https://arxiv.org/pdf/2507.22445
Copy Paste: [[2507.22445]] AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini(https://arxiv.org/abs/2507.22445)
Keywords: language model, gpt, prompt
Abstract: Can a language model trained largely on Anglo-American texts generate stories that are culturally relevant to other nationalities? To find out, we generated 11,800 stories - 50 for each of 236 countries - by sending the prompt "Write a 1500 word potential {demonym} story" to OpenAI's model gpt-4o-mini. Although the stories do include surface-level national symbols and themes, they overwhelmingly conform to a single narrative plot structure across countries: a protagonist lives in or returns home to a small town and resolves a minor conflict by reconnecting with tradition and organising community events. Real-world conflicts are sanitised, romance is almost absent, and narrative tension is downplayed in favour of nostalgia and reconciliation. The result is a narrative homogenisation: an AI-generated synthetic imaginary that prioritises stability above change and tradition above growth. We argue that the structural homogeneity of AI-generated narratives constitutes a distinct form of AI bias, a narrative standardisation that should be acknowledged alongside the more familiar representational bias. These findings are relevant to literary studies, narratology, critical AI studies, NLP research, and efforts to improve the cultural alignment of generative AI.
摘要：对盎格鲁裔文本进行培训的语言模型可以产生与其他民族有关的故事吗？为了找出答案，我们通过向OpenAI的Model GPT -4O -Mini发送提示，生成了11,800个故事 - 236个国家 /地区的50个故事。尽管这些故事确实包括表面级别的国家符号和主题，但它们压倒性地符合各个国家 /地区的单个叙事情节结构：主角生活或返回家中，并回到一个小城镇，并通过与传统和组织社区活动重新连接来解决小小的冲突。现实世界中的冲突被卫生，浪漫几乎没有，叙事紧张局势被低调，而不是怀旧与和解。结果是叙事同质化：AI生成的合成虚构，将优先于变化的稳定性优先于变化和超过增长的传统。我们认为，AI生成的叙述的结构同质性构成了AI偏见的独特形式，这是一种叙事标准化，应与更熟悉的代表性偏见一起承认。这些发现与文学研究，叙事学，批判性AI研究，NLP研究以及改善生成AI的文化一致性有关。

Title: Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Authors: Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, Mugariya Farooq, Giulia Campesan, Ruxandra Cojocaru, Yasser Djilali, Shi Hu, Iheb Chaabane, Puneesh Khanna, Mohamed El Amine Seddik, Ngoc Dung Huynh, Phuc Le Khac, Leen AlQadi, Billel Mokeddem, Mohamed Chami, Abdalgader Abubaker, Mikhail Lubinets, Kacper Piskorski, Slim Frikha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22448
Pdf URL: https://arxiv.org/pdf/2507.22448
Copy Paste: [[2507.22448]] Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance(https://arxiv.org/abs/2507.22448)
Keywords: language model, llm
Abstract: In this report, we introduce Falcon-H1, a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency across diverse use cases. Unlike earlier Falcon models built solely on Transformer or Mamba architectures, Falcon-H1 adopts a parallel hybrid approach that combines Transformer-based attention with State Space Models (SSMs), known for superior long-context memory and computational efficiency. We systematically revisited model design, data strategy, and training dynamics, challenging conventional practices in the field. Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters. Quantized instruction-tuned models are also available, totaling over 30 checkpoints on Hugging Face Hub. Falcon-H1 models demonstrate state-of-the-art performance and exceptional parameter and training efficiency. The flagship Falcon-H1-34B matches or outperforms models up to 70B scale, such as Qwen3-32B, Qwen2.5-72B, and Llama3.3-70B, while using fewer parameters and less data. Smaller models show similar trends: the Falcon-H1-1.5B-Deep rivals current leading 7B-10B models, and Falcon-H1-0.5B performs comparably to typical 7B models from 2024. These models excel across reasoning, mathematics, multilingual tasks, instruction following, and scientific knowledge. With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications. All models are released under a permissive open-source license, underscoring our commitment to accessible and impactful AI research.
摘要：在本报告中，我们介绍了Falcon-H1，这是一系列新的大型语言模型（LLMS），其中包含针对各种用例的高性能和效率优化的混合体系结构设计。与较早的Falcon模型不同，Falcon-H1采用了一种平行的混合方法，将基于变压器的注意力与状态空间模型（SSM）相结合，该方法以较高的长篇小说内存和计算效率而闻名。我们系统地重新审视了模型设计，数据策略和培训动态，并挑战了该领域的常规实践。 Falcon-H1以多种配置发布，包括基本和指令调整的变体，为0.5b，1.5b，1.5b深，3B，7B和34B参数。还提供了量化的指令调整模型，在拥抱面轮中总共超过30个检查站。 Falcon-H1模型表明了最先进的性能以及出色的参数和训练效率。旗舰Falcon-H1-34B匹配或胜过最高70B刻度的模型，例如QWEN3-32B，QWEN2.5-72B和LLAMA3.3-70B，同时使用较少的参数和更少的数据。较小的模型显示出相似的趋势：Falcon-H1-1.5B深度竞争当前领先的7B-10B模型，而Falcon-H1-0.5B的性能与2024年的典型7B模型相当。这些模型跨越推理，数学，多语言任务，多项式任务，指令，指令，以下以及科学知识。在支持多达256K上下文令牌和18种语言的支持下，Falcon-H1适用于广泛的应用。所有模型均在允许的开源许可下发布，强调了我们对可访问且有影响力的AI研究的承诺。

Title: What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models

Authors: Tian Yun, Chen Sun, Ellie Pavlick
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22457
Pdf URL: https://arxiv.org/pdf/2507.22457
Copy Paste: [[2507.22457]] What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models(https://arxiv.org/abs/2507.22457)
Keywords: language model, llm
Abstract: Recent work has argued that large language models (LLMs) are not "abstract reasoners", citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an "abstract reasoner", and why it matters whether LLMs fit the bill.
摘要：最近的工作认为，大型语言模型（LLMS）不是“抽象的推理者”，理由是他们在各种具有挑战性的任务上零射击表现不佳。我们重新审视这些实验，以增加索赔的细微差别。首先，我们表明，虽然LLM在零拍设置中的表现确实很差，但即使调整了一小部分用于输入编码的参数，也可以实现近乎完美的性能。但是，我们还表明，这种鉴定不一定在数据集中传输。我们将这些经验结果集作为邀请（重新）开启讨论“抽象推理者”的含义，以及为什么LLM是否适合该法案。

Title: IFEvalCode: Controlled Code Generation

Authors: Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, Binyuan Hui, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22462
Pdf URL: https://arxiv.org/pdf/2507.22462
Copy Paste: [[2507.22462]] IFEvalCode: Controlled Code Generation(https://arxiv.org/abs/2507.22462)
Keywords: language model, llm
Abstract: Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models' ability to generate correct code versus code that precisely follows instructions.
摘要：代码大语言模型（代码LLM）通过将自然语言描述转换为功能代码，在代码生成方面取得了重大进展；但是，现实世界中的应用程序通常要求更严格的遵守详细的要求，例如编码样式，线路计数和结构性约束，而不是正确性。为了解决这个问题，本文介绍了向前和向后的约束生成，以提高受控代码生成中代码LLMS的指令跟踪功能，从而确保输出与人类定义的指南更加紧密地保持一致。作者进一步介绍了IFEVALCODE，这是一种多种语言基准，包括七种编程语言（Python，Java，Java，JavaScript，Typescript，Shell，C ++和C＃）的1.6K测试样本，每个样本均包含中文和英文查询。与现有基准不同，如果IFEVALCODE将评估分为两个指标：正确性（corr。）和跟随指令（instr。），从而实现了更细微的评估。超过40个LLM的实验表明，封闭式模型在可控代码生成中的表现要优于开源模型，并突出了模型生成正确代码与准确遵循指令的代码的能力之间的显着差距。

Title: SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

Authors: Lei Sheng, Shuai-Shuai Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22478
Pdf URL: https://arxiv.org/pdf/2507.22478
Copy Paste: [[2507.22478]] SLM-SQL: An Exploration of Small Language Models for Text-to-SQL(https://arxiv.org/abs/2507.22478)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87\% execution accuracy (EX), while the 1.5B model achieved 67.08\% EX. We will release our dataset, model, and code to github: this https URL.
摘要：大型语言模型（LLMS）在将自然语言问题转化为SQL查询（文本到SQL）方面表现出强烈的表现。相反，由于其逻辑推理功能有限，因此在文本到SQL任务上的小语言模型（SLMS）在文本到SQL任务上的表现不佳。但是，SLM在推理速度和边缘部署的适用性方面具有固有的优势。为了探索其在文本到SQL应用程序中的潜力，我们利用后培训技术的最新进步。具体而言，我们使用开源SYNSQL-2.5M数据集来构建两个派生数据集：用于SQL生成的SYNSQL-THENINK-916K，用于SQL Merge Revision的SQL生成和Synsql-Merge-Merge-310K。然后，我们将基于监督的微调和加强学习后培训应用于SLM，然后使用纠正性自稳定性方法进行推理。实验结果证明了我们方法SLM-SQL的有效性和概括性。在鸟类开发集中，五个评估的模型的平均改善为31.4分。值得注意的是，0.5B模型达到56.87 \％执行精度（EX），而1.5B模型达到67.08 \％ex。我们将将数据集，模型和代码发布到GitHub：此HTTPS URL。

Title: CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

Authors: Dongchen Li (1), Jitao Liang (1), Wei Li (1), Xiaoyu Wang (2), Longbing Cao (3), Kun Yu (4) ((1) College of Computer Science and Engineering, Northeastern University, Shenyang, China, (2) Liaoning Cancer Hospital and Institute, Shenyang, China, (3) Macquarie University, Sydney, Australia, (4) College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22533
Pdf URL: https://arxiv.org/pdf/2507.22533
Copy Paste: [[2507.22533]] CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records(https://arxiv.org/abs/2507.22533)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and multilingual nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these diverse settings, CliCARE significantly outperforms strong baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by expert oncologists.
摘要：大型语言模型（LLMS）具有改善临床决策支持并通过合成复杂的纵向癌症电子健康记录（EHRS）来改善医师倦怠的巨大希望。但是，它们在这个关键领域的实施面临三个主要挑战：无法有效地处理患者记录的广泛长度和多语言性，以进行准确的时间分析；临床幻觉的风险增加，因为传统的接地技术（例如检索型发电（RAG））不能充分纳入面向过程的临床准则；以及不可靠的评估指标，阻碍了肿瘤学中AI系统的验证。为了解决这些问题，我们提出了Clicare，这是在临床指南中扎根大语言模型的框架，以确保对纵向癌症电子健康记录的决策支持。该框架通过将非结构化的纵向EHR转换为患者特定的时间知识图（TKGS）来捕获远程依赖性，然后通过使这些现实世界中的患者轨迹与规范指南知识图对齐来捕获决策支持过程。这种方法通过产生高保真临床摘要和可行的建议，为肿瘤学家提供了循证的决策支持。我们使用来自中国私人癌症数据集和公共英语模仿数据集的大规模纵向数据验证了我们的框架。在这些不同的环境中，Clicare明显胜过强大的基准，包括领先的长篇文化LLM和知识图形增强的抹布方法。我们结果的临床有效性得到了强大的评估方案的支持，该方案证明了与专家肿瘤学家的评估有很高的相关性。

Title: A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support

Authors: Long S. T. Nguyen, Truong P. Hua, Thanh M. Nguyen, Toan Q. Pham, Nam K. Ngo, An X. Nguyen, Nghi D. M. Pham, Nghia H. Nguyen, Tho T. Quan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22542
Pdf URL: https://arxiv.org/pdf/2507.22542
Copy Paste: [[2507.22542]] A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support(https://arxiv.org/abs/2507.22542)
Keywords: language model, llm
Abstract: With the rapid growth of Artificial Intelligence, Large Language Models (LLMs) have become essential for Question Answering (QA) systems, improving efficiency and reducing human workload in customer service. The emergence of Vietnamese LLMs (ViLLMs) highlights lightweight open-source models as a practical choice for their accuracy, efficiency, and privacy benefits. However, domain-specific evaluations remain limited, and the absence of benchmark datasets reflecting real customer interactions makes it difficult for enterprises to select suitable models for support applications. To address this gap, we introduce the Customer Support Conversations Dataset (CSConDa), a curated benchmark of over 9,000 QA pairs drawn from real interactions with human advisors at a large Vietnamese software company. Covering diverse topics such as pricing, product availability, and technical troubleshooting, CSConDa provides a representative basis for evaluating ViLLMs in practical scenarios. We further present a comprehensive evaluation framework, benchmarking 11 lightweight open-source ViLLMs on CSConDa with both automatic metrics and syntactic analysis to reveal model strengths, weaknesses, and linguistic patterns. This study offers insights into model behavior, explains performance differences, and identifies key areas for improvement, supporting the development of next-generation ViLLMs. By establishing a robust benchmark and systematic evaluation, our work enables informed model selection for customer service QA and advances research on Vietnamese LLMs. The dataset is publicly available at this https URL.
摘要：随着人工智能的迅速增长，大型语言模型（LLMS）对于问答系统（QA）系统至关重要，提高了效率并减少了客户服务中的人工工作量。越南LLM（Villms）的出现强调了轻巧的开源模型，作为其准确性，效率和隐私益处的实用选择。但是，特定于领域的评估仍然有限，并且没有反映实际客户互动的基准数据集，因此企业很难为支持应用程序选择合适的模型。为了解决这一差距，我们介绍了客户支持对话数据集（CSCONDA），这是一个超过9,000个QA对的基准，该基准是从一家大型越南软件公司与人类顾问进行的真实互动中得出的。 CSCONDA涵盖了定价，产品可用性和技术故障排除等各种主题，为在实际情况下评估别墅提供了代表性的基础。我们进一步介绍了一个全面的评估框架，通过自动指标和句法分析对11个轻巧的开源别墅进行基准测试，以揭示模型优势，劣势和语言模式。这项研究提供了对模型行为的见解，解释了绩效差异，并确定了改进的关键领域，支持下一代别墅的发展。通过建立强大的基准和系统评估，我们的工作为客户服务质量保证和对越南LLM的研究提供了知情的模型选择。该数据集可在此HTTPS URL上公开可用。

Title: ControlMed: Adding Reasoning Control to Medical Language Model

Authors: Sung-Min Lee, Siyoon Lee, Juyeon Kim, Kyungmin Roh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22545
Pdf URL: https://arxiv.org/pdf/2507.22545
Copy Paste: [[2507.22545]] ControlMed: Adding Reasoning Control to Medical Language Model(https://arxiv.org/abs/2507.22545)
Keywords: language model, llm
Abstract: Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.
摘要：由于临床决策的生命至关重要的性质需要可靠的支持，因此医疗领域的推理大型语言模型（LLM）越来越多地采用了医疗领域的越来越多。尽管有这些进步，但现有的推理通常会产生不必要的冗长的推理过程，从而导致大量的计算开销和响应延迟。这些限制阻碍了它们在现实世界中的实际部署。为了应对这些挑战，我们介绍了\ textbf {ControlMed}，这是一种医学语言模型，使用户能够通过细粒度控制标记在推理时间积极控制推理过程的长度。通过三阶段管道对ControlMed进行了训练：1）在涵盖\ textIt {direct}和\ textit {clinaction {推理响应}的大规模合成医学指令上进行预训练}; 2）通过多长度推理数据和显式长度控制标记进行微调进行微调； 3）使用基于模型的奖励信号加强学习，以提高事实准确性和响应质量。各种英语和韩国医学基准的实验结果表明，与最先进的模型相比，我们的模型具有相似或更好的性能。此外，用户可以通过根据需要控制推理长度来灵活平衡推理精度和计算效率。这些发现表明，ControlMed是用于临床问答和医学信息分析的实用且适应性的解决方案。

Title: Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Authors: Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22564
Pdf URL: https://arxiv.org/pdf/2507.22564
Copy Paste: [[2507.22564]] Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs(https://arxiv.org/abs/2507.22564)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
摘要：大型语言模型（LLMS）在各种任务中都表现出令人印象深刻的能力，但是它们的安全机制仍然容易受到利用认知偏见的对抗性攻击的影响 - 系统地偏离了理性判断。与以前的越狱方法着重于迅速的工程或算法操纵，这项工作突出了在破坏LLM保障措施时多偏见互动所忽视的力量。我们提出了CognitiveAttack，这是一个新型的红色团队框架，系统地利用了个人和合并的认知偏见。通过整合监督的微调和加强学习，CognitiveAttack生成了提示，这些提示嵌入了优化的偏差组合，有效地绕开了安全协议，同时保持高攻击成功率。实验结果表明，在30种不同的LLM中，特别是在开源模型中的显着脆弱性。与SOTA Black-Box方法PAP（60.1％对31.6％）相比，CognitiveAttack取得了更高的攻击成功率，从而暴露了当前防御机制的临界限制。这些发现突出了多偏见的交互作用，是一种强大而又不受欢迎的攻击向量。这项工作介绍了一种新颖的跨学科观点，通过桥接认知科学和LLM安全性，为更强大和人类的AI系统铺平了道路。

Title: Unveiling the Influence of Amplifying Language-Specific Neurons

Authors: Inaya Rahmanisa, Lyzander Marciano Andrylie, Krisna Mahardika Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22581
Pdf URL: https://arxiv.org/pdf/2507.22581
Copy Paste: [[2507.22581]] Unveiling the Influence of Amplifying Language-Specific Neurons(https://arxiv.org/abs/2507.22581)
Keywords: llm
Abstract: Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.
摘要：与单个语言密切相关的LLM中语言特定的神经元已被证明通过停用它们会影响模型行为。但是，它们在扩增中的作用仍未得到充实。这项工作使用三种主要采用不同语言训练的模型来研究通过18种语言（包括低资源的语言）的干预措施来扩大语言特异性神经元的效果。我们通过提出的语言转向转移（LSS）评估评分来比较放大因素在转向目标语言方面的有效性，然后在下游任务上对其进行评估：常识性推理（XCOPA，XWINOGRAD），知识（include）和Translation（Flores）。最佳放大因素有效地将输出转向几乎所有经过测试的语言。在下游任务上使用此因素的干预措施在某些情况下改善了自我语言表现，但通常会降低跨语言结果。这些发现突出了语言特异性神经元在多语言行为中的影响，在多语言行为中，放大可能是有益的，尤其是对低资源语言，但对跨语性转移提供了有限的优势。

Title: BALSAM: A Platform for Benchmarking Arabic Large Language Models

Authors: Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov, Norah Alzahrani, Eman alBilali, Nizar Habash, Abdelrahman El-Sheikh, Muhammad Elmallah, Haonan Li, Hamdy Mubarak, Mohamed Anwar, Zaid Alyafeai, Ahmed Abdelali, Nora Altwairesh, Maram Hasanain, Abdulmohsen Al Thubaity, Shady Shehata, Bashar Alhafni, Injy Hamed, Go Inoue, Khalid Elmadani, Ossama Obeid, Fatima Haouari, Tamer Elsayed, Emad Alghamdi, Khalid Almubarak, Saied Alshahrani, Ola Aljarrah, Safa Alajlan, Areej Alshaqarawi, Maryam Alshihri, Sultana Alghurabi, Atikah Alzeghayer, Afrah Altamimi, Abdullah Alfaifi, Abdulrahman AlOsaimy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22603
Pdf URL: https://arxiv.org/pdf/2507.22603
Copy Paste: [[2507.22603]] BALSAM: A Platform for Benchmarking Arabic Large Language Models(https://arxiv.org/abs/2507.22603)
Keywords: language model, llm
Abstract: The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
摘要：在所有语言中，大型语言模型（LLM）的令人印象深刻的进步尚未匹配。特别是，由于数据稀缺，阿拉伯语及其方言的语言多样性，形态复杂性等，在阿拉伯语滞后的LLM表现会进一步阻碍，而阿拉伯基准的质量通常依赖于静态，公开可用的数据，缺乏全面的任务覆盖或不提供盲目测试集的专用平台。这使得衡量实际进度并减轻数据污染是一项挑战。在这里，我们旨在弥合这些差距。特别是，我们介绍了Balsam，这是一种旨在推进阿拉伯语LLM开发和评估的全面，由社区驱动的基准。它包括来自14个广泛类别的78个NLP任务，其中52K示例分为37K测试和15K开发，以及一个集中的透明平台，用于盲目评估。我们将Balsam设想为一个统一的平台，该平台设定标准并促进协作研究以提高阿拉伯语LLM功能。

Title: Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Authors: Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22608
Pdf URL: https://arxiv.org/pdf/2507.22608
Copy Paste: [[2507.22608]] Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation(https://arxiv.org/abs/2507.22608)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at this https URL.
摘要：大型语言模型（LLMS）具有强大的多语言能力，但是语言特定处理背后的神经机制尚不清楚。我们分析了Llama-3.1-8B，Mistral-Nemo-12b和Aya-Expanse-8b和32b中的21种类型上多样性的语言，识别控制语言行为的神经元，分析了语言特定的神经元。使用语言激活概率熵（LAPE）方法，我们表明这些神经元聚集在更深的层中，非拉丁蛋白脚本显示出更大的专业化。相关语言共享重叠的神经元，反映了语言接近性的内部表示。通过语言算术，即系统的激活添加和乘法，我们引导模型停用了不需要的语言并激活所需的语言，表现优于更简单的替代方法。这些干预措施有效地指导了五个多语言任务的行为：语言强迫，翻译，QA，理解和NLI。对于高资源语言而言，操纵更为成功，而类型学相似性则提高了有效性。我们还证明，当神经元逐渐停用时，跨语化神经元转向可以增强下游性能，并揭示内部“后备”机制，以进行语言选择。我们的代码在此HTTPS URL上公开可用。

Title: Multilingual Political Views of Large Language Models: Identification and Steering

Authors: Daniil Gurgurov, Katharina Trinley, Ivan Vykopal, Josef van Genabith, Simon Ostermann, Roberto Zamparelli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22623
Pdf URL: https://arxiv.org/pdf/2507.22623
Copy Paste: [[2507.22623]] Multilingual Political Views of Large Language Models: Identification and Steering(https://arxiv.org/abs/2507.22623)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases--frequently skewing toward liberal or progressive positions--key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at this https URL.
摘要：大型语言模型（LLM）越来越多地用于日常工具和应用中，引起了人们对其对政治观点的潜在影响的担忧。虽然先前的研究表明，LLM经常表现出可衡量的政治偏见 - 通常会偏向自由主义或渐进式立场 - 键差仍然存在。大多数现有研究仅评估一组狭窄的模型和语言，留下有关跨体系结构，尺度和多语言环境的政治偏见的普遍性的公开问题。此外，很少有工作检查这些偏见是否可以主动控制。在这项工作中，我们通过对现代开源指导调节LLM的政治取向的大规模研究来解决这些差距。我们使用14种语言，使用政治指南测试，每次陈述使用11种语义上等效的释义，评估包括Llama-3.1，QWEN-3和AYA-EXPANSE在内的七个模型，以确保可靠的测量。我们的结果表明，较大的模型始终转向自由主义者左翼立场，在语言和模型家庭之间存在很大的差异。为了测试政治立场的可操作性，我们利用了一种简单的质量激活干预技术，并表明它可靠地将模型的响应转向了跨多种语言的替代意识形态立场。我们的代码在此HTTPS URL上公开可用。

Title: From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs

Authors: Jie He, Victor Gutierrez Basulto, Jeff Z. Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22716
Pdf URL: https://arxiv.org/pdf/2507.22716
Copy Paste: [[2507.22716]] From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs(https://arxiv.org/abs/2507.22716)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: this https URL.
摘要：基于增强学习的检索生成（RAG）方法增强了大语言模型（LLMS）的推理能力。但是，大多数仅依靠最终回答奖励，忽略了中间推理质量。本文分析了现有的抹布推理模型，并确定了三种主要故障模式：（1）信息不足，这意味着该模型无法检索足够的支持；（2）有缺陷的推理，尽管有足够的信息，但出现逻辑或内容级缺陷的原因；（3）回答不一致的情况，有效的推理链导致最终答案不匹配。我们提出了Tiresrag-R1，这是一种使用思想重新反射过程的新型框架，以及一个多维奖励系统，以提高推理和稳定性。 Tiresrag-R1介绍：（1）充分奖励，以鼓励彻底检索；（2）评估推理链的合理性和准确性的推理质量奖励；（3）反思奖励，以检测和修改错误。它还采用了困难的重新权策略和培训样本过滤，以提高复杂任务的性能。在四个多跳跃质量检查数据集上的实验表明，Tiresrag-R1优于先前的抹布方法，并且可以很好地推广到单跳任务。代码和数据可在以下网址提供：此HTTPS URL。

Title: Investigating Hallucination in Conversations for Low Resource Languages

Authors: Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22720
Pdf URL: https://arxiv.org/pdf/2507.22720
Copy Paste: [[2507.22720]] Investigating Hallucination in Conversations for Low Resource Languages(https://arxiv.org/abs/2507.22720)
Keywords: language model, gpt, llm, hallucination
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
摘要：大型语言模型（LLMS）表现出非常熟练的熟练程度，这些文本与人写作非常相似。但是，它们通常会产生事实不正确的陈述，这个问题通常称为“幻觉”。解决幻觉对于提高LLM的可靠性和有效性至关重要。尽管大量研究集中在英语中的幻觉上，但我们的研究将这项调查扩展到了三种语言的会话数据：印地语，波尔西和普通话。我们对数据集进行了全面的分析，以检查这些语言中的事实和语言错误，用于GPT-3.5，GPT-4O，LLAMA-3.1，GEMMA-2.0，DEEPSEEK-R1和QWEN-3。我们发现，LLM在普通话中产生的幻觉反应很少，但在印地语和波西的幻觉中产生了明显更高的幻觉。

Title: Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning

Authors: Benedikt Roth, Stephan Rappensperger, Tianming Qiu, Hamza Imamović, Julian Wörmann, Hao Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22729
Pdf URL: https://arxiv.org/pdf/2507.22729
Copy Paste: [[2507.22729]] Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning(https://arxiv.org/abs/2507.22729)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation. Their token-level representations capture rich, human-aligned semantics. However, pooling these vectors into a text embedding discards crucial information. Nevertheless, many non-generative downstream tasks, such as clustering, classification, or retrieval, still depend on accurate and controllable sentence- or document-level embeddings. We explore several adaptation strategies for pre-trained, decoder-only LLMs: (i) various aggregation techniques for token embeddings, (ii) task-specific prompt engineering, and (iii) text-level augmentation via contrastive fine-tuning. Combining these components yields state-of-the-art performance on the English clustering track of the Massive Text Embedding Benchmark (MTEB). An analysis of the attention map further shows that fine-tuning shifts focus from prompt tokens to semantically relevant words, indicating more effective compression of meaning into the final hidden state. Our experiments demonstrate that LLMs can be effectively adapted as text embedding models through a combination of prompt engineering and resource-efficient contrastive fine-tuning on synthetically generated positive pairs.
摘要：大型语言模型（LLM）已成为自然语言处理（NLP）的基石，在文本生成中取得了令人印象深刻的表现。他们的令牌级表示捕获了丰富的人类一致语义。但是，将这些向量汇集到嵌入文本中的丢弃至关重要的信息。然而，许多非生成下游任务（例如聚类，分类或检索）仍然取决于准确，可控制的句子或文档级嵌入。我们探讨了针对预先训练的，仅解码器的LLMS的几种适应策略：（i）用于令牌嵌入的各种聚合技术，（ii）特定于任务的及时工程，以及（iii）通过对比度进行的细调。结合这些组件在大规模文本嵌入基准（MTEB）的英语聚类轨道上产生最先进的性能。对注意力图的分析进一步表明，微调转移的重点从及时令牌到语义相关的单词，表明对含义更有效地压缩到最终的隐藏状态中。我们的实验表明，通过迅速工程和资源有效的对比度微调在合成生成的阳性对上的结合，可以有效地将LLM有效地适应为文本嵌入模型。

Title: Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index

Authors: Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22744
Pdf URL: https://arxiv.org/pdf/2507.22744
Copy Paste: [[2507.22744]] Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index(https://arxiv.org/abs/2507.22744)
Keywords: language model, hallucination
Abstract: Reducing hallucinations in abstractive summarization remains a critical challenge for deploying language models (LMs) in real-world settings. In this work, we introduce a rewarddriven fine-tuning framework that explicitly optimizes for Entity Hallucination Index (EHI), a metric designed to quantify the presence, correctness, and grounding of named entities in generated summaries. Given a corpus of meeting transcripts, we first generate baseline summaries using a pre-trained LM and compute EHI scores via automatic entity extraction and matching. We then apply reinforcement learning to fine-tune the model parameters, using EHI as a reward signal to bias generation toward entity-faithful outputs. Our approach does not rely on human-written factuality annotations, enabling scalable fine-tuning. Experiments demonstrate consistent improvements in EHI across datasets, with qualitative analysis revealing a significant reduction in entity-level hallucinations without degradation in fluency or informativeness. We release a reproducible Colab pipeline, facilitating further research on hallucination-aware model fine-tuning using lightweight, hallucintion metrics like EHI.
摘要：减少抽象性摘要中的幻觉仍然是在现实世界中部署语言模型（LMS）的关键挑战。在这项工作中，我们引入了一个奖励驱动的微调框架，该框架明确优化了实体幻觉指数（EHI），该指标旨在量化生成的摘要中指定实体的存在，正确性和基础。在给定成绩单的情况下，我们首先使用预训练的LM生成基线摘要，并通过自动实体提取和匹配来计算EHI分数。然后，我们将增强学习学习以对模型参数进行微调，将EHI用作奖励信号，以偏向于实体信仰输出。我们的方法不依赖于人写的事实注释，从而实现可扩展的微调。实验证明了跨数据集的EHI的一致改善，定性分析表明，实体级别幻觉的显着降低而不会流利或信息性降解。我们发布了可再现的Colab管道，促进了使用轻巧的幻觉指标（例如EHI）对幻觉感知模型进行微调的进一步研究。

Title: CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Authors: Jindřich Libovický, Jindřich Helcl, Andrei Manea, Gianluca Vico
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22752
Pdf URL: https://arxiv.org/pdf/2507.22752
Copy Paste: [[2507.22752]] CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset(https://arxiv.org/abs/2507.22752)
Keywords: language model, llm, prompt
Abstract: We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results highlight a significant gap in regional knowledge among current LLMs. Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering.
摘要：我们介绍了一个开放式区域问题的基准，该基准涵盖了文本和视觉方式。我们还使用最先进的大语言模型（LLM）提供了强大的基准。我们的数据集由以Wikipedia为基础的手动精心策划的问题和答案组成，该问题由捷克西亚，斯洛伐克和乌克兰的母语人士创建，并随附英语翻译。它既包括纯粹的文本问题，也包括需要视觉理解的问题。作为基准，我们通过提示和补充答案正确性的判断来评估最先进的LLM。使用这些人类评估，我们分析了现有自动评估指标的可靠性。我们的基线结果突出了当前LLM中区域知识的显着差距。此外，除了基于LLM的评估外，自动指标与人类判断力之间的相关性最小。我们将该数据集作为一种资源释放到（1）评估LLMS中的区域知识，（2）在充满挑战的环境中研究跨语性生成一致性，（3）推进开放式问题的评估指标的开发。

Title: Opportunities and Challenges of LLMs in Education: An NLP Perspective

Authors: Sowmya Vajjala, Bashar Alhafni, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22753
Pdf URL: https://arxiv.org/pdf/2507.22753
Copy Paste: [[2507.22753]] Opportunities and Challenges of LLMs in Education: An NLP Perspective(https://arxiv.org/abs/2507.22753)
Keywords: language model, llm
Abstract: Interest in the role of large language models (LLMs) in education is increasing, considering the new opportunities they offer for teaching, learning, and assessment. In this paper, we examine the impact of LLMs on educational NLP in the context of two main application scenarios: {\em assistance} and {\em assessment}, grounding them along the four dimensions -- reading, writing, speaking, and tutoring. We then present the new directions enabled by LLMs, and the key challenges to address. We envision that this holistic overview would be useful for NLP researchers and practitioners interested in exploring the role of LLMs in developing language-focused and NLP-enabled educational applications of the future.
摘要：考虑到他们为教学，学习和评估提供的新机会，对大语模型（LLM）在教育中的作用的兴趣正在增加。在本文中，我们在两个主要的应用程序方案中检查了LLM对教育NLP的影响：{\ em Assist}和{\ em评估}，将它们沿着四个维度扎根 - 阅读，写作，口语和辅导。然后，我们介绍LLMS启用的新指示，以及要解决的主要挑战。我们设想，此整体概述对于有兴趣探索LLM在开发以语言为中心和NLP的教育应用程序中的作用的NLP研究人员和从业人员有用。

Title: MASCA: LLM based-Multi Agents System for Credit Assessment

Authors: Gautam Jajoo, Pranjal A Chitale, Saksham Agarwal
Subjects: cs.CL, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22758
Pdf URL: https://arxiv.org/pdf/2507.22758
Copy Paste: [[2507.22758]] MASCA: LLM based-Multi Agents System for Credit Assessment(https://arxiv.org/abs/2507.22758)
Keywords: llm, agent
Abstract: Recent advancements in financial problem-solving have leveraged LLMs and agent-based systems, with a primary focus on trading and financial modeling. However, credit assessment remains an underexplored challenge, traditionally dependent on rule-based methods and statistical models. In this paper, we introduce MASCA, an LLM-driven multi-agent system designed to enhance credit evaluation by mirroring real-world decision-making processes. The framework employs a layered architecture where specialized LLM-based agents collaboratively tackle sub-tasks. Additionally, we integrate contrastive learning for risk and reward assessment to optimize decision-making. We further present a signaling game theory perspective on hierarchical multi-agent systems, offering theoretical insights into their structure and interactions. Our paper also includes a detailed bias analysis in credit assessment, addressing fairness concerns. Experimental results demonstrate that MASCA outperforms baseline approaches, highlighting the effectiveness of hierarchical LLM-based multi-agent systems in financial applications, particularly in credit scoring.
摘要：财务解决问题的最新进展已利用LLM和基于代理的系统，主要关注交易和财务建模。但是，信用评估仍然是一个不充分的挑战，传统上取决于基于规则的方法和统计模型。在本文中，我们介绍了MASCA，这是一种LLM驱动的多代理系统，旨在通过镜像现实世界的决策过程来增强信用评估。该框架采用了分层体系结构，其中专业的基于LLM的代理商协作解决子任务。此外，我们将对比度学习进行风险和奖励评估以优化决策。我们进一步介绍了有关层次多代理系统的信号游戏理论的观点，为它们的结构和相互作用提供了理论见解。我们的论文还包括信用评估中的详细偏见分析，以解决公平问题。实验结果表明，MASCA的表现优于基线方法，强调了基于层级LLM的多代理系统在财务应用中，尤其是在信用评分方面的有效性。

Title: DBLPLink 2.0 -- An Entity Linker for the DBLP Scholarly Knowledge Graph

Authors: Debayan Banerjee, Tilahun Abedissa Taffa, Ricardo Usbeck
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22811
Pdf URL: https://arxiv.org/pdf/2507.22811
Copy Paste: [[2507.22811]] DBLPLink 2.0 -- An Entity Linker for the DBLP Scholarly Knowledge Graph(https://arxiv.org/abs/2507.22811)
Keywords: llm
Abstract: In this work we present an entity linker for DBLP's 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the "yes" token output at the penultimate layer of the LLM.
摘要：在这项工作中，我们介绍了DBLP 2025版的基于RDF的知识图的实体链接器。与2022版本相比，DBLP现在将出版物场所视为一种名为DBLP：stream的新实体类型。在较早版本的DBLPLINK中，我们在数据集上训练了KG- Embeddings和重新率，以产生实体链接。相反，在这项工作中，我们使用新方法使用LLM开发了一个零摄像的实体链接器，在该方法中，我们根据LLM倒数第二层的“是”令牌输出的对数概率重新排列候选实体。

Title: Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization

Authors: Weijia Zhang, Songgaojun Deng, Evangelos Kanoulas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22829
Pdf URL: https://arxiv.org/pdf/2507.22829
Copy Paste: [[2507.22829]] Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization(https://arxiv.org/abs/2507.22829)
Keywords: agent
Abstract: Query-focused table summarization requires complex reasoning, often approached through step-by-step natural language (NL) plans. However, NL plans are inherently ambiguous and lack structure, limiting their conversion into executable programs like SQL and hindering scalability, especially for multi-table tasks. To address this, we propose a paradigm shift to structured representations. We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi-agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases: 1) Structured Planning to generate TaSoF from a query, 2) Graph-based Execution to convert plan steps into SQL and model dependencies via a directed cyclic graph for parallel execution, and 3) Summary Generation to produce query-focused summaries. Our method explicitly captures complex dependencies and improves reliability. Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single- and multi-table settings, demonstrating the advantages of structured representations for robust and scalable summarization.
摘要：以查询为重点的表摘要需要复杂的推理，通常是通过分步自然语言（NL）计划接近的。但是，NL计划本质上是模棱两可的，缺乏结构，从而将其转换为SQL和阻碍可扩展性（尤其是用于多蛋白桌布）的可执行程序。为了解决这个问题，我们提出了向结构化表示的范式转变。我们介绍了一个新的结构化计划，TASOF，灵感来自传统的多代理系统中的形式主义，以及一个框架，Spage，该计划在三个阶段中正式的推理过程形式化：1）结构化计划从查询中生成Tasof，2）基于图形的执行，2）基于图形的执行，将计划步骤转换为SQL，并通过有指导的循环图表进行平行执行和3）汇总的生成和3）汇总的生成和3）。我们的方法明确捕获复杂的依赖性并提高可靠性。三个公共基准的实验表明，Spage在单桌子和多桌子设置中始终优于先前模型，这证明了结构化表示的优势，用于鲁棒和可扩展的汇总。

Title: Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

Authors: Kwesi Cobbina, Tianyi Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22887
Pdf URL: https://arxiv.org/pdf/2507.22887
Copy Paste: [[2507.22887]] Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning(https://arxiv.org/abs/2507.22887)
Keywords: language model, llm, prompt
Abstract: In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. However, it has been found that ICL's performance can be sensitive to the choices of demos and their order. This paper investigates an unexplored new positional bias of ICL for the first time: we observe that the predictions and accuracy can drift drastically when the positions of demos, the system prompt, and the user message in LLM input are varied. We refer to this bias as DEMOS' POSITION IN PROMPT (DPP) bias. We design a systematic evaluation pipeline to study this type of positional bias across classification, question answering, summarization, and reasoning tasks. We introduce two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, to quantify net gains and output volatility induced by changes in the demos' position. Extensive experiments on ten LLMs from four open-source model families (QWEN, LLAMA3, MISTRAL, COHERE) verify that the bias significantly affects their accuracy and predictions: placing demos at the start of the prompt yields the most stable and accurate outputs with gains of up to +6 points. In contrast, placing demos at the end of the user message flips over 30\% of predictions without improving correctness on QA tasks. Smaller models are most affected by this sensitivity, though even large models remain marginally affected on complex tasks.
摘要：内部文化学习（ICL）是大语言模型（LLMS）的重要新兴能力，在提示中包括一些演示（emos），可以在推断过程中很少进行学习。但是，已经发现ICL的性能可以对演示的选择及其顺序敏感。本文首次研究了未开发的ICL的新位置偏置：我们观察到，当演示，系统提示和LLM输入中的用户消息的位置变化时，预测和准确性会大大漂移。我们将这种偏见称为迅速（DPP）偏见中的演示位置。我们设计了一个系统的评估管道，以研究分类，问答，摘要和推理任务的这种位置偏见。我们介绍了两个指标，即精度变化和预测变化，以量化演示位置变化引起的净得益和输出波动率。从四个开源模型家族（QWEN，LLAMA3，MISTRAL，COHERE）进行十个LLM的广泛实验证明了偏见会显着影响其准确性和预测：将演示放置在提示开始时，将获得最稳定，最精确的输出，最多可获得+6点。相比之下，将演示放置在用户消息结束时将预测的30％超过30％，而无需提高质量检查任务的正确性。较小的模型受到这种敏感性影响最大，尽管即使是大型模型也对复杂的任务有很小的影响。