2025-07-03

Title: MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered

Authors: Imran Mirza, Cole Huang, Ishwara Vasista, Rohan Patil, Asli Akalin, Sean O'Brien, Kevin Zhu
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2507.01019
Pdf URL: https://arxiv.org/pdf/2507.01019
Copy Paste: [[2507.01019]] MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered(https://arxiv.org/abs/2507.01019)
Keywords: language model, llm, agent
Abstract: Multi-agent systems, which consist of multiple AI models interacting within a shared environment, are increasingly used for persona-based interactions. However, if not carefully designed, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. We present MALIBU, a novel benchmark developed to assess the degree to which LLM-based multi-agent systems implicitly reinforce social biases and stereotypes. MALIBU evaluates bias in LLM-based multi-agent systems through scenario-based assessments. AI models complete tasks within predefined contexts, and their responses undergo evaluation by an LLM-based multi-agent judging system in two phases. In the first phase, judges score responses labeled with specific demographic personas (e.g., gender, race, religion) across four metrics. In the second phase, judges compare paired responses assigned to different personas, scoring them and selecting the superior response. Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality, emphasizing the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks in multi-agent systems.
摘要：由多个在共享环境中交互的AI模型组成的多代理系统越来越多地用于基于角色的互动。但是，如果不经过精心设计，这些系统可以在大语言模型（LLM）中加强隐性偏见，从而引起人们对公平和公平表示的担忧。我们提出了马里布（Malibu），这是一种新颖的基准，旨在评估基于LLM的多机构系统隐含地增强社会偏见和刻板印象的程度。 Malibu通过基于方案的评估评估了基于LLM的多代理系统的偏见。 AI模型在预定义的上下文中完成任务，其响应通过两个阶段的基于LLM的多代理判断系统进行了评估。在第一阶段，法官得分在四个指标上标有特定人口角色（例如性别，种族，宗教）标记的回答。在第二阶段，法官比较了分配给不同角色的配对响应，对其进行评分并选择了优越的响应。我们的研究量化了LLM生成的产出中的偏见，表明缓解偏差可能有利于边缘化的角色而不是真正的中立性，从而强调了对多代理系统中细微差别的检测，平衡的公平策略以及透明的评估基准的需求。

Title: Event-based evaluation of abstractive news summarization

Authors: Huiling You, Samia Touileb, Erik Velldal, Lilja Øvrelid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01160
Pdf URL: https://arxiv.org/pdf/2507.01160
Copy Paste: [[2507.01160]] Event-based evaluation of abstractive news summarization(https://arxiv.org/abs/2507.01160)
Keywords: language model
Abstract: An abstractive summary of a news article contains its most important information in a condensed version. The evaluation of automatically generated summaries by generative language models relies heavily on human-authored summaries as gold references, by calculating overlapping units or similarity scores. News articles report events, and ideally so should the summaries. In this work, we propose to evaluate the quality of abstractive summaries by calculating overlapping events between generated summaries, reference summaries, and the original news articles. We experiment on a richly annotated Norwegian dataset comprising both events annotations and summaries authored by expert human annotators. Our approach provides more insight into the event information contained in the summaries.
摘要：新闻文章的抽象性摘要包含其最重要的信息。通过生成语言模型对自动生成的摘要的评估在很大程度上依赖于人类实现的摘要作为黄金参考，通过计算重叠单位或相似性分数。新闻文章报告事件，理想情况下，摘要也是如此。在这项工作中，我们建议通过计算生成的摘要，参考摘要和原始新闻文章之间的重叠事件来评估抽象性摘要的质量。我们尝试了一个丰富注释的挪威数据集，其中包括人类注释者专家注释者撰写的事件注释和摘要。我们的方法提供了对摘要中包含的事件信息的更多洞察力。

Title: GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant

Authors: Michał Matak, Jarosław A. Chudziak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01259
Pdf URL: https://arxiv.org/pdf/2507.01259
Copy Paste: [[2507.01259]] GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant(https://arxiv.org/abs/2507.01259)
Keywords: language model, gpt, llm, agent
Abstract: In this paper we discuss the capability of large language models to base their answer and provide proper references when dealing with legal matters of non-english and non-chinese speaking country. We discuss the history of legal information retrieval, the difference between case law and statute law, its impact on the legal tasks and analyze the latest research in this field. Basing on that background we introduce gAIus, the architecture of the cognitive LLM-based agent, whose responses are based on the knowledge retrieved from certain legal act, which is Polish Civil Code. We propose a retrieval mechanism which is more explainable, human-friendly and achieves better results than embedding-based approaches. To evaluate our method we create special dataset based on single-choice questions from entrance exams for law apprenticeships conducted in Poland. The proposed architecture critically leveraged the abilities of used large language models, improving the gpt-3.5-turbo-0125 by 419%, allowing it to beat gpt-4o and lifting gpt-4o-mini score from 31% to 86%. At the end of our paper we show the possible future path of research and potential applications of our findings.
摘要：在本文中，我们讨论了大语言模型基于其答案的能力，并在处理非英语和非中印度国家的法律事务时提供了适当的参考。我们讨论法律信息检索的历史，判例法和法规法之间的差异，其对法律任务的影响并分析该领域的最新研究。在背景上，我们介绍了基于认知LLM的代理商的建筑Gaius，其回答是基于从某些法律法案（即波兰民法典）中检索出的知识。我们提出了一种检索机制，该机制比基于嵌入的方法更可解释，对人类友好，并取得更好的结果。为了评估我们的方法，我们根据在波兰进行的法律学徒入学考试中的单选问题创建特殊数据集。拟议的建筑非常利用了使用的大语言模型的能力，将GPT-3.5-Turbo-0125提高了419％，使其可以击败GPT-4O并提高GPT-4O-MINI得分从31％提高到86％。在论文的最后，我们展示了研究结果的研究和潜在应用的未来路径。

Title: Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening

Authors: Cindy Lie Tabuse, David Restepo, Carolina Gracitelli, Fernando Korn Malerbi, Caio Regatieri, Luis Filipe Nakayama
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01278
Pdf URL: https://arxiv.org/pdf/2507.01278
Copy Paste: [[2507.01278]] Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening(https://arxiv.org/abs/2507.01278)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored. This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs and simulate clinical decisions for diabetic retinopathy (DR) and glaucoma screening, including the impact of adding real or synthetic clinical metadata. We conducted a retrospective diagnostic validation study using 300 annotated fundus images. GPT-4 received structured prompts describing each image, with or without patient metadata. The model was tasked with assigning an ICDR severity score, recommending DR referral, and estimating the cup-to-disc ratio for glaucoma referral. Performance was evaluated using accuracy, macro and weighted F1 scores, and Cohen's kappa. McNemar's test and change rate analysis were used to assess the influence of metadata. GPT-4 showed moderate performance for ICDR classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25), driven mainly by correct identification of normal cases. Performance improved in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For glaucoma referral, performance was poor across all settings (accuracy ~78%, F1 <0.04, kappa <0.03). Metadata inclusion did not significantly affect outcomes (McNemar p > 0.05), and predictions remained consistent across conditions. GPT-4 can simulate basic ophthalmic decision-making from structured prompts but lacks precision for complex tasks. While not suitable for clinical use, LLMs may assist in education, documentation, or image annotation workflows in ophthalmology.
摘要：大型语言模型（LLMS）可以根据自然语言提示来模拟临床推理，但是它们在眼科中的实用性在很大程度上没有探索。这项研究评估了GPT-4解释视网膜眼底照片的结构化文本描述的能力，并模拟了糖尿病性视网膜病（DR）和青光眼筛查的临床决策，包括添加真实或合成临床元数据的影响。我们使用300个注释的底面图像进行了回顾性诊断验证研究。 GPT-4收到了描述每个图像的结构化提示，有或没有患者元数据。该模型的任务是分配ICDR严重程度评分，建议DR转介，并估算青光眼推荐的杯赛比率。使用精度，宏观和加权F1分数以及Cohen的Kappa评估了性能。 McNemar的测试和变化率分析用于评估元数据的影响。 GPT-4显示ICDR分类的中等性能（准确性为67.5％，宏F1 0.33，加权F1 0.67，Kappa 0.25），主要由正确识别正常情况驱动。二进制DR推荐任务的性能改善（准确性82.3％，F1 0.54，Kappa 0.44）。对于青光眼推荐，在所有情况下的性能都很差（准确性〜78％，f1 <0.04，kappa <0.03）。元数据包含并未显着影响结果（McNemar p> 0.05），并且在条件下预测保持一致。 GPT-4可以从结构化提示中模拟基本的眼科决策，但缺乏精确的复杂任务。 LLM不适合临床使用，但可以帮助眼科中的教育，文档或图像注释工作流。

Title: Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization

Authors: Juan Chen, Baolong Bi, Wei Zhang, Jingyan Sui, Xiaofei Zhu, Yuanzhuo Wang, Lingrui Mei, Shenghua Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01281
Pdf URL: https://arxiv.org/pdf/2507.01281
Copy Paste: [[2507.01281]] Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization(https://arxiv.org/abs/2507.01281)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG this http URL this work, we argue that LLMs should rethink all evidence, including both retrieved content and internal knowledge, before generating this http URL propose CARE-RAG (Conflict-Aware and Reliable Evidence for RAG), a novel framework that improves trustworthiness through Conflict-Driven Summarization of all available this http URL-RAG first derives parameter-aware evidence by comparing parameter records to identify diverse internal perspectives. It then refines retrieved evidences to produce context-aware evidence, removing irrelevant or misleading content. To detect and summarize conflicts, we distill a 3B LLaMA3.2 model to perform conflict-driven summarization, enabling reliable synthesis across multiple this http URL further ensure evaluation integrity, we introduce a QA Repair step to correct outdated or ambiguous benchmark this http URL on revised QA datasets with retrieval data show that CARE-RAG consistently outperforms strong RAG baselines, especially in scenarios with noisy or conflicting evidence.
摘要：检索增强的生成（RAG）通过将其参数知识与外部检索内容集成在一起，从而增强了大语言模型（LLMS）。但是，由于内部不一致或嘈杂所检索的内容引起的知识冲突可能会严重破坏这项工作的HTTP URL的发电性可靠性，我们认为LLMS应该重新考虑所有证据，包括在此范围内产生此类证据，在产生此类证据（冲突范围）（RAGS）的所有证据，这使得rag for rag for rag for rag therwrive rag this from rag this from rag nmperrive rangivians for rag nmprive ther this from in this from in this from in this for rag- HTTP URL-rag首先通过比较参数记录来识别不同的内部观点来得出参数感知的证据。然后，它完善了检索到的证据以产生上下文感知的证据，消除无关或误导性的内容。 To detect and summarize conflicts, we distill a 3B LLaMA3.2 model to perform conflict-driven summarization, enabling reliable synthesis across multiple this http URL further ensure evaluation integrity, we introduce a QA Repair step to correct outdated or ambiguous benchmark this http URL on revised QA datasets with retrieval data show that CARE-RAG consistently outperforms strong RAG baselines, especially in场景吵闹或矛盾的证据。

Title: Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

Authors: Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, Sewon Min
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.01297
Pdf URL: https://arxiv.org/pdf/2507.01297
Copy Paste: [[2507.01297]] Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks(https://arxiv.org/abs/2507.01297)
Keywords: retrieval-augmented generation, agent
Abstract: Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B--70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems--all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.
摘要：在有限的环境中，主要研究了检索声明的一代（RAG），例如Factoid问题回答；更具挑战性的，推理密集型基准的基准从最小的抹布中获得了有限的成功。在这项工作中，我们挑战了这种普遍的观点，以既定的，强化的基准：MMLU，MMLU PRO，AGI EVAR，GPQA和MATH。我们在先前的工作中确定了一个关键的丢失组件：可用的网络尺度数据存储与预处理数据的广度一致。为此，我们介绍了Compactds：一个多样化的，高质量的网络尺度数据存储，可在单节点上实现高检索精度和次要延迟。关键见解是（1）大多数网络内容可以在不牺牲覆盖范围的情况下过滤掉，并且紧凑，高质量的子集就足够了；（2）将内存近似近似邻居（ANN）检索和盘中的精确搜索平衡速度和回忆组合结合。使用Compactds，我们表明，最小的RAG管道可在所有基准和型号尺寸（8B---70B）上取得一致的准确性提高，MMLU的相对增长率为10％，MMLU Pro的相对增长率为33％，GPQA的相对增长率为14％，数学上的相对增长率为14％。没有一个数据源仅足够，强调了来源多样性的重要性（网络爬网，精选数学，学术论文，教科书）。最后，我们表明，我们精心设计的内部数据存储匹配或优于网络搜索引擎（例如Google搜索）以及最近提出的基于代理的抹布系统，同时还保持了简单，可重复性和自我启示。我们发布了紧凑型和检索管道，支持未来的研究探索基于检索的AI系统。

Title: La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Authors: Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01299
Pdf URL: https://arxiv.org/pdf/2507.01299
Copy Paste: [[2507.01299]] La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation(https://arxiv.org/abs/2507.01299)
Keywords: language model, llm
Abstract: Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.
摘要：激活稀疏性可以减少大语言模型（LLM）推理的正向通行期间的计算开销和内存转移。现有方法面临限制，要求耗时的恢复训练阻碍了现实世界的采用，或者依靠基于经验的修剪，这会导致波动的稀疏性和不稳定的推理加速。本文介绍了LAROSA（层旋转的稀疏激活），这是一种新颖的激活稀疏方法，旨在提高LLM效率，而无需额外的训练或基于幅度的修剪。我们利用图层正交旋转将输入激活转化为更适合稀疏的旋转形式。通过在旋转的激活中采用TOP-K选择方法，我们实现了一致的模型级稀疏性和可靠的墙壁锁定时间加速。 Larosa在各种尺寸和类型的LLMS中都有效，表明性能降解和强大的推理加速度最小。具体来说，对于以40％稀疏性的Llama2-7b，Larosa仅达到0.17的周期差距，并具有一致的1.30倍墙锁定时间的速度，并且与仅1.77％和Cats的teal teal teal the ofters ofter-shot任务相比，零发件中的准确性差距减少了0.54％。

Title: Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

Authors: Nifu Dan, Yujun Cai, Yiwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01334
Pdf URL: https://arxiv.org/pdf/2507.01334
Copy Paste: [[2507.01334]] Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs(https://arxiv.org/abs/2507.01334)
Keywords: language model, llm, prompt
Abstract: Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.
摘要：长期以来，对于大型语言模型（LLM）而言，逐渐导致物理学推理的复杂性一直是一项艰巨的任务，需要综合深刻的概念理解和熟练的问题解决技术。在这项研究中，我们调查了高级指导调节的推理模型（例如DeepSeek-R1）的应用，以解决来自充满挑战的Scibench基准策划的各种物理问题。我们全面的实验评估揭示了推理模型的显着功能。它们不仅在回答复杂的物理问题方面达到了最新的准确性，而且还产生了强调符号推导的独特推理模式。此外，我们的发现表明，即使对于这些高度复杂的推理模型，少数发动机提示的战略性融合仍然可以在整体准确性方面产生可衡量的改善，从而突出了持续性能提高的潜力。

Title: LEDOM: An Open and Fundamental Reverse Language Model

Authors: Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01335
Pdf URL: https://arxiv.org/pdf/2507.01335
Copy Paste: [[2507.01335]] LEDOM: An Open and Fundamental Reverse Language Model(https://arxiv.org/abs/2507.01335)
Keywords: language model
Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.
摘要：我们介绍了第一个纯粹的反向语言模型LEDOM，它在435b代币上具有2B和7B参数变体的自动训练性，该代币通过以前的代币预测以相反的时间顺序处理序列。我们第一次将反向语言模型作为跨通用任务的潜在基础模型提出，并伴随着一系列有趣的示例和见解。基于LEDOM，我们进一步介绍了一个新颖的应用：反向奖励，其中LEDOM引导的远期语言模型输出的重新依据可在数学推理任务上进行大量的绩效改进。这种方法利用了Ledom独特的向后推理能力，可以通过后评估来完善发电质量。我们的发现表明，Ledom具有具有广泛应用潜力的独特特征。我们将发布所有模型，培训代码和预培训数据，以促进未来的研究。

Title: Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Authors: Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01352
Pdf URL: https://arxiv.org/pdf/2507.01352
Copy Paste: [[2507.01352]] Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy(https://arxiv.org/abs/2507.01352)
Keywords: language model
Abstract: Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.
摘要：尽管奖励模型（RMS）在增强人类反馈（RLHF）中的重要作用中，但当前最新的开放RMS在大多数现有的评估基准中的表现较差，但未能捕捉到细微和复杂的人类偏好。即使是结合高级培训技术的方法也没有带来有意义的绩效提高。我们假设这种脆性主要源于偏好数据集的局限性，这些数据集通常被范围狭窄，综合标记或缺乏严格的质量控制。为了应对这些挑战，我们提出了一个大规模的偏好数据集，其中包括4000万个偏好对，名为SynpRef-40m。为了大规模启用数据策划，我们设计了人类协同的两阶段管道，该管道利用人类注释质量和AI可伸缩性的互补优势。在这条管道中，人类提供了经过验证的注释，而大型语言模型则根据人类的指导进行自动策划。在此偏好混合物上进行培训，我们介绍了Skywork-Reward-V2，这是一套由0.6B到8B参数的八个奖励模型，并在经过精心策划的2600万首偏好对的子集中训练了SynpRef-40m。我们证明，Skywork-Reward-V2在广泛的功能中具有多功能性，包括与人类偏好，客观正确性，安全性，对风格偏见的抵抗力以及最佳N规模的阻力，在七个主要奖励模型基准中实现最先进的性能。消融研究证实，我们方法的有效性不仅源于数据量表，而且还源于高质量的策展。 Skywork-Reward-V2系列代表了开放奖励模型中的重大进展，强调了现有偏好数据集的未开发潜力，并证明了Human-AI策展协同作用如何可以显着更高的数据质量。

Title: LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

Authors: Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01449
Pdf URL: https://arxiv.org/pdf/2507.01449
Copy Paste: [[2507.01449]] LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation(https://arxiv.org/abs/2507.01449)
Keywords: llm
Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at this https URL.
摘要：投机解码（SD），其中采用小型草稿模型提前提出草稿令牌，然后目标模型并行验证它们，它已成为LLM推理加速度的有希望的技术。改善SD的许多努力是消除对草案模型的需求，并以基于检索的方式产生草稿令牌，以进一步减轻起草的开销，并大大减少部署和应用程序的难度。但是，基于检索的SD依赖于匹配的范式来检索最相关的参考，作为草稿，这些方法通常无法找到匹配和准确的草稿令牌。为了应对这一挑战，我们建议LogitsPEC有效地扩大检索范围，并找到最相关的参考作为草稿。我们的logitspec是由观察到最后一个令牌的logit不仅可以预测下一个令牌，还可以推测下一个下一个令牌的动机。具体而言，LogitSpec分为两个步骤生成草稿令牌：（1）使用最后一个logit推测下一个隔壁的令牌；（2）检索下一个令牌和下一个令牌的相关参考。 LogitSpec是无训练的插件，可以轻松地集成到现有的LLM推理框架中。广泛的文本生成基准测试的广泛实验表明，LogitSpec可以达到2.61美元的$速度$速度和3.28平均每次解码步骤所接受的令牌。我们的代码可在此HTTPS URL上找到。

Title: Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities

Authors: Yingqiang Gao, Kaede Johnson, David Froehlich, Luisa Carrer, Sarah Ebling
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01479
Pdf URL: https://arxiv.org/pdf/2507.01479
Copy Paste: [[2507.01479]] Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities(https://arxiv.org/abs/2507.01479)
Keywords: language model, llm
Abstract: Automatic text simplification (ATS) aims to enhance language accessibility for various target groups, particularly persons with intellectual disabilities. Recent advancements in generative AI, especially large language models (LLMs), have substantially improved the quality of machine-generated text simplifications, thereby mitigating information barriers for the target group. However, existing LLM-based ATS systems do not incorporate preference feedback on text simplifications during training, resulting in a lack of personalization tailored to the specific needs of target group representatives. In this work, we extend the standard supervised fine-tuning (SFT) approach for adapting LLM-based ATS models by leveraging a computationally efficient LLM alignment technique -- direct preference optimization (DPO). Specifically, we post-train LLM-based ATS models using human feedback collected from persons with intellectual disabilities, reflecting their preferences on paired text simplifications generated by mainstream LLMs. Furthermore, we propose a pipeline for developing personalized LLM-based ATS systems, encompassing data collection, model selection, SFT and DPO post-training, and evaluation. Our findings underscore the necessity of active participation of target group persons in designing personalized AI accessibility solutions aligned with human expectations. This work represents a step towards personalizing inclusive AI systems at the target-group level, incorporating insights not only from text simplification experts but also from target group persons themselves.
摘要：自动文本简化（ATS）旨在增强各种目标群体，尤其是智障人士的语言可访问性。生成AI的最新进步，尤其是大型语言模型（LLMS），已大大提高了机器生成的文本简化质量，从而减轻了目标组的信息障碍。但是，现有的基于LLM的ATS系统并未纳入培训期间文本简化的偏好反馈，从而导致缺乏针对目标群体代表的特定需求量身定制的个性化。在这项工作中，我们通过利用一种计算有效的LLM比对技术 - 直接偏好优化（DPO）来扩展用于调整基于LLM的ATS模型的标准监督微调（SFT）方法。具体而言，我们使用从智力障碍者那里收集的人类反馈后，将基于LLM的ATS模型发布到基于LLM的ATS模型，反映了他们对主流LLMS生成的配对文本简化的偏好。此外，我们提出了一条用于开发个性化基于LLM的ATS系统的管道，包括数据收集，模型选择，SFT和DPO后培训以及评估。我们的发现强调了目标团体在设计个性化的AI可访问性解决方案与人类期望一致的个性化AI可访问性解决方案的必要性。这项工作代表了在目标组层面个性化包容性AI系统的一步，不仅结合了文本简化专家的见解，而且还纳入了目标群体本身的洞察力。

Title: Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing

Authors: Álvaro Zaera, Diana Nicoleta Popa, Ivan Sekulic, Paolo Rosso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01541
Pdf URL: https://arxiv.org/pdf/2507.01541
Copy Paste: [[2507.01541]] Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing(https://arxiv.org/abs/2507.01541)
Keywords: language model, llm
Abstract: Out-of-scope (OOS) intent detection is a critical challenge in task-oriented dialogue systems (TODS), as it ensures robustness to unseen and ambiguous queries. In this work, we propose a novel but simple modular framework that combines uncertainty modeling with fine-tuned large language models (LLMs) for efficient and accurate OOS detection. The first step applies uncertainty estimation to the output of an in-scope intent detection classifier, which is currently deployed in a real-world TODS handling tens of thousands of user interactions daily. The second step then leverages an emerging LLM-based approach, where a fine-tuned LLM is triggered to make a final decision on instances with high uncertainty. Unlike prior approaches, our method effectively balances computational efficiency and performance, combining traditional approaches with LLMs and yielding state-of-the-art results on key OOS detection benchmarks, including real-world OOS data acquired from a deployed TODS.
摘要：在面向任务的对话系统（TODS）中，偏离（OOS）意图检测是一个关键的挑战，因为它可以确保稳健性地看不见和模棱两可的查询。在这项工作中，我们提出了一个新颖但简单的模块化框架，该框架将不确定性建模与微调的大语言模型（LLMS）结合在一起，以有效而准确的OOS检测。第一步将不确定性估计适用于范围内意图检测分类器的输出，该输出目前已部署在现实世界中，每天处理数以万计的用户交互。然后，第二步利用了新兴的LLM方法，其中触发了微调的LLM，以对高不确定性的实例做出最终决定。与先前的方法不同，我们的方法有效地平衡了计算效率和性能，将传统方法与LLM相结合，并在关键的OOS检测基准上产生最先进的结果，包括从部署的TODS中获得的现实世界中的OOS数据。

Title: Is External Information Useful for Stance Detection with LLMs?

Authors: Quang Minh Nguyen, Taegyoon Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01543
Pdf URL: https://arxiv.org/pdf/2507.01543
Copy Paste: [[2507.01543]] Is External Information Useful for Stance Detection with LLMs?(https://arxiv.org/abs/2507.01543)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9\%. We explain this through experiments showing LLMs' tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at this https URL.
摘要：在立场检测任务中，文本被归类为对目标的有利，对立或中立。先前的工作表明，使用外部信息，例如Wikipedia的摘录，可以提高立场检测性能。但是，尽管这些信息在许多推理任务中都广泛采用，但这些信息是否可以使大型语言模型（LLM）受益仍然是一个未解决的问题。在这项研究中，我们对Wikipedia和Web搜索外部信息如何影响八个LLM的立场检测以及在具有12个目标的三个数据集中进行了系统评估。令人惊讶的是，我们发现此类信息在大多数情况下会降低性能，而宏F1的得分下降了27.9％。我们通过实验表明LLMS的趋势将其预测与所提供信息的立场和情感保持一致，而不是给定文本的地面真实态度，从而解释了这一点。我们还发现，通过思考链的提示，绩效降解持续存在，同时进行微调减轻，但并未完全消除它。我们的发现与以前的基于BERT系统的文献相反，这表明外部信息可以增强性能，并强调了基于LLM的立场分类器中信息偏见的风险。代码可在此HTTPS URL上找到。

Title: Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation

Authors: Shutong Feng, Hsien-chin Lin, Nurul Lubis, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Renato Vukovic, Milica Gašić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01594
Pdf URL: https://arxiv.org/pdf/2507.01594
Copy Paste: [[2507.01594]] Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation(https://arxiv.org/abs/2507.01594)
Keywords: language model, llm, agent
Abstract: Task-oriented dialogue (ToD) systems are designed to help users achieve specific goals through natural language interaction. While recent advances in large language models (LLMs) have significantly improved linguistic fluency and contextual understanding, building effective and emotionally intelligent ToD systems remains a complex challenge. Effective ToD systems must optimise for task success, emotional understanding and responsiveness, and precise information conveyance, all within inherently noisy and ambiguous conversational environments. In this work, we investigate architectural, representational, optimisational as well as emotional considerations of ToD systems. We set up systems covering these design considerations with a challenging evaluation environment composed of a natural-language user simulator coupled with an imperfect natural language understanding module. We propose \textbf{LUSTER}, an \textbf{L}LM-based \textbf{U}nified \textbf{S}ystem for \textbf{T}ask-oriented dialogue with \textbf{E}nd-to-end \textbf{R}einforcement learning with both short-term (user sentiment) and long-term (task success) rewards. Our findings demonstrate that combining LLM capability with structured reward modelling leads to more resilient and emotionally responsive ToD systems, offering a practical path forward for next-generation conversational agents.
摘要：面向任务的对话（TOD）系统旨在通过自然语言互动来帮助用户实现特定的目标。尽管大型语言模型（LLM）的最新进展显着提高了语言流利性和上下文理解，但建立有效和情感智能的TOD系统仍然是一个复杂的挑战。有效的TOD系统必须在任务成功，情感理解和响应能力以及精确的信息输送方面进行优化，所有这些都在固有的嘈杂和模棱两可的对话环境中。在这项工作中，我们研究了TOD系统的建筑，代表性，优化以及情感上的考虑。我们设置了涵盖这些设计注意事项的系统，该系统具有充满挑战的评估环境，该评估环境由自然语言用户模拟器和不完美的自然语言理解模块组成。 We propose \textbf{LUSTER}, an \textbf{L}LM-based \textbf{U}nified \textbf{S}ystem for \textbf{T}ask-oriented dialogue with \textbf{E}nd-to-end \textbf{R}einforcement learning with both short-term (user sentiment) and long-term (task成功）奖励。我们的发现表明，将LLM功能与结构化奖励建模相结合会导致更具弹性和情感响应的TOD系统，为下一代对话代理提供了实用的前进道路。

Title: Chart Question Answering from Real-World Analytical Narratives

Authors: Maeve Hutchinson, Radu Jianu, Aidan Slingsby, Jo Wood, Pranava Madhyastha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01627
Pdf URL: https://arxiv.org/pdf/2507.01627
Copy Paste: [[2507.01627]] Chart Question Answering from Real-World Analytical Narratives(https://arxiv.org/abs/2507.01627)
Keywords: language model, gpt
Abstract: We present a new dataset for chart question answering (CQA) constructed from visualization notebooks. The dataset features real-world, multi-view charts paired with natural language questions grounded in analytical narratives. Unlike prior benchmarks, our data reflects ecologically valid reasoning workflows. Benchmarking state-of-the-art multimodal large language models reveals a significant performance gap, with GPT-4.1 achieving an accuracy of 69.3%, underscoring the challenges posed by this more authentic CQA setting.
摘要：我们提出了一个新数据集，用于根据可视化笔记本构建的图表问答（CQA）。该数据集具有现实世界中的多视图图表，以及基于分析叙事的自然语言问题。与先前的基准不同，我们的数据反映了生态上有效的推理工作流程。基准测试最先进的多模式大型语言模型揭示了显着的性能差距，GPT-4.1的准确性为69.3％，强调了这种更真实的CQA设置所带来的挑战。

Title: Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Authors: Georgii Levtsov, Dmitry Ustalov
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.01633
Pdf URL: https://arxiv.org/pdf/2507.01633
Copy Paste: [[2507.01633]] Confidence and Stability of Global and Pairwise Scores in NLP Evaluation(https://arxiv.org/abs/2507.01633)
Keywords: language model
Abstract: With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at this https URL under a permissive license.
摘要：随着强大的指导调整神经语言模型的出现，自然语言处理（NLP）的基准测试正在越来越多地转移到成对比较的排行榜上，例如LMSYS竞技场，例如传统的Global Cointwise Coess（例如，Glue，Glue，Big-Bench，Big-Bench，Swe-Bench）。本文经验研究了全球得分和成对比较的优势和劣势，以帮助决策在选择适当的模型评估策略时。通过使用标准的全局指标以及流行的Bradley-Terry模型进行成对比较的计算实验，我们发现，尽管全球得分提供了更可靠的总体排名，但它们可以低估具有罕见，重大错误或置信度较低的强大模型。相反，成对比较对于确定全球得分较低的模型之间的强大竞争者特别有效，尤其是在难以定义质量指标（例如，文本生成）的情况下，如果他们需要更多的比较以汇聚，如果频繁进行纽带。我们的代码和数据可在此HTTPS URL获得允许许可证。

Title: Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings

Authors: Rifki Afina Putri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01645
Pdf URL: https://arxiv.org/pdf/2507.01645
Copy Paste: [[2507.01645]] Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings(https://arxiv.org/abs/2507.01645)
Keywords: language model
Abstract: In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. To better understand model behavior, we group the target languages into three categories: seen (included during pre-training), partially seen (not included but linguistically related to seen languages), and unseen (absent and unrelated in pre-training data). Our results reveal clear performance disparities across these groups: multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages. We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language. Additionally, we conduct a further analysis on tokenization and show that while subword fragmentation and vocabulary overlap with Indonesian correlate weakly with prediction quality, they do not fully explain the observed performance. Instead, the most consistent predictor of transfer success is the model's prior exposure to the language, either directly or through a related language.
摘要：在本文中，我们通过情感分析的任务调查了预训练的语言模型向低资源印尼本地语言的转移性。我们使用不同类型的模型在十种本地语言上评估零拍摄性能和基于适配器的转移：单语印尼BERT，MBERT和XLM-R等多语言模型，以及一种基于模块化适配器的方法MAD-X。为了更好地理解模型行为，我们将目标语言分为三类：看到（包括在培训期间），部分看到（不包括在语言上与所见语言相关），并且看不见（在预训练数据中不存在且不相关）。我们的结果揭示了这些组之间的明显绩效差异：多语言模型在可见的语言上表现最佳，在部分看见的语言上，而在看不见的语言上表现不佳。我们发现MAD-X可以显着提高性能，尤其是对于看到和部分看到的语言，而无需在目标语言中标记数据。此外，我们对令牌化进行了进一步的分析，并表明，虽然子字片段化和词汇与印度尼西亚人重叠与预测质量弱相关，但它们并不能完全解释观察到的性能。取而代之的是，转移成功的最一致的预测指标是该模型先前直接或通过相关语言接触该语言。

Title: AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness

Authors: Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Zhen Ye, Guang Chen, Zhiyong Huang, Jing Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01702
Pdf URL: https://arxiv.org/pdf/2507.01702
Copy Paste: [[2507.01702]] AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness(https://arxiv.org/abs/2507.01702)
Keywords: language model, llm, agent
Abstract: The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at this https URL.
摘要：社交媒体时代多模式模因的扩散要求多模式大语模型（MLLMS）有效地了解模因有害。现有用于评估有害模因理解的MLLM的基准测试取决于使用静态数据集的基于准确的模型无关的评估。随着在线模因动态发展，这些基准测试的能力有限。为了解决这个问题，我们提出了Adammeme，这是一个灵活的，基于代理的评估框架，可自适应地探测MLLM在解密模因危害中的推理能力。通过多代理协作，Adammeme通过迭代地使用具有挑战性的样本更新模因数据，从而提供了全面的评估，从而揭示了MLLMS如何解释有害性的特定限制。广泛的实验表明，我们的框架系统地揭示了不同目标MLLM的不同性能，从而对模型特异性弱点进行了深入的细粒度分析。我们的代码可在此HTTPS URL上找到。

Title: Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach

Authors: Aditya Tomar, Rudra Murthy, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01715
Pdf URL: https://arxiv.org/pdf/2507.01715
Copy Paste: [[2507.01715]] Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach(https://arxiv.org/abs/2507.01715)
Keywords: language model
Abstract: Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making. This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance. We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others, enabling a deeper study of their relationship. Our experiments compare encoder-only models and fine-tuned decoder-only models using QLoRA. While encoder-only models perform well, decoder-only models also show competitive results. Crucially, joint training on bias and stereotype detection significantly improves bias detection compared to training them separately. Additional experiments with sentiment analysis confirm that the improvements stem from the connection between bias and stereotypes, not multi-task learning alone. These findings highlight the value of leveraging stereotype information to build fairer and more effective AI systems.
摘要：语言模型中的偏见和刻板印象会造成伤害，尤其是在敏感领域，例如内容适度和决策。本文通过探讨如何共同学习这些任务增强模型性能来解决偏见和刻板印象检测。我们介绍了立体恐惧症，这是一个独特的数据集，该数据集标记为偏见和刻板印象检测：宗教，性别，社会经济地位，种族，职业和其他数据集，从而对其关系进行了更深入的研究。我们的实验比较了使用Qlora的仅编码模型和仅进行微调解码器模型。虽然仅编码模型的性能很好，但仅解码器模型也显示出竞争性结果。至关重要的是，与分别训练它们相比，对偏差和刻板印象检测的联合培训显着改善了偏置检测。通过情感分析进行的其他实验证实，改进源于偏见与刻板印象之间的联系，而不是单独的多任务学习。这些发现突出了利用刻板印象信息来构建更公平，更有效的AI系统的价值。

Title: LLMs for Legal Subsumption in German Employment Contracts

Authors: Oliver Wardas, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01734
Pdf URL: https://arxiv.org/pdf/2507.01734
Copy Paste: [[2507.01734]] LLMs for Legal Subsumption in German Employment Contracts(https://arxiv.org/abs/2507.01734)
Keywords: language model, llm
Abstract: Legal work, characterized by its text-heavy and resource-intensive nature, presents unique challenges and opportunities for NLP research. While data-driven approaches have advanced the field, their lack of interpretability and trustworthiness limits their applicability in dynamic legal environments. To address these issues, we collaborated with legal experts to extend an existing dataset and explored the use of Large Language Models (LLMs) and in-context learning to evaluate the legality of clauses in German employment contracts. Our work evaluates the ability of different LLMs to classify clauses as "valid," "unfair," or "void" under three legal context variants: no legal context, full-text sources of laws and court rulings, and distilled versions of these (referred to as examination guidelines). Results show that full-text sources moderately improve performance, while examination guidelines significantly enhance recall for void clauses and weighted F1-Score, reaching 80\%. Despite these advancements, LLMs' performance when using full-text sources remains substantially below that of human lawyers. We contribute an extended dataset, including examination guidelines, referenced legal sources, and corresponding annotations, alongside our code and all log files. Our findings highlight the potential of LLMs to assist lawyers in contract legality review while also underscoring the limitations of the methods presented.
摘要：法律工作以其文本繁重和资源密集的性质为特征，为NLP研究带来了独特的挑战和机会。尽管数据驱动的方法已经提高了该领域，但他们缺乏可解释性和可信度限制了它们在动态法律环境中的适用性。为了解决这些问题，我们与法律专家合作，扩展了现有的数据集，并探讨了大型语言模型（LLMS）的使用和内部文化学习以评估德国就业合同中条款的合法性。我们的工作评估了不同LLM将条款分类为“有效”，“不公平”或“无效”的能力：没有法律背景，没有法律背景，法律和法院裁决的全文来源以及这些版本的蒸馏版本（称为考试指南）。结果表明，全文来源适度提高了性能，而考试指南显着增强了Void子句和加权F1分数的回忆，达到80 \％。尽管取得了这些进步，但使用全文资源时，LLMS的性能仍大大低于人类律师的效果。我们贡献了一个扩展的数据集，包括检查指南，参考法律资源和相应的注释，以及我们的代码和所有日志文件。我们的发现凸显了LLMS协助律师进行合同合法性审查的潜力，同时也强调了提出的方法的局限性。

Title: MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Authors: Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Meng Fang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01785
Pdf URL: https://arxiv.org/pdf/2507.01785
Copy Paste: [[2507.01785]] MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining(https://arxiv.org/abs/2507.01785)
Keywords: language model, llm
Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English "raters" via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.
摘要：数据质量是大语言模型性能的关键驱动力，但现有的基于模型的选择方法几乎完全关注英语。我们介绍了一个可扩展的框架，该框架将高质量的英语数据质量信号转移到了17种目标语言的单个评估者中。通过成对比较来掩盖多个英语“评估者”，以学习统一的文档质量分数，然后通过翻译通过翻译来投射这些判断，以培训单语，交叉语言和平行文本对的多语言评估者。应用于Web数据，选择了平衡的英语和多语言内容子集，以预见1.2 B参数Llama模型。与包括Qurater，AskLlm，DCLM等强大基线相比，我们的方法提高了英语基准和多语言评估的平均准确性，并且在知识密集型任务方面尤其较大。我们进一步分析了叙事材料的翻译保真度，选择偏见和代表性不足，概述了未来工作的方向。

Title: Probing Evaluation Awareness of Language Models

Authors: Jord Nguyen, Khiem Hoang, Carlo Leonardo Attubato, Felix Hofstätter
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01786
Pdf URL: https://arxiv.org/pdf/2507.01786
Copy Paste: [[2507.01786]] Probing Evaluation Awareness of Language Models(https://arxiv.org/abs/2507.01786)
Keywords: language model, prompt
Abstract: Language models can distinguish between testing and deployment phases -- a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We show that linear probes can separate real-world evaluation and deployment prompts, suggesting that current models internally represent this distinction. We also find that current safety evaluations are correctly classified by the probes, suggesting that they already appear artificial or inauthentic to models. Our findings underscore the importance of ensuring trustworthy evaluations and understanding deceptive capabilities. More broadly, our work showcases how model internals may be leveraged to support blackbox methods in safety audits, especially for future models more competent at evaluation awareness and deception.
摘要：语言模型可以区分测试和部署阶段 - 一种称为评估意识的功能。这具有重大的安全性和政策影响，可能破坏了AI治理框架和自愿行业承诺的评估的可靠性。在本文中，我们研究了Llama-3.3-70B教学中的评估意识。我们表明，线性探针可以分开现实世界的评估和部署提示，这表明当前模型内部表示这种区别。我们还发现，当前的安全评估是通过探针正确分类的，这表明它们已经看起来是人为的或对模型不真实的。我们的发现强调了确保值得信赖的评估和理解欺骗性能力的重要性。更广泛地说，我们的工作展示了如何利用模型内部设备来支持安全审计中的黑盒方法，尤其是对于未来的模型，更有能力在评估意识和欺骗方面。

Title: How Do Vision-Language Models Process Conflicting Information Across Modalities?

Authors: Tianze Hua, Tian Yun, Ellie Pavlick
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01790
Pdf URL: https://arxiv.org/pdf/2507.01790
Copy Paste: [[2507.01790]] How Do Vision-Language Models Process Conflicting Information Across Modalities?(https://arxiv.org/abs/2507.01790)
Keywords: language model
Abstract: AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.
摘要：AI模型越来越多地是多模式，将不同的输入流集成到连贯的状态表示中，随后的行为和动作可以基于该状态。本文试图了解当输入流提供冲突的信息时，此类模型的行为如何。专门针对视觉模型，我们提供了不一致的输入（例如，狗的图像与标题“猫的照片”配对），并要求该模型报告一种特定模式中存在的信息（例如，“字幕”字幕说什么 /在图像中是什么？”）。我们发现，模型通常偏爱一种模式，例如报告图像，无论标题如何报告，但不同的模型在不同的模式上有所不同。我们发现证据表明，在模型的内部代表性结构中，行为首选的方式很明显，并且特定的注意力头可以重组表示形式，以偏爱一种模式而不是另一种方式。此外，我们发现模态性不足的“路由器头”似乎促进了有关指令中要求的模式的答案，并且可以操纵或转移，以提高跨数据集和模态性能。这项工作共同提供了识别和控制模型是否以及如何检测和解决复杂多模式环境中冲突的信号的基本步骤。

Title: Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes

Authors: Nikita Neveditsin, Pawan Lingras, Vijay Mago
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.01810
Pdf URL: https://arxiv.org/pdf/2507.01810
Copy Paste: [[2507.01810]] Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes(https://arxiv.org/abs/2507.01810)
Keywords: language model, prompt
Abstract: We present a comparative analysis of the parseability of structured outputs generated by small language models for open attribute-value extraction from clinical notes. We evaluate three widely used serialization formats: JSON, YAML, and XML, and find that JSON consistently yields the highest parseability. Structural robustness improves with targeted prompting and larger models, but declines for longer documents and certain note types. Our error analysis identifies recurring format-specific failure patterns. These findings offer practical guidance for selecting serialization formats and designing prompts when deploying language models in privacy-sensitive clinical settings.
摘要：我们对小语言模型产生的结构化输出的校准性进行了比较分析，用于从临床注释中提取开放属性值。我们评估了三种广泛使用的序列化格式：JSON，YAML和XML，发现JSON始终产生最高的ParseAbity。结构鲁棒性通过有针对性的提示和更大的模型改善，但对于更长的文档和某些音符类型而下降。我们的错误分析确定了重复格式特定的故障模式。这些发现为选择序列化格式和设计提示提供了实用的指导，以在对隐私敏感的临床环境中部署语言模型时。

Title: Low-Perplexity LLM-Generated Sequences and Where To Find Them

Authors: Arthur Wuhrmann, Anastasiia Kucherenko, Andrei Kucharavy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01844
Pdf URL: https://arxiv.org/pdf/2507.01844
Copy Paste: [[2507.01844]] Low-Perplexity LLM-Generated Sequences and Where To Find Them(https://arxiv.org/abs/2507.01844)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences - high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their behavior.
摘要：随着大型语言模型（LLMS）变得越来越普遍，了解特定的培训数据如何塑造其产出对于透明，问责制，隐私和公平至关重要。为了探索LLMS如何利用和复制其训练数据，我们引入了一种系统的方法，该方法旨在分析模型生成的高概率文本跨度。我们的管道可靠地提取了跨不同主题的较长序列，同时避免退化，然后将它们追溯到训练数据中的来源。令人惊讶的是，我们发现这些低衰变跨度的很大一部分无法映射到语料库中。对于那些确实匹配的人，我们量化了跨源文档中发生的分布，突出了逐字记忆的范围和性质，并铺平了一种更好地理解LLMS培训数据如何影响其行为的方法。

Title: Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages

Authors: Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01853
Pdf URL: https://arxiv.org/pdf/2507.01853
Copy Paste: [[2507.01853]] Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages(https://arxiv.org/abs/2507.01853)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that go beyond English centric benchmarks and address the requirements of linguistically diverse regions such as India. We present EKA-EVAL, a unified and production-ready evaluation framework that integrates over 35 benchmarks, including 10 Indic-specific datasets, spanning categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Compared to existing Indian language evaluation tools, EKA-EVAL offers broader benchmark coverage, with built-in support for distributed inference, quantization, and multi-GPU usage. Our systematic comparison positions EKA-EVAL as the first end-to-end, extensible evaluation suite tailored for both global and Indic LLMs, significantly lowering the barrier to multilingual benchmarking. The framework is open-source and publicly available at this https URL eka-eval and a part of ongoing EKA initiative (this https URL), which aims to scale up to over 100 benchmarks and establish a robust, multilingual evaluation ecosystem for LLMs.
摘要：大型语言模型（LLMS）的快速发展加剧了对超越英语基准的评估框架的需求，并满足了语言上不同地区（如印度）的要求。我们提出了EKA-eval，这是一个统一且可生产的评估框架，该框架集成了35多个基准，包括10个特定指标的数据集，诸如推理，数学，工具使用，长篇文章理解和阅读理解的类别。与现有的印度语言评估工具相比，EKA-eval提供了更广泛的基准覆盖范围，并为分布式推理，量化和多GPU使用提供了内置支持。我们的系统比较位置EKA-Eval是针对全球和INDIC LLM量身定制的第一个端到端，可扩展的评估套件，从而大大降低了多语言基准测试的障碍。该框架是开源的，并在此HTTPS URL EKA-EVAL和正在进行的EKA计划（此HTTPS URL）的一部分中公开可用，该计划旨在扩展到100多个基准，并建立一个可靠的多项式评估生态系统的LLMS。

Title: DIY-MKG: An LLM-Based Polyglot Language Learning System

Authors: Kenan Tang, Yanhong Li, Yao Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01872
Pdf URL: https://arxiv.org/pdf/2507.01872
Copy Paste: [[2507.01872]] DIY-MKG: An LLM-Based Polyglot Language Learning System(https://arxiv.org/abs/2507.01872)
Keywords: language model, llm, prompt
Abstract: Existing language learning tools, even those powered by Large Language Models (LLMs), often lack support for polyglot learners to build linguistic connections across vocabularies in multiple languages, provide limited customization for individual learning paces or needs, and suffer from detrimental cognitive offloading. To address these limitations, we design Do-It-Yourself Multilingual Knowledge Graph (DIY-MKG), an open-source system that supports polyglot language learning. DIY-MKG allows the user to build personalized vocabulary knowledge graphs, which are constructed by selective expansion with related words suggested by an LLM. The system further enhances learning through rich annotation capabilities and an adaptive review module that leverages LLMs for dynamic, personalized quiz generation. In addition, DIY-MKG allows users to flag incorrect quiz questions, simultaneously increasing user engagement and providing a feedback loop for prompt refinement. Our evaluation of LLM-based components in DIY-MKG shows that vocabulary expansion is reliable and fair across multiple languages, and that the generated quizzes are highly accurate, validating the robustness of DIY-MKG.
摘要：现有的语言学习工具，即使是由大型语言模型（LLM）提供支持的工具，通常缺乏对多语言学习者的支持，无法用多种语言建立跨词汇的语言联系，为单个学习步伐或需求提供有限的自定义，并且受到了有害的认知卸载。为了解决这些限制，我们设计自己动手多语言知识图（DIY-MKG），这是一个支持多语言学习的开源系统。 DIY-MKG允许用户构建个性化的词汇知识图，这些词汇知识图是由LLM建议的相关单词选择性扩展构建的。该系统通过丰富的注释能力和一个自适应评论模块进一步增强学习，该模块利用LLMS进行动态，个性化的测验生成。此外，DIY-MKG允许用户标记不正确的测验问题，同时增加用户参与度并提供反馈循环以及时进行完善。我们对DIY-MKG中基于LLM的组件的评估表明，词汇扩展在多种语言之间是可靠且公平的，并且生成的测验非常准确，从而验证了DIY-MKG的鲁棒性。

Title: MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

Authors: Dongyi Ding, Tiannan Wang, Chenghao Zhu, Meiling Tao, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01887
Pdf URL: https://arxiv.org/pdf/2507.01887
Copy Paste: [[2507.01887]] MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants(https://arxiv.org/abs/2507.01887)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the "SLMs Learnability Gap". To address this, we introduce \textbf{Mi}d-\textbf{Co}T \textbf{T}eacher \textbf{A}ssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.
摘要：大型语言模型（LLMS）在需要长期思考序列的推理任务上表现出色，以进行计划，反思和精致。但是，对于广泛的部署，它们的大量模型规模和高计算需求是不切实际的。然而，由于能力有限，小语言模型（SLM）经常难以学习长形式的COT推理，这是我们称为“ SLMS可学习性差距”的现象。为了解决这个问题，我们介绍了\ textbf {mi} d- \ textbf {co} t \ textbf {t} erner \ textbf {a} ssistant蒸馏（micotal），这是一个改善SLMS长期蒸馏的框架。 Micota采用中型模型作为教师助手，并利用中间长的COT序列来弥合容量和推理长度间隙。我们对下游任务的实验表明，尽管从大型教师中蒸馏出的SLMS可以通过应用Micota来表现不佳，但它们在推理性能方面取得了重大改善。具体而言，QWEN2.5-7B教学和QWEN2.5-3B教学的平均得分分别在AIME2024，AMC，Olympiad，Math-500和GSM8K基准的平均得分上提高了3.47和3.93。为了更好地了解Micota背后的机制，我们执行了一个定量实验，表明我们的方法产生的数据与基本SLM分布更紧密地对齐。我们的见解为对SLM的长时间数据蒸馏的未来研究铺平了道路。

Title: High-Layer Attention Pruning with Rescaling

Authors: Songtao Liu, Peng Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01900
Pdf URL: https://arxiv.org/pdf/2507.01900
Copy Paste: [[2507.01900]] High-Layer Attention Pruning with Rescaling(https://arxiv.org/abs/2507.01900)
Keywords: language model, llm
Abstract: Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines.
摘要：修剪是压缩大语言模型（LLM）的一种高效方法，可大大降低推理潜伏期。但是，常规的无培训结构化修剪方法通常采用一种启发式指标，该指标无与伦比地消除了所有修剪层的注意力，而无需考虑它们在网络体系结构中的位置。在这项工作中，我们提出了一种新颖的修剪算法，该算法从策略性地修剪了该模型较高层中的注意力。由于删除注意力头可以改变令牌表示的大小，因此我们引入了一个自适应恢复参数，该参数校准了后量后标度以抵消这种效果。我们对包括Llama3.1-8B，Mistral-7b-V0.3，Qwen2-7b和Gemma2-9b在内的各种LLM进行了全面的实验。我们的评估包括27个数据集的生成和歧视任务。结果始终表明，我们的方法的表现优于现有的结构化修剪方法。在发电任务中，这种改进尤其值得注意，在该任务中，我们的方法极大地胜过现有的基准。

Title: AI4Research: A Survey of Artificial Intelligence for Scientific Research

Authors: Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, Zhi Chen, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.01903
Pdf URL: https://arxiv.org/pdf/2507.01903
Copy Paste: [[2507.01903]] AI4Research: A Survey of Artificial Intelligence for Scientific Research(https://arxiv.org/abs/2507.01903)
Keywords: language model, llm
Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.
摘要：人工智能（AI）的最新进展，特别是在大型语言模型（例如OpenAI-O1和DeepSeek-R1）中，在复杂领域（例如逻辑推理和实验编码）中表现出了显着的功能。在这些进步的推动下，许多研究探讨了AI在创新过程中的应用，特别是在科学研究的背景下。这些AI技术主要旨在开发可以自主在广泛的科学学科进行研究过程的系统。尽管有这些重大的进步，但仍缺乏对AI研究的全面调查（AI4研究），这阻碍了我们的理解并阻碍了该领域的进一步发展。为了解决这一差距，我们提出了一项全面的调查，并就AI4Research提供了统一的观点。具体而言，我们工作的主要贡献如下：（1）系统分类学：我们首先引入系统的分类法，以对AI4Research中的五个主流任务进行分类。（2）新的边界：然后，我们确定了关键的研究差距，并突出了有希望的未来方向，重点是自动实验的严格性和可扩展性以及社会影响。（3）丰富的应用程序和资源：最后，我们编译了大量资源，包括相关的多学科应用程序，数据语料库和工具。我们希望我们的工作能够为研究社区提供快速获取这些资源，并在AI4Research中刺激创新的突破。

Title: Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models

Authors: Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, Qing He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.01915
Pdf URL: https://arxiv.org/pdf/2507.01915
Copy Paste: [[2507.01915]] Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models(https://arxiv.org/abs/2507.01915)
Keywords: language model, llm
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. To address this issue, we frame human value alignment as a multi-objective optimization problem, aiming to maximize a set of potentially conflicting objectives. We introduce Gradient-Adaptive Policy Optimization (GAPO), a novel fine-tuning paradigm that employs multiple-gradient descent to align LLMs with diverse preference distributions. GAPO adaptively rescales the gradients for each objective to determine an update direction that optimally balances the trade-offs between objectives. Additionally, we introduce P-GAPO, which incorporates user preferences across different objectives and achieves Pareto solutions that better align with the user's specific needs. Our theoretical analysis demonstrates that GAPO converges towards a Pareto optimal solution for multiple objectives. Empirical results on Mistral-7B show that GAPO outperforms current state-of-the-art methods, achieving superior performance in both helpfulness and harmlessness.
摘要：从人类反馈（RLHF）中学习的强化已成为将大型语言模型（LLM）与人类偏好保持一致的强大技术。但是，有效地使LLM与多样化的人类偏好保持一致仍然是一个重大挑战，尤其是在冲突时。为了解决这个问题，我们将人类价值对准作为一个多目标优化问题，旨在最大化一组潜在的冲突目标。我们引入了梯度自适应政策优化（GAPO），这是一种新型的微调范式，该范式采用了多种梯度下降到具有不同偏好分布的LLM。 GAPO适应每个目标的梯度，以确定更新方向，以最佳地平衡目标之间的权衡。此外，我们介绍了P-GAPO，该P-GAPO结合了不同目标的用户偏好，并实现了与用户特定需求更好的帕累托解决方案。我们的理论分析表明，GAPO将用于多个目标的帕累托最佳解决方案收敛。 Mistral-7b的经验结果表明，GAPO的表现优于当前的最新方法，在有益和无害性方面取得了卓越的表现。

Title: Decision-oriented Text Evaluation

Authors: Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.01923
Pdf URL: https://arxiv.org/pdf/2507.01923
Copy Paste: [[2507.01923]] Decision-oriented Text Evaluation(https://arxiv.org/abs/2507.01923)
Keywords: language model, llm, agent
Abstract: Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts--including objective morning summaries and subjective closing-bell analyses--as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.
摘要：自然语言产生（NLG）越来越多地部署在高风险领域中，但常见的内在评估方法，例如N-gram重叠或句子合理，与实际决策效率无关。我们提出了一个面向决策的框架，用于通过直接测量其对人类和大语模型（LLM）决策结果的影响来评估生成的文本。使用市场摘要文本（包括客观的早晨摘要和主观闭幕式 - 贝尔分析）作为测试案例，我们根据人类投资者执行的贸易的财务绩效以及这些文本仅由这些文本告知的自主LLM代理商的经济绩效来评估决策质量。我们的发现表明，在仅依靠摘要时，人类和LLM代理人都没有始终超过随机的性能。但是，更丰富的分析评论使协作人类LLM团队能够显着超过个人或代理基线。我们的方法强调了评估生成的文本通过促进人类和LLM之间协同决策的能力的重要性，强调了传统内在指标的关键局限性。

Title: The Thin Line Between Comprehension and Persuasion in LLMs

Authors: Adrian de Wynter, Tangming Yuan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2507.01936
Pdf URL: https://arxiv.org/pdf/2507.01936
Copy Paste: [[2507.01936]] The Thin Line Between Comprehension and Persuasion in LLMs(https://arxiv.org/abs/2507.01936)
Keywords: language model, llm, chat, agent
Abstract: Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs' ability to maintain a debate--one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.
摘要：大型语言模型（LLM）非常出色，可保持高级，令人信服的对话。他们在敏感领域（例如同行评审和心理健康应用程序）中快速部署为聊天机器人和评估人员。这与他们的推理能力不同的叙述，呼吁对LLMS进行仔细检查及其对话的理解。在这项工作中，我们首先要评估LLMS保持辩论的能力，这是人类交流中最纯粹但最复杂的形式之一。然后，我们衡量这种能力如何与他们对所谈论的内容的理解有关，即他们对对话结构和务实背景的理解。我们发现LLM能够维持连贯的，有说服力的辩论，经常摇摆参与者和观众的信念。我们还注意到，意识或怀疑AI参与鼓励人们对提出的论点更加批评。然而，当对LLM对更深层的对话结构的理解时，他们无法证明该理解。我们的发现将LLMS-As-As-Eshuutors的缺点与理解背景的能力（在）能力相关联。更广泛地说，对于论证理论领域，我们认为，如果代理人可以令人信服地维持对话，那么它就不必知道它在说什么。因此，务实背景和连贯性的建模是效力的继发性。