2025-07-25

Title: Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Authors: Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17842
Pdf URL: https://arxiv.org/pdf/2507.17842
Copy Paste: [[2507.17842]] Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning(https://arxiv.org/abs/2507.17842)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.
摘要：大型语言模型（LLMS）最近在网络环境中产生“可信的人类”行为具有强大的潜力。先前的工作已经通过LLM合成的理由探索了增强培训数据，并应用监督的微调（SFT）来增强推理能力，进而可以改善下游行动预测。但是，这种方法的性能固有地受到用于生成理由的模型的推理能力的限制。在本文中，我们介绍了Shop-R1，这是一种新颖的增强学习（RL）框架，旨在提高LLMS在线购物环境中对真实人类行为进行模拟的推理能力，专门在线购物环境中，Shop-R1将人类行为仿真任务分解为两个阶段：基本原理生成和行动预测，每个人都由独特的奖励信号进行指导。对于基本原理，我们利用内部模型信号（例如logit分布）以自我监督的方式指导推理过程。为了采取行动预测，我们提出了一个具有难以感知的缩放的层次奖励结构，以防止奖励黑客入侵并实现精细的奖励分配。该设计评估了高级动作类型和细粒度亚法细节（属性和值）的正确性，从而与其难度成正比奖励产出。实验结果表明，与基线相比，我们的方法的相对改善超过65％。

Title: Dynamic and Generalizable Process Reward Modeling

Authors: Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17849
Pdf URL: https://arxiv.org/pdf/2507.17849
Copy Paste: [[2507.17849]] Dynamic and Generalizable Process Reward Modeling(https://arxiv.org/abs/2507.17849)
Keywords: language model, llm
Abstract: Process Reward Models (PRMs) are crucial for guiding Large Language Models (LLMs) in complex scenarios by providing dense reward signals. However, existing PRMs primarily rely on heuristic approaches, which struggle with cross-domain generalization. While LLM-as-judge has been proposed to provide generalized rewards, current research has focused mainly on feedback results, overlooking the meaningful guidance embedded within the text. Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria. DG-PRM dynamically selects reward signals for step-wise reward scoring. To handle multifaceted reward signals, we pioneeringly adopt Pareto dominance estimation to identify discriminative positive and negative pairs. Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards. Further analysis reveals that DG-PRM adapts well to out-of-distribution scenarios, demonstrating exceptional generalizability.
摘要：流程奖励模型（PRM）对于通过提供密集的奖励信号来指导大型语言模型（LLM）至关重要。但是，现有的PRM主要依靠启发式方法，这些方法与跨域泛化斗争。尽管已提出了法学律师法官来提供广泛的奖励，但当前的研究主要集中在反馈结果上，忽略了文本中嵌入的有意义的指导。此外，静态和粗粒的评估标准难以适应复杂的过程监督。为了应对这些挑战，我们提出了动态且可推广的过程奖励建模（DG-PRM），该建模具有奖励树，可捕获和存储精细的，多维的奖励标准。 DG-PRM动态选择奖励信号进行逐步奖励评分。为了处理多方面的奖励信号，我们开创性地采用帕累托优势估计，以识别歧视性正面和负面对。实验结果表明，DG-PRM在盛行的基准测试方面取得了惊人的性能，从而显着提高了跨任务的模型性能。进一步的分析表明，DG-PRM很好地适应了分布式场景，证明了出色的普遍性。

Title: VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL

Authors: Shubham Mohole, Sainyam Galhotra
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2507.17896
Pdf URL: https://arxiv.org/pdf/2507.17896
Copy Paste: [[2507.17896]] VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL(https://arxiv.org/abs/2507.17896)
Keywords: llm, prompt
Abstract: Application systems using natural language interfaces to databases (NLIDBs) have democratized data analysis. This positive development has also brought forth an urgent challenge to help users who might use these systems without a background in statistical analysis to formulate bias-free analytical questions. Although significant research has focused on text-to-SQL generation accuracy, addressing cognitive biases in analytical questions remains underexplored. We present VeriMinder, this https URL, an interactive system for detecting and mitigating such analytical vulnerabilities. Our approach introduces three key innovations: (1) a contextual semantic mapping framework for biases relevant to specific analysis contexts (2) an analytical framework that operationalizes the Hard-to-Vary principle and guides users in systematic data analysis (3) an optimized LLM-powered system that generates high-quality, task-specific prompts using a structured process involving multiple candidates, critic feedback, and self-reflection. User testing confirms the merits of our approach. In direct user experience evaluation, 82.5% participants reported positively impacting the quality of the analysis. In comparative evaluation, VeriMinder scored significantly higher than alternative approaches, at least 20% better when considered for metrics of the analysis's concreteness, comprehensiveness, and accuracy. Our system, implemented as a web application, is set to help users avoid "wrong question" vulnerability during data analysis. VeriMinder code base with prompts, this https URL, is available as an MIT-licensed open-source software to facilitate further research and adoption within the community.
摘要：使用自然语言界面到数据库（NLIDB）的应用系统已民主化数据分析。这一积极的发展也提出了一个紧迫的挑战，以帮助那些可能在统计分析中使用这些系统的用户可以提出无偏见的分析问题。尽管重大研究集中在文本到SQL的生成准确性上，但解决分析问题中的认知偏见仍然没有得到充实。我们提出了Veriminder，即此HTTPS URL，这是一种用于检测和减轻此类分析漏洞的交互式系统。 Our approach introduces three key innovations: (1) a contextual semantic mapping framework for biases relevant to specific analysis contexts (2) an analytical framework that operationalizes the Hard-to-Vary principle and guides users in systematic data analysis (3) an optimized LLM-powered system that generates high-quality, task-specific prompts using a structured process involving multiple candidates, critic feedback, and self-reflection.用户测试证实了我们方法的优点。在直接用户体验评估中，有82.5％的参与者报告对分析的质量产生积极影响。在比较评估中，Veriminder的得分明显高于替代方法，当考虑到分析的具体性，全面性和准确性指标时，至少要高20％。我们的系统（作为Web应用程序实现）设置为帮助用户在数据分析过程中避免“错误的问题”漏洞。 veriminder代码库带有提示，该HTTPS URL可作为MIT许可的开源软件，可促进社区内的进一步研究和采用。

Title: Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text

Authors: Hulayyil Alshammari, Praveen Rao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17944
Pdf URL: https://arxiv.org/pdf/2507.17944
Copy Paste: [[2507.17944]] Evaluating the Performance of AI Text Detectors, Few-Shot and Chain-of-Thought Prompting Using DeepSeek Generated Text(https://arxiv.org/abs/2507.17944)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: Large language models (LLMs) have rapidly transformed the creation of written materials. LLMs have led to questions about writing integrity, thereby driving the creation of artificial intelligence (AI) detection technologies. Adversarial attacks, such as standard and humanized paraphrasing, inhibit detectors' ability to detect machine-generated text. Previous studies have mainly focused on ChatGPT and other well-known LLMs and have shown varying accuracy across detectors. However, there is a clear gap in the literature about DeepSeek, a recently published LLM. Therefore, in this work, we investigate whether six generally accessible AI detection tools -- AI Text Classifier, Content Detector AI, Copyleaks, QuillBot, GPT-2, and GPTZero -- can consistently recognize text generated by DeepSeek. The detectors were exposed to the aforementioned adversarial attacks. We also considered DeepSeek as a detector by performing few-shot prompting and chain-of-thought reasoning (CoT) for classifying AI and human-written text. We collected 49 human-authored question-answer pairs from before the LLM era and generated matching responses using DeepSeek-v3, producing 49 AI-generated samples. Then, we applied adversarial techniques such as paraphrasing and humanizing to add 196 more samples. These were used to challenge detector robustness and assess accuracy impact. While QuillBot and Copyleaks showed near-perfect performance on original and paraphrased DeepSeek text, others -- particularly AI Text Classifier and GPT-2 -- showed inconsistent results. The most effective attack was humanization, reducing accuracy to 71% for Copyleaks, 58% for QuillBot, and 52% for GPTZero. Few-shot and CoT prompting showed high accuracy, with the best five-shot result misclassifying only one of 49 samples (AI recall 96%, human recall 100%).
摘要：大型语言模型（LLM）迅速改变了书面材料的创建。 LLM引起了关于编写完整性的问题，从而推动了人工智能（AI）检测技术的创建。对抗性攻击，例如标准和人性化的释义，抑制了探测器检测机器生成的文本的能力。先前的研究主要集中在Chatgpt和其他众所周知的LLM上，并且在探测器之间表现出不同的精度。但是，关于最近出版的LLM DeepSeek的文献存在明显的差距。因此，在这项工作中，我们研究了六个通常可访问的AI检测工具 - AI文本分类器，内容检测器AI，copyleaks，Quillbot，GPT-2和GPTZERO-是否可以始终如一地识别DeepSeek生成的文本。探测器暴露于上述对抗攻击中。我们还将DeepSeek视为探测器，通过执行几乎没有弹药的促进和经营推理（COT）来对AI和人工写的文本进行分类。我们从LLM时代之前收集了49个由人为撰写的问答对，并使用DeepSeek-V3产生了匹配的回答，从而产生了49个AI生成的样品。然后，我们应用了诸如释义和人性化之类的对抗技术，以增加196个样本。这些被用来挑战探测器的鲁棒性并评估准确性影响。虽然Quillbot和Copyleaks在原始和释义的DeepSeek文本上表现出几乎完美的性能，但其他人（尤其是AI文本分类器和GPT-2）显示出不一致的结果。最有效的攻击是人性化，海串联的准确度降低到71％，Quillbot的准确性为58％，而GPTZERO的准确性为52％。很少射击和婴儿床提示表现出很高的精度，最佳的五杆结果误导了49个样本中的一个（AI召回96％，人类召回100％）。

Title: Are LLM Belief Updates Consistent with Bayes' Theorem?

Authors: Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, Peter M. Atkinson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17951
Pdf URL: https://arxiv.org/pdf/2507.17951
Copy Paste: [[2507.17951]] Are LLM Belief Updates Consistent with Bayes' Theorem?(https://arxiv.org/abs/2507.17951)
Keywords: language model, llm
Abstract: Do larger and more capable language models learn to update their "beliefs" about propositions more consistently with Bayes' theorem when presented with evidence in-context? To test this, we formulate a Bayesian Coherence Coefficient (BCC) metric and generate a dataset with which to measure the BCC. We measure BCC for multiple pre-trained-only language models across five model families, comparing against the number of model parameters, the amount of training data, and model scores on common benchmarks. Our results provide evidence for our hypothesis that larger and more capable pre-trained language models assign credences that are more coherent with Bayes' theorem. These results have important implications for our understanding and governance of LLMs.
摘要：在向贝叶斯定理提供证据时，更大，更有能力的语言模型是否会学会更新他们对命题的“信念”？为了测试这一点，我们制定了贝叶斯连贯系数（BCC）度量，并生成一个数据集来测量BCC。我们测量了五个模型系列中多种预训练的只有预训练的语言模型的BCC，与模型参数的数量，训练数据的量和模型得分进行了比较。我们的结果为我们的假设提供了证据，即更大，更有能力的预训练的语言模型分配了与贝叶斯定理更连贯的凭据。这些结果对我们对LLM的理解和治理具有重要意义。

Title: Technical Report of TeleChat2, TeleChat2.5 and T1

Authors: Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18013
Pdf URL: https://arxiv.org/pdf/2507.18013
Copy Paste: [[2507.18013]] Technical Report of TeleChat2, TeleChat2.5 and T1(https://arxiv.org/abs/2507.18013)
Keywords: language model, gpt, chat, chain-of-thought
Abstract: We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI's o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.
摘要：我们介绍了最新的Telechat模型系列：\ TextBf {telechat2}，\ textbf {telechat2.5}和\ textbf {t1}，对他们的前任Telechat进行了重大升级。尽管模型体系结构的变化很小，但新系列通过在训练前和训练后阶段的增强训练策略来取得了可观的绩效提高。该系列始于\ textbf {telechat2}，该系列在10万亿高质量和多样的代币上进行了预处理。接下来是监督的微调（SFT）和直接偏好优化（DPO），以进一步增强其功能。 \ textbf {telechat2.5}和\ textbf {t1}通过将持续的预处理阶段与域特异性数据集结合在一起，并结合加固学习（RL）来扩展管道，以提高代码生成和数学推理任务的性能。 \ textbf {t1}变体设计用于复杂的推理，支持长期的经营链（COT）推理，并证明了数学和编码的实质性改进。相比之下，\ textbf {telechat2.5}优先考虑速度，提供快速推断。 \ textbf {t1}和\ textbf {telechat2.5}的旗舰模型都是具有115B参数的密集变压器架构，与原始telechat相比，具有115B参数，展示了推理和一般任务性能的显着进步。值得注意的是，\ textbf {t1-115b}均优于OpenAI的O1-Mini和GPT-4O，均超过了专有模型。我们公开发布\ textbf {telechat2}，\ textbf {telechat2.5}和\ textbf {t1}，包括具有35B和115B参数的训练后版本，以赋予开发人员和研究人员的能力，并具有针对多元化应用程序量身定制的先进语言模型。

Title: NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database

Authors: Weizhi Fei, Hao Shi, Jing Xu, Jingchen Peng, Jiazheng Li, Jingzhao Zhang, Bo Bai, Wei Han, Zhenyuan Chen, Xueyan Niu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18028
Pdf URL: https://arxiv.org/pdf/2507.18028
Copy Paste: [[2507.18028]] NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database(https://arxiv.org/abs/2507.18028)
Keywords: language model, gpt, llm
Abstract: Efficiently editing knowledge stored in large language models (LLMs) enables model updates without large-scale training. One possible solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of facts. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module, % In particular, our gated module only operates when inference involves the edited facts, effectively preserving the general abilities of LLMs. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFacts datasets, using GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB not only excels in editing efficacy, generalization, specificity, fluency, and consistency, but also preserves overall performance across six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50x} more than in prior work).
摘要：有效地编辑存储在大语言模型（LLM）中的知识可实现模型更新，而无需大规模培训。一种可能的解决方案是定位和编辑（L \＆e），可以同时修改大量事实。但是，这种编辑可能会损害LLM的一般能力，甚至会导致忘记数千个编辑时忘记编辑的事实。在本文中，我们将现有的线性L \＆E方法建模为查询键值（KV）数据库。从这个角度来看，我们提出了NeuralDB，这是一个编辑框架，该框架明确表示被编辑的事实是配备了非线性门控检索模块的神经KV数据库，尤其是我们的门控模块，只有在推理涉及编辑的事实时才有效地保留LLMS的一般能力。使用GPT2-XL，GPT-J（6B）和Llama-3（8B），在ZSRE和反对数据集上进行了涉及10,000个事实编辑的全面实验。结果表明，NeuralDB不仅在编辑功效，概括，特异性，流利度和一致性方面表现出色，而且还可以保留六种代表性文本理解和发电任务的整体性能。进一步的实验表明，即使缩放到100,000个事实（\ textbf {50x}），神经DB仍保持其有效性。

Title: GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.18043
Pdf URL: https://arxiv.org/pdf/2507.18043
Copy Paste: [[2507.18043]] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs(https://arxiv.org/abs/2507.18043)
Keywords: language model, llm, hallucination
Abstract: Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model's logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus dispreferred outputs. These tokens are then used to construct directional steering vectors that capture semantic shifts from undesirable to desirable behavior. During inference, GrAInS adjusts hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale. This enables fine-grained, interpretable, and modular control over model behavior, without retraining or auxiliary supervision. Empirically, GrAInS consistently outperforms both fine-tuning and existing steering baselines: it achieves a 13.22% accuracy gain on TruthfulQA using Llama-3.1-8B, reduces hallucination rates on MMHal-Bench from 0.624 to 0.514 with LLaVA-1.6-7B, and improves alignment win rates on SPA-VL by 8.11%, all while preserving the model's fluency and general capabilities.
摘要：推理时间转向方法通过在不更新模型权重的情况下修改测试时间内的内部激活，为微调大语言模型（LLMS）和视觉语言模型（VLMS）提供了轻巧的替代方案。但是，大多数现有的方法都依赖于固定的全球干预向量，忽略了单个输入令牌的因果影响，并且无法利用模型logits的信息梯度，尤其是在视觉和文本输入造成不均匀的多模式设置中。为了解决这些局限性，我们引入了谷物，这是一种推理时间转向方法，该方法跨越语言和视觉语言模型和任务。谷物通过集成梯度使用基于梯度的对比，基于梯度的归因来识别最有影响力的代币，无论是基于其对优选输出与分配输出的贡献而产生的积极和负面归因。然后，这些令牌用于构建定向转向向量，以捕获从不良行为到理想行为的语义转移。在推断期间，谷物会在由令牌级归因信号引导的变压器层处调节隐藏的激活，并将激活归一化以保留代表性尺度。这可以实现对模型行为的细粒度，可解释和模块化的控制，而无需重新培训或辅助监督。从经验上讲，谷物始终优于微调和现有的转向基线：它使用Llama-3.1-8B在真实性方面的准确度获得了13.22％的准确性，将MMHAL基础板上的幻觉降低到0.624的幻觉速度从0.624降低到0.514，并以Llava-1.6-7b的速度提高了lllava-1.6-7b，并通过8.11％的速度来降低。流利性和一般能力。

Title: Synthetic Data Generation for Phrase Break Prediction with Large Language Model

Authors: Hoyeon Lee, Sejung Son, Ye-Eun Kang, Jong-Hwan Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18044
Pdf URL: https://arxiv.org/pdf/2507.18044
Copy Paste: [[2507.18044]] Synthetic Data Generation for Phrase Break Prediction with Large Language Model(https://arxiv.org/abs/2507.18044)
Keywords: language model, llm
Abstract: Current approaches to phrase break prediction address crucial prosodic aspects of text-to-speech systems but heavily rely on vast human annotations from audio or text, incurring significant manual effort and cost. Inherent variability in the speech domain, driven by phonetic factors, further complicates acquiring consistent, high-quality data. Recently, large language models (LLMs) have shown success in addressing data challenges in NLP by generating tailored synthetic data while reducing manual annotation needs. Motivated by this, we explore leveraging LLM to generate synthetic phrase break annotations, addressing the challenges of both manual annotation and speech-related tasks by comparing with traditional annotations and assessing effectiveness across multiple languages. Our findings suggest that LLM-based synthetic data generation effectively mitigates data challenges in phrase break prediction and highlights the potential of LLMs as a viable solution for the speech domain.
摘要：当前的短语中断预测方法解决了文本到语音系统的关键韵律方面，但在很大程度上依赖于音频或文本中的大量人类注释，从而产生了大量的手动努力和成本。由语音因素驱动的语音域的固有变异性进一步使获取一致的高质量数据复杂化。最近，大型语言模型（LLMS）通过生成量身定制的合成数据，同时减少手动注释需求，在解决NLP中的数据挑战方面取得了成功。在此激励的基础上，我们探索了LLM的利用，以产生合成短语中断注释，通过与传统注释并评估多种语言的有效性来解决手动注释和与语音有关的任务的挑战。我们的发现表明，基于LLM的合成数据生成有效地减轻了短语中断预测中的数据挑战，并突出了LLM作为语音域的可行解决方案的潜力。

Title: Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs

Authors: Tevin Atwal, Chan Nam Tieu, Yefeng Yuan, Zhan Shi, Yuhong Liu, Liang Cheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18055
Pdf URL: https://arxiv.org/pdf/2507.18055
Copy Paste: [[2507.18055]] Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs(https://arxiv.org/abs/2507.18055)
Keywords: language model, llm, prompt
Abstract: The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs' capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.
摘要：大语模型（LLM）生成的合成数据的使用日益增加，既提出了数据驱动的应用程序的机会和挑战。虽然合成数据提供了一种具有成本效益，可扩展的替代品现实数据以促进模型培训，但其多样性和隐私风险仍未得到充实。为了关注基于文本的合成数据，我们提出了一组全面的指标，以定量评估多样性（即语言表达，情感和用户观点），以及由几种由尚未达到的级别LLM产生的合成数据集的隐私（即重新识别的合成数据集）。实验结果揭示了LLMS在产生多样化和隐私的合成数据方面的能力的显着局限性。在评估结果的指导下，提出了一种基于及时的方法，以增强合成评论的多样性，同时保留审核者的隐私。

Title: TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

Authors: Zehan Li, Hongjie Chen, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Zhongjiang He, Xuelong Li
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.18061
Pdf URL: https://arxiv.org/pdf/2507.18061
Copy Paste: [[2507.18061]] TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios(https://arxiv.org/abs/2507.18061)
Keywords: language model, llm, agent
Abstract: Spoken language models (SLMs) have seen rapid progress in recent years, along with the development of numerous benchmarks for evaluating their performance. However, most existing benchmarks primarily focus on evaluating whether SLMs can perform complex tasks comparable to those tackled by large language models (LLMs), often failing to align with how users naturally interact in real-world conversational scenarios. In this paper, we propose TELEVAL, a dynamic benchmark specifically designed to evaluate SLMs' effectiveness as conversational agents in realistic Chinese interactive settings. TELEVAL defines three evaluation dimensions: Explicit Semantics, Paralinguistic and Implicit Semantics, and System Abilities. It adopts a dialogue format consistent with real-world usage and evaluates text and audio outputs separately. TELEVAL particularly focuses on the model's ability to extract implicit cues from user speech and respond appropriately without additional instructions. Our experiments demonstrate that despite recent progress, existing SLMs still have considerable room for improvement in natural conversational tasks. We hope that TELEVAL can serve as a user-centered evaluation framework that directly reflects the user experience and contributes to the development of more capable dialogue-oriented SLMs.
摘要：近年来，口语模型（SLM）以及开发了许多基准来评估其性能。但是，大多数现有的基准主要集中于评估SLM是否可以执行与大语言模型（LLMS）所解决的任务相当的复杂任务，通常无法与用户在现实世界中的对话方案中自然互动的方式保持一致。在本文中，我们提出了Televal，这是一种动态基准，专门旨在评估SLM作为现实中国互动环境中的对话剂的有效性。 Televal定义了三个评估维度：明确的语义，副语言和隐性语义以及系统能力。它采用与现实世界使用一致的对话格式，并分别评估文本和音频输出。 Televal特别关注该模型从用户语音中提取隐式线索并在没有其他说明的情况下做出适当响应的能力。我们的实验表明，尽管最近取得了进展，但现有的SLM仍然有很大的改进自然对话任务的空间。我们希望Televal可以用作以用户为中心的评估框架，该框架直接反映用户体验并有助于开发更有能力的面向对话的SLM。

Title: Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints

Authors: Haomin Qi, Zihan Dai, Chengbo Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18076
Pdf URL: https://arxiv.org/pdf/2507.18076
Copy Paste: [[2507.18076]] Hybrid and Unitary Fine-Tuning of Large Language Models: Methods and Benchmarking under Resource Constraints(https://arxiv.org/abs/2507.18076)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) remains a computational bottleneck due to their scale and memory demands. This paper presents a comprehensive evaluation of parameter-efficient fine-tuning (PEFT) techniques, including LoRA, BOFT, LoRA-GA, and uRNN, and introduces a novel hybrid strategy that dynamically integrates BOFT's orthogonal stability with LoRA-GA's gradient-aligned rapid convergence. By computing per-layer adaptive updates guided by gradient norms, the hybrid method achieves superior convergence efficiency and generalization across diverse tasks. We also explore, for the first time, the adaptation of unitary RNN (uRNN) principles to transformer-based LLMs, enhancing gradient stability through structured unitary constraints. Empirical evaluations on four benchmarks -- GLUE, GSM8K, MT-Bench, and HumanEval -- using models ranging from 7B to 405B parameters demonstrate that our hybrid method consistently outperforms individual PEFT baselines, approaching full fine-tuning accuracy while reducing resource consumption by up to 2.1 times in training time and 50 percent in memory usage. These findings establish the hybrid approach as a practical and scalable fine-tuning solution for real-world deployment of LLMs under resource constraints.
摘要：通过微调大型语言模型（LLM），由于其规模和记忆需求，仍然是计算瓶颈。本文对包括Lora，Boft，Lora-GA和Urnn在内的参数有效微调（PEFT）技术进行了全面评估，并引入了一种新型的混合策略，该策略将Boft的正交稳定性与Lora-GA的梯度梯度相结合。通过计算以梯度规范为指导的每层自适应更新，混合方法实现了跨不同任务的卓越收敛效率和概括。我们还首次探讨了单一RNN（URNN）原理对基于变压器的LLM的适应，从而通过结构化的统一约束来增强梯度稳定性。使用从7b到405B参数的模型对四个基准的实证评估 - GSM8K，MT-Bench和HumaneVal - 表明，我们的混合方法始终超过单个PEFT底线，在培训中最多可减少2.1次的培训和50％的时间，从而使单个PEFT基准均超过了全面的精确度，从而降低了2.1次。这些发现将混合方法建立为在资源限制下实际部署LLM的实用且可扩展的微调解决方案。

Title: GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

Authors: Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.18119
Pdf URL: https://arxiv.org/pdf/2507.18119
Copy Paste: [[2507.18119]] GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness(https://arxiv.org/abs/2507.18119)
Keywords: language model
Abstract: Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.
摘要：端到端口语模型（SLM）的最新进展已大大提高了AI系统进行自然口语互动的能力。但是，大多数现有模型仅将语音视为语言内容的工具，经常忽略嵌入在人类言语中的丰富语言和说话者的特征提示，例如方言，年龄，情感和非语音发声。在这项工作中，我们介绍了一种具有副语言和说话者特征意识的新型口语模型，旨在将口语建模扩展到文本语义之外。 Goat-SLM采用了双模式的头部体系结构，该架构将语言建模与声学实现相融合，从而在支持表现力和适应性的语音生成的同时，使语言理解能够强大。为了提高模型效率和多功能性，我们提出了一种模块化，分阶段的培训策略，该策略逐渐使用大规模的语音文本语料库来逐步使语言，副语言和说话者的特征信息保持一致。多维评估基准Televal的实验结果表明，山羊SLM在语义和非语义任务中都能达到均衡的性能，并且在处理情感，方言变化和年龄敏感的相互作用方面胜过现有的开源模型。这项工作突出了建模超出语言内容的重要性，并促进了更自然，自适应和社会意识的语言系统的发展。

Title: MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Authors: Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18140
Pdf URL: https://arxiv.org/pdf/2507.18140
Copy Paste: [[2507.18140]] MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning(https://arxiv.org/abs/2507.18140)
Keywords: language model, llm
Abstract: Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps. However, existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored. This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical this http URL, our framework focuses on two key evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model's ability to accurately understand and construct visualizations from scratch. (2) Multi-modal Code Editing (MCE) assesses the model's capacity for fine-grained operations, which include three types: Deletion, Modification and Annotation. To evaluate the above tasks, we incorporate a dataset that covers the five most popular types of mathematical figures, including geometric diagrams, function plots, and three types of statistical charts, to provide a comprehensive and effective measurement of existing MLLMs. Our experimental evaluation involves nine mainstream MLLMs, and the results reveal that existing models still lag significantly behind human performance in performing fine-grained visual operations.
摘要：多模式大语言模型（MLLM）的最新进展已通过基于文本指令执行视觉操作来逐步逐步进行多模式数学推理。一种有前途的方法使用代码作为中间表示，以精确表达和操纵推理步骤中的图像。但是，现有的评估主要集中在仅文本推理输出上，从而使MLLM通过在很大程度上没有探索的代码执行准确的视觉操作的能力。这项工作通过评估MLLM在多模式数学的HTTP URL中评估MLLM基于代码的功能来解决这一差距的第一步，我们的框架着重于两个关键的评估方面：（1）多模式代码生成（MCG）评估该模型从scratch中准确理解和构造可视化的能力。（2）多模式代码编辑（MCE）评估模型的细粒操作能力，其中包括三种类型：删除，修改和注释。为了评估上述任务，我们合并了一个数据集，该数据集涵盖了五种最受欢迎的数学数字类型，包括几何图，功能图和三种类型的统计图表，以提供对现有MLLM的全面测量。我们的实验评估涉及九个主流MLLM，结果表明，现有模型在执行细粒度的视觉操作中仍然显着落后于人类绩效。

Title: HIVMedQA: Benchmarking large language models for HIV medical decision support

Authors: Gonzalo Cardenal Antolin, Jacques Fellay, Bashkim Jaha, Roger Kouyos, Niko Beerenwinkel, Diane Duroux
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18143
Pdf URL: https://arxiv.org/pdf/2507.18143
Copy Paste: [[2507.18143]] HIVMedQA: Benchmarking large language models for HIV medical decision support(https://arxiv.org/abs/2507.18143)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are emerging as valuable tools to support clinicians in routine decision-making. HIV management is a compelling use case due to its complexity, including diverse treatment options, comorbidities, and adherence challenges. However, integrating LLMs into clinical practice raises concerns about accuracy, potential harm, and clinician acceptance. Despite their promise, AI applications in HIV care remain underexplored, and LLM benchmarking studies are scarce. This study evaluates the current capabilities of LLMs in HIV management, highlighting their strengths and limitations. We introduce HIVMedQA, a benchmark designed to assess open-ended medical question answering in HIV care. The dataset consists of curated, clinically relevant questions developed with input from an infectious disease physician. We evaluated seven general-purpose and three medically specialized LLMs, applying prompt engineering to enhance performance. Our evaluation framework incorporates both lexical similarity and an LLM-as-a-judge approach, extended to better reflect clinical relevance. We assessed performance across key dimensions: question comprehension, reasoning, knowledge recall, bias, potential harm, and factual accuracy. Results show that Gemini 2.5 Pro consistently outperformed other models across most dimensions. Notably, two of the top three models were proprietary. Performance declined as question complexity increased. Medically fine-tuned models did not always outperform general-purpose ones, and larger model size was not a reliable predictor of performance. Reasoning and comprehension were more challenging than factual recall, and cognitive biases such as recency and status quo were observed. These findings underscore the need for targeted development and evaluation to ensure safe, effective LLM integration in clinical care.
摘要：大型语言模型（LLM）正在成为支持临床医生常规决策的宝贵工具。 HIV管理是一种令人信服的用例，因为其复杂性，包括各种治疗选择，合并症和依从性挑战。但是，将LLMS整合到临床实践中会引起人们对准确性，潜在伤害和临床医生接受的担忧。尽管他们承诺，但AI在艾滋病毒护理中的应用仍未得到充实，而LLM基准测试研究很少。这项研究评估了LLM在HIV管理中的当前功能，突出了其优势和局限性。我们介绍了HIVMedQA，这是一种基准，旨在评估艾滋病毒护理中的开放式医疗问题。该数据集由经过精心策划的临床相关问题组成，并与传染病医师的意见一起提出。我们评估了七个通用和三个医学专业的LLM，并应用了及时的工程来提高性能。我们的评估框架既包含词汇相似性，又结合了LLM-AS-A-A-Gudge方法，以更好地反映临床相关性。我们评估了关键维度的绩效：问题理解，推理，知识回忆，偏见，潜在的伤害和事实准确性。结果表明，Gemini 2.5 Pro在大多数维度上始终优于其他模型。值得注意的是，前三个模型中有两个是专有的。绩效下降，问题复杂性增加。医学微调的模型并不总是胜过通用的模型，较大的模型大小并不是可靠的性能预测指标。推理和理解比事实召回更具挑战性，并且观察到认知偏见和现状等认知偏见。这些发现强调了有针对性开发和评估的必要性，以确保在临床护理中安全有效的LLM整合。

Title: SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

Authors: Wonjun Jeong, Dongseok Kim, Taegkeun Whangbo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18182
Pdf URL: https://arxiv.org/pdf/2507.18182
Copy Paste: [[2507.18182]] SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models(https://arxiv.org/abs/2507.18182)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels, rather than demonstrating genuine understanding. This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner. By repeatedly invoking a null prompt that lacks semantic content, SCOPE estimates each model's unique position-bias distribution. It then redistributes the answer slot according to the inverse-bias distribution, thereby equalizing the lucky-rate, the probability of selecting the correct answer by chance. Furthermore, it prevents semantically similar distractors from being placed adjacent to the answer, thereby blocking near-miss guesses based on superficial proximity cues. Across multiple benchmark experiments, SCOPE consistently outperformed existing debiasing methods in terms of stable performance improvements and showed clearer confidence distributions over correct options. This framework thus offers a new standard for enhancing the fairness and reliability of LLM evaluations.
摘要：大型语言模型（LLM）可以通过利用选项位置或标签中的固有偏见，而不是表现出真正的理解来实现多项选择任务的膨胀分数。这项研究介绍了范围，这是一个评估框架，旨在以数据集独立的方式衡量和减轻此类选择偏见。通过反复调用缺乏语义内容的无效提示，范围估计了每个模型的独特位置偏差分布。然后，它会根据反偏差分布重新分配答案插槽，从而使幸运率均等，偶然地选择正确答案的概率。此外，它可以防止语义上相似的干扰因素与答案相邻，从而阻止基于表面接近线索的近乎失踪的猜测。在多个基准实验中，范围在稳定的绩效改进方面始终优于现有的偏见方法，并在正确的选项上表现出更清晰的置信度分布。因此，该框架为增强LLM评估的公平性和可靠性提供了新的标准。

Title: TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks

Authors: Keyu Wu, Qianjin Yu, Manlin Mei, Ruiting Liu, Jun Wang, Kailai Zhang, Yelun Bao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18190
Pdf URL: https://arxiv.org/pdf/2507.18190
Copy Paste: [[2507.18190]] TN-AutoRCA: Benchmark Construction and Agentic Framework for Self-Improving Alarm-Based Root Cause Analysis in Telecommunication Networks(https://arxiv.org/abs/2507.18190)
Keywords: agent
Abstract: Root Cause Analysis (RCA) in telecommunication networks is a critical task, yet it presents a formidable challenge for Artificial Intelligence (AI) due to its complex, graph-based reasoning requirements and the scarcity of realistic benchmarks.
摘要：电信网络中的根本原因分析（RCA）是一项关键任务，但由于其复杂的基于图的推理要求和现实基准的稀缺性，它对人工智能（AI）提出了巨大的挑战。

Title: Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection

Authors: San Kim, Jonghwi Kim, Yejin Jeon, Gary Geunbae Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18202
Pdf URL: https://arxiv.org/pdf/2507.18202
Copy Paste: [[2507.18202]] Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection(https://arxiv.org/abs/2507.18202)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk, attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever's similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.
摘要：检索增强的生成（RAG）通过提供外部知识以获得准确和最新的响应来增强大语模型（LLMS）。但是，这种对外部资源的依赖暴露了安全风险，攻击者可以将中毒的文件注入知识库，以将生成过程转向有害或误导性产出。在本文中，我们提出了基于梯度的掩盖令牌概率（GMTP），这是一种新型的防御方法，可检测和过滤反vressar脚制作的文档。具体而言，GMTP通过检查猎犬相似性函数的梯度来识别高影响令。然后将这些关键令牌掩盖，并通过蒙版语言模型（MLM）检查其概率。由于注射的令牌通常表现出明显低的蒙版概率，因此这使GMTP能够轻松检测恶意文档并实现高精度过滤。实验表明，GMTP能够在保留相关文档的同时消除超过90％的中毒内容，从而在不同的数据集和对抗性环境中保持稳健的检索和发电性能。

Title: Exploring the Impact of Instruction-Tuning on LLM's Susceptibility to Misinformation

Authors: Kyubeen Han, Junseo Jang, Hongjin Kim, Geunyeong Jeong, Harksoo Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18203
Pdf URL: https://arxiv.org/pdf/2507.18203
Copy Paste: [[2507.18203]] Exploring the Impact of Instruction-Tuning on LLM's Susceptibility to Misinformation(https://arxiv.org/abs/2507.18203)
Keywords: language model, llm, hallucination, prompt
Abstract: Instruction-tuning enhances the ability of large language models (LLMs) to follow user instructions more accurately, improving usability while reducing harmful outputs. However, this process may increase the model's dependence on user input, potentially leading to the unfiltered acceptance of misinformation and the generation of hallucinations. Existing studies primarily highlight that LLMs are receptive to external information that contradict their parametric knowledge, but little research has been conducted on the direct impact of instruction-tuning on this phenomenon. In our study, we investigate the impact of instruction-tuning on LLM's susceptibility to misinformation. Our analysis reveals that instruction-tuned LLMs are significantly more likely to accept misinformation when it is presented by the user. A comparison with base models shows that instruction-tuning increases reliance on user-provided information, shifting susceptibility from the assistant role to the user role. Furthermore, we explore additional factors influencing misinformation susceptibility, such as the role of the user in prompt structure, misinformation length, and the presence of warnings in the system prompt. Our findings underscore the need for systematic approaches to mitigate unintended consequences of instruction-tuning and enhance the reliability of LLMs in real-world applications.
摘要：指导调整可以增强大语模型（LLMS）更准确地遵循用户说明的能力，从而提高可用性，同时降低有害输出。但是，此过程可能会增加模型对用户输入的依赖性，这可能导致对错误信息的接受和幻觉的产生。现有的研究主要凸显了LLM可以接受与其参数知识相矛盾的外部信息，但是关于教学调整对这种现象的直接影响的研究很少。在我们的研究中，我们研究了指导调整对LLM易感性错误信息的影响。我们的分析表明，指导调整的LLMS在用户呈现时更有可能接受错误信息。与基本模型的比较表明，指导调整增加了对用户提供的信息的依赖，从而将敏感性从助理角色转移到用户角色。此外，我们探讨了影响错误信息敏感性的其他因素，例如用户在迅速结构，错误信息长度以及系统提示中的警告中的作用。我们的发现强调了对系统方法进行系统方法的需求，以减轻指导调整的意外后果，并提高现实世界应用中LLM的可靠性。

Title: Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

Authors: Xinrui Chen, Hongxing Zhang, Fanyi Zeng, Yongxian Wei, Yizhi Wang, Xitong Ling, Guanghao Li, Chun Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18212
Pdf URL: https://arxiv.org/pdf/2507.18212
Copy Paste: [[2507.18212]] Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation(https://arxiv.org/abs/2507.18212)
Keywords: language model, llm
Abstract: Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the original model's question-answering performance, outperforming the baseline by 4.01%.
摘要：层修剪已成为一种有前途的技术，用于压缩大语言模型（LLM），同时达到与修剪率成正比的加速度。在这项工作中，我们确定去除任何层都会在隐藏状态下引起显着幅度的间隙，从而导致大量性能降解。为了解决这个问题，我们提出了Prune＆Comp，这是一种新颖的插件修剪方案，该方案利用幅度的补偿来以无训练的方式减轻此类差距。具体而言，我们首先估计了层去除层引起的幅度差距，然后通过离线重新缩放剩余的权重来消除此差距，并产生了零运行时的开销。我们通过迭代修剪策略进一步证明了Prune＆Comp的优势。当与迭代的修剪和补偿循环集成时，Prune＆Comp始终增强现有层修剪指标。例如，当使用普遍的块影响度量指标修剪5层的Llama-3-8b时，几乎将困惑性的一半减半，并保留了原始模型的提问性能的93.19 \％，使基线的表现优于4.01％。

Title: Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models

Authors: Suhang Wu, Jialong Tang, Chengyi Yang, Pei Zhang, Baosong Yang, Junhui Li, Junfeng Yao, Min Zhang, Jinsong Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18263
Pdf URL: https://arxiv.org/pdf/2507.18263
Copy Paste: [[2507.18263]] Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models(https://arxiv.org/abs/2507.18263)
Keywords: language model
Abstract: Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.
摘要：如今，直接语音翻译（ST）引起了人们越来越多的关注，但是话语中术语的准确翻译仍然是一个巨大的挑战。在这方面，当前的研究主要集中于利用各种翻译知识为ST模型。但是，这些方法通常会与无关紧要的噪音干扰，无法完全利用翻译知识。为了解决这些问题，在本文中，我们提出了一种新颖的定位和对焦方法，用于术语翻译。它首先有效地定位了在话语中包含术语的语音剪辑，以构建翻译知识，从而最大程度地减少ST模型的无关信息。随后，它将翻译知识与音频和文本方式的话语和假设相关联，从而使ST模型可以更好地专注于翻译过程中的翻译知识。各个数据集的实验结果表明，我们的方法有效地将术语定位在话语中，并提高术语翻译的成功率，同时保持稳健的一般翻译性能。

Title: StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer

Authors: Pritika Ramu, Apoorv Saxena, Meghanath M Y, Varsha Sankar, Debraj Basu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18294
Pdf URL: https://arxiv.org/pdf/2507.18294
Copy Paste: [[2507.18294]] StyleAdaptedLM: Enhancing Instruction Following Models with Efficient Stylistic Transfer(https://arxiv.org/abs/2507.18294)
Keywords: llm
Abstract: Adapting LLMs to specific stylistic characteristics, like brand voice or authorial tones, is crucial for enterprise communication but challenging to achieve from corpora which lacks instruction-response formatting without compromising instruction adherence. We introduce StyleAdaptedLM, a framework that efficiently transfers stylistic traits to instruction-following models using Low-Rank Adaptation (LoRA). LoRA adapters are first trained on a base model with diverse unstructured stylistic corpora, then merged with a separate instruction-following model. This enables robust stylistic customization without paired data or sacrificing task performance. Experiments across multiple datasets and models demonstrate improved stylistic consistency while preserving instruction adherence, with human evaluations confirming brand-specific convention uptake. StyleAdaptedLM offers an efficient path for stylistic personalization in LLMs.
摘要：将LLM适应特定的风格特征，例如品牌语音或作者音调，对于企业的交流至关重要，但要挑战于Corpora的挑战，而Corpora缺乏指导 - 响应格式而不会损害指导依从性。我们介绍了StyLeadaptedLM，该框架有效地将风格性状转移到使用低级适应（LORA）的指令遵循模型中。 Lora适配器首先是在具有多种非结构化风格语料库的基本模型上培训的，然后与单独的指导跟随模型合并。这可以在没有配对数据或牺牲任务性能的情况下实现强大的风格自定义。多个数据集和模型之间的实验表明，在保留指导依从性的同时，通过人类评估确认了特定于品牌的常规摄入，这表明了风格上的一致性的提高。 StyLeadaptedLM为LLMS中的风格个性化提供了有效的途径。

Title: BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit

Authors: Biao Yi, Zekun Fei, Jianing Geng, Tong Li, Lihai Nie, Zheli Liu, Yiming Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18305
Pdf URL: https://arxiv.org/pdf/2507.18305
Copy Paste: [[2507.18305]] BadReasoner: Planting Tunable Overthinking Backdoors into Large Reasoning Models for Fun or Profit(https://arxiv.org/abs/2507.18305)
Keywords: language model, llm, chain-of-thought
Abstract: Large reasoning models (LRMs) have emerged as a significant advancement in artificial intelligence, representing a specialized class of large language models (LLMs) designed to tackle complex reasoning tasks. The defining characteristic of LRMs lies in their extensive chain-of-thought (CoT) reasoning capabilities. In this paper, we identify a previously unexplored attack vector against LRMs, which we term "overthinking backdoors". We advance this concept by proposing a novel tunable backdoor, which moves beyond simple on/off attacks to one where an attacker can precisely control the extent of the model's reasoning verbosity. Our attack is implemented through a novel data poisoning methodology. It pairs a tunable trigger-where the number of repetitions signals the desired intensity-with a correspondingly verbose CoT response. These responses are programmatically generated by instructing a teacher LLM to inject a controlled number of redundant refinement steps into a correct reasoning process. The approach preserves output correctness, which ensures stealth and establishes the attack as a pure resource-consumption vector. Extensive empirical results on various LRMs demonstrate that our method can reliably trigger a controllable, multi-fold increase in the length of the reasoning process, without degrading the final answer's correctness. Our source code is available at this https URL.
摘要：大型推理模型（LRMS）已成为人工智能的重大进步，代表了旨在应对复杂推理任务的专业类型的大语言模型（LLMS）。 LRMS的定义特征在于其广泛的经过思考链（COT）推理能力。在本文中，我们确定了对LRMS的先前未开发的攻击向量，我们将其称为“过度思考后门”。我们通过提出一个新颖的可调后门来推进这一概念，该后门超越了简单的开/关攻击，攻击者可以精确控制模型推理冗长的程度。我们的攻击是通过新的数据中毒方法实施的。它配对一个可调触发器 - 重复的数量信号标志着所需的强度，并具有相应的词汇cot响应。这些响应是通过指示教师LLM向正确推理过程中注入受控数量的冗余步骤来通过编程生成的。该方法保留了输出正确性，从而确保隐身并确定攻击为纯粹的资源消费向量。各种LRM的广泛经验结果表明，我们的方法可以可靠地触发推理过程长度的可控，多重增加，而不会降低最终答案的正确性。我们的源代码可在此HTTPS URL上找到。

Title: TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning

Authors: Yifu Chen, Bingchen Huang, Zhiling Wang, Yuanchao Du, Junfeng Luo, Lei Shen, Zhineng chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18340
Pdf URL: https://arxiv.org/pdf/2507.18340
Copy Paste: [[2507.18340]] TDR: Task-Decoupled Retrieval with Fine-Grained LLM Feedback for In-Context Learning(https://arxiv.org/abs/2507.18340)
Keywords: llm
Abstract: In-context learning (ICL) has become a classic approach for enabling LLMs to handle various tasks based on a few input-output examples. The effectiveness of ICL heavily relies on the quality of these examples, and previous works which focused on enhancing example retrieval capabilities have achieved impressive performances. However, two challenges remain in retrieving high-quality examples: (1) Difficulty in distinguishing cross-task data distributions, (2) Difficulty in making the fine-grained connection between retriever output and feedback from LLMs. In this paper, we propose a novel framework called TDR. TDR decouples the ICL examples from different tasks, which enables the retrieval module to retrieve examples specific to the target task within a multi-task dataset. Furthermore, TDR models fine-grained feedback from LLMs to supervise and guide the training of the retrieval module, which helps to retrieve high-quality examples. We conducted extensive experiments on a suite of 30 NLP tasks, the results demonstrate that TDR consistently improved results across all datasets and achieves state-of-the-art performance. Meanwhile, our approach is a plug-and-play method, which can be easily combined with various LLMs to improve example retrieval abilities for ICL. The code is available at this https URL.
摘要：内部文化学习（ICL）已成为使LLMS能够基于一些输入输出示例处理各种任务的经典方法。 ICL的有效性在很大程度上取决于这些示例的质量，并且以前着重于增强示例检索能力的作品取得了令人印象深刻的表现。但是，在检索高质量的示例中仍然存在两个挑战：（1）难以区分交叉任务数据分布，（2）难以在检索器输出和LLMS反馈之间建立细粒度的连接。在本文中，我们提出了一个名为TDR的新颖框架。 TDR将ICL示例与不同的任务分解，这使检索模块能够检索多任务数据集中针对目标任务的示例。此外，TDR模型从LLMS进行了细粒度的反馈，以监督和指导检索模块的培训，这有助于检索高质量的示例。我们对30个NLP任务的套件进行了广泛的实验，结果表明，TDR始终改善了所有数据集的结果，并实现了最先进的性能。同时，我们的方法是一种插件方法，可以轻松地与各种LLM结合使用，以提高ICL的示例检索能力。该代码可在此HTTPS URL上找到。

Title: Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence

Authors: Ariana Sahitaj, Premtim Sahitaj, Veronika Solopova, Jiaao Li, Sebastian Möller, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18343
Pdf URL: https://arxiv.org/pdf/2507.18343
Copy Paste: [[2507.18343]] Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence(https://arxiv.org/abs/2507.18343)
Keywords: language model, llm
Abstract: Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories, conduct a human annotation study on the HQP dataset that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.
摘要：由于任务复杂性和有限的高质量标签数据，社交媒体上的宣传检测仍然具有挑战性。本文介绍了一个新颖的框架，将人类专业知识与大语言模型（LLM）援助相结合，以提高注释一致性和可扩展性。我们提出了一项层次分类法，该分类法将14种细化的宣传技术组织为三个更广泛的类别，对HQP数据集进行了人类注释研究，该研究揭示了低通道的一致性，用于良好的粒度标签，并实现了LLM辅助的预先通知方案，以提取宣传的宣传型，并分配了一个良好的概述，并将其分配给良好的份量。一项二级人类验证研究表明，一致性和时间效率都有显着改善。在此基础上，我们微调较小的语言模型（SLM）以执行结构化注释。我们没有对人类注释进行微调，而是对高质量LLM生成的数据进行训练，从而使大型模型能够产生这些注释，并较小的模型，可以通过知识蒸馏来学会生成它们。我们的工作有助于开发可扩展和强大的宣传检测系统，以与SDG 16一致的透明和负责任的媒体生态系统的概念。该代码在我们的GitHub存储库中公开可用。

Title: CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

Authors: Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18392
Pdf URL: https://arxiv.org/pdf/2507.18392
Copy Paste: [[2507.18392]] CLEAR: Error Analysis via LLM-as-a-Judge Made Easy(https://arxiv.org/abs/2507.18392)
Keywords: language model, llm
Abstract: The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.
摘要：大型语言模型（LLM）的评估越来越依赖于其他LLM作为法官。但是，当前的评估范例通常会产生单个分数或排名，回答哪个模型更好，但不是原因。尽管对于基准测试至关重要，但这些顶级得分掩盖了模型性能背后的特定，可行的原因。为了弥合这一差距，我们引入了Clear，这是一个交互式的开源软件包，用于基于LLM的错误分析。 Clear首先生成了每一结构的文本反馈，然后创建一组系统级错误问题，并量化每个已确定问题的流行率。我们的软件包还为用户提供了交互式仪表板，该仪表板通过汇总可视化允许进行全面的错误分析，应用交互式过滤器来隔离特定问题或得分范围，并钻探到体现特定行为模式的单个实例。我们证明了对破布和数学基准测试的明确分析，并通过用户案例研究展示了其实用性。

Title: FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

Authors: Giorgos Iacovides, Wuyang Zhou, Danilo Mandic
Subjects: cs.CL, cs.LG, q-fin.ST, q-fin.TR
Abstract URL: https://arxiv.org/abs/2507.18417
Pdf URL: https://arxiv.org/pdf/2507.18417
Copy Paste: [[2507.18417]] FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs(https://arxiv.org/abs/2507.18417)
Keywords: language model, llm
Abstract: Opinions expressed in online finance-related textual data are having an increasingly profound impact on trading decisions and market movements. This trend highlights the vital role of sentiment analysis as a tool for quantifying the nature and strength of such opinions. With the rapid development of Generative AI (GenAI), supervised fine-tuned (SFT) large language models (LLMs) have become the de facto standard for financial sentiment analysis. However, the SFT paradigm can lead to memorization of the training data and often fails to generalize to unseen samples. This is a critical limitation in financial domains, where models must adapt to previously unobserved events and the nuanced, domain-specific language of finance. To this end, we introduce FinDPO, the first finance-specific LLM framework based on post-training human preference alignment via Direct Preference Optimization (DPO). The proposed FinDPO achieves state-of-the-art performance on standard sentiment classification benchmarks, outperforming existing supervised fine-tuned models by 11% on the average. Uniquely, the FinDPO framework enables the integration of a fine-tuned causal LLM into realistic portfolio strategies through a novel 'logit-to-score' conversion, which transforms discrete sentiment predictions into continuous, rankable sentiment scores (probabilities). In this way, simulations demonstrate that FinDPO is the first sentiment-based approach to maintain substantial positive returns of 67% annually and strong risk-adjusted performance, as indicated by a Sharpe ratio of 2.0, even under realistic transaction costs of 5 basis points (bps).
摘要：与在线财务相关的文本数据中表达的意见对交易决策和市场变动产生了越来越深的影响。这种趋势强调了情感分析作为量化此类观点的性质和力量的工具的重要作用。随着生成AI（Genai）的快速发展，受监督的微调（SFT）大语模型（LLMS）已成为现实的财务情感分析标准。但是，SFT范式可以导致训练数据的记忆，并且通常无法概括地看不见样本。这是金融领域中的关键限制，其中模型必须适应以前未观察到的事件以及金融的细微差别，特定于领域的语言。为此，我们介绍了FindPo，这是第一个基于培训后人类首选项通过直接偏好优化（DPO）基于培训后的人类偏好对齐的框架。拟议的FindPo在标准情感分类基准上取得了最新的表现，平均表现优于现有监督的微调模型11％。独特的是，FindPo框架可以通过新颖的“ logit-to-Score”转换将微调的因果LLM集成到现实的投资组合策略中，从而将离散的情感预测转化为连续的，可等级的情感得分（概率）。通过这种方式，模拟证明了FindPo是第一种基于情感的方法，即使每年保持67％的正值阳性回报和强劲的风险调整绩效，如夏普比率为2.0，即使在现实的交易成本为5个基点（BPS）下也是如此。

Title: AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data

Authors: Rana Alshaikh, Israa Alghanmi, Shelan Jeawak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18442
Pdf URL: https://arxiv.org/pdf/2507.18442
Copy Paste: [[2507.18442]] AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data(https://arxiv.org/abs/2507.18442)
Keywords: language model, llm
Abstract: The cognitive and reasoning abilities of large language models (LLMs) have enabled remarkable progress in natural language processing. However, their performance in interpreting structured data, especially in tabular formats, remains limited. Although benchmarks for English tabular data are widely available, Arabic is still underrepresented because of the limited availability of public resources and its unique language features. To address this gap, we present AraTable, a novel and comprehensive benchmark designed to evaluate the reasoning and understanding capabilities of LLMs when applied to Arabic tabular data. AraTable consists of various evaluation tasks, such as direct question answering, fact verification, and complex reasoning, involving a wide range of Arabic tabular sources. Our methodology follows a hybrid pipeline, where initial content is generated by LLMs and subsequently filtered and verified by human experts to ensure high dataset quality. Initial analyses using AraTable show that, while LLMs perform adequately on simpler tabular tasks such as direct question answering, they continue to face significant cognitive challenges when tasks require deeper reasoning and fact verification. This indicates that there are substantial opportunities for future work to improve performance on complex tabular reasoning tasks. We also propose a fully automated evaluation framework that uses a self-deliberation mechanism and achieves performance nearly identical to that of human judges. This research provides a valuable, publicly available resource and evaluation framework that can help accelerate the development of foundational models for processing and analysing Arabic structured data.
摘要：大语言模型（LLM）的认知和推理能力已在自然语言处理方面取得了显着进步。但是，它们在解释结构化数据（尤其是表格格式）方面的性能仍然有限。尽管用于英语表格数据的基准广泛可用，但由于公共资源的可用性有限及其独特的语言功能，阿拉伯语的代表性仍然不足。为了解决这一差距，我们提出了一种新颖而全面的基准测试，旨在评估应用于阿拉伯表格数据时LLM的推理和理解能力。可耕作包括各种评估任务，例如直接问题回答，事实验证和复杂的推理，涉及广泛的阿拉伯表格源。我们的方法遵循一条混合管道，其中最初的内容由LLM生成，然后由人类专家进行过滤和验证，以确保高数据集质量。使用芳理的初步分析表明，虽然LLM在更简单的表格任务（例如直接问答）等任务上表现得当，但当任务需要更深入的推理和事实验证时，它们将继续面临重大认知挑战。这表明将来的工作有大量的机会来改善复杂的表格推理任务的绩效。我们还提出了一个完全自动化的评估框架，该框架使用一种自我解释机制，并实现与人类法官几乎相同的绩效。这项研究提供了一个有价值的公开资源和评估框架，可以帮助加速处理和分析阿拉伯结构化数据的基础模型。

Title: Generation of Synthetic Clinical Text: A Systematic Review

Authors: Basel Alshaikhdeeb, Ahmed Abdelmonem Hemedan, Soumyabrata Ghosh, Irina Balaur, Venkata Satagopam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18451
Pdf URL: https://arxiv.org/pdf/2507.18451
Copy Paste: [[2507.18451]] Generation of Synthetic Clinical Text: A Systematic Review(https://arxiv.org/abs/2507.18451)
Keywords: gpt
Abstract: Generating clinical synthetic text represents an effective solution for common clinical NLP issues like sparsity and privacy. This paper aims to conduct a systematic review on generating synthetic medical free-text by formulating quantitative analysis to three research questions concerning (i) the purpose of generation, (ii) the techniques, and (iii) the evaluation methods. We searched PubMed, ScienceDirect, Web of Science, Scopus, IEEE, Google Scholar, and arXiv databases for publications associated with generating synthetic medical unstructured free-text. We have identified 94 relevant articles out of 1,398 collected ones. A great deal of attention has been given to the generation of synthetic medical text from 2018 onwards, where the main purpose of such a generation is towards text augmentation, assistive writing, corpus building, privacy-preserving, annotation, and usefulness. Transformer architectures were the main predominant technique used to generate the text, especially the GPTs. On the other hand, there were four main aspects of evaluation, including similarity, privacy, structure, and utility, where utility was the most frequent method used to assess the generated synthetic medical text. Although the generated synthetic medical text demonstrated a moderate possibility to act as real medical documents in different downstream NLP tasks, it has proven to be a great asset as augmented, complementary to the real documents, towards improving the accuracy and overcoming sparsity/undersampling issues. Yet, privacy is still a major issue behind generating synthetic medical text, where more human assessments are needed to check for the existence of any sensitive information. Despite that, advances in generating synthetic medical text will considerably accelerate the adoption of workflows and pipeline development, discarding the time-consuming legalities of data transfer.
摘要：生成临床合成文本代表了对于常见的临床NLP问题（例如稀疏性和隐私）的有效解决方案。本文旨在通过对三个有关（i）发电的目的进行定量分析的定量分析来进行有关生成合成医学自由文本的系统综述，（ii）技术和（iii）评估方法。我们搜索了与生成合成医学非结构化自由文本相关的出版物，搜索了PubMed，ScienceDirect，Scienceirect，Scienceirect，Scopus，Scopus，IEEE，Google Scholar和Arxiv数据库。我们已经确定了1,398个收集的相关文章。从2018年开始，人们对合成医学文本的一代人非常关注，这一代人的主要目的是扩大文本，辅助写作，语料库构建，隐私性，注释和实用性。变压器体系结构是用于生成文本的主要主要技术，尤其是GPT。另一方面，评估的主要四个主要方面，包括相似性，隐私，结构和效用，其中效用是用于评估生成的合成医学文本的最常见方法。尽管产生的合成医学文本表明，在不同下游NLP任务中充当真实医疗文件的可能性中等，但事实证明，它是一项巨大的资产，随着增强，对实际文件的补充，可以提高准确性并克服稀疏性/底漆的问题。然而，隐私仍然是生成合成医学文本的主要问题，其中需要更多的人类评估来检查是否存在任何敏感信息。尽管如此，生成合成医学文本的进展将大大加速工作流程和管道开发，从而丢弃数据传输的耗时的合法性。

Title: Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models

Authors: Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18504
Pdf URL: https://arxiv.org/pdf/2507.18504
Copy Paste: [[2507.18504]] Not All Features Deserve Attention: Graph-Guided Dependency Learning for Tabular Data Generation with Language Models(https://arxiv.org/abs/2507.18504)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown strong potential for tabular data generation by modeling textualized feature-value pairs. However, tabular data inherently exhibits sparse feature-level dependencies, where many feature interactions are structurally insignificant. This creates a fundamental mismatch as LLMs' self-attention mechanism inevitably distributes focus across all pairs, diluting attention on critical relationships, particularly in datasets with complex dependencies or semantically ambiguous features. To address this limitation, we propose GraDe (Graph-Guided Dependency Learning), a novel method that explicitly integrates sparse dependency graphs into LLMs' attention mechanism. GraDe employs a lightweight dynamic graph learning module guided by externally extracted functional dependencies, prioritizing key feature interactions while suppressing irrelevant ones. Our experiments across diverse real-world datasets demonstrate that GraDe outperforms existing LLM-based approaches by up to 12% on complex datasets while achieving competitive results with state-of-the-art approaches in synthetic data quality. Our method is minimally intrusive yet effective, offering a practical solution for structure-aware tabular data modeling with LLMs.
摘要：大型语言模型（LLM）通过对文本化特征值对进行建模，显示出强大的表格数据生成潜力。但是，表格数据固有地表现出稀疏的特征级依赖性，其中许多特征相互作用在结构上无关紧要。由于LLMS的自我注意力机制不可避免地会在所有对中分配注意力，从而稀释了对关键关系的关注，尤其是在具有复杂依赖性或语义上模棱两可的特征的数据集中，这会产生根本的不匹配。为了解决这一限制，我们提出了等级（图引导的依赖性学习），这是一种新的方法，将稀疏依赖图明确整合到LLMS的注意机制中。等级采用轻巧的动态图学习模块，以外部提取的功能依赖性为指导，在抑制无关的功能相互作用的同时确定关键特征相互作用的优先级。我们跨不同现实世界数据集的实验表明，在复杂数据集上，等级优于现有的基于LLM的方法，同时通过合成数据质量的最先进的方法实现了竞争成果。我们的方法最少侵入但有效，为使用LLMS提供了结构感知的表格数据建模的实用解决方案。

Title: The Moral Gap of Large Language Models

Authors: Maciej Skorski, Alina Landowska
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18523
Pdf URL: https://arxiv.org/pdf/2507.18523
Copy Paste: [[2507.18523]] The Moral Gap of Large Language Models(https://arxiv.org/abs/2507.18523)
Keywords: language model, llm, prompt
Abstract: Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.
摘要：道德基金会检测对于分析社会话语和发展道德一致的AI系统至关重要。尽管大型语言模型在各种任务中都表现出色，但他们在专门的道德推理上的表现尚不清楚。这项研究提供了使用ROC，PR和DET曲线分析的整个Twitter和Reddit数据集的最先进的LLM和微调变压器之间的首次全面比较。结果表明，尽管迅速的工程努力，但LLMS表现出高的虚假负率和系统性不足，道德内容的系统不足。这些发现表明，特定于任务的微调仍然优于提示道德推理应用。

Title: GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

Authors: Urchade Zaratiana, Gil Pasternak, Oliver Boyd, George Hurn-Maloney, Ash Lewis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18546
Pdf URL: https://arxiv.org/pdf/2507.18546
Copy Paste: [[2507.18546]] GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface(https://arxiv.org/abs/2507.18546)
Keywords: language model, llm
Abstract: Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at this https URL.
摘要：信息提取（IE）是众多NLP应用程序的基础，但是现有的解决方案通常需要专门的模型来进行不同的任务或依靠计算昂贵的大语言模型。我们提出了Gliner2，这是一个统一的框架，可增强原始的Gliner体系结构，以支持单个有效模型中指定的实体识别，文本分类和层次结构化数据提取。 Gliner2构建了预处理的变压器编码器体系结构，保持CPU效率和紧凑的尺寸，同时通过基于直观的模式接口引入多任务组成。我们的实验表明，与基于LLM的替代方案相比，在提取和分类任务之间具有竞争性能，其部署可访问性得到了重大改进。我们在此HTTPS URL上发布了带有预训练的模型和文档的开源PIP库库。

Title: Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Authors: Ganesh Sapkota, Md Hasibur Rahman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18570
Pdf URL: https://arxiv.org/pdf/2507.18570
Copy Paste: [[2507.18570]] Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods(https://arxiv.org/abs/2507.18570)
Keywords: language model
Abstract: This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.
摘要：本文提出了一种新型的混合令牌化策略，该策略通过将6-Mer令牌化与字节对编码（BPE-600）相结合，从而增强了DNA语言模型（DLMS）的性能。传统的K-MER令牌化在捕获局部DNA序列结构方面有效，但经常面临挑战，包括令牌分布不平坦和对全球序列上下文的有限理解。为了解决这些限制，我们建议将独特的6mer令牌与通过600 BPE周期生成的最佳选择的BPE代币合并。这种混合方法确保了平衡且感知的词汇，从而使模型能够同时捕获DNA序列中的短和长模式。使用Next-K-Mer预测作为微调任务，评估了在此混合动力词汇上接受培训的基础DLM，这表明性能显着提高。该模型的预测精度为3-mers的预测准确性为10.78％，4-mers的预测准确性为10.1％，而5-Mers的预测准确性为4.12％，表现优于NT，DNABERT2和GROVER等最先进的模型。这些结果突出了混合令牌化策略在DNA建模中保留局部序列结构和全局上下文信息的能力。这项工作强调了高级令牌化方法在基因组语言建模中的重要性，并为下游DNA序列分析和生物学研究中的未来应用奠定了强大的基础。

Title: Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Authors: Feng Hong, Geng Yu, Yushi Ye, Haicheng Huang, Huangjie Zheng, Ya Zhang, Yanfeng Wang, Jiangchao Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18578
Pdf URL: https://arxiv.org/pdf/2507.18578
Copy Paste: [[2507.18578]] Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs(https://arxiv.org/abs/2507.18578)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$\times$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$\times$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.
摘要：扩散大语言模型（DLLM）已成为专为快速平行生成设计的自回归模型的引人注目的替代方案。但是，现有的DLLM受到严重的质量速度权衡的困扰，在这种情况下，更快的平行解码会导致大量的性能下降。我们将其归因于DLLM中标准解码的不可逆性，该标准解码很容易在错误的解码方向上两极化以及早期误差上下文积累。为了解决这个问题，我们引入了广泛的，狭窄的（Wino），这是一种无训练的解码算法，可在DLLM中进行可撤销的解码。 Wino采用平行的草稿和验证机制，积极起草多个令牌，同时使用模型的双向上下文来验证并重新掩盖可疑的，以进行细化。 Wino在Llada和Mmada等开源DLLM中经过验证，可决定性地改善质量速度权衡。例如，在GSM8K数学基准上，它可以加速推断6 $ \ times $，同时将准确性提高2.58％；在Flickr30k字幕上，它实现了10 $ \ times $速度，并且性能更高。进行了更全面的实验，以证明优势并提供对Wino的深入了解。

Title: System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition

Authors: Jiahao Wang, Ramen Liu, Longhui Zhang, Jing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18580
Pdf URL: https://arxiv.org/pdf/2507.18580
Copy Paste: [[2507.18580]] System Report for CCL25-Eval Task 10: SRAG-MAV for Fine-Grained Chinese Hate Speech Recognition(https://arxiv.org/abs/2507.18580)
Keywords: gpt, prompt, retrieval-augmented generation
Abstract: This paper presents our system for CCL25-Eval Task 10, addressing Fine-Grained Chinese Hate Speech Recognition (FGCHSR). We propose a novel SRAG-MAV framework that synergistically integrates task reformulation(TR), Self-Retrieval-Augmented Generation (SRAG), and Multi-Round Accumulative Voting (MAV). Our method reformulates the quadruplet extraction task into triplet extraction, uses dynamic retrieval from the training set to create contextual prompts, and applies multi-round inference with voting to improve output stability and performance. Our system, based on the Qwen2.5-7B model, achieves a Hard Score of 26.66, a Soft Score of 48.35, and an Average Score of 37.505 on the STATE ToxiCN dataset, significantly outperforming baselines such as GPT-4o (Average Score 15.63) and fine-tuned Qwen2.5-7B (Average Score 35.365). The code is available at this https URL.
摘要：本文介绍了我们针对CCL25-EVAL任务10的系统，涉及中国精细的中国仇恨言论识别（FGCHSR）。我们提出了一个新颖的SRAG-MAV框架，该框架协同整合了任务重新重新制定（TR），自我重新校长发电（SRAG）和多轮累积投票（MAV）。我们的方法将四倍提取任务重新定义为三重提取，利用训练集中的动态检索来创建上下文提示，并将多轮推断应用于投票方面，以提高输出稳定性和性能。我们的系统基于QWEN2.5-7B型号，获得的硬得分为26.66，软得分为48.35，平均得分为37.505，在该州有毒数据集上的平均得分为37.505，表现明显超过了GPT-4O（例如平均得分15.63）和QWEN2.5-7B（平均得分15.63）和诸如GPT-4O（平均得分15.63）的得分。该代码可在此HTTPS URL上找到。

Title: AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs

Authors: Xiaopeng Ke, Hexuan Deng, Xuebo Liu, Jun Rao, Zhenxi Song, Jun Yu, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18584
Pdf URL: https://arxiv.org/pdf/2507.18584
Copy Paste: [[2507.18584]] AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs(https://arxiv.org/abs/2507.18584)
Keywords: language model, llm
Abstract: Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703k examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at this https URL.
摘要：尽管大语言模型（LLM）在一般领域的表现令人印象深刻，但它们通常在专业领域的表现不佳。现有方法通常依赖于数据综合方法，并通过使用未标记的数据捕获特定于域特异性特征来产生有希望的结果。但是，这些方法要么会产生高计算成本，要么遭受性能限制，同时也表明跨不同任务的概括不足。为了应对这些挑战，我们提出了Aquilt，这是一个从相应的未标记数据中构建任何专业域的指令数据的框架，包括答案，问题，未标记的数据，检查，逻辑和任务类型。通过合并逻辑和检查，我们鼓励推理过程和自我检查以增强模型性能。此外，可自定义的任务指令可为任何任务提供高质量的数据生成。结果，我们构建了一个703K示例的数据集，以训练强大的数据合成模型。实验表明，水原料与DeepSeek-V3相当，同时仅利用了生产成本的17％。进一步的分析表明，我们生成的数据与下游任务具有更高的相关性。源代码，模型和脚本可在此HTTPS URL上找到。

Title: TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

Authors: Andreea Nica, Ivan Zakazov, Nicolas Mario Baldwin, Saibo Geng, Robert West
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18618
Pdf URL: https://arxiv.org/pdf/2507.18618
Copy Paste: [[2507.18618]] TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards(https://arxiv.org/abs/2507.18618)
Keywords: language model, llm, prompt
Abstract: Prompt optimization improves the reasoning abilities of large language models (LLMs) without requiring parameter updates to the target model. Following heuristic-based "Think step by step" approaches, the field has evolved in two main directions: while one group of methods uses textual feedback to elicit improved prompts from general-purpose LLMs in a training-free way, a concurrent line of research relies on numerical rewards to train a special prompt model, tailored for providing optimal prompts to the target model. In this paper, we introduce the Textual Reward Prompt framework (TRPrompt), which unifies these approaches by directly incorporating textual feedback into training of the prompt model. Our framework does not require prior dataset collection and is being iteratively improved with the feedback on the generated prompts. When coupled with the capacity of an LLM to internalize the notion of what a "good" prompt is, the high-resolution signal provided by the textual rewards allows us to train a prompt model yielding state-of-the-art query-specific prompts for the problems from the challenging math datasets GSMHard and MATH.
摘要：提示优化可提高大语模型（LLM）的推理能力，而无需对目标模型进行参数更新。遵循基于启发式的“思考逐步”方法，该领域已经在两个主要方向发展：一组方法使用文本反馈以无培训的方式从通用LLMS发出改进的提示，但并发的研究系列依赖于数值奖励来训练特殊的提示模型，用于为目标提示提供最佳的提示，以提示目标模型。在本文中，我们介绍了文本奖励提示框架（TRPROMPT），该框架通过将文本反馈直接纳入及时模型的培训来统一这些方法。我们的框架不需要事先的数据集收集，并且随着生成的提示的反馈而进行迭代改进。当结合LLM的能力内部化“良好”提示的概念时，文本奖励提供的高分辨率信号使我们能够训练迅速的模型，从而产生最先进的查询特定提示，以解决具有挑战性的数学数据集GSMHARD和MATH的挑战性数学数据。

Title: Checklists Are Better Than Reward Models For Aligning Language Models

Authors: Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, Tongshuang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18624
Pdf URL: https://arxiv.org/pdf/2507.18624
Copy Paste: [[2507.18624]] Checklists Are Better Than Reward Models For Aligning Language Models(https://arxiv.org/abs/2507.18624)
Keywords: language model
Abstract: Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this -- typically using fixed criteria such as "helpfulness" and "harmfulness". In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose "Reinforcement Learning from Checklist Feedback" (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item - using both AI judges and specialized verifier programs - then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks -- RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. These results establish checklist feedback as a key tool for improving language models' support of queries that express a multitude of needs.
摘要：语言模型必须进行调整以了解和遵循用户说明。强化学习被广泛用于促进这一点 - 通常使用诸如“帮助”和“有害性”之类的固定标准。在我们的工作中，我们建议使用灵活的，特定于指导的标准作为扩大强化学习在引起跟随教学方面产生的影响的手段。我们建议“从清单反馈中学习”（RLCF）。从说明中，我们提取清单并评估响应对每个项目的满足程度 - 使用AI法官和专门的验证程序程序 - 然后将这些分数组合在一起以计算RL的奖励。我们将RLCF与五个广泛研究基准的模型（QWEN2.5-7B构造）应用于强大指导的其他对齐方法进行了比较 - RLCF是提高每个基准测试效果的唯一方法，包括在遵循Bnech上获得4分的增强率，包括遵循Bneckench的硬满意度，在Infobench上增加了Infobench和3杆的增长和3分的增长，并增加了3分的增长。这些结果建立了清单反馈，作为改善语言模型对表达大量需求的查询的支持的关键工具。