2025-06-23

Title: Veracity: An Open-Source AI Fact-Checking System

Authors: Taylor Lynn Curtis, Maximilian Puelma Touzel, William Garneau, Manon Gruaz, Mike Pinder, Li Wei Wang, Sukanya Krishna, Luda Cohen, Jean-François Godbout, Reihaneh Rabbany, Kellin Pelrine
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.15794
Pdf URL: https://arxiv.org/pdf/2506.15794
Copy Paste: [[2506.15794]] Veracity: An Open-Source AI Fact-Checking System(https://arxiv.org/abs/2506.15794)
Keywords: language model, llm, agent
Abstract: The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity's ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.
摘要：错误信息的扩散对社会构成了重大威胁，这受生成AI的能力加剧。该演示论文介绍了一种开源的AI系统，旨在通过透明且易于访问的事实检查来打击个人的错误信息。真实性利用大语言模型（LLM）和Web检索代理之间的协同作用来分析用户提取的索赔，并提供直观的解释的基础真实评估。关键功能包括多语言支持，索赔真实性的数值评分以及受熟悉的消息传递应用程序启发的交互式接口。本文将展示真实性的能力，不仅可以检测错误信息，还可以解释其推理，促进媒体素养并促进更知名的社会。

Title: Rethinking LLM Training through Information Geometry and Quantum Metrics

Authors: Riccardo Di Sipio
Subjects: cs.CL, quant-ph
Abstract URL: https://arxiv.org/abs/2506.15830
Pdf URL: https://arxiv.org/pdf/2506.15830
Copy Paste: [[2506.15830]] Rethinking LLM Training through Information Geometry and Quantum Metrics(https://arxiv.org/abs/2506.15830)
Keywords: language model, llm
Abstract: Optimization in large language models (LLMs) unfolds over high-dimensional parameter spaces with non-Euclidean structure. Information geometry frames this landscape using the Fisher information metric, enabling more principled learning via natural gradient descent. Though often impractical, this geometric lens clarifies phenomena such as sharp minima, generalization, and observed scaling laws. We argue that curvature-aware approaches deepen our understanding of LLM training. Finally, we speculate on quantum analogies based on the Fubini-Study metric and Quantum Fisher Information, hinting at efficient optimization in quantum-enhanced systems.
摘要：大语言模型（LLMS）的优化在具有非欧盟结构的高维参数空间上展开。信息几何形状使用Fisher信息度量标准框架，从而可以通过自然梯度下降进行更多的原则学习。尽管通常不切实际，但这种几何镜头阐明了现象，例如尖锐的最小值，概括和观察到的缩放定律。我们认为，曲率感知的方法可以加深我们对LLM培训的理解。最后，我们推测基于Fubini-study公制和量子Fisher信息的量子类比，暗示了量子增强系统中有效优化。

Title: MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Authors: Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.15841
Pdf URL: https://arxiv.org/pdf/2506.15841
Copy Paste: [[2506.15841]] MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents(https://arxiv.org/abs/2506.15841)
Keywords: llm, prompt, agent
Abstract: Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.
摘要：现代语言代理必须在长马，多转弯相互作用上运行，它们可以检索外部信息，适应观察结果并回答相互依存的查询。然而，大多数LLM系统都依赖于全文提示，无论其相关性如何，所有过去的转弯都附加了。这会导致内存增长，计算成本的增加以及在分布外输入长度上降低推理性能。我们介绍了MEM1，这是一个端到端的增强学习框架，使代理能够在长期多转任任务中持续内存运行。在每个转弯处，MEM1都会更新一个紧凑的共享内部状态，该状态共同支持内存的合并和推理。该状态将先前的内存与来自环境的新观察结果整合在一起，同时策略性地丢弃无关或冗余的信息。为了支持更现实和组成的设置中的培训，我们通过将现有数据集撰写为任意复杂的任务序列来提出一种简单但有效且可扩展的方法来构建多转变环境。跨三个域的实验，包括内部检索QA，开放域Web QA和Multi-Turn Web购物，表明MEM1-7B将性能提高了3.5倍，同时将记忆使用量减少了3.7倍，而在16个目标的多功能QA任务上，QWEN2.5-14B-构造与QWEN2.5-14B-构造相比，超越了训练视野。我们的结果证明了推理驱动的记忆合并的承诺，作为训练长匹配互动剂的现有解决方案的可扩展替代方案，在该解决方案中均已优化效率和性能。

Title: Finance Language Model Evaluation (FLaME)

Authors: Glenn Matlin, Mika Okamoto, Huzaifa Pardawala, Yang Yang, Sudheer Chava
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2506.15846
Pdf URL: https://arxiv.org/pdf/2506.15846
Copy Paste: [[2506.15846]] Finance Language Model Evaluation (FLaME)(https://arxiv.org/abs/2506.15846)
Keywords: language model
Abstract: Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against 'reasoning-reinforced' LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results.
摘要：语言模型（LMS）已通过核心自然语言处理（NLP）任务证明了令人印象深刻的功能。由于现有评估框架的方法论的主要差距，LMS对财务高度专业知识密集型任务的有效性仍然很难评估，这导致了LMS对公共财务NLP（FINNLP）任务的较低效果的错误信念。为了证明LMS对这些FINNLP任务的潜力，我们提出了第一个用于金融语言模型评估（FLAME）的整体基准测试套件。我们是第一篇研究论文，对LMS进行了全面研究LMS，其经验研究对23个基金会LMS的经验研究超过20个核心NLP金融任务。我们开源我们的框架软件以及所有数据和结果。

Title: Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Authors: Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W. Schmidt, Chris Tanner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.15889
Pdf URL: https://arxiv.org/pdf/2506.15889
Copy Paste: [[2506.15889]] Entropy-Driven Pre-Tokenization for Byte-Pair Encoding(https://arxiv.org/abs/2506.15889)
Keywords: language model, gpt
Abstract: Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.
摘要：Byte-Pair编码（BPE）由于其简单性和下游任务的强大经验性能，已成为现代语言模型中广泛采用的子字代币化方法。但是，将BPE应用于中国等不段的语言提出了重大挑战，因为其频率驱动的合并操作对语言边界不可知。为了解决这个问题，我们提出了两种熵信息的预习惯策略，这些策略使用无监督的信息理论提示指导BPE分割。第一种方法使用侧面的互信息和左/右熵来识别连贯的角色跨度，而第二种方法则利用了源自预算GPT-2模型的预测熵来检测边界不确定性。我们在PKU数据集的一个子集上评估了这两种方法，并证明了与标准BPE相比，分割精度，召回和F1得分的实质性改进。我们的结果表明，熵引导的预习惯不仅可以增强与金标准语言单元的一致性，而且还为改善低资源和多语言环境中的令牌化质量提供了有希望的方向。

Title: Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning

Authors: Sam Silver, Jimin Sun, Ivan Zhang, Sara Hooker, Eddie Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.15894
Pdf URL: https://arxiv.org/pdf/2506.15894
Copy Paste: [[2506.15894]] Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning(https://arxiv.org/abs/2506.15894)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities, yet their performance remains brittle to minor variations in problem description and prompting strategy. Furthermore, reasoning is vulnerable to sampling-induced errors which autoregressive models must primarily address using self-correction via additionally-generated tokens. To better understand self-correction capabilities of recent models, we conduct experiments measuring models' ability to self-correct synthetic perturbations introduced into their Chain of Thought (CoT) reasoning. We observe robust single-utterance intrinsic self-correction behavior across a range of open-weight models and datasets, ranging from subtle, implicit corrections to explicit acknowledgments and corrections of errors. Our findings suggest that LLMs, including those not finetuned for long CoT, may possess stronger intrinsic self-correction capabilities than commonly shown in the literature. The presence of this ability suggests that recent "reasoning" model work involves amplification of traits already meaningfully present in models.
摘要：大型语言模型（LLMS）表现出了令人印象深刻的数学推理能力，但是它们的性能仍然对问题描述和提示策略的微小变化仍然很脆弱。此外，推理容易受到采样引起的错误的影响，自回归模型必须主要通过额外生成的令牌来解决自我纠正。为了更好地了解最近模型的自我纠正能力，我们进行了实验，以测量模型的自我纠正能力，使其自我纠正的合成扰动引入了他们的思想链（COT）推理。我们观察到一系列开放权重模型和数据集的强大单固有性固有性自我纠正行为，从微妙的隐式更正到明确的确认和错误的校正。我们的发现表明，LLM，包括未针对长床的未固定的LLM，可能具有比文献中常见的内在自我纠正能力更强的。这种能力的存在表明，最近的“推理”模型工作涉及在模型中有意义地存在的特征放大。

Title: From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents

Authors: Mohammad Amaan Sayeed, Mohammed Talha Alam, Raza Imam, Shahab Saquib Sohail, Amir Hussain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.15911
Pdf URL: https://arxiv.org/pdf/2506.15911
Copy Paste: [[2506.15911]] From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents(https://arxiv.org/abs/2506.15911)
Keywords: llm, prompt, retrieval-augmented generation, agent
Abstract: Centuries-old Islamic medical texts like Avicenna's Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.
摘要：数百年历史的伊斯兰医学文本，例如阿维森纳（Avicenna）的医学典范和预言的tibb-e-nabawi编码了许多预防保健，营养和整体疗法，但对于许多人来说，在现代AI系统中仍无法使用。现有的语言模型基准狭窄地集中在事实召回或用户偏好上，在验证文化上扎根的医学指南方面留下了差距。我们提出了一条统一的评估管道Tibbe-Ag，该管道将30个经过精心策划的预言中的医学问题与人类验证的补救措施保持一致，并在三种配置下比较了三个LLM（Llama-3，Mismtral-7b，Qwen2-7b）：直接直接，直接发电，回顾生成一代，促成生殖器，并具有科学的自我指示。然后，由次级LLM作为代理法官评估每个答案，得出单个3C3H质量得分。检索将事实准确性提高了13％，而代理提示通过更深入的机械洞察力和安全考虑增加了10％的提高。我们的结果表明，将古典伊斯兰文本与检索和自我评估相结合，可以可靠，具有文化敏感的医学提问。

Title: Reranking-based Generation for Unbiased Perspective Summarization

Authors: Narutatsu Ri, Nicholas Deas, Kathleen McKeown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.15925
Pdf URL: https://arxiv.org/pdf/2506.15925
Copy Paste: [[2506.15925]] Reranking-based Generation for Unbiased Perspective Summarization(https://arxiv.org/abs/2506.15925)
Keywords: language model, llm
Abstract: Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
摘要：在现实世界中产生公正的摘要，例如政治观点摘要仍然是大型语言模型（LLM）的关键应用。然而，现有的评估框架依靠传统指标来衡量关键属性，例如覆盖范围和忠诚，而无需验证其适用性，而开发改进的摘要的努力仍然很新鲜。我们通过（1）识别可靠的指标来解决这些差距，以衡量观点的摘要质量，以及（2）调查超出零弹药推断的基于LLM的方法的疗效。也就是说，我们建立了一个测试集，用于使用人类注释进行基准测量可靠性，并表明与基于语言模型的指标相比，传统指标表现不佳，这被证明是强大的评估者。使用这些指标，我们表明基于重读的方法会产生强大的结果，并通过合成生成和重新标记的数据进行偏好调整，从而进一步提高了性能。我们的发现旨在为可靠的评估和开发观点摘要方法做出贡献。

Title: From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation

Authors: Zhihan Guo, Jiele Wu, Wenqian Cui, Yifei Zhang, Minda Hu, Yufei Wang, Irwin King
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16024
Pdf URL: https://arxiv.org/pdf/2506.16024
Copy Paste: [[2506.16024]] From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation(https://arxiv.org/abs/2506.16024)
Keywords: language model, gpt, llm, prompt
Abstract: Current research on long-form context in Large Language Models (LLMs) primarily focuses on the understanding of long-contexts, the Open-ended Long Text Generation (Open-LTG) remains insufficiently explored. Training a long-context generation model requires curation of gold standard reference data, which is typically nonexistent for informative Open-LTG tasks. However, previous methods only utilize general assessments as reward signals, which limits accuracy. To bridge this gap, we introduce ProxyReward, an innovative reinforcement learning (RL) based framework, which includes a dataset and a reward signal computation method. Firstly, ProxyReward Dataset generation is accomplished through simple prompts that enables the model to create automatically, obviating extensive labeled data or significant manual effort. Secondly, ProxyReward Signal offers a targeted evaluation of information comprehensiveness and accuracy for specific questions. The experimental results indicate that our method ProxyReward surpasses even GPT-4-Turbo. It can significantly enhance performance by 20% on the Open-LTG task when training widely used open-source models, while also surpassing the LLM-as-a-Judge approach. Our work presents effective methods to enhance the ability of LLMs to address complex open-ended questions posed by human.
摘要：当前关于长语言模型（LLM）的长形环境的研究主要集中在对长篇小说的理解上，开放式的长文本生成（Open-LTG）仍然不足以探索。训练长篇文化生成模型需要策划黄金标准参考数据，这通常不存在信息丰富的Open-LTG任务。但是，以前的方法仅利用一般评估作为奖励信号，从而限制了准确性。为了弥合这一差距，我们介绍了基于创新的增强学习（RL）框架ProxyReward，其中包括数据集和奖励信号计算方法。首先，通过简单的提示来实现ProxyReward数据集生成，该提示使模型能够自动创建，从而避免了广泛的标记数据或大量的手动工作。其次，ProxyReward信号为特定问题提供了针对性的信息全面和准确性评估。实验结果表明，我们的proxyreward方法甚至超过GPT-4-涡轮增压。当训练广泛使用的开源模型时，在开放LTG任务上，它可以显着提高20％的性能，同时也超过了LLM-AS-A-A-Gudge方法。我们的工作提出了有效的方法，以增强LLM解决人类提出的复杂开放式问题的能力。

Title: EvoLM: In Search of Lost Language Model Training Dynamics

Authors: Zhenting Qi, Fan Nie, Alexandre Alahi, James Zou, Himabindu Lakkaraju, Yilun Du, Eric Xing, Sham Kakade, Hanlin Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16029
Pdf URL: https://arxiv.org/pdf/2506.16029
Copy Paste: [[2506.16029]] EvoLM: In Search of Lost Language Model Training Dynamics(https://arxiv.org/abs/2506.16029)
Keywords: language model
Abstract: Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.
摘要：现代语言模型（LM）培训已分为多个阶段，使下游开发人员难以评估每个阶段的设计选择的影响。我们提出了Evolm，这是一个模型套件，可以对LMS的训练动力进行系统和透明的分析，跨训练，持续的预训练，监督微调和增强学习。通过从头开始训练超过1B和4B参数的100多个LMS，我们严格评估上游（语言建模）和下游（解决问题）推理能力，包括考虑到内域和跨域概括的考虑。关键见解强调了过度训练和训练后的回报降低，减轻特定领域特定的持续持续培训期间遗忘的重要性和实践，持续培训在弥合前训练前培训和培训后培训阶段的关键作用以及在配置监督的精力研究时进行各种复杂的折磨。为了促进公开研究和可重复性，我们发布了所有训练和训练后的模型，所有阶段的培训数据集以及我们的整个培训和评估管道。

Title: Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3

Authors: Xinyue Huang, Ziqi Lin, Fang Sun, Wenchao Zhang, Kejian Tong, Yunbo Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16037
Pdf URL: https://arxiv.org/pdf/2506.16037
Copy Paste: [[2506.16037]] Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3(https://arxiv.org/abs/2506.16037)
Keywords: retrieval-augmented generation
Abstract: This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model's robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers.
摘要：本文介绍了针对复杂的问题回答任务量身定制的新型检索型生成（RAG）框架，解决了多跳推理中的挑战以及跨冗长文档的上下文理解。该框架建立在Llama 3的基础上，将密集的检索模块与先进的上下文融合和多跳的推理机制集成在一起，从而实现了更准确和相干的响应生成。结合检索可能性和产生跨凝结的联合优化策略可提高模型的鲁棒性和适应性。实验结果表明，所提出的系统的表现优于现有的检索和生成基线，证实了其在提供精确的，上下文基础的答案方面的有效性。

Title: DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling

Authors: Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, Sercan Ö. Arık
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16043
Pdf URL: https://arxiv.org/pdf/2506.16043
Copy Paste: [[2506.16043]] DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling(https://arxiv.org/abs/2506.16043)
Keywords: language model, llm
Abstract: Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation. Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints. We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. The integrated sampling strategy unifies parallel and sequential sampling by constructing synthetic sequential reasoning chains from initially independent parallel responses, promoting diverse and coherent reasoning trajectories. The dynamic budget allocation framework formulates the allocation of computational resources as a multi-armed bandit problem, adaptively distributing the inference budget across queries based on the uncertainty of previously sampled responses, thereby maximizing computational efficiency. By combining these components, DynScaling effectively improves LLM performance under practical resource constraints without the need for external verifiers. Experimental results demonstrate that DynScaling consistently surpasses existing verifier-free inference scaling baselines in both task performance and computational cost.
摘要：通过增加测试时间计算，推理时间缩放已被证明有效地提高大语模型（LLM）的性能。然而，其实际应用通常受到依赖外部验证符或缺乏对现实计算约束的优化的阻碍。我们提出了DYNSCALING，通过两项主要创新来解决这些局限性：集成的并行序列采样策略和基于强盗的动态预算分配框架。集成的采样策略通过从最初独立的并行响应中构造合成顺序推理链来统一并行和顺序采样，从而促进了多样的和相干的推理轨迹。动态预算分配框架将计算资源的分配作为多军匪徒问题，根据先前采样响应的不确定性，可以自适应地分配推理预算，从而最大程度地提高计算效率。通过结合这些组件，Dynscaling在实际资源限制下有效地提高了LLM的性能，而无需外部验证器。实验结果表明，在任务性能和计算成本中，Dynscaling始终超过了无验证者的推理缩放基准。

Title: Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning

Authors: Duc Hieu Ho, Chenglin Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16064
Pdf URL: https://arxiv.org/pdf/2506.16064
Copy Paste: [[2506.16064]] Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning(https://arxiv.org/abs/2506.16064)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) have demonstrated robust capabilities across various natural language tasks. However, producing outputs that are consistently honest and helpful remains an open challenge. To overcome this challenge, this paper tackles the problem through two complementary directions. It conducts a comprehensive benchmark evaluation of ten widely used large language models, including both proprietary and open-weight models from OpenAI, Meta, and Google. In parallel, it proposes a novel prompting strategy, self-critique-guided curiosity refinement prompting. The key idea behind this strategy is enabling models to self-critique and refine their responses without additional training. The proposed method extends the curiosity-driven prompting strategy by incorporating two lightweight in-context steps including self-critique step and refinement step. The experiment results on the HONESET dataset evaluated using the framework $\mathrm{H}^2$ (honesty and helpfulness), which was executed with GPT-4o as a judge of honesty and helpfulness, show consistent improvements across all models. The approach reduces the number of poor-quality responses, increases high-quality responses, and achieves relative gains in $\mathrm{H}^2$ scores ranging from 1.4% to 4.3% compared to curiosity-driven prompting across evaluated models. These results highlight the effectiveness of structured self-refinement as a scalable and training-free strategy to improve the trustworthiness of LLMs outputs.
摘要：大型语言模型（LLMS）在各种自然语言任务中都表现出了强大的功能。但是，产生始终如一诚实和乐于助人的产出仍然是一个公开挑战。为了克服这一挑战，本文通过两个互补方向解决了问题。它对十种广泛使用的大型语言模型进行了全面的基准评估，包括来自OpenAI，Meta和Google的专有和开放权重模型。同时，它提出了一种新颖的提示策略，自我评价引导的好奇心精致提示。这种策略背后的关键思想是使模型在没有额外培训的情况下自我批评和完善他们的反应。提出的方法通过合并两个轻巧的内在步骤，包括自我评价步骤和改进步骤，扩展了好奇心驱动的提示策略。使用框架$ \ mathrm {h}^2 $（诚实和乐于助人）评估的荣誉数据集的实验结果，该框架是通过GPT-4O执行的，作为诚实和乐于助人的法官，在所有模型中都表现出一致的改进。该方法可减少质量不良的响应数量，增加高质量的响应，并在$ \ mathrm {h}^2 $分数中获得相对的收益，而与好奇心驱动的提示相比，在评估的模型中，与好奇心驱动的提示相比。这些结果强调了结构化自我进行的有效性，这是一种可扩展且无训练的策略，以提高LLMS输出的可信度。

Title: FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

Authors: Natapong Nitarach, Warit Sirichotedumrong, Panop Pitchayarthorn, Pittawat Taveekitworachai, Potsawee Manakul, Kunat Pipatanakul
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16123
Pdf URL: https://arxiv.org/pdf/2506.16123
Copy Paste: [[2506.16123]] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning(https://arxiv.org/abs/2506.16123)
Keywords: language model, prompt, chain-of-thought
Abstract: This paper presents FinCoT, a structured chain-of-thought (CoT) prompting approach that incorporates insights from domain-specific expert financial reasoning to guide the reasoning traces of large language models. We investigate that there are three main prompting styles in FinNLP: (1) standard prompting--zero-shot prompting; (2) unstructured CoT--CoT prompting without an explicit reasoning structure, such as the use of tags; and (3) structured CoT prompting--CoT prompting with explicit instructions or examples that define structured reasoning steps. Previously, FinNLP has primarily focused on prompt engineering with either standard or unstructured CoT prompting. However, structured CoT prompting has received limited attention in prior work. Furthermore, the design of reasoning structures in structured CoT prompting is often based on heuristics from non-domain experts. In this study, we investigate each prompting approach in FinNLP. We evaluate the three main prompting styles and FinCoT on CFA-style questions spanning ten financial domains. We observe that FinCoT improves performance from 63.2% to 80.5% and Qwen-2.5-7B-Instruct from 69.7% to 74.2%, while reducing generated tokens eight-fold compared to structured CoT prompting. Our findings show that domain-aligned structured prompts not only improve performance and reduce inference costs but also yield more interpretable and expert-aligned reasoning traces.
摘要：本文介绍了Fincot，这是一种结构化链（COT）提示方法，该方法结合了特定于领域的专家财务推理的见解，以指导大语言模型的推理痕迹。我们调查Finnlp中有三种主要提示样式：（1）标准提示 - 零射击提示；（2）非结构化的COT- cot提示，没有明确的推理结构，例如使用标签；（3）结构化的COT提示 - 提示提示使用明确的说明或定义结构化推理步骤的示例。以前，FINNLP主要专注于具有标准或非结构化的COT提示的及时工程。但是，结构化的婴儿床提示在先前的工作中受到了有限的关注。此外，结构化COT提示中推理结构的设计通常是基于非域专家的启发式方法。在这项研究中，我们研究了FINNLP中的每种提示方法。我们在CFA风格的问题上评估了三种主要提示样式和FinCot，涵盖了十个金融领域。我们观察到，FinCot将绩效从63.2％提高到80.5％，而QWEN-2.5-7B教学法从69.7％提高到74.2％，而与结构化的COT提示相比，降低了八倍的产生令牌。我们的发现表明，与域保持一致的结构化提示不仅可以提高性能并降低推理成本，还可以产生更容易解释和专家的推理痕迹。

Title: Under the Shadow of Babel: How Language Shapes Reasoning in LLMs

Authors: Chenxi Wang, Yixuan Zhang, Lang Gao, Zixiang Xu, Zirui Song, Yanbo Wang, Xiuying Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16151
Pdf URL: https://arxiv.org/pdf/2506.16151
Copy Paste: [[2506.16151]] Under the Shadow of Babel: How Language Shapes Reasoning in LLMs(https://arxiv.org/abs/2506.16151)
Keywords: language model, llm
Abstract: Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal word order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.
摘要：语言不仅是通信的工具，而且是人类认知和推理的媒介。如语言相对论所暗示的那样，语言的结构塑造了认知模式，那么对人类语言训练的大型语言模型（LLM）也可能会内部化嵌入不同语言的习惯逻辑结构。为了审查这一假设，我们介绍了双语的双语数据集，用于因果推理，其中包括远期和反向因果形式的语义对齐的中文和英语样本。我们的研究揭示了三个关键发现：（1）LLM在类型上表现出类型的关注模式，更多地侧重于中文的原因和句子初始连接，同时在英语中显示出更加平衡的分布。（2）模型将特定于因果单词顺序的语言偏好内化，并经常将其应用于非典型输入，从而导致性能退化，尤其是在中文中。（3）当因果推理成功时，模型表示趋于跨语言的语义对齐抽象，表明超出表面形式的共同理解。总体而言，这些结果表明LLM不仅模仿了表面语言形式，而且还内部化了由语言塑造的推理偏见。这种现象植根于认知语言理论，首次通过模型内部的结构分析在经验上验证。

Title: SGIC: A Self-Guided Iterative Calibration Framework for RAG

Authors: Guanhua Chen, Yutong Yao, Lidia S. Chao, Xuebo Liu, Derek F. Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16172
Pdf URL: https://arxiv.org/pdf/2506.16172
Copy Paste: [[2506.16172]] SGIC: A Self-Guided Iterative Calibration Framework for RAG(https://arxiv.org/abs/2506.16172)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recent research in retrieval-augmented generation (RAG) has concentrated on retrieving useful information from candidate documents. However, numerous methodologies frequently neglect the calibration capabilities of large language models (LLMs), which capitalize on their robust in-context reasoning prowess. This work illustrates that providing LLMs with specific cues substantially improves their calibration efficacy, especially in multi-round calibrations. We present a new SGIC: Self-Guided Iterative Calibration Framework that employs uncertainty scores as a tool. Initially, this framework calculates uncertainty scores to determine both the relevance of each document to the query and the confidence level in the responses produced by the LLMs. Subsequently, it reevaluates these scores iteratively, amalgamating them with prior responses to refine calibration. Furthermore, we introduce an innovative approach for constructing an iterative self-calibration training set, which optimizes LLMs to efficiently harness uncertainty scores for capturing critical information and enhancing response accuracy. Our proposed framework significantly improves performance on both closed-source and open-weight LLMs.
摘要：最新的检索效果生成（RAG）的研究集中在从候选文件中检索有用的信息。但是，许多方法经常忽略大语言模型（LLMS）的校准功能，这些功能利用了其强大的内在推理能力。这项工作表明，提供LLM的特定提示可以大大提高其校准功效，尤其是在多轮校准中。我们提出了一个新的SGIC：自我引入的迭代校准框架，该校准框架采用不确定性得分作为工具。最初，该框架计算不确定性得分，以确定每个文档与查询的相关性，又要确定LLMS产生的响应中的置信度。随后，它重新评估了这些得分，并以先前对完善校准的响应进行融合。此外，我们引入了一种创新的方法，用于构建迭代自我校准训练集，该方法优化了LLM，以有效利用不确定性得分来捕获关键信息并提高响应精度。我们提出的框架可显着提高封闭源和开放式LLM的性能。

Title: JETHICS: Japanese Ethics Understanding Evaluation Dataset

Authors: Masashi Takeshita, Rafal Rzepka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16187
Pdf URL: https://arxiv.org/pdf/2506.16187
Copy Paste: [[2506.16187]] JETHICS: Japanese Ethics Understanding Evaluation Dataset(https://arxiv.org/abs/2506.16187)
Keywords: language model, gpt, llm
Abstract: In this work, we propose JETHICS, a Japanese dataset for evaluating ethics understanding of AI models. JETHICS contains 78K examples and is built by following the construction methods of the existing English ETHICS dataset. It includes four categories based normative theories and concepts from ethics and political philosophy; and one representing commonsense morality. Our evaluation experiments on non-proprietary large language models (LLMs) and on GPT-4o reveal that even GPT-4o achieves only an average score of about 0.7, while the best-performing Japanese LLM attains around 0.5, indicating a relatively large room for improvement in current LLMs.
摘要：在这项工作中，我们提出了Jethics，这是一种日本数据集，用于评估对AI模型的道德理解。 Jethics包含78K示例，并通过遵循现有英语伦理数据集的构建方法来构建。它包括四种基于伦理和政治哲学的基于规范理论和概念；一个代表常识性道德。我们对非专有大型语言模型（LLM）和GPT-4O的评估实验表明，即使GPT-4O也只能达到约0.7的平均得分，而表现最佳的日本LLM则达到0.5左右，这表明当前LLMS的改善空间相对较大。

Title: Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports

Authors: Anindita Bhattacharya, Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16247
Pdf URL: https://arxiv.org/pdf/2506.16247
Copy Paste: [[2506.16247]] Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports(https://arxiv.org/abs/2506.16247)
Keywords: language model, gpt, chat
Abstract: The findings section of a radiology report is often detailed and lengthy, whereas the impression section is comparatively more compact and captures key diagnostic conclusions. This research explores the use of advanced abstractive summarization models to generate the concise impression from the findings section of a radiology report. We have used the publicly available MIMIC-CXR dataset. A comparative analysis is conducted on leading pre-trained and open-source large language models, including T5-base, BART-base, PEGASUS-x-base, ChatGPT-4, LLaMA-3-8B, and a custom Pointer Generator Network with a coverage mechanism. To ensure a thorough assessment, multiple evaluation metrics are employed, including ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BERTScore. By analyzing the performance of these models, this study identifies their respective strengths and limitations in the summarization of medical text. The findings of this paper provide helpful information for medical professionals who need automated summarization solutions in the healthcare sector.
摘要：放射学报告的发现部分通常是详细且冗长的，而印象部分则相对紧凑，并捕获了关键的诊断结论。这项研究探讨了先进的抽象摘要模型的使用来从放射学报告的发现部分产生简洁的印象。我们已经使用了公开可用的MIMIC-CXR数据集。对领先的预训练和开源大语言模型进行了比较分析，包括T5-碱，Bart-base，Pegasus-X-Base，Chatgpt-4，Llama-3-8B和具有覆盖范围机制的自定义指针生成器网络。为了确保彻底的评估，采用了多个评估指标，包括Rouge-1，Rouge-2，Rouge-L，Meteor和Bertscore。通过分析这些模型的性能，本研究确定了它们在医学文本摘要中的优势和局限性。本文的发现为需要自动汇总解决方案的医疗专业人员提供了有用的信息。

Title: PL-Guard: Benchmarking Language Model Safety for Polish

Authors: Aleksandra Krasnodębska, Karolina Seweryn, Szymon Łukasik, Wojciech Kusa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16322
Pdf URL: https://arxiv.org/pdf/2506.16322
Copy Paste: [[2506.16322]] PL-Guard: Benchmarking Language Model Safety for Polish(https://arxiv.org/abs/2506.16322)
Keywords: language model, llm
Abstract: Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.
摘要：尽管越来越多地为确保大语模型的安全性（LLM）的安全性，但大多数现有的安全评估和审核工具仍然对英语和其他高资源语言持巨大偏见，而大多数全球语言都没有散发出来。为了解决这一差距，我们引入了一个手动注释的基准数据集，以用于波兰语中的语言模型安全分类。我们还创建了这些样品的对抗性变体，旨在挑战模型鲁棒性。我们进行了一系列实验，以评估基于LLM的不同大小和体系结构的基于LLM的模型。具体而言，我们调整了三种模型：七大型分类器（波兰BERT衍生品）和波兰适应的Llama-8B模型PLLUM。我们使用带注释的数据的不同组合训练这些模型并评估其性能，并将其与公开可用的警卫模型进行比较。结果表明，总部位于赫伯特的分类器取得了最高的总体表现，尤其是在对抗条件下。

Title: Analyzing the Influence of Knowledge Graph Information on Relation Extraction

Authors: Cedric Möller, Ricardo Usbeck
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.16343
Pdf URL: https://arxiv.org/pdf/2506.16343
Copy Paste: [[2506.16343]] Analyzing the Influence of Knowledge Graph Information on Relation Extraction(https://arxiv.org/abs/2506.16343)
Keywords: llm
Abstract: We examine the impact of incorporating knowledge graph information on the performance of relation extraction models across a range of datasets. Our hypothesis is that the positions of entities within a knowledge graph provide important insights for relation extraction tasks. We conduct experiments on multiple datasets, each varying in the number of relations, training examples, and underlying knowledge graphs. Our results demonstrate that integrating knowledge graph information significantly enhances performance, especially when dealing with an imbalance in the number of training examples for each relation. We evaluate the contribution of knowledge graph-based features by combining established relation extraction methods with graph-aware Neural Bellman-Ford networks. These features are tested in both supervised and zero-shot settings, demonstrating consistent performance improvements across various datasets.
摘要：我们研究了将知识图信息合并到有关关系提取模型跨多个数据集的性能的影响。我们的假设是，实体在知识图中的位置为关系提取任务提供了重要的见解。我们在多个数据集上进行实验，每个数据集的关系数量，培训示例和基础知识图都有所不同。我们的结果表明，集成知识图信息会显着提高性能，尤其是在处理每个关系的培训示例数量的不平衡时。我们通过将建立的关系提取方法与图形感知的神经钟形福音网络相结合，评估了基于知识图的特征的贡献。这些功能均在监督和零拍设置中测试，表明各个数据集的性能提高一致。

Title: Can structural correspondences ground real world representational content in Large Language Models?

Authors: Iwan Williams
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16370
Pdf URL: https://arxiv.org/pdf/2506.16370
Copy Paste: [[2506.16370]] Can structural correspondences ground real world representational content in Large Language Models?(https://arxiv.org/abs/2506.16370)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) such as GPT-4 produce compelling responses to a wide range of prompts. But their representational capacities are uncertain. Many LLMs have no direct contact with extra-linguistic reality: their inputs, outputs and training data consist solely of text, raising the questions (1) can LLMs represent anything and (2) if so, what? In this paper, I explore what it would take to answer these questions according to a structural-correspondence based account of representation, and make an initial survey of this evidence. I argue that the mere existence of structural correspondences between LLMs and worldly entities is insufficient to ground representation of those entities. However, if these structural correspondences play an appropriate role - they are exploited in a way that explains successful task performance - then they could ground real world contents. This requires overcoming a challenge: the text-boundedness of LLMs appears, on the face of it, to prevent them engaging in the right sorts of tasks.
摘要：GPT-4等大型语言模型（LLM）对广泛的提示产生了令人信服的反应。但是他们的代表性能力尚不确定。许多LLM与语言外现实没有直接接触：它们的输入，输出和培训数据仅由文本组成，提出问题（1）LLM可以代表任何东西，（2）如果是，什么？在本文中，我探讨了根据基于结构的代表说明回答这些问题的必要条件，并对该证据进行了初步调查。我认为，LLMS和世俗实体之间仅存在结构对应关系，不足以对这些实体的地面表示。但是，如果这些结构对应关系起着适当的作用 - 它们将以解释成功的任务绩效的方式进行利用 - 那么他们可以扎根现实世界中的内容。这需要克服一个挑战：从表面上讲，LLM的文本结合性，以防止他们从事正确的任务。

Title: InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Authors: Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.16381
Pdf URL: https://arxiv.org/pdf/2506.16381
Copy Paste: [[2506.16381]] InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems(https://arxiv.org/abs/2506.16381)
Keywords: prompt
Abstract: In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.
摘要：在现代语音综合中，副语言信息（例如说话者的声音音色，情感状态和动态韵律）在传达细微差别超越语义之外的细微效果中起着关键作用。传统的文本对语音（TTS）系统依赖于固定样式标签或插入语音提示来控制这些线索，从而严重限制了灵活性。最近的尝试试图采用自然语言指令来调节副语言特征，从而大大改善了教学驱动的TTS模型的概括。尽管许多TTS系统现在通过文本描述支持自定义的合成，但它们的实际解释和执行复杂指令的实际能力在很大程度上尚未探索。此外，仍然缺乏专门为基于教学的TTS设计的高质量基准和自动化评估指标，这阻碍了这些模型的准确评估和迭代优化。为了解决这些局限性，我们介绍了ConstermentTseval，这是测量复杂自然语言控制能力的基准。我们介绍了三个任务，即声参数规范，描述性风格指令以及角色扮演，包括英语和中文子集，每个都有1K测试用例（总共6K）与参考音频配对。我们利用双子座作为自动法官来评估其跟随能力的指导能力。我们对可访问的指令遵循TTS系统的评估突出了进一步改进的大量空间。我们预计，指令将推动进度朝着更强大，灵活和准确的指导遵循TTS迈进。

Title: Large Language Models in Argument Mining: A Survey

Authors: Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16383
Pdf URL: https://arxiv.org/pdf/2506.16383
Copy Paste: [[2506.16383]] Large Language Models in Argument Mining: A Survey(https://arxiv.org/abs/2506.16383)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques -- such as prompting, chain-of-thought reasoning, and retrieval augmentation -- have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.
摘要：参数挖掘（AM）是自然语言处理（NLP）的关键子场（NLP），重点是从文本中提取论证结构。大型语言模型（LLM）的出现对AM进行了深刻的变化，从而实现了先进的内在学习，及时的生成和强大的跨域适应性。这项调查系统地综合了LLM驱动的AM的最新进展。我们提供了对基础理论和注释框架的简洁审查，以及精心策划的数据集目录。一个关键的贡献是我们对AM子任务的全面分类法，阐明了当代LLM技术（例如提示，经过思考的推理和检索增强）如何重新配置了他们的执行。我们进一步详细介绍了当前的LLM架构和方法论，批判性评估评估实践，并描述包括长篇小说推理，可解释性和注释瓶颈在内的关键挑战。最后，我们重点介绍了新兴趋势，并提出了基于LLM的计算论证的前瞻性研究议程，旨在战略性地指导研究人员迅速发展的领域。

Title: RiOT: Efficient Prompt Refinement with Residual Optimization Tree

Authors: Chenyi Zhou, Zhengyan Shi, Yuan Yao, Lei Liang, Huajun Chen, Qiang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16389
Pdf URL: https://arxiv.org/pdf/2506.16389
Copy Paste: [[2506.16389]] RiOT: Efficient Prompt Refinement with Residual Optimization Tree(https://arxiv.org/abs/2506.16389)
Keywords: language model, llm, prompt
Abstract: Recent advancements in large language models (LLMs) have highlighted their potential across a variety of tasks, but their performance still heavily relies on the design of effective prompts. Existing methods for automatic prompt optimization face two challenges: lack of diversity, limiting the exploration of valuable and innovative directions and semantic drift, where optimizations for one task can degrade performance in others. To address these issues, we propose Residual Optimization Tree (RiOT), a novel framework for automatic prompt optimization. RiOT iteratively refines prompts through text gradients, generating multiple semantically diverse candidates at each step, and selects the best prompt using perplexity. Additionally, RiOT incorporates the text residual connection to mitigate semantic drift by selectively retaining beneficial content across optimization iterations. A tree structure efficiently manages the optimization process, ensuring scalability and flexibility. Extensive experiments across five benchmarks, covering commonsense, mathematical, logical, temporal, and semantic reasoning, demonstrate that RiOT outperforms both previous prompt optimization methods and manual prompting.
摘要：大型语言模型（LLM）的最新进展突出了它们在各种任务中的潜力，但它们的性能仍然在很大程度上依赖于有效提示的设计。自动及时优化的现有方法面临两个挑战：缺乏多样性，限制了对有价值和创新的方向和语义漂移的探索，其中一项任务的优化可以降低其他任务的性能。为了解决这些问题，我们提出了剩余优化树（Riot），这是一个新型的自动及时优化框架。 Riot迭代通过文本梯度来完善提示，每一步都会产生多个语义上不同的候选者，并使用困惑选择最佳提示。此外，Riot通过在优化迭代中选择性保留有益的内容，并结合了文本残差连接，以减轻语义漂移。树结构有效地管理优化过程，确保可伸缩性和灵活性。涵盖常识性，数学，逻辑，时间和语义推理的五个基准测试的广泛实验，证明RIOT的表现都超过了先前的迅速优化方法和手动提示。

Title: From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling

Authors: Yao Lu, Zhaiyuan Ji, Jiawei Du, Yu Shanqing, Qi Xuan, Tianyi Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16393
Pdf URL: https://arxiv.org/pdf/2506.16393
Copy Paste: [[2506.16393]] From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling(https://arxiv.org/abs/2506.16393)
Keywords: language model, gpt, llm
Abstract: Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: this https URL.
摘要：尽管基于大语言模型（LLM）的注释范式近年来取得了重大突破，但其实际部署仍然具有两个核心瓶颈：首先，以大规模注释打电话给商业API的成本非常昂贵；其次，在需要细粒语义理解的情况下，例如情感分类和毒性分类，LLM的注释精度甚至低于专用于该领域的小语言模型（SLM）。为了解决这些问题，我们提出了基于此的全自动注释框架自动通道器的多模型合作注释的新范式。具体而言，自动通道由两层组成。上层元控制层使用LLM的生成和推理功能来选择注释，自动生成注释代码并验证困难的样本；低级任务特殊层由多个通过多模型投票执行注释的SLM组成。此外，我们使用对元控制层的二级审查获得的困难样本作为增强学习集，并通过连续的学习策略在各个阶段微调SLM，从而改善了SLM的概括。广泛的实验表明，自动通道的表现优于现有的开源/API LLM，以零射击，一声，婴儿床和多数投票设置。值得注意的是，与直接注释GPT-3.5-Turbo相比，自动通道的注释成本降低了74.15％，同时仍将精度提高了6.21％。项目页面：此HTTPS URL。

Title: OJBench: A Competition Level Code Benchmark For Large Language Models

Authors: Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16395
Pdf URL: https://arxiv.org/pdf/2506.16395
Copy Paste: [[2506.16395]] OJBench: A Competition Level Code Benchmark For Large Language Models(https://arxiv.org/abs/2506.16395)
Keywords: language model, llm
Abstract: Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models' reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.
摘要：大语言模型（LLM）的最新进展表明，数学和代码推理能力取得了重大进展。但是，现有的代码基准的能力有限，可以评估这些功能的全部功能，尤其是在竞争水平上。为了弥合这一差距，我们介绍了Ojbench，这是一种旨在评估LLM的竞争级代码推理能力的小说且具有挑战性的基准。 Ojbench包括NOI和ICPC的232个编程竞争问题，对模型推理技能进行了更严格的测试。我们在37个型号上使用OJBench进行了全面的评估，包括封闭源和开源模型，面向推理的模型。我们的结果表明，即使是面向推理的最先进的模型，例如O4-Mini和Gemini-2.5-Pro-Exp，也与高度挑战性的竞争水平问题斗争。这突出了模型在竞争级别的代码推理中面临的重大挑战。

Title: NepaliGPT: A Generative Language Model for the Nepali Language

Authors: Shushanta Pudasaini, Aman Shakya, Siddhartha Shrestha, Sahil Bhatta, Sunil Thapa, Sushmita Palikhe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16399
Pdf URL: https://arxiv.org/pdf/2506.16399
Copy Paste: [[2506.16399]] NepaliGPT: A Generative Language Model for the Nepali Language(https://arxiv.org/abs/2506.16399)
Keywords: language model, gpt, llm, chat
Abstract: After the release of ChatGPT, Large Language Models (LLMs) have gained huge popularity in recent days and thousands of variants of LLMs have been released. However, there is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet. To fill this research gap in the Nepali NLP space, this research proposes \textit{NepaliGPT}, a generative large language model tailored specifically for the Nepali language. This research introduces an advanced corpus for the Nepali language collected from several sources, called the Devanagari Corpus. Likewise, the research introduces the first NepaliGPT benchmark dataset comprised of 4,296 question-answer pairs in the Nepali language. The proposed LLM NepaliGPT achieves the following metrics in text generation: Perplexity of 26.32245, ROUGE-1 score of 0.2604, causal coherence of 81.25\%, and causal consistency of 85.41\%.
摘要：在Chatgpt发布后，大型语言模型（LLMS）在最近几天获得了巨大的知名度，并且已发布了数千种LLMS的变体。但是，尼泊尔语言没有生成语言模型，因为尚未探索其他下游任务（包括微调）。为了填补尼泊尔NLP空间中的这一研究差距，这项研究提出了\ textit {nepaligpt}，这是一种专门针对尼泊尔语言量身定制的生成性大语言模型。这项研究介绍了从几种来源收集的尼泊尔语言的先进语料库，称为Devanagari语料库。同样，该研究介绍了第一个Nepaligpt基准数据集，该数据集由尼泊尔语言中的4,296个提问对组成。拟议的LLM Nepaligpt在文本生成中达到了以下指标：周期性为26.32245，Rouge-1得分为0.2604，因果相干性为81.25 \％，而因果一致性为85.41 \％。

Title: When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework

Authors: Zhen Xu, Shang Zhu, Jue Wang, Junlin Wang, Ben Athiwaratkun, Chi Wang, James Zou, Ce Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16411
Pdf URL: https://arxiv.org/pdf/2506.16411
Copy Paste: [[2506.16411]] When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework(https://arxiv.org/abs/2506.16411)
Keywords: language model, gpt, llm, long context, agent
Abstract: We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a length sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring superlinear model noise growth with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.
摘要：我们研究将大语言模型（LLM）应用于长文本的挑战。我们提出了一个理论框架，该框架将长上下文任务的故障模式区分为三类：跨块依赖（任务噪声），随着上下文大小（模型噪声）的混淆以及部分结果（聚合器噪声）的不完善集成。在此观点下，我们分析使用多代理块的有效何时，即将长度序列分为较小的块并汇总每个块的处理结果。我们对诸如检索，问题回答和摘要等任务的实验既证实了理论分析和有利于多代理块的条件。通过探索具有输入长度的超线性模型噪声增长，我们还解释了为什么对于大型输入，使用基于块的处理配置的较弱的模型可以超越更高级的模型，例如单个拍摄中应用的GPT4O。总体而言，我们提出了一个原则上的理解框架，结果突出了通过精心管理的块和聚合策略来处理LLM中长篇小说的直接途径。

Title: REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

Authors: Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, Onur Mutlu
Subjects: cs.CL, cs.AR, cs.DB
Abstract URL: https://arxiv.org/abs/2506.16444
Pdf URL: https://arxiv.org/pdf/2506.16444
Copy Paste: [[2506.16444]] REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing(https://arxiv.org/abs/2506.16444)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).
摘要：大型语言模型（LLMS）面临着固有的挑战：他们的知识仅限于他们接受过的数据。为了克服这一问题，检索功能增强的一代（RAG）用外部知识存储库补充了静态的LLM培训知识。 RAG包括三个阶段：索引，检索和一代。抹布的检索阶段成为推理管道中的重要瓶颈。在此阶段，用户查询将映射到嵌入向量，并且近似最近的邻居搜索（ANN）算法搜索数据库中的相似矢量以识别相关项目。由于大型数据库大小，ANN会在主机和存储系统之间产生大量的数据运动开销。为了减轻这些间接费用，先前的工作提出了存储后处理（ISP）技术，该技术通过在存储中执行计算来加速ANN。但是，现有的作品利用ANN的ISP（i）采用了未针对ISP系统量身定制的算法，（ii）请勿加速ANN选择数据的数据检索操作，并且（iii）引入了重大的硬件修改，限制了性能并阻碍其采用。我们提出了Reis，这是针对破布的第一个ISP系统，它使用三种关键机制来解决这些局限性。首先，Reis采用了一个数据库布局，该布局将嵌入向量嵌入其关联文档的数据库链接起来，从而有效地检索。其次，它通过引入ISP-tail的数据放置技术来启用有效的ANN，该数据放置技术在存储系统的平面上分布嵌入并采用了轻巧的闪存翻译层。第三，Reis利用了使用存储系统中现有的计算资源的ANN引擎。与服务器级系统相比，REIS平均提高了检索的性能（能源效率）13倍（55倍）。

Title: StoryWriter: A Multi-Agent Framework for Long Story Generation

Authors: Haotian Xia, Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16445
Pdf URL: https://arxiv.org/pdf/2506.16445
Copy Paste: [[2506.16445]] StoryWriter: A Multi-Agent Framework for Long Story Generation(https://arxiv.org/abs/2506.16445)
Keywords: language model, llm, agent
Abstract: Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story generation framework, which consists of three main modules: (1) outline agent, which generates event-based outlines containing rich event plots, character, and event-event relationships. (2) planning agent, which further details events and plans which events should be written in each chapter to maintain an interwoven and engaging story. (3) writing agent, which dynamically compresses the story history based on the current event to generate and reflect new plots, ensuring the coherence of the generated story. We conduct both human and automated evaluation, and StoryWriter significantly outperforms existing story generation baselines in both story quality and length. Furthermore, we use StoryWriter to generate a dataset, which contains about $6,000$ high-quality long stories, with an average length of $8,000$ words. We train the model Llama3.1-8B and GLM4-9B using supervised fine-tuning on LongStory and develop StoryWriter_GLM and StoryWriter_GLM, which demonstrates advanced performance in long story generation.
摘要：长篇小说的生成仍然是现有大型语言模型（LLM）的挑战，这主要是由于两个主要因素：（1）话语连贯性，这需要绘图一致性，逻辑相干性和长期生成中的完整性，以及（2）叙事复杂性，这需要一个交织和引人入胜的叙述。为了应对这些挑战，我们提出了故事作者，这是一个多代理故事生成框架，该框架由三个主要模块组成：（1）概述代理，该模块生成了基于事件的大纲，其中包含丰富的事件图，角色和事件 - 事件 - 事件 - 事件 - 事件关系。（2）计划代理，进一步详细介绍了应在每一章中写出哪些事件的事件和计划，以保持交织和引人入胜的故事。（3）写作代理，该写作代理人根据当前事件动态压缩故事历史，以生成和反映新的情节，从而确保生成的故事的连贯性。我们进行人类和自动化的评估，故事作者在故事质量和长度上都大大优于现有的故事生成基线。此外，我们使用故事作者生成一个数据集，该数据集包含约6,000美元的高质量长篇小说，平均长度为8,000美元。我们使用Longstory上有监督的微调来训练Model Llama3.1-8B和GLM4-9B，并开发Storywriter_glm和Storywriter_glm，这表明了长篇小说中的高级表现。

Title: Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection

Authors: Saad Almohaimeed, Saleh Almohaimeed, Damla Turgut, Ladislau Bölöni
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16476
Pdf URL: https://arxiv.org/pdf/2506.16476
Copy Paste: [[2506.16476]] Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection(https://arxiv.org/abs/2506.16476)
Keywords: gpt
Abstract: Implicit hate speech has recently emerged as a critical challenge for social media platforms. While much of the research has traditionally focused on harmful speech in general, the need for generalizable techniques to detect veiled and subtle forms of hate has become increasingly pressing. Based on lexicon analysis, we hypothesize that implicit hate speech is already present in publicly available harmful speech datasets but may not have been explicitly recognized or labeled by annotators. Additionally, crowdsourced datasets are prone to mislabeling due to the complexity of the task and often influenced by annotators' subjective interpretations. In this paper, we propose an approach to address the detection of implicit hate speech and enhance generalizability across diverse datasets by leveraging existing harmful speech datasets. Our method comprises three key components: influential sample identification, reannotation, and augmentation using Llama-3 70B and GPT-4o. Experimental results demonstrate the effectiveness of our approach in improving implicit hate detection, achieving a +12.9-point F1 score improvement compared to the baseline.
摘要：隐性仇恨言论最近成为社交媒体平台的关键挑战。尽管传统上，许多研究通常都集中在有害言论上，但需要概括的技术来检测面纱和微妙的仇恨形式变得越来越紧迫。根据词典分析，我们假设隐性仇恨言论已经存在于公开可用的有害语音数据集中，但可能没有被注释者明确认可或标记。此外，由于任务的复杂性，众包数据集容易出现标签，并且通常受注释者的主观解释的影响。在本文中，我们提出了一种解决隐性仇恨言论检测并通过利用现有有害语音数据集的可推广性的方法。我们的方法包括三个关键组成部分：使用Llama-3 70B和GPT-4O进行影响的样本识别，重新注释和增强。实验结果证明了我们方法在改善隐式仇恨检测方面的有效性，与基线相比，提高了 +12.9分的F1得分。

Title: Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples

Authors: Soumya Suvra Ghosal, Vaibhav Singh, Akash Ghosh, Soumyabrata Pal, Subhadip Baidya, Sriparna Saha, Dinesh Manocha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16502
Pdf URL: https://arxiv.org/pdf/2506.16502
Copy Paste: [[2506.16502]] Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples(https://arxiv.org/abs/2506.16502)
Keywords: language model, gpt, llm, prompt
Abstract: Reward models are essential for aligning large language models (LLMs) with human preferences. However, most open-source multilingual reward models are primarily trained on preference datasets in high-resource languages, resulting in unreliable reward signals for low-resource Indic languages. Collecting large-scale, high-quality preference data for these languages is prohibitively expensive, making preference-based training approaches impractical. To address this challenge, we propose RELIC, a novel in-context learning framework for reward modeling in low-resource Indic languages. RELIC trains a retriever with a pairwise ranking objective to select in-context examples from auxiliary high-resource languages that most effectively highlight the distinction between preferred and less-preferred responses. Extensive experiments on three preference datasets- PKU-SafeRLHF, WebGPT, and HH-RLHF-using state-of-the-art open-source reward models demonstrate that RELIC significantly improves reward model accuracy for low-resource Indic languages, consistently outperforming existing example selection methods. For example, on Bodo-a low-resource Indic language-using a LLaMA-3.2-3B reward model, RELIC achieves a 12.81% and 10.13% improvement in accuracy over zero-shot prompting and state-of-the-art example selection method, respectively.
摘要：奖励模型对于将大型语言模型（LLM）与人类偏好保持一致至关重要。但是，大多数开源的多语言奖励模型主要是在高资源语言的偏好数据集上培训的，从而导致低资源指示语言的奖励信号不可靠。收集这些语言的大规模高质量偏好数据非常昂贵，这使得基于偏好的培训方法不切实际。为了应对这一挑战，我们提出了一个新颖的文章中文化学习框架，用于低资源指示语言的奖励建模。 Relic训练一个具有成对排名目标的检索员，从辅助高资源语言中选择中文示例，最有效地强调了首选和偏爱响应之间的区别。在三个偏好数据集中进行了广泛的实验-PKU-SAFERLHF，WebGPT和HHH-RLHF使用的最先进的开源奖励模型表明，Relic显着提高了低资源指示语言的奖励模型准确性，从而始终超过现有的示例选择方法。例如，在Bodo-A低资源形式语言 - 使用Llama-3.2-3b奖励模型上，Relic分别在零摄像机提示和最新的示例选择方法的准确性上取得了12.81％和10.13％的提高。

Title: Measuring (a Sufficient) World Model in LLMs: A Variance Decomposition Framework

Authors: Nadav Kunievsky, James A. Evans
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16584
Pdf URL: https://arxiv.org/pdf/2506.16584
Copy Paste: [[2506.16584]] Measuring (a Sufficient) World Model in LLMs: A Variance Decomposition Framework(https://arxiv.org/abs/2506.16584)
Keywords: language model, llm, prompt
Abstract: Understanding whether large language models (LLMs) possess a world model-a structured understanding of the world that supports generalization beyond surface-level patterns-is central to assessing their reliability, especially in high-stakes applications. We propose a formal framework for evaluating whether an LLM exhibits a sufficiently robust world model, defined as producing consistent outputs across semantically equivalent prompts while distinguishing between prompts that express different intents. We introduce a new evaluation approach to measure this that decomposes model response variability into three components: variability due to user purpose, user articulation, and model instability. An LLM with a strong world model should attribute most of the variability in its responses to changes in foundational purpose rather than superficial changes in articulation. This approach allows us to quantify how much of a model's behavior is semantically grounded rather than driven by model instability or alternative wording. We apply this framework to evaluate LLMs across diverse domains. Our results show how larger models attribute a greater share of output variability to changes in user purpose, indicating a more robust world model. This improvement is not uniform, however: larger models do not consistently outperform smaller ones across all domains, and their advantage in robustness is often modest. These findings highlight the importance of moving beyond accuracy-based benchmarks toward semantic diagnostics that more directly assess the structure and stability of a model's internal understanding of the world.
摘要：了解大型语言模型（LLM）是否具有世界模型的对世界的结构性理解，该理解支持超出表面级别模式的概括，这是评估其可靠性的中心，尤其是在高风险应用中。我们提出了一个正式框架，用于评估LLM是否表现出足够健壮的世界模型，定义为在语义上等效的提示中产生一致的输出，同时区分表达不同意图的提示。我们引入了一种新的评估方法，以测量将模型响应变异分解为三个组成部分：由于用户目的，用户表达和模型不稳定性而导致的可变性。具有强大世界模型的LLM应归因于其对基本目的变化的响应的大部分变异性，而不是表达的表面变化。这种方法使我们能够量化模型的行为的语义基础，而不是由模型不稳定性或替代措辞驱动。我们应用此框架来评估跨不同领域的LLM。我们的结果表明，较大的模型如何将更大的输出可变性归因于用户目的的变化，这表明世界模型更强大。但是，这种改进并不统一：较大的模型在所有领域的表现并不一致地优于较小的模型，而且它们在鲁棒性方面的优势通常是适度的。这些发现强调了超越基于准确的基准测试的重要性，而基于准确性的基准介绍了语义诊断，从而更直接地评估了模型内部对世界的内部理解的结构和稳定性。

Title: A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications

Authors: Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16594
Pdf URL: https://arxiv.org/pdf/2506.16594
Copy Paste: [[2506.16594]] A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications(https://arxiv.org/abs/2506.16594)
Keywords: language model, llm, prompt
Abstract: Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations.
摘要：合成数据生成 - 减轻生物医学领域的数据稀缺，隐私问题和数据质量挑战 - 大型语言模型（LLMS）的快速进步促进了数据质量挑战。该范围审查遵循Prisma-SCR指南，并合成了59项研究，该研究于2020年至2025年之间发表，并从PubMed，ACM，Web of Science和Google Scholar收集。该评论系统地研究了合成数据生成的生物医学研究和应用趋势，强调临床应用，方法和评估。我们的分析确定了非结构化文本（78.0％），表格数据（13.6％）和多模式来源的数据模式（8.4％）；提示的生成方法（72.9％），微调（22.0％）LLM和专业模型（5.1％）；对内在指标（27.1％），人类在循环评估（55.9％）和基于LLM的评估（13.6％）的异质评估（27.1％）。该分析解决了当前的局限性，卫生专业人员如何利用生物医学领域的合成数据生成。我们的评论还强调了跨临床领域，资源和模型可访问性以及评估标准化的适应性挑战。

Title: Initial Investigation of LLM-Assisted Development of Rule-Based Clinical NLP System

Authors: Jianlin Shi, Brian T. Bucher
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16628
Pdf URL: https://arxiv.org/pdf/2506.16628
Copy Paste: [[2506.16628]] Initial Investigation of LLM-Assisted Development of Rule-Based Clinical NLP System(https://arxiv.org/abs/2506.16628)
Keywords: language model, llm
Abstract: Despite advances in machine learning (ML) and large language models (LLMs), rule-based natural language processing (NLP) systems remain active in clinical settings due to their interpretability and operational efficiency. However, their manual development and maintenance are labor-intensive, particularly in tasks with large linguistic variability. To overcome these limitations, we proposed a novel approach employing LLMs solely during the rule-based systems development phase. We conducted the initial experiments focusing on the first two steps of developing a rule-based NLP pipeline: find relevant snippets from the clinical note; extract informative keywords from the snippets for the rule-based named entity recognition (NER) component. Our experiments demonstrated exceptional recall in identifying clinically relevant text snippets (Deepseek: 0.98, Qwen: 0.99) and 1.0 in extracting key terms for NER. This study sheds light on a promising new direction for NLP development, enabling semi-automated or automated development of rule-based systems with significantly faster, more cost-effective, and transparent execution compared with deep learning model-based solutions.
摘要：尽管机器学习进展（ML）和大型语言模型（LLMS），但基于规则的自然语言处理（NLP）系统由于其可解释性和运营效率而在临床环境中保持活跃。但是，他们的手动开发和维护是劳动力密集的，尤其是在具有较大语言可变性的任务中。为了克服这些局限性，我们提出了一种新的方法，该方法仅在基于规则的系统开发阶段采用LLM。我们进行了最初的实验，重点是开发基于规则的NLP管道的前两个步骤：从临床注释中查找相关片段；从摘要中提取有益的关键字，以基于规则的命名实体识别（NER）组件。我们的实验证明了在识别临床相关文本片段（DeepSeek：0.98，QWEN：0.99）和1.0中提取NER的关键术语时表明了出色的回忆。这项研究阐明了NLP开发的有希望的新方向，与基于深度学习模型的解决方案相比，具有明显更快，更具成本效益和透明执行的基于规则的系统的半自动化或自动化开发。

Title: Arch-Router: Aligning LLM Routing with Human Preferences

Authors: Co Tran, Salman Paracha, Adil Hafeez, Shuguang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16655
Pdf URL: https://arxiv.org/pdf/2506.16655
Copy Paste: [[2506.16655]] Arch-Router: Aligning LLM Routing with Human Preferences(https://arxiv.org/abs/2506.16655)
Keywords: language model, llm
Abstract: With the rapid proliferation of large language models (LLMs) -- each optimized for different strengths, style, or latency/cost profile -- routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. In this work, we propose a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) -- offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce \textbf{Arch-Router}, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Our approach also supports seamlessly adding new models for routing without requiring retraining or architectural modifications. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. Our approach captures subjective evaluation criteria and makes routing decisions more transparent and flexible. Our model is available at: \texttt{this https URL}.
摘要：随着大型语言模型（LLMS）的快速扩散（针对不同的优势，样式或潜伏期/成本概况进行了优化），路由已成为运营不同模型使用的必不可少的技术。但是，现有的LLM路由方法以两种关键方式受到限制：它们使用通常无法捕获主观评估标准驱动的人类偏好的基准评估性能，并且通常从有限的模型中进行选择。在这项工作中，我们提出了一个由偏好对准的路由框架，该框架通过将查询与用户定义的域（例如旅行）或操作类型（例如，图像编辑）匹配来指导模型选择 - 提供了一种实用机制来编码路由决策中的偏好。具体来说，我们介绍了\ textbf {Arch-router}，这是一种紧凑的1.5B模型，该模型学会映射查询为模型路由决策的域名偏好。我们的方法还支持无缝添加用于路由的新模型，而无需进行重新训练或架构修改。对话数据集的实验表明，我们的方法实现了最新的（SOTA）导致与人类偏好的匹配查询，并且表现优于顶级专有模型。我们的方法捕获了主观评估标准，并使路由决策更加透明和灵活。我们的模型可在：\ texttt {this HTTPS url}上获得。

Title: Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations

Authors: Ananth Agarwal, Jasper Jian, Christopher D. Manning, Shikhar Murty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16678
Pdf URL: https://arxiv.org/pdf/2506.16678
Copy Paste: [[2506.16678]] Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations(https://arxiv.org/abs/2506.16678)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance. Adopting a "mechanisms vs. outcomes" framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.
摘要：大型语言模型（LLMS）在处理和生成文本时表现出强大的语法掌握。尽管这表明对层次语法和依赖关系的内在理解，但它们代表句法结构的确切机制是解释性研究中的开放区域。探测提供了一种方法来确定在激活中线性编码的语法机制，但是，尚无综合研究确定模型的探测准确性是否可靠地预测其下游句法性能。采用“机制与结果”框架，我们评估了32个开放式变压器模型，发现通过探测提取的句法特征无法预测英语语言现象跨英语语言现象的目标语法评估的结果。我们的结果突出了通过探测和下游任务中可观察到的句法行为发现的潜在句法表示之间的实质性脱节。

Title: LegiGPT: Party Politics and Transport Policy with Large Language Model

Authors: Hyunsoo Yun, Eun Hak Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16692
Pdf URL: https://arxiv.org/pdf/2506.16692
Copy Paste: [[2506.16692]] LegiGPT: Party Politics and Transport Policy with Large Language Model(https://arxiv.org/abs/2506.16692)
Keywords: language model, gpt, llm, prompt
Abstract: Given the significant influence of lawmakers' political ideologies on legislative decision-making, understanding their impact on policymaking is critically important. We introduce a novel framework, LegiGPT, which integrates a large language model (LLM) with explainable artificial intelligence (XAI) to analyze transportation-related legislative proposals. LegiGPT employs a multi-stage filtering and classification pipeline using zero-shot prompting with GPT-4. Using legislative data from South Korea's 21st National Assembly, we identify key factors - including sponsor characteristics, political affiliations, and geographic variables - that significantly influence transportation policymaking. The LLM was used to classify transportation-related bill proposals through a stepwise filtering process based on keywords, phrases, and contextual relevance. XAI techniques were then applied to examine relationships between party affiliation and associated attributes. The results reveal that the number and proportion of conservative and progressive sponsors, along with district size and electoral population, are critical determinants shaping legislative outcomes. These findings suggest that both parties contributed to bipartisan legislation through different forms of engagement, such as initiating or supporting proposals. This integrated approach provides a valuable tool for understanding legislative dynamics and guiding future policy development, with broader implications for infrastructure planning and governance.
摘要：鉴于立法者政治意识形态对立法决策的重大影响，了解他们对政策制定的影响至关重要。我们介绍了一个新颖的框架，即Legigpt，该框架将大型语言模型（LLM）与可解释的人工智能（XAI）集成在一起，以分析与运输相关的立法建议。 Legigpt使用GPT-4的零射击提示使用多阶段过滤和分类管道。使用韩国第21国民议会的立法数据，我们确定了关键因素，包括赞助商特征，政治隶属关系和地理变量，这会极大地影响运输决策。 LLM用于通过基于关键字，短语和上下文相关性的逐步过滤过程来对与运输相关的帐单提案进行分类。然后将XAI技术应用于检查政党隶属关系与相关属性之间的关系。结果表明，保守和进步的赞助商的数量和比例以及地区规模和选举人口是塑造立法结果的关键决定因素。这些发现表明，双方通过不同形式的参与（例如启动或支持提案）为两党立法做出了贡献。这种综合方法为理解立法动力和指导未来政策制定提供了宝贵的工具，对基础设施计划和治理具有更广泛的影响。

Title: ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models

Authors: Bin Chen, Xinzge Gao, Chuanrui Hu, Penghang Yu, Hua Zhang, Bing-Kun Bao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16712
Pdf URL: https://arxiv.org/pdf/2506.16712
Copy Paste: [[2506.16712]] ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models(https://arxiv.org/abs/2506.16712)
Keywords: gpt, hallucination
Abstract: Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, $R^\star$, which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through reinforcement learning on challenging examples to enhance its preference discrimination capabilities. Experiments on three public benchmarks show that ReasonGRM achieves competitive or state-of-the-art performance, outperforming previous best GRMs by 1.8\% on average and surpassing proprietary models such as GPT-4o by up to 5.6\%. These results demonstrate the effectiveness of reasoning-aware training and highlight the importance of high-quality rationale selection for reliable preference modeling.
摘要：生成奖励模型（GRM）比标量奖励模型在捕获人类的偏好时提供了更大的灵活性，但是其有效性受到不良推理能力的限制。这通常会导致不完整或过度投机的推理路径，从而导致复杂任务中的幻觉或缺少关键信息。我们使用ReasongRM（一个三阶段的生成奖励建模框架）来应对这一挑战。在第一阶段，零-RL用于生成简洁的，以结果为导向的推理路径，以减少关键遗漏的可能性。在第二阶段，我们介绍了一个新颖的评估度量标准，即$ r^\ star $，该$ $ r^\ star $根据其一代的可能性来得分推理路径。这有利于通过最少的探索来达到正确答案的路径，从而有助于减少训练过程中容易发生的数据。在最后阶段，该模型通过在具有挑战性的例子上进行加强学习，以增强其偏好歧视能力。三个公共基准测试的实验表明，ReasongRM可以达到竞争性或最先进的性能，平均比以前的最佳GRM超过1.8 \％，超过了诸如GPT-4O之类的专有模型，高达5.6 \％。这些结果证明了推理感知训练的有效性，并强调了高质量理由选择对可靠的偏好模型的重要性。

Title: The Role of Model Confidence on Bias Effects in Measured Uncertainties

Authors: Xinyi Liu, Weiguang Wang, Hangfeng He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16724
Pdf URL: https://arxiv.org/pdf/2506.16724
Copy Paste: [[2506.16724]] The Role of Model Confidence on Bias Effects in Measured Uncertainties(https://arxiv.org/abs/2506.16724)
Keywords: language model, gpt, llm, prompt
Abstract: With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model's lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases induce greater changes in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence leads to greater underestimation of epistemic uncertainty (i.e. overconfidence) due to bias, whereas it has no significant effect on the direction of changes in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.
摘要：随着对开放式任务的大型语言模型（LLM）的越来越多，准确评估了反映模型缺乏知识的认知不确定性，这对于确保可靠的结果至关重要。然而，由于存在众所周知的不确定性，量化此类任务中的认知不确定性是具有挑战性的，这是由于多个有效的答案引起的。虽然偏见可以将噪声引入认知不确定性估计中，但它也可能会减少差异不确定性的噪声。为了调查这种权衡，我们进行了有关视觉问题回答（VQA）任务的实验，并发现缓解及时引入的偏见可改善GPT-4O中的不确定性定量。基于先前的工作，表明LLM倾向于在模型置信度较低时复制输入信息，我们进一步分析了这些迅速偏见如何影响GPT-4O和QWEN2-VL的各种无偏见置信度的认识和良性不确定性。我们发现，当无偏见的模型置信度较低时，所有考虑的偏见都会引起两种不确定性的更大变化。此外，由于偏见，较低的无偏见模型置信度导致认知不确定性（即过度启发）的低估，而对质量不确定性估计的变化方向没有显着影响。这些独特的影响加深了我们对减轻不确定性量化的偏见的理解，并有可能告知更先进的技术的发展。

Title: LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

Authors: Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.16738
Pdf URL: https://arxiv.org/pdf/2506.16738
Copy Paste: [[2506.16738]] LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization(https://arxiv.org/abs/2506.16738)
Keywords: language model
Abstract: With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.
摘要：随着语音语言模型（SLM）的快速发展，离散的语音令牌已成为语音和文本之间的核心接口，从而实现了跨模态的统一建模。最新的语音令牌化方法旨在将语义信息从低级声学隔离开来，以更好地与语言模型保持一致。特别是，以前的方法使用SSL教师（例如Hubert）提取语义表示，然后将其蒸馏成语义量化器以抑制声学冗余，并捕获与内容相关的潜在结构。但是，它们仍然产生语音令牌序列的长度比其文本对应物更长，从而为有效的语音语言建模带来了挑战。降低帧速率是一种天然解决方案，但是标准技术（例如跨帧的刚性平均池池）可能会扭曲或稀释有效LM对齐所需的语义结构。为了解决这个问题，我们提出了LM-SPT，这是一种语音令牌化方法，它引入了一种新颖的语义蒸馏。我们没有通过合并直接匹配教师和学生的功能，而是仅从语义令牌中重建语音，并最大程度地减少了原始波形和重建波形的编码表示之间的差异，从冷冻的自动语音识别（ASR）编码器中获得。这种间接但数据驱动的监督使令牌机能够学习与语言模型更加与语言模型更加一致的离散单元。 LM-SPT进一步纳入了对语音令牌化编码器和解码器的建筑改进，并支持多个帧速率，包括25Hz，12.5Hz和6.25Hz。实验结果表明，与基准相比，LM-SPT实现了优越的重建保真度，并且接受LM-SPT令牌训练的SLMS在语音到文本上实现了竞争性能，并且在文本到语音任务上的表现始终超过了基线。

Title: Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly

Authors: Lance Ying, Ryan Truong, Katherine M. Collins, Cedegao E. Zhang, Megan Wei, Tyler Brooke-Wilson, Tan Zhi-Xuan, Lionel Wong, Joshua B. Tenenbaum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16755
Pdf URL: https://arxiv.org/pdf/2506.16755
Copy Paste: [[2506.16755]] Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly(https://arxiv.org/abs/2506.16755)
Keywords: language model, agent
Abstract: Drawing real world social inferences usually requires taking into account information from multiple modalities. Language is a particularly powerful source of information in social settings, especially in novel situations where language can provide both abstract information about the environment dynamics and concrete specifics about an agent that cannot be easily visually observed. In this paper, we propose Language-Informed Rational Agent Synthesis (LIRAS), a framework for drawing context-specific social inferences that integrate linguistic and visual inputs. LIRAS frames multimodal social reasoning as a process of constructing structured but situation-specific agent and environment representations - leveraging multimodal language models to parse language and visual inputs into unified symbolic representations, over which a Bayesian inverse planning engine can be run to produce granular probabilistic judgments. On a range of existing and new social reasoning tasks derived from cognitive science experiments, we find that our model (instantiated with a comparatively lightweight VLM) outperforms ablations and state-of-the-art models in capturing human judgments across all domains.
摘要：绘制现实世界的社会推论通常需要考虑多种方式的信息。语言是社交环境中特别有力的信息来源，尤其是在新的情况下，语言可以提供有关环境动力学的抽象信息，又可以提供有关代理的具体详细信息，这些剂量在视觉上很容易观察到。在本文中，我们建议使用语言信息的理性代理合成（LIRAS），这是绘制上下文特定社会推论的框架，该框架集成了语言和视觉输入。里拉斯（Liras）将多模式的社会推理框架作为构建结构化但特定情况的代理和环境表示的过程 - 利用多模式模型将语言和视觉输入解析为统一的符号表示，可以在其上运行贝叶斯反向计划引擎以产生粒状概率的概率判断。在从认知科学实验中得出的一系列现有和新的社会推理任务上，我们发现我们的模型（使用相对轻巧的VLM实例化）优于消融和最先进的模型，在捕获所有领域的人类判断方面。

Title: SocialSim: Towards Socialized Simulation of Emotional Support Conversation

Authors: Zhuang Chen, Yaru Cao, Guanqun Bi, Jincenzi Wu, Jinfeng Zhou, Xiyao Xiao, Si Chen, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16756
Pdf URL: https://arxiv.org/pdf/2506.16756
Copy Paste: [[2506.16756]] SocialSim: Towards Socialized Simulation of Emotional Support Conversation(https://arxiv.org/abs/2506.16756)
Keywords: language model, chat
Abstract: Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.
摘要：情感支持对话（ESC）有助于减少人们的心理压力，并通过互动对话提供情感价值。由于众包大型ESC语料库的成本很高，最近尝试使用大型语言模型进行对话增强。但是，现有的方法在很大程度上忽略了ESC固有的社会动态，从而导致模拟较差。在本文中，我们介绍了SocialSim，这是一个新颖的框架，通过整合社会互动的关键方面来模拟ESC：社会披露和社会意识。在寻求者方面，我们通过构建一个捕捉多样化和真实的寻求帮助的情景来促进社会披露。在支持者方面，我们通过引起认知推理来产生逻辑和支持性反应来增强社会意识。在Socialsim的基础上，我们构建了SSCONV，这是一个大规模的合成ESC语料库，其质量甚至可以超过众包的ESC数据。我们进一步培训了SSCONV的聊天机器人，并在自动评估和人类评估中展示了其最先进的表现。我们认为Socialsim提供了一种可扩展的方法来综合ESC，从而使情感护理更容易获得和实用。

Title: Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Authors: Lei Jiang, Zixun Zhang, Zizhou Wang, Xiaobing Sun, Zhen Li, Liangli Zhen, Xiaohua Xu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.16760
Pdf URL: https://arxiv.org/pdf/2506.16760
Copy Paste: [[2506.16760]] Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models(https://arxiv.org/abs/2506.16760)
Keywords: language model, prompt
Abstract: Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs' cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO's effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.
摘要：大型视觉模型（LVLMS）在多模式任务中表现出出色的表现，但仍然容易受到绕过内置安全机制的越狱攻击，以引起限制性的内容产生。现有的Black-Box越狱方法主要依赖于对抗文本提示或图像扰动，但是这些方法可以通过标准内容过滤系统高度检测，并且表现出低查询和计算效率。在这项工作中，我们提出了跨模式的对抗性多模式混淆（CAMO），这是一种新颖的黑盒越狱攻击框架，将恶意提示分解为语义上良性的视觉和文字片段。通过利用LVLM的跨模式推理能力，迷彩通过多步推理，逃避常规检测机制来秘密地重建有害指令。我们的方法支持可调的推理复杂性，并且比先前的攻击所需的查询要少得多，从而使隐形和效率既可以。对领先LVLMS进行的全面评估验证了迷彩的有效性，展示了稳健的性能和强大的跨模型可传递性。这些结果强调了当前内置安全机制中的重大漏洞，强调了迫切需要视觉系统中先进的，对齐感知的安全和安全解决方案。

Title: DistillNote: LLM-based clinical note summaries improve heart failure diagnosis

Authors: Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.16777
Pdf URL: https://arxiv.org/pdf/2506.16777
Copy Paste: [[2506.16777]] DistillNote: LLM-based clinical note summaries improve heart failure diagnosis(https://arxiv.org/abs/2506.16777)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) offer unprecedented opportunities to generate concise summaries of patient information and alleviate the burden of clinical documentation that overwhelms healthcare providers. We present Distillnote, a framework for LLM-based clinical note summarization, and generate over 64,000 admission note summaries through three techniques: (1) One-step, direct summarization, and a divide-and-conquer approach involving (2) Structured summarization focused on independent clinical insights, and (3) Distilled summarization that further condenses the Structured summaries. We test how useful are the summaries by using them to predict heart failure compared to a model trained on the original notes. Distilled summaries achieve 79% text compression and up to 18.2% improvement in AUPRC compared to an LLM trained on the full notes. We also evaluate the quality of the generated summaries in an LLM-as-judge evaluation as well as through blinded pairwise comparisons with clinicians. Evaluations indicate that one-step summaries are favoured by clinicians according to relevance and clinical actionability, while distilled summaries offer optimal efficiency (avg. 6.9x compression-to-performance ratio) and significantly reduce hallucinations. We release our summaries on PhysioNet to encourage future research.
摘要：大型语言模型（LLMS）提供了前所未有的机会，可以简化患者信息的摘要，并减轻压倒医疗保健提供者的临床文件负担。我们提出了DistillNote，这是一个基于LLM的临床Note摘要的框架，并通过三种技术产生了超过64,000个录取注释摘要：（1）一步，直接汇总以及一种分裂和拼接方法（涉及（2）结构化的摘要，重点介绍了针对独立临床见解的结构性摘要，以及（3）蒸馏出结构性凝结的汇总。我们通过使用原始音符训练的模型来预测心力衰竭来测试摘要来预测心力衰竭。与在完整票据中训练的LLM相比，蒸馏摘要可实现AUPRC的79％的文本压缩和高达18.2％的提高。我们还通过与临床医生的盲法比较来评估LLM-AS法官评估中生成的摘要的质量。评估表明，根据相关性和临床可行性，一步摘要受到临床医生的青睐，而蒸馏的摘要则提供了最佳的效率（AVG。6.9倍压缩与性能比率），并显着降低了幻觉。我们发布有关Physionet的摘要，以鼓励未来的研究。

Title: MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Authors: Muyang Zheng, Yuanzhi Yao, Changting Lin, Rui Wang, Meng Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16792
Pdf URL: https://arxiv.org/pdf/2506.16792
Copy Paste: [[2506.16792]] MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning(https://arxiv.org/abs/2506.16792)
Keywords: language model, llm, prompt
Abstract: Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks--methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version--order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.
摘要：尽管努力将大型语言模型（LLM）与社会和道德价值保持一致，但这些模型仍然容易受到越狱攻击的影响 - 旨在引起有害反应的方法。由于令牌输入的离散性质，限制对目标LLM的访问以及有限的查询预算，因此被越狱的Black-Box LLM被认为是具有挑战性的。为了解决上面的问题，我们提出了一种有效的方法，用于通过迭代语义调整（名为MIST）越狱黑盒大型语言模型。雾使攻击者能够迭代地完善提示，以保留原始语义意图，同时引起有害内容。具体来说，为了平衡语义相似性与计算效率，雾气结合了两个关键策略：顺序同义词搜索及其高级版本 - 订单确定的优化。与其他最先进的白色盒子和黑盒越狱方法相比，在两个开源型号和四个封闭式型号上进行了广泛的实验表明，MIST可以实现竞争性攻击成功率和攻击转移性。此外，我们对计算效率进行实验，以验证雾的实际生存能力。

Title: From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts

Authors: Daniel Christoph, Max Ploner, Patrick Haller, Alan Akbik
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16912
Pdf URL: https://arxiv.org/pdf/2506.16912
Copy Paste: [[2506.16912]] From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts(https://arxiv.org/abs/2506.16912)
Keywords: language model
Abstract: Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample-efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposure. This study analyzes multiple models of varying architectures and sizes, all trained on the same pre-training data. By annotating relational facts with their frequencies in the training corpus, we examine how model performance varies with fact frequency. Our findings show that most models perform similarly on high-frequency facts but differ notably on low-frequency facts. This analysis provides new insights into the relationship between model architecture, size, and factual learning efficiency.
摘要：样本效率是语言模型的关键特性，对培训效率具有实际影响。在现实世界中，信息遵循长尾分布。但是，我们希望模型能够学习和回忆频繁且不常见的事实。样品效率高的模型可以更好地应对学习和保留稀有信息的挑战，而无需过度曝光。这项研究分析了多种不同的体系结构和大小的模型，所有模型均经过相同训练数据的训练。通过在训练语料库中及其频率注释关系事实，我们检查了模型性能如何随事实频率而变化。我们的发现表明，大多数模型在高频事实上的表现类似，但在低频事实方面有很大差异。该分析为模型架构，大小和事实学习效率之间的关系提供了新的见解。

Title: Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond

Authors: Antonin Berthon, Mihaela van der Schaar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.16982
Pdf URL: https://arxiv.org/pdf/2506.16982
Copy Paste: [[2506.16982]] Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond(https://arxiv.org/abs/2506.16982)
Keywords: llm
Abstract: Accurately assessing student knowledge is critical for effective education, yet traditional Knowledge Tracing (KT) methods rely on opaque latent embeddings, limiting interpretability. Even LLM-based approaches generate direct predictions or summaries that may hallucinate without any accuracy guarantees. We recast KT as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable. Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text. By constraining all predictive information to pass through a short natural-language bottleneck, LBMs ensure that the summary contains accurate information while remaining human-interpretable. Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories. We demonstrate that training the encoder with group-relative policy optimization, using downstream decoding accuracy as a reward signal, effectively improves summary quality.
摘要：准确评估学生知识对于有效的教育至关重要，但是传统知识追踪（KT）方法依赖不透明的潜在嵌入，从而限制了可解释性。即使是基于LLM的方法，也会产生直接的预测或摘要，这些预测可能在没有任何准确保证的情况下幻觉。我们将KT作为一个反问题：学习最低自然语言摘要，这使过去的答案可以解释和未来的答案可预测。我们的语言瓶颈模型（LBM）由一个编码器LLM组成，该模型撰写了可解释的知识摘要和一个冷冻的解码器LLM，必须仅使用该摘要文本重建和预测学生的响应。通过限制所有预测信息以通过短自然语言瓶颈，LBMS确保摘要包含准确的信息，同时保持人解剖。关于合成算术基准和大规模EEDI数据集的实验表明，LBMS与最先进的KT和直接LLM方法的准确性相媲美，同时需要更少的学生轨迹。我们证明，使用下游解码精度作为奖励信号来培训编码器，以奖励信号有效地提高了摘要质量。

Title: TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs

Authors: Sahil Kale, Vijaykant Nadadur
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.16990
Pdf URL: https://arxiv.org/pdf/2506.16990
Copy Paste: [[2506.16990]] TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs(https://arxiv.org/abs/2506.16990)
Keywords: language model, llm, prompt
Abstract: LaTeX's precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at this https URL.
摘要：乳胶在排版方面的精度和灵活性使其成为制备科学文档的黄金标准。大型语言模型（LLMS）为研究人员提供了一个有前途的机会，可以使用乳胶和自然语言说明生产出出版的材料，但是当前的基准测试完全缺乏对此能力的评估。通过引入TEXPERT，我们的基准数据集具有自然语言的提示，以生成乳胶代码，该代码重点介绍了跨多个难度级别的科学文档组成部分，我们在这方面对LLM性能进行了深入的分析，并确定了频繁的错误类型。我们在开放式和封闭源LLMS进行的评估重点介绍了多个关键发现：在乳胶生成中，LLM在标准基准方面表现不佳，并且随着任务的复杂性的增加，精确的下降效果很高；诸如DeepSeek V3和DeepSeek编码器之类的开源模型在乳胶任务中强烈竞争封闭源对应物；格式和包装错误意外普遍存在，这表明大多数LLM的培训数据集中缺乏乳胶示例。我们的数据集，代码和模型评估可在此HTTPS URL上获得。

Title: PersonalAI: Towards digital twins in the graph form

Authors: Mikhail Menschikov, Dmitry Evseev, Ruslan Kostoev, Ilya Perepechkin, Ilnaz Salimov, Victoria Dochkina, Petr Anokhin, Evgeny Burnaev, Nikita Semenov
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.17001
Pdf URL: https://arxiv.org/pdf/2506.17001
Copy Paste: [[2506.17001]] PersonalAI: Towards digital twins in the graph form(https://arxiv.org/abs/2506.17001)
Keywords: language model, llm, retrieval augmented generation
Abstract: The challenge of personalizing language models, specifically the ability to account for a user's history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remains pertinent. To address this, we propose utilizing external memory in the form of knowledge graphs, which are constructed and updated by the LLM itself. We have expanded upon ideas of AriGraph architecture and for the first time introduced a combined graph featuring both standard edges and two types of hyperedges. Experiments conducted on the TriviaQA, HotpotQA and DiaASQ benchmarks indicates that this approach aids in making the process of graph construction and knowledge extraction unified and robust. Furthermore, we augmented the DiaASQ benchmark by incorporating parameters such as time into dialogues and introducing contradictory statements made by the same speaker at different times. Despite these modifications, the performance of the question-answering system remained robust, demonstrating the proposed architecture's ability to maintain and utilize temporal dependencies.
摘要：个性化语言模型的挑战，特别是在交互期间考虑用户历史记录的能力，这是一个很大的兴趣。尽管大型语言模型（LLM）和检索增强一代的最新进展增强了LLM的事实基础，但保留广泛的个人信息并使用它来产生个性化响应的任务仍然是相关的。为了解决这个问题，我们建议以知识图的形式利用外部内存，这些内存由LLM本身构建和更新。我们已经扩展了《流行架构的思想》，并首次引入了一个组合图，其中既具有标准边缘和两种类型的Hyperedges。在Triviaqa，HotPotQA和Diaasq基准上进行的实验表明，这种方法有助于使图形结构和知识提取的过程统一和稳健。此外，我们通过将时间（例如时间）纳入对话中并在不同时间引入同一扬声器所作的矛盾陈述来增强DIAASQ基准测试。尽管进行了这些修改，但提问系统的性能仍然强大，证明了拟议的体系结构维护和利用时间依赖性的能力。

Title: LLM-Generated Feedback Supports Learning If Learners Choose to Use It

Authors: Danielle R. Thomas, Conrad Borchers, Shambhavi Bhushan, Erin Gatz, Shivang Gupta, Kenneth R. Koedinger
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.17006
Pdf URL: https://arxiv.org/pdf/2506.17006
Copy Paste: [[2506.17006]] LLM-Generated Feedback Supports Learning If Learners Choose to Use It(https://arxiv.org/abs/2506.17006)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are increasingly used to generate feedback, yet their impact on learning remains underexplored, especially compared to existing feedback methods. This study investigates how on-demand LLM-generated explanatory feedback influences learning in seven scenario-based tutor training lessons. Analyzing over 2,600 lesson completions from 885 tutor learners, we compare posttest performance among learners across three groups: learners who received feedback generated by gpt-3.5-turbo, those who declined it, and those without access. All groups received non-LLM corrective feedback. To address potential selection bias-where higher-performing learners may be more inclined to use LLM feedback-we applied propensity scoring. Learners with a higher predicted likelihood of engaging with LLM feedback scored significantly higher at posttest than those with lower propensity. After adjusting for this effect, two out of seven lessons showed statistically significant learning benefits from LLM feedback with standardized effect sizes of 0.28 and 0.33. These moderate effects suggest that the effectiveness of LLM feedback depends on the learners' tendency to seek support. Importantly, LLM feedback did not significantly increase completion time, and learners overwhelmingly rated it as helpful. These findings highlight LLM feedback's potential as a low-cost and scalable way to improve learning on open-ended tasks, particularly in existing systems already providing feedback without LLMs. This work contributes open datasets, LLM prompts, and rubrics to support reproducibility.
摘要：大型语言模型（LLMS）越来越多地用于产生反馈，但它们对学习的影响仍然没有充满反感，尤其是与现有的反馈方法相比。这项研究调查了按需LLM生成的解释反馈如何影响七个基于情况的导师培训课程的学习。分析了885名导师学习者的2,600多个课程完成，我们比较了三个小组学习者之间的后测试：收到由GPT-3.5-Turbo产生的反馈的学习者，那些拒绝了它的人以及无访问的人。所有组都收到了非LLL纠正反馈。为了解决潜在的选择偏见 - 在这种选择较高的学习者可能更倾向于使用LLM反馈，我们应用倾向评分。与倾向较低的学习者相比，与LLM反馈接合的可能性更高的学习者比倾向较低的学习者得分明显高得多。调整了这种效果后，七个课程中有两个从LLM反馈中显示出具有统计学意义的学习优势，标准效应大小为0.28和0.33。这些适度的影响表明，LLM反馈的有效性取决于学习者寻求支持的趋势。重要的是，LLM反馈并没有显着增加完成时间，学习者将其评为有用。这些发现突出了LLM反馈作为改善开放式任务学习的低成本和可扩展方式的潜力，尤其是在现有的已经提供没有LLM的反馈的系统中。这项工作贡献了开放数据集，LLM提示和专栏，以支持可重复性。

Title: Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning

Authors: Giuseppe Attanasio, Sonal Sannigrahi, Ben Peters, André F. T. Martins
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17019
Pdf URL: https://arxiv.org/pdf/2506.17019
Copy Paste: [[2506.17019]] Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning(https://arxiv.org/abs/2506.17019)
Keywords: language model
Abstract: This paper presents the IT-IST submission to the IWSLT 2025 Shared Task on Instruction Following Speech Processing. We submit results for the Short Track, i.e., speech recognition, translation, and spoken question answering. Our model is a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder through a first phase of modality alignment and a second phase of instruction fine-tuning. Crucially, we focus on using small-scale language model backbones (< 2B) and restrict to high-quality, CC-BY data along with synthetic data generation to supplement existing resources.
摘要：本文介绍了IWSLT 2025在语音处理后有关指令的共享任务的IT-IST提交。我们提交了短轨道的结果，即语音识别，翻译和口头问题回答。我们的模型是一个统一的语音到文本模型，该模型通过模态对齐的第一阶段和第二阶段的指导微调来整合预训练的连续语音编码器和文本解码器。至关重要的是，我们专注于使用小规模的语言模型骨架（<2b），并限于高质量的CC-by数据以及合成数据生成来补充现有资源。

Title: MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Authors: Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.17046
Pdf URL: https://arxiv.org/pdf/2506.17046
Copy Paste: [[2506.17046]] MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models(https://arxiv.org/abs/2506.17046)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
摘要：多模式的大语言模型（MLLM）在众多视觉任务中表现出了重大进步。由于其强大的图像文本对齐能力，MLLM可以有效地理解具有清晰含义的图像文本对。但是，有效解决自然语言和视觉环境中固有的歧义仍然具有挑战性。现有的多模式基准通常忽略语言和视觉歧义，主要依赖于单峰环境来消除歧义，因此未能利用模态之间的相互澄清潜力。为了弥合这一差距，我们介绍了Mucar，这是一种新颖且具有挑战性的基准，旨在评估跨多语言和跨模式场景的多模式歧义分辨率。粘液包括：（1）一个多语言数据集，其中模棱两可的文本表达式通过相应的视觉上下文独特地解决，（2）一个双重镜头数据集，该数据集系统地将模棱两可的图像与模棱两可的文本上下文配对，每种组合都仔细地构造出一个单一的，可以通过相互启动来产生一个单一的解释。与人类水平的性能相比，涉及19种最先进的多模型模型的广泛评估 - 具有开源和专有体系结构的综合 - 重大的差距，强调了将来需要研究对更复杂的跨模式歧义理解方法的未来研究，从而进一步推动了多态推理的界限。

Title: Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025

Authors: Dominik Macháček, Peter Polák
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.17077
Pdf URL: https://arxiv.org/pdf/2506.17077
Copy Paste: [[2506.17077]] Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025(https://arxiv.org/abs/2506.17077)
Keywords: llm, prompt
Abstract: This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025. We cover all four language pairs with a direct or cascade approach. The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt. We further improve the performance by prompting to inject in-domain terminology, and we accommodate context. Our cascaded systems further use EuroLLM for unbounded simultaneous translation. Compared to the Organizers' baseline, our systems improve by 2 BLEU points on Czech to English and 13-22 BLEU points on English to German, Chinese and Japanese on the development sets. Additionally, we also propose a new enhanced measure of speech recognition latency.
摘要：本文介绍了查尔斯大学对IWSLT 2025同时的语音翻译任务的提交。我们以直接或级联方法涵盖了所有四种语言对。我们系统的骨干是脱机耳语语音模型，我们将其用于同时模式的翻译和转录，并与最新的同时策略alignatt一起使用。我们通过促使注入内域术语来进一步提高性能，并适应环境。我们的级联系统进一步使用Eurollm进行无限的同时翻译。与组织者的基准相比，我们的系统在捷克语上提高了2个BLEU点，在英语上，在英语，中文和日本的开发场景上，我们的系统在英语上提高了13-22点。此外，我们还提出了一种新的增强语音识别潜伏期的度量。

Title: Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

Authors: Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha, Amin Farajian, André F. T. Martins
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17080
Pdf URL: https://arxiv.org/pdf/2506.17080
Copy Paste: [[2506.17080]] Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs(https://arxiv.org/abs/2506.17080)
Keywords: gpt, llm
Abstract: Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.
摘要：经过验证的LLM已被证明是在机器翻译等特定任务上达到最新性能的有效策略。但是，这种适应过程通常意味着牺牲通用功能，例如对话推理和跟踪指导，阻碍了系统在需要技能混合的现实世界应用中的实用性。在本文中，我们介绍了Tower+，这是一套模型套件，旨在在翻译和多语言通用文本功能中提供出色的性能。我们通过引入基于Tower（Alves等，2024）的新型培训配方，包括持续预处理，监督的微调，偏好优化和可验证学习的增强学习，从而实现了翻译专业和多语言通用能力之间的帕累托领域。在培训的每个阶段，我们仔细生成和策划数据，以增强翻译的性能以及涉及代码生成，数学问题解决问题和一般指导遵循的通用任务。我们在多个尺度上开发模型：2b，9b和72b。我们较小的模型通常优于较大的通用开放量和专有LLM（例如Llama 3.3 70B，GPT-4O）。我们最大的模型为高资源语言提供了一流的翻译性能，并在多语言竞技场进行了多语言竞技场和IF-MT中的最高结果，我们引入了用于评估翻译和指导遵循的指令的基准。我们的发现强调，可以在一般能力中与前沿模型相抗衡，同时针对特定业务领域（例如翻译和本地化）进行优化。

Title: Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation

Authors: Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.17088
Pdf URL: https://arxiv.org/pdf/2506.17088
Copy Paste: [[2506.17088]] Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation(https://arxiv.org/abs/2506.17088)
Keywords: language model, llm, hallucination, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM's internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: this https URL.
摘要：大型语言模型（LLMS）经常表现出\ textit {幻觉}，从而响应提示而产生事实不正确或语义上无关的内容。经过思考链（COT）提示可以通过鼓励逐步推理来减轻幻觉，但其对幻觉检测的影响仍然没有被逐渐解散。为了弥合这一差距，我们进行了系统的经验评估。我们从试点实验开始，揭示了COT推理显着影响LLM的内部状态和令牌概率分布。在此基础上，我们评估了各种COT提示方法对指导调节和面向推理的LLM的主流幻觉检测方法的影响。具体而言，我们检查了三个关键维度：幻觉得分分布的变化，检测准确性的变化以及检测置信度的变化。我们的发现表明，尽管COT提示有助于减少幻觉频率，但它也倾向于掩盖用于检测的关键信号，从而损害了各种检测方法的有效性。我们的研究强调了推理的使用方面的折衷方案。代码可公开可用：此HTTPS URL。

Title: Better Language Model Inversion by Compactly Representing Next-Token Distributions

Authors: Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.17090
Pdf URL: https://arxiv.org/pdf/2506.17090
Copy Paste: [[2506.17090]] Better Language Model Inversion by Compactly Representing Next-Token Distributions(https://arxiv.org/abs/2506.17090)
Keywords: language model, prompt
Abstract: Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
摘要：语言模型反演试图仅使用语言模型输出恢复隐藏的提示。此功能对语言模型部署中的安全性和问责制有影响，例如从受API保护的语言模型的系统消息中泄漏私人信息。我们提出了一种新方法 - 从logprob序列（PILS）提示反转，该方法通过在多个生成步骤的过程中从模型的下一步概率中收集线索来恢复隐藏的提示。我们的方法由关键见解启用：语言模型的向量值输出占据了低维子空间。这使我们能够使用线性映射无效地在多个生成步骤上压缩完整的下一步概率分布，从而允许更多输出信息用于反转。我们的方法比以前的最新方法获得了巨大的收益，用于恢复隐藏的提示，在测试集中，确切的恢复率更高，在一种情况下，恢复率从17％提高到60％。我们的方法还表现出令人惊讶的良好概括行为。例如，当我们在测试时间将步骤增加到32时，经过16代步骤训练的逆变器将获得5--27点的提示恢复。此外，我们在恢复隐藏的系统消息的更具挑战性的任务上展示了我们的方法的强劲表现。我们还分析了逐字重复在及时恢复中的作用，并提出了一种用于基于logit的逆变器的跨家族模型转移的新方法。我们的发现表明，与以前所知的反演攻击相比，下一句话的概率是反演攻击的脆弱攻击表面。

Title: Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?

Authors: Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.17121
Pdf URL: https://arxiv.org/pdf/2506.17121
Copy Paste: [[2506.17121]] Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?(https://arxiv.org/abs/2506.17121)
Keywords: language model, long context
Abstract: Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult. In this paper, we propose the *KV footprint* as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory. We evaluate methods based on the smallest footprint they attain while preserving performance in both long-context understanding and generation, with context lengths of up to 128K tokens. This metric reveals the high peak memory of prior KV eviction methods. One class of methods -- *post-fill eviction* -- has a high footprint due to being incompatible with eviction during pre-filling. We adapt these methods to be able to evict KVs during pre-filling, achieving substantially lower KV footprints. We then turn to *recency eviction* methods, wherein we propose PruLong, an end-to-end optimization method for learning which attention heads need to retain the full KV cache and which do not. PruLong saves memory while preserving long-context performance, achieving 12% smaller KV footprint than prior methods while retaining performance in challenging recall tasks. Our paper clarifies the complex tangle of long-context inference methods and paves the way for future development to minimize the KV footprint.
摘要：语言模型处理诸如书籍摘要之类的任务的越来越长的上下文，但这导致键值（KV）缓存的内存成本不断增长。许多先前的作品提出了从内存中丢弃KV的方法，但是它们的方法是根据有利的设置量身定制的，掩盖了高峰记忆和性能降解等警告，并且方法之间的公平比较很困难。在本文中，我们将 * kV足迹 *作为统一度量，既说明存储的KV条目及其寿命。我们根据它们达到的最小足迹来评估方法，同时在长篇小说理解和生成中保持性能，上下文长度高达128K令牌。该度量标准揭示了先前KV驱逐方法的高峰值记忆。一类方法 - *填充后驱逐 * - 由于在预填充过程中与驱逐不相容，因此具有很高的占地面积。我们适应了这些方法，以便在预填充过程中驱逐KV，从而达到较低的KV足迹。然后，我们转向 * RECENCY驱逐 *方法，其中我们提出了Prulong，这是一种学习端到端优化方法，用于学习哪种注意力负责人需要保留完整的KV缓存，而哪些不需要。 Prulong可以节省内存，同时保留长篇文本的性能，在挑战召回任务中保留性能的同时，实现比以前的方法小12％。我们的论文阐明了长篇小说推理方法的复杂纠结，并为未来开发的方式铺平了道路，以最大程度地减少KV足迹。

Title: CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models

Authors: Naiming Liu, Richard Baraniuk, Shashank Sonkar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.17180
Pdf URL: https://arxiv.org/pdf/2506.17180
Copy Paste: [[2506.17180]] CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models(https://arxiv.org/abs/2506.17180)
Keywords: language model
Abstract: We introduce CLEAR-3K, a dataset of 3,000 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question present an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews Correlation Coefficient plateaus at just 0.55, even for the best-performing this http URL, CLEAR-3K provides a crucial benchmark for developing and evaluating genuine causal reasoning in language models, which is an essential capability for applications that require accurate assessment of causal relationships.
摘要：我们介绍了Clear-3K，这是一个3,000个断言问题的数据集，旨在评估语言模型是否可以确定一个陈述是否有因果解释另一个。每个问题都构成了一个断言对和挑战语言模型，以区分语义相关性和真正的因果解释性关系。通过对21种最先进的语言模型的全面评估（从0.5b到72B参数），我们确定了两个基本发现。首先，语言模型经常将语义相似性与因果关系混淆，依靠词汇和语义重叠，而不是推断实际的因果关系。其次，随着参数大小的增加，模型倾向于从对因果关系过度怀疑转变为过度允许接受它们。尽管发生了这种转变，但Matthews相关系数的性能仅为0.55，即使对于表现最好的HTTP URL，Clear-3K也为在语言模型中开发和评估真正的因果推理的至关重要的基准提供了重要的基准，这是需要准确评估Causal Relatiess的应用的重要能力。

Title: Towards AI Search Paradigm

Authors: Yuchen Li, Hengyi Cai, Rui Kong, Xinran Chen, Jiamin Chen, Jun Yang, Haojie Zhang, Jiayi Li, Jiayi Wu, Yiqun Chen, Changle Qu, Keyi Kong, Wenwen Ye, Lixin Su, Xinyu Ma, Long Xia, Daiting Shi, Jiashu Zhao, Haoyi Xiong, Shuaiqiang Wang, Dawei Yin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.17188
Pdf URL: https://arxiv.org/pdf/2506.17188
Copy Paste: [[2506.17188]] Towards AI Search Paradigm(https://arxiv.org/abs/2506.17188)
Keywords: llm, retrieval-augmented generation, agent
Abstract: In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.
摘要：在本文中，我们介绍了AI搜索范式，这是一种用于模拟人类信息处理和决策的下一代搜索系统的综合蓝图。该范式采用了四种由LLM驱动的代理（Master，Planner，Executor和Writer）组成的模块化体系结构，可动态适应全部信息需求，从简单的事实查询到复杂的多阶段推理任务。这些代理商通过协调的工作流程动态协作，以评估查询复杂性，将问题分解为可执行计划，并协调工具使用，任务执行和内容综合。我们系统地提出了实现此范式的关键方法，包括任务计划和工具集成，执行策略，结盟和稳健的检索效果生成以及有效的LLM推理，涵盖了算法技术和基础架构级别的优化。通过为这些基本组件提供深入的指南，这项工作旨在为可信赖，适应性和可扩展的AI搜索系统的发展提供信息。

Title: Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency

Authors: Kathleen C. Fraser, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.17209
Pdf URL: https://arxiv.org/pdf/2506.17209
Copy Paste: [[2506.17209]] Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency(https://arxiv.org/abs/2506.17209)
Keywords: language model, llm
Abstract: Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users. However, fine-tuning is known to remove the safety alignment features of the model, even when the fine-tuning data does not contain any harmful content. We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the "attack". Most well-intentioned developers are likely unaware that they are deploying an LLM with reduced safety. On the other hand, this known vulnerability can be easily exploited by malicious actors intending to bypass safety guardrails. To make any meaningful progress in mitigating this issue, we first need reliable and reproducible safety evaluations. In this work, we investigate how robust a safety benchmark is to trivial variations in the experimental procedure, and the stochastic nature of LLMs. Our initial experiments expose surprising variance in the results of the safety evaluation, even when seemingly inconsequential changes are made to the fine-tuning setup. Our observations have serious implications for how researchers in this field should report results to enable meaningful comparisons in the future.
摘要：为特定域或任务进行微调的通用大语言模型（LLM）已成为普通用户的常规过程。但是，即使微调数据不包含任何有害内容，微型调查也可以消除模型的安全对准功能。我们认为这是LLMS的关键故障模式，由于微调的广泛吸收，再加上“攻击”的良性本质。大多数善意的开发人员可能没有意识到他们正在以降低安全性部署LLM。另一方面，打算绕过安全护栏的恶意行为者很容易利用这种已知的漏洞。为了在缓解此问题方面取得任何有意义的进步，我们首先需要可靠且可重现的安全评估。在这项工作中，我们研究了安全基准对实验程序的微不足道变化以及LLMS的随机性质的鲁棒性。即使看似无关紧要的变化，我们的最初实验在安全评估结果中也表明了令人惊讶的差异。我们的观察结果对该领域的研究人员如何报告结果如何实现有意义的比较。