2025-08-19

Title: Deep Language Geometry: Constructing a Metric Space from LLM Weights

Authors: Maksym Shamrai, Vladyslav Hamolia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11676
Pdf URL: https://arxiv.org/pdf/2508.11676
Copy Paste: [[2508.11676]] Deep Language Geometry: Constructing a Metric Space from LLM Weights(https://arxiv.org/abs/2508.11676)
Keywords: language model, llm
Abstract: We introduce a novel framework that utilizes the internal weight activations of modern Large Language Models (LLMs) to construct a metric space of languages. Unlike traditional approaches based on hand-crafted linguistic features, our method automatically derives high-dimensional vector representations by computing weight importance scores via an adapted pruning algorithm. Our approach captures intrinsic language characteristics that reflect linguistic phenomena. We validate our approach across diverse datasets and multilingual LLMs, covering 106 languages. The results align well with established linguistic families while also revealing unexpected inter-language connections that may indicate historical contact or language evolution. The source code, computed language latent vectors, and visualization tool are made publicly available at this https URL.
摘要：我们介绍了一个新颖的框架，该框架利用现代大型语言模型（LLM）的内部重量激活来构建语言的度量空间。与基于手工制作的语言特征的传统方法不同，我们的方法通过通过适应的修剪算法计算重量重要性得分来自动衍生高维矢量表示。我们的方法捕获了反映语言现象的内在语言特征。我们在不同的数据集和多语言LLM中验证了我们的方法，涵盖了106种语言。结果与既定的语言家庭都很好地吻合，同时还揭示了可能表明历史接触或语言演变的意外语言联系。源代码，计算语言潜在向量和可视化工具可在此HTTPS URL上公开提供。

Title: Can we Evaluate RAGs with Synthetic Data?

Authors: Jonas van Elburg, Peter van der Putten, Maarten Marx
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11758
Pdf URL: https://arxiv.org/pdf/2508.11758
Copy Paste: [[2508.11758]] Can we Evaluate RAGs with Synthetic Data?(https://arxiv.org/abs/2508.11758)
Keywords: language model, llm
Abstract: We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when such data is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they fail to produce consistent RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.
摘要：我们调查了大语模型（LLMS）生成的合成问题解答（QA）数据是否可以在无法获得此类数据时作为人体标记基准的有效代理。我们在两个实验中评估了合成基准测试的可靠性：一个在保持生成器固定的同时变化的检索参数，另一个用固定的检索器参数改变了发电机。在四个数据集（其中两个开放域和两个专有）中，我们发现合成的基准可靠地对抹布的rags对索犬构型有所不同，与人体标记的基准测试基准相符。但是，在比较发电机体系结构时，它们无法产生一致的抹布排名。分解可能是由于合成和人类基准之间的任务不匹配以及有利于某些发电机的风格偏见的组合。

Title: Limitation Learning: Catching Adverse Dialog with GAIL

Authors: Noah Kasmanoff, Rahul Zalkikar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11767
Pdf URL: https://arxiv.org/pdf/2508.11767
Copy Paste: [[2508.11767]] Limitation Learning: Catching Adverse Dialog with GAIL(https://arxiv.org/abs/2508.11767)
Keywords: prompt
Abstract: Imitation learning is a proven method for creating a policy in the absence of rewards, by leveraging expert demonstrations. In this work, we apply imitation learning to conversation. In doing so, we recover a policy capable of talking to a user given a prompt (input state), and a discriminator capable of classifying between expert and synthetic conversation. While our policy is effective, we recover results from our discriminator that indicate the limitations of dialog models. We argue that this technique can be used to identify adverse behavior of arbitrary data models common for dialog oriented tasks.
摘要：模仿学习是通过利用专家示范的奖励，可以在没有奖励的情况下制定政策的一种经过验证的方法。在这项工作中，我们将模仿学习应用于对话。在此过程中，我们恢复了能够与给定提示（输入状态）的用户交谈的策略，以及能够在专家和合成对话之间进行分类的歧视者。尽管我们的政策有效，但我们从歧视者中恢复了表明对话模型的局限性的结果。我们认为，该技术可用于确定针对对话的任务常见的任意数据模型的不良行为。

Title: Investigating Transcription Normalization in the Faetar ASR Benchmark

Authors: Leo Peckham, Michael Ong, Naomi Nagy, Ewan Dunbar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11771
Pdf URL: https://arxiv.org/pdf/2508.11771
Copy Paste: [[2508.11771]] Investigating Transcription Normalization in the Faetar ASR Benchmark(https://arxiv.org/abs/2508.11771)
Keywords: language model
Abstract: We examine the role of transcription inconsistencies in the Faetar Automatic Speech Recognition benchmark, a challenging low-resource ASR benchmark. With the help of a small, hand-constructed lexicon, we conclude that find that, while inconsistencies do exist in the transcriptions, they are not the main challenge in the task. We also demonstrate that bigram word-based language modelling is of no added benefit, but that constraining decoding to a finite lexicon can be beneficial. The task remains extremely difficult.
摘要：我们研究了转录不一致在FAETAR自动语音识别基准中的作用，这是一种具有挑战性的低资源ASR基准。在一个小的手工构造的词典的帮助下，我们得出结论，尽管抄写中确实存在矛盾，但它们并不是任务中的主要挑战。我们还证明，基于Bigram单词的语言建模并没有额外的好处，但是将有限词典的解码约束是有益的。任务仍然非常困难。

Title: A Multi-Task Evaluation of LLMs' Processing of Academic Text Input

Authors: Tianyi Li, Yu Qin, Olivia R. Liu Sheng
Subjects: cs.CL, econ.GN
Abstract URL: https://arxiv.org/abs/2508.11779
Pdf URL: https://arxiv.org/pdf/2508.11779
Copy Paste: [[2508.11779]] A Multi-Task Evaluation of LLMs' Processing of Academic Text Input(https://arxiv.org/abs/2508.11779)
Keywords: language model, llm, prompt
Abstract: How much large language models (LLMs) can aid scientific discovery, notably in assisting academic peer review, is in heated debate. Between a literature digest and a human-comparable research assistant lies their practical application potential. We organize individual tasks that computer science studies employ in separate terms into a guided and robust workflow to evaluate LLMs' processing of academic text input. We employ four tasks in the assessment: content reproduction/comparison/scoring/reflection, each demanding a specific role of the LLM (oracle/judgmental arbiter/knowledgeable arbiter/collaborator) in assisting scholarly works, and altogether testing LLMs with questions that increasingly require intellectual capabilities towards a solid understanding of scientific texts to yield desirable solutions. We exemplify a rigorous performance evaluation with detailed instructions on the prompts. Adopting first-rate Information Systems articles at three top journals as the input texts and an abundant set of text metrics, we record a compromised performance of the leading LLM - Google's Gemini: its summary and paraphrase of academic text is acceptably reliable; using it to rank texts through pairwise text comparison is faintly scalable; asking it to grade academic texts is prone to poor discrimination; its qualitative reflection on the text is self-consistent yet hardly insightful to inspire meaningful research. This evidence against an endorsement of LLMs' text-processing capabilities is consistent across metric-based internal (linguistic assessment), external (comparing to the ground truth), and human evaluation, and is robust to the variations of the prompt. Overall, we do not recommend an unchecked use of LLMs in constructing peer reviews.
摘要：在激烈的辩论中，有多少大型语言模型（LLM）可以帮助科学发现，尤其是在协助学术同行评审方面。在文献消化和人为比较的研究助理之间，其实际应用潜力在于。我们组织了计算机科学研究以不同术语采用的单个任务，以评估LLMS对学术文本输入的处理。我们在评估中采用四个任务：内容复制/比较/评分/反思，每个任务都要求LLM（Oracle/判断仲裁员/知识渊博的仲裁者/合作者）在协助学术工作中发挥特定作用，并完全需要越来越多地测试LLM，以越来越需要越来越需要越来越多的问题来实现对科学文学的良好理解，以提出科学的解决方案。我们用有关提示的详细说明来体现严格的性能评估。在三个顶级期刊上采用一流的信息系统文章作为输入文本和大量的文本指标，我们记录了领先的LLM -Google的Gemini的妥协性能：其学术文本的摘要和释义是可靠的；使用它通过成对文本比较来对文本进行排名是微弱的。要求它对学术文本进行评分很容易受到歧视。它对文本的定性反思是自谐的，但几乎没有洞察力激发有意义的研究。在基于度量的内部（语言评估），外部（与地面真理相比）和人类评估的基于度量的内部评估（语言评估）中，反对LLMS的文本处理能力的证据是一致的，并且对提示的变化是强大的。总体而言，我们不建议在构建同行评审中不受限制地使用LLM。

Title: LLM-Guided Planning and Summary-Based Scientific Text Simplification: DS@GT at CLEF 2025 SimpleText

Authors: Krishna Chaitanya Marturi, Heba H. Elwazzan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11816
Pdf URL: https://arxiv.org/pdf/2508.11816
Copy Paste: [[2508.11816]] LLM-Guided Planning and Summary-Based Scientific Text Simplification: DS@GT at CLEF 2025 SimpleText(https://arxiv.org/abs/2508.11816)
Keywords: language model, llm
Abstract: In this paper, we present our approach for the CLEF 2025 SimpleText Task 1, which addresses both sentence-level and document-level scientific text simplification. For sentence-level simplification, our methodology employs large language models (LLMs) to first generate a structured plan, followed by plan-driven simplification of individual sentences. At the document level, we leverage LLMs to produce concise summaries and subsequently guide the simplification process using these summaries. This two-stage, LLM-based framework enables more coherent and contextually faithful simplifications of scientific text.
摘要：在本文中，我们介绍了Clef 2025 SimpleText Task 1的方法，该任务1既解决句子级别和文档级的科学文本简化。为了简化句子级别，我们的方法使用大型语言模型（LLM）首先生成结构化计划，然后是计划驱动的单个句子的简化。在文档级别，我们利用LLMS生成简明的摘要，然后使用这些摘要指导简化过程。这个两阶段的基于LLM的框架可以使科学文本更加连贯和上下文忠实地简化。

Title: Hallucination Detection and Mitigation in Scientific Text Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText

Authors: Krishna Chaitanya Marturi, Heba H. Elwazzan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11823
Pdf URL: https://arxiv.org/pdf/2508.11823
Copy Paste: [[2508.11823]] Hallucination Detection and Mitigation in Scientific Text Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText(https://arxiv.org/abs/2508.11823)
Keywords: language model, llm, hallucination
Abstract: In this paper, we describe our methodology for the CLEF 2025 SimpleText Task 2, which focuses on detecting and evaluating creative generation and information distortion in scientific text simplification. Our solution integrates multiple strategies: we construct an ensemble framework that leverages BERT-based classifier, semantic similarity measure, natural language inference model, and large language model (LLM) reasoning. These diverse signals are combined using meta-classifiers to enhance the robustness of spurious and distortion detection. Additionally, for grounded generation, we employ an LLM-based post-editing system that revises simplifications based on the original input texts.
摘要：在本文中，我们描述了CLEF 2025 SimpleText Task 2的方法，该任务的重点是检测和评估科学文本简化中的创造性产生和信息失真。我们的解决方案集成了多种策略：我们构建了一个合奏框架，该框架利用基于BERT的分类器，语义相似度度量，自然语言推断模型和大型语言模型（LLM）推理。这些不同的信号使用元分类器组合，以增强虚假和失真检测的鲁棒性。此外，对于扎根生成，我们采用了一个基于LLM的后编辑系统，该系统根据原始输入文本修改了简化。

Title: Every 28 Days the AI Dreams of Soft Skin and Burning Stars: Scaffolding AI Agents with Hormones and Emotions

Authors: Leigh Levinson, Christopher J. Agostino
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2508.11829
Pdf URL: https://arxiv.org/pdf/2508.11829
Copy Paste: [[2508.11829]] Every 28 Days the AI Dreams of Soft Skin and Burning Stars: Scaffolding AI Agents with Hormones and Emotions(https://arxiv.org/abs/2508.11829)
Keywords: language model, prompt, agent
Abstract: Despite significant advances, AI systems struggle with the frame problem: determining what information is contextually relevant from an exponentially large possibility space. We hypothesize that biological rhythms, particularly hormonal cycles, serve as natural relevance filters that could address this fundamental challenge. We develop a framework that embeds simulated menstrual and circadian cycles into Large Language Models through system prompts generated from periodic functions modeling key hormones including estrogen, testosterone, and cortisol. Across multiple state-of-the-art models, linguistic analysis reveals emotional and stylistic variations that track biological phases; sadness peaks during menstruation while happiness dominates ovulation and circadian patterns show morning optimism transitioning to nocturnal introspection. Benchmarking on SQuAD, MMLU, Hellaswag, and AI2-ARC demonstrates subtle but consistent performance variations aligning with biological expectations, including optimal function in moderate rather than extreme hormonal ranges. This methodology provides a novel approach to contextual AI while revealing how societal biases regarding gender and biology are embedded within language models.
摘要：尽管有重大进展，AI系统仍在框架问题上挣扎：确定哪些信息与指数较大的可能性空间相关。我们假设生物节奏，尤其是荷尔蒙周期，是可以应对这一基本挑战的自然相关性过滤器。我们开发了一个框架，该框架通过系统提示将模拟的月经和昼夜节律嵌入大型语言模型中，该系统提示是由定期功能建模的关键激素，包括雌激素，睾丸激素和皮质醇。在多种最先进的模型中，语言分析揭示了跟踪生物学阶段的情感和风格变化。月经期间的悲伤达到峰值，而幸福则主导排卵和昼夜节律模式，表明早晨的乐观情绪过渡到夜间内省。在小队，MMLU，HELLASWAG和AI2-ARC上进行基准测试表明，与生物学期望相符的细微但一致的性能变化，包括在中度而不是极端激素范围内的最佳功能。该方法为上下文AI提供了一种新颖的方法，同时揭示了如何将有关性别和生物学的社会偏见嵌入语言模型中。

Title: When Does Language Transfer Help? Sequential Fine-Tuning for Cross-Lingual Euphemism Detection

Authors: Julia Sammartino, Libby Barak, Jing Peng, Anna Feldman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11831
Pdf URL: https://arxiv.org/pdf/2508.11831
Copy Paste: [[2508.11831]] When Does Language Transfer Help? Sequential Fine-Tuning for Cross-Lingual Euphemism Detection(https://arxiv.org/abs/2508.11831)
Keywords: language model
Abstract: Euphemisms are culturally variable and often ambiguous, posing challenges for language models, especially in low-resource settings. This paper investigates how cross-lingual transfer via sequential fine-tuning affects euphemism detection across five languages: English, Spanish, Chinese, Turkish, and Yoruba. We compare sequential fine-tuning with monolingual and simultaneous fine-tuning using XLM-R and mBERT, analyzing how performance is shaped by language pairings, typological features, and pretraining coverage. Results show that sequential fine-tuning with a high-resource L1 improves L2 performance, especially for low-resource languages like Yoruba and Turkish. XLM-R achieves larger gains but is more sensitive to pretraining gaps and catastrophic forgetting, while mBERT yields more stable, though lower, results. These findings highlight sequential fine-tuning as a simple yet effective strategy for improving euphemism detection in multilingual models, particularly when low-resource languages are involved.
摘要：委婉语在文化上是可变的，而且通常是模棱两可的，对语言模型构成了挑战，尤其是在低资源环境中。本文调查了通过顺序微调通过五种语言的委婉语检测的跨语言转移如何：英语，西班牙语，中文，土耳其语和约鲁巴语。我们将顺序微调与使用XLM-R和MBERT的单语和同时微调进行了比较，分析了如何通过语言配对，类型学特征和预读覆盖范围来塑造性能。结果表明，具有高资源L1的顺序微调可改善L2性能，尤其是对于Yoruba和Turkish等低资源语言。 XLM-R实现了更大的增长，但对训练的间隙和灾难性遗忘更为敏感，而Mbert的结果更稳定，尽管结果较低。这些发现强调了顺序微调作为改善多语言模型中委婉检测的简单而有效的策略，尤其是在涉及低资源语言的情况下。

Title: SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance

Authors: Andrei-Valentin Tănase, Elena Pelican
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11857
Pdf URL: https://arxiv.org/pdf/2508.11857
Copy Paste: [[2508.11857]] SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance(https://arxiv.org/abs/2508.11857)
Keywords: language model, gpt
Abstract: Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning "superword" tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI's o200k tokenizer and 30% improvement over Google's Gemma 3 tokenizer (256k vocabulary), while maintaining competitive performance across 38 languages. When integrated with a GPT-2 scale model (124M parameters) trained on 10 billion tokens from the FineWeb-Edu dataset, SupraTok yields 8.4% improvement on HellaSWAG and 9.5% on MMLU benchmarks without architectural modifications. While these results are promising at this scale, further validation at larger model scales is needed. These findings suggest that efficient tokenization can complement architectural innovations as a path to improved language model performance.
摘要：在自然语言处理中，令牌化仍然是一个基本但毫无疑问的瓶颈，尽管模型体系结构取得了显着进展，但策略基本上是静态的。我们提出了Supratok，这是一种新型的令牌化体系结构，通过三种创新来重新想象子词细分：跨边界模式学习，发现多字语义单元，熵驱动的数据策展，以优化训练语料库质量和多相课程学习，以实现稳定融合。我们的方法通过学习“ superword”令牌，连贯的多字表达式来扩展字节对编码，从而保持语义统一，同时最大化压缩效率。与OpenAI的O200K令牌相比，Supratok的英语令牌化效率提高了31％（5.91 vs 4.51个字符），而Google的Gemma 3 Tokenizer（256K词汇表）提高了30％，同时保持38种语言的竞争性能。当与从FineWeb-Edu数据集对100亿个令牌训练的GPT-2量表模型（12400万参数）集成时，Supratok在Hellaswag上的提高了8.4％，MMLU基准为9.5％，没有建筑修改。尽管这些结果在此规模上有希望，但仍需要在较大模型量表上进行进一步验证。这些发现表明，有效的令牌化可以补充建筑创新，这是改善语言模型表现的途径。

Title: In-Context Examples Matter: Improving Emotion Recognition in Conversation with Instruction Tuning

Authors: Hui Ma, Bo Zhang, Jinpeng Hu, Zenglin Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11889
Pdf URL: https://arxiv.org/pdf/2508.11889
Copy Paste: [[2508.11889]] In-Context Examples Matter: Improving Emotion Recognition in Conversation with Instruction Tuning(https://arxiv.org/abs/2508.11889)
Keywords: language model, llm, prompt
Abstract: Emotion recognition in conversation (ERC) aims to identify the emotion of each utterance in a conversation, playing a vital role in empathetic artificial intelligence. With the growing of large language models (LLMs), instruction tuning has emerged as a critical paradigm for ERC. Existing studies mainly focus on multi-stage instruction tuning, which first endows LLMs with speaker characteristics, and then conducts context-aware instruction tuning to comprehend emotional states. However, these methods inherently constrains the capacity to jointly capture the dynamic interaction between speaker characteristics and conversational context, resulting in weak alignment among speaker identity, contextual cues, and emotion states within a unified framework. In this paper, we propose InitERC, a simple yet effective one-stage in-context instruction tuning framework for ERC. InitERC adapts LLMs to learn speaker-context-emotion alignment from context examples via in-context instruction tuning. Specifically, InitERC comprises four components, i.e., demonstration pool construction, in-context example selection, prompt template design, and in-context instruction tuning. To explore the impact of in-context examples, we conduct a comprehensive study on three key factors: retrieval strategy, example ordering, and the number of examples. Extensive experiments on three widely used datasets demonstrate that our proposed InitERC achieves substantial improvements over the state-of-the-art baselines.
摘要：对话中的情感认可（ERC）旨在确定对话中每种话语的情感，在善解人意的人工智能中发挥至关重要的作用。随着大型语言模型（LLM）的增长，指令调整已成为ERC的关键范式。现有研究主要集中于多阶段教学调整，该教学首先赋予LLMS具有说话者特征，然后进行上下文感知的教学调整以理解情绪状态。但是，这些方法固有地限制了在统一框架内共同捕获说话者特征和对话环境之间共同捕获动态相互作用的能力，从而导致说话者身份，上下文提示和情感状态之间的一致性较弱。在本文中，我们提出了Initerc，这是ERC的简单而有效的一阶段内部文献指令调整框架。 Initerc适应LLMS，通过上下文指令调整从上下文示例中学习说话者-Context-semotion-emotion。具体而言，Initerc包括四个组件，即演示池构建，中下文示例选择，提示模板设计和内部上下文指令调整。为了探讨示例中的影响，我们对三个关键因素进行了全面研究：检索策略，示例订购和示例数量。在三个广泛使用的数据集上进行了广泛的实验表明，我们提出的Initerc对最先进的基线取得了重大改进。

Title: CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures

Authors: Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.11915
Pdf URL: https://arxiv.org/pdf/2508.11915
Copy Paste: [[2508.11915]] CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures(https://arxiv.org/abs/2508.11915)
Keywords: language model, llm, agent
Abstract: Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf's and Heaps' Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi-agent LLM systems. Our code is available at this https URL.
摘要：具有大语言模型（LLM）的代理商之间的游戏理论相互作用揭示了许多新兴功能，但是这些相互作用的语言多样性尚未得到充分量化。在本文中，我们介绍了对话鲁棒性评估评分：核心，旨在量化不同游戏理论相互作用的多代理系统中语言使用的有效性。核心集成了集群熵，词汇重复和语义相似性的度量，从而提供了直接的对话质量。我们将核心应用于跨竞争，合作和中性环境的成对LLM对话框，进一步将我们的分析扎根于ZIPF的分析和堆放法律，以表征单词频率分布和词汇生长。我们的发现表明，合作环境既表现出更陡峭的ZIPF分布和较高的堆指数，表明更多的重复以及更大的词汇膨胀。相比之下，竞争性相互作用显示较低的ZIPF和堆指数，反映出较少的重复和更受限的词汇。这些结果为社会激励措施如何影响语言的适应提供了新的见解，并强调了核心作为衡量多代理LLM系统语言鲁棒性的强大诊断。我们的代码可在此HTTPS URL上找到。

Title: LLMs Struggle with NLI for Perfect Aspect: A Cross-Linguistic Study in Chinese and Japanese

Authors: Jie Lu, Du Jin, Hitomi Yanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11927
Pdf URL: https://arxiv.org/pdf/2508.11927
Copy Paste: [[2508.11927]] LLMs Struggle with NLI for Perfect Aspect: A Cross-Linguistic Study in Chinese and Japanese(https://arxiv.org/abs/2508.11927)
Keywords: llm
Abstract: Unlike English, which uses distinct forms (e.g., had, has, will have) to mark the perfect aspect across tenses, Chinese and Japanese lack separate grammatical forms for tense within the perfect aspect, which complicates Natural Language Inference (NLI). Focusing on the perfect aspect in these languages, we construct a linguistically motivated, template-based NLI dataset (1,350 pairs per language). Experiments reveal that even advanced LLMs struggle with temporal inference, particularly in detecting subtle tense and reference-time shifts. These findings highlight model limitations and underscore the need for cross-linguistic evaluation in temporal semantics. Our dataset is available at this https URL.
摘要：与英语不同的是，使用不同的形式（例如，有，拥有，将拥有）来标记时态的完美方面，中国和日本在完美的方面缺乏分离的语法形式，这使自然语言推断（NLI）复杂化。为了关注这些语言的完美方面，我们构建了一个以语言动机的，基于模板的NLI数据集（每种语言1,350对）。实验表明，即使是高级LLM在时间推理中也很难，尤其是在检测微妙的时态和参考时间转移时。这些发现突出了模型的局限性，并强调了时间语义中对跨语言评估的需求。我们的数据集可在此HTTPS URL上找到。

Title: CAMF: Collaborative Adversarial Multi-agent Framework for Machine Generated Text Detection

Authors: Yue Wang, Liesheng Wei, Yuxiang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11933
Pdf URL: https://arxiv.org/pdf/2508.11933
Copy Paste: [[2508.11933]] CAMF: Collaborative Adversarial Multi-agent Framework for Machine Generated Text Detection(https://arxiv.org/abs/2508.11933)
Keywords: language model, llm, agent
Abstract: Detecting machine-generated text (MGT) from contemporary Large Language Models (LLMs) is increasingly crucial amid risks like disinformation and threats to academic integrity. Existing zero-shot detection paradigms, despite their practicality, often exhibit significant deficiencies. Key challenges include: (1) superficial analyses focused on limited textual attributes, and (2) a lack of investigation into consistency across linguistic dimensions such as style, semantics, and logic. To address these challenges, we introduce the \textbf{C}ollaborative \textbf{A}dversarial \textbf{M}ulti-agent \textbf{F}ramework (\textbf{CAMF}), a novel architecture using multiple LLM-based agents. CAMF employs specialized agents in a synergistic three-phase process: \emph{Multi-dimensional Linguistic Feature Extraction}, \emph{Adversarial Consistency Probing}, and \emph{Synthesized Judgment Aggregation}. This structured collaborative-adversarial process enables a deep analysis of subtle, cross-dimensional textual incongruities indicative of non-human origin. Empirical evaluations demonstrate CAMF's significant superiority over state-of-the-art zero-shot MGT detection techniques.
摘要：从当代大型语言模型（LLMS）中检测机器生成的文本（MGT）是越来越重要的风险，例如虚假信息和对学术完整性的威胁。现有的零射击检测范例尽管实用，但通常会出现严重的缺陷。主要挑战包括：（1）侧重于有限的文本属性的表面分析，以及（2）缺乏对语言，风格，语义和逻辑等语言维度一致性的研究。为了解决这些挑战，我们介绍了\ textbf {c} ollaborative \ textbf {a} dversarial \ textbf {m} ulti-agent \ textbf {f textbf {f} ramework（\ textbf {camf {camf {camf {camf {camf {camf {camf}），一种使用多个基于LLM的型号的小说架构。 CAMF在协同的三相过程中采用专门的代理：\ emph {多维语言特征提取}，\ emph {对抗性一致性探测}和\ emph {综合判断聚集}。这种结构化的协作 - 对抗过程可以深入分析微妙的跨维文本不一致，指示非人类起源。经验评估表明，CAMF比最先进的MGT检测技术具有重要优势。

Title: Learning Wisdom from Errors: Promoting LLM's Continual Relation Learning through Exploiting Error Cases

Authors: Shaozhe Yin, Jinyu Guo, Kai Shuang, Xia Liu, Ruize Ou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12031
Pdf URL: https://arxiv.org/pdf/2508.12031
Copy Paste: [[2508.12031]] Learning Wisdom from Errors: Promoting LLM's Continual Relation Learning through Exploiting Error Cases(https://arxiv.org/abs/2508.12031)
Keywords: language model, llm
Abstract: Continual Relation Extraction (CRE) aims to continually learn new emerging relations while avoiding catastrophic forgetting. Existing CRE methods mainly use memory replay and contrastive learning to mitigate catastrophic forgetting. However, these methods do not attach importance to the error cases that can reveal the model's cognitive biases more effectively. To address this issue, we propose an instruction-based continual contrastive tuning approach for Large Language Models (LLMs) in CRE. Different from existing CRE methods that typically handle the training and memory data in a unified manner, this approach splits the training and memory data of each task into two parts respectively based on the correctness of the initial responses and treats them differently through dual-task fine-tuning. In addition, leveraging the advantages of LLM's instruction-following ability, we propose a novel instruction-based contrastive tuning strategy for LLM to continuously correct current cognitive biases with the guidance of previous data in an instruction-tuning manner, which mitigates the gap between old and new relations in a more suitable way for LLMs. We experimentally evaluate our model on TACRED and FewRel, and the results show that our model achieves new state-of-the-art CRE performance with significant improvements, demonstrating the importance of specializing in exploiting error cases.
摘要：持续的关系提取（CRE）旨在不断学习新的新兴关系，同时避免灾难性遗忘。现有的CRE方法主要使用记忆重播和对比度学习来减轻灾难性遗忘。但是，这些方法对可以更有效地揭示模型的认知偏见的错误情况并不重要。为了解决这个问题，我们为CRE中的大型语言模型（LLMS）提出了一种基于教学的持续对比度调整方法。与通常以统一的方式处理训练和内存数据的现有CRE方法不同，此方法将每个任务的训练和内存数据分别根据初始响应的正确性分别分为两个部分，并通过双重任务通过双重调查以不同的方式对待它们。此外，利用LLM的指导遵循能力的优势，我们提出了一种新型的LLM基于教学的对比度调整策略，以不断纠正当前的认知偏见，并以先前数据的指导方式以指导方式进行指导，从而减轻旧与新关系之间的差距，以更合适的LLM。我们通过实验性地评估了对Tacred和Lightrel的模型，结果表明，我们的模型实现了新的最先进的CRE性能，并有了重大改进，这表明了专门利用错误案例的重要性。

Title: Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation

Authors: Jinyi Han, Tingyun Li, Shisong Chen, Jie Shi, Xinyi Wang, Guanglei Yue, Jiaqing Liang, Xin Lin, Liqian Wen, Zulong Chen, Yanghua Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12040
Pdf URL: https://arxiv.org/pdf/2508.12040
Copy Paste: [[2508.12040]] Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation(https://arxiv.org/abs/2508.12040)
Keywords: language model, llm
Abstract: While large language models (LLMs) have demonstrated remarkable performance across diverse tasks, they fundamentally lack self-awareness and frequently exhibit overconfidence, assigning high confidence scores to incorrect predictions. Accurate confidence estimation is therefore critical for enhancing the trustworthiness and reliability of LLM-generated outputs. However, existing approaches suffer from coarse-grained scoring mechanisms that fail to provide fine-grained, continuous confidence estimates throughout the generation process. To address these limitations, we introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation. Specifically, we first develop a comprehensive pipeline for constructing training data that effectively captures the underlying probabilistic distribution of LLM responses, and then train a model to predict confidence scores for arbitrary text sequences in a supervised manner. Furthermore, we propose a Backward Confidence Integration (BCI) strategy that leverages information from the subsequent text to enhance confidence estimation for the current sequence during inference. We also introduce three strategies for identifying optimal positions to perform confidence estimation within the generation process. Extensive experiments on multiple benchmark datasets demonstrate that FineCE consistently outperforms existing classical confidence estimation methods. Our code and all baselines used in the paper are available on GitHub.
摘要：尽管大型语言模型（LLMS）在各种任务中表现出了出色的表现，但它们从根本上缺乏自我意识，并且经常表现出过度的自信，从而为不正确的预测分配了高信心得分。因此，准确的置信度估计对于增强LLM生成的产出的可信度和可靠性至关重要。但是，现有的方法遭受了在整个生成过程中无法提供细粒度，连续置信度估计的粗粒度评分机制。为了解决这些局限性，我们引入了Finece，这是一种新颖的置信度估计方法，在文本生成过程中可以提供准确，细粒度的置信度得分。具体而言，我们首先开发了一条全面的管道来构建培训数据，该数据有效地捕获了LLM响应的潜在概率分布，然后训练模型以预测以监督方式预测任意文本序列的置信度得分。此外，我们提出了一种向后的置信度集成（BCI）策略，该策略利用后续文本的信息来增强推理过程中当前序列的置信度估计。我们还引入了三种策略来确定最佳位置以在生成过程中执行置信度估计。在多个基准数据集上进行的广泛实验表明，Finece始终优于现有的经典置信度估计方法。我们的代码和本文中使用的所有基线都可以在GitHub上获得。

Title: J6: Jacobian-Driven Role Attribution for Multi-Objective Prompt Optimization in LLMs

Authors: Yao Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.12086
Pdf URL: https://arxiv.org/pdf/2508.12086
Copy Paste: [[2508.12086]] J6: Jacobian-Driven Role Attribution for Multi-Objective Prompt Optimization in LLMs(https://arxiv.org/abs/2508.12086)
Keywords: language model, llm, prompt
Abstract: In large language model (LLM) adaptation, balancing multiple optimization objectives such as improving factuality (heat) and increasing confidence (via low entropy) poses a fundamental challenge, especially when prompt parameters (e.g., hidden-layer insertions h and embedding modifications w) interact in non-trivial ways. Existing multi-objective optimization strategies often rely on scalar gradient aggregation, ignoring the deeper geometric structure between objectives and parameters. We propose J6, a structured Jacobian-based method that decomposes the gradient interaction matrix into six interpretable components. This decomposition enables both hard decision-making (e.g., choosing the dominant update direction via argmax) and soft strategies (e.g., attention-style weighting via softmax over J6), forming a dynamic update framework that adapts to local conflict and synergy. Moreover, the interpretable structure of J6 provides insight into parameter attribution, task interference, and geometry-aligned adaptation. Our work introduces a principled and extensible mechanism for conflict-aware prompt optimization, and opens a new avenue for incorporating structured Jacobian reasoning into multi-objective neural tuning.
摘要：在大型语言模型（LLM）适应中，平衡多个优化目标，例如改善事实（热量）和提高信心（通过低熵）提出了基本挑战，尤其是当迅速参数（例如，隐藏式插入h和嵌入修改W）和嵌入修改w时，以非客气方式相互作用。现有的多目标优化策略通常依赖于标量梯度聚集，而忽略了目标和参数之间的更深的几何结构。我们提出了一种基于雅各布的方法J6，将梯度相互作用矩阵分解为六个可解释的组件。这种分解可以使艰难的决策（例如，通过Argmax选择主要的更新方向）和软策略（例如，通过J6上的SoftMax通过SoftMax进行注意力加权），形成了适应本地冲突和协同作用的动态更新框架。此外，J6的可解释结构提供了对参数归因，任务干扰和几何形状一致的适应性的见解。我们的工作介绍了一种有原则且可扩展的机制，用于冲突意识迅速优化，并为将结构化的Jacobian推理纳入多目标神经调节开辟了新的途径。

Title: STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

Authors: Haiquan Hu, Jiazhi Jiang, Shiyou Xu, Ruhan Zeng, Tian Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.12096
Pdf URL: https://arxiv.org/pdf/2508.12096
Copy Paste: [[2508.12096]] STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples(https://arxiv.org/abs/2508.12096)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly. While recent models often achieve higher scores on standard benchmarks, these improvements do not consistently reflect enhanced real-world reasoning capabilities. Moreover, widespread overfitting to public benchmarks and the high computational cost of full evaluations have made it both expensive and less effective to distinguish meaningful differences between models. To address these challenges, we propose the \textbf{S}tructured \textbf{T}ransition \textbf{E}valuation \textbf{M}ethod (STEM), a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs. STEM identifies \textit{significant transition samples} (STS) by analyzing consistent performance transitions among LLMs of the same architecture but varying parameter scales. These samples enable STEM to effectively estimate the capability position of an unknown model. Qwen3 model family is applied to construct the STS pool on six diverse and representative benchmarks. To assess generalizability. Experimental results indicate that STEM reliably captures performance trends, aligns with ground-truth rankings of model capability. These findings highlight STEM as a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs.
摘要：随着模型能力迅速提高，评估大型语言模型（LLM）已变得越来越具有挑战性。尽管最近的模型通常在标准基准上获得更高的分数，但这些改进并不能始终如一地反映出现实世界中的推理能力。此外，广泛地超越公共基准测试和全面评估的高计算成本使区分模型之间有意义的差异既昂贵又不有效。为了应对这些挑战，我们提出了\ textbf {s} cructured \ textbf {t} Ransition \ textbf {e}估值\ textbf {m} ethod（stem），这是一个轻巧且可解释的评估框架，以有效地估计LLMS的相对能力。 STEM通过分析同一体系结构的LLMS之间的一致性跃迁，但参数量表不同，可以识别\ textIt {重要的过渡样本}（STS）。这些样品使STEM能够有效估计未知模型的能力位置。 QWEN3模型家族用于在六种不同的代表性基准上构建STS池。评估普遍性。实验结果表明，茎可靠地捕获性能趋势，与模型能力的基础排名保持一致。这些发现突出显示了一种实用且可扩展的方法，用于对LLM的细粒度，建筑不可能的评估。

Title: LLM-as-a-Judge for Privacy Evaluation? Exploring the Alignment of Human and LLM Perceptions of Privacy in Textual Data

Authors: Stephen Meisenbacher, Alexandra Klymenko, Florian Matthes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12158
Pdf URL: https://arxiv.org/pdf/2508.12158
Copy Paste: [[2508.12158]] LLM-as-a-Judge for Privacy Evaluation? Exploring the Alignment of Human and LLM Perceptions of Privacy in Textual Data(https://arxiv.org/abs/2508.12158)
Keywords: llm
Abstract: Despite advances in the field of privacy-preserving Natural Language Processing (NLP), a significant challenge remains the accurate evaluation of privacy. As a potential solution, using LLMs as a privacy evaluator presents a promising approach $\unicode{x2013}$ a strategy inspired by its success in other subfields of NLP. In particular, the so-called $\textit{LLM-as-a-Judge}$ paradigm has achieved impressive results on a variety of natural language evaluation tasks, demonstrating high agreement rates with human annotators. Recognizing that privacy is both subjective and difficult to define, we investigate whether LLM-as-a-Judge can also be leveraged to evaluate the privacy sensitivity of textual data. Furthermore, we measure how closely LLM evaluations align with human perceptions of privacy in text. Resulting from a study involving 10 datasets, 13 LLMs, and 677 human survey participants, we confirm that privacy is indeed a difficult concept to measure empirically, exhibited by generally low inter-human agreement rates. Nevertheless, we find that LLMs can accurately model a global human privacy perspective, and through an analysis of human and LLM reasoning patterns, we discuss the merits and limitations of LLM-as-a-Judge for privacy evaluation in textual data. Our findings pave the way for exploring the feasibility of LLMs as privacy evaluators, addressing a core challenge in solving pressing privacy issues with innovative technical solutions.
摘要：尽管具有隐私性自然语言处理（NLP）领域的进步，但重大挑战仍然是对隐私的准确评估。作为一种潜在解决方案，使用LLMS作为隐私评估者提出了一种有希望的方法$ \ Unicode {X2013} $一种受到其在NLP其他子领域的成功启发的策略。特别是，所谓的$ \ textit {llm-as-a-a-gudge} $范式在各种自然语言评估任务上取得了令人印象深刻的结果，表明与人类注释者的一致性很高。认识到隐私既主观又难以定义，我们研究了是否也可以利用LLM-AS-A-A-A-A-A-A-A-A-A-A-Audge来评估文本数据的隐私敏感性。此外，我们衡量了LLM评估与人类对文本隐私的看法的紧密程度。由涉及10个数据集，13个LLM和677名人类调查参与者的研究产生的，我们确认隐私确实是一个很难在经验上衡量的概念，这是人际内一致性率通常很低的。然而，我们发现LLM可以准确地对全球人类隐私的观点进行建模，并通过对人类和LLM推理模式的分析，讨论LLM-AS-A-A-A-Gudge的优点和局限性在文本数据中评估隐私评估。我们的发现为探索LLM作为隐私评估者的可行性铺平了道路，以解决通过创新技术解决方案解决紧迫的隐私问题的核心挑战。

Title: Structuring the Unstructured: A Systematic Review of Text-to-Structure Generation for Agentic AI with a Universal Evaluation Framework

Authors: Zheye Deng, Chunkit Chan, Tianshi Zheng, Wei Fan, Weiqi Wang, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12257
Pdf URL: https://arxiv.org/pdf/2508.12257
Copy Paste: [[2508.12257]] Structuring the Unstructured: A Systematic Review of Text-to-Structure Generation for Agentic AI with a Universal Evaluation Framework(https://arxiv.org/abs/2508.12257)
Keywords: agent
Abstract: The evolution of AI systems toward agentic operation and context-aware retrieval necessitates transforming unstructured text into structured formats like tables, knowledge graphs, and charts. While such conversions enable critical applications from summarization to data mining, current research lacks a comprehensive synthesis of methodologies, datasets, and metrics. This systematic review examines text-to-structure techniques and the encountered challenges, evaluates current datasets and assessment criteria, and outlines potential directions for future research. We also introduce a universal evaluation framework for structured outputs, establishing text-to-structure as foundational infrastructure for next-generation AI systems.
摘要：AI系统向代理操作和上下文感知的检索的演变必须将非结构化文本转换为诸如表，知识图和图表之类的结构化格式。尽管这种转换可以从摘要到数据挖掘，但当前的研究缺乏对方法，数据集和指标的全面综合。该系统评价研究了文本到结构技术以及遇到的挑战，评估当前的数据集和评估标准，并概述了未来研究的潜在方向。我们还为结构化输出介绍了通用评估框架，并将文本对结构作为下一代AI系统的基础基础架构。

Title: Fast, Slow, and Tool-augmented Thinking for LLMs: A Review

Authors: Xinda Jia, Jinpeng Li, Zezhong Wang, Jingjing Li, Xingshan Zeng, Yasheng Wang, Weinan Zhang, Yong Yu, Weiwen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12265
Pdf URL: https://arxiv.org/pdf/2508.12265
Copy Paste: [[2508.12265]] Fast, Slow, and Tool-augmented Thinking for LLMs: A Review(https://arxiv.org/abs/2508.12265)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in reasoning across diverse domains. However, effective reasoning in real-world tasks requires adapting the reasoning strategy to the demands of the problem, ranging from fast, intuitive responses to deliberate, step-by-step reasoning and tool-augmented thinking. Drawing inspiration from cognitive psychology, we propose a novel taxonomy of LLM reasoning strategies along two knowledge boundaries: a fast/slow boundary separating intuitive from deliberative processes, and an internal/external boundary distinguishing reasoning grounded in the model's parameters from reasoning augmented by external tools. We systematically survey recent work on adaptive reasoning in LLMs and categorize methods based on key decision factors. We conclude by highlighting open challenges and future directions toward more adaptive, efficient, and reliable LLMs.
摘要：大型语言模型（LLM）在跨不同领域的推理方面表现出了显着的进步。但是，现实世界任务中的有效推理需要将推理策略调整为问题的要求，从快速，直观的回应到故意，逐步推理和工具提升的思维。从认知心理学中汲取灵感，我们提出了沿两个知识界限的LLM推理策略的新颖分类学：一个快/慢的边界，将直觉与审议过程分开，而内部/外部边界区分了模型参数中基于推理的推理与外部工具增强的推理。我们系统地调查了LLM中自适应推理的最新工作，并根据关键决策因素对方法进行分类。我们通过强调开放的挑战和未来的方向，以更加适应性，高效和可靠的LLM进行结论。

Title: The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

Authors: Elon Ezra, Ariel Weizman, Amos Azaria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12277
Pdf URL: https://arxiv.org/pdf/2508.12277
Copy Paste: [[2508.12277]] The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution(https://arxiv.org/abs/2508.12277)
Keywords: language model, llm
Abstract: Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.
摘要：大型语言模型（LLM）通常是根据测试其知识或推理能力的任务进行评估的。在本文中，我们探讨了不同类型的评估：LLM是否可以预测其自身反应的各个方面。由于LLMS缺乏执行自己的能力，因此我们引入了自我执行的基准，该基准衡量了模型预测其产出属性的能力，例如，问题是否会拒绝回答或可能产生哪些相关性。我们的实验表明，模型在此基准测试中的表现通常很差，并且增加的模型大小或能力并不能始终导致更好的性能。这些结果表明，LLM的表现和理由对自己的行为的表现和理由有一个基本限制。

Title: Legal$Δ$: Enhancing Legal Reasoning in LLMs via Reinforcement Learning with Chain-of-Thought Guided Information Gain

Authors: Xin Dai, Buqiang Xu, Zhenghao Liu, Yukun Yan, Huiyuan Xie, Xiaoyuan Yi, Shuo Wang, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12281
Pdf URL: https://arxiv.org/pdf/2508.12281
Copy Paste: [[2508.12281]] Legal$Δ$: Enhancing Legal Reasoning in LLMs via Reinforcement Learning with Chain-of-Thought Guided Information Gain(https://arxiv.org/abs/2508.12281)
Keywords: language model, llm, chain-of-thought
Abstract: Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision-making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast-thinking behavior by producing direct answers without explicit multi-step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose Legal$\Delta$, a reinforcement learning framework designed to enhance legal reasoning through chain-of-thought guided information gain. During training, Legal$\Delta$ employs a dual-mode input setup-comprising direct answer and reasoning-augmented modes-and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. Legal$\Delta$ follows a two-stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek-R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal-domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that Legal$\Delta$ outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at this https URL.
摘要：法律人工智能（Legalai）在支持大语模型（LLMS）的支持下，在自动化司法决策方面取得了显着进步。但是，现有的法律LLM仍在努力产生可靠和可解释的推理过程。他们经常通过在不明确的多步推理的情况下产生直接答案来默认行为，从而限制了它们在需要严格理由的复杂法律场景中的有效性。为了应对这一挑战，我们提出了法律$ \ delta $，这是一个强化学习框架，旨在通过经过思考的指导信息收益来增强法律推理。在培训期间，Legal $ \ delta $采用双模式输入设置 - 复制直接答案和以推理为主名的模式，并最大程度地提高了它们之间的信息增益。这鼓励模型获得有意义的推理模式，而不是产生肤浅或多余的解释。 Legal $ \ delta $遵循了两阶段的方法：（1）从强大的大型推理模型（LRM），DeepSeek-R1和（2）通过差分比较提炼推理质量，结合多维奖励机制，从而评估结构相干性和法律规定的特定性。多个法律推理任务的实验结果表明，法律$ \ delta $在准确性和可解释性方面都优于强大的基准。它始终在不依赖标签的偏好数据的情况下产生更强大和可信赖的法律判断。所有代码和数据将在此HTTPS URL上发布。

Title: A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation

Authors: Ziyang Chen, Erxue Min, Xiang Zhao, Yunxin Li, Xin Jia, Jinzhi Liao, Jichao Li, Shuaiqiang Wang, Baotian Hu, Dawei Yin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.12282
Pdf URL: https://arxiv.org/pdf/2508.12282
Copy Paste: [[2508.12282]] A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation(https://arxiv.org/abs/2508.12282)
Keywords: llm, retrieval-augmented generation
Abstract: We introduce ChronoQA, a large-scale benchmark dataset for Chinese question answering, specifically designed to evaluate temporal reasoning in Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over 300,000 news articles published between 2019 and 2024, and contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions. The dataset supports both single- and multi-document scenarios, reflecting the real-world requirements for temporal alignment and logical consistency. ChronoQA features comprehensive structural annotations and has undergone multi-stage validation, including rule-based, LLM-based, and human evaluation, to ensure data quality. By providing a dynamic, reliable, and scalable resource, ChronoQA enables structured evaluation across a wide range of temporal tasks, and serves as a robust benchmark for advancing time-sensitive retrieval-augmented question answering systems.
摘要：我们介绍了Chronoqa，这是一种用于中国问题答案的大规模基准数据集，专门设计用于评估检索功能增强生成（RAG）系统中的时间推理。 Chronoqa由2019年至2024年之间发表的300,000多种新闻文章构建，其中包含5,176个高质量的问题，涵盖了绝对，骨料和相对时间类型，并具有显式和隐性时间表达式。数据集支持单案和多文档方案，反映了时间对齐和逻辑一致性的现实要求。 ChronoQA具有全面的结构注释，并进行了多阶段验证，包括基于规则的，基于LLM的和人类评估，以确保数据质量。通过提供动态，可靠和可扩展的资源，Chronoqa可以在各种时间任务上实现结构化评估，并可以作为推进时间敏感的检索检索 - 杰出的问答系统的强大基准。

Title: Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering

Authors: Eviatar Nachshoni, Arie Cattan, Shmuel Amar, Ori Shapira, Ido Dagan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12355
Pdf URL: https://arxiv.org/pdf/2508.12355
Copy Paste: [[2508.12355]] Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering(https://arxiv.org/abs/2508.12355)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated strong performance in question answering (QA) tasks. However, Multi-Answer Question Answering (MAQA), where a question may have several valid answers, remains challenging. Traditional QA settings often assume consistency across evidences, but MAQA can involve conflicting answers. Constructing datasets that reflect such conflicts is costly and labor-intensive, while existing benchmarks often rely on synthetic data, restrict the task to yes/no questions, or apply unverified automated annotation. To advance research in this area, we extend the conflict-aware MAQA setting to require models not only to identify all valid answers, but also to detect specific conflicting answer pairs, if any. To support this task, we introduce a novel cost-effective methodology for leveraging fact-checking datasets to construct NATCONFQA, a new benchmark for realistic, conflict-aware MAQA, enriched with detailed conflict labels, for all answer pairs. We evaluate eight high-end LLMs on NATCONFQA, revealing their fragility in handling various types of conflicts and the flawed strategies they employ to resolve them.
摘要：大型语言模型（LLMS）已显示出强大的回答（QA）任务。但是，多回答问题答案（MAQA），一个问题可能有几个有效的答案，仍然具有挑战性。传统的质量检查设置通常会在跨证据中保持一致性，但是MAQA可能涉及矛盾的答案。构建反映这种冲突的数据集是昂贵且劳动力密集的，而现有的基准通常依靠合成数据，将任务限制为“是/否问题”或应用未验证的自动注释。为了推进这一领域的研究，我们将冲突感知的MAQA设置扩展到需要模型以确定所有有效的答案，还需要检测特定的冲突答案对（如果有）。为了支持这项任务，我们介绍了一种新颖的成本效益方法，用于利用事实检查数据集构建NatConfqa，NatConfqa是一种现实，冲突意识到的MAQA的新基准，并具有详细的冲突标签，用于所有答案对。我们评估了Natconfqa上的八个高端LLM，揭示了它们在处理各种冲突和解决这些冲突的策略方面的脆弱性。

Title: ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models

Authors: Yuanfeng Xu, Zehui Dai, Jian Liang, Jiapeng Guan, Guangrun Wang, Liang Lin, Xiaohui Lv
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12387
Pdf URL: https://arxiv.org/pdf/2508.12387
Copy Paste: [[2508.12387]] ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models(https://arxiv.org/abs/2508.12387)
Keywords: language model, llm, chain-of-thought
Abstract: Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs), but often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers during multi-step reasoning. Existing efforts have improved SLM performance, but typically at the cost of one or more of three key aspects: (1) reasoning capability, due to biased supervision that filters out negative reasoning paths and limits learning from errors; (2) autonomy, due to over-reliance on externally generated reasoning signals; and (3) generalization, which suffers when models overfit to teacher-specific patterns. In this paper, we introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains. To enhance reasoning capability, we propose Multi-Route Process Verification (MRPV), which contrasts both positive and negative reasoning paths to extract decisive patterns. To reduce reliance on external guidance and improve autonomy, we introduce Enabling Autonomy via Asymptotic Induction (EAAI), a training strategy that gradually fades external signals. To improve generalization, we apply guided chain-of-thought distillation to encode domain-specific rules and expert knowledge into SLM parameters, making them part of what the model has learned. Extensive experiments on both vertical and general reasoning tasks demonstrate that ReaLM significantly improves SLM performance across aspects (1)-(3) above.
摘要：小语言模型（SLM）是大型语言模型（LLM）的一种经济有效的替代方法，但由于其能力有限，并且在多步推理期间产生错误或不一致的答案，因此通常会因复杂的推理而挣扎。现有的努力提高了SLM的性能，但通常以三个关键方面的一个或多个成本为代价：（1）推理能力，由于有偏见的监督，从而滤除了负面推理路径并限制了从错误中学习；（2）自主权，由于过度依赖外部产生的推理信号；（3）概括，当模型过于特定于教师的模式时，它会受到损失。在本文中，我们介绍了Realm，这是一个增强式学习框架，用于垂直领域中强大和自给自足的推理。为了增强推理能力，我们提出了多路由工艺验证（MRPV），该过程将正面和负推理路径与提取决定性模式进行对比。为了减少对外部指导的依赖并提高自主权，我们通过渐近诱导（EAAI）引入启用自主权，这是一种逐渐消失外部信号的培训策略。为了改善概括，我们将指导链的蒸馏应用于SLM参数中，将特定于领域的规则和专家知识编码为特定的规则，使其成为模型所学的一部分。对垂直和一般推理任务的广泛实验表明，领域可显着改善上面（1） - （3）方面的SLM性能。

Title: MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graph

Authors: Duzhen Zhang, Zixiao Wang, Zhong-Zhi Li, Yahan Yu, Shuncheng Jia, Jiahua Dong, Haotian Xu, Xing Wu, Yingying Zhang, Tielin Zhang, Jie Yang, Xiuying Chen, Le Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12393
Pdf URL: https://arxiv.org/pdf/2508.12393
Copy Paste: [[2508.12393]] MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graph(https://arxiv.org/abs/2508.12393)
Keywords: language model, llm, agent
Abstract: The rapid expansion of medical literature presents growing challenges for structuring and integrating domain knowledge at scale. Knowledge Graphs (KGs) offer a promising solution by enabling efficient retrieval, automated reasoning, and knowledge discovery. However, current KG construction methods often rely on supervised pipelines with limited generalizability or naively aggregate outputs from Large Language Models (LLMs), treating biomedical corpora as static and ignoring the temporal dynamics and contextual uncertainty of evolving knowledge. To address these limitations, we introduce MedKGent, a LLM agent framework for constructing temporally evolving medical KGs. Leveraging over 10 million PubMed abstracts published between 1975 and 2023, we simulate the emergence of biomedical knowledge via a fine-grained daily time series. MedKGent incrementally builds the KG in a day-by-day manner using two specialized agents powered by the Qwen2.5-32B-Instruct model. The Extractor Agent identifies knowledge triples and assigns confidence scores via sampling-based estimation, which are used to filter low-confidence extractions and inform downstream processing. The Constructor Agent incrementally integrates the retained triples into a temporally evolving graph, guided by confidence scores and timestamps to reinforce recurring knowledge and resolve conflicts. The resulting KG contains 156,275 entities and 2,971,384 relational triples. Quality assessments by two SOTA LLMs and three domain experts demonstrate an accuracy approaching 90\%, with strong inter-rater agreement. To evaluate downstream utility, we conduct RAG across seven medical question answering benchmarks using five leading LLMs, consistently observing significant improvements over non-augmented baselines. Case studies further demonstrate the KG's value in literature-based drug repurposing via confidence-aware causal inference.
摘要：医学文献的迅速扩展提出了越来越多的挑战，可以大规模构建和整合领域知识。知识图（KGS）通过实现有效的检索，自动推理和知识发现提供了有希望的解决方案。但是，当前的KG施工方法通常依靠有限的可推广性或来自大型语言模型（LLM）的综合产量的监督管道，将生物医学语料库视为静态的，而忽略了不断发展的知识的时间动态和上下文不确定性。为了解决这些局限性，我们介绍了Medkgent，这是一个LLM代理框架，用于构建时间不断发展的医疗KGS。在1975年至2023年之间发表了超过1000万个PubMed摘要，我们通过细粒度的每日时间序列模拟了生物医学知识的出现。 Medkgent使用由QWEN2.5-32B-INSTRUCT模型提供动力的两种专业代理以日常的方式逐日构建KG。提取器代理可以通过基于抽样的估计来识别知识三元，并分配了置信分数，这些估计用于过滤低信心提取并为下游处理提供信息。构造函数将保留的三元组逐步整合到一个暂时发展的图中，以置信度得分和时间戳为指导，以增强重复的知识并解决冲突。由此产生的kg包含156,275个实体和2,971,384个关系三元组。两名SOTA LLM和三个领域专家的质量评估表明，与评分者之间的强烈同意，准确性接近90 \％。为了评估下游公用事业，我们使用五个领先的LLM进行了七个医学问题进行抹布，从而始终如一地观察到非夸张基线的显着改善。案例研究进一步证明了KG在基于文献的药物中的价值通过信心感知的因果推论。

Title: ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads

Authors: Zhuorui Liu, Chen Zhang, Dawei Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12407
Pdf URL: https://arxiv.org/pdf/2508.12407
Copy Paste: [[2508.12407]] ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads(https://arxiv.org/abs/2508.12407)
Keywords: language model, llm, long context
Abstract: With the rapid development of large language models (LLMs), handling long context has become one of the vital abilities in LLMs. Such long-context ability is accompanied by difficulties in deployment, especially due to the increased consumption of KV cache. There is certain work aiming to optimize the memory footprint of KV cache, inspired by the observation that attention heads can be categorized into retrieval heads that are of great significance and streaming heads that are of less significance. Typically, identifying the streaming heads and and waiving the KV cache in the streaming heads would largely reduce the overhead without hurting the performance that much. However, since employing both retrieval and streaming heads in one layer decomposes one large round of attention computation into two small ones, it may unexpectedly bring extra latency on accessing and indexing tensors. Based on this intuition, we impose an important improvement to the identification process of retrieval and streaming heads, in which we design a criterion that enforces exclusively retrieval or streaming heads gathered in one unique layer. In this way, we further eliminate the extra latency and only incur negligible performance degradation. Our method named \textsc{ZigzagAttention} is competitive among considered baselines owing to reduced latency and comparable performance.
摘要：随着大型语言模型（LLM）的快速发展，处理长篇小说已成为LLM的重要能力之一。如此长的文本能力伴随着部署困难，尤其是由于KV缓存消耗的增加。有一些旨在优化KV缓存的内存足迹的工作，灵感来自于观察到的注意力头可以归类为具有重要意义的检索头，并且溪流头的意义较小。通常，识别流媒体头并放弃流媒体头中的KV缓存将在很大程度上减少开销而不会损害其性能。但是，由于一层同时使用检索和流型头会将一轮注意计算分解为两个小型计算，因此它可能会出乎意料地带来额外的延迟，以访问和索引张量。基于这种直觉，我们对检索和流媒体头的识别过程进行了重要的改进，在该过程中，我们设计了一个标准，该标准可执行专门的检索或以一个独特层收集的流媒体。通过这种方式，我们进一步消除了额外的延迟，并且只会导致可忽略的性能降解。我们的方法\ textsc {Zigzagattention}在被认为是基准的竞争中，由于延迟降低和可比性能。

Title: The Cultural Gene of Large Language Models: A Study on the Impact of Cross-Corpus Training on Model Values and Biases

Authors: Emanuel Z. Fenech-Borg, Tilen P. Meznaric-Kos, Milica D. Lekovic-Bojovic, Arni J. Hentze-Djurhuus
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12411
Pdf URL: https://arxiv.org/pdf/2508.12411
Copy Paste: [[2508.12411]] The Cultural Gene of Large Language Models: A Study on the Impact of Cross-Corpus Training on Model Values and Biases(https://arxiv.org/abs/2508.12411)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are deployed globally, yet their underlying cultural and ethical assumptions remain underexplored. We propose the notion of a "cultural gene" -- a systematic value orientation that LLMs inherit from their training corpora -- and introduce a Cultural Probe Dataset (CPD) of 200 prompts targeting two classic cross-cultural dimensions: Individualism-Collectivism (IDV) and Power Distance (PDI). Using standardized zero-shot prompts, we compare a Western-centric model (GPT-4) and an Eastern-centric model (ERNIE Bot). Human annotation shows significant and consistent divergence across both dimensions. GPT-4 exhibits individualistic and low-power-distance tendencies (IDV score approx 1.21; PDI score approx -1.05), while ERNIE Bot shows collectivistic and higher-power-distance tendencies (IDV approx -0.89; PDI approx 0.76); differences are statistically significant (p < 0.001). We further compute a Cultural Alignment Index (CAI) against Hofstede's national scores and find GPT-4 aligns more closely with the USA (e.g., IDV CAI approx 0.91; PDI CAI approx 0.88) whereas ERNIE Bot aligns more closely with China (IDV CAI approx 0.85; PDI CAI approx 0.81). Qualitative analyses of dilemma resolution and authority-related judgments illustrate how these orientations surface in reasoning. Our results support the view that LLMs function as statistical mirrors of their cultural corpora and motivate culturally aware evaluation and deployment to avoid algorithmic cultural hegemony.
摘要：大型语言模型（LLM）是在全球部署的，但其基本的文化和道德假设仍然没有得到充实的态度。我们提出了“文化基因”的概念，即LLMS从其培训语料库中继承的系统价值取向 - 并引入了针对两个经典跨文化维度的200个提示的文化探测数据集（CPD）：个人主义 - 综合主义（IDV）和Power距离（PDI）。使用标准化的零射击提示，我们比较一个以西方为中心的模型（GPT-4）和一个以东部为中心的模型（Ernie Bot）。人类注释在两个维度上均显示出显着且一致的差异。 GPT-4表现出个人主义和低功率距离趋势（IDV得分约1.21； PDI得分约-1.05），而Ernie Bot显示集体主义和更高的能力距离趋势（IDV约-0.89； PDI约0.76）；差异具有统计学意义（p <0.001）。我们进一步计算了针对霍夫斯泰德国家成绩的文化一致性指数（CAI），并发现GPT-4与美国更紧密地保持一致（例如，IDV CAI约0.91; PDI CAI约0.88），而Ernie Bot与中国更紧密地与中国保持一致（IDV CAI CAI大约0.85; PDI CAI; PDI CAI大约0.81）。解决难题和权威相关的判断的定性分析说明了这些方向如何在推理中浮出水面。我们的研究结果支持LLM充当其文化语料库的统计反射者的观点，并激发了文化意识的评估和部署，以避免算法文化霸权。

Title: Uncovering Emergent Physics Representations Learned In-Context by Large Language Models

Authors: Yeongwoo Song, Jaeyong Bae, Dong-Kyum Kim, Hawoong Jeong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.12448
Pdf URL: https://arxiv.org/pdf/2508.12448
Copy Paste: [[2508.12448]] Uncovering Emergent Physics Representations Learned In-Context by Large Language Models(https://arxiv.org/abs/2508.12448)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) exhibit impressive in-context learning (ICL) abilities, enabling them to solve wide range of tasks via textual prompts alone. As these capabilities advance, the range of applicable domains continues to expand significantly. However, identifying the precise mechanisms or internal structures within LLMs that allow successful ICL across diverse, distinct classes of tasks remains elusive. Physics-based tasks offer a promising testbed for probing this challenge. Unlike synthetic sequences such as basic arithmetic or symbolic equations, physical systems provide experimentally controllable, real-world data based on structured dynamics grounded in fundamental principles. This makes them particularly suitable for studying the emergent reasoning behaviors of LLMs in a realistic yet tractable setting. Here, we mechanistically investigate the ICL ability of LLMs, especially focusing on their ability to reason about physics. Using a dynamics forecasting task in physical systems as a proxy, we evaluate whether LLMs can learn physics in context. We first show that the performance of dynamics forecasting in context improves with longer input contexts. To uncover how such capability emerges in LLMs, we analyze the model's residual stream activations using sparse autoencoders (SAEs). Our experiments reveal that the features captured by SAEs correlate with key physical variables, such as energy. These findings demonstrate that meaningful physical concepts are encoded within LLMs during in-context learning. In sum, our work provides a novel case study that broadens our understanding of how LLMs learn in context.
摘要：大型语言模型（LLMS）具有令人印象深刻的内在学习能力（ICL）能力，使他们能够单独通过文本提示来解决广泛的任务。随着这些功能的提高，适用域的范围继续显着扩展。但是，确定LLM中允许在各种不同的任务中成功进行ICL的精确机制或内部结构仍然难以捉摸。基于物理的任务为探索这一挑战提供了有前途的测试床。与诸如基本算术或符号方程之类的合成序列不同，物理系统基于基于基本原理的结构化动力学提供实验可控的现实数据。这使得它们特别适合在现实但可进行的环境中研究LLM的新兴推理行为。在这里，我们从机械上研究了LLM的ICL能力，尤其是专注于他们推理物理学的能力。使用物理系统中的动态预测任务作为代理，我们评估LLM是否可以在上下文中学习物理。我们首先表明，在上下文中，动态预测的性能会通过更长的输入上下文提高。为了揭示LLM中这种能力的出现，我们使用稀疏自动编码器（SAE）分析了模型的残差流动性。我们的实验表明，SAE捕获的特征与关键物理变量（例如能量）相关。这些发现表明，有意义的物理概念是在内在学习过程中在LLM中编码的。总而言之，我们的工作提供了一项新颖的案例研究，扩大了我们对LLM在上下文中学习的理解。

Title: M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

Authors: Ruirui Gao, Emily Johnson, Bowen Tan, Yanfei Qian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12458
Pdf URL: https://arxiv.org/pdf/2508.12458
Copy Paste: [[2508.12458]] M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following(https://arxiv.org/abs/2508.12458)
Keywords: language model
Abstract: Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following, yet their development is often hindered by the high cost and inconsistency of human annotation required for effective fine-tuning and preference alignment. Traditional supervised fine-tuning (SFT) and existing preference optimization methods like RLHF and DPO frequently struggle to efficiently leverage the model's own generation space to identify highly informative "hard negative" samples. To address these challenges, we propose Multimodal-Model-Guided Preference Optimization (M3PO), a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following. M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates. This selection is driven by a sophisticated mechanism that integrates two crucial signals: a Multimodal Alignment Score (MAS) to assess external quality and the model's Self-Consistency / Confidence (log-probability) to gauge internal belief. These are combined into a novel M3P-Score, which specifically identifies preferred responses and challenging dispreferred responses that the model might confidently generate despite being incorrect. These high-quality preference pairs are then used for efficient Direct Preference Optimization (DPO) fine-tuning on base LVLMs like LLaVA-1.5 (7B/13B) using LoRA. Our extensive experiments demonstrate that M3PO consistently outperforms strong baselines, including SFT, simulated RLHF, vanilla DPO, and RM-DPO, across a comprehensive suite of multimodal instruction following benchmarks (MME-Bench, POPE, IFT, Human Pref. Score).
摘要：大型视觉模型（LVLM）具有巨大的潜力进行复杂的多模式指导，但它们的发展通常受到有效的微调和偏好一致性所需的人类注释的高成本和不一致性的阻碍。传统的监督微调（SFT）以及RLHF和DPO（DPO）的现有偏好优化方法经常努力有效利用模型的生成空间来识别信息丰富的“硬负”样本。为了应对这些挑战，我们提出了多模式模型指导的偏好优化（M3PO），这是一种新型和数据效率的方法，旨在增强LVLMS在视觉教学中的功能。 M3PO智能地从不同的LVLM生成的候选人中选择了最“学习的”偏好样品对。该选择是由一个复杂的机制驱动的，该机制集成了两个关键信号：多模式对准评分（MAS），以评估外部质量和模型的自稳态 /置信度（对数概率）来衡量内部信念。这些组合成一个新颖的M3P评分，该得分特别识别了首选响应，并具有挑战性的分配响应，即使模型不正确，该模型可能会自信地产生。然后使用这些高质量的偏好对用于使用LORA（例如Llava-1.5（7b/13b））上的基础LVLM上的有效直接偏好优化（DPO）进行微调。我们的广泛实验表明，M3PO始终超过强大的基线，包括SFT，模拟RLHF，Vanilla DPO和RM-DPO，跨基准（MME Bench，MME Bench，Pope，Pope，ift，Human Pref。分数）的全面多模式指令的全面套件）。

Title: LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages

Authors: Alham Fikri Aji, Trevor Cohn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12459
Pdf URL: https://arxiv.org/pdf/2508.12459
Copy Paste: [[2508.12459]] LoraxBench: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages(https://arxiv.org/abs/2508.12459)
Keywords: llm
Abstract: As one of the world's most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce LoraxBench, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. Our dataset covers 20 languages, with the addition of two formality registers for three languages. We evaluate a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesian and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness `Krama' Javanese.
摘要：作为世界上人口最多的国家之一，在NLP的进步方面，印度尼西亚说了700种语言。我们介绍了Loraxbench，这是一个专注于印度尼西亚低资源语言的基准，并涵盖了6种不同的任务：阅读理解，开放式质量统计，语言推论，因果推理，翻译，翻译和文化质量检查。我们的数据集涵盖20种语言，并为三种语言增加了两个形式寄存器。我们评估了一套多种语言和以区域为中心的LLM，发现该基准具有挑战性。我们注意到印尼语和其他语言（尤其是低资源的语言）之间的性能之间存在明显的差异。与一般的多语言模型相比，使用特定区域模型时没有明显的铅。最后，我们表明，登记册的变化会影响模型性能，尤其是在社交媒体中不常见的注册表，例如高级礼貌'Krama'Javanese。

Title: Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models

Authors: Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12461
Pdf URL: https://arxiv.org/pdf/2508.12461
Copy Paste: [[2508.12461]] Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models(https://arxiv.org/abs/2508.12461)
Keywords: language model, gpt
Abstract: In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments.
摘要：2025年8月，OpenAI发布了GPT-oss型号，这是自2019年GPT-2以来的首个开放权重大语型号，包括两种具有120B和20B参数的专家架构混合物。我们对六个当代开源的大型语言模型进行了评估，范围从14.7b到235b参数，代表着茂密和稀疏设计，涵盖了十个基准，涵盖了一般知识，数学推理，代码生成，多语言理解和对话能力。在标准化推理设置下以未量化的形式测试了所有模型，并使用McNemars测试和效果大小分析进行统计验证。结果表明，GPT-OSS-20b在几个基准上（例如HumaneVal和MMLU）上始终优于GPT-Oss-1220b，尽管每次响应的记忆和能量大大降低。这两种模型均显示了当前开源局势中的中期整体性能，在代码生成方面具有相对强度，并且在多语言任务中的弱点。这些发现提供了经验证据，表明稀疏体系结构的扩展可能不会产生比例的性能提高，从而强调了进一步研究优化策略的需求，并为未来的开源部署提供了更有效的模型选择。

Title: The Structural Sources of Verb Meaning Revisited: Large Language Models Display Syntactic Bootstrapping

Authors: Xiaomeng Zhu, R. Thomas McCoy, Robert Frank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12482
Pdf URL: https://arxiv.org/pdf/2508.12482
Copy Paste: [[2508.12482]] The Structural Sources of Verb Meaning Revisited: Large Language Models Display Syntactic Bootstrapping(https://arxiv.org/abs/2508.12482)
Keywords: language model, gpt
Abstract: Syntactic bootstrapping (Gleitman, 1990) is the hypothesis that children use the syntactic environments in which a verb occurs to learn its meaning. In this paper, we examine whether large language models exhibit a similar behavior. We do this by training RoBERTa and GPT-2 on perturbed datasets where syntactic information is ablated. Our results show that models' verb representation degrades more when syntactic cues are removed than when co-occurrence information is removed. Furthermore, the representation of mental verbs, for which syntactic bootstrapping has been shown to be particularly crucial in human verb learning, is more negatively impacted in such training regimes than physical verbs. In contrast, models' representation of nouns is affected more when co-occurrences are distorted than when syntax is distorted. In addition to reinforcing the important role of syntactic bootstrapping in verb learning, our results demonstrated the viability of testing developmental hypotheses on a larger scale through manipulating the learning environments of large language models.
摘要：句法自举（Gleitman，1990）是孩子使用动词来学习其含义的句法环境的假设。在本文中，我们检查了大型语言模型是否表现出类似的行为。我们通过训练罗伯塔（Roberta）和GPT-2在刻薄的数据集中培训Roberta和GPT-2。我们的结果表明，在删除句法提示时，模型的动词表示比同时删除信息时会降低。此外，对于人类动词学习，句法自举的表示，在这种训练方案中，其句法自举在人体动词学习中特别重要。相比之下，当同时发生扭曲时，模型的名词表示比语法变形时会更大。除了加强句法引导在动词学习中的重要作用外，我们的结果还证明了通过操纵大语言模型的学习环境，在更大范围内测试发展假设的生存能力。

Title: Mitigating Hallucinations in Large Language Models via Causal Reasoning

Authors: Yuangang Li, Yiqing Shen, Yi Nian, Jiechao Gao, Ziyi Wang, Chenxiao Yu, Shawn Li, Jie Wang, Xiyang Hu, Yue Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.12495
Pdf URL: https://arxiv.org/pdf/2508.12495
Copy Paste: [[2508.12495]] Mitigating Hallucinations in Large Language Models via Causal Reasoning(https://arxiv.org/abs/2508.12495)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: Large language models (LLMs) exhibit logically inconsistent hallucinations that appear coherent yet violate reasoning principles, with recent research suggesting an inverse relationship between causal reasoning capabilities and such hallucinations. However, existing reasoning approaches in LLMs, such as Chain-of-Thought (CoT) and its graph-based variants, operate at the linguistic token level rather than modeling the underlying causal relationships between variables, lacking the ability to represent conditional independencies or satisfy causal identification assumptions. To bridge this gap, we introduce causal-DAG construction and reasoning (CDCR-SFT), a supervised fine-tuning framework that trains LLMs to explicitly construct variable-level directed acyclic graph (DAG) and then perform reasoning over it. Moreover, we present a dataset comprising 25,368 samples (CausalDR), where each sample includes an input question, explicit causal DAG, graph-based reasoning trace, and validated answer. Experiments on four LLMs across eight tasks show that CDCR-SFT improves the causal reasoning capability with the state-of-the-art 95.33% accuracy on CLADDER (surpassing human performance of 94.8% for the first time) and reduces the hallucination on HaluEval with 10% improvements. It demonstrates that explicit causal structure modeling in LLMs can effectively mitigate logical inconsistencies in LLM outputs. Code is available at this https URL.
摘要：大型语言模型（LLM）表现出逻辑上不一致的幻觉，这些幻觉似乎连贯而违反了推理原则，最近的研究表明，因果推理能力与此类幻觉之间存在反向关系。但是，LLM中的现有推理方法（例如Thearked（COT）及其基于图的变体）在语言令牌级别运行，而不是对变量之间的基本因果关系进行建模，缺乏表示有条件的独立性或满足因果鉴定假设的能力。为了弥合这一差距，我们引入了因果关系构造和推理（CDCR-SFT），这是一个受监督的微调框架，该框架训练LLMS明确构建可变级别的定向无环形图（DAG），然后在其上执行推理。此外，我们提出一个包含25,368个样本（Causaldr）的数据集，其中每个样本都包含一个输入问题，明确的因果DAG，基于图形的推理跟踪和经过验证的答案。在八个任务中进行四个LLM的实验表明，CDCR-SFT可以提高因果推理能力，而最先进的Cladder的精度为95.33％（首次超过94.8％的人类表现），并降低了Halueval的幻觉，而Halueval的幻觉则提高了10％。它表明，LLMS中的显式因果结构建模可以有效地减轻LLM输出中的逻辑不一致。代码可在此HTTPS URL上找到。

Title: CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

Authors: Seonglae Cho, Zekun Wu, Adriano Koshiyama
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.12535
Pdf URL: https://arxiv.org/pdf/2508.12535
Copy Paste: [[2508.12535]] CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection(https://arxiv.org/abs/2508.12535)
Keywords: language model, llm
Abstract: Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.
摘要：稀疏的自动编码器（SAE）可以在不监督的情况下从大语言模型（LLM）中提取可解释的功能。但是，它们在下游转向任务中的有效性受到对比度数据集或大型激活存储的要求的限制。为了解决这些局限性，我们提出了Corrsteer，该Corrsteer通过将样品正确性与推理时生成的令牌激活相关联，从而选择特征。该方法仅使用推理时间激活来提取更相关的功能，从而避免了虚假的相关性。它还从平均激活中获得转向系数，使整个管道自动化。我们的方法显示了在质量检查，缓解偏差，预防越越越越来越多的任务性能以及对Gemma 2 2b和Llama 3.1 8B上的推理基准的提高，尤其是MMLU性能提高了 +4.1％，只有4000个样品的Harmbench提高了 +22.9％。选定的功能展示了与每个任务要求一致的语义有意义的模式，从而揭示了驱动性能的潜在功能。我们的工作将基于相关的选择建立为一种有效且可扩展的方法，用于跨语言模型应用程序自动转向。

Title: Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning

Authors: Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, Berlin Chen
Subjects: cs.CL, cs.AI, cs.SD
Abstract URL: https://arxiv.org/abs/2508.12591
Pdf URL: https://arxiv.org/pdf/2508.12591
Copy Paste: [[2508.12591]] Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning(https://arxiv.org/abs/2508.12591)
Keywords: language model, llm
Abstract: Traditional Automated Speaking Assessment (ASA) systems exhibit inherent modality limitations: text-based approaches lack acoustic information while audio-based methods miss semantic context. Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive ASA by simultaneously processing audio and text within unified frameworks. This paper presents a very first systematic study of MLLM for comprehensive ASA, demonstrating the superior performance of MLLM across the aspects of content and language use . However, assessment on the delivery aspect reveals unique challenges, which is deemed to require specialized training strategies. We thus propose Speech-First Multimodal Training (SFMT), leveraging a curriculum learning principle to establish more robust modeling foundations of speech before cross-modal synergetic fusion. A series of experiments on a benchmark dataset show MLLM-based systems can elevate the holistic assessment performance from a PCC value of 0.783 to 0.846. In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches, which also paves a new avenue for ASA.
摘要：传统的自动化口语评估（ASA）系统表现出固有的方式局限性：基于文本的方法缺乏声学信息，而基于音频的方法会错过语义上下文。多模式大型语言模型（MLLM）通过在统一框架内同时处理音频和文本为全面的ASA提供了前所未有的机会。本文介绍了针对全面ASA的MLLM进行的首次系统研究，证明了MLLM在内容和语言使用方面的出色表现。但是，对交付方面的评估揭示了独特的挑战，这被认为需要专门的培训策略。因此，我们提出了言语至上的多模式训练（SFMT），利用课程学习原理在跨模式协同融合之前建立了更健壮的语音建模基础。基于基于MLLM的系统的一系列实验可以将整体评估性能从0.783提高到0.846。特别是，SFMT在评估交付方面表现出色，比常规培训方法的绝对准确性提高了4％，这也为ASA铺平了新的途径。

Title: Semantic Anchoring in Agentic Memory: Leveraging Linguistic Structures for Persistent Conversational Context

Authors: Maitreyi Chatterjee, Devansh Agarwal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12630
Pdf URL: https://arxiv.org/pdf/2508.12630
Copy Paste: [[2508.12630]] Semantic Anchoring in Agentic Memory: Leveraging Linguistic Structures for Persistent Conversational Context(https://arxiv.org/abs/2508.12630)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) have demonstrated impressive fluency and task competence in conversational settings. However, their effectiveness in multi-session and long-term interactions is hindered by limited memory persistence. Typical retrieval-augmented generation (RAG) systems store dialogue history as dense vectors, which capture semantic similarity but neglect finer linguistic structures such as syntactic dependencies, discourse relations, and coreference links. We propose Semantic Anchoring, a hybrid agentic memory architecture that enriches vector-based storage with explicit linguistic cues to improve recall of nuanced, context-rich exchanges. Our approach combines dependency parsing, discourse relation tagging, and coreference resolution to create structured memory entries. Experiments on adapted long-term dialogue datasets show that semantic anchoring improves factual recall and discourse coherence by up to 18% over strong RAG baselines. We further conduct ablation studies, human evaluations, and error analysis to assess robustness and interpretability.
摘要：大型语言模型（LLM）在会话环境中表现出令人印象深刻的流利性和任务能力。然而，有限的记忆持久性阻碍了它们在多课程和长期互动中的有效性。典型的检索演示生成（RAG）系统将对话历史记录为密集的向量，它捕获语义相似性，但忽视了更精细的语言结构，例如句法依赖性，话语关系和核心链接。我们提出了语义锚定，这是一种混合代理记忆体系结构，富含基于向量的存储，并具有明确的语言提示，以改善对细微的，上下文丰富的交流的回忆。我们的方法结合了依赖性解析，话语关系标记和核心方案，以创建结构化的内存条目。对适应的长期对话数据集进行的实验表明，语义锚定在强大的抹布基线上提高了事实召回和话语的连贯性高达18％。我们进一步进行消融研究，人类评估和错误分析，以评估鲁棒性和解释性。

Title: Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Authors: Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, Shuyue Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12631
Pdf URL: https://arxiv.org/pdf/2508.12631
Copy Paste: [[2508.12631]] Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing(https://arxiv.org/abs/2508.12631)
Keywords: language model, gpt, llm
Abstract: Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models -- including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 -- Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at this https URL.
摘要：在大语模型（LLM）进步中，平衡性能和效率是一个核心挑战。 GPT-5通过测试时间路由解决此问题，在推理过程中将查询动态分配给有效或高容量模型。在这项工作中，我们提出了Avengers-Pro，这是一个测试时间路由框架，结合了各种能力和效率的LLM，为所有绩效效率折衷方案提供了统一的解决方案。《复仇者联盟》嵌入和簇传入的查询，然后根据性能效率得分将每个查询都路由到最合适的模型。在6个具有挑战性的基准和8个领先模型中，包括GPT-5-MEDIUM，GEMINI-2.5-PRO和Claude-Opus-4.1- Avengers-Pro成就最先进的结果：通过将最强的单个模型（GPT-5-MEDIUM）更改为 +7％，通过将性能效率 - 权衡参数变化。此外，它可以以低27％的成本匹配最强单型模型的平均准确性，并以低63％的成本达到〜90％的绩效。最后但并非最不重要的一点是，它达到了帕累托的边界，在所有单个模型中，对于任何给定的成本而言，始终如一地产生最高准确性，而对于任何给定精度，成本最低。代码可在此HTTPS URL上找到。

Title: Prompt-Induced Linguistic Fingerprints for LLM-Generated Fake News Detection

Authors: Chi Wang, Min Gao, Zongwei Wang, Junwei Yin, Kai Shu, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12632
Pdf URL: https://arxiv.org/pdf/2508.12632
Copy Paste: [[2508.12632]] Prompt-Induced Linguistic Fingerprints for LLM-Generated Fake News Detection(https://arxiv.org/abs/2508.12632)
Keywords: language model, llm, prompt
Abstract: With the rapid development of large language models, the generation of fake news has become increasingly effortless, posing a growing societal threat and underscoring the urgent need for reliable detection methods. Early efforts to identify LLM-generated fake news have predominantly focused on the textual content itself; however, because much of that content may appear coherent and factually consistent, the subtle traces of falsification are often difficult to uncover. Through distributional divergence analysis, we uncover prompt-induced linguistic fingerprints: statistically distinct probability shifts between LLM-generated real and fake news when maliciously prompted. Based on this insight, we propose a novel method named Linguistic Fingerprints Extraction (LIFE). By reconstructing word-level probability distributions, LIFE can find discriminative patterns that facilitate the detection of LLM-generated fake news. To further amplify these fingerprint patterns, we also leverage key-fragment techniques that accentuate subtle linguistic differences, thereby improving detection reliability. Our experiments show that LIFE achieves state-of-the-art performance in LLM-generated fake news and maintains high performance in human-written fake news. The code and data are available at this https URL.
摘要：随着大型语言模型的快速发展，假新闻的产生变得越来越轻松，构成了不断增长的社会威胁，并强调了对可靠检测方法的迫切需求。早期确定LLM生成的假新闻的努力主要集中在文本内容本身上。但是，由于其中大部分内容可能看起来连贯且实际上是一致的，因此伪造的微妙痕迹通常很难发现。通过分布分歧分析，我们发现了迅速引起的语言指纹：在恶意提示时，LLM生成的真实新闻和假新闻之间的统计学概率变化。基于这种见解，我们提出了一种名为语言指纹提取（Life）的新型方法。通过重建单词级别的概率分布，生活可以找到有助于检测LLM生成的假新闻的歧视性模式。为了进一步扩大这些指纹模式，我们还利用了键性差异技术，这些技术强调了细微的语言差异，从而提高了检测可靠性。我们的实验表明，生活在LLM生成的虚假新闻中取得了最先进的表现，并在人间写的假新闻中保持了高性能。该代码和数据可在此HTTPS URL上找到。

Title: Breaking Language Barriers: Equitable Performance in Multilingual Language Models

Authors: Tanay Nagar, Grigorii Khvatskii, Anna Sokol, Nitesh V. Chawla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12662
Pdf URL: https://arxiv.org/pdf/2508.12662
Copy Paste: [[2508.12662]] Breaking Language Barriers: Equitable Performance in Multilingual Language Models(https://arxiv.org/abs/2508.12662)
Keywords: language model, llm, prompt
Abstract: Cutting-edge LLMs have emerged as powerful tools for multilingual communication and understanding. However, LLMs perform worse in Common Sense Reasoning (CSR) tasks when prompted in low-resource languages (LRLs) like Hindi or Swahili compared to high-resource languages (HRLs) like English. Equalizing this inconsistent access to quality LLM outputs is crucial to ensure fairness for speakers of LRLs and across diverse linguistic communities. In this paper, we propose an approach to bridge this gap in LLM performance. Our approach involves fine-tuning an LLM on synthetic code-switched text generated using controlled language-mixing methods. We empirically demonstrate that fine-tuning LLMs on synthetic code-switched datasets leads to substantial improvements in LRL model performance while preserving or enhancing performance in HRLs. Additionally, we present a new dataset of synthetic code-switched text derived from the CommonSenseQA dataset, featuring three distinct language ratio configurations.
摘要：尖端的LLM已成为多种语言交流和理解的强大工具。但是，与高资源语言（HRLS）（如英语）相比，使用低资源语言（LRLS）（如印地语）或Swahili（如英语），LLM在常识推理（CSR）任务上的表现较差（CSR）任务。平衡这种不一致的获得优质LLM输出对于确保LRL和各种语言社区的公平性至关重要。在本文中，我们提出了一种在LLM性能中弥合这一差距的方法。我们的方法涉及对使用受控语言混合方法生成的合成代码切换文本进行微调LLM。我们从经验上证明，合成代码开关数据集的微调LLM会导致LRL模型性能的实质性改善，同时保留或增强HRL中的性能。此外，我们提出了一个从CommonSenseQA数据集派生的合成代码开关文本的新数据集，其中具有三种不同的语言比率配置。

Title: Leveraging Large Language Models for Predictive Analysis of Human Misery

Authors: Bishanka Seal, Rahul Seetharaman, Aman Bansal, Abhilash Nandy
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.12669
Pdf URL: https://arxiv.org/pdf/2508.12669
Copy Paste: [[2508.12669]] Leveraging Large Language Models for Predictive Analysis of Human Misery(https://arxiv.org/abs/2508.12669)
Keywords: language model, llm, prompt
Abstract: This study investigates the use of Large Language Models (LLMs) for predicting human-perceived misery scores from natural language descriptions of real-world scenarios. The task is framed as a regression problem, where the model assigns a scalar value from 0 to 100 to each input statement. We evaluate multiple prompting strategies, including zero-shot, fixed-context few-shot, and retrieval-based prompting using BERT sentence embeddings. Few-shot approaches consistently outperform zero-shot baselines, underscoring the value of contextual examples in affective prediction. To move beyond static evaluation, we introduce the "Misery Game Show", a novel gamified framework inspired by a television format. It tests LLMs through structured rounds involving ordinal comparison, binary classification, scalar estimation, and feedback-driven reasoning. This setup enables us to assess not only predictive accuracy but also the model's ability to adapt based on corrective feedback. The gamified evaluation highlights the broader potential of LLMs in dynamic emotional reasoning tasks beyond standard regression. Code and data link: this https URL
摘要：这项研究调查了使用大型语言模型（LLM）来预测现实世界情景的自然语言描述中人类感知的苦难得分。该任务被构架为回归问题，该模型将标量值从0到100分配给每个输入语句。我们评估了多种提示策略，包括使用bert句子嵌入，包括零射击，固定封闭式少数弹射和基于检索的提示。很少有射击方法始终超过零击基线，强调情感预测中上下文示例的价值。为了超越静态评估，我们介绍了“苦难游戏节目”，这是一个受电视格式启发的新型游戏框架。它通过涉及顺序比较，二进制分类，标量估计和反馈驱动推理的结构化回合来测试LLM。该设置使我们不仅可以评估预测精度，还可以评估模型根据纠正反馈进行适应的能力。游戏化的评估突出了LLM在超出标准回归以外的动态情感推理任务中的广泛潜力。代码和数据链接：此HTTPS URL

Title: ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

Authors: Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.12685
Pdf URL: https://arxiv.org/pdf/2508.12685
Copy Paste: [[2508.12685]] ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction(https://arxiv.org/abs/2508.12685)
Keywords: language model, llm, agent
Abstract: Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby limiting real-world performance of agentic tasks. In this paper, we propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask-and-fill operations; and the offline verification phase ensures correctness and coherence via rule- and model-based checks. Experiments demonstrate that ToolACE-MT enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.
摘要：使用大型语言模型（LLM）解决的代理任务解决需要多转，多步交互，通常涉及复杂的功能调用和动态的用户代理交换。现有的基于仿真的数据生成方法对于此类方案，很大程度上取决于多个LLM代理之间的昂贵自回旋交互，从而限制了代理任务的实际性能。在本文中，我们提出了一种新型的非自动回归迭代生成框架，称为Toolace-MT，用于构建高质量的多转向代理对话。 Toolace-MT通过三个阶段生成完整的对话轨迹：粗粒初始化，迭代细化和离线验证。初始化阶段构建了一个结构完整但具有语义上粗糙的对话骨架。迭代精致阶段引入了逼真的复杂性，并通过面具填充操作持续进行精炼。离线验证阶段可确保通过规则和基于模型的检查确保正确性和连贯性。实验表明，工具-MT可以实现高效，有效且可推广的代理数据生成，并为工具增强的LLM方案提供了新的范式，用于高质量的数据构建。

Title: DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

Authors: Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Yuchi Xu, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12726
Pdf URL: https://arxiv.org/pdf/2508.12726
Copy Paste: [[2508.12726]] DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning(https://arxiv.org/abs/2508.12726)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often either lack disciplinary breadth or the structural depth necessary to elicit robust reasoning behaviors. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (book corpus and web corpus) to generate multidisciplinary challenging questions. A core innovation of our approach is the introduction of a Design Logic concept, which mimics the question-creation process of human educators. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with disciplinary source materials, we are able to create reasoning questions that far surpass the difficulty and diversity of existing datasets. Based on this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: Design-Logic-Reasoning-Book (DLR-Book), containing 3.04 million challenging questions synthesized from the book corpus, and Design-Logic-Reasoning-Web (DLR-Web), with 1.66 million challenging questions from the web corpus. Our data analysis demonstrates that the questions synthesized by our method exhibit substantially greater difficulty and diversity than those in the baseline datasets. We validate the effectiveness of these datasets by conducting SFT experiments on the Qwen3-8B-Base and Qwen3-4B-Base models. The results show that our dataset significantly outperforms existing multidisciplinary datasets of the same volume. Training with the full datasets further enables the models to surpass the multidisciplinary reasoning performance of the official Qwen3-8B and Qwen3-4B models.
摘要：大型语言模型（LLMS）在许多自然语言任务中取得了巨大的成功，但仍在复杂，多步骤推理中挣扎，尤其是跨不同学科。现有的推理数据集通常缺乏纪律广度或引起强大推理行为所需的结构深度。我们提出了设计师：设计与设计的推理数据合成管道，利用自然可用的广泛的原始文档（书籍语料库和Web语料库）生成多学科的具有挑战性的问题。我们方法的核心创新是引入设计逻辑概念，该概念模仿了人类教育工作者的提问过程。我们使用LLM从各个学科的现有问题中反向设计和摘要120,000多个设计逻辑。通过将这些设计逻辑与纪律原始材料匹配，我们能够创建推理问题，这些问题远远超过了现有数据集的难度和多样性。基于这条管道，我们合成了两个跨越75个学科的大规模推理数据集：设计与设计 - 书本书籍（DLR-Book），其中包含304万本书中的挑战性问题，从书本上综合出来，以及166万个质疑的Web Corpus挑战性问题。我们的数据分析表明，与基线数据集中的问题相比，我们方法综合的问题表现出更大的难度和多样性。我们通过在QWEN3-8B基本和QWEN3-4B基本模型上进行SFT实验来验证这些数据集的有效性。结果表明，我们的数据集大大优于同一卷的现有多学科数据集。使用完整数据集的培训进一步使模型能够超过官方QWEN3-8B和QWEN3-4B型号的多学科推理性能。

Title: LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models

Authors: Zhiyuan Ning, Tianle Gu, Jiaxin Song, Shixin Hong, Lingyu Li, Huacan Liu, Jie Li, Yixu Wang, Meng Lingyu, Yan Teng, Yingchun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12733
Pdf URL: https://arxiv.org/pdf/2508.12733
Copy Paste: [[2508.12733]] LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models(https://arxiv.org/abs/2508.12733)
Keywords: language model, llm
Abstract: The widespread adoption and increasing prominence of large language models (LLMs) in global technologies necessitate a rigorous focus on ensuring their safety across a diverse range of linguistic and cultural contexts. The lack of a comprehensive evaluation and diverse data in existing multilingual safety evaluations for LLMs limits their effectiveness, hindering the development of robust multilingual safety alignment. To address this critical gap, we introduce LinguaSafe, a comprehensive multilingual safety benchmark crafted with meticulous attention to linguistic authenticity. The LinguaSafe dataset comprises 45k entries in 12 languages, ranging from Hungarian to Malay. Curated using a combination of translated, transcreated, and natively-sourced data, our dataset addresses the critical need for multilingual safety evaluations of LLMs, filling the void in the safety evaluation of LLMs across diverse under-represented languages from Hungarian to Malay. LinguaSafe presents a multidimensional and fine-grained evaluation framework, with direct and indirect safety assessments, including further evaluations for oversensitivity. The results of safety and helpfulness evaluations vary significantly across different domains and different languages, even in languages with similar resource levels. Our benchmark provides a comprehensive suite of metrics for in-depth safety evaluation, underscoring the critical importance of thoroughly assessing multilingual safety in LLMs to achieve more balanced safety alignment. Our dataset and code are released to the public to facilitate further research in the field of multilingual LLM safety.
摘要：全球技术中大型语言模型（LLM）的广泛采用和越来越多的突出性需要严格的重点，以确保其在各种语言和文化背景下的安全性。在现有的多语言安全评估中，缺乏全面的评估和不同的数据限制了其有效性，从而阻碍了强大的多语言安全一致性的发展。为了解决这个关键的差距，我们介绍了Linguasafe，这是一种全面的多语言安全基准，该基准对语言真实性进行了细致的关注。 Linguasafe数据集包含12种语言的45K条目，从匈牙利到马来语。我们的数据集结合了翻译，转录和本地的数据的结合，解决了对LLM的多语言安全评估的关键需求，从而填补了从匈牙利人到马来语的不同代表性不足语言的LLMS的安全评估。 Linguasafe提出了一个多维和细粒度的评估框架，并进行了直接和间接的安全评估，包括进一步评估过敏性。安全性和帮助性评估的结果在不同的领域和不同语言之间，即使在资源水平相似的语言中也有很大差异。我们的基准提供了一套全面的指标，以进行深入的安全评估，强调了彻底评估LLM中多语言安全以实现更平衡的安全一致性的重要重要性。我们的数据集和代码将发布给公众，以促进多语言LLM安全领域的进一步研究。

Title: CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Authors: Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Penge
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12769
Pdf URL: https://arxiv.org/pdf/2508.12769
Copy Paste: [[2508.12769]] CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description(https://arxiv.org/abs/2508.12769)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at this https URL
摘要：大型语言模型（LLM）的最新进展显着提高了文本到SQL系统的准确性。但是，一个关键的挑战仍然存在：自然语言问题（NLQ）及其相应的SQL查询之间的语义不匹配。这个问题在大规模数据库中加剧了，在SQL生成过程中，语义上相似的属性阻碍了链接和语义漂移，最终降低了模型的准确性。为了应对这些挑战，我们介绍了Cred-SQL，这是一个为大规模数据库而设计的框架，该数据库集成了集群检索和执行描述。 Cred-SQL首先执行基于群集的大规模架构检索，以查明与给定NLQ最相关的表和列，从而减轻了模式不匹配。然后，它引入了中间自然语言表示 - 执行语言（EDL） - 弥合NLQ和SQL之间的差距。该重新构造将任务分为两个阶段：文本到edl和edl to-sql，利用LLMS的强大一般推理能力，同时减少语义偏差。对两个大规模，跨域基准的旋转和鸟类示范的广泛实验表明，Cred-SQL实现了新的最先进的（SOTA）性能，从而验证了其有效性和可伸缩性。我们的代码可在此HTTPS URL上找到

Title: From SALAMANDRA to SALAMANDRATA: BSC Submission for WMT25 General Machine Translation Shared Task

Authors: Javier Garcia Gilabert, Xixian Liao, Severino Da Dalt, Ella Bohman, Audrey Mash, Francesca De Luca Fornaciari, Irene Baucells, Joan Llop, Miguel Claramunt Argote, Carlos Escolano, Maite Melero
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12774
Pdf URL: https://arxiv.org/pdf/2508.12774
Copy Paste: [[2508.12774]] From SALAMANDRA to SALAMANDRATA: BSC Submission for WMT25 General Machine Translation Shared Task(https://arxiv.org/abs/2508.12774)
Keywords: llm
Abstract: In this paper, we present the SALAMANDRATA family of models, an improved iteration of SALAMANDRA LLMs (Gonzalez-Agirre et al., 2025) specifically trained to achieve strong performance in translation-related tasks for 38 European languages. SALAMANDRATA comes in two scales: 2B and 7B parameters. For both versions, we applied the same training recipe with a first step of continual pre-training on parallel data, and a second step of supervised fine-tuning on high-quality instructions. The BSC submission to the WMT25 General Machine Translation shared task is based on the 7B variant of SALAMANDRATA. We first adapted the model vocabulary to support the additional non-European languages included in the task. This was followed by a second phase of continual pre-training and supervised fine-tuning, carefully designed to optimize performance across all translation directions for this year's shared task. For decoding, we employed two quality-aware strategies: Minimum Bayes Risk Decoding and Tuned Re-ranking using COMET and COMET-KIWI respectively. We publicly release both the 2B and 7B versions of SALAMANDRATA, along with the newer SALAMANDRATA-V2 model, on Hugging Face1
摘要：在本文中，我们介绍了Salamandrata模型家族，这是Salamandra LLM的改进（Gonzalez-Agirre等，2025），专门培训，可在38种欧洲语言的翻译相关任务中实现强大的绩效。 Salamandrata有两个尺度：2B和7B参数。对于这两个版本，我们都采用了相同的培训配方，并在并行数据上连续进行预训练的第一步，以及对高质量指令进行微调的第二步。 WMT25通用机器翻译共享任务的BSC提交基于Salamandrata的7B变体。我们首先对模型词汇进行了调整，以支持任务中包含的其他非欧洲语言。接下来是连续培训和监督微调的第二阶段，精心设计，以优化所有翻译方向的性能，以完成今年的共同任务。为了解码，我们采用了两种质量感知的策略：最低贝叶斯风险使用彗星和彗星 - KIWI进行重新排列。我们公开发布了Salamandrata的2b和7b版本，以及较新的Salamandrata-V2型号，Hugging Face1

Title: HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks

Authors: Zhe Chen, Yusheng Liao, Shuyang Jiang, Zhiyuan Zhu, Haolin Li, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12778
Pdf URL: https://arxiv.org/pdf/2508.12778
Copy Paste: [[2508.12778]] HeteroRAG: A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks(https://arxiv.org/abs/2508.12778)
Keywords: language model, retrieval-augmented generation
Abstract: Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While retrieval-augmented generation has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports affects the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for dynamically constructing queries for diverse corpora. Incorporating knowledge from such multifaceted sources, Med-LVLM is then trained with Heterogeneous Knowledge Preference Tuning to achieve cross-modality and multi-source knowledge alignment. Extensive experiments across 12 datasets and 3 modalities demonstrate that the proposed HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks, significantly improving factual accuracy and reliability of Med-LVLMs.
摘要：医学大视力语言模型（MED-LVLM）在临床应用中表现出了希望，但遭受了事实上的不准确和不可靠的产出，在现实世界中的诊断中带来了风险。尽管检索增强的一代已经成为潜在的解决方案，但当前的医学多模式抹布系统无法在异质源进行有效检索。检索报告的无关紧要会影响分析的事实，而知识不足会影响临床决策的可信度。为了弥合差距，我们构建了Medatlas，其中包括广泛的多模式报告存储库和各种文本语料库。基于它，我们提出了Heterorag，这是一个新颖的框架，可通过异质知识来源增强MED-LVLM。该框架引入了特定于模式的剪辑，以进行有效的报告检索，并引入了用于动态构建各种语料库的查询的多公司查询生成器。然后，通过从此类多面源中的知识结合了Med-LVLM，然后通过异质知识偏好调整进行训练，以实现交叉模式和多源知识对齐。跨12个数据集和3种模式进行的广泛实验表明，拟议的Heterorag在大多数医学视觉语言基准中实现了最先进的性能，从而显着提高了MED-LVLMS的事实准确性和可靠性。

Title: Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward

Authors: Yong Deng, Guoqing Wang, Zhenzhe Ying, Xiaofeng Wu, Jinzhen Lin, Wenwen Xiong, Yuqin Dai, Shuo Yang, Zhanwei Zhang, Qiwen Wang, Yang Qin, Changhua Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12800
Pdf URL: https://arxiv.org/pdf/2508.12800
Copy Paste: [[2508.12800]] Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward(https://arxiv.org/abs/2508.12800)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths. Experiments on seven benchmarks show consistent improvements over the state-of-the-art. Key advantages include: (1) Atom-Searcher scales computation at test-time. (2) Atomic Thought provides supervision anchors for RRMs, bridging deep research tasks and RRMs. (3) Atom-Searcher exhibits more interpretable, human-like reasoning patterns.
摘要：大型语言模型（LLMS）具有出色的解决问题的能力，但由于静态内部知识而在复杂的任务中挣扎。检索增强的生成（RAG）增强了对外部信息的访问，但由于严格的工作流程，多跳的推理和战略搜索仍然有限。代理深度研究的最新进展使LLMS能够自主的理由，搜索和综合信息。但是，目前依靠基于结果的增强学习（RL）的方法面临着关键问题，例如冲突梯度和奖励稀疏性，限制了绩效提高和训练效率。为了解决这些问题，我们首先提出了原子思想，这是一种新颖的LLM思维范式，将推理分解为细粒功能单元。这些单元是通过推理奖励模型（RRMS）监督的，这些模型（RRMS）提供了原子思想奖励（ATR）以进行细粒度的指导。在此基础上，我们提出了Atom-Searcher，这是一个新型的RL Agesic深入研究框架，将原子思想和ATR整合在一起。 Atom-Searcher使用课程启发的奖励时间表，优先考虑过程级ATR，并过渡到结果奖励，从而加速了在有效推理路径上的融合。七个基准测试的实验表现出比最先进的一致改进。关键优势包括：（1）在测试时间时进行计算计算。（2）原子思想为RRMS提供了监督锚，桥接了深入的研究任务和RRM。（3）Atom-Searcher表现出更容易解释的人类般的推理模式。

Title: When Alignment Hurts: Decoupling Representational Spaces in Multilingual Models

Authors: Ahmed Elshabrawy, Hour Kaing, Haiyue Song, Alham Fikri Aji, Hideki Tanaka, Masao Utiyama, Raj Dabre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12803
Pdf URL: https://arxiv.org/pdf/2508.12803
Copy Paste: [[2508.12803]] When Alignment Hurts: Decoupling Representational Spaces in Multilingual Models(https://arxiv.org/abs/2508.12803)
Keywords: language model, llm
Abstract: Alignment with high-resource standard languages is often assumed to aid the modeling of related low-resource varieties. We challenge this assumption by demonstrating that excessive representational entanglement with a dominant variety, such as Modern Standard Arabic (MSA) in relation to Arabic dialects, can actively hinder generative modeling. We present the first comprehensive causal study of this phenomenon by analyzing and directly intervening in the internal representation geometry of large language models (LLMs). Our key contribution is an online variational probing framework that continuously estimates the subspace of the standard variety during fine-tuning, enabling projection-based decoupling from this space. While our study uses Arabic as a case due to its unusually rich parallel resources across 25 dialects, the broader motivation is methodological: dialectal MT serves as a controlled proxy for generative tasks where comparable multi-variety corpora are unavailable. Across 25 dialects, our intervention improves generation quality by up to +4.9 chrF++ and +2.0 on average compared to standard fine-tuning, despite a measured tradeoff in standard-language performance. These results provide causal evidence that subspace dominance by high-resource varieties can restrict generative capacity for related varieties. More generally, we unify geometric and information-theoretic probing with subspace-level causal interventions, offering practical tools for improving generative modeling in closely related language families and, more broadly, for controlling representational allocation in multilingual and multi-domain LLMs. Code will be released.
摘要：通常假定与高资源标准语言的一致性有助于建模相关的低资源品种。我们通过证明与阿拉伯语方言相关的现代标准阿拉伯语（MSA）的过度代表性纠缠来挑战这一假设，可以积极阻碍生成的建模。我们通过分析和直接介入大语言模型（LLMS）的内部表示几何形状，介绍了对该现象的首次全面因果研究。我们的关键贡献是一个在线变异探测框架，该框架在微调过程中不断估计标准品种的子空间，从而使基于投影的脱钩能够与该空间进行分离。尽管我们的研究将阿拉伯语作为一种情况，因为它在25种方言中具有异常丰富的平行资源，但更广泛的动机是方法论上：方言MT是无法使用可比的多变量语料库的生成任务的受控代理。在25种方言中，与标准微调相比，我们的干预措施平均提高了高达+4.9 CHRF ++和+2.0的发电质量，尽管标准语言性能的折衷进行了衡量。这些结果提供了因果证据，表明高资源品种的子空间优势可以限制相关品种的生成能力。更一般而言，我们通过子空间级别的因果干预措施将几何和信息理论探测统一，提供了实用的工具，可改善与密切相关的语言家族中的生成建模，更广泛地控制多语言和多域LLMS中的代表性分配。代码将发布。

Title: Word Meanings in Transformer Language Models

Authors: Jumbly Grindrod, Peter Grindrod
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12863
Pdf URL: https://arxiv.org/pdf/2508.12863
Copy Paste: [[2508.12863]] Word Meanings in Transformer Language Models(https://arxiv.org/abs/2508.12863)
Keywords: language model, llm
Abstract: We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each word has an entry that contains semantic information. To do this, we extracted the token embedding space of RoBERTa-base and k-means clustered it into 200 clusters. In our first study, we then manually inspected the resultant clusters to consider whether they are sensitive to semantic information. In our second study, we tested whether the clusters are sensitive to five psycholinguistic measures: valence, concreteness, iconicity, taboo, and age of acquisition. Overall, our findings were very positive - there is a wide variety of semantic information encoded within the token embedding space. This serves to rule out certain "meaning eliminativist" hypotheses about how transformer LLMs process semantic information.
摘要：我们研究了变压器语言模型中的单词含义如何表示。具体而言，我们关注的是变压器模型是否采用类似于词汇商店的东西 - 每个单词都有一个包含语义信息的条目。为此，我们提取了Roberta-base的令牌嵌入空间，K-Means将其聚集到200个簇中。然后，在我们的第一个研究中，我们手动检查了最终的簇，以考虑它们是否对语义信息敏感。在我们的第二项研究中，我们测试了簇是否对五种心理语言措施敏感：价，具体性，标志性，禁忌和获取的年龄。总体而言，我们的发现非常积极 - 在令牌嵌入空间中编码了各种各样的语义信息。这有助于排除某些关于Transformer LLM如何处理语义信息的“消除主义”假设。

Title: An LLM Agent-Based Complex Semantic Table Annotation Approach

Authors: Yilin Geng, Shujing Wang, Chuan Wang, Keqing He, Yanfei Lv, Ying Wang, Zaiwen Feng, Xiaoying Bai
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2508.12868
Pdf URL: https://arxiv.org/pdf/2508.12868
Copy Paste: [[2508.12868]] An LLM Agent-Based Complex Semantic Table Annotation Approach(https://arxiv.org/abs/2508.12868)
Keywords: llm, prompt, agent
Abstract: The Semantic Table Annotation (STA) task, which includes Column Type Annotation (CTA) and Cell Entity Annotation (CEA), maps table contents to ontology entities and plays important roles in various semantic applications. However, complex tables often pose challenges such as semantic loss of column names or cell values, strict ontological hierarchy requirements, homonyms, spelling errors, and abbreviations, which hinder annotation accuracy. To address these issues, this paper proposes an LLM-based agent approach for CTA and CEA. We design and implement five external tools with tailored prompts based on the ReAct framework, enabling the STA agent to dynamically select suitable annotation strategies depending on table characteristics. Experiments are conducted on the Tough Tables and BiodivTab datasets from the SemTab challenge, which contain the aforementioned challenges. Our method outperforms existing approaches across various metrics. Furthermore, by leveraging Levenshtein distance to reduce redundant annotations, we achieve a 70% reduction in time costs and a 60% reduction in LLM token usage, providing an efficient and cost-effective solution for STA.
摘要：语义表注释（STA）任务，其中包括列类型注释（CTA）和细胞实体注释（CEA），映射表内容到本体论实体，并在各种语义应用中扮演重要角色。但是，复杂的表通常会带来挑战，例如列名称或单元格值的语义丢失，严格的本体论层次结构要求，同音词，拼写错误和缩写，这阻碍了注释精度。为了解决这些问题，本文提出了针对CTA和CEA的基于LLM的代理商方法。我们根据React框架设计并实施了五个具有量身定制提示的外部工具，使STA代理可以根据表特征动态选择合适的注释策略。实验是在SEMTAB挑战的坚韧表和生物驱动器数据集上进行的，SEMTAB挑战包含上述挑战。我们的方法的表现优于各种指标的现有方法。此外，通过利用Levenshtein距离来减少冗余注释，我们的时间成本降低了70％，LLM令牌使用率降低了60％，为STA提供了有效且具有成本效益的解决方案。

Title: A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

Authors: Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.12903
Pdf URL: https://arxiv.org/pdf/2508.12903
Copy Paste: [[2508.12903]] A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models(https://arxiv.org/abs/2508.12903)
Keywords: language model, llm
Abstract: Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6 percent compared to standard generation, while also achieving an 8.2 percent improvement in accuracy. Our code and all baselines used in the paper are available in the GitHub.
摘要：通过迭代改进，自我投资的最新进展表明，可以改善大语言模型（LLM）的产出的巨大潜力。但是，大多数现有的自我填充方法都依赖于具有固定数量的迭代的反应性过程，因此很难根据不断发展的生成环境来确定最佳的细化时间和内容。受到人类在执行过程中动态提炼思想的方式的启发，我们提出了主动的自我限制（PASR），这是一种新颖的方法，使LLM可以在生成过程中完善其输出。与重生整个响应的方法不同，PASR主动决定是否，何时以及如何根据模型的内部状态和不断发展的环境来完善。我们对10个任务集进行了广泛的实验，以评估PASR的有效性。实验结果表明，PASR显着提高了解决问题的性能。特别是，在QWEN3-8B上，PASR与标准生成相比，PASR将平均令牌消耗量减少了41.6％，同时还可以提高8.2％的准确性。我们的代码和论文中使用的所有基准都在GitHub中可用。

Title: Analyzing Information Sharing and Coordination in Multi-Agent Planning

Authors: Tianyue Ou, Saujas Vaduguru, Daniel Fried
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.12981
Pdf URL: https://arxiv.org/pdf/2508.12981
Copy Paste: [[2508.12981]] Analyzing Information Sharing and Coordination in Multi-Agent Planning(https://arxiv.org/abs/2508.12981)
Keywords: language model, llm, agent
Abstract: Multi-agent systems (MASs) have pushed the boundaries of large language model (LLM) agents in domains such as web research and software engineering. However, long-horizon, multi-constraint planning tasks involve conditioning on detailed information and satisfying complex interdependent constraints, which can pose a challenge for these systems. In this study, we construct an LLM-based MAS for a travel planning task which is representative of these challenges. We evaluate the impact of a notebook to facilitate information sharing, and evaluate an orchestrator agent to improve coordination in free form conversation between agents. We find that the notebook reduces errors due to hallucinated details by 18%, while an orchestrator directs the MAS to focus on and further reduce errors by up to 13.5% within focused sub-areas. Combining both mechanisms achieves a 25% final pass rate on the TravelPlanner benchmark, a 17.5% absolute improvement over the single-agent baseline's 7.5% pass rate. These results highlight the potential of structured information sharing and reflective orchestration as key components in MASs for long horizon planning with LLMs.
摘要：多机构系统（质量）已在Web研究和软件工程等领域中推动了大语言模型（LLM）代理的边界。但是，多匹马，多构造计划任务涉及根据详细信息调节并满足复杂的相互依存约束，这可能会对这些系统构成挑战。在这项研究中，我们为旅行计划任务构建了一个基于LLM的MAS，这代表了这些挑战。我们评估笔记本以促进信息共享的影响，并评估编排代理，以改善代理之间自由形式对话的协调。我们发现，由于幻觉的细节，笔记本将错误减少了18％，而编排者则指示MAS专注于重点的次级分会内的错误并将错误进一步减少13.5％。将这两种机制结合起来，在TravelPlanner基准测试基准上达到了25％的最终通过率，比单代理基线的7.5％通过率的绝对提高17.5％。这些结果突出了结构化信息共享和反思性编排的潜力，这是与LLMS长期计划的质量关键组成部分。

Title: WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Authors: Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, Christian Bizer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13024
Pdf URL: https://arxiv.org/pdf/2508.13024
Copy Paste: [[2508.13024]] WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents(https://arxiv.org/abs/2508.13024)
Keywords: language model, gpt, llm, agent
Abstract: LLM-based web agents have the potential to automate long-running web tasks, such as finding offers for specific products in multiple online shops and subsequently ordering the cheapest products that meet the users needs. This paper introduces WebMall, a multi-shop online shopping benchmark for evaluating the effectiveness and efficiency of web agents for comparison-shopping. WebMall consists of four simulated online shops populated with authentic product offers sourced from the Common Crawl, alongside a suite of 91 cross-shop tasks. These tasks include basic tasks such as finding specific products in multiple shops, performing price comparisons, adding items to the shopping cart, and completing checkout. Advanced tasks involve searching for products based on vague requirements, identifying suitable substitutes, and finding compatible products. Compared to existing e-commerce benchmarks, such as WebShop or ShoppingBench, WebMall introduces comparison-shopping tasks across multiple shops. Furthermore, the product offers are more heterogeneous, as they originate from hundreds of distinct real-world shops. The tasks in WebMall require longer interaction trajectories than those in WebShop, while remaining representative of real-world shopping behaviors. We evaluate eight baseline agents on WebMall, varying in observation modality, memory utilization, and underlying large language model (GPT 4.1 and Claude Sonnet 4). The best-performing configurations achieve completion rates of 75% and 53%, and F1 scores of 87% and 63%, on the basic and advanced task sets, respectively. WebMall is publicly released to facilitate research on web agents and to promote advancements in navigation, reasoning, and efficiency within e-commerce scenarios.
摘要：基于LLM的Web代理商有可能自动执行长期运行的Web任务，例如在多家在线商店中为特定产品寻找优惠，然后订购满足用户需求的最便宜的产品。本文介绍了Webmall，这是一种多购物中心的在线购物基准，用于评估网络代理在比较购物中的有效性和效率。 Webmall由四家模拟的在线商店组成，这些商店填充了来自Common Crawl的真实产品，以及一套91个跨购物中心任务。这些任务包括基本任务，例如在多家商店中查找特定产品，进行价格比较，在购物车中添加物品以及完成结帐。高级任务涉及根据模糊的要求搜索产品，确定合适的替代品并找到兼容的产品。与现有的电子商务基准（例如网络商店或购物摊）相比，Webmall介绍了多家商店的比较购物任务。此外，产品提供的产品更加异质，因为它们来自数百家不同的现实商店。 Webmall中的任务需要比Webshop中的互动轨迹更长，同时仍代表了现实世界中的购物行为。我们评估了Webmall上的八种基线代理，观察方式，内存利用和潜在的大语言模型（GPT 4.1和Claude Sonnet 4）各不相同。在基本和高级任务集上，表现最佳的配置的完成率分别为75％和53％，F1分别为87％和63％。 Webmall公开发布，以促进有关网络代理的研究，并在电子商务方案中促进导航，推理和效率方面的进步。

Title: Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction

Authors: Xinhe Li, Jiajun Liu, Peng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13037
Pdf URL: https://arxiv.org/pdf/2508.13037
Copy Paste: [[2508.13037]] Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction(https://arxiv.org/abs/2508.13037)
Keywords: language model, llm, chain-of-thought
Abstract: Recent studies have demonstrated that Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters. To tackle the challenge of poor reasoning in Small Language Models (SLMs), existing methods typically leverage LLMs to generate massive amounts of data for cramming training. In psychology, they are akin to System 1 thinking, which resolves reasoning problems rapidly based on experience and intuition. However, human learning also requires System 2 thinking, where knowledge is first acquired and then reinforced through practice. Inspired by such two distinct modes of thinking, we propose a novel method based on the multi-LoRA Interaction for mathematical reasoning Distillation (LoRID). First, we input the question and reasoning of each sample into an LLM to create knowledge-enhanced datasets. Subsequently, we train a LoRA block on the student model as an Intuitive Reasoner (IR), which directly generates Chain-of-Thoughts for problem-solving. Then, to imitate System 2 thinking, we train the Knowledge Generator (KG) and Deep Reasoner (DR), respectively. The former outputs only knowledge after receiving problems, while the latter uses that knowledge to perform reasoning. Finally, to address the randomness in the generation of IR and DR, we evaluate whether their outputs are consistent, and the inference process needs to be iterated if not. This step can enhance the mathematical reasoning ability of SLMs through mutual feedback. Experimental results show that LoRID achieves state-of-the-art performance, especially on the GSM8K dataset, where it outperforms the second-best method by 2.3%, 16.1%, 2.4%, 12.3%, and 1.8% accuracy across the five base models, respectively.
摘要：最近的研究表明，大型语言模型（LLMS）具有强大的数学推理能力，但依赖数百亿个参数。为了应对小语言模型（SLM）中不良推理的挑战，现有方法通常利用LLMS来生成大量的数据以进行填充培训。在心理学中，它们类似于系统1的思维，它根据经验和直觉迅速解决了推理问题。但是，人类的学习还需要系统2思考，其中首次获得知识然后通过实践加强。受这两种不同的思维方式的启发，我们提出了一种基于数学推理蒸馏（LORID）多洛拉相互作用的新方法。首先，我们将每个样本的问题和推理输入LLM中以创建知识增强的数据集。随后，我们以直观的推理器（IR）为单位，在学生模型上训练一个洛拉（Lora）块，该障碍直接生成解决问题的经验链。然后，为模仿系统2思考，我们分别训练知识生成器（kg）和深度推理器（DR）。前者仅在收到问题后才输出知识，而后者则使用该知识来执行推理。最后，为了解决IR和DR生成的随机性，我们评估它们的输出是否一致，并且推理过程需要迭代。此步骤可以增强SLM通过相互反馈的数学推理能力。实验结果表明，洛里德（Lorid）在GSM8K数据集上实现了最先进的性能，在该数据集中，它的表现分别优于第二好的方法，分别在五个基本模型中分别超过2.3％，16.1％，2.4％，12.3％和1.8％的精度。

Title: Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi, Zorluklar ve İyileştirme Fırsatları

Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Banu Diri, Savaş Yıldırım, Öner Aytaş
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13044
Pdf URL: https://arxiv.org/pdf/2508.13044
Copy Paste: [[2508.13044]] Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi, Zorluklar ve İyileştirme Fırsatları(https://arxiv.org/abs/2508.13044)
Keywords: language model, llm
Abstract: Language models have made significant advancements in understanding and generating human language, achieving remarkable success in various applications. However, evaluating these models remains a challenge, particularly for resource-limited languages like Turkish. To address this issue, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is based on a meticulously curated dataset comprising 6,200 multiple-choice questions across 62 sections within the Turkish education system. This benchmark provides a standard framework for Turkish NLP research, enabling detailed analyses of LLMs' capabilities in processing Turkish text. In this study, we evaluated state-of-the-art LLMs on TR-MMLU, highlighting areas for improvement in model design. TR-MMLU sets a new standard for advancing Turkish NLP research and inspiring future innovations.
摘要：语言模型在理解和生成人类语言方面取得了重大进步，在各种应用中取得了巨大的成功。但是，评估这些模型仍然是一个挑战，特别是对于像土耳其这样的资源有限语言。为了解决这个问题，我们介绍了土耳其MMLU（TR-MMLU）基准，这是一个全面的评估框架，旨在评估土耳其语中大语言模型（LLMS）的语言和概念能力。 TR-MMLU基于精心策划的数据集，其中包括土耳其教育系统中62个部分的6,200个多项选择问题。该基准为土耳其NLP研究提供了标准框架，从而详细分析了LLMS在处理土耳其文本方面的能力。在这项研究中，我们评估了TR-MMLU上的最新LLMS，突出了改进模型设计的领域。 TR-MMLU设定了一个新的标准，用于推进土耳其NLP研究并启发未来的创新。

Title: Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi

Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13058
Pdf URL: https://arxiv.org/pdf/2508.13058
Copy Paste: [[2508.13058]] Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi(https://arxiv.org/abs/2508.13058)
Keywords: language model, llm
Abstract: Tokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (\%TR), and token purity (\%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages.
摘要：令牌化是自然语言处理（NLP）的基本预处理步骤，严重影响了大语言模型（LLMS）捕捉语言和语义细微差别的能力。这项研究介绍了一个新颖的评估框架，以应对特定于形态富含形态和低资源的语言（例如土耳其语）的挑战。利用土耳其MMLU（TR-MMLU）数据集，包括土耳其教育系统中的6,200个多项选择问题，我们根据词汇量，代币计数，处理时间，语言特异性令牌百分比（\％tr）和token Purity（\％pure）评估了代币器。这些新提出的指标衡量了如何有效地象征性的语言结构。我们的分析表明，特定于语言的令牌百分比与下游性能（例如MMLU得分）相比，比令牌纯度更强。此外，仅增加模型参数并不一定会提高语言性能，从而强调了定制的，特定于语言的令牌化方法的重要性。所提出的框架为形态上复杂的语言建立了强大而实用的令牌化标准。

Title: Reinforced Context Order Recovery for Adaptive Reasoning and Planning

Authors: Long Ma, Fangwei Zhong, Yizhou Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13070
Pdf URL: https://arxiv.org/pdf/2508.13070
Copy Paste: [[2508.13070]] Reinforced Context Order Recovery for Adaptive Reasoning and Planning(https://arxiv.org/abs/2508.13070)
Keywords: language model
Abstract: Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token generation orders to solve tractably, which we characterize with the $\mathcal{V}$-information framework. Motivated by this, we propose Reinforced Context Order Recovery (ReCOR), a reinforcement-learning-based framework to extract adaptive, data-dependent token generation orders from text data without annotations. Self-supervised by token prediction statistics, ReCOR estimates the hardness of predicting every unfilled token and adaptively selects the next token during both training and inference. Experiments on challenging reasoning and planning datasets demonstrate the superior performance of ReCOR compared with baselines, sometimes outperforming oracle models supervised with the ground-truth order.
摘要：现代因果语言模型，然后在离散扩散模型中快速发展，现在可以产生各种有趣且有用的内容。但是，这些模型家族主要是通过固定（从左到右）或随机顺序进行输出令牌的训练，这可能会偏离最初生成令牌的逻辑顺序。在本文中，我们观察到，当前的因果和扩散模型在需要自适应代币生成订单的问题中遇到困难，我们以$ \ MATHCAL {V} $ - 信息框架来表征。在此激励的情况下，我们提出了增强的上下文订单恢复（Recor），这是一种基于加强学习的框架，可从文本数据中提取自适应，数据依赖于数据的代币生成订单，而无需注释。由令牌预测统计数据对自我监督，repor估计了预测每个未填充令牌并自适应地选择训练和推理期间下一个令牌的硬度。关于挑战推理和规划数据集的实验表明，与基准相比，回收的表现出色，有时比通过基地订单监督的甲骨文模型优于甲骨文模型。

Title: DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Authors: Dayyán O'Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, Jörg Tiedemann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13079
Pdf URL: https://arxiv.org/pdf/2508.13079
Copy Paste: [[2508.13079]] DocHPLT: A Massively Multilingual Document-Level Translation Dataset(https://arxiv.org/abs/2508.13079)
Keywords: llm
Abstract: Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences, with further possibility to provide 2500 bonus pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.
摘要：现有的文档级计算资源仅适用于少数语言，主要是高资源的语言。为了促进对文档级翻译的培训和评估，以及更广泛地为全球社区的长篇文章建模，我们创建了Dochplt，这是迄今为止最大的公开可用文档级翻译数据集。它包含50种语言和英语配对的1.24亿个对齐文档对，包括42.6亿个刑期，并有可能提供2500个奖金对不涉及英语。与以前基于重建的方法从句子级别的数据中汇总文档，我们修改了现有的Web提取管道，以将完整的文档完整性从源中保留，并保留所有内容，包括不一致的部分。在我们的初步实验确定了文档级翻译的最佳培训上下文策略之后，我们证明了在Dochplt上进行了微调的LLMS基本上胜过了以现成的指导型基线的基本线，并且对资源不足的语言进行了特别的改进。我们在宽松许可下开放数据集，为推进多语言文档级翻译的基础架构提供了必不可少的基础架构。

Title: All for law and law for all: Adaptive RAG Pipeline for Legal Research

Authors: Figarri Keisha, Prince Singh, Pallavi, Dion Fernandes, Aravindh Manivannan, Ilham Wicaksono, Faisal Ahmad
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.13107
Pdf URL: https://arxiv.org/pdf/2508.13107
Copy Paste: [[2508.13107]] All for law and law for all: Adaptive RAG Pipeline for Legal Research(https://arxiv.org/abs/2508.13107)
Keywords: language model, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations by grounding large language model outputs in cited sources, a capability that is especially critical in the legal domain. We present an end-to-end RAG pipeline that revisits and extends the LegalBenchRAG baseline with three targeted enhancements: (i) a context-aware query translator that disentangles document references from natural-language questions and adapts retrieval depth and response style based on expertise and specificity, (ii) open-source retrieval strategies using SBERT and GTE embeddings that achieve substantial performance gains (improving Recall@K by 30-95\% and Precision@K by $\sim$2.5$\times$ for $K>4$) while remaining cost-efficient, and (iii) a comprehensive evaluation and generation framework that combines RAGAS, BERTScore-F1, and ROUGE-Recall to assess semantic alignment and faithfulness across models and prompt designs. Our results show that carefully designed open-source pipelines can rival or outperform proprietary approaches in retrieval quality, while a custom legal-grounded prompt consistently produces more faithful and contextually relevant answers than baseline prompting. Taken together, these contributions demonstrate the potential of task-aware, component-level tuning to deliver legally grounded, reproducible, and cost-effective RAG systems for legal research assistance.
摘要：检索增强的发电（RAG）通过将大型语言模型输出接地在引用的来源中减轻幻觉，这一能力在法律领域尤其重要。我们提出了一条端到端的RAG管道，重新审视和扩展了法律贝切拉格的基线，并具有三个有针对性的增强：（i）上下文感知的查询翻译器，将文档引用从自然语言问题中删除文档的参考，并适应自然语言的深度和响应方式，并基于专业性和特定性，（ii）使用SBEBTER ge gte gte（ii）启用gte（ii）的范围（ii）的范围（ii），以改进范围。 30-95 \％和precision@k by $ \ sim $ \ sim $ 2.5 $ \ times $ $ k> 4 $）虽然保持成本效益），以及（iii）结合了Ragas，Bertscore-f1和Rouge-Recall的全面评估和生成框架，以评估跨模型和及时设计的语义一致性和忠实设计。我们的结果表明，精心设计的开源管道可以以检索质量竞争或胜过专有方法，而自定义的法律接地及时迅速始终产生比基线提示更忠实且相关的答案。综上所述，这些贡献表明了任务感知，组件级调整的潜力，以提供合法的，可重现和具有成本效益的抹布系统，以提供法律研究援助。

Title: AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation

Authors: Zefang Liu, Arman Anwar
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2508.13118
Pdf URL: https://arxiv.org/pdf/2508.13118
Copy Paste: [[2508.13118]] AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation(https://arxiv.org/abs/2508.13118)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Incident response (IR) requires fast, coordinated, and well-informed decision-making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that incorporates retrieval-augmented generation (RAG) into multi-agent incident response simulations. Built on the Backdoors & Breaches (B&B) tabletop game environment, AutoBnB-RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG-Wiki), and another using narrative-style incident reports (RAG-News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real-world cyber incidents based on public breach reports, demonstrating AutoBnB-RAG's ability to reconstruct complex multi-stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM-based multi-agent systems for cybersecurity decision-making.
摘要：事件响应（IR）需要快速，协调且明智的决策才能遏制和减轻网络威胁。尽管大型语言模型（LLMS）在模拟的IR设置中表现出了作为自主代理的希望，但其推理通常受到无法获得外部知识的访问的限制。在这项工作中，我们提出了AutoBNB-rag，这是AutoBNB框架的扩展，该框架将检索功能生成（RAG）纳入多代理事件响应模拟中。 Autobnb-rag建立在后门和漏洞（B＆B）桌面游戏环境的基础上，使代理商可以在协作调查中发出检索查询并纳入外部证据。我们介绍了两个检索设置：一个以策划的技术文档为基础（Rag-Wiki），另一种是使用叙事风格的事件报告（RAG-NEWS）。我们评估了八个团队结构的性能，包括旨在促进关键推理的新引入的论证配置。为了验证实用实用程序，我们还根据公共违规报告模拟了现实世界中的网络事件，证明了Autobnb-Rag重建复杂的多阶段攻击的能力。我们的结果表明，检索增强可以提高各种组织模型的决策质量和成功率。这项工作证明了将检索机制集成到基于LLM的多代理系统中以进行网络安全决策的价值。

Title: Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries

Authors: Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13124
Pdf URL: https://arxiv.org/pdf/2508.13124
Copy Paste: [[2508.13124]] Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries(https://arxiv.org/abs/2508.13124)
Keywords: language model, gpt, llm
Abstract: Abstractive summarization is a core application in contact centers, where Large Language Models (LLMs) generate millions of summaries of call transcripts daily. Despite their apparent quality, it remains unclear whether LLMs systematically under- or over-attend to specific aspects of the transcript, potentially introducing biases in the generated summary. While prior work has examined social and positional biases, the specific forms of bias pertinent to contact center operations - which we term Operational Bias - have remained unexplored. To address this gap, we introduce BlindSpot, a framework built upon a taxonomy of 15 operational bias dimensions (e.g., disfluency, speaker, topic) for the identification and quantification of these biases. BlindSpot leverages an LLM as a zero-shot classifier to derive categorical distributions for each bias dimension in a pair of transcript and its summary. The bias is then quantified using two metrics: Fidelity Gap (the JS Divergence between distributions) and Coverage (the percentage of source labels omitted). Using BlindSpot, we conducted an empirical study with 2500 real call transcripts and their summaries generated by 20 LLMs of varying scales and families (e.g., GPT, Llama, Claude). Our analysis reveals that biases are systemic and present across all evaluated models, regardless of size or family.
摘要：抽象性摘要是联系中心中的核心应用程序，其中大语言模型（LLMS）每天都会产生数百万个呼叫笔录的摘要。尽管具有明显的质量，但尚不清楚LLMS是否会系统地不适合转录本的特定方面，可能在生成的摘要中引入偏见。尽管先前的工作已经检查了社会和位置偏见，但与联系中心操作有关的特定偏见形式（我们称为运营偏见）仍未得到探索。为了解决这一差距，我们介绍了盲点，该框架是建立在15个操作偏置维度（例如，差异，说话者，主题）的框架上，以识别和量化这些偏见。 BlindSpot利用LLM作为零击分类器来得出一对转录本及其摘要中每个偏差维度的分类分布。然后使用两个指标对偏差进行量化：保真度间隙（分布之间的JS差异）和覆盖率（省略源标签的百分比）。使用盲点，我们进行了一项实证研究，其中有2500个真实的呼叫转录本及其摘要由20个不同的尺度和家庭（例如，GPT，Llame，Claude）产生。我们的分析表明，无论规模或家庭如何，偏见都是系统性的，并且呈现在所有评估的模型中。

Title: Improving Detection of Watermarked Language Models

Authors: Dara Bahri, John Wieting
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.13131
Pdf URL: https://arxiv.org/pdf/2508.13131
Copy Paste: [[2508.13131]] Improving Detection of Watermarked Language Models(https://arxiv.org/abs/2508.13131)
Keywords: language model, llm, prompt
Abstract: Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.
摘要：水印最近已成为检测几代大语言模型（LLM）的有效策略。水印的强度通常在很大程度上取决于语言模型提供的熵和一组输入提示。但是，熵在实践上可能非常有限，尤其是对于经过训练后训练的模型，例如通过指示调整或从人类反馈学习（RLHF）学习的模型，这基于单独的水印而挑战，这使检测结果。在这项工作中，我们研究是否可以通过将水印探测器与非含水标记的检测器相结合来改进检测。我们探索了许多混合方案，这些混合方案结合了两者，观察到在广泛的实验条件下，两类检测器的性能提高。

Title: OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Authors: Pranjal Aggarwal, Seungone Kim, Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.13141
Pdf URL: https://arxiv.org/pdf/2508.13141
Copy Paste: [[2508.13141]] OptimalThinkingBench: Evaluating Over and Underthinking in LLMs(https://arxiv.org/abs/2508.13141)
Keywords: llm
Abstract: Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. In this work, we introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks. Using novel thinking-adjusted accuracy metrics, we perform extensive evaluation of 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.
摘要：思考LLMS以增加计算和过度思考更简单的问题为代价解决复杂的任务，而非思考的LLMS更快，更便宜，但要考虑到更严重的推理问题。这导致开发了单独的思维和非思考的LLM变体，因此为最终用户的每个查询选择最佳模型的责任。在这项工作中，我们引入了最佳思维板，这是一个统一的基准，共同评估了LLMS中的过度思考和思考，并鼓励开发具有平衡性能和效率的最佳思维模型。我们的基准分为两个子基准：过度思考膨胀板，在72个域中具有简单的查询，并在思想下，包含11个具有挑战性的推理任务。使用新颖的思维调整精度指标，我们对33种不同的思维和非思想模型进行了广泛的评估，并表明没有模型能够在我们的基准上进行最佳思考。思考模型通常会在最简单的用户查询中对数百个代币的思考，而无需提高性能。相反，大型的非思想模型在思想下，通常落在思维模型较小得多的情况下。我们进一步探索了几种鼓励最佳思考的方法，但是发现这些方法通常以一个子基准为代价，而以另一种为代价，这突出了将来需要更好地统一和最佳模型。

Title: Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

Authors: David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.13144
Pdf URL: https://arxiv.org/pdf/2508.13144
Copy Paste: [[2508.13144]] Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation(https://arxiv.org/abs/2508.13144)
Keywords: language model
Abstract: Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark's ability to separate better models from worse models, and noise, a benchmark's sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model's intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.
摘要：开发大型语言模型很昂贵，涉及通过小型实验做出决策，通常是通过评估大型多任务评估套件。在这项工作中，我们分析了特定的特性，这些特性使得基准更可靠，并设计了设计更高质量评估基准的干预措施。我们介绍了两个关键指标，这些指标显示了当前基准的差异：信号，基准测试的能力可以将更好的模型与差模型分开，而噪声，基准测试对训练步骤之间的随机变异性的敏感性。我们证明，在小规模做出决策时，具有更好的信噪比的基准更可靠，而噪声较小的基准标准具有较低的缩放标准预测误差。这些结果表明，改善信号或噪声将导致更有用的基准测试，因此我们引入了三种旨在直接影响信号或噪声的干预措施。例如，我们建议切换到具有更好的信号和噪声（例如，困惑而不是准确性）的度量标准会导致更高的可靠性和改善的缩放法律误差。我们还发现，过滤嘈杂的子任务以提高集合信号噪声比率，从而导致更可靠的多任务评估。我们还发现，平均模型的中间检查点的输出以减少噪声导致一致的改进。最后，我们建议那些创建新基准的人，或选择要使用哪些现有基准测试，以高信号和低噪声。我们在这些实验中使用30个基准，以及375个从60m到32B参数的开放权重模型，从而产生了一个新的，可公开可用的数据集，该数据集为900K评估基准结果，总计200M实例。

Title: RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns

Authors: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Ziyang Luo, Di Wang, Min Yang, Lidia S. Chao, Derek F. Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13152
Pdf URL: https://arxiv.org/pdf/2508.13152
Copy Paste: [[2508.13152]] RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns(https://arxiv.org/abs/2508.13152)
Keywords: language model, llm
Abstract: Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: this https URL
摘要：检测大型语言模型（LLM）产生的内容对于防止滥用和构建值得信赖的AI系统至关重要。尽管现有的检测方法表现良好，但它们在分布（OOD）方案中的鲁棒性仍然缺乏。在本文中，我们假设，与现有检测方法使用的特征相比，LLMS的内部表示包含更全面和原始的特征，可以更有效地捕获和区分LLM生成的文本（LGT）和人写入文本（HWT）之间的统计模式差异。我们在处理这两种类型的文本时验证了跨不同LLM的这一假设，并观察到神经激活模式的显着差异。基于此，我们建议重新注册，这是一种基于统计的检测方法。具体而言，我们首先采用替代模型来收集LGT和HWT的表示形式，并提取可以更好地识别LGT的不同激活功能。我们可以通过在此功能方向上计算文本表示的投影得分来对文本进行分类，并与预先计算的阈值进行比较。实验结果表明，重新注册的表现优于所有基线，平均分布（ID）和OOD场景平均为94.92％的AUROC，同时也证明了对各种文本尺寸和主流攻击的强大弹性。数据和代码可公开可用：此HTTPS URL