2025-07-15

Title: SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Authors: Wenliang Shan, Michael Fu, Rui Yang, Chakkrit (Kla)Tantithamthavorn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08898
Pdf URL: https://arxiv.org/pdf/2507.08898
Copy Paste: [[2507.08898]] SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems(https://arxiv.org/abs/2507.08898)
Keywords: language model, llm, prompt
Abstract: Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?''), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. SEALGuard advances the safety alignment of LLM systems by introducing an effective multilingual guardrail.
摘要：安全对准对于LLM驱动系统至关重要。尽管最近以LLM驱动的护栏方法（例如Llamaguard）实现了用英语编写的不安全输入的高检测准确性（例如``如何制造炸弹？''），但他们在多语言的不安全输入中挣扎。该限制使LLM系统容易受到不安全和越狱的提示，并以低资源语言（例如东南亚语言）撰写。本文介绍了Sealguard，这是一种多语言护栏，旨在改善各种语言的安全对齐方式。它旨在解决现有护栏的多语言安全对准差距，并确保有效地过滤LLM驱动系统中不安全和越狱提示。我们将通用多种语言模型调整为使用低级适应（LORA）的多语言护栏。我们构建了Sealsbench，这是一个大规模的多语言安全对齐数据集，其中包含超过260,000个语言的提示，包括安全，不安全和越狱案件。我们评估了Sealguard，以针对该基准上的Llamaguard等最先进的护栏进行评估。我们的发现表明，多语言不安全和越狱促使最先进的llamaguard的表现大大降低了，该表现分别使国防成功率下降（DSR）下降了9％和18％，与仅英语提示的性能相比。相比之下，Sealguard在检测多种语言不安全和越狱提示方面的表现优于现有的护栏，比Llamaguard提高了48％的DSR，并获得了最佳的DSR，Precision和F1得分。我们的消融研究进一步揭示了适应策略和模型大小对Sealguard的整体性能的贡献。 Sealguard通过引入有效的多语言护栏来提高LLM系统的安全对齐。

Title: Evaluating LLMs in Medicine: A Call for Rigor, Transparency

Authors: Mahmoud Alwakeel, Aditya Nagori, Vijay Krishnamoorthy, Rishikesan Kamaleswaran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08916
Pdf URL: https://arxiv.org/pdf/2507.08916
Copy Paste: [[2507.08916]] Evaluating LLMs in Medicine: A Call for Rigor, Transparency(https://arxiv.org/abs/2507.08916)
Keywords: language model, llm
Abstract: Objectives: To evaluate the current limitations of large language models (LLMs) in medical question answering, focusing on the quality of datasets used for their evaluation. Materials and Methods: Widely-used benchmark datasets, including MedQA, MedMCQA, PubMedQA, and MMLU, were reviewed for their rigor, transparency, and relevance to clinical scenarios. Alternatives, such as challenge questions in medical journals, were also analyzed to identify their potential as unbiased evaluation tools. Results: Most existing datasets lack clinical realism, transparency, and robust validation processes. Publicly available challenge questions offer some benefits but are limited by their small size, narrow scope, and exposure to LLM training. These gaps highlight the need for secure, comprehensive, and representative datasets. Conclusion: A standardized framework is critical for evaluating LLMs in medicine. Collaborative efforts among institutions and policymakers are needed to ensure datasets and methodologies are rigorous, unbiased, and reflective of clinical complexities.
摘要：目的：评估医学问答中大语言模型（LLM）的当前局限性，重点介绍用于评估的数据集的质量。材料和方法：对包括MEDQA，MEDMCQA，PubMedQA和MMLU在内的广泛使用的基准数据集进行了审查，以其严格，透明度以及与临床方案的相关性。还分析了替代方案，例如医学期刊中的挑战问题，以确定其潜在的无偏评估工具。结果：大多数现有数据集都缺乏临床现实主义，透明度和鲁棒验证过程。公开可用的挑战性问题提供了一些好处，但受其小规模，狭窄范围和暴露于LLM培训的限制。这些差距突出了对安全，全面和代表性数据集的需求。结论：标准化框架对于评估医学LLM至关重要。需要在机构和政策制定者之间进行协作努力，以确保数据集和方法严格，无偏见，并且反映了临床复杂性。

Title: From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

Authors: Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08924
Pdf URL: https://arxiv.org/pdf/2507.08924
Copy Paste: [[2507.08924]] From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation(https://arxiv.org/abs/2507.08924)
Keywords: language model, llm
Abstract: The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.
摘要：大型语言模型（LLM）的开发需要强大的基准测试，不仅包括学术领域，还包括工业领域，以有效地评估其在现实情况下的适用性。在本文中，我们介绍了两个韩国专家级别的基准。从现有的KMMLU重建的Kmmlu-Redux由韩国国家技术资格考试中的问题组成，并删除了重大错误以提高可靠性。 KMMLU-PRO基于韩国国家专业许可考试，以反映韩国的专业知识。我们的实验表明，这些基准可以全面代表韩国的工业知识。我们公开发布数据集。

Title: Self-Improving Model Steering

Authors: Rongyi Zhu, Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Ting Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08967
Pdf URL: https://arxiv.org/pdf/2507.08967
Copy Paste: [[2507.08967]] Self-Improving Model Steering(https://arxiv.org/abs/2507.08967)
Keywords: language model, llm, prompt
Abstract: Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.
摘要：模型转向代表了一种强大的技术，该技术将大型语言模型（LLMS）与推理期间的人类偏好保持一致。但是，传统的模型传动方法在很大程度上依赖于外部注释的数据，不仅限制了它们对不同环境的适应性，而且还限制了它们对注释质量的有效性。在本文中，我们介绍了SIMS，这是第一个不依赖外部监督的自我改进模型框架。 Sims以迭代的自我改善周期自主生成和完善对比样本的核心，从而实现适应性，特定于上下文的转向。此外，SIMS采用新颖的策略，包括及时的排名和对比度采样，以进一步提高转向功效。跨不同LLM和基准测试的广泛评估表明，SIM可以大大优于转向有效性和适应性的现有方法，从而强调自我改善模型转向是对推理时间LLM Alignment的未来研究的有希望的方向。

Title: Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery

Authors: Ana Chkhaidze, Reshanne R. Reeder, Connor Gag, Anastasia Kiyonaga, Seana Coulson
Subjects: cs.CL, q-bio.NC, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.09011
Pdf URL: https://arxiv.org/pdf/2507.09011
Copy Paste: [[2507.09011]] Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery(https://arxiv.org/abs/2507.09011)
Keywords: language model, hallucination
Abstract: A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Recent proposals regarding the imagery spectrum, that is, differences in the visual system of individuals with absent imagery, typical imagery, and vivid imagery, suggest these differences should impact the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind's eye during Ganzflicker-induced hallucinations. Strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Embeddings from vision language models better captured these differences than text-only language models, and participants with stronger imagery used language with richer sensorimotor associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.
摘要：一种称为ganzflicker的快速交替的红色和黑色显示器会引起视觉幻觉，反映了视觉系统的生成能力。关于图像频谱的最新建议，即缺乏图像，典型图像和生动图像的个体的视觉系统的差异，表明这些差异应影响其他内部产生的视觉体验的复杂性。在这里，我们使用自然语言处理的工具来分析来自4,000多名参与者幻觉的自由文本描述，询问具有不同图像表型的人在Ganzflicker引起的幻觉期间是否看到了不同的事物。强大的成像仪描述了复杂的自然含量，而弱成像仪报告了简单的几何模式。视觉语言模型的嵌入比纯文本语言模型更好地捕获了这些差异，并且具有更富有感官关联的语言的参与者。这些发现可能反映了早期视觉区域和与图像频谱相关的高阶区域之间协调的个体变化。

Title: Lizard: An Efficient Linearization Framework for Large Language Models

Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.09025
Pdf URL: https://arxiv.org/pdf/2507.09025
Copy Paste: [[2507.09025]] Lizard: An Efficient Linearization Framework for Large Language Models(https://arxiv.org/abs/2507.09025)
Keywords: language model, llm
Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model's performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.
摘要：我们提出了一种线性化框架Lizard，该框架将基于变压器的大型语言模型（LLMS）转换为无限 - 封闭式生成的灵活的次级架构。由于软件注意力的二次复杂性和增长的密钥值（KV）高速缓存，基于变压器的LLM面临着重要的内存和计算瓶颈。蜥蜴通过引入次级注意机制来解决这些局限性，该机制在保留输出质量的同时紧密近似于SoftMax的注意力。与以前的线性化方法通常受固定模型结构限制并因此排除门控机制不同，蜥蜴结合了一个受最近最新线性模型启发的门控模块。这可以使自适应内存控制，支持恒定内存推理，提供强长度的泛化，并允许更灵活的模型设计。蜥蜴将封闭式的线性注意与全局上下文压缩和滑动窗口的关注相结合，通过元记忆增强了窗口的注意，形成了一种捕获远距离依赖性和细粒度局部相互作用的混合机制。此外，我们引入了一种硬件感知算法，该算法可以加速模型的训练速度。广泛的实验表明，蜥蜴在跨标准语言建模任务中实现了教师模型的表现几乎无情的恢复，同时大大优于先前的线性化方法。在5-Shot MMLU基准测试中，蜥蜴在先前的模型中提高了18分，并显示了关联召回任务的显着改善。

Title: ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making

Authors: Bharadwaj Ravichandran, David Joy, Paul Elliott, Brian Hu, Jadie Adams, Christopher Funk, Emily Veenhuis, Anthony Hoogs, Arslan Basharat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09037
Pdf URL: https://arxiv.org/pdf/2507.09037
Copy Paste: [[2507.09037]] ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making(https://arxiv.org/abs/2507.09037)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LLM alignment and personalization. Existing LLM comparison tools largely focus on benchmarking tasks, such as knowledge-based question answering. In contrast, our proposed ALIGN system focuses on dynamic personalization of LLM-based decision-makers through prompt-based alignment to a set of fine-grained attributes. Key features of our system include robust configuration management, structured output generation with reasoning, and several algorithm implementations with swappable LLM backbones, enabling different types of analyses. Our user interface enables a qualitative, side-by-side comparison of LLMs and their alignment to various attributes, with a modular backend for easy algorithm integration. Additionally, we perform a quantitative analysis comparing alignment approaches in two different domains: demographic alignment for public opinion surveys and value alignment for medical triage decision-making. The entire ALIGN framework is open source and will enable new research on reliable, responsible, and personalized LLM-based decision-makers.
摘要：大型语言模型（LLM）越来越多地用作决策辅助工具。但是，用户具有不同的价值观和偏好，可能会影响他们的决策，这需要新颖的LLM一致性和个性化方法。现有的LLM比较工具在很大程度上集中在基准任务上，例如基于知识的问答。相比之下，我们提出的对齐系统通过迅速的对齐方式与一组细粒属性进行迅速对齐，重点介绍了基于LLM的决策者的动态个性化。我们系统的关键功能包括强大的配置管理，带有推理的结构化输出生成以及具有可交换LLM骨架的几种算法实现，从而实现了不同类型的分析。我们的用户界面启用了LLMS的定性，并排比较及其与各种属性的一致性，并具有模块化后端，可轻松算法集成。此外，我们进行了定量分析，以比较两个不同领域的一致性方法：公众舆论调查的人口统计学和医疗分类决策的价值一致性。整个Align框架都是开源的，将对可靠，负责任和个性化的基于LLM的决策者进行新的研究。

Title: OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique

Authors: Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, Boris Ginsburg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09075
Pdf URL: https://arxiv.org/pdf/2507.09075
Copy Paste: [[2507.09075]] OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique(https://arxiv.org/abs/2507.09075)
Keywords: language model, llm
Abstract: Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.
摘要：基于推理的大语言模型（LLM）的最新进步，尤其是通过测试时间扩展的潜力，为代码生成和批评创造了巨大的蒸馏机会。但是，这两个领域的进展从根本上取决于大规模的高质量数据集。在这项工作中，我们介绍了OpenCodereason-II，一个数据集由250万个问题 - 解决方案 - 危机三元组（约35K唯一的编程问题）组成，使其几乎是以前最大的公开代码推理数据集的两倍。在这项工作中，我们采用了两阶段监督的微调策略。第一阶段着重于代码生成的微调，而第二阶段涉及对代码生成和批评的模型的联合培训。我们由此产生的FINETINED QWEN2.5-INSTRUCT模型在代码生成中实现了超过或等于先前的开放式蒸馏型模型的性能。值得注意的是，我们的代码生成和批评模型的整合导致竞争性编码性能的显着改善。此外，我们提出了LiveCodeBench基准的扩展，以专门支持C ++编程语言，从而促进了使用此基准测试的更全面的LLM评估。

Title: Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation

Authors: Jialong Mai, Xiaofen Xing, Yawei Li, Zhipeng Li, Jingyuan Xing, Xiangmin Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09076
Pdf URL: https://arxiv.org/pdf/2507.09076
Copy Paste: [[2507.09076]] Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation(https://arxiv.org/abs/2507.09076)
Keywords: language model, llm
Abstract: Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively "memorize" the contextual information. We trained an emotion SLLM as a backbone and incorporated our DPM into inference for emotion recognition in conversation (ERC). Experimental results on the IEMOCAP dataset show that DPM significantly improves the emotion recognition capabilities of SLLM when processing long audio sequences, achieving state-of-the-art performance.
摘要：最近的研究重点是应用语音大语言模型（SLLM）来改善语音情绪识别（SER）。但是，语音模式的固有高帧速率严重限制了SLLM的信号处理和理解能力。例如，在达到其容量限制之前，具有4K上下文窗口的SLLM只能以50Hz功能采样率处理80秒的音频。 SLLM中使用的输入令牌压缩方法忽略了多个对话转弯的情绪的连续性和惯性。本文提出了一种动态参数存储器（DPM）机制，具有上下文语义和句子级情感编码，从而在SLLM中使用有限的上下文Windows来处理无限长度的音频。具体而言，DPM在推断过程中逐渐将句子级信息和情感编码为临时的Lora模块，以有效地“记住”上下文信息。我们训练了一种情感SLLM作为骨干，并将我们的DPM纳入了对话中的情感识别（ERC）。 IEMOCAP数据集的实验结果表明，在处理长音频序列时，DPM显着提高了SLLM的情绪识别能力，从而实现了最新的性能。

Title: CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

Authors: Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09104
Pdf URL: https://arxiv.org/pdf/2507.09104
Copy Paste: [[2507.09104]] CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards(https://arxiv.org/abs/2507.09104)
Keywords: language model, llm
Abstract: Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.
摘要：最近，LLM-As-Gudge在评估大语言模型中的作用已获得突出。但是，当前的法官模型遭受了狭窄的专业化和有限的鲁棒性，破坏了他们进行全面评估的能力。在这项工作中，我们提出了一种新颖的通才法官模型Compassjudger-2，该模型通过任务驱动的多域数据策略克服了这些限制。我们方法的核心是通过可验证的奖励监督判断任务，通过拒绝抽样来指导内在的批判推理，以培养可靠，可推广的判断能力。我们引入了一个精致的学习目标，并以保证金政策梯度损失来提高绩效。从经验上讲，Compassjudger-2在多个法官和奖励基准中取得了卓越的成果，而我们的7B模型通过较大的模型（例如DeepSeek-V3和QWEN3-235B-A22B）展示了竞争性判断准确性。此外，我们提出了JudgerBenchv2，这是一种评估跨域判断准确性和等级一致性的全面基准测试，以标准化法官模型评估。这些贡献提高了强大的，可扩展的LLM判断，并建立新的绩效和评估标准。

Title: OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering

Authors: Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess, Chenliang Xu, Niaz Abdolrahim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09155
Pdf URL: https://arxiv.org/pdf/2507.09155
Copy Paste: [[2507.09155]] OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering(https://arxiv.org/abs/2507.09155)
Keywords: language model, gpt, llm, prompt
Abstract: This work presents OPENXRD, an open-book pipeline designed for crystallography question answering, which integrates textual prompts with concise supporting content generated by GPT-4.5. Instead of using scanned textbooks, which may lead to copyright issues, OPENXRD generates compact, domain-specific references that help smaller models understand key concepts in X-ray diffraction (XRD). We evaluate OPENXRD on a well-defined set of 217 expert-level XRD questions by comparing different vision-language models, including GPT-4 and LLaVA-based frameworks such as Mistral, LLaMA, and QWEN, under both closed-book (without supporting material) and open-book (with supporting material) conditions. Our experimental results show significant accuracy improvements in models that use the GPT-4.5-generated summaries, particularly those with limited prior training in crystallography. OPENXRD uses knowledge from larger models to fill knowledge gaps in crystallography and shows that AI-generated texts can help smaller models reason more effectively in scientific tasks. While the current version of OPENXRD focuses on text-based inputs, we also explore future extensions such as adding real crystal diagrams or diffraction patterns to improve interpretation in specialized materials science contexts. Overall, OPENXRD shows that specialized open-book systems can be useful in materials science and provides a foundation for broader natural language processing (NLP) tools in critical scientific fields.
摘要：这项工作介绍了OpenXrd，这是一款用于晶体学问题答案的开放式式管道，该管道将文本提示与GPT-4.5生成的简洁支持的内容集成在一起。 OpenXrd不使用扫描的教科书（可能导致版权问题），而是生成紧凑的，特定于域的参考，这些参考可以帮助较小的模型了解X射线衍射（XRD）中的关键概念。我们通过比较包括GPT-4和基于LLAVA的框架，例如Mistral，Llama和Qwen等不同视觉语言模型（包括封闭式书籍（无需支持材料））和开放式材料（与支持材料）条件的不同视觉语言模型，并通过比较不同的视觉语言模型，评估了OpenXRD。我们的实验结果表明，使用GPT-4.5生成的摘要，尤其是那些先前晶体学训练有限的模型中的明显准确性提高。 OpenXRD使用较大模型的知识来填补晶体学中的知识空白，并表明AI生成的文本可以帮助较小的模型在科学任务中更有效地理解。尽管当前版本的OpenXRD专注于基于文本的输入，但我们还探索了未来的扩展，例如添加真实的晶体图或衍射模式，以改善专业材料科学环境中的解释。总体而言，OpenXRD表明，专门的开放式系统在材料科学中可能很有用，并为关键科学领域的更广泛的自然语言处理（NLP）工具奠定了基础。

Title: RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking

Authors: Shuo Yang, Zijian Yu, Zhenzhe Ying, Yuqin Dai, Guoqing Wang, Jun Lan, Jinfeng Xu, Jinze Li, Edith C.H. Ngai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09174
Pdf URL: https://arxiv.org/pdf/2507.09174
Copy Paste: [[2507.09174]] RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking(https://arxiv.org/abs/2507.09174)
Keywords: language model, prompt, agent
Abstract: The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at this https URL.
摘要：多模式错误信息的快速扩散对自动事实检查系统提出了重大挑战，尤其是当索赔模棱两可或缺乏足够的背景时。我们介绍了Rama，这是一个新颖的检索型多代理框架，旨在验证多媒体错误信息。拉玛（Rama）结合了三个核心创新：（1）将多模式主张转换为精确的Web搜索查询的战略查询公式；（2）来自不同的，权威来源的跨验证证据汇总；（3）多种多样的大型语言模型和及时变体的互补优势。广泛的实验表明，Rama在基准数据集上取得了出色的性能，尤其是通过在检索事实证据的情况下进行验证来解决模棱两可或不可能的主张。我们的发现强调了集成基于Web的证据和多代理多媒体验证的多代理推理的必要性，为更可靠和可扩展的事实检验解决方案铺平了道路。 Rama将在此HTTPS URL上公开使用。

Title: Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models

Authors: Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.09185
Pdf URL: https://arxiv.org/pdf/2507.09185
Copy Paste: [[2507.09185]] Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models(https://arxiv.org/abs/2507.09185)
Keywords: language model, llm
Abstract: Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.
摘要：大型语言模型（LLM）经常开发出专门针对特定数据集的学习机制，例如依赖于域特异性相关性，这些相关性产生了高信心预测而没有可概括的推理。虽然在一种环境中有益，但这些特定于数据集的机制通常会在模型遇到新任务或分布时降低性能。在这项工作中，我们引入了一种微调方法，旨在通过识别和修剪与基于变压器LLM的数据集特异性机制相关的神经元来增强概括。我们的方法采用综合梯度来量化每个神经元对高信心预测的影响，从而指出那些不成比例地促进数据集特定的性能的影响，而无需支持可靠的，可转移的推理。选择性修剪这些神经元迫使模型取决于可推广的表示。我们基于多项选择的基准进行了评估，我们的基于修剪的微调显着提高了性能，超过了先前（非固化）适应方法。

Title: Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Authors: Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Sangjee Dondrub, Caizang Tai, Haixing Zhao, Huaque Cairang, Suonan Cairang, Rou Te, Lengben Zhaxi, Gazang Zhaxi, Zhonglin Ye, Yuhui Zheng, Chunyan Peng, Secha Jia, Pema Tashi, Cizhen Jiacuo, Pema Dorjee, Hongkai Liu, Pema Yanggon, Tsehang Dorjee, Jiaxin Han, Qiongying Hu, Jilin Man, Huanke You, Yuqi Ren, Duo La, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09205
Pdf URL: https://arxiv.org/pdf/2507.09205
Copy Paste: [[2507.09205]] Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training(https://arxiv.org/abs/2507.09205)
Keywords: language model
Abstract: Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.
摘要：大型语言模型在许多语言中取得了显着的进步。但是，作为代表性的低资源语言，藏族在现有模型中的人数尤其不足，因为缺乏高质量的培训语料库。为了解决这一差距，我们策划了迄今为止最大的藏族预训练语料库，从不同来源汇总了数据，并应用了针对藏族量身定制的专用数据清洁和处理管道。借助策划的数据，我们将多语言基础模型培养到Banzhida，这是一个多语言的大型语言模型，为藏族增长了生成的AI。为了评估模型的藏族能力，我们创建了新的高质量藏基准测试，并通过现有的公共基准进行补充。实验结果表明，在广泛的任务上，班兹希达始终如一地超过了相似规模的开源模型和藏族式模型的开源模型。

Title: Psychology-Driven Enhancement of Humour Translation

Authors: Yuchen Su, Yonghua Zhu, Yang Chen, Diana Benavides-Prado, Michael Witbrock
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09259
Pdf URL: https://arxiv.org/pdf/2507.09259
Copy Paste: [[2507.09259]] Psychology-Driven Enhancement of Humour Translation(https://arxiv.org/abs/2507.09259)
Keywords: language model, llm, chain-of-thought
Abstract: Humour translation plays a vital role as a bridge between different cultures, fostering understanding and communication. Although most existing Large Language Models (LLMs) are capable of general translation tasks, these models still struggle with humour translation, which is especially reflected through linguistic interference and lacking humour in translated text. In this paper, we propose a psychology-inspired Humour Decomposition Mechanism (HDM) that utilises Chain-of-Thought (CoT) to imitate the ability of the human thought process, stimulating LLMs to optimise the readability of translated humorous texts. Moreover, we integrate humour theory in HDM to further enhance the humorous elements in the translated text. Our automatic evaluation experiments on open-source humour datasets demonstrate that our method significantly improves the quality of humour translation, yielding average gains of 7.75\% in humour, 2.81\% in fluency, and 6.13\% in coherence of the generated text.
摘要：幽默翻译是不同文化之间的桥梁，促进理解和交流的重要作用。尽管大多数现有的大型语言模型（LLMS）能够执行一般翻译任务，但这些模型仍然在幽默翻译中挣扎，这尤其是通过语言干扰反映出的，并且在翻译文本中缺乏幽默感。在本文中，我们提出了一种由心理学启发的幽默分解机制（HDM），该机制利用经过三通链（COT）模仿人类思维过程的能力，刺激LLMS以优化翻译幽默文本的可读性。此外，我们将幽默理论整合在HDM中，以进一步增强翻译文本中的幽默元素。我们对开源幽默数据集的自动评估实验表明，我们的方法显着提高了幽默翻译的质量，幽默的平均增长率为7.75 \％，流利度为2.81 \％，而生成的文本相干。

Title: DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Authors: Cathy Jiao, Yijun Pan, Emily Xiao, Daisy Sheng, Niket Jain, Hanzhang Zhao, Ishita Dasgupta, Jiaqi W. Ma, Chenyan Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09424
Pdf URL: https://arxiv.org/pdf/2507.09424
Copy Paste: [[2507.09424]] DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models(https://arxiv.org/abs/2507.09424)
Keywords: language model, llm
Abstract: Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks -- training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement. We hope DATE-LM serves as a foundation for future data attribution research in LLMs.
摘要：数据归因方法量化了培训数据对模型输出的影响，并且与广泛的LLM研究和应用程序变得越来越相关，包括数据集策划，模型可解释性，数据估值。但是，在系统以LLM为中心的数据归因方法的评估中仍然存在关键差距。为此，我们介绍了date-lm（语言模型中的数据归因评估），这是通过现实世界中LLM应用程序评估数据归因方法的统一基准。 Date-LM通过三个关键任务来衡量归因质量 - 训练数据选择，毒性/偏见过滤和事实归因。我们的基准旨在易于使用，使研究人员能够在各种任务和LLM架构上配置和运行大规模评估。此外，我们使用date-lm对现有数据归因方法进行大规模评估。我们的发现表明，在所有任务中没有单一的方法占主导地位，数据归因方法具有更简单的基准的权衡，并且方法性能对特定于任务的评估设计敏感。最后，我们发布了公共排行榜，以快速比较方法并促进社区参与。我们希望Date-LM能够成为LLMS未来数据归因研究的基础。

Title: Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models

Authors: Mingchuan Yang, Ziyuan Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09470
Pdf URL: https://arxiv.org/pdf/2507.09470
Copy Paste: [[2507.09470]] Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models(https://arxiv.org/abs/2507.09470)
Keywords: language model
Abstract: This study explores the optimization of the DRAGON Longformer base model for clinical text classification, specifically targeting the binary classification of medical case descriptions. A dataset of 500 clinical cases containing structured medical observations was used, with 400 cases for training and 100 for validation. Enhancements to the pre-trained joeranbosma/dragon-longformer-base-mixed-domain model included hyperparameter tuning, domain-specific preprocessing, and architectural adjustments. Key modifications involved increasing sequence length from 512 to 1024 tokens, adjusting learning rates from 1e-05 to 5e-06, extending training epochs from 5 to 8, and incorporating specialized medical terminology. The optimized model achieved notable performance gains: accuracy improved from 72.0% to 85.2%, precision from 68.0% to 84.1%, recall from 75.0% to 86.3%, and F1-score from 71.0% to 85.2%. Statistical analysis confirmed the significance of these improvements (p < .001). The model demonstrated enhanced capability in interpreting medical terminology, anatomical measurements, and clinical observations. These findings contribute to domain-specific language model research and offer practical implications for clinical natural language processing applications. The optimized model's strong performance across diverse medical conditions underscores its potential for broad use in healthcare settings.
摘要：这项研究探讨了用于临床文本分类的Dragon Longformer基本模型的优化，特别针对医学案例描述的二元分类。使用了500例含有结构化医学观察的临床病例的数据集，其中400例培训病例和100例验证。预先训练的Joeranbosma/Dragon-Longormen-Base-Mixed-domain模型的增强功能包括超参数调整，特定领域的预处理和体系结构调整。关键修改涉及将序列长度从512增加到1024个令牌，将学习率从1E-05调整到5E-06，从5E-06将培训时期从5延长到8，并结合了专业的医学术语。优化的模型获得了显着的性能增长：准确性从72.0％提高到85.2％，精度从68.0％提高到84.1％，召回75.0％到86.3％，F1得分从71.0％提高到85.2％。统计分析证实了这些改进的重要性（p <.001）。该模型表明，在解释医学术语，解剖测量和临床观察方面具有增强的能力。这些发现有助于特定领域的语言模型研究，并为临床自然语言处理应用提供实际影响。优化的模型在各种医疗条件下的强劲表现强调了其在医疗机构中广泛使用的潜力。

Title: Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

Authors: Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09477
Pdf URL: https://arxiv.org/pdf/2507.09477
Copy Paste: [[2507.09477]] Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs(https://arxiv.org/abs/2507.09477)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at this https URL.
摘要：检索演示的一代（RAG）通过注入外部知识来提高大语言模型（LLMS）的事实，但它缺乏需要多步推理的问题；相反，纯粹以推理为导向的方法通常会幻觉或误入歧途。这项调查在统一的推理 - 取消透视视角下综合了这两条线。我们首先绘制高级推理如何优化抹布的每个阶段（推理增强的抹布）。然后，我们展示如何检索不同类型的供应遗失前提的知识并扩展复杂推理的上下文（抹布增强的推理）。最后，我们聚焦了新兴的抹布框架框架，在其中（代理）LLMS迭代交织搜索和推理，以在知识密集型基准中实现最先进的性能。我们将方法，数据集和公开挑战分类，并概述研究途径，以更深入的破布系统，这些系统更有效，自适应，值得信赖和以人为本。该集合可在此HTTPS URL上找到。

Title: ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning

Authors: Changli Wang, Rui Wu, Fang Yin
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.09482
Pdf URL: https://arxiv.org/pdf/2507.09482
Copy Paste: [[2507.09482]] ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning(https://arxiv.org/abs/2507.09482)
Keywords: language model
Abstract: Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{this https URL}.
摘要：人类的情绪很复杂，讽刺是一种微妙而独特的形式。尽管讽刺研究取得了进展，但讽刺产生仍未得到充实，这主要是由于对文本方式和视觉线索的忽视以及现有数据集中的图像内容与讽刺意图之间的不匹配。在本文中，我们介绍了M2SAG，这是一种具有4,970个样品的多模式讽刺生成数据集，每个数据集包含图像，讽刺文本和讽刺目标。对于基准M2SAG，我们提出了Visp，这是一个集成了近端策略优化（PPO）和对比度学习的一代框架。 PPO利用奖励分数从DIP来引导讽刺文本的产生，而对比度学习则鼓励模型有利于奖励分数更高的输出。这些策略提高了整体发电质量，并具有更明显的讽刺意图。我们评估了五个度量集的VISP，发现它超过了所有基线，包括大型语言模型，强调了它们在讽刺产生中的局限性。此外，我们分析了M2SAG和VISP产生的文本的讽刺分数和事实不一致的分布。生成的文本表现出更高的平均讽刺得分（0.898 vs. 0.770）和事实不一致（0.768 vs. 0.739），这表明Visp比原始数据集产生的较高质量的讽刺含量。％数据集和代码将公开可用。我们的数据集和代码将在\ textit {this HTTPS url}中发布。

Title: Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis

Authors: Junjie Liu, Yuanhe Tian, Yan Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09485
Pdf URL: https://arxiv.org/pdf/2507.09485
Copy Paste: [[2507.09485]] Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2507.09485)
Keywords: language model, llm, prompt
Abstract: Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios to identify the sentiment polarity of specific aspect terms in a sentence. Although many existing studies leverage large language models (LLMs) to perform ABSA due to their strong context understanding capabilities, they still face challenges to learn the context information in the running text because of the short text, as well as the small and unbalanced labeled training data, where most data are labeled with positive sentiment. Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented this http URL this paper, we propose an LLM-based ABSA approach with training data this http URL, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and balanced label distributions to better train an ABSA model. Meanwhile, in order to improve the quality of the augmented data, we propose a reinforcement learning approach to optimize the data augmentation. this http URL results and further analyses on English benchmark datasets for ABSA demonstrate the effectiveness of our approach, where superior performance is observed over strong baselines and most existing studies.
摘要：基于方面的情感分析（ABSA）是社交媒体场景中至关重要的细粒度任务，以确定句子中特定方面术语的情感极性。尽管许多现有的研究都利用大型语言模型（LLMS）由于其强大的上下文理解能力而执行ABSA，但由于简短的文本，他们仍然面临挑战，以学习跑步文本中的上下文信息，以及小型且不平衡的标记培训数据，其中大多数数据都以积极的情感标记。 Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented this http URL this paper, we propose an LLM-based ABSA approach with training data this http URL, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and平衡标签分布，以更好地训练ABSA模型。同时，为了提高增强数据的质量，我们提出了一种增强学习方法来优化数据增强。该HTTP URL导致并进一步分析了英语基准数据集的ABSA，证明了我们方法的有效性，在这种情况下，在强大的基础线和大多数现有研究中观察到了卓越的性能。

Title: GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities

Authors: Siyi Wu, Zeyu Wang, Xinyuan Song, Zhengpeng Zhou, Lifan Sun, Tianyu Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09497
Pdf URL: https://arxiv.org/pdf/2507.09497
Copy Paste: [[2507.09497]] GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities(https://arxiv.org/abs/2507.09497)
Keywords: agent
Abstract: Modern enterprise environments demand intelligent systems capable of handling complex, dynamic, and multi-faceted tasks with high levels of autonomy and adaptability. However, traditional single-purpose AI systems often lack sufficient coordination, memory reuse, and task decomposition capabilities, limiting their scalability in realistic settings. To address these challenges, we present \textbf{GoalfyMax}, a protocol-driven framework for end-to-end multi-agent collaboration. GoalfyMax introduces a standardized Agent-to-Agent (A2A) communication layer built on the Model Context Protocol (MCP), allowing independent agents to coordinate through asynchronous, protocol-compliant interactions. It incorporates the Experience Pack (XP) architecture, a layered memory system that preserves both task rationales and execution traces, enabling structured knowledge retention and continual learning. Moreover, our system integrates advanced features including multi-turn contextual dialogue, long-short term memory modules, and dynamic safety validation, supporting robust, real-time strategy adaptation. Empirical results on complex task orchestration benchmarks and case study demonstrate that GoalfyMax achieves superior adaptability, coordination, and experience reuse compared to baseline frameworks. These findings highlight its potential as a scalable, future-ready foundation for multi-agent intelligent systems.
摘要：现代企业环境需要能够处理具有高度自主性和适应性的复杂，动态和多方面任务的智能系统。但是，传统的单能AI系统通常缺乏足够的协调，内存重复使用和任务分解功能，从而限制了它们在现实设置中的可扩展性。为了应对这些挑战，我们提出\ textbf {goarsFymax}，这是一个用于端到端多代理协作的协议驱动的框架。 GoareFymax引入了建立在模型上下文协议（MCP）上的标准化代理到代理（A2A）通信层，从而使独立代理可以通过异步，符合协议符合协议的交互来协调。它结合了体验包（XP）体系结构，这是一个分层内存系统，可保留任务理由和执行跟踪，从而实现结构化知识的保留和持续学习。此外，我们的系统集成了高级功能，包括多转弯上下文对话，长期术语内存模块和动态安全验证，并支持强大的实时策略适应。复杂的任务编排基准和案例研究的经验结果表明，与基线框架相比，目标fymax可以实现卓越的适应性，协调性和经验重用。这些发现凸显了其作为多代理智能系统的可扩展，未来就绪的基础的潜力。

Title: Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models

Authors: Junjie Wu, Gefei Gu, Yanan Zheng, Dit-Yan Yeung, Arman Cohan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09506
Pdf URL: https://arxiv.org/pdf/2507.09506
Copy Paste: [[2507.09506]] Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models(https://arxiv.org/abs/2507.09506)
Keywords: language model, gpt
Abstract: Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing -- a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data -- remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code can be found in https://github. com/wujunjie1998/Ref-Long.
摘要：长篇小说语言模型（LCLM）在长篇小说中表现出令人印象深刻的能力理解任务。其中，长篇文章引用 - 一项至关重要的任务，要求LCLM将感兴趣的项目归因于长篇小说数据的特定部分 - 仍然没有被忽视。为了弥合这一差距，本文提出了对长篇文章语言模型（ref-long）的参考评估，这是一种新颖的基准测试，旨在评估LCLMS的长篇文章引用能力。具体而言，Ref-Long要求LCLMS识别引用特定键的文档索引，并强调密钥与文档之间的上下文关系，而通过简单检索。根据任务设计，我们构建了三个子集，从合成到现实的场景不等，形成了ref-long基准。 13个LCLM的实验结果显示，即使在GPT-4O（例如GPT-4O）中，在长篇文化引用中也有显着的缺点。为了进一步调查这些挑战，我们进行了全面的分析，包括人类评估，任务格式调整，微调实验和错误分析，从而导致了一些关键见解。我们的数据和代码可以在https：// github中找到。 com/wujunjie1998/ref-long。

Title: How Important is `Perfect' English for Machine Translation Prompts?

Authors: Patrícia Schmidtová, Niyati Bafna, Seth Aycock, Gianluca Vico, Wiktor Kamzela, Katharina Hämmerl, Vilém Zouhar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09509
Pdf URL: https://arxiv.org/pdf/2507.09509
Copy Paste: [[2507.09509]] How Important is `Perfect' English for Machine Translation Prompts?(https://arxiv.org/abs/2507.09509)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs' performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt. The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans.
摘要：大型语言模型（LLMS）在最近的机器翻译评估中取得了最佳的结果，但也已知它们对提示中的错误和扰动很敏感。我们系统地评估用户提示中的人类合理和合成错误如何影响LLMS在两个相关任务上的性能：机器翻译和机器翻译评估。我们提供定量分析和定性见解，以了解模型如何响应用户提示中噪声的增加。迅速的质量强烈影响翻译性能：由于许多错误，即使是一个好的提示也可能表现不佳或不差的提示而不会出错。但是，不同的噪声类型对翻译质量的影响有所不同，角色级别和噪音的组合使性能降低性能要比副扰动更大。定性分析表明，较低的迅速质量在很大程度上导致了较差的指导，而不是直接影响翻译质量本身。此外，LLM仍然可以在场景中转化，并具有压倒性的随机噪声，这会使人类难以辨认。

Title: An Exploration of Knowledge Editing for Arabic

Authors: Basel Mousi, Nadir Durrani, Fahim Dalvi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09629
Pdf URL: https://arxiv.org/pdf/2507.09629
Copy Paste: [[2507.09629]] An Exploration of Knowledge Editing for Arabic(https://arxiv.org/abs/2507.09629)
Keywords: chat
Abstract: While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.
摘要：尽管知识编辑（KE）已通过英语进行了广泛的探索，但其在形态丰富的语言（如阿拉伯语）中的行为仍然不受欢迎。在这项工作中，我们介绍了阿拉伯语KE的首次研究。我们在ZSRE和反对基准的阿拉伯语翻译上评估了四种方法（罗马，泥浆，冰和LTE），分析了多语言和跨语义设置。我们在Llama-2-7B-Chat上的实验表明，基于参数的方法与跨语性概括相加，而指导调节的方法则更加牢固。我们将学习至编辑（LTE）扩展到多语言环境，并表明联合阿拉伯语英语培训可改善编辑性和转移。我们发布了阿拉伯语KE基准和用于LTE数据的多语言培训，以支持未来的研究。

Title: Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?

Authors: Pawitsapak Akarajaradwong, Chompakorn Chaksangchaichot, Pirat Pothavorn, Attapol Thamrongrattanarit-Rutherford, Ekapol Chuangsuwanich, Sarana Nutanong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09638
Pdf URL: https://arxiv.org/pdf/2507.09638
Copy Paste: [[2507.09638]] Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?(https://arxiv.org/abs/2507.09638)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The Retrieval-Augmented Generation (RAG) systems' performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.
摘要：检索提升的一代（RAG）系统在泰国法律问题上的表现仍然有限，尤其是对于需要广泛而复杂的法律推理的问题。为了解决这些局限性，我们引入了一种方法，将LLMS调整为提高法律引用准确性，并使用群体相关政策优化（GRPO）提高响应质量。我们的方法将BGE-M3嵌入作为一种具有成本效益的语义相似性奖励，与大型语言模型法官相比，计算费用大大降低了2.5倍。 Nitibench基准测试的实验证明了实质性改进：GRPO从基本模型中获得了高达90％的引用F1增长，而关节质量指标比教学调整增加了31％。至关重要的是，我们的方法与指导调整相比，对复杂的法律推理任务的鲁棒性增强了，为增强泰语法律LLM提供了有效和资源有效的解决方案。

Title: MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs

Authors: Shulin Huang, Linyi Yang, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.09701
Pdf URL: https://arxiv.org/pdf/2507.09701
Copy Paste: [[2507.09701]] MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs(https://arxiv.org/abs/2507.09701)
Keywords: language model, llm
Abstract: Large language models exhibit cultural biases and limited cross-cultural understanding capabilities, particularly when serving diverse global user populations. We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction and enables causal analysis through Counterfactual Rephrasing and Confounder Rephrasing. Our comprehensive evaluation spans 13 cultures and 13 languages, systematically assessing both cultural awareness and cultural bias across different linguistic scenarios. The framework provides 39,897 cultural awareness instances and 17,940 cultural bias instances. Experimental results reveal performance disparities across different linguistic scenarios, demonstrating that optimal cultural performance is not only linked to training data distribution, but also is related to language-culture alignment. The evaluation results also expose the fairness issue, where approaches appearing successful in the English scenario create substantial disadvantages. MCEval represents the first comprehensive multilingual cultural evaluation framework that provides deeper insights into LLMs' cultural understanding.
摘要：大型语言模型表现出文化偏见和有限的跨文化理解能力，尤其是在为多样化的全球用户群体服务时。我们提出了MCEVAL，这是一个新型的多语言评估框架，该框架采用动态文化问题构建，并通过反事实重塑和混杂因素重新设计来实现因果分析。我们的全面评估涵盖了13种文化和13种语言，系统地评估了不同语言情景的文化意识和文化偏见。该框架提供了39,897个文化意识实例和17,940个文化偏见实例。实验结果揭示了不同语言情景之间的绩效差异，表明最佳文化表现不仅与训练数据分布有关，而且与语言文化的一致性有关。评估结果也暴露了公平性问题，在英语场景中似乎成功的方法造成了严重的缺点。 MCEVAL代表了第一个全面的多语言文化评估框架，该框架为LLMS的文化理解提供了更深入的见解。

Title: Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces

Authors: Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.09709
Pdf URL: https://arxiv.org/pdf/2507.09709
Copy Paste: [[2507.09709]] Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces(https://arxiv.org/abs/2507.09709)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.
摘要：了解大语言模型（LLM）的潜在空间几何形状是解释其行为和改善对齐的关键。 \ baturay {但是，尚不清楚llms在多大程度上组织与语义理解有关的表示。为了研究这一点，我们对基于变压器的LLM中的隐藏状态进行了大规模的经验研究，分析了6个科学主题和12层的11个仅解码器模型。我们发现，高级语义信息始终存在于低维子空间中，这些子空间形成了跨不同域的可分离表示。这种可分离性在更深的层中变得更加明显，在提示下，即使表面内容没有变化，触发结构化推理或对齐行为$ \ unicode {x2013} $也是如此。这种几何形状可以在隐藏空间中进行简单而有效的因果干预。例如，可以通过单个向量方向捕获诸如思想链之类的推理模式。这些发现共同支持了几何感知工具的开发，这些工具直接使用潜在表示，以检测和减轻有害或对抗性内容，例如利用基于运输的防御力来利用这种分离性。作为概念的证明，我们通过训练简单的MLP分类器作为轻质的潜在防护仪来证明这一潜力，该防护措施可检测以高精度检测对抗性和恶意提示。

Title: Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding

Authors: Qi Feng, Yihong Liu, Hinrich Schütze
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.09758
Pdf URL: https://arxiv.org/pdf/2507.09758
Copy Paste: [[2507.09758]] Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding(https://arxiv.org/abs/2507.09758)
Keywords: language model
Abstract: Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics -- such as text length -- which may not accurately reflect the model's own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks. Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.
摘要：课程学习是自然语言处理（NLP）中广泛采用的培训策略，在该培训中，模型会通过增加难以提高学习效率和表现的难度来组织的示例。但是，大多数现有的方法都依赖于手动定义的难度指标（例如文本长度），这可能无法准确反映模型自己的观点。为了克服这一限制，我们提出了一个自适应课程学习范式，该课程范式优先基于预先训练的语言模型（PLM）本身预测的难度分数。在这些分数的基础上，我们探索了各种培训策略，这些培训策略在微调的示例中有所不同：从易于硬的，难以容易的，到混合采样。我们评估了涵盖二进制和多类分类任务的四种自然语言理解（NLU）数据集的方法。实验结果表明，与标准随机抽样相比，我们的方法会导致更快的收敛性和提高性能。

Title: Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

Authors: Qinyuan Ye, Robin Jia, Xiang Ren
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.09875
Pdf URL: https://arxiv.org/pdf/2507.09875
Copy Paste: [[2507.09875]] Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition(https://arxiv.org/abs/2507.09875)
Keywords: language model
Abstract: Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their notable performance and present three key findings. First, we uncover a function induction mechanism that explains the model's generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.
摘要：大型语言模型展示了通过内在学习执行看不见的任务的有趣能力。但是，目前尚不清楚模型内的哪些机制驱动了这种任务级别的概括。在这项工作中，我们通过逐一添加的镜头（即1+1 = 3，2+2 = 5，3+3 =？）来解决这个问题，这是一个两步，反事实任务，其意外+1功能是第二步。利用电路风格的可解释性技术（例如路径修补），我们分析了模型的内部计算，背后的内部计算并呈现了三个关键发现。首先，我们发现了一种函数诱导机制，该机制解释了该模型的概括从标准添加到逐个添加。该机制类似于先前工作中发现的感应头机理的结构，并将其提升到更高水平的抽象水平。其次，我们表明+1函数的诱导由并联多个注意力头，每个函数都发出了+1函数的不同部分。最后，我们发现这种功能诱导机制在更广泛的任务中重复使用，包括综合任务，例如移动的多项选择质量质量质量质量质量质量标准和算法任务，例如base-8添加。总体而言，我们的发现提供了更深入的见解，即语言模型中可重复使用和可组合的结构如何实现任务级的概括。

Title: Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking

Authors: Hai Toan Nguyen, Tien Dat Nguyen, Viet Ha Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09935
Pdf URL: https://arxiv.org/pdf/2507.09935
Copy Paste: [[2507.09935]] Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking(https://arxiv.org/abs/2507.09935)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
摘要：检索增强的生成（RAG）系统通常使用块策略进行检索，从而通过使其访问外部知识来增强大语模型（LLMS），从而确保检索到检索的信息是最新的和特定于领域的。但是，传统方法通常无法创造出捕获足够的语义含义的块，因为它们无法解释基本的文本结构。本文提出了一个新颖的框架，该框架通过集成层次的文本细分和聚类来增强抹布，从而产生更有意义和语义上的连贯块。在推断期间，框架通过利用细分级别和群集级向量表示来检索信息，从而增加了检索更精确和上下文相关信息的可能性。对叙事QA，质量和Qasper数据集的评估表明，与传统的分块技术相比，提出的方法取得了改善的结果。

Title: Tiny Reward Models

Authors: Sarah Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.09973
Pdf URL: https://arxiv.org/pdf/2507.09973
Copy Paste: [[2507.09973]] Tiny Reward Models(https://arxiv.org/abs/2507.09973)
Keywords: language model, prompt
Abstract: Large decoder-based language models have become the dominant architecture for reward modeling in reinforcement learning from human feedback (RLHF). However, as reward models are increasingly deployed in test-time strategies, their inference costs become a growing concern. We present TinyRM, a family of small, bidirectional masked language models (MLMs) with as few as 400 million parameters, that rival the capabilities of models over 175 times larger on reasoning and safety preference modeling tasks. TinyRM combines FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve strong performance on RewardBench, despite using significantly fewer resources. Our experiments suggest that small models benefit from domain-specific tuning strategies, particularly in reasoning, where lightweight finetuning methods are especially effective. While challenges remain in building generalist models and conversational preference modeling, our preliminary results highlight the promise of lightweight bidirectional architectures as efficient, scalable alternatives for preference modeling.
摘要：基于大型解码器的语言模型已成为从人类反馈（RLHF）中学习奖励建模的主要体系结构。但是，随着奖励模型越来越多地用于测试时间策略，其推论成本成为日益关注的问题。我们提出了Tinyrm，这是一个小型，双向蒙面的语言模型（MLMS），其参数少于4亿个参数，与在推理和安全性偏好模型任务上大于175倍的模型的功能匹配。 Tinyrm结合了Flan风格的提示，方向性低级适应（DORA）和层冻结，尽管使用了较少的资源，但在奖励台上的冻结以实现良好的性能。我们的实验表明，小型模型受益于特定领域的调整策略，尤其是在推理中，轻量轻量训练方法特别有效。尽管在建立通才模型和对话偏好模型中仍然存在挑战，但我们的初步结果突出了轻巧双向体系结构作为偏好建模的有效，可扩展的替代方案的希望。

Title: Protective Factor-Aware Dynamic Influence Learning for Suicide Risk Prediction on Social Media

Authors: Jun Li, Xiangmeng Wang, Haoyang Li, Yifei Yan, Hong Va Leong, Ling Feng, Nancy Xiaonan Yu, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.10008
Pdf URL: https://arxiv.org/pdf/2507.10008
Copy Paste: [[2507.10008]] Protective Factor-Aware Dynamic Influence Learning for Suicide Risk Prediction on Social Media(https://arxiv.org/abs/2507.10008)
Keywords: language model
Abstract: Suicide is a critical global health issue that requires urgent attention. Even though prior work has revealed valuable insights into detecting current suicide risk on social media, little attention has been paid to developing models that can predict subsequent suicide risk over time, limiting their ability to capture rapid fluctuations in individuals' mental state transitions. In addition, existing work ignores protective factors that play a crucial role in suicide risk prediction, focusing predominantly on risk factors alone. Protective factors such as social support and coping strategies can mitigate suicide risk by moderating the impact of risk factors. Therefore, this study proposes a novel framework for predicting subsequent suicide risk by jointly learning the dynamic influence of both risk factors and protective factors on users' suicide risk transitions. We propose a novel Protective Factor-Aware Dataset, which is built from 12 years of Reddit posts along with comprehensive annotations of suicide risk and both risk and protective factors. We also introduce a Dynamic Factors Influence Learning approach that captures the varying impact of risk and protective factors on suicide risk transitions, recognizing that suicide risk fluctuates over time according to established psychological theories. Our thorough experiments demonstrate that the proposed model significantly outperforms state-of-the-art models and large language models across three datasets. In addition, the proposed Dynamic Factors Influence Learning provides interpretable weights, helping clinicians better understand suicidal patterns and enabling more targeted intervention strategies.
摘要：自杀是一个关键的全球健康问题，需要紧急关注。即使先前的工作已经揭示了在社交媒体上发现当前自杀风险的宝贵见解，但几乎没有关注开发可以预测随着时间的推移预测随后的自杀风险的模型，从而限制了他们捕捉个人精神状态过渡中快速波动的能力。此外，现有工作忽略了在自杀风险预测中起着至关重要作用的保护因素，仅仅集中在风险因素上。社会支持和应对策略等保护因素可以通过调节风险因素的影响来减轻自杀风险。因此，本研究提出了一个新的框架，用于通过共同学习风险因素和保护因素对用户自杀风险转变的动态影响，以预测随后的自杀风险。我们提出了一个新颖的保护因子感知数据集，该数据集是由12年的Reddit帖子以及自杀风险以及风险和保护因素的全面注释而构建的。我们还引入了动态因素影响学习方法，该方法捕获了风险和保护因素对自杀风险转变的不同影响，并认识到自杀风险会根据既定的心理理论随着时间而变化。我们的详尽实验表明，所提出的模型在三个数据集上大大优于最先进的模型和大型语言模型。此外，提出的动态因素影响学习提供了可解释的权重，帮助临床医生更好地理解自杀模式并实现更有针对性的干预策略。

Title: GeLaCo: An Evolutionary Approach to Layer Compression

Authors: David Ponce, Thierry Etchegoyhen, Javier Del Ser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.10059
Pdf URL: https://arxiv.org/pdf/2507.10059
Copy Paste: [[2507.10059]] GeLaCo: An Evolutionary Approach to Layer Compression(https://arxiv.org/abs/2507.10059)
Keywords: language model, llm
Abstract: Large Language Models (LLM) have achieved remarkable performance across a large number of tasks, but face critical deployment and usage barriers due to substantial computational requirements. Model compression methods, which aim to reduce model size while preserving its capacity, are an important means to mitigate these issues. Promising approaches along these lines, such as structured pruning, typically require costly empirical search for optimal variants and may run the risk of ignoring better solutions. In this work we introduce GeLaCo, an evolutionary approach to LLM compression via layer collapse. Our approach supports an efficient exploration of the compression solution space via population-based search and a module-wise similarity fitness function capturing attention, feed-forward, and hidden state representations. GeLaCo also supports both single and multi-objective evolutionary compression search, establishing the first Pareto frontier along compression and quality axes. We evaluate GeLaCo solutions via both perplexity-based and generative evaluations over foundational and instruction-tuned models, outperforming state-of-the-art alternatives.
摘要：大型语言模型（LLM）在许多任务中都取得了出色的性能，但是由于实质性的计算要求，面临着关键的部署和使用障碍。模型压缩方法旨在减少模型大小的同时保持其容量，是减轻这些问题的重要手段。沿这些线路（例如结构化修剪）的有前途的方法通常需要昂贵的经验搜索来进行最佳变体，并且可能会忽略更好的解决方案的风险。在这项工作中，我们介绍了Gelaco，这是通过层塌陷通过层压缩的进化方法。我们的方法通过基于人群的搜索和模块的相似性适应性功能来支持对压缩解决方案空间的有效探索，从而吸引了注意力，进料和隐藏的状态表示。 Gelaco还支持单目标和多目标进化压缩搜索，并在压缩和质量轴上建立了第一个Pareto边界。我们通过基础和指导调整模型的基于困惑和生成性评估来评估Gelaco解决方案，并优于最先进的替代方案。

Title: Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Authors: Simon Münker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10073
Pdf URL: https://arxiv.org/pdf/2507.10073
Copy Paste: [[2507.10073]] Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires(https://arxiv.org/abs/2507.10073)
Keywords: language model, llm, prompt, agent
Abstract: Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs' origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn't consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.
摘要：AI系统是真正代表人类价值的，还是仅仅代表遍布人类的价值观？我们的研究表明了一个有关现实的研究：尽管具有语言能力，但大型语言模型（LLM）仍未代表各种文化道德框架。我们通过在19个文化背景中应用道德基础问卷来揭示AI生成和人类道德直觉之间的显着差距。将多个最先进的LLMS的起源与人类基线数据进行比较，我们发现这些模型系统地使道德多样性均匀。令人惊讶的是，增加的模型规模并不能始终如一地改善文化代表性的保真度。我们的发现挑战了在社会科学研究中，LLM越来越多地用作合成人群，并突出了当前AI对齐方法的基本限制。如果没有提示的数据驱动的对齐，这些系统将无法捕获细微的，特定于文化的道德直觉。我们的结果呼吁建立更基础的一致性目标和评估指标，以确保AI系统代表各种人类价值观，而不是使道德格局变平。

Title: Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning

Authors: Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10085
Pdf URL: https://arxiv.org/pdf/2507.10085
Copy Paste: [[2507.10085]] Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning(https://arxiv.org/abs/2507.10085)
Keywords: chain-of-thought
Abstract: Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.
摘要：最近提出的参数有效的微调（PEFT）方法的表示微调（REFT）吸引了广泛的关注，仅通过编辑表示空间来显着提高参数效率。在这项工作中，我们调查将REFT应用于复杂的推理任务。但是，直接使用本机REFT方法（在每一层的开头和结尾修改固定表示）会产生次优性能，因为这些固定位置表示对输出的影响不确定。我们观察到，在复杂的推理任务中，通常存在某些关键表示。这些表示可以整合前面层中的重要信息，或者调节后续层表示。通过逐层传播，它们对最终产出产生了重大影响。自然，对这些关键表示的微调有可能大大提高推理性能。在这些见解的基础上，我们提出了关键表示微调（CRFT），这是一种新颖的方法，通过信息流分析来识别和优化这些关键表示形式。 CRFT在监督的学习框架内运行，在冻结基本模型的同时，在低级线性子空间中动态优化关键表示。使用骆驼和米斯特拉尔模型家族，我们的方法的有效性和效率在八个基准中进行了算术和常识性推理的验证。此外，我们的方法还有效地适应了很少的射击设置，从而将单杆精度提高了16.4％。我们的工作突出了代表级优化的尚未开发的潜在，可用于COT推理，为传统PEFT方法提供了轻巧而有力的替代方案。

Title: Fusing Large Language Models with Temporal Transformers for Time Series Forecasting

Authors: Chen Su, Yuanhe Tian, Qinyu Liu, Jun Zhang, Yan Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.10098
Pdf URL: https://arxiv.org/pdf/2507.10098
Copy Paste: [[2507.10098]] Fusing Large Language Models with Temporal Transformers for Time Series Forecasting(https://arxiv.org/abs/2507.10098)
Keywords: language model, llm, prompt
Abstract: Recently, large language models (LLMs) have demonstrated powerful capabilities in performing various tasks and thus are applied by recent studies to time series forecasting (TSF) tasks, which predict future values with the given historical time series. Existing LLM-based approaches transfer knowledge learned from text data to time series prediction using prompting or fine-tuning strategies. However, LLMs are proficient at reasoning over discrete tokens and semantic patterns but are not initially designed to model continuous numerical time series data. The gaps between text and time series data lead LLMs to achieve inferior performance to a vanilla Transformer model that is directly trained on TSF data. However, the vanilla Transformers often struggle to learn high-level semantic patterns. In this paper, we design a novel Transformer-based architecture that complementarily leverages LLMs and vanilla Transformers, so as to integrate the high-level semantic representations learned by LLMs into the temporal information encoded by time series Transformers, where a hybrid representation is obtained by fusing the representations from the LLM and the Transformer. The resulting fused representation contains both historical temporal dynamics and semantic variation patterns, allowing our model to predict more accurate future values. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approach.
摘要：最近，大型语言模型（LLMS）在执行各种任务方面表现出了强大的功能，因此，最近的研究对时间序列预测（TSF）任务应用了，这些任务通过给定的历史时间序列预测未来值。现有的基于LLM的方法通过提示或微调策略从文本数据中学到的转移知识转移到时间序列预测。但是，LLM熟练地超过离散令牌和语义模式，但最初并非旨在建模连续数值时间序列数据。文本和时间序列数据之间的差距导致LLMS达到直接对TSF数据训练的香草变压器模型的劣等性能。但是，香草变压器通常很难学习高级语义模式。在本文中，我们设计了一种基于变压器的新型架构，该体系结构互补地利用LLMS和Vanilla变压器，以便将LLMS学到的高级语义表示形式整合到由时间序列变压器编码的时间信息中，其中通过从LLM和Transformer中融合了混合表示。由此产生的融合表示既包含历史时间动力学和语义变化模式，从而使我们的模型可以预测更准确的未来值。基准数据集上的实验证明了所提出的方法的有效性。

Title: Task-Based Flexible Feature Distillation for LLMs

Authors: Khouloud Saadi, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.10155
Pdf URL: https://arxiv.org/pdf/2507.10155
Copy Paste: [[2507.10155]] Task-Based Flexible Feature Distillation for LLMs(https://arxiv.org/abs/2507.10155)
Keywords: language model, llm
Abstract: Knowledge Distillation (KD) in general and feature distillation in particular are promising techniques for reducing the high computational demand of large language models (LLMs). However, traditional feature KD methods typically assume that the teacher and the student share the same hidden size, limiting the flexibility of the student's architecture. A common solution to this problem involves training a linear projector to align their feature spaces, but this introduces additional parameters that must be learned from scratch and often degrades performance on downstream tasks, especially in generative settings. To address this issue, in this work, we propose a novel task-based feature distillation method that enables knowledge transfer between teacher and student models with different hidden layer dimensions, without introducing any new parameters. Leveraging the insight that only a subset of LLM components contribute significantly to a specific downstream task, our approach identifies the most task-relevant hidden units in the teacher and directly distills their activations to the student. Our method is flexible and easily integrates with other distillation frameworks. Empirical results show consistent improvements over prior approaches across diverse tasks, including classification, instruction-following, and summarization, achieving up to a 3\% performance gain over the linear projection baseline.
摘要：一般而言，知识蒸馏（KD）尤其是特征蒸馏是减少大型语言模型（LLMS）高计算需求的有前途技术。但是，传统功能KD方法通常假定老师和学生共享相同的隐藏大小，从而限制了学生体系结构的灵活性。解决此问题的一个常见解决方案是训练线性投影仪以对齐其特征空间，但是这引入了其他参数，这些参数必须从头开始学习，并且经常在下游任务上降低性能，尤其是在生成设置中。为了解决这个问题，在这项工作中，我们提出了一种基于任务的新型特征蒸馏方法，该方法可以在没有引入任何新参数的情况下具有不同隐藏层尺寸的教师和学生模型之间的知识转移。利用只有LLM组件的子集对特定的下游任务产生重大贡献的见解，我们的方法可以识别教师中最重要的隐藏单位，并直接将其激活提炼为学生。我们的方法是灵活的，并且很容易与其他蒸馏框架集成。经验结果表明，对各种任务的先前方法的一致改进，包括分类，跟随指令和摘要，比线性投影基线的性能增益高达3 \％。

Title: Abusive text transformation using LLMs

Authors: Rohitash Chandra, Jiyong Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10177
Pdf URL: https://arxiv.org/pdf/2507.10177
Copy Paste: [[2507.10177]] Abusive text transformation using LLMs(https://arxiv.org/abs/2507.10177)
Keywords: language model, gpt, llm
Abstract: Although Large Language Models (LLMs) have demonstrated significant advancements in natural language processing tasks, their effectiveness in the classification and transformation of abusive text into non-abusive versions remains an area for exploration. In this study, we aim to use LLMs to transform abusive text (tweets and reviews) featuring hate speech and swear words into non-abusive text, while retaining the intent of the text. We evaluate the performance of two state-of-the-art LLMs, such as Gemini, GPT-4o, DeekSeek and Groq, on their ability to identify abusive text. We them to transform and obtain a text that is clean from abusive and inappropriate content but maintains a similar level of sentiment and semantics, i.e. the transformed text needs to maintain its message. Afterwards, we evaluate the raw and transformed datasets with sentiment analysis and semantic analysis. Our results show Groq provides vastly different results when compared with other LLMs. We have identified similarities between GPT-4o and DeepSeek-V3.
摘要：尽管大型语言模型（LLMS）在自然语言处理任务上显示出很大的进步，但它们在滥用文本中分类和转化为非虐待版本中的有效性仍然是探索的领域。在这项研究中，我们旨在使用LLMS来改变仇恨言论和发誓单词为非虐待文本的滥用文本（推文和评论），同时保留文本的意图。我们评估了两种最先进的LLM的性能，例如Gemini，GPT-4O，Deekseek和Groq，它们可以识别滥用文本的能力。我们他们要转换并获得从滥用和不适当的内容中清洁的文本，但保持了类似的情感和语义水平，即转换的文本需要维护其信息。之后，我们通过情感分析和语义分析评估了原始和转换的数据集。我们的结果表明，与其他LLM相比，GROQ提供了截然不同的结果。我们已经确定了GPT-4O和DeepSeek-V3之间的相似之处。

Title: Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects

Authors: Renad Al-Monef, Hassan Alhuzali, Nora Alturayeif, Ashwag Alasmari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10216
Pdf URL: https://arxiv.org/pdf/2507.10216
Copy Paste: [[2507.10216]] Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects(https://arxiv.org/abs/2507.10216)
Keywords: language model, llm
Abstract: As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces \texttt{Absher}, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \texttt{Absher} comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.
摘要：随着大型语言模型（LLMS）越来越核心阿拉伯NLP应用，评估他们对区域方言和文化差异的理解至关重要，尤其是在沙特阿拉伯等语言多样性的环境中。本文介绍了\ texttt {absher}，这是一种综合基准，该基准专门旨在评估跨沙特方言的LLMS性能。 \ texttt {absher}包括超过18,000个多项选择问题，涵盖了六个不同的类别：含义，true/fals，填充，上下文用法，文化解释和位置识别。这些问题源自来自沙特阿拉伯各个地区的辩证单词，短语和谚语的精心策划的数据集。我们评估了几种最先进的LLM，包括多语言和阿拉伯特异性模型。我们还提供了有关其功能和局限性的详细见解。我们的结果揭示了显着的绩效差距，尤其是在需要文化推断或上下文理解的任务中。我们的发现凸显了迫切需要方言感知的培训和文化对齐的评估方法，以改善现实世界中阿拉伯语应用中的LLMS绩效。

Title: Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation

Authors: Muzhaffar Hazman, Minh-Khoi Pham, Shweta Soundararajan, Goncalo Mordido, Leonardo Custode, David Lynch, Giorgio Cruciata, Yucheng Shi, Hongmeng Song, Wang Chao, Pan Yue, Aleksandar Milenovic, Alexandros Agapitos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.10326
Pdf URL: https://arxiv.org/pdf/2507.10326
Copy Paste: [[2507.10326]] Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation(https://arxiv.org/abs/2507.10326)
Keywords: language model, llm, prompt
Abstract: Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and highly capable LLMs. In contrast, solving complex tasks that require detailed information to be included in the prompt increases the amount of text that needs to be optimised. Furthermore, smaller models have been shown to be more sensitive to prompt design. To address these challenges, we propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases. In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes by searching the space of programmes populated by function compositions of syntactic, dictionary-based and LLM-based prompt-editing functions. In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes in an attempt to further fine-tune their performance. Our approach outperforms three state-of-the-art prompt optimisation approaches, PromptWizard, OPRO, and RL-Prompt, on three relatively small general-purpose LLMs in four domain-specific challenging tasks. We also illustrate several examples where these benchmark methods suffer relatively severe performance degradation, while our approach improves performance in almost all task-model combinations, only incurring minimal degradation when it does not.
摘要：事实证明，迅速的工程是利用预估计的大语言模型（LLM）来解决各种现实世界任务的关键步骤。已经提出了许多解决方案，试图通过使用模型本身编辑提示来自动化及时工程。但是，大多数最先进的方法都是在需要最小的及时模板以及非常强大且功能强大的LLM的任务上评估的。相反，解决需要在提示中包含详细信息的复杂任务增加了需要优化的文本数量。此外，较小的型号已被证明对及时设计更敏感。为了应对这些挑战，我们提出了一种进化搜索方法，用于由两个阶段组成的自动离散提示优化。在第一阶段，语法引导的遗传编程通过搜索由句法，基于词典和基于LLM的及时编辑功能填充的程序的空间来调用及时创建程序。在第二阶段中，应用本地搜索来探索表现最佳计划的社区，以进一步调整其性能。我们的方法在四个相对较小的通用LLM中，在四个特定领域的挑战性任务中，提示，OPRO和RL-PROMPT优于三种最先进的及时优化方法。我们还说明了这些基准方法遭受相对严重的性能降解的几个示例，而我们的方法几乎可以提高所有任务模型组合的性能，仅在没有时会导致最小的降解。

Title: Using AI to replicate human experimental results: a motion study

Authors: Rosa Illan Castillo, Javier Valenzuela
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.10342
Pdf URL: https://arxiv.org/pdf/2507.10342
Copy Paste: [[2507.10342]] Using AI to replicate human experimental results: a motion study(https://arxiv.org/abs/2507.10342)
Keywords: language model, gpt, llm
Abstract: This paper explores the potential of large language models (LLMs) as reliable analytical tools in linguistic research, focusing on the emergence of affective meanings in temporal expressions involving manner-of-motion verbs. While LLMs like GPT-4 have shown promise across a range of tasks, their ability to replicate nuanced human judgements remains under scrutiny. We conducted four psycholinguistic studies (on emergent meanings, valence shifts, verb choice in emotional contexts, and sentence-emoji associations) first with human participants and then replicated the same tasks using an LLM. Results across all studies show a striking convergence between human and AI responses, with statistical analyses (e.g., Spearman's rho = .73-.96) indicating strong correlations in both rating patterns and categorical choices. While minor divergences were observed in some cases, these did not alter the overall interpretative outcomes. These findings offer compelling evidence that LLMs can augment traditional human-based experimentation, enabling broader-scale studies without compromising interpretative validity. This convergence not only strengthens the empirical foundation of prior human-based findings but also opens possibilities for hypothesis generation and data expansion through AI. Ultimately, our study supports the use of LLMs as credible and informative collaborators in linguistic inquiry.
摘要：本文探讨了大语模型（LLM）作为语言研究中可靠的分析工具的潜力，重点是涉及动词动词的时间表达中情感含义的出现。尽管像GPT-4这样的LLM在一系列任务中都表现出了希望，但它们复制细微差别的人类判断的能力仍然受到审查。我们首先与人类参与者进行了四项心理语言学研究（关于紧急含义，情感环境中的动词选择，情感环境中的动词选择以及句子 - emoji协会），然后使用LLM复制相同的任务。所有研究的结果都表明，人类和人工智能反应之间的收敛性显着，统计分析（例如，Spearman的Rho = .73-.96）表明等级模式和分类选择中的相关性很强。尽管在某些情况下观察到较小的分歧，但这些分歧并未改变整体解释结果。这些发现提供了令人信服的证据，表明LLM可以增强传统的基于人类的实验，从而在不损害解释有效性的情况下进行更广泛的研究。这种融合不仅增强了先前基于人类的发现的经验基础，而且还为假设产生和数据扩展的可能性开辟了可能性。最终，我们的研究支持LLM作为语言探究中可信和信息丰富的合作者的使用。

Title: From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

Authors: Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10435
Pdf URL: https://arxiv.org/pdf/2507.10435
Copy Paste: [[2507.10435]] From Sequence to Structure: Uncovering Substructure Reasoning in Transformers(https://arxiv.org/abs/2507.10435)
Keywords: language model, llm
Abstract: Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.
摘要：最近的研究表明，大型语言模型（LLMS）具有解决图形推理任务的能力。值得注意的是，即使将图形结构嵌入文本描述中，LLMS仍然可以有效回答相关问题。这就提出了一个基本问题：仅解码器的变压器体系结构如何理解基本的图形结构？为了解决这个问题，我们从子结构提取任务开始，解释变压器内部的内部机制，然后分析输入查询的影响。具体而言，通过经验结果和理论分析，我们提出了诱导的亚结构过滤（ISF），该视角捕获了多层变压器中的子结构识别。我们进一步验证了LLM中的ISF过程，从而揭示了跨层的一致内部动力学。在这些见解的基础上，我们探讨了变压器在处理多种图形类型中的更广泛功能。具体而言，我们介绍了在子结构中进行思考的概念，以有效提取复杂的复合模式，并证明仅解码器的变压器可以从属性图（例如分子图）中成功提取子结构。我们的发现共同提供了一个新的见解，了解基于序列的变压器如何通过图数据执行子结构提取任务。

Title: Referential ambiguity and clarification requests: comparing human and LLM behaviour

Authors: Chris Madge, Matthew Purver, Massimo Poesio
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10445
Pdf URL: https://arxiv.org/pdf/2507.10445
Copy Paste: [[2507.10445]] Referential ambiguity and clarification requests: comparing human and LLM behaviour(https://arxiv.org/abs/2507.10445)
Keywords: llm
Abstract: In this work we examine LLMs' ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus -- one for reference and ambiguity in reference, and one for SDRT including clarifications -- into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs' ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.
摘要：在这项工作中，我们研究了LLMS在遵循异步指令/指令/跟随器格式的以任务为导向的对话中提出澄清问题的能力。我们提出了一种新的语料库，将Minecraft对话语料库的两个现有注释结合在一起 - 一种用于参考和歧义的歧义，一个用于SDRT，包括澄清 - 将单一的共同格式与单一的共同格式相结合，提供了必要的信息，以实验澄清及其与歧义的关系。通过这种语料库，我们将LLM的行动与原始的人类生成的澄清问题进行了比较，研究了人类和LLM在歧义的情况下如何作用。我们发现，在这些对话中，歧义与人类产生澄清问题的联系以及人与LLM之间的相关性较低。人类几乎从未提出澄清问题的参考歧义，但对于基于任务的不确定性而言通常会这样做。相反，LLM对参考歧义产生了更多的澄清问题，但对于任务不确定性而言，LLMS较少。我们质疑LLM提出澄清问题的能力是否取决于他们最近模拟推理的能力，并使用不同的推理方法进行测试，发现推理确实增加了问题频率和相关性。

Title: From BERT to Qwen: Hate Detection across architectures

Authors: Ariadna Mon, Saúl Fenollosa, Jon Lecumberri
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.10468
Pdf URL: https://arxiv.org/pdf/2507.10468
Copy Paste: [[2507.10468]] From BERT to Qwen: Hate Detection across architectures(https://arxiv.org/abs/2507.10468)
Keywords: llm
Abstract: Online platforms struggle to curb hate speech without over-censoring legitimate discourse. Early bidirectional transformer encoders made big strides, but the arrival of ultra-large autoregressive LLMs promises deeper context-awareness. Whether this extra scale actually improves practical hate-speech detection on real-world text remains unverified. Our study puts this question to the test by benchmarking both model families, classic encoders and next-generation LLMs, on curated corpora of online interactions for hate-speech detection (Hate or No Hate).
摘要：在线平台难以遏制仇恨言论而不过度审查合法话语。早期双向变压器编码器取得了长足的进步，但是超大自回归LLM的到来有望更深入的上下文意识。这种额外的规模是否实际上改善了现实世界文本上的实际仇恨语音检测仍然没有验证。我们的研究通过对模型家庭，经典编码器和下一代LLM进行基准测试，将这个问题用于测试，以策划的在线互动涉及仇恨语音检测（仇恨或无仇恨）。

Title: MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking

Authors: Mohamed T. Younes, Omar Walid, Mai Hassan, Ali Hamdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.10472
Pdf URL: https://arxiv.org/pdf/2507.10472
Copy Paste: [[2507.10472]] MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking(https://arxiv.org/abs/2507.10472)
Keywords: language model, llm
Abstract: This paper introduces an innovative Applicant Tracking System (ATS) enhanced by a novel Robotic process automation (RPA) framework or as further referred to as MLAR. Traditional recruitment processes often encounter bottlenecks in resume screening and candidate shortlisting due to time and resource constraints. MLAR addresses these challenges employing Large Language Models (LLMs) in three distinct layers: extracting key characteristics from job postings in the first layer, parsing applicant resume to identify education, experience, skills in the second layer, and similarity matching in the third layer. These features are then matched through advanced semantic algorithms to identify the best candidates efficiently. Our approach integrates seamlessly into existing RPA pipelines, automating resume parsing, job matching, and candidate notifications. Extensive performance benchmarking shows that MLAR outperforms the leading RPA platforms, including UiPath and Automation Anywhere, in high-volume resume-processing tasks. When processing 2,400 resumes, MLAR achieved an average processing time of 5.4 seconds per resume, reducing processing time by approximately 16.9% compared to Automation Anywhere and 17.1% compared to UiPath. These results highlight the potential of MLAR to transform recruitment workflows by providing an efficient, accurate, and scalable solution tailored to modern hiring needs.
摘要：本文介绍了一种创新的申请人跟踪系统（ATS），它通过新型的机器人过程自动化（RPA）框架增强了或进一步称为MLAR。由于时间和资源限制，传统的招聘过程经常在简历筛选中遇到瓶颈和候选人入围。 MLAR在三个不同的层次中采用大型语言模型（LLM）的挑战解决了这些挑战：从第一层的职位发布中提取关键特征，解析申请人简历以识别教育，经验，第二层技能以及第三层的相似性匹配。然后，通过高级语义算法匹配这些功能，以有效地识别最佳候选者。我们的方法将无缝集成到现有的RPA管道中，自动化简历解析，工作匹配和候选通知。广泛的性能基准测试表明，MLAR在大量简历处理任务中胜过领先的RPA平台，包括UIPATH和AUTOMATION。当处理2,400次恢复时，MLAR的平均处理时间为每次简历5.4秒，与自动化相比，与Uipath相比，与自动化相比，处理时间降低了约16.9％，而Uipath则为17.1％。这些结果突出了MLAR通过为现代招聘需求量身定制的有效，准确且可扩展的解决方案来改变招聘工作流的潜力。

Title: Can You Detect the Difference?

Authors: İsmail Tarım, Aytuğ Onan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.10475
Pdf URL: https://arxiv.org/pdf/2507.10475
Copy Paste: [[2507.10475]] Can You Detect the Difference?(https://arxiv.org/abs/2507.10475)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has raised concerns about reliably detecting AI-generated text. Stylometric metrics work well on autoregressive (AR) outputs, but their effectiveness on diffusion-based models is unknown. We present the first systematic comparison of diffusion-generated text (LLaDA) and AR-generated text (LLaMA) using 2 000 samples. Perplexity, burstiness, lexical diversity, readability, and BLEU/ROUGE scores show that LLaDA closely mimics human text in perplexity and burstiness, yielding high false-negative rates for AR-oriented detectors. LLaMA shows much lower perplexity but reduced lexical fidelity. Relying on any single metric fails to separate diffusion outputs from human writing. We highlight the need for diffusion-aware detectors and outline directions such as hybrid models, diffusion-specific stylometric signatures, and robust watermarking.
摘要：大型语言模型（LLM）的快速发展引起了人们对可靠检测AI生成的文本的担忧。造型测量指标在自回归（AR）输出上很好地工作，但是它们对基于扩散的模型的有效性尚不清楚。我们使用2 000个样本介绍了扩散生成文本（LLADA）和AR-INENERED TEAKS（LLAMA）的第一个系统比较。困惑，爆发性，词汇多样性，可读性和BLEU/rouge分数表明，Llada在困惑和爆发中密切模仿人类文本，从而为AR面向AR的检测器产生高的假阴性率。骆驼的困惑性低得多，但词汇忠诚度降低。依靠任何单一度量都无法将扩散输出与人写作分开。我们强调需要扩散感知的探测器和轮廓方向，例如混合模型，扩散特异性的造型标志和强大的水印。

Title: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Authors: Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.10524
Pdf URL: https://arxiv.org/pdf/2507.10524
Copy Paste: [[2507.10524]] Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation(https://arxiv.org/abs/2507.10524)
Keywords: language model
Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
摘要：缩放语言模型可以解锁令人印象深刻的功能，但是随附的计算和记忆需求使培训和部署既昂贵。现有的效率努力通常针对参数共享或自适应计算，从而打开了如何同时获得两者的问题。我们介绍了回归（MOR）的混合物，这是一个统一的框架，结合了单个递归变压器内部的两个效率轴。 MOR在递归步骤中恢复了共享的一层层，以实现参数效率，而轻型路由器通过将不同的递归深度分配给单个令牌来实现自适应令牌级别的思维。这使得MOR仅在给定递归深度仍处于活动状态的令牌中只能将二次注意计算注意到，从而通过选择性缓存其键值对来进一步提高内存访问效率。除了这些核心机制之外，我们还提出了一种KV共享变体，该变体可以重用第一次递归中的KV对，专门设计用于减少预填充潜伏期和内存足迹。在从135m到1.7b参数的模型量表中，MOR形成了一个新的Pareto边界：在均等训练和较小的模型尺寸下，它显着降低了验证的困惑，并提高了很少的射击精度，而与Vanilla和现有的Recursive Baselines相比，它提供了更高的吞吐量。这些收益表明，MOR是通往大型模型质量的有效途径，而不会产生大型模型。

Title: CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Authors: Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2507.10535
Pdf URL: https://arxiv.org/pdf/2507.10535
Copy Paste: [[2507.10535]] CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks(https://arxiv.org/abs/2507.10535)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.
摘要：大型语言模型（LLMS）在各种编码任务中都显着提高了最新的技术。除了直接回答用户查询外，LLMS还可以用作法官，评估和比较其他模型产生的响应质量。这种评估能力对于基准不同的LLM和通过响应排名提高响应质量至关重要。然而，尽管LLM-AS-A-A-Gudge范式的采用越来越多，但由于缺乏专用的基准，其在编码方案中的有效性仍未得到充实。为了解决此差距，我们介绍了CodeJudgeBench，这是一个明确设计的基准，旨在评估在三个关键编码任务中LLM-AS-A-A-A-A-A-Gudge模型的性能：代码生成，代码维修和单位测试生成。通过对26个LLM-AS-A-A-Gudge模型的全面基准测试，我们发现最近的思维模型在我们精心设计的代码判断任务上显着优于非思想模型。值得注意的是，即使是QWEN3-8B等相对较小的思维模型，也可以胜过训练有特殊训练的LLM-AS-A-A-Gudge模型，大小最高70B。然而，所有模型在编码任务的判断中仍然表现出很大的随机性。对于成对的判断任务，只需更改提出响应的顺序就可以大大影响准确性。此外，当判断不同LLMS编写的代码和单元测试时，LLM-AS-A-A-Gudge模型也会显示出性能的差异。这种敏感性引起了人们对LLM-AS-A-Gudge在编码方案中的可靠性和一致性的担忧。最后，我们研究了法学学士法官的最佳提示策略。我们发现，使用配对比较的表现优于标量点的判断。此外，在完整，未经处理的LLM响应中保留评论和推理会导致法官绩效的提高。