2025-08-15

Title: A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

Authors: Hugo Massaroli, Leonardo Iara, Emmanuel Iarussi, Viviana Siless
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.09993
Pdf URL: https://arxiv.org/pdf/2508.09993
Copy Paste: [[2508.09993]] A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain(https://arxiv.org/abs/2508.09993)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.
摘要：大型语言模型（LLM）越来越多地在现实世界应用中部署，但人们对公平性的担忧持续存在，尤其是在诸如刑事司法，教育，医疗保健和金融等高风险领域。本文介绍了透明的评估协议，用于使用Internet计算机协议（ICP）区块链上的智能合约（Foundation，2023）基准测试OpenSource LLM的公平性。我们的方法通过直接直接在链上执行托管拥抱的面部端点，提示和指标来确保可验证，不可变和可重复的评估。我们在PISA数据集上基准了Llama，DeepSeek和Mistral模型以进行学术绩效预测（OECD，2018），这是一种适合使用统计平等和均等机会指标进行公平评估的数据集（Hardt等，2016）。我们还评估了从立体声数据集（Nadeem等，2020）得出的结构化上下文关联指标，以衡量上下文关联中的社会偏见。我们通过使用万花筒基准（Salazar等，2025）对英语，西班牙语和葡萄牙语进行多种语言评估进一步扩展分析，从而揭示了跨语言差异。所有代码和结果都是开源的，可以跨模型版本进行社区审计和纵向公平跟踪。

Title: Thematic and Task-Based Categorization of K-12 GenAI Usages with Hierarchical Topic Modeling

Authors: Johannes Schneider, Béatrice S. Hasler, Michaela Varrone, Fabian Hoya, Thomas Schroffenegger, Dana-Kristin Mah, Karl Peböck
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.09997
Pdf URL: https://arxiv.org/pdf/2508.09997
Copy Paste: [[2508.09997]] Thematic and Task-Based Categorization of K-12 GenAI Usages with Hierarchical Topic Modeling(https://arxiv.org/abs/2508.09997)
Keywords: gpt, llm, prompt, chat
Abstract: We analyze anonymous interaction data of minors in class-rooms spanning several months, schools, and subjects employing a novel, simple topic modeling approach. Specifically, we categorize more than 17,000 messages generated by students, teachers, and ChatGPT in two dimensions: content (such as nature and people) and tasks (such as writing and explaining). Our hierarchical categorization done separately for each dimension includes exemplary prompts, and provides both a high-level overview as well as tangible insights. Prior works mostly lack a content or thematic categorization. While task categorizations are more prevalent in education, most have not been supported by real-world data for K-12. In turn, it is not surprising that our analysis yielded a number of novel applications. In deriving these insights, we found that many of the well-established classical and emerging computational methods, i.e., topic modeling, for analysis of large amounts of texts underperform, leading us to directly apply state-of-the-art LLMs with adequate pre-processing to achieve hierarchical topic structures with better human alignment through explicit instructions than prior approaches. Our findings support fellow researchers, teachers and students in enriching the usage of GenAI, while our discussion also highlights a number of concerns and open questions for future research.
摘要：我们分析了跨越几个月的教室中未成年人的匿名交互数据，以及采用新颖，简单的主题建模方法的主题。具体来说，我们将学生，老师和chatgpt产生的17,000多条在二维中分类：内容（例如自然，人）和任务（例如写作和解释）。我们为每个维度分别完成的分层分类包括示例性提示，并提供了高级概述和有形的见解。先前的工作主要缺乏内容或主题分类。尽管任务分类在教育中更为普遍，但大多数人没有得到K-12的现实数据支持。反过来，我们的分析产生了许多新颖的应用并不奇怪。在得出这些见解时，我们发现许多完善的经典和新兴的计算方法，即主题建模，用于分析大量文本表现不佳的大量文本，使我们直接采用了与先验的方法相比，通过explicit texlicit指导来实现更好的人类主题结构，以实现足够的预处理，以实现更好的预处理。我们的发现支持研究人员，教师和学生丰富Genai的用法，而我们的讨论也强调了许多担忧和开放问题，以供将来的研究。

Title: INTIMA: A Benchmark for Human-AI Companionship Behavior

Authors: Lucie-Aimée Kaffee, Giada Pistilli, Yacine Jernite
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.09998
Pdf URL: https://arxiv.org/pdf/2508.09998
Copy Paste: [[2508.09998]] INTIMA: A Benchmark for Human-AI Companionship Behavior(https://arxiv.org/abs/2508.09998)
Keywords: language model, prompt
Abstract: AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.
摘要：AI Companionship用户与AI系统建立情感纽带，已成为一种重要的模式，具有积极的含义，但也与含义有关。我们介绍了互动和机器附件基准（Intima），这是用于评估语言模型中陪伴行为的基准。从心理理论和用户数据中汲取灵感，我们开发了四个类别和368个目标提示的31个行为的分类法。对这些提示的响应被评估为陪伴增强，维护边界或中性。将Intima应用于Gemma-3，Phi-4，O3-Mini和Claude-4表明，在所有模型中，伴侣提高行为仍然更为普遍，尽管我们观察到模型之间的明显差异。不同的商业提供商在基准的更灵敏的部分中优先考虑不同的类别，这是关于用户福祉的适当边界设定和情感支持事项的内容。这些发现凸显了需要更一致的方法来处理情绪激动的互动。

Title: XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs

Authors: Yuzhuo Xiao, Zeyu Han, Yuhan Wang, Huaizu Jiang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.09999
Pdf URL: https://arxiv.org/pdf/2508.09999
Copy Paste: [[2508.09999]] XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs(https://arxiv.org/abs/2508.09999)
Keywords: language model, llm
Abstract: The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.
摘要：多模式错误信息在社交媒体上的迅速传播要求采取更有效和强大的检测方法。利用多模式大语言模型（MLLM）的最新进展显示了解决这一挑战的潜力。但是，目前尚不清楚现有方法所在的位置（证据检索V.S.推理），阻碍了该领域的进一步进步。在数据集方面，现有的基准要么包含过时的事件，因此由于MLLMS可以简单地记住这些事件，或者人为地合成，因此无法反映现实世界中的误解模式，因此导致评估偏见。此外，它缺乏对基于MLLM的模型设计策略的全面分析。为了解决这些问题，我们介绍了XFACTA，这是一种现代的现实世界数据集，非常适合评估基于MLLM的检测器。我们系统地评估了各种基于MLLM的错误信息检测策略，评估跨不同架构和尺度的模型，并根据现有检测方法进行基准测试。在这些分析的基础上，我们进一步启用了一个半自动检测框架框架，该框架不断更新Xfacta，以保持新内容以保持其当代相关性。我们的分析提供了有价值的见解和实践，以推进多模式错误信息检测领域。代码和数据已发布。

Title: AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification

Authors: Chenhao Xue, Yuanzhe Jin, Adrian Carrasco-Revilla, Joyraj Chakraborty, Min Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10000
Pdf URL: https://arxiv.org/pdf/2508.10000
Copy Paste: [[2508.10000]] AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification(https://arxiv.org/abs/2508.10000)
Keywords: language model, llm
Abstract: When developing text classification models for real world applications, one major challenge is the difficulty to collect sufficient data for all text classes. In this work, we address this challenge by utilizing large language models (LLMs) to generate synthetic data and using such data to improve the performance of the models without waiting for more real data to be collected and labelled. As an LLM generates different synthetic data in response to different input examples, we formulate an automated workflow, which searches for input examples that lead to more ``effective'' synthetic data for improving the model concerned. We study three search strategies with an extensive set of experiments, and use experiment results to inform an ensemble algorithm that selects a search strategy according to the characteristics of a class. Our further experiments demonstrate that this ensemble approach is more effective than each individual strategy in our automated workflow for improving classification models using LLMs.
摘要：在为现实世界应用开发文本分类模型时，一个主要挑战是很难为所有文本类收集足够的数据。在这项工作中，我们通过利用大语言模型（LLM）来解决这一挑战，以生成综合数据并使用此类数据来改善模型的性能，而无需等待收集和标记更多的实际数据。由于LLM响应不同的输入示例生成了不同的合成数据，因此我们制定了一个自动化工作流，该工作流程搜索输入示例，这些示例会导致更多``有效''合成数据以改善相关模型。我们通过广泛的实验研究了三种搜索策略，并使用实验结果来告知合奏算法，该算法根据类的特征选择搜索策略。我们的进一步实验表明，在自动化工作流中，这种合奏方法比使用LLMS改进分类模型的每个单独策略更有效。

Title: Semantic Structure in Large Language Model Embeddings

Authors: Austin C. Kozlowski, Callin Dai, Andrei Boutyline
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10003
Pdf URL: https://arxiv.org/pdf/2508.10003
Copy Paste: [[2508.10003]] Semantic Structure in Large Language Model Embeddings(https://arxiv.org/abs/2508.10003)
Keywords: language model, llm
Abstract: Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.
摘要：心理学研究一致发现，在各种语义尺度上的人类对单词的评分可以简化为低维形式，而信息丢失相对较少。我们发现，大型语言模型（LLMS）嵌入矩阵中编码的语义关联表现出相似的结构。我们表明，通过反义词对定义的语义方向（例如种类 - 残酷）的语义方向的投影与人类评分高度相关，并进一步发现，这些投影有效地减少了LLM嵌入中的3维子空间，非常类似于从人类调查响应中得出的模式。此外，我们发现沿着一个语义方向的转移代币会导致对几何形成的特征与余弦相似性成正比的靶向效果。这些发现表明，语义特征在LLM中的纠缠与它们在人类语言中的互连方式类似，尽管它显然很复杂，但它们的语义信息很大。此外，对这种语义结构的核算对于避免转向功能时避免意外后果至关重要。

Title: From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation

Authors: Chengliang Zhou, Mei Wang, Ting Zhang, Qiannan Zhu, Jian Li, Hua Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10005
Pdf URL: https://arxiv.org/pdf/2508.10005
Copy Paste: [[2508.10005]] From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation(https://arxiv.org/abs/2508.10005)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs' performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students' comprehensive abilities.
摘要：大型语言模型（LLM）在数学问题解决方面表现出了显着的功能。但是，从提供答案到产生高质量的教育问题的过渡提出了重大挑战。为了推进教育问题的产生（EQG），并促进LLMS产生教学上有价值且具有教育性有效的问题，我们介绍了EQGBench，这是一种专门设计用于评估LLMS在中国EQG中的表现的综合基准。 EQGBENCH建立了一个五维评估框架，该框架由900个评估样本的数据集支持，涵盖了三个基本中学学科：数学，物理和化学。该数据集将用户查询与不同的知识点，难度梯度和问题类型规格结合在一起，以模拟现实的教育场景。通过对46个主流大型模型的系统评估，我们揭示了产生反映教育价值和培养学生全面能力的问题的重要空间。

Title: Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models

Authors: Y. Lyu, D. Combs, D. Neumann, Y. C. Leong
Subjects: cs.CL, stat.ME
Abstract URL: https://arxiv.org/abs/2508.10007
Pdf URL: https://arxiv.org/pdf/2508.10007
Copy Paste: [[2508.10007]] Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models(https://arxiv.org/abs/2508.10007)
Keywords: language model
Abstract: Hostile attribution bias is the tendency to interpret social interactions as intentionally hostile. The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias, and includes open-ended questions where participants describe the perceived intentions behind a negative social situation and how they would respond. While these questions provide insights into the contents of hostile attributions, they require time-intensive scoring by human raters. In this study, we assessed whether large language models can automate the scoring of AIHQ open-ended responses. We used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. We used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. To support broader adoption, we provide an accessible scoring interface that includes both local and cloud-based options. Together, our findings suggest that large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations.
摘要：敌对的归因偏见是将社会互动解释为故意敌对的趋势。歧义意图的敌对问卷（AIHQ）通常用于衡量敌对归因偏见，并包括开放式问题，参与者描述了负面社会状况背后的感知意图以及他们将如何做出回应。尽管这些问题为敌对归因的内容提供了见解，但它们需要人类评估者的时间密集度评分。在这项研究中，我们评估了大型语言模型是否可以自动化AIHQ开放式响应的评分。我们使用了先前收集的数据集，其中创伤性脑损伤（TBI）和健康对照组（HC）完成了AIHQ，并以训练有素的人类评估者评估了他们的开放式反应。我们使用了其中一半的响应来微调两个模型，并在人类生成的评级上进行了测试，并在AIHQ响应的其余一半上测试了微调模型。结果表明，模型生成的等级与人类等级对敌对态度和攻击反应的归因都一致，微型模型显示出更高的比对。在模棱两可，故意和意外情况类型中，这种对齐是一致的，并且在TBI和HC组之间的敌意归因和攻击反应归因方面的群体差异上复制了先前的发现。微调模型还可以很好地概括为独立的非临床数据集。为了支持更广泛的采用，我们提供了一个可访问的评分界面，其中包括本地和基于云的选项。我们的发现共同表明，大型语言模型可以在研究和临床环境中简化AIHQ评分，从而揭示了它们促进不同人群心理评估的潜力。

Title: Multidimensional classification of posts for online course discussion forum curation

Authors: Antonio Leandro Martins Candido, Jose Everardo Bessa Maia
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10008
Pdf URL: https://arxiv.org/pdf/2508.10008
Copy Paste: [[2508.10008]] Multidimensional classification of posts for online course discussion forum curation(https://arxiv.org/abs/2508.10008)
Keywords: language model, llm
Abstract: The automatic curation of discussion forums in online courses requires constant updates, making frequent retraining of Large Language Models (LLMs) a resource-intensive process. To circumvent the need for costly fine-tuning, this paper proposes and evaluates the use of Bayesian fusion. The approach combines the multidimensional classification scores of a pre-trained generic LLM with those of a classifier trained on local data. The performance comparison demonstrated that the proposed fusion improves the results compared to each classifier individually, and is competitive with the LLM fine-tuning approach
摘要：在线课程中的讨论论坛的自动策划需要持续的更新，从而使大型语言模型（LLMS）经常重新培训成为资源密集型过程。为了规避昂贵的微调的需求，本文提出并评估了贝叶斯融合的使用。该方法将预训练的通用LLM的多维分类分数与经过本地数据培训的分类器的分类分数结合在一起。性能比较表明，与每个分类器相比，提出的融合可以改善结果，并且与LLM微调方法具有竞争力

Title: An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs

Authors: Ayana Hussain, Patrick Zhao, Nicholas Vincent
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10010
Pdf URL: https://arxiv.org/pdf/2508.10010
Copy Paste: [[2508.10010]] An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs(https://arxiv.org/abs/2508.10010)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are a double-edged sword capable of generating harmful misinformation -- inadvertently, or when prompted by "jailbreak" attacks that attempt to produce malicious outputs. LLMs could, with additional research, be used to detect and prevent the spread of misinformation. In this paper, we investigate the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation. We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media, and how effectively it can be detected using standard machine learning approaches. Specifically, we closely examine 109 distinct attacks against three target LLMs and compare the attack prompts to in-the-wild health-related LLM queries. We also examine the resulting jailbreak responses, comparing the generated misinformation to health-related misinformation on Reddit. Our findings add more evidence that LLMs can be effectively used to detect misinformation from both other LLMs and from people, and support a body of work suggesting that with careful design, LLMs can contribute to a healthier overall information ecosystem.
摘要：大型语言模型（LLMS）是一把双刃剑，能够产生有害的错误信息 - 无意间，或者在试图产生恶意输出的“越狱”攻击的提示中。 LLM可以在其他研究中用于检测和防止错误信息传播。在本文中，我们研究了LLM产生的越狱攻击的功效和特征，这些越狱攻击导致其他模型产生有害的医学错误信息。我们还研究了与社交媒体上的典型错误信息相比，越狱LLM产生的错误信息是如何使用标准机器学习方法有效检测到的。具体而言，我们对三个目标LLM的109次攻击进行了密切检查，并将攻击提示与与野生健康相关的LLM查询进行了比较。我们还检查了由此产生的越狱回应，将产生的错误信息与与健康有关的错误信息进行了比较。我们的发现增加了更多的证据，表明LLM可以有效地用于检测其他LLM和人的错误信息，并支持一项工作，表明，通过仔细设计，LLM可以为更健康的整体信息生态系统做出贡献。

Title: Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan

Authors: Yuta Nagamori, Mikoto Kosai, Yuji Kawai, Haruka Marumo, Misaki Shibuya, Tatsuya Negishi, Masaki Imanishi, Yasumasa Ikeda, Koichiro Tsuchiya, Asuka Sawai, Licht Miyamoto
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10011
Pdf URL: https://arxiv.org/pdf/2508.10011
Copy Paste: [[2508.10011]] Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan(https://arxiv.org/abs/2508.10011)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has demonstrated remarkable progress across various professional fields, including medicine and education. However, their performance in nutritional education, especially in Japanese national licensure examination for registered dietitians, remains underexplored. This study aimed to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Questions from the Japanese national examination for registered dietitians were used as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering, including role assignment, was tested to assess potential performance improvements. Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. ChatGPT showed greater consistency in response patterns but lower accuracy. Prompt engineering had minimal effect, except for modest improvement when correct answers and explanations were explicitly provided. While some generative AI models marginally exceeded the passing threshold, overall accuracy and answer consistency remained suboptimal. Moreover, all the models demonstrated notable limitations in answer consistency and robustness. Further advancements are needed to ensure reliable and stable AI-based study aids for dietitian licensure preparation.
摘要：基于大型语言模型（LLM）的生成人工智能（AI），例如Chatgpt，在包括医学和教育在内的各种专业领域都取得了突出的进步。但是，他们在营养教育方面的表现，尤其是在日本国家注册营养师的许可考试中，仍然没有得到充实的态度。这项研究旨在评估当前基于LLM的生成AI模型作为营养生的研究辅助工具的潜力。基于GPT-3.5和GPT-4，日本国家针对注册营养师的检查的问题被用作CHATGPT和三种Bing模型的提示（精确，创意，平衡）。每个问题都被分为独立会议，并分析了模型响应的准确性，一致性和响应时间。测试了包括角色分配在内的其他及时工程，以评估潜在的绩效改进。 Bing-Presise（66.2％）和Bing-Creative（61.4％）超过了通过阈值（60％），而Bing平衡（43.3％）和Chatgpt（42.8％）没有超过阈值。除营养教育以外，Bing-presise and Bing-Intricage and bing-feartive conceptive contrentive和Bing-feative均超过了其他领域，除了营养教育之外，所有模型都表现不佳。没有一个模型始终在重复尝试中提供相同的正确答案，从而强调了答案稳定性的局限性。 Chatgpt在响应模式方面显示出更大的一致性，但精度较低。及时工程的效果最小，除了明确提供正确的答案和解释时进行适度的改进。尽管某些生成的AI模型略微超过了通过阈值，但总体准确性和答案一致性仍然是最佳选择。此外，所有模型在答案一致性和鲁棒性方面都表现出明显的局限性。需要进一步的进步，以确保营养师许可准备的可靠和稳定的基于AI的研究辅助工具。

Title: Guided Navigation in Knowledge-Dense Environments: Structured Semantic Exploration with Guidance Graphs

Authors: Dehao Tao, Guangjie Liu, Weizheng, Yongfeng Huang, Minghu jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10012
Pdf URL: https://arxiv.org/pdf/2508.10012
Copy Paste: [[2508.10012]] Guided Navigation in Knowledge-Dense Environments: Structured Semantic Exploration with Guidance Graphs(https://arxiv.org/abs/2508.10012)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) exhibit strong linguistic capabilities, their reliance on static knowledge and opaque reasoning processes limits their performance in knowledge intensive tasks. Knowledge graphs (KGs) offer a promising solution, but current exploration methods face a fundamental trade off: question guided approaches incur redundant exploration due to granularity mismatches, while clue guided methods fail to effectively leverage contextual information for complex scenarios. To address these limitations, we propose Guidance Graph guided Knowledge Exploration (GG Explore), a novel framework that introduces an intermediate Guidance Graph to bridge unstructured queries and structured knowledge retrieval. The Guidance Graph defines the retrieval space by abstracting the target knowledge' s structure while preserving broader semantic context, enabling precise and efficient exploration. Building upon the Guidance Graph, we develop: (1) Structural Alignment that filters incompatible candidates without LLM overhead, and (2) Context Aware Pruning that enforces semantic consistency with graph constraints. Extensive experiments show our method achieves superior efficiency and outperforms SOTA, especially on complex tasks, while maintaining strong performance with smaller LLMs, demonstrating practical value.
摘要：尽管大型语言模型（LLM）具有强大的语言能力，但它们对静态知识和不透明推理过程的依赖限制了其在知识密集任务中的表现。知识图（KGS）提供了一个有希望的解决方案，但是当前的探索方法面临基本权衡：由于粒度不匹配而引起的问题引导的方法导致了冗余探索，而线索指导的方法无法有效利用上下文信息来为复杂的场景利用上下文信息。为了解决这些局限性，我们提出了指导图指导知识探索（GG Explore），这是一个新颖的框架，该框架引入了中间指导图，以桥接非结构化的查询和结构化知识检索。指导图通过在保留更广泛的语义上下文的同时抽象目标知识的结构来定义检索空间，从而实现精确有效的探索。在制导图的基础上，我们开发：（1）结构对齐，从而过滤了没有LLM开销的不兼容的候选者，以及（2）上下文意识到修剪，从而实现语义一致性与图形约束。广泛的实验表明，我们的方法可以提高效率，尤其是在复杂的任务上，同时使用较小的LLMS保持强劲的性能，从而表明实用价值。

Title: Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

Authors: Linqing Chen, Hanmeng Zhong, Wentao Wu, Weilei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10013
Pdf URL: https://arxiv.org/pdf/2508.10013
Copy Paste: [[2508.10013]] Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis(https://arxiv.org/abs/2508.10013)
Keywords: language model, llm
Abstract: Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.
摘要：大型语言模型（LLM）培训面临着关键的瓶颈：高质量，推理密集的问题 - 答案对的稀缺性，尤其是来自PubMed论文或法律文件的稀疏，特定领域的来源。现有方法依赖于表面模式，从根本上讲，无法产生可控制的，复杂的多跳跃推理问题，这些问题测试了真正的理解力，这些问题必不可少，以推进LLM培训范式。我们提出\ textbf {语义桥}，这是第一个可以控制地从任意来源产生复杂的多跳的推理问题的通用框架。我们的突破性创新是\ textIt {语义图编织} - 三个互补的桥接机制（实体桥梁桥梁，用于角色变化的共享实体，谓语/因果/因果/逻辑序列的谓词桥梁，以及因果桥，以实现良好的跨度构建跨度的分析，并构造了整个文档的实质性构造，并构造了整体上的复杂性，并构成了互补的复杂性，并填充了整个复杂的元素，并构成了整个复杂的培养基，并构成了整个复杂的培养物，并构成了整个复杂的核对范围，并构成了整个复杂的培养。我们的多模式AMR管道可实现高达9.5％的往返质量，从而使生产就可以控制的质量保证生成。广泛的评估表明，通用数据集（Wikipedia）和专业领域（生物医学）之间的性能均可在四种语言（英语，中文，法语，德语）上产生一致的18.3％-25.4％的增长。来自200个来源产生的问题对优于600个本地注释示例，材料少67％。人类评估表明，复杂性高23.4％，回答性提高了18.7％，图案覆盖率提高了31.2％。语义桥为LLM训练数据综合建立了一个新的范式，从而可以从稀疏来源中控制有针对性的推理问题。我们将发布我们的核心代码和语义桥模型。

Title: PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?

Authors: Lingfeng Zhou, Jialing Zhang, Jin Gao, Mohan Jiang, Dequan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10014
Pdf URL: https://arxiv.org/pdf/2508.10014
Copy Paste: [[2508.10014]] PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?(https://arxiv.org/abs/2508.10014)
Keywords: llm
Abstract: Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at this https URL.
摘要：当前的角色扮演研究通常依赖于未验证的LLM-AS-A-A-A-Gudge范式，这可能无法反映人类对角色忠诚的看法。人类一致评估的关键先决条件是角色识别，即认识到谁在对话环境中说话的能力。我们认为，对角色扮演质量的任何有意义的判断（角色的扮演程度）从根本上取决于首先正确地将单词和行为归因于正确的角色（谁在讲话）。我们提出了旨在测试LLM评估者是否可以可靠地识别人类角色的第一个基准。 Personaeval使用小说，脚本和视频成绩单中的人为对话，挑战模型，根据对话上下文确定正确的角色。我们的实验，包括人类的研究，表明即使表现最佳的LLM也只能达到69％的精度，远低于可靠评估所需的水平。相比之下，人类参与者的表现为90.8％的精度，这表明当前的LLM评估者仍然不足以有效地判断角色扮演场景。为了更好地理解这一差距，我们检查了培训时间适应和测试时间计算，这表明可靠的评估需要的不仅仅是特定于任务的调整，但取决于LLM评估者中强，类似人类的推理能力。我们在此HTTPS URL上发布基准。

Title: RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

Authors: Enzhi Wang, Qicheng Li, Shiwan Zhao, Aobo Kong, Jiaming Zhou, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10015
Pdf URL: https://arxiv.org/pdf/2508.10015
Copy Paste: [[2508.10015]] RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis(https://arxiv.org/abs/2508.10015)
Keywords: language model, llm, chat
Abstract: In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.
摘要：近年来，大型语言模型（LLMS）在多模式处理中取得了显着进步，包括基于端到端语音的语言模型，这些模型可以使自然互动并在以任务为导向的对话（TOD）系统中执行特定任务。但是，现有的TOD数据集主要基于文本，缺乏真正的语音信号，这些信号对于评估基于语音的LLM的稳健性至关重要。此外，现有的语音TOD数据集主要是英语，并且缺乏关键方面，例如语音疏远和说话者的变化。为了解决这些差距，我们介绍了Realtalk-CN，这是第一个中国多型，多域语音 - 文本双模式TOD数据集，其中包括5.4k对话（60k的话语，150小时，150小时），并配对语音文本注释。 RealTalk-CN通过带注释的自发言语探索捕获了各种对话方案，从而确保了语音对话中对现实世界复杂性的全面报道。此外，我们提出了一项新颖的跨模式聊天任务，该任务可真实地模拟现实世界的用户交互，从而可以在语音和文本模式之间进行动态切换。我们的评估涵盖了对语音疏远，对说话者特征的敏感性和跨域性能的鲁棒性。广泛的实验验证了Realtalk-CN的有效性，为中国基于语音的LLMS研究奠定了坚实的基础。

Title: Training-Free Multimodal Large Language Model Orchestration

Authors: Tianyu Xie, Yuhang Wu, Yongdong Luo, Jiayi Ji, Xiawu Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10016
Pdf URL: https://arxiv.org/pdf/2508.10016
Copy Paste: [[2508.10016]] Training-Free Multimodal Large Language Model Orchestration(https://arxiv.org/abs/2508.10016)
Keywords: language model, llm, agent
Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.
摘要：不同的多模式大型语言模型（MLLM）不能直接集成到统一的多模式输入系统中。在以前的工作中，由于模态对准，文本到语音效率和其他整合问题的挑战，培训被认为是不可避免的组成部分。在本文中，我们介绍了多模式大语模型编排，这是一种创建交互式多模式AI系统的有效方法，而无需额外的培训。 MLLM编排利用大语言模型的固有推理能力通过明确的工作流来协调专业模型，实现自然的多模式相互作用，同时保持模块化，提高可解释性并显着提高计算效率。我们的编排框架建立在三个关键创新之上：（1）中央控制器LLM，通过精心设计的代理分析用户输入并动态地将任务路由到适当的专用模型；（2）平行的文本对语音架构，可实现与无缝中断处理和自然对话流的真实全双工互动；（3）通过智能信息综合和检索来维持跨模式的跨模式内存集成系统，在某些情况下选择性地避免了不必要的模态调用，以提高响应速度。广泛的评估表明，MLLM编排在没有额外培训的情况下实现了全面的多模式能力，比传统的共同训练的标准基准测试方法提高了高达7.8％的绩效，从而使潜伏期降低了10.3％，并通过明确的协调过程可显着增强解释性。

Title: A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models

Authors: Sridhar Mahadevan
Subjects: cs.CL, cs.AI, math.AT
Abstract URL: https://arxiv.org/abs/2508.10018
Pdf URL: https://arxiv.org/pdf/2508.10018
Copy Paste: [[2508.10018]] A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models(https://arxiv.org/abs/2508.10018)
Keywords: language model, llm
Abstract: Natural language is replete with superficially different statements, such as ``Charles Darwin wrote" and ``Charles Darwin is the author of", which carry the same meaning. Large language models (LLMs) should generate the same next-token probabilities in such cases, but usually do not. Empirical workarounds have been explored, such as using k-NN estimates of sentence similarity to produce smoothed estimates. In this paper, we tackle this problem more abstractly, introducing a categorical homotopy framework for LLMs. We introduce an LLM Markov category to represent probability distributions in language generated by an LLM, where the probability of a sentence, such as ``Charles Darwin wrote" is defined by an arrow in a Markov category. However, this approach runs into difficulties as language is full of equivalent rephrases, and each generates a non-isomorphic arrow in the LLM Markov category. To address this fundamental problem, we use categorical homotopy techniques to capture ``weak equivalences" in an LLM Markov category. We present a detailed overview of application of categorical homotopy to LLMs, from higher algebraic K-theory to model categories, building on powerful theoretical results developed over the past half a century.
摘要：自然语言充满了表面上不同的陈述，例如``查尔斯·达尔文（Charles Darwin）写的''和``查尔斯·达尔文（Charles Darwin）是“作者”，其含义相同。在这种情况下，大型语言模型（LLMS）应产生相同的下一步概率，但通常不会。已经探索了经验的解决方法，例如使用句子相似性的K-NN估计值来产生平滑的估计。在本文中，我们更加抽象地解决了这个问题，为LLMS引入了一个分类同质框架。我们介绍了LLM Markov类别，以代表LLM产生的语言中的概率分布，其中句子的概率（例如``Charles Darwin写道''来定义了Markov类别中的箭头的定义。但是，这种方法遇到困难。 LLM Markov类别中捕获``弱等价''的分类同义技术。我们介绍了从较高的代数K理论到模型类别的分类同型应用到LLM的详细概述，这是基于过去半个世纪发展的强大理论结果的基础。

Title: Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning

Authors: Li Wang, Changhao Zhang, Zengqi Xiu, Kai Lu, Xin Yu, Kui Zhang, Wenjun Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10019
Pdf URL: https://arxiv.org/pdf/2508.10019
Copy Paste: [[2508.10019]] Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning(https://arxiv.org/abs/2508.10019)
Keywords: language model, llm
Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., $\leq$ 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.
摘要：尽管大语言模型（LLM）的推理能力最近取得了进步，但提高了小语言模型的推理能力（SLM，例如$ \ \ leq $ 1.5B）仍然具有挑战性。一个关键的障碍在于自然语言的复杂性和可变性：本质上等效的问题通常以多种表面形式出现，通常被多余或分散注意力的细节所掩盖。这对SLM施加了双重负担：他们必须首先从复杂的语言输入中提取核心问题，然后根据该理解执行推理。由此产生的巨大而嘈杂的问题空间阻碍了优化，特别是对于容量有限的模型而言。为了解决这个问题，我们提出了一个新的框架，通过将自然语言问题映射到规范性问题空间中，将理解与推理的理解解除了 - 语义简化但表现力的领域。这使SLM能够专注于对标准化输入的推理，而没有语言可变性。在此框架内，我们介绍了Durit（通过迭代训练与推理的理解），这是一种迭代的三步算法：（1）通过强化学习绘制自然语言问题，（2）通过自我介绍来对齐推理轨迹，以及（3）在问题领域中的推理推理政策。在此过程中，映射器和推理器在交替的循环中进行了共同训练。实验表明，Durit基本上改善了SLM在内域和室外数学和逻辑推理任务上的性能。除了提高推理能力外，杜利特还提高了推理的鲁棒性，从而将理解与推理的理解验证为增强SLM的有效策略。

Title: FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models

Authors: Chuan Li, Qianyi Zhao, Fengran Mo, Cen Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10020
Pdf URL: https://arxiv.org/pdf/2508.10020
Copy Paste: [[2508.10020]] FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models(https://arxiv.org/abs/2508.10020)
Keywords: language model, llm, chain-of-thought
Abstract: Efficiently enhancing the reasoning capabilities of large language models (LLMs) in federated learning environments remains challenging, particularly when balancing performance gains with strict computational, communication, and privacy constraints. This challenge is especially acute in healthcare, where decisions-spanning clinical, operational, and patient-facing contexts-demand not only accurate outputs but also interpretable, traceable rationales to ensure safety, accountability, and regulatory compliance. Conventional federated tuning approaches on LLM fail to address this need: they optimize primarily for answer correctness while neglecting rationale quality, leaving CoT capabilities dependent on models' innate pre-training abilities. Moreover, existing methods for improving rationales typically rely on privacy-violating knowledge distillation from centralized models. Additionally, the communication overhead in traditional federated fine-tuning on LLMs remains substantial. We addresses this gap by proposing FedCoT, a novel framework specifically designed to enhance reasoning in federated settings. FedCoT leverages a lightweight chain-of-thought enhancement mechanism: local models generate multiple reasoning paths, and a compact discriminator dynamically selects the most promising one. This approach improves reasoning accuracy and robustness while providing valuable interpretability, which is particularly critical for medical applications. To manage client heterogeneity efficiently, we adopt an improved aggregation approach building upon advanced LoRA module stacking, incorporating client classifier-awareness to achieve noise-free aggregation across diverse clients. Comprehensive experiments on medical reasoning tasks demonstrate that FedCoT significantly boosts client-side reasoning performance under stringent resource budgets while fully preserving data privacy.
摘要：有效地增强了联合学习环境中大语言模型（LLM）的推理能力（LLM）仍然具有挑战性，尤其是在平衡性能与严格的计算，沟通和隐私约束时。这一挑战在医疗保健中尤其严重，在此挑战中，决策跨越的临床，操作和面向患者的环境不仅要精确产出，而且可以解释的可追溯原理，以确保安全，问责制和法规合规性。 LLM上的常规联合调谐方法无法解决这一需求：它们主要优化以确保正确性，同时忽略了基本原理质量，而COT功能则取决于模型的先天预培训能力。此外，现有的改善理由的方法通常依赖于从集中模型中蒸馏出隐私的知识蒸馏。此外，在LLMS上传统的联邦通过微调的沟通开销仍然很大。我们通过提出FedCot来解决这一差距，FedCot是一个专门设计的新型框架，旨在增强联邦设置中的推理。 FedCot利用了轻巧的增强机制：本地模型会产生多个推理路径，而紧凑的歧视器动态选择最有希望的歧视者。这种方法提高了推理的准确性和鲁棒性，同时提供了有价值的解释性，这对于医疗应用尤其重要。为了有效地管理客户的异质性，我们在先进的Lora模块堆叠基础上采用了改进的聚合方法，并结合了客户分类器的意识，以实现跨不同客户的无噪声聚合。关于医疗推理任务的全面实验表明，FedCot在严格的资源预算下显着提高客户端的推理性能，同时完全保留数据隐私。

Title: LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Authors: Egor Fadeev, Dzhambulat Mollaev, Aleksei Shestov, Dima Korolev, Omar Zoloev, Ivan Kireev, Andrey Savchenko, Maksim Makarenko
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10021
Pdf URL: https://arxiv.org/pdf/2508.10021
Copy Paste: [[2508.10021]] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients(https://arxiv.org/abs/2508.10021)
Keywords: language model, llm, prompt
Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.
摘要：从其历史沟通的序列中学习客户的嵌入至关重要。尽管大型语言模型（LLMS）提供了一般世界知识，但它们在长期事件序列上的直接使用在计算上昂贵且不切实际。在本文中，我们提出了拿铁，这是一种对比的学习框架，将原始事件的嵌入与冷冻LLM的语义嵌入一致。行为特征总结为简短的提示，由LLM嵌入，并通过对比度损失用作监督。与LLM对完整序列的常规处理相比，所提出的方法显着降低了推理成本和输入大小。我们通过实验表明，我们的方法优于现实世界财务数据集中学习事件序列表示的最先进技术，同时仍可以在潜伏期敏感的环境中部署。

Title: Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

Authors: Yuanchang Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10022
Pdf URL: https://arxiv.org/pdf/2508.10022
Copy Paste: [[2508.10022]] Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control(https://arxiv.org/abs/2508.10022)
Keywords: language model, llm, hallucination
Abstract: This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates $p$-value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs' black-box nature, subsequently constructing prediction sets via null hypothesis testing ($\mathcal{H}_0$) with empirically derived $p$-values. Evaluations on MMLU and MMLU-Pro benchmarks using off-the-shelf LLMs demonstrate: (1) The enhanced CP achieves user-specified empirical miscoverage rates; (2) Test-set average prediction set size (APSS) decreases monotonically with increasing risk levels ($\alpha$), validating APSS as an effective uncertainty metric. This work establishes a principled statistical framework for trustworthy LLM deployment in high-stakes QA applications.
摘要：这项研究介绍了一个显着性测试增强的保形预测（CP）框架，以提高多项选择性问题回答（MCQA）中大语言模型（LLMS）的可信度。尽管LLM越来越多地部署在纪律质量质量检查方案中，但幻觉和非事实产生实质上却损害了响应可靠性。尽管CP为预测集提供了统计上严格的边际覆盖范围保证，并且显着性测试提供了确定的统计严格性，但其协同整合仍未得到探索。为了减轻幻觉和事实不准确，我们的框架将$ p $ - 价值计算与MCQA响应的自洽性重新采样通过一致性得分。这种方法计算了以经验衍生的$ p $ - 值的方式来解决LLMS的黑框自然解决LLMS的Black-Box性质，随后通过NULL假设测试（$ \ Mathcal {H} _0 $）来构建预测集。使用现成的LLMS对MMLU和MMLU-PRO基准测试的评估证明：（1）增强的CP达到了用户指定的经验误差率；（2）测试集的平均预测集大小（APS）随着风险水平的增加（$ \ alpha $）单调减少，从而将APS验证为有效的不确定性度量。这项工作为高风险QA应用程序中可信赖的LLM部署建立了一个原则性的统计框架。

Title: RTTC: Reward-Guided Collaborative Test-Time Compute

Authors: J. Pablo Muñoz, Jinjie Yuan
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.10024
Pdf URL: https://arxiv.org/pdf/2508.10024
Copy Paste: [[2508.10024]] RTTC: Reward-Guided Collaborative Test-Time Compute(https://arxiv.org/abs/2508.10024)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of Large Language Models (LLMs) at inference, leveraging strategies such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). However, the optimal adaptation strategy varies across queries, and indiscriminate application of TTC strategy incurs substantial computational overhead. In this work, we introduce Reward-Guided Test-Time Compute (RTTC), a novel framework that adaptively selects the most effective TTC strategy for each query via a pretrained reward model, maximizing downstream accuracy across diverse domains and tasks. RTTC operates in a distributed server-client architecture, retrieving relevant samples from a remote knowledge base and applying RAG or lightweight fine-tuning on client devices only when necessary. To further mitigate redundant computation, we propose Query-State Caching, which enables the efficient reuse of historical query states at both retrieval and adaptation levels. Extensive experiments across multiple LLMs and benchmarks demonstrate that RTTC consistently achieves superior accuracy compared to vanilla RAG or TTT, validating the necessity of adaptive, reward-guided TTC selection and the potential of RTTC for scalable, high-performance language model adaptation.
摘要：测试时间计算（TTC）已成为一种强大的范式，用于提高推理时大型语言模型（LLM）的性能，利用策略（例如测试时间培训）（TTT）和检索效果（RAG）。但是，最佳适应性策略在查询之间各不相同，而TTC策略的不加区别应用会造成大量的计算开销。在这项工作中，我们引入了奖励指导的测试时间计算（RTTC），这是一个新颖的框架，可以通过预处理的奖励模型适应每个查询的最有效的TTC策略，从而最大程度地提高了各种域和任务的下游精度。 RTTC在分布式服务器 - 客户端架构中运行，从远程知识库中检索相关样本，并仅在必要时在客户端设备上应用抹布或轻巧的微调。为了进一步减轻冗余计算，我们提出了查询状态缓存，该缓存可以有效地在检索和适应水平上对历史查询状态进行有效的重复使用。跨多个LLM和基准测试的广泛实验表明，与Vanilla Rag或TTT相比，RTTC始终达到了较高的精度，从而验证了适应性，奖励指导的TTC选择的必要性，并且RTTC的潜力以及可扩展的，高表现的语言模型适应性。

Title: Detecting and explaining postpartum depression in real-time with generative artificial intelligence

Authors: Silvia García-Méndez, Francisco de Arriba-Pérez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10025
Pdf URL: https://arxiv.org/pdf/2508.10025
Copy Paste: [[2508.10025]] Detecting and explaining postpartum depression in real-time with generative artificial intelligence(https://arxiv.org/abs/2508.10025)
Keywords: language model, llm
Abstract: Among the many challenges mothers undergo after childbirth, postpartum depression (PPD) is a severe condition that significantly impacts their mental and physical well-being. Consequently, the rapid detection of ppd and their associated risk factors is critical for in-time assessment and intervention through specialized prevention procedures. Accordingly, this work addresses the need to help practitioners make decisions with the latest technological advancements to enable real-time screening and treatment recommendations. Mainly, our work contributes to an intelligent PPD screening system that combines Natural Language Processing, Machine Learning (ML), and Large Language Models (LLMs) towards an affordable, real-time, and non-invasive free speech analysis. Moreover, it addresses the black box problem since the predictions are described to the end users thanks to the combination of LLMs with interpretable ml models (i.e., tree-based algorithms) using feature importance and natural language. The results obtained are 90 % on ppd detection for all evaluation metrics, outperforming the competing solutions in the literature. Ultimately, our solution contributes to the rapid detection of PPD and their associated risk factors, critical for in-time and proper assessment and intervention.
摘要：母亲在分娩后面临的许多挑战中，产后抑郁症（PPD）是一种严重的疾病，会对他们的身心健康产生重大影响。因此，PPD及其相关风险因素的快速检测对于通过专门的预防程序进行计时和干预至关重要。因此，这项工作旨在通过最新的技术进步来帮助从业者做出决定，以实现实时筛查和治疗建议。主要是，我们的工作有助于智能PPD筛选系统，该系统结合了自然语言处理，机器学习（ML）和大型语言模型（LLMS），以实现负担得起的，实时和无创的自由言论分析。此外，它解决了黑匣子问题，因为通过使用特征重要性和自然语言的LLM与可解释的ML模型（即，基于树的算法）的结合，将这些预测与最终用户相结合。对于所有评估指标，获得的PPD检测结果为90％，表现优于文献中的竞争解决方案。最终，我们的解决方案有助于快速检测PPD及其相关的风险因素，这对于在时间和适当的评估和干预中至关重要。

Title: SABER: Switchable and Balanced Training for Efficient LLM Reasoning

Authors: Kai Zhao, Yanjun Zhao, Jiaming Song, Shien He, Lusheng Zhang, Qiang Zhang, Tianjiao Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10026
Pdf URL: https://arxiv.org/pdf/2508.10026
Copy Paste: [[2508.10026]] SABER: Switchable and Balanced Training for Efficient LLM Reasoning(https://arxiv.org/abs/2508.10026)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) empowered by chain-of-thought reasoning have achieved impressive accuracy on complex tasks but suffer from excessive inference costs and latency when applied uniformly to all problems. We propose SABER (Switchable and Balanced Training for Efficient LLM Reasoning), a reinforcement learning framework that endows LLMs with user-controllable, token-budgeted reasoning. SABER first profiles each training example's base-model thinking token usage and assigns it to one of the predefined budget tiers. During fine-tuning, the model is guided by system prompts and length-aware rewards to respect its assigned budget. In parallel, we incorporate no-think examples to ensure the model remains reliable even when explicit reasoning is turned off. SABER further supports four discrete inference modes - NoThink, FastThink, CoreThink, and DeepThink, enabling flexible trade-offs between latency and reasoning depth. Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) demonstrate that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. In particular, SABER-FastThink cuts reasoning length by 65.4% and yields a 3.6% accuracy gain compared with the base model on the MATH benchmark.
摘要：由经过思考的推理授权的大型语言模型（LLM）在复杂的任务上取得了令人印象深刻的准确性，但是当统一应用于所有问题时，推理成本和潜伏期过高。我们提出了Saber（有效的LLM推理，可切换和平衡的培训），这是一个增强学习框架，它赋予LLMS具有用户控制的，具有代币的预算推理。 Saber首先介绍了每个培训示例的基本模型思维令牌的使用情况，并将其分配给预定义的预算层之一。在微调过程中，该模型以系统提示和长度感知奖励为指导，以尊重其指定的预算。同时，我们合并了无思想的示例，以确保即使关闭明确的推理，该模型也保持可靠。 Saber进一步支持四种离散推理模式 - NotHink，FastThink，Corethink和DeepThink，从而在延迟和推理深度之间实现了灵活的权衡。对数学推理（数学，GSM8K），代码生成（MBPP）和逻辑推理（LiveBench-Reomation）的广泛评估表明，Saber在预算紧张，优雅的降级以及有效的跨尺度和跨域通用下实现了很高的精度。特别是，与数学基准的基本模型相比，Saber-FastThink将推理长度缩短了65.4％，准确度增长了3.6％。

Title: LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data

Authors: Ali Zolnour, Hossein Azadmaleki, Yasaman Haghbin, Fatemeh Taherinezhad, Mohamad Javad Momeni Nezhad, Sina Rashidi, Masoud Khani, AmirSajjad Taleban, Samin Mahdizadeh Sani, Maryam Dadkhah, James M. Noble, Suzanne Bakken, Yadollah Yaghoobzadeh, Abdol-Hossein Vahabie, Masoud Rouhizadeh, Maryam Zolnoori
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10027
Pdf URL: https://arxiv.org/pdf/2508.10027
Copy Paste: [[2508.10027]] LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data(https://arxiv.org/abs/2508.10027)
Keywords: language model, gpt, llm
Abstract: Alzheimer's disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers. To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection. Transcripts from the DementiaBank "cookie-theft" task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings. The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 -> 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech. Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling.
摘要：阿尔茨海默氏病和相关痴呆症（ADRD）影响了美国约五百万老年人，但一半以上仍未诊断。基于语音的自然语言处理（NLP）提供了一种有希望的，可扩展的方法，可以通过语言标记来检测早期认知能力下降。为了开发和评估（i）（i）使用手工制作的语言特征融合变压器嵌入的筛选管道，（ii）使用大语模型（LLMS）生成的合成语音测试数据增强，以及（iii）基准测试单峰和多模式的LLM LLM分类器进行ADRD检测。使用了痴呆症“ cookie-theft”任务（n = 237）的成绩单。在三种微调策略下评估了十种变压器模型。融合模型将来自表现最佳变压器的嵌入与110个词汇衍生的语言特征相结合。对五个LLM（Llama-8b/70b，Medalpaca-7b，Ministral-8B，GPT-4O）进行了微调，以生成标签调子条件的合成语音，用于增强培训数据。测试了三个多模型模型（GPT-4O，QWEN-OMNI，PHI-4），以零拍和微调设置进行语音文本分类。融合模型达到了F1 = 83.3（AUC = 89.5），表现优于语言或仅变形金属的基准。用2倍MEDALPACA-7B合成语音增加训练数据将F1提高到85.7。微调显着改善了单峰LLM分类器（例如Medalpaca：F1 = 47.3-> 78.5 F1）。当前的多峰模型表现出较低的性能（GPT-4O = 70.2 F1; QWEN = 66.0）。性能获得与合成和真实语音之间的分布相似性一致的。将变压器嵌入具有语言特征的嵌入增强了语音的ADRD检测。临床调整的LLM有效地支持分类和数据增强，而在多模型建模中需要进一步进步。

Title: PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs

Authors: Xiao Fu, Hossein A. Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, Aldo Lipani
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10028
Pdf URL: https://arxiv.org/pdf/2508.10028
Copy Paste: [[2508.10028]] PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs(https://arxiv.org/abs/2508.10028)
Keywords: language model, llm
Abstract: Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user's profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.
摘要：个性化文本生成对于以用户为中心的信息系统至关重要，但大多数评估方法忽略了用户的个性。我们介绍\ textbf {pref}，a \ textbf {p} sersonalisiss \ textbf {r} eferece-referey \ textbf {e}估值\ textbf {f textbf {f} ramework共同测量一般的输出质量和用户特定的量化，而无需金色个性化的参数。 PREF在三步管道中运行：（1）覆盖阶段使用大型语言模型（LLM）生成涵盖普遍标准的全面，特定于查询的指南，例如事实，连贯性和完整性；（2）使用目标用户的配置文件，陈述或推断的偏好以及上下文，将偏好阶段重新排名并选择性地增加这些因素，从而产生个性化评估；（3）评分阶段适用LLM法官对候选人的答案对此规则进行评分，从而确保基线适当性，同时捕获主观优先级。覆盖范围与偏好的分离提高了鲁棒性，透明度和可重复性，并允许较小的模型近似较大的质量。在包括隐式偏好遵循任务的包括隐式偏好的基准上进行的实验表明，与强基本线相比，PREF可以实现更高的准确性，更好的校准和与人类判断更紧密的一致性。通过启用可扩展，可解释和用户一致的评估，PREF为更可靠的评估和开发个性化语言生成系统奠定了基础。

Title: Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Authors: Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao XuNingyu Zhang, Bo Lin, Meng Han
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.10029
Pdf URL: https://arxiv.org/pdf/2508.10029
Copy Paste: [[2508.10029]] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs(https://arxiv.org/abs/2508.10029)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.
摘要：大型语言模型（LLMS）在各种语言任务中表现出令人印象深刻的能力，但容易受到越狱攻击的攻击，这些攻击规避了他们的安全对准。本文介绍了潜在的融合越狱（LFJ），这是一种基于代表的攻击，从有害和良性的查询对中介绍了隐藏的状态，以引起禁止的反应。 LFJ首先选择具有较高主题和句法相似性的查询对，然后在影响层和令牌上执行梯度引导的插值，然后优化以平衡攻击成功，输出流畅性和计算效率。对跨基准和恶意建筑等基准的Vicuna和Llama-2等模型的评估产生的平均攻击成功率（ASR）为94.01％，表现优于现有方法。为了减轻LFJ，我们提出了一种对抗性训练防御，以插值示例进行微调模型，从而使ASR减少了80％以上，而不会在良性输入上降低性能。消融研究证实了查询对选择，隐藏状态插值组件以及LFJ有效性中的优化策略的重要性。

Title: Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

Authors: Saaduddin Mahmud, Mason Nakamura, Kyle H. Wray, Shlomo Zilberstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10030
Pdf URL: https://arxiv.org/pdf/2508.10030
Copy Paste: [[2508.10030]] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models(https://arxiv.org/abs/2508.10030)
Keywords: language model, llm, prompt
Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.
摘要：及时的优化方法表明，在对齐黑盒大型语言模型（LLMS）方面具有显着效果。同时，推理缩放策略（例如最佳N采样和多数投票）也已被证明可以通过交易计算来提高一致性和性能。但是，现有的及时优化方法是推理策略不可知论。也就是说，他们不考虑部署过程中采用的推理策略来优化提示。这构成了显着的方法论差距，因为我们的经验和理论分析揭示了这两个范式之间的强烈相互依赖性。此外，我们发现用户对多个目标之间权衡的偏好，推理预算在很大程度上影响了及时和推理配置的选择。为了解决这一差距，我们引入了一个名为IAPO（推理意识提示优化）的统一的小说框架，该框架共同优化了提示和推理量表，同时意识到推理预算和不同的任务目标。然后，我们为IAPO开发了一种固定预算训练算法，我们称之为PSST（通过顺序修剪及时缩放），并分析有限预算保证错误概率。最后，我们评估了PSST对六个不同任务的有效性，包括多目标文本生成和推理，并在通过迅速优化对齐Black-Box LLM时展示了合并推理意识的关键作用。

Title: The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Authors: Fan Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10032
Pdf URL: https://arxiv.org/pdf/2508.10032
Copy Paste: [[2508.10032]] The Cost of Thinking: Increased Jailbreak Risk in Large Language Models(https://arxiv.org/abs/2508.10032)
Keywords: language model, llm, prompt
Abstract: Thinking mode has always been regarded as one of the most valuable modes in LLMs. However, we uncover a surprising and previously overlooked phenomenon: LLMs with thinking mode are more easily broken by Jailbreak attack. We evaluate 9 LLMs on AdvBench and HarmBench and find that the success rate of attacking thinking mode in LLMs is almost higher than that of non-thinking mode. Through large numbers of sample studies, it is found that for educational purposes and excessively long thinking lengths are the characteristics of successfully attacked data, and LLMs also give harmful answers when they mostly know that the questions are harmful. In order to alleviate the above problems, this paper proposes a method of safe thinking intervention for LLMs, which explicitly guides the internal thinking processes of LLMs by adding "specific thinking tokens" of LLMs to the prompt. The results demonstrate that the safe thinking intervention can significantly reduce the attack success rate of LLMs with thinking mode.
摘要：思维模式一直被视为LLM中最有价值的模式之一。但是，我们发现了一种令人惊讶且以前被忽视的现象：越狱攻击更容易打破具有思考模式的LLM。我们在Advbench和Harmbench上评估了9个LLM，发现LLMS中攻击思维模式的成功率几乎高于非思想模式的成功率。通过大量的样本研究，发现出于教育目的和过长的思维长度是成功攻击数据的特征，而LLMS大多数人大多知道问题有害时也会给出有害的答案。为了减轻上述问题，本文提出了一种针对LLM的安全思维干预方法，该方法通过在提示中添加LLM的“特定思维代币”来明确指导LLM的内部思维过程。结果表明，安全思维干预可以大大降低LLM通过思考模式的攻击成功率。

Title: Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion

Authors: Dong Zhao, Yadong Wang, Xiang Chen, Chenxi Wang, Hongliang Dai, Chuanxing Geng, Shengzhong Zhang, Shaoyuan Li, Sheng-Jun Huang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10036
Pdf URL: https://arxiv.org/pdf/2508.10036
Copy Paste: [[2508.10036]] Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion(https://arxiv.org/abs/2508.10036)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.
摘要：大型语言模型（LLMS）在几次信息提取（IE）方面表现出巨大的潜力，但它们的性能对选择示例的选择高度敏感。传统的选择策略通常无法提供内容丰富的指导，因为它们忽略了模型谬误的关键来源：混乱不仅源于语义内容，而且还源于IE任务所需的结构良好格式的产生。为了解决这个问题，我们引入了主动提示信息提取（APIE），这是一个以原则为指导的新型主动提示框架，我们术语内省混乱。我们的方法通过双重组件不确定性度量度量来评估其自身的混乱，该指标唯一量化了格式不确定性（难以产生正确的语法）和内容不确定性（在提取的语义中不一致）。通过将未标记的数据与这个综合分数进行排名，我们的框架积极选择最具挑战性和信息性的样本，以作为少量示例。对四个基准测试的广泛实验表明，我们的方法始终优于强基础，从而在提取准确性和鲁棒性方面都有显着提高。我们的工作强调了在建立有效且可靠的结构化生成系统时，对模型不确定性的细粒度双层视图的重要性至关重要。

Title: mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning

Authors: Nghia Trung Ngo, Franck Dernoncourt, Thien Huu Nguyen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10137
Pdf URL: https://arxiv.org/pdf/2508.10137
Copy Paste: [[2508.10137]] mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning(https://arxiv.org/abs/2508.10137)
Keywords: language model, llm
Abstract: Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.
摘要：推理 - 强化大型语言模型（LLM）的最新进步在复杂的推理任务中表现出了显着的功能。但是，对不同人类推理技能利用的基础机制仍未得到调查，特别是对于涉及跨不同语言和文化的日常知识的多语言常识性推理。为了解决这一差距，我们为\ textbf {s}杀死基于\ textbf {co} mmonsense \ textbf {re} asoning（\ textbf {mscore}）的\ textbf {m} ulifuling and-benchmarks。我们的基准测试结合了三个关键组成部分，旨在系统地评估LLM的推理能力，包括：（1）推理技能的新型分类学，可以对模型的推理过程进行细化的分析，（2）强大的数据合成管道专门针对常识性推理评估而专门定制了用于调用式推理评估，以及（3）相同的范围允许扩展范围的范围划定的范围划分的任务均匀划分，该范围的范围均匀划分了。对八个不同大小和训练方法的八个最先进的LLM进行了广泛的实验，这表明\ TextBf {Mscore}对于当前模型，尤其是在较高的复杂性水平下，\ textbf {mscore}仍然具有巨大的挑战。我们的结果揭示了与细微的多语言通用和文化常识面对细微的推理增强模型的局限性。我们进一步就模型的推理过程提供了详细的分析，提出了提高多语言常识性推理能力的未来方向。

Title: Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

Authors: Kartikeya Badola, Jonathan Simon, Arian Hosseini, Sara Marie Mc Carthy, Tsendsuren Munkhdalai, Abhimanyu Goyal, Tomáš Kočiský, Shyam Upadhyay, Bahare Fatemi, Mehran Kazemi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10142
Pdf URL: https://arxiv.org/pdf/2508.10142
Copy Paste: [[2508.10142]] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs(https://arxiv.org/abs/2508.10142)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.
摘要：大型语言模型（LLM）擅长解决清晰和完整的陈述解决问题，但在大多数真实世界中常见的环境或交互式任务通常都在困难。这突出了开发LLM的关键需求，该LLM可以有效地进行逻辑上一致的多转向对话，通过不完整的数据寻求信息和原因。为此，我们介绍了一个新颖的基准测试，其中包括一套多转弯任务，每个任务旨在测试特定的推理，互动对话和寻求信息的能力。这些任务具有确定性的评分机制，从而消除了对人类干预的需求。在我们的基准上评估前沿模型显示出明显的净空。我们的分析表明，大多数错误来自不良的指导，推理失败和计划不良。该基准测试为当前LLM在处理复杂，互动场景方面的优势和缺点提供了宝贵的见解，并为未来的研究提供了一个强大的平台，旨在提高这些关键功能。

Title: LaajMeter: A Framework for LaaJ Evaluation

Authors: Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Avi Ziv
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10161
Pdf URL: https://arxiv.org/pdf/2508.10161
Copy Paste: [[2508.10161]] LaajMeter: A Framework for LaaJ Evaluation(https://arxiv.org/abs/2508.10161)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). While effective in general domains, LaaJs pose significant challenges in domain-specific contexts, where annotated data is scarce and expert evaluation is costly. In such cases, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. As a result, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate and refine LaaJs for specific evaluation tasks: they can test whether their metrics correctly distinguish between better and worse (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.
摘要：大型语言模型（LLM）越来越多地用作自然语言处理任务中的评估者，这是一种称为LLM-AS-A-A-Gudge（LAAJ）的范式。尽管在一般领域有效，但LAAJS在特定于领域的环境中构成了重大挑战，在域中，注释数据很少，并且专家评估的成本很高。在这种情况下，通常使用未针对应用特定域进行验证的指标进行元评估。结果，很难确定哪些指标有效地识别LAAJ质量，进一步的阈值表明了足够的评估者性能。在这项工作中，我们介绍了Laajmeter，这是一个基于模拟的框架，用于控制LAAJS的元评估。 LAAJMeter使工程师能够生成代表虚拟模型和法官的合成数据，从而可以在现实条件下进行系统分析。这有助于实践者验证和完善对特定评估任务的LAAJ：他们可以测试他们的指标是否正确区分了更好或更差的（虚拟）LAAJ，并估算适当的评估者阈值。我们在涉及旧版编程语言的代码翻译任务中演示了LAAJMeter的实用性，显示了对评估者质量的敏感性不同的指标。我们的结果突出了共同指标的局限性以及原则指标选择的重要性。 LAAJMeter提供了一种可扩展且可扩展的解决方案，用于评估低资源环境中的LAAJ，这有助于更广泛的努力，以确保NLP中值得信赖和可重现的评估。

Title: Estimating Machine Translation Difficulty

Authors: Lorenzo Proietti, Stefano Perrella, Vilém Zouhar, Roberto Navigli, Tom Kocmi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10175
Pdf URL: https://arxiv.org/pdf/2508.10175
Copy Paste: [[2508.10175]] Estimating Machine Translation Difficulty(https://arxiv.org/abs/2508.10175)
Keywords: llm
Abstract: Machine translation quality has began achieving near-perfect translations in some setups. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. Automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. We formalize the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging machine translation benchmarks. Our results show that dedicated models (dubbed Sentinel-src) outperform both heuristic-based methods (e.g. word rarity or syntactic complexity) and LLM-as-a-judge approaches. We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.
摘要：机器翻译质量已经开始在某些设置中实现接近完美的翻译。这些高质量的产出使得很难区分最先进的模型并确定未来改进的领域。自动识别机器翻译系统斗争的文本有望开发更多的歧视性评估并指导未来的研究。我们将翻译难度估算的任务正式化，根据其翻译的预期质量来定义文本的难度。我们引入了一个新的指标来评估难度估计器并使用它来评估基准和新颖方法。最后，我们通过使用更具挑战性的机器翻译基准来证明难度估计器的实际实用性。我们的结果表明，专用模型（称为Sentinel-SRC）的表现优于基于启发式的方法（例如单词稀有性或句法复杂性）和llm-as-as-a-a-a-gudge方法。我们发布了两个改进的模型，以实现难度估算，即Sentinel-SRC-24和Sentinel-SRC-25，它们可用于扫描大量文本集合，并选择最有可能挑战当代机器翻译系统的人。

Title: Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs

Authors: Wenlong Deng, Jiaming Zhang, Qi Zeng, Christos Thrampoulidis, Boying Gong, Xiaoxiao Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10180
Pdf URL: https://arxiv.org/pdf/2508.10180
Copy Paste: [[2508.10180]] Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs(https://arxiv.org/abs/2508.10180)
Keywords: language model, llm
Abstract: Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.
摘要：量化单个培训样本的影响对于增强大语言模型（LLM）和视觉语言模型（VLM）的透明度和问责制至关重要。但是，现有的数据估值方法通常依赖于Hessian信息或模型再培训，这使得它们对数十亿参数模型的计算效果均高。在这项工作中，我们介绍了for-falue，这是一个仅向前的数据估值框架，可实现LLM和VLM的可扩展影响估计。通过利用现代基础模型的丰富表示形式，For-Value使用仅基于单个正向通行证的简单闭合表达式计算得分，从而消除了对昂贵的梯度计算的需求。我们的理论分析表明，预价通过在训练和验证样本之间捕获隐藏表示形式和预测错误来准确估计每样本的影响。广泛的实验表明，在识别有影响力的微调示例并有效地检测出标记错误的数据方面，值得匹配或优于基于梯度的基线。

Title: PakBBQ: A Culturally Adapted Bias Benchmark for QA

Authors: Abdullah Hashmat, Muhammad Arham Mirza, Agha Ali Raza
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10186
Pdf URL: https://arxiv.org/pdf/2508.10186
Copy Paste: [[2508.10186]] PakBBQ: A Culturally Adapted Bias Benchmark for QA(https://arxiv.org/abs/2508.10186)
Keywords: language model, llm, prompt
Abstract: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12\% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.
摘要：随着在各种应用程序中广泛采用大语言模型（LLM），经验是确保它们在所有用户社区中的公平性。但是，大多数LLM都接受了西方以中心数据的培训和评估，对低资源语言和区域环境的关注很少。为了解决这一差距，我们介绍了PAKBBQ，这是一种在文化和区域改编的原始偏差基准的扩展，以进行答案（BBQ）数据集。 PAKBBQ包括超过214个模板，17180个QA对英语和乌尔都语的8个类别，涵盖了八个偏见维度，包括年龄，残疾，外观，性别，性别，社会经济状况，宗教，区域隶属关系和语言形式，在巴基斯坦相关。我们在模棱两可和明确的歧义环境以及负面问题与非负面问题框架的情况下评估了多种多语言LLM。我们的实验表明（i）歧义的平均准确性增益为12 \％，（ii）乌尔都语中的反偏置行为始终更强，并且（iii）明显的框架效应，当问题带来负面提出时，可以减少刻板印象响应。这些发现突出了上下文化基准和简单的及时工程策略在低资源设置中减轻偏置的重要性。

Title: Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

Authors: Igor Halperin
Subjects: cs.CL, cs.AI, cs.LG, q-fin.CP
Abstract URL: https://arxiv.org/abs/2508.10192
Pdf URL: https://arxiv.org/pdf/2508.10192
Copy Paste: [[2508.10192]] Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models(https://arxiv.org/abs/2508.10192)
Keywords: language model, llm, hallucination, prompt
Abstract: The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.
摘要：大型语言模型（LLM）的扩散受到幻觉的挑战，模型产生非事实，荒谬或不忠文本的关键故障模式。本文介绍了语义差异指标（SDM），这是一种用于检测忠实幻觉的新型轻量级框架 - LLMS响应与输入环境的严重偏差的事件。我们专注于这些LLM错误的特定实现，{confabulations，定义为与用户查询的任意且语义上未对准的响应。现有方法通过测量单个固定提示的答案的多样性，例如语义熵测试任意性。我们的SDM框架通过更及时了解来改善这一点：我们通过不仅在多个答案中，而且跨多个原始提示的多种语义等效释义来测量响应一致性来测试更深层次的任意性。从方法上讲，我们的方法使用句子嵌入的联合聚类来为提示和答案创建共享主题空间。提示和响应之间的主题共同存在的热图可以看作是用户机器对话的二维可视化。然后，我们计算一套信息理论指标，以衡量提示和响应之间的语义差异。我们的实用分数，$ \ Mathcal {s} _h $，结合了Jensen-Shannon Divergence和Wasserstein距离，以量化这种分歧，高分表明忠实的幻觉。此外，我们将KL Divergence KL（答案$ || $提示）确定为\ textbf {语义探索}的强大指标，这是区分不同生成行为的关键信号。这些指标进一步合并到语义框中，这是一种诊断框架，用于对LLM响应类型进行分类，包括危险，自信的封发。

Title: Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia

Authors: Andrew X. Chen, Guillermo Horga, Sean Escola
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10226
Pdf URL: https://arxiv.org/pdf/2508.10226
Copy Paste: [[2508.10226]] Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia(https://arxiv.org/abs/2508.10226)
Keywords: language model, llm
Abstract: Patients who are at clinical high risk (CHR) for schizophrenia need close monitoring of their symptoms to inform appropriate treatments. The Brief Psychiatric Rating Scale (BPRS) is a validated, commonly used research tool for measuring symptoms in patients with schizophrenia and other psychotic disorders; however, it is not commonly used in clinical practice as it requires a lengthy structured interview. Here, we utilize large language models (LLMs) to predict BPRS scores from clinical interview transcripts in 409 CHR patients from the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) cohort. Despite the interviews not being specifically structured to measure the BPRS, the zero-shot performance of the LLM predictions compared to the true assessment (median concordance: 0.84, ICC: 0.73) approaches human inter- and intra-rater reliability. We further demonstrate that LLMs have substantial potential to improve and standardize the assessment of CHR patients via their accuracy in assessing the BPRS in foreign languages (median concordance: 0.88, ICC: 0.70), and integrating longitudinal information in a one-shot or few-shot learning approach.
摘要：患有精神分裂症的临床高风险（CHR）的患者需要密切监测其症状，以告知适当的治疗方法。简短的精神病学评分量表（BPRS）是一种经过验证的，常用的研究工具，用于测量精神分裂症患者和其他精神病患者的症状；但是，它在临床实践中并不常用，因为它需要冗长的结构化访谈。在这里，我们利用大型语言模型（LLMS）来预测409名CHR患者临床访谈成绩单中的BPRS评分，来自加速药物伙伴关系精神分裂症（AMP-SCZ）同类。尽管访谈不是专门为衡量BPR的构造的，但与真实评估（中位数一致性：0.84，ICC：0.73）相比，LLM预测的零射击性能接近了人际关系和内部可靠性。我们进一步证明，LLM具有重大潜力，可以通过评估外语的BPR（中位数一致性：0.88，ICC：0.70）来改善和标准化CHR患者的评估，并以一次性或更少的摄入学习方法整合纵向信息。

Title: Inductive Bias Extraction and Matching for LLM Prompts

Authors: Christian M. Angel, Francis Ferraro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10295
Pdf URL: https://arxiv.org/pdf/2508.10295
Copy Paste: [[2508.10295]] Inductive Bias Extraction and Matching for LLM Prompts(https://arxiv.org/abs/2508.10295)
Keywords: llm, prompt
Abstract: The active research topic of prompt engineering makes it evident that LLMs are sensitive to small changes in prompt wording. A portion of this can be ascribed to the inductive bias that is present in the LLM. By using an LLM's output as a portion of its prompt, we can more easily create satisfactory wording for prompts. This has the effect of creating a prompt that matches the inductive bias in model. Empirically, we show that using this Inductive Bias Extraction and Matching strategy improves LLM Likert ratings used for classification by up to 19% and LLM Likert ratings used for ranking by up to 27%.
摘要：迅速工程的主动研究主题使LLM显然对及时措辞的小变化很敏感。其中一部分可以归因于LLM中存在的归纳偏差。通过将LLM的输出作为其提示的一部分，我们可以更轻松地为提示创建令人满意的措辞。这具有创建与模型中归纳偏差相匹配的提示。从经验上讲，我们表明，使用这种感应性偏置提取和匹配策略可改善用于分类的LLM Likert评分，最多可提高19％，而LLM Likert等级用于排名高达27％。

Title: Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race

Authors: Gustavo Bonil, Simone Hashiguti, Jhessica Silva, João Gondim, Helena Maia, Nádia Silva, Helio Pedrini, Sandra Avila
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10304
Pdf URL: https://arxiv.org/pdf/2508.10304
Copy Paste: [[2508.10304]] Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race(https://arxiv.org/abs/2508.10304)
Keywords: language model, llm, prompt
Abstract: With the advance of Artificial Intelligence (AI), Large Language Models (LLMs) have gained prominence and been applied in diverse contexts. As they evolve into more sophisticated versions, it is essential to assess whether they reproduce biases, such as discrimination and racialization, while maintaining hegemonic discourses. Current bias detection approaches rely mostly on quantitative, automated methods, which often overlook the nuanced ways in which biases emerge in natural language. This study proposes a qualitative, discursive framework to complement such methods. Through manual analysis of LLM-generated short stories featuring Black and white women, we investigate gender and racial biases. We contend that qualitative methods such as the one proposed here are fundamental to help both developers and users identify the precise ways in which biases manifest in LLM outputs, thus enabling better conditions to mitigate them. Results show that Black women are portrayed as tied to ancestry and resistance, while white women appear in self-discovery processes. These patterns reflect how language models replicate crystalized discursive representations, reinforcing essentialization and a sense of social immobility. When prompted to correct biases, models offered superficial revisions that maintained problematic meanings, revealing limitations in fostering inclusive narratives. Our results demonstrate the ideological functioning of algorithms and have significant implications for the ethical use and development of AI. The study reinforces the need for critical, interdisciplinary approaches to AI design and deployment, addressing how LLM-generated discourses reflect and perpetuate inequalities.
摘要：随着人工智能（AI）的发展，大语言模型（LLM）已获得突出，并在不同的情况下应用。随着它们发展为更复杂的版本，必须评估它们是否繁殖偏见，例如歧视和种族化，同时保持霸权话语。当前的偏见检测方法主要依赖于定量的自动化方法，这些方法通常忽略了以自然语言出现偏见的细微差别方式。这项研究提出了一个定性的，言语的框架，以补充这种方法。通过对LLM生成的短篇小说的手动分析，以黑白妇女为特色，我们研究了性别和种族偏见。我们认为，诸如此处提出的一种定性方法是基本的，可以帮助开发人员和用户确定偏见在LLM输出中表现出的确切方法，从而使更好的条件减轻它们。结果表明，黑人妇女被描绘成与祖先和抵抗相关的，而白人妇女则出现在自我发现的过程中。这些模式反映了语言模型如何复制结晶的话语表达，增强本质和社会不动感的感觉。当提示纠正偏见时，模型提供了表面修订，以保持有问题的含义，从而揭示了促进包容性叙事的局限性。我们的结果证明了算法的意识形态功能，并对AI的道德使用和发展具有重要意义。这项研究强化了对AI设计和部署的关键，跨学科方法的需求，从而解决了LLM生成的话语如何反映和永久性不平等。

Title: From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis

Authors: Xuan Li, Jialiang Dong, Raymond Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10311
Pdf URL: https://arxiv.org/pdf/2508.10311
Copy Paste: [[2508.10311]] From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis(https://arxiv.org/abs/2508.10311)
Keywords: gpt
Abstract: Documents are core carriers of information and knowl-edge, with broad applications in finance, healthcare, and scientific research. Tables, as the main medium for structured data, encapsulate key information and are among the most critical document components. Existing studies largely focus on surface-level tasks such as layout analysis, table detection, and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks like cross-paragraph data interpretation and context-consistent analysis. To address this, we propose DOTABLER, a table-centric semantic document parsing framework designed to uncover deep semantic links between tables and their context. DOTABLER leverages a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Built on this semantic understanding, DOTABLER implements two core functionalities: table-centric document structure parsing and domain-specific table retrieval, delivering comprehensive table-anchored semantic analysis and precise extraction of semantically relevant tables. Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior performance in table-context semantic analysis and deep document parsing compared to advanced models such as GPT-4o.
摘要：文件是信息和知识范围的核心载体，在金融，医疗保健和科学研究中广泛应用。表作为结构化数据的主要媒介，封装了关键信息，并且是最关键的文档组件之一。现有的研究主要集中在表面级别的任务上，例如布局分析，表检测和数据提取，缺乏对表格的语义解析及其上下文关联。这限制了高级任务，例如跨段数据解释和上下文一致的分析。为了解决这个问题，我们提出了Dotabler，这是一个以桌子为中心的语义文档解析框架，旨在揭示表及其上下文之间的深层语义联系。 Dotabler利用预先训练的模型的自定义数据集和特定于域特定的微调，集成了完整的解析管道，以识别上下文段与表与表绑定的上下文段。 Dotabler建立在这种语义理解的基础上，实现了两个核心功能：以桌子为中心的文档结构解析和特定于域的表检索，可提供全面的桌面锚定语义分析，并精确地提取语义相关表。 Dotaber在现实世界中有超过1,000张表的近4,000页，具有超过1,000张表的评估，与GPT-4O（例如GPT-4O）相比，在桌面语义分析和深层文档分析中，表现出了超过90％的精度和F1分数。

Title: Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation

Authors: Minhao Wang, Yunhang He, Cong Xu, Zhangchi Zhu, Wei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10312
Pdf URL: https://arxiv.org/pdf/2508.10312
Copy Paste: [[2508.10312]] Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation(https://arxiv.org/abs/2508.10312)
Keywords: language model, llm
Abstract: Recommender systems in concert with Large Language Models (LLMs) present promising avenues for generating semantically-informed recommendations. However, LLM-based recommenders exhibit a tendency to overemphasize semantic correlations within users' interaction history. When taking pretrained collaborative ID embeddings as input, LLM-based recommenders progressively weaken the inherent collaborative signals as the embeddings propagate through LLM backbones layer by layer, as opposed to traditional Transformer-based sequential models in which collaborative signals are typically preserved or even enhanced for state-of-the-art performance. To address this limitation, we introduce FreLLM4Rec, an approach designed to balance semantic and collaborative information from a spectral perspective. Item embeddings that incorporate both semantic and collaborative information are first purified using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant high-frequency noise. Temporal Frequency Modulation (TFM) then actively preserves collaborative signal layer by layer. Note that the collaborative preservation capability of TFM is theoretically guaranteed by establishing a connection between the optimal but hard-to-implement local graph fourier filters and the suboptimal yet computationally efficient frequency-domain filters. Extensive experiments on four benchmark datasets demonstrate that FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves competitive performance, with improvements of up to 8.00\% in NDCG@10 over the best baseline. Our findings provide insights into how LLMs process collaborative information and offer a principled approach for improving LLM-based recommendation systems.
摘要：与大型语言模型（LLMS）共同推荐系统提出了有希望的途径，用于生成语义知名度的建议。但是，基于LLM的推荐人表现出过度强调用户交互历史中语义相关性的趋势。当以验证的协作ID嵌入为输入时，基于LLM的推荐人会逐步削弱固有的协作信号，因为嵌入方式通过LLM骨架逐层传播，而不是传统的基于变压器的顺序模型，而该模型通常保留了一个合作信号，甚至可以增强对现状的表现。为了解决这一限制，我们介绍了Frellm4Rec，这种方法旨在从光谱的角度平衡语义和协作信息。首先，使用全局图低通滤波器（G-LPF）纯化了同时结合语义和协作信息的项目嵌入，以初步消除无关的高频噪声。然后，时间频率调制（TFM）会逐图积极保留协作信号。请注意，通过在最佳但难以实现的本地图傅立叶过滤器与次优且计算上有效的频率域滤波器之间建立联系，从理论上可以保证TFM的协作保存能力。在四个基准数据集上进行的广泛实验表明，FRELLM4REC成功地减轻了协作信号衰减并实现竞争性能，而NDCG@10中的提高高达8.00 \％，而不是最佳基线。我们的发现提供了有关LLMS如何处理协作信息的见解，并提供了改善基于LLM的建议系统的原则方法。

Title: Cross-Prompt Encoder for Low-Performing Languages

Authors: Beso Mikaberidze, Teimuraz Saghinadze, Simon Ostermann, Philipp Muller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10352
Pdf URL: https://arxiv.org/pdf/2508.10352
Copy Paste: [[2508.10352]] Cross-Prompt Encoder for Low-Performing Languages(https://arxiv.org/abs/2508.10352)
Keywords: language model, llm, prompt
Abstract: Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages-those that achieve poor accuracy even under full-model fine-tuning. We introduce the Cross-Prompt Encoder (XPE), which combines a lightweight encoding architecture with multi-source training on typologically diverse languages - a design that enables the model to capture abstract and transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Experiments on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.
摘要：软提示已成为参数有效的微调（PEFT）中适配器的强大替代方案，使大型语言模型（LLMS）能够适应没有架构更改或参数更新的下游任务。虽然先前的工作集中于通过小型神经及时编码器中的参数相互作用稳定训练，但它们跨语言的传递潜力仍然没有探索。在本文中，我们证明了及时编码器可以在改善低表现语言的性能方面发挥核心作用，即使在全模型的微调下，也可以达到较差的精度。我们介绍了交叉宣传编码器（XPE），该编码器将轻巧的编码体系结构与多种多样的语言进行了多源培训 - 一种设计，该设计使该模型能够捕获跨语言的抽象和可转移的模式。为了补充XPE，我们提出了一种双重提示机制，该机制将基于编码器的提示与直接训练的标准软提示相结合。这种混合设计对目标语言特别有效，这些目标语言受益于广泛共享的结构和特定于语言的一致性。 SIB-200基准上的实验揭示了一个一致的权衡：XPE对于低表现的语言最有效，而混合动力变体则在多语言环境中提供了更广泛的适应性。

Title: Making Qwen3 Think in Korean with Reinforcement Learning

Authors: Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10355
Pdf URL: https://arxiv.org/pdf/2508.10355
Copy Paste: [[2508.10355]] Making Qwen3 Think in Korean with Reinforcement Learning(https://arxiv.org/abs/2508.10355)
Keywords: language model, chain-of-thought
Abstract: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B "think" natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.
摘要：我们提出了一种两阶段的微调方法，以使大型语言模型qwen3 14b“思考”在韩文中。在第一阶段，在高质量的韩国推理数据集中受到监督的微调（SFT）为韩国逻辑推理建立了强大的基础，从而在韩国语言任务中得到了显着改进，甚至在一般推理能力方面取得了显着提高。在第二阶段，我们使用定制的小组相对政策优化（GRPO）算法采用强化学习来进一步增强韩国推理的一致性和整体解决问题的绩效。我们通过引入校准奖励信号的Oracle法官模型来解决GRPO培训中的关键稳定挑战 - 例如奖励黑客和政策崩溃。我们的方法实现了稳定的学习（避免在幼稚的GRPO中观察到的崩溃），并导致稳定的，增量的增长。最终的RL调整模型在保持知识和语言水平的同时，在先进的推理基准（尤其是数学和编码任务）上展示了大幅改进的结果，成功地在韩国人中成功地进行了内部思想链。

Title: Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10366
Pdf URL: https://arxiv.org/pdf/2508.10366
Copy Paste: [[2508.10366]] Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models(https://arxiv.org/abs/2508.10366)
Keywords: language model, llm
Abstract: Aspect-based sentiment analysis (ABSA) has made significant strides, yet challenges remain for low-resource languages due to the predominant focus on English. Current cross-lingual ABSA studies often centre on simpler tasks and rely heavily on external translation tools. In this paper, we present a novel sequence-to-sequence method for compound ABSA tasks that eliminates the need for such tools. Our approach, which uses constrained decoding, improves cross-lingual ABSA performance by up to 10\%. This method broadens the scope of cross-lingual ABSA, enabling it to handle more complex tasks and providing a practical, efficient alternative to translation-dependent techniques. Furthermore, we compare our approach with large language models (LLMs) and show that while fine-tuned multilingual LLMs can achieve comparable results, English-centric LLMs struggle with these tasks.
摘要：基于方面的情感分析（ABSA）取得了长足的进步，但由于主要关注英语，低资源语言仍然存在挑战。当前的跨语性ABSA研究通常以更简单的任务为中心，并严重依赖外部翻译工具。在本文中，我们为复合ABSA任务提供了一种新颖的顺序到序列方法，以消除对此类工具的需求。我们使用受限解码的方法将跨语义的ABSA性能提高了10 \％。该方法扩大了跨语性ABSA的范围，使其能够处理更复杂的任务，并提供了与翻译有关的技术的实用，有效的替代方案。此外，我们将我们的方法与大语言模型（LLMS）进行比较，并表明，虽然微调的多语言LLM可以取得可比的结果，但以英语为中心的LLM在这些任务上挣扎。

Title: Large Language Models for Summarizing Czech Historical Documents and Beyond

Authors: Václav Tran, Jakub Šmíd, Jiří Martínek, Ladislav Lenc, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10368
Pdf URL: https://arxiv.org/pdf/2508.10368
Copy Paste: [[2508.10368]] Large Language Models for Summarizing Czech Historical Documents and Beyond(https://arxiv.org/abs/2508.10368)
Keywords: language model
Abstract: Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od Čerchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.
摘要：文本摘要是将更大的文本缩短为简洁版本的任务，同时保留其基本含义和关键信息。尽管用英语和其他高资源语言对摘要进行了大量探讨，但捷克文本摘要，尤其是对于历史文档，由于语言复杂性和稀缺的注释数据集而保持不足。诸如Mistral和MT5之类的大型语言模型在许多自然语言处理任务和语言上都表现出了很好的结果。因此，我们将这些模型用于捷克摘要，从而产生了两个关键的贡献：（1）使用这些高级模型在现代捷克摘要数据集SumeCzech上实现新的最新结果，以及（2）引入一个名为Poselodčerchova的新颖数据集，以通过基线结果汇总具有历史czech文档的Poselodčerchova。这些贡献共同提供了捷克文本摘要和开放捷克历史文本处理研究的新途径的巨大潜力。

Title: Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10369
Pdf URL: https://arxiv.org/pdf/2508.10369
Copy Paste: [[2508.10369]] Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding(https://arxiv.org/abs/2508.10369)
Keywords: language model, llm
Abstract: While aspect-based sentiment analysis (ABSA) has made substantial progress, challenges remain for low-resource languages, which are often overlooked in favour of English. Current cross-lingual ABSA approaches focus on limited, less complex tasks and often rely on external translation tools. This paper introduces a novel approach using constrained decoding with sequence-to-sequence models, eliminating the need for unreliable translation tools and improving cross-lingual performance by 5\% on average for the most complex task. The proposed method also supports multi-tasking, which enables solving multiple ABSA tasks with a single model, with constrained decoding boosting results by more than 10\%. We evaluate our approach across seven languages and six ABSA tasks, surpassing state-of-the-art methods and setting new benchmarks for previously unexplored tasks. Additionally, we assess large language models (LLMs) in zero-shot, few-shot, and fine-tuning scenarios. While LLMs perform poorly in zero-shot and few-shot settings, fine-tuning achieves competitive results compared to smaller multilingual models, albeit at the cost of longer training and inference times. We provide practical recommendations for real-world applications, enhancing the understanding of cross-lingual ABSA methodologies. This study offers valuable insights into the strengths and limitations of cross-lingual ABSA approaches, advancing the state-of-the-art in this challenging research domain.
摘要：尽管基于方面的情感分析（ABSA）取得了长足的进步，但对于低资源语言而言，挑战仍然存在，这些语言常常被忽略了英语。当前的跨语性ABSA方法集中于有限的，较不复杂的任务，并且通常依靠外部翻译工具。本文使用序列到序列模型的约束解码引入了一种新颖的方法，从而消除了对不可靠的翻译工具的需求，并将跨语言性能提高了5 \％，平均而言是最复杂的任务。所提出的方法还支持多任务处理，该方法可以通过单个模型求解多个ABSA任务，并具有约束的解码增强结果超过10 \％。我们评估了七种语言和六项ABSA任务的方法，超过了最新方法，并为以前未探索的任务设定了新的基准测试。此外，我们在零射击，很少射击和微调方案中评估大语言模型（LLMS）。尽管LLM在零射击和少量设置中的表现较差，但与较小的多语言模型相比，微调取得了竞争成果，尽管其成本为更长的培训和推理时间。我们为现实世界应用提供了实用的建议，从而增强了对跨语性ABSA方法论的理解。这项研究为跨语言ABSA方法的优势和局限性提供了宝贵的见解，从而推进了这个充满挑战的研究领域的最先进。

Title: Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Authors: Chiyu Zhang, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2508.10390
Pdf URL: https://arxiv.org/pdf/2508.10390
Copy Paste: [[2508.10390]] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts(https://arxiv.org/abs/2508.10390)
Keywords: language model, llm, prompt
Abstract: Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: this https URL.
摘要：当提示没有明显有害或未能引起有害产出时，评估越狱攻击是具有挑战性的。不幸的是，许多现有的红色团队数据集都包含此类不合适的提示。为了准确评估攻击，需要对这些数据集进行评估和清洁恶意。但是，现有的恶意内容检测方法依赖于手动注释，即劳动密集型或大型语言模型（LLMS），在有害类型中的准确性不一致。为了平衡准确性和效率，我们提出了一个名为MDH的混合评估框架（基于LLMS和人为援助的恶意检测），将基于LLM的注释与最小的人类监督相结合，并将其应用于数据集清洁和狱城响应的检测。此外，我们发现精心制作的开发人员的信息可以大大提高越狱的成功，这使我们提出了两种新策略：D-Attack（利用上下文模拟）和DH-COT，其中包含了被劫持的思想链。代码，数据集，判断和检测结果将在GitHub存储库中发布：此HTTPS URL。

Title: Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Authors: Huizhen Shu, Xuying Li, Qirui Wang, Yuji Kosuga, Mengqiu Tian, Zhuo Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10404
Pdf URL: https://arxiv.org/pdf/2508.10404
Copy Paste: [[2508.10404]] Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation(https://arxiv.org/abs/2508.10404)
Keywords: language model, llm, prompt
Abstract: With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP this http URL, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.
摘要：随着自然语言处理（NLP）的快速扩散，尤其是大语言模型（LLM），为越狱LLMS产生对抗性例子仍然是了解模型脆弱性和改善鲁棒性的关键挑战。在这种情况下，我们提出了一种新的黑框攻击方法，该方法利用大型模型的解释性。我们介绍了稀疏功能扰动框架（SFPF），这是一种新颖的对抗文本生成的方法，它利用稀疏的自动编码器来识别和操纵文本中的关键特征。使用SAE模型重建隐藏层表示后，我们在成功攻击的文本上执行特征聚类，以识别具有更高激活的功能。然后将这些高度激活的功能扰动以生成新的对抗文本。这种选择性的扰动保留了恶意意图，同时放大安全信号，从而增加了逃避现有防御能力的潜力。我们的方法实现了一种新的红色团队策略，以平衡对抗性有效性与安全一致性。实验结果表明，SFPF产生的对抗文本可以绕过最新的防御机制，从而揭示了当前NLP中的持续性脆弱性，该方法的有效性在提示和层次上的有效性各不相同，并且其对其他体系结构的普遍性和更大的模型仍有验证。

Title: ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Authors: Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10419
Pdf URL: https://arxiv.org/pdf/2508.10419
Copy Paste: [[2508.10419]] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning(https://arxiv.org/abs/2508.10419)
Keywords: llm, long context
Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at this https URL
摘要：对长篇小说和小说的叙事理解是一个富有挑战性的领域，归因于其复杂的情节线，并纠缠不清，经常在人物和实体之间发展。考虑到LLM在扩展上下文和高计算成本上的推理减少，基于检索的方法在实践中仍然是关键作用。但是，由于其无状态的单步检索过程，传统的抹布方法可能会缺乏，这通常会忽略在远程上下文中捕获相互联系的动态性质。在这项工作中，我们提出了Comorag，认为叙事推理不是一个镜头的过程，而是一种动态的，不断发展的相互作用，在新的证据获取与过去的知识巩固之间，类似于与人类认知与大脑中与记忆相关的信号的推理。具体而言，在遇到推理僵局时，Comorag在与动态内存工作空间互动时会经历迭代推理周期。在每个周期中，它都会生成探测查询以设计新的探索路径，然后将检索到的新方面的证据集成到全球内存池中，从而支持查询分辨率相干上下文的出现。在四个具有挑战性的长篇小说叙事基准（200k+令牌）中，Comorag的表现优于强大的抹布基线，相对相对增长，与最强的基线相比，相对相对稳定的基线增长了11％。进一步的分析表明，Comorag对于需要全球理解的复杂查询尤其有利，为基于检索的长篇小说理解对国家推理提供了有原则的，认知动机的范式。我们的代码在此HTTPS URL上公开发布

Title: Evaluating LLMs on Chinese Idiom Translation

Authors: Cai Yang, Yao Dou, David Heineman, Xiaofeng Wu, Wei Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10421
Pdf URL: https://arxiv.org/pdf/2508.10421
Copy Paste: [[2508.10421]] Evaluating LLMs on Chinese Idiom Translation(https://arxiv.org/abs/2508.10421)
Keywords: language model, gpt, llm
Abstract: Idioms, whose figurative meanings usually differ from their literal interpretations, are common in everyday language, especially in Chinese, where they often contain historical references and follow specific structural patterns. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. In this work, we introduce IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. We annotate 900 translation pairs from nine modern systems, including GPT-4o and Google Translate, across four domains: web, news, Wikipedia, and social media. We find these systems fail at idiom translation, producing incorrect, literal, partial, or even missing translations. The best-performing system, GPT-4, makes errors in 28% of cases. We also find that existing evaluation metrics measure idiom quality poorly with Pearson correlation below 0.48 with human ratings. We thus develop improved models that achieve F$_1$ scores of 0.68 for detecting idiom translation errors.
摘要：成语通常与他们的字面解释有所不同，在日常语言中很常见，尤其是在中文中，它们通常包含历史参考并遵循特定的结构模式。尽管使用大型语言模型的机器翻译最近取得了进展，但对中国成语翻译知之甚少。在这项工作中，我们介绍了Idiomeval，这是一个框架，具有全面的错误分类法对中国成语翻译。我们注释了来自九个现代系统的900个翻译对，包括GPT-4O和Google Translate，跨四个领域：Web，News，Wikipedia和社交媒体。我们发现这些系统在成语翻译中失败，产生不正确，文字，部分甚至缺失的翻译。表现最佳的系统GPT-4在28％的情况下犯了错误。我们还发现，现有的评估指标的质量质量很差，而皮尔逊相关性与人类评分低于0.48。因此，我们开发了改进的模型，以实现F $ _1 $分数为0.68，以检测成语翻译错误。

Title: Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints

Authors: Sandeep Reddy, Kabir Khan, Rohit Patil, Ananya Chakraborty, Faizan A. Khan, Swati Kulkarni, Arjun Verma, Neha Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10426
Pdf URL: https://arxiv.org/pdf/2508.10426
Copy Paste: [[2508.10426]] Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints(https://arxiv.org/abs/2508.10426)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are limited by substantial computational cost. We introduce a "computational economics" framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.
摘要：大型语言模型（LLMS）受大量计算成本的限制。我们引入了一个“计算经济学”框架，该框架将LLM视为资源受限代理的内部经济（注意力头和神经元块），必须分配稀缺的计算以最大程度地提高任务效用。首先，我们从经验上表明，当计算稀缺时，标准LLMS在保持准确性的同时将注意力重新分配给高价值令牌。在这一观察结果的基础上，我们提出了一个激励驱动的培训范式，该范围通过可不同的计算成本术语来增加任务损失，从而鼓励稀疏和有效的激活。在胶水（MNLI，STS-B，Cola）和Wikitext-103上，该方法产生了一个模型家族，可追踪帕累托前沿，并始终如一地统治事后修剪。为了获得类似的精度，我们的拖鞋和较低的潜伏期大约降低了40％，以及更容易解释的注意力模式。这些结果表明，经济原则为在严格的资源限制下设计有效，适应性和更透明的LLM提供了原则上的途径。

Title: DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales

Authors: Herun Wan, Jiaying Wu, Minnan Luo, Xiangzheng Kong, Zihan Ma, Zhi Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10444
Pdf URL: https://arxiv.org/pdf/2508.10444
Copy Paste: [[2508.10444]] DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales(https://arxiv.org/abs/2508.10444)
Keywords: language model, hallucination, prompt, chain-of-thought
Abstract: Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.
摘要：从大型视觉语言模型（LVLM）中产生文本原理以支持可训练的多模式错误信息检测器已成为有希望的范式。但是，它的有效性在根本上受到三个核心挑战的限制：（i）由于幻觉而导致的事实不正确的多样性，（ii）引入噪声的事实不准确，（iii）无关或相互冲突的内容。我们介绍了Difar，这是一种检测器不足的框架，可产生多种，事实和相关的理由，以增强错误信息检测。 DIFAR采用了五个经过思考的提示来从LVLM中引起各种推理痕迹，并结合了一个轻量级的事后过滤模块，以根据句子级的事实和相关性得分选择理由句子。对四个流行基准测试的广泛实验表明，DIFAR的表现高出四个基线类别高达5.9％，并使现有探测器提高了多达8.7％。自动指标和人类评估都证实，DIFAR可显着提高所有三个维度的理由质量。

Title: When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models

Authors: Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10552
Pdf URL: https://arxiv.org/pdf/2508.10552
Copy Paste: [[2508.10552]] When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models(https://arxiv.org/abs/2508.10552)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three underlying causes: attention dilution from severe token redundancy in non-textual modalities, the influence of fusion architecture design, and task formulations that implicitly favor textual inputs. Furthermore, we propose a simple token compression method that effectively rebalances model attention. Applying this method to LLaVA-7B, for instance, drastically reduces its MDI from 10.23 to a well-balanced value of 0.86. Our analysis and methodological framework offer a foundation for the development of more equitable and comprehensive multimodal language models.
摘要：多模式的大语言模型（MLLM）在多种多模式任务中表现出了显着的功能。但是，这些模型遭受了称为文本优势的核心问题：它们在很大程度上取决于其推论，同时不足以实现其他方式。虽然先前的工作已经在视觉任务中承认了这种现象，但通常将其归因于数据偏见或模型架构。在本文中，我们对各种数据模式的文本优势进行了首次系统研究，包括图像，视频，音频，时间序列和图形。为了衡量这种不平衡，我们提出了两个评估指标：模式优势指数（MDI）和注意效率指数（AEI）。我们的全面分析表明，在所有测试方式中，文本优势既重要又普遍。我们的深入分析确定了三个根本原因：非文本模式中严重令牌冗余的注意力稀释，融合体系结构设计的影响以及暗中偏爱文本输入的任务配方。此外，我们提出了一种简单的令牌压缩方法，可以有效地重新平衡模型的关注。例如，将此方法应用于LLAVA-7B，将其MDI急剧从10.23降低到平衡值为0.86。我们的分析和方法学框架为开发更公平和全面的多模式模型提供了基础。

Title: eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM

Authors: Irma Heithoff. Marc Guggenberger, Sandra Kalogiannis, Susanne Mayer, Fabian Maag, Sigurd Schacht, Carsten Lanquillon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10553
Pdf URL: https://arxiv.org/pdf/2508.10553
Copy Paste: [[2508.10553]] eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM(https://arxiv.org/abs/2508.10553)
Keywords: language model, gpt, llm
Abstract: This paper presents a feasibility study on the deployment of a European Deep Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support mechanistic interpretability research on large language models. The need for widespread accessibility of LLM interpretability infrastructure in Europe drives this initiative to democratize advanced model analysis capabilities for the research community. The project introduces a GPU-based cluster hosted at Ansbach University of Applied Sciences and interconnected with partner institutions, enabling remote model inspection via the NNsight API. A structured pilot study involving 16 researchers from across Europe evaluated the platform's technical performance, usability, and scientific utility. Users conducted interventions such as activation patching, causal tracing, and representation analysis on models including GPT-2 and DeepSeek-R1-70B. The study revealed a gradual increase in user engagement, stable platform performance throughout, and a positive reception of the remote experimentation capabilities. It also marked the starting point for building a user community around the platform. Identified limitations such as prolonged download durations for activation data as well as intermittent execution interruptions are addressed in the roadmap for future development. This initiative marks a significant step towards widespread accessibility of LLM interpretability infrastructure in Europe and lays the groundwork for broader deployment, expanded tooling, and sustained community collaboration in mechanistic interpretability research.
摘要：本文介绍了一项有关欧洲深度推理结构（EDIF）的可行性研究，这是一种符合NDIF兼容的基础设施，旨在支持大语模型上的机械性可解释性研究。欧洲LLM可解释性基础设施的广泛可及性的需求推动了这项主动性使研究界的先进模型分析能力民主化。该项目介绍了一个基于GPU的集群，该集群在Ansbach Applied Sciences大学托管，并与合作伙伴机构互连，并通过NNSight API启用远程模型检查。一项涉及来自欧洲的16位研究人员的结构化试验研究评估了该平台的技术性能，可用性和科学实用性。用户对包括GPT-2和DeepSeek-R1-70B在内的模型进行了干预措施，例如激活修补，因果追踪和表示分析。该研究表明，用户参与度逐渐增加，整个平台性能稳定，以及对远程实验能力的积极接受。它还标志着在平台周围建立用户社区的起点。确定的限制，例如激活数据的延长下载持续时间以及间歇性执行中断，以供未来开发的路线图中解决。该计划标志着欧洲LLM可解释性基础设施的广泛可及性迈出的重要一步，并为更广泛的部署，扩大工具扩大工具以及在机械解释性研究中持续持续的社区协作奠定了基础。

Title: Learning from Natural Language Feedback for Personalized Question Answering

Authors: Alireza Salemi, Hamed Zamani
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.10695
Pdf URL: https://arxiv.org/pdf/2508.10695
Copy Paste: [[2508.10695]] Learning from Natural Language Feedback for Personalized Question Answering(https://arxiv.org/abs/2508.10695)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.
摘要：个性化对于增强语言技术的有效性和用户满意度至关重要，尤其是在寻求信息的任务中，例如问答。当前的个性化大语模型（LLM）的方法通常依赖于检索功能增强的一代（RAG），然后使用标量奖励信号进行加强学习，以教导模型如何使用检索到的个人环境。我们认为，这些标量奖励有时会提供弱，非教学反馈，从而限制学习效率和个性化质量。我们介绍了VAC，这是一个针对个性化响应生成的新型框架，它用自然语言反馈（NLF）代替了标量奖励，这些反馈（NLF）是根据用户资料和问题叙述而生成的。 NLF充当了丰富且可操作的监督信号，允许策略模型迭代地完善其输出并内化有效的个性化策略。培训在优化反馈模型和对改进的响应的策略模型之间进行交替，从而导致了不再需要推断反馈的策略模型。对LAMP-QA基准的评估由三个不同的领域组成，表明对最新结果的一致和显着改善。人类评估进一步证实了生成的反应的优越质量。这些结果表明，NLF提供了更有效的信号来优化个性化的问题答案。

Title: Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs

Authors: Xiangqi Jin, Yuxuan Wang, Yifeng Gao, Zichen Wen, Biqing Qi, Dongrui Liu, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10736
Pdf URL: https://arxiv.org/pdf/2508.10736
Copy Paste: [[2508.10736]] Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs(https://arxiv.org/abs/2508.10736)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE's effectiveness, achieving up to 17.29% accuracy improvement with 4.12$\times$ speedup on GSM8K, and up to 276.67$\times$ acceleration on MMLU while maintaining competitive performance.
摘要：尽管大型语言模型（LLM）取得了显着的成功，但它们的前缀仅提示范式，顺序生成过程为双向信息提供了有限的灵活性。扩散大语言模型（DLLM）通过双向注意机制和迭代精致过程提供了新的机会，从而实现了更灵活的就地促进策略。我们介绍了ICE（就在现场链链及早出口而提示），这是一个新颖的框架，可将仅前缀的提示转换为专门为DLLM设计的就地提示。 ICE在迭代精致过程中直接在掩盖的令牌位置内部集成了现场提示，并采用了信心感知到的早期退出机制，以显着减少计算开销。广泛的实验证明了ICE的有效性，在GSM8K上以4.12 $ \ times $速度的速度提高了高达17.29％的精度，并在保持竞争性能的同时，MMLU上的加速度高达276.67 $ \ times $加速。

Title: Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

Authors: Osama Mohammed Afzal, Preslav Nakov, Tom Hope, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10795
Pdf URL: https://arxiv.org/pdf/2508.10795
Copy Paste: [[2508.10795]] Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback(https://arxiv.org/abs/2508.10795)
Keywords: llm
Abstract: Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.
摘要：新颖性评估是同行评审的中心研究，尤其是在诸如NLP之类的高批量领域，审阅者能力越来越紧张。我们提出了一种用于自动化新颖性评估的结构化方法，该方法通过三个阶段进行了专家审阅者的行为进行建模：从提交中提取内容，相关工作的检索和综合以及基于证据评估的结构化比较。我们的方法是通过对人类书面新颖性评论进行的大规模分析来告知的，并捕获了关键模式，例如独立索赔验证和上下文推理。对182 ICLR 2025的评估，该方法与人类注释的新颖性评估进行了评估，该方法与人类推理的一致性达到86.5％，并就新颖性结论达成了75.3％的一致性 - 实质上优于现有的基于LLM的基准。该方法产生了详细的文献意识分析，并提高了对临时审阅者判断的一致性。这些结果突出了结构化LLM辅助方法的潜力，以支持更严格和透明的同行评审，而不会取代人类专业知识。数据和代码可用。

Title: Reinforced Language Models for Sequential Decision Making

Authors: Jim Dilkes, Vahid Yazdanpanah, Sebastian Stein
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10839
Pdf URL: https://arxiv.org/pdf/2508.10839
Copy Paste: [[2508.10839]] Reinforced Language Models for Sequential Decision Making(https://arxiv.org/abs/2508.10839)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.
摘要：大型语言模型（LLMS）显示出作为顺序决策代理的潜力，但是由于依赖大型计算昂贵的模型，它们的应用通常受到限制。这创造了需要改善较小模型的需求，但是现有的训练后方法是为单转交互而设计的，无法在多步代理任务中处理信用分配。为了解决这个问题，我们介绍了多步组相关政策优化（MS-GRPO），这是一种用于培训后LLM代理的新算法，以正式的文本介导的随机游戏（TSMG）和语言代理政策（LAP）框架为基础。对于信用分配，MS-GRPO将整个累积情节奖励归因于每个单独的情节步骤。我们通过一种新型的绝对优势加权策略来补充这种算法，我们表明了提高的训练性能。我们通过在蛇和冷冻湖上进行3亿个参数模型来评估我们的方法。我们的实验表明，该方法在改善决策绩效方面有效：我们的后3B参数模型在冷冻的湖泊任务上优于72B参数基线。这项工作表明，有针对性的训练是一种实用，有效的替代方法，可以依靠模型量表使用LLM创建顺序决策剂。

Title: Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning

Authors: Chongyuan Dai, Jinpeng Hu, Hongchang Shi, Zhuo Li, Xun Yang, Meng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10848
Pdf URL: https://arxiv.org/pdf/2508.10848
Copy Paste: [[2508.10848]] Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning(https://arxiv.org/abs/2508.10848)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.
摘要：在缺乏合格的心理健康专业人员的短缺之中，大型语言模型（LLMS）纳入心理应用中，提供了一种有希望的方法来减轻心理健康障碍的负担。最新推理的LLM在数学和编程中取得了显着的表现，而心理领域的研究主要强调了情感支持和同理心对话，并且对对产生可靠反应有益的推理机制的关注有限。因此，在本文中，我们提出了Psyche-R1，这是第一个共同整合同理心，心理专业知识和推理的中国心理LLM，这是基于新颖的数据策划管道的。具体而言，我们设计了一个全面的数据综合管道，该管道产生了超过75K高质量的心理问题，并与详细的理由配对，该问题是通过经过三通链（COT）的推理和迭代及时及时的及时理解优化以及73K同情心的对话而产生的。随后，我们采用了混合培训策略，通过多个LLM跨序列策略来确定挑战的样本，以提高团体相对政策优化（GRPO）来提高推理能力，而其余数据则用于监督的微调（SFT），以增强促进反应反应的产生和心理领域知识。广泛的实验结果证明了Psyche-R1在几个心理基准中的有效性，在该基准中，我们的7B Psyche-R1获得了可比的结果与671B DeepSeek-R1相当。

Title: SSRL: Self-Search Reinforcement Learning

Authors: Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10874
Pdf URL: https://arxiv.org/pdf/2508.10874
Copy Paste: [[2508.10874]] SSRL: Self-Search Reinforcement Learning(https://arxiv.org/abs/2508.10874)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.
摘要：我们研究了大型语言模型（LLMS）作为加强学习（RL）中代理搜索任务有效的模拟器的潜力，从而减少了对与外部搜索引擎的昂贵交互的依赖。为此，我们首先通过结构化提示和重复采样来量化LLM的内在搜索能力，我们将其称为自我搜索。我们的结果表明，LLM在推理预算方面表现出很强的缩放行为，在提问基准的基准（包括具有挑战性的Browsecomp任务）上实现高通@K。在这些观察结果的基础上，我们介绍了自我搜索RL（SSRL），从而通过基于格式的基于规则的奖励来增强LLMS的自我搜索能力。 SSRL可以使模型在内部迭代地完善其知识利用率，而无需访问外部工具。经验评估表明，SSRL训练的政策模型为搜索驱动的RL培训提供了一个具有成本效益且稳定的环境，从而减少了对外部搜索引擎的依赖，并促进了强大的SIM到现实转移。我们得出以下结论：1）LLM具有可以有效提出来实现高性能的世界知识； 2）SSRL证明了利用内部知识减少幻觉的潜力； 3）SSRL训练的模型无需额外努力即可与外部搜索引擎无缝集成。我们的发现突出了LLMS支持更可扩展的RL代理培训的潜力。

Title: A Survey on Diffusion Language Models

Authors: Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10875
Pdf URL: https://arxiv.org/pdf/2508.10875
Copy Paste: [[2508.10875]] A Survey on Diffusion Language Models(https://arxiv.org/abs/2508.10875)
Keywords: language model
Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at this https URL.
摘要：扩散语言模型（DLM）迅速成为了主要自回旋（AR）范式的强大而有希望的替代品。通过通过迭代授予过程并行生成令牌，DLM在减少推理潜伏期和捕获双向上下文方面具有固有的优势，从而可以对生成过程进行细粒度的控制。在实现多个速度的同时，最近的进步使DLM可以显示出与自回归同行相当的性能，这使其成为各种自然语言处理任务的令人信服的选择。在这项调查中，我们提供了当前DLM景观的整体概述。我们追踪其演变和与其他范式（例如自回归和掩盖语言模型）的关系，并涵盖了基本原理和最先进的模型。我们的工作提供了最新的，全面的分类法和对当前技术的深入分析，从培训前策略到先进的训练后方法。这项调查的另一个贡献是对DLM推理策略和优化的彻底综述，包括改进并行性，缓存机制和发电质量的改进。我们还强调了DLM的多模式扩展的最新方法，并在各种实际情况下描述了其应用程序。此外，我们的讨论解决了DLM的局限性和挑战，包括效率，长期处理和基础设施要求，同时概述了未来的研究指示，以维持这个快速发展的领域的进步。 Project GitHub可在此HTTPS URL上找到。