2025-05-26

Title: Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge

Authors: Dimitri Schreiter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17037
Pdf URL: https://arxiv.org/pdf/2505.17037
Copy Paste: [[2505.17037]] Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge(https://arxiv.org/abs/2505.17037)
Keywords: language model, llm, prompt
Abstract: Prompt engineering has emerged as a critical component in optimizing large language models (LLMs) for domain-specific tasks. However, the role of prompt specificity, especially in domains like STEM (physics, chemistry, biology, computer science and mathematics), medicine, and law, remains underexplored. This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves LLM performance in domain-specific question-answering and reasoning tasks. We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four LLMs: Llama-3.1-70B-Instruct, Granite-13B-Instruct-V2, Flan-T5-XL, and Mistral-Large 2, across datasets in STEM, law, and medicine. Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best. Identifying this optimal specificity range offers a key insight for prompt design, suggesting that manipulating prompts within this range could maximize LLM performance and lead to more efficient applications in specialized domains.
摘要：及时工程已成为优化针对领域特定任务的大型语言模型（LLM）的关键组成部分。但是，及时特异性的作用，尤其是在STEM（物理，化学，生物学，计算机科学和数学），医学和法律等领域，仍然没有得到充实的态度。本文解决了提示中提高词汇特异性的问题是否可以改善特定于领域的提问和推理任务中的LLM绩效。我们开发了一个具有不同特异性水平的系统替代名词，动词和形容词的同义词框架，衡量了对四个LLM的影响：Llama-3.1-70B教学，Granite-13B-Instruct-V2，Flan-T5-XL，Flan-T5-XL，以及跨数据量，跨茎，法律，法律，法律和医学。我们的结果表明，尽管提示的特异性通常不会产生重大影响，但在所有被考虑的模型中，LLM表现最好的模型似乎都有一个特异性范围。确定此最佳特异性范围为及时设计提供了关键的见解，这表明在此范围内操纵提示可以最大程度地提高LLM性能并导致在专用域中更有效的应用程序。

Title: Signals from the Floods: AI-Driven Disaster Analysis through Multi-Source Data Fusion

Authors: Xian Gong, Paul X. McCarthy, Lin Tian, Marian-Andrei Rizoiu
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2505.17038
Pdf URL: https://arxiv.org/pdf/2505.17038
Copy Paste: [[2505.17038]] Signals from the Floods: AI-Driven Disaster Analysis through Multi-Source Data Fusion(https://arxiv.org/abs/2505.17038)
Keywords: language model, llm
Abstract: Massive and diverse web data are increasingly vital for government disaster response, as demonstrated by the 2022 floods in New South Wales (NSW), Australia. This study examines how X (formerly Twitter) and public inquiry submissions provide insights into public behaviour during crises. We analyse more than 55,000 flood-related tweets and 1,450 submissions to identify behavioural patterns during extreme weather events. While social media posts are short and fragmented, inquiry submissions are detailed, multi-page documents offering structured insights. Our methodology integrates Latent Dirichlet Allocation (LDA) for topic modelling with Large Language Models (LLMs) to enhance semantic understanding. LDA reveals distinct opinions and geographic patterns, while LLMs improve filtering by identifying flood-relevant tweets using public submissions as a reference. This Relevance Index method reduces noise and prioritizes actionable content, improving situational awareness for emergency responders. By combining these complementary data streams, our approach introduces a novel AI-driven method to refine crisis-related social media content, improve real-time disaster response, and inform long-term resilience planning.
摘要：正如澳大利亚新南威尔士州（新南威尔士州）的2022年洪水所证明的那样，大量和多样化的网络数据对政府灾难响应越来越重要。这项研究研究了X（以前为Twitter）和公众询问如何在危机期间对公共行为的见解。我们分析了55,000多个与洪水有关的推文和1,450次提交的提交，以识别极端天气事件中的行为模式。尽管社交媒体帖子简短且分散，但询问提交是详细的，可提供结构化见解的多页文档。我们的方法与大型语言模型（LLMS）集成了潜在的Dirichlet分配（LDA），以增强语义理解。 LDA揭示了不同的意见和地理模式，而LLM通过使用公共提交作为参考来识别与洪水相关的推文来改善过滤。这种相关指数方法可以减少噪声并确定可行的内容的优先级，从而提高了应急响应者的情境意识。通过结合这些互补数据流，我们的方法引入了一种新型的AI驱动方法，以完善与危机相关的社交媒体内容，改善实时灾难响应并为长期弹性计划提供信息。

Title: VLM-KG: Multimodal Radiology Knowledge Graph Generation

Authors: Abdullah Abdullah, Seong Tae Kim
Subjects: cs.CL, cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17042
Pdf URL: https://arxiv.org/pdf/2505.17042
Copy Paste: [[2505.17042]] VLM-KG: Multimodal Radiology Knowledge Graph Generation(https://arxiv.org/abs/2505.17042)
Keywords: language model
Abstract: Vision-Language Models (VLMs) have demonstrated remarkable success in natural language generation, excelling at instruction following and structured output generation. Knowledge graphs play a crucial role in radiology, serving as valuable sources of factual information and enhancing various downstream tasks. However, generating radiology-specific knowledge graphs presents significant challenges due to the specialized language of radiology reports and the limited availability of domain-specific data. Existing solutions are predominantly unimodal, meaning they generate knowledge graphs only from radiology reports while excluding radiographic images. Additionally, they struggle with long-form radiology data due to limited context length. To address these limitations, we propose a novel multimodal VLM-based framework for knowledge graph generation in radiology. Our approach outperforms previous methods and introduces the first multimodal solution for radiology knowledge graph generation.
摘要：视觉语言模型（VLM）在自然语言产生中表现出色，在跟随教学和结构化产出生成方面取得了巨大的成功。知识图在放射学中起着至关重要的作用，它是事实信息的宝贵来源，并增强了各种下游任务。但是，由于放射学报告的专业语言和特定于领域的数据的可用性有限，生成放射学特定的知识图提出了重大挑战。现有的解决方案主要是单峰，这意味着它们仅从放射学报告中生成知识图，同时排除了放射学图像。此外，由于上下文长度有限，他们在长期放射学数据方面遇到了困难。为了解决这些局限性，我们提出了一个新型的基于多模式VLM的放射学知识图生成的框架。我们的方法表现优于先前的方法，并引入了放射学知识图生成的第一个多模式解决方案。

Title: Assessing GPT's Bias Towards Stigmatized Social Groups: An Intersectional Case Study on Nationality Prejudice and Psychophobia

Authors: Afifah Kashif, Heer Patel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17045
Pdf URL: https://arxiv.org/pdf/2505.17045
Copy Paste: [[2505.17045]] Assessing GPT's Bias Towards Stigmatized Social Groups: An Intersectional Case Study on Nationality Prejudice and Psychophobia(https://arxiv.org/abs/2505.17045)
Keywords: language model, gpt, llm, prompt
Abstract: Recent studies have separately highlighted significant biases within foundational large language models (LLMs) against certain nationalities and stigmatized social groups. This research investigates the ethical implications of these biases intersecting with outputs of widely-used GPT-3.5/4/4o LLMS. Through structured prompt series, we evaluate model responses to several scenarios involving American and North Korean nationalities with various mental disabilities. Findings reveal significant discrepancies in empathy levels with North Koreans facing greater negative bias, particularly when mental disability is also a factor. This underscores the need for improvements in LLMs designed with a nuanced understanding of intersectional identity.
摘要：最近的研究分别强调了基础大语言模型（LLM）对某些民族和受污名的社会群体的重大偏见。这项研究调查了这些偏见与广泛使用的GPT-3.5/4/4O LLM的输出相交的伦理意义。通过结构化的迅速系列，我们评估了对涉及各种精神残疾的美国和朝鲜民族的几种情况的模型响应。研究结果表明，与朝鲜人面临更大的负面偏见的同理心水平上有很大的差异，尤其是当精神残疾也是一个因素时。这强调了对LLM的改进的需求，该LLM设计具有细微的理解。

Title: Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe

Authors: Erin Palm, Astrit Manikantan, Mark E. Pepin, Herprit Mahal, Srikanth Subramanya Belwadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17047
Pdf URL: https://arxiv.org/pdf/2505.17047
Copy Paste: [[2505.17047]] Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe(https://arxiv.org/abs/2505.17047)
Keywords: language model, llm
Abstract: In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from each specialty scored notes drafted from a total of 97 patient visits. We found uniformly high inter rater agreement (RWG greater than 0.7) between evaluators in general medicine, orthopedics, and obstetrics and gynecology, and moderate (RWG 0.5 to 0.7) to high inter rater agreement in pediatrics and cardiology. We found a modest yet significant difference in the overall note quality, wherein Gold notes achieved a score of 4.25 out of 5 and Ambient notes scored 4.20 out of 5 (p = 0.04). Our findings support the use of the PDQI9 instrument as a practical method to gauge the quality of LLM authored notes, as compared to human-authored notes.
摘要：在美国各地的医疗实践中，医生已经开始实施生成人工智能（AI）工具来执行抄写员的功能，以减少记录临床相遇的负担。尽管它们广泛使用，但尚无既定方法来衡量AI抄写质量。为了解决这一差距，我们开发了一项盲目的研究，比较了大语言模型（LLM）与基于音频录制的临床相遇的现场专家的相对性能（LLM）的相对性能。医师文档质量工具（PDQI9）的定量指标提供了一个框架来衡量音符质量，我们适应了评估AI生成的注释的相对性能。跨越5个医学专业的临床专家使用了PDQI9工具来评估专家放射的金音和LLM撰写的环境音符。来自每个专业的两个评估者都从总共97例患者就诊中起草了笔记。我们发现，普通医学，骨科，妇产科和妇产科评估者之间的评估者之间的统一高评估者一致性（RWG大于0.7），以及中度（RWG 0.5至0.7），与儿科和心脏病学中的高评估者一致。我们发现整体音符质量有一个适中但显着的差异，其中金音的得分为4.25，在5分中的分数为4.20分（满分5分）（p = 0.04）。我们的发现支持使用PDQI9仪器作为一种实用方法来衡量LLM撰写的注释的质量，与人类作者的注释相比。

Title: Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

Authors: Agam Shah, Siddhant Sukhani, Huzaifa Pardawala, Saketh Budideti, Riya Bhadani, Rudra Gopal, Siddhartha Somani, Michael Galarnyk, Soungmin Lee, Arnav Hiray, Akshar Ravichandran, Eric Kim, Pranav Aluru, Joshua Zhang, Sebastian Jaskowski, Veer Guda, Meghaj Tarte, Liqin Ye, Spencer Gosden, Rutwik Routu, Rachel Yuh, Sloka Chava, Sahasra Chava, Dylan Patrick Kelly, Aiden Chiang, Harsit Mittal, Sudheer Chava
Subjects: cs.CL, cs.AI, cs.CY, q-fin.CP, q-fin.GN
Abstract URL: https://arxiv.org/abs/2505.17048
Pdf URL: https://arxiv.org/pdf/2505.17048
Copy Paste: [[2505.17048]] Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally(https://arxiv.org/abs/2505.17048)
Keywords: language model, llm
Abstract: Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank's data, confirming the principle "the whole is greater than the sum of its parts." Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework's economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.
摘要：世界各地的中央银行在维持经济稳定中起着至关重要的作用。他们的通信中对政策的影响是必不可少的，尤其是因为误解会对脆弱的人群产生不成比例的影响。为了解决这个问题，我们介绍了迄今为止最全面的货币政策语料库，介绍了世界央行（WCB）数据集，其中包括来自不同地理区域的25个中央银行的380k句子，涵盖了28年的历史数据。在所有可用年份中均匀地对每个银行（总计25,000）进行了1k句子的句子后，我们使用双重注释者，分歧决议和二级专家评论对每个句子进行注释和审查。我们定义三个任务：立场检测，时间分类和不确定性估计，每个句子都注释了所有三个句子。我们在这些任务上基准了七个预处理的语言模型（PLM）和九种大语言模型（LLMS）（零射击，很少射击和带有注释指南），运行了15,075个基准测试实验。我们发现，经过跨银行的汇总数据培训的模型大大超过了对单个银行数据训练的模型，证实了“整体大于其部分的总和”。此外，严格的人类评估，错误分析和预测任务验证了我们框架的经济公用事业。通过CC-BY-NC-SA 4.0许可证，可以通过HuggingFace和GitHub访问我们的文物。

Title: Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations

Authors: David Rozado
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.17049
Pdf URL: https://arxiv.org/pdf/2505.17049
Copy Paste: [[2505.17049]] Gender and Positional Biases in LLM-Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations(https://arxiv.org/abs/2505.17049)
Keywords: language model, llm, prompt
Abstract: This study examines the behavior of Large Language Models (LLMs) when evaluating professional candidates based on their resumes or curricula vitae (CVs). In an experiment involving 22 leading LLMs, each model was systematically given one job description along with a pair of profession-matched CVs, one bearing a male first name, the other a female first name, and asked to select the more suitable candidate for the job. Each CV pair was presented twice, with names swapped to ensure that any observed preferences in candidate selection stemmed from gendered names cues. Despite identical professional qualifications across genders, all LLMs consistently favored female-named candidates across 70 different professions. Adding an explicit gender field (male/female) to the CVs further increased the preference for female applicants. When gendered names were replaced with gender-neutral identifiers "Candidate A" and "Candidate B", several models displayed a preference to select "Candidate A". Counterbalancing gender assignment between these gender-neutral identifiers resulted in gender parity in candidate selection. When asked to rate CVs in isolation rather than compare pairs, LLMs assigned slightly higher average scores to female CVs overall, but the effect size was negligible. Including preferred pronouns (he/him or she/her) next to a candidate's name slightly increased the odds of the candidate being selected regardless of gender. Finally, most models exhibited a substantial positional bias to select the candidate listed first in the prompt. These findings underscore the need for caution when deploying LLMs in high-stakes autonomous decision-making contexts and raise doubts about whether LLMs consistently apply principled reasoning.
摘要：这项研究根据其简历或课程Vitae（CVS）评估专业候选者时，研究了大语言模型（LLM）的行为。在涉及22个领先LLM的实验中，每个模型都被系统地给出了一个职位描述，以及一对专业匹配的CV，一个带有男性名字，另一个是女性名字，并要求选择更合适的工作候选人。每个简历对呈现了两次，其名称交换了，以确保候选人选择中观察到的任何偏好源自性别名称提示。尽管各个性别的专业资格都相同，但所有LLM都始终偏爱70个不同职业的女性候选人。在CVS中添加明确的性别领域（男性/女性）进一步增加了对女性申请人的偏爱。当性别名称被性别中立标识符替换为“候选A”和“候选B”时，几种模型表现出选择“候选A”的偏好。这些性别中立的标识符之间的反平衡性别分配导致候选人选择中的性别平等。当被要求孤立地评分CV而不是比较对时，LLM总体上分配了女性CVS的平均得分稍高，但效果大小可以忽略不计。包括候选人名字旁边的首选代词（他/他或她/她）略微增加了被选中的候选人的几率，而不论性别如何。最后，大多数模型都表现出很大的位置偏见，以在提示中首先选择候选人。这些发现强调了在高风险自治决策环境中部署LLM时需要谨慎的需求，并引起了对LLM是否始终应用原则推理的怀疑。

Title: Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning

Authors: Yanhao Jia, Xinyi Wu, Qinglin Zhang, Yiran Qin, Luwei Xiao, Shuai Zhao
Subjects: cs.CL, cs.AI, cs.CE, cs.CY, cs.MM
Abstract URL: https://arxiv.org/abs/2505.17050
Pdf URL: https://arxiv.org/pdf/2505.17050
Copy Paste: [[2505.17050]] Towards Robust Evaluation of STEM Education: Leveraging MLLMs in Project-Based Learning(https://arxiv.org/abs/2505.17050)
Keywords: language model, llm, hallucination, agent
Abstract: Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.
摘要：基于项目的学习（PBL）涉及各种高度相关的多模式数据，使其成为STEM学科中至关重要的教育方法。随着多模式大语言模型（MLLM）的快速发展，研究人员已经开始探索他们增强任务的潜力，例如教育环境中的信息检索，知识理解和数据生成。但是，现有的基准在提供自由形式的产出结构和严格的人类专家验证过程时就缺乏，从而限制了它们在评估现实世界教育任务方面的有效性。此外，很少有方法开发出自动化管道来协助教师利用MLLM的复杂责任，这在很大程度上是由于模型幻觉和不稳定性而导致了不可靠的实施。为了解决这一差距，我们介绍了PBLBench，这是一种新颖的基准测试，旨在评估以特定于领域的知识和长期理解为基础的复杂推理，从而挑战了与人类专家所处理的任务相似的任务。为了建立可靠的基础真理，我们采用了分析层次结构过程（AHP），利用专家驱动的成对比较来得出结构化和加权评估标准。我们使用PBLbench评估了15个领先的MLLM/LLMS的性能，并证明即使最先进的模型也只能达到59％的排名准确性，从而强调了该基准测试带来的重大挑战。我们认为，PBLBench将成为开发更有能力的AI代理商的催化剂，最终旨在减轻教师的工作量并提高教育生产力。

Title: Embedding-to-Prefix: Parameter-Efficient Personalization for Pre-Trained Large Language Models

Authors: Bernd Huber, Ghazal Fazelnia, Andreas Damianou, Sebastian Peleato, Max Lefarov, Praveen Ravichandran, Marco De Nadai, Mounia Lalmas-Roellke, Paul N. Bennett
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17051
Pdf URL: https://arxiv.org/pdf/2505.17051
Copy Paste: [[2505.17051]] Embedding-to-Prefix: Parameter-Efficient Personalization for Pre-Trained Large Language Models(https://arxiv.org/abs/2505.17051)
Keywords: language model, llm, prompt, chat
Abstract: Large language models (LLMs) excel at generating contextually relevant content. However, tailoring these outputs to individual users for effective personalization is a significant challenge. While rich user-specific information often exists as pre-existing user representations, such as embeddings learned from preferences or behaviors, current methods to leverage these for LLM personalization typically require costly fine-tuning or token-heavy prompting. We propose Embedding-to-Prefix (E2P), a parameter-efficient method that injects pre-computed context embeddings into an LLM's hidden representation space through a learned projection to a single soft token prefix. This enables effective personalization while keeping the backbone model frozen and avoiding expensive adaptation techniques. We evaluate E2P across two public datasets and in a production setting: dialogue personalization on Persona-Chat, contextual headline generation on PENS, and large-scale personalization for music and podcast consumption. Results show that E2P preserves contextual signals and achieves strong performance with minimal computational overhead, offering a scalable, efficient solution for contextualizing generative AI systems.
摘要：大型语言模型（LLMS）在生成上下文相关内容方面表现出色。但是，将这些输出量身定制为个人用户以有效个性化是一个重大挑战。虽然特定于用户特定的信息通常是作为先前存在的用户表示形式而存在的，例如从偏好或行为中学到的嵌入方式，但当前的方法将这些方法用于LLM个性化，通常需要昂贵的微调或繁重的提示。我们提出了一种嵌入式前缀（E2P），这是一种参数效率的方法，该方法通过学习的投影将预计的上下文嵌入到LLM的隐藏表示空间中，这是通过学习的投影到单个软令牌前缀中的。这可以有效个性化，同时保持骨干模型冷冻并避免昂贵的适应技术。我们在两个公共数据集和生产环境中评估了E2P：关于角色聊天的对话个性化，钢笔上的上下文标题生成以及音乐和播客消费的大规模个性化。结果表明，E2P可以通过最小的计算开销来维护上下文信号，并实现了强大的性能，提供了可扩展的，有效的解决方案，用于上下文化生成AI系统。

Title: SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Authors: Jinwoo Park, Seunggeun Cho, Dongsu Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17052
Pdf URL: https://arxiv.org/pdf/2505.17052
Copy Paste: [[2505.17052]] SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs(https://arxiv.org/abs/2505.17052)
Keywords: language model, llm
Abstract: Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.
摘要：大型语言模型（LLMS）为许多现代应用提供了许多现代应用程序，但是在大规模上为其提供服务仍然是昂贵和资源密集的。当前以服务器为中心的系统忽略了边缘的消费级GPU。我们介绍了一个符合边缘辅助的推理框架，该框架使用投机解码方案在边缘和服务器GPU之间分配LLM工作负载，并在网络上仅交换令牌输出。 SpecEdge采用主动的边缘起草来重叠边缘令牌创建与服务器验证和管道感知的调度，这些计划交织了多个用户请求以增加服务器端吞吐量。实验表明，通过实现2.22倍服务器吞吐量，特定的总体成本效率提高了1.91倍，与仅使用服务器的基线相比，在LLM服务中引入了可扩展的，具有成本效益的范式的范围，将令牌延迟降低了11.24％。

Title: Social preferences with unstable interactive reasoning: Large language models in economic trust games

Authors: Ou Jiamin, Eikmans Emile, Buskens Vincent, Pankowska Paulina, Shan Yuli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17053
Pdf URL: https://arxiv.org/pdf/2505.17053
Copy Paste: [[2505.17053]] Social preferences with unstable interactive reasoning: Large language models in economic trust games(https://arxiv.org/abs/2505.17053)
Keywords: language model, gpt, llm, prompt, chat
Abstract: While large language models (LLMs) have demonstrated remarkable capabilities in understanding human languages, this study explores how they translate this understanding into social exchange contexts that capture certain essences of real world human interactions. Three LLMs - ChatGPT-4, Claude, and Bard - were placed in economic trust games where players balance self-interest with trust and reciprocity, making decisions that reveal their social preferences and interactive reasoning abilities. Our study shows that LLMs deviate from pure self-interest and exhibit trust and reciprocity even without being prompted to adopt a specific persona. In the simplest one-shot interaction, LLMs emulated how human players place trust at the beginning of such a game. Larger human-machine divergences emerged in scenarios involving trust repayment or multi-round interactions, where decisions were influenced by both social preferences and interactive reasoning. LLMs responses varied significantly when prompted to adopt personas like selfish or unselfish players, with the impact outweighing differences between models or game types. Response of ChatGPT-4, in an unselfish or neutral persona, resembled the highest trust and reciprocity, surpassing humans, Claude, and Bard. Claude and Bard displayed trust and reciprocity levels that sometimes exceeded and sometimes fell below human choices. When given selfish personas, all LLMs showed lower trust and reciprocity than humans. Interactive reasoning to the actions of counterparts or changing game mechanics appeared to be random rather than stable, reproducible characteristics in the response of LLMs, though some improvements were observed when ChatGPT-4 responded in selfish or unselfish personas.
摘要：尽管大型语言模型（LLM）在理解人类语言方面表现出了非凡的能力，但本研究探讨了它们如何将这种理解转化为捕捉现实世界人类互动某些本质的社会交流环境。三个LLM -Chatgpt -4，Claude和Bard被放置在经济信任游戏中，在这些游戏中，玩家平衡了自我利益与信任和互惠，做出决定，揭示了他们的社会偏好和互动推理能力。我们的研究表明，LLM偏离纯粹的自身利益，表现出信任和互惠，即使没有提示采用特定的角色。在最简单的一声互动中，LLM模仿了人类玩家在这种游戏开始时如何将信任放在其上。在涉及信任还款或多轮互动的情况下出现了更大的人机差异，在这种情况下，决策都受社会偏好和互动推理的影响。当提示采用诸如自私或无私的玩家之类的角色时，LLMS的响应差异很大，其影响大于模型或游戏类型之间的差异。 Chatgpt-4的反应是无私或中立的角色，类似于最高的信任和互惠，超越了人类，克劳德和吟游诗人。克劳德（Claude）和吟游诗人（Bard）表现出信任和互惠水平，有时超过，有时低于人类选择。当赋予自私的角色时，所有LLM都比人类表现出较低的信任和互惠。在LLM的响应中，对对应物或不断变化的游戏机制的行为或不断变化的游戏机制的交互推理似乎是随机的，而不是稳定的，可重复的特征，尽管当Chatgpt-4以自私或无私的角色做出反应时，观察到了一些改进。

Title: Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

Authors: Luoxi Tang, Tharunya Sundar, Shuai Yang, Ankita Patra, Manohar Chippada, Giqi Zhao, Yi Li, Riteng Zhang, Tunan Zhao, Ting Yang, Yuqiao Meng, Weicheng Ma, Zhaohan Xi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17056
Pdf URL: https://arxiv.org/pdf/2505.17056
Copy Paste: [[2505.17056]] Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective(https://arxiv.org/abs/2505.17056)
Keywords: language model, llm
Abstract: AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.
摘要：AI通过启用强大的工具来改善学习体验，从而改变教育。在最近的进步中，大型语言模型（LLMS）具有革命性学习者如何与教育内容互动的特殊希望。在这项工作中，我们通过关注英语标准化测试（ESTS）来研究LLMS支持标准化测试准备的潜力。具体而言，我们评估了它们在各种EST问题类型中生成准确且上下文适当的解决方案的能力。我们介绍了ESTBook，这是一个综合基准，旨在评估LLMS解决EST问题的功能。 ESTBook汇总了五项广泛认可的测试，其中包括29种问题类型，超过10,576个问题，包括文本，图像，音频，表和数学符号。使用ESTBook，我们系统地评估了LLMS的准确性和推理效率。此外，我们提出了一个分解分析框架，将复杂的EST问题分解为特定于任务的解决方案步骤。该框架使我们能够在推理过程的每个阶段隔离和评估LLM性能。评估结果提供了对LLM在教育环境中能力的见解，并指出了针对性的策略，以提高其作为智能辅导系统的可靠性。

Title: DO-RAG: A Domain-Specific QA Framework Using Knowledge Graph-Enhanced Retrieval-Augmented Generation

Authors: David Osei Opoku, Ming Sheng, Yong Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17058
Pdf URL: https://arxiv.org/pdf/2505.17058
Copy Paste: [[2505.17058]] DO-RAG: A Domain-Specific QA Framework Using Knowledge Graph-Enhanced Retrieval-Augmented Generation(https://arxiv.org/abs/2505.17058)
Keywords: hallucination, retrieval-augmented generation, chain-of-thought, agent
Abstract: Domain-specific QA systems require not just generative fluency but high factual accuracy grounded in structured expert knowledge. While recent Retrieval-Augmented Generation (RAG) frameworks improve context recall, they struggle with integrating heterogeneous data and maintaining reasoning consistency. To address these challenges, we propose DO-RAG, a scalable and customizable hybrid QA framework that integrates multi-level knowledge graph construction with semantic vector retrieval. Our system employs a novel agentic chain-of-thought architecture to extract structured relationships from unstructured, multimodal documents, constructing dynamic knowledge graphs that enhance retrieval precision. At query time, DO-RAG fuses graph and vector retrieval results to generate context-aware responses, followed by hallucination mitigation via grounded refinement. Experimental evaluations in the database and electrical domains show near-perfect recall and over 94% answer relevancy, with DO-RAG outperforming baseline frameworks by up to 33.38%. By combining traceability, adaptability, and performance efficiency, DO-RAG offers a reliable foundation for multi-domain, high-precision QA at scale.
摘要：域特异性的质量保证系统不仅需要生成的流利度，而且需要基于结构化专家知识的高事实准确性。虽然最近的检索型发电（RAG）框架改善了上下文的回忆，但他们在整合异质数据并保持推理一致性方面努力努力。为了应对这些挑战，我们提出了do-rag，这是一个可扩展且可自定义的混合质量保证框架，将多级知识图构造与语义矢量检索集成在一起。我们的系统采用新颖的经过经过思考的架构来从非结构化的多模式文档中提取结构化关系，从而构建动态知识图，以增强检索精度。在查询时间，do-rag fuses图和矢量检索结果以产生上下文感知的响应，然后通过接地改进进行减轻幻觉。数据库和电源域中的实验评估显示出接近完美的召回率和超过94％的答案相关性，而DO-RAG的基线框架的比基线框架高达33.38％。通过结合可追溯性，适应性和性能效率，Do-rag为多域，高精度质量质量检查提供了可靠的基础。

Title: Medalyze: Lightweight Medical Report Summarization Application Using FLAN-T5-Large

Authors: Van-Tinh Nguyen, Hoang-Duong Pham, Thanh-Hai To, Cong-Tuan Hung Do, Thi-Thu-Trang Dong, Vu-Trung Duong Le, Van-Phuc Hoang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17059
Pdf URL: https://arxiv.org/pdf/2505.17059
Copy Paste: [[2505.17059]] Medalyze: Lightweight Medical Report Summarization Application Using FLAN-T5-Large(https://arxiv.org/abs/2505.17059)
Keywords: gpt
Abstract: Understanding medical texts presents significant challenges due to complex terminology and context-specific language. This paper introduces Medalyze, an AI-powered application designed to enhance the comprehension of medical texts using three specialized FLAN-T5-Large models. These models are fine-tuned for (1) summarizing medical reports, (2) extracting health issues from patient-doctor conversations, and (3) identifying the key question in a passage. Medalyze is deployed across a web and mobile platform with real-time inference, leveraging scalable API and YugabyteDB. Experimental evaluations demonstrate the system's superior summarization performance over GPT-4 in domain-specific tasks, based on metrics like BLEU, ROUGE-L, BERTScore, and SpaCy Similarity. Medalyze provides a practical, privacy-preserving, and lightweight solution for improving information accessibility in healthcare.
摘要：理解医学文本由于复杂的术语和特定于上下文的语言带来了重大挑战。本文介绍了Medalyze，这是一种AI驱动的应用程序，旨在使用三种专业的Flan-T5大型模型来增强医学文本的理解。这些模型对（1）总结医学报告，（2）从患者doctor对话中提取健康问题，以及（3）在段落中识别关键问题。 MedAlyze将部署在带有实时推理的网络和移动平台上，利用可伸缩的API和YugabytedB。实验评估表明，基于BLEU，Rouge-L，Bertscore和Spacy相似性等指标，该系统在特定于域的任务中的卓越汇总性能优于GPT-4。 MedAlyze提供了一种实用，保护隐私和轻巧的解决方案，可改善医疗保健中信息的可访问性。

Title: SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Authors: Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17060
Pdf URL: https://arxiv.org/pdf/2505.17060
Copy Paste: [[2505.17060]] SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation(https://arxiv.org/abs/2505.17060)
Keywords: llm
Abstract: In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository this https URL.
摘要：为了实现流体和天然的人机语音互动，现有的全双工对话系统通常采用具有辅助组件的模块化体系结构，例如语音活动探测器，互动器，对话状态预测变量或多个LLM。但是，这些系统遇到了跨模块的错误积累，并在关键挑战中挣扎，例如与上下文有关的驳船和回声取消。最近的方法，最著名的是Moshi，通过将音频编解码器注入单个LLM的令牌空间来简化管道。但是，这种方法在语音上而不是文本方式时仍会产生显着的性能降解。在本文中，我们介绍了Salmonn-Omni，这是第一张独立的完整语音LLM，在其代币空间中无音频编解码器运行。它具有LLM主链中的新型动态思维机制，使该模型能够学习何时在说话和听力状态之间过渡。对广泛使用的基准测试的实验和开放域的对话表明，尽管使用较少的训练数据，但与现有开源全双工模型相比，Salmonn-OMNI至少在现有开源全双工模型中取得了至少30 \％的相对性能改善，并高度竞争性地对半双链和转弯的系统执行。此外，Salmonn-Omni在复杂的对话场景中表现出强劲的表现，包括转弯，回声，回声取消和与上下文有关的驳船，并通过强化学习取得了进一步的改进。在以下存储库中提供了HTTPS URL的以下存储库中提供了一些演示对话。

Title: Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models

Authors: Xinlong Chen, Yuanxing Zhang, Qiang Liu, Junfei Wu, Fuzheng Zhang, Tieniu Tan
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.17061
Pdf URL: https://arxiv.org/pdf/2505.17061
Copy Paste: [[2505.17061]] Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models(https://arxiv.org/abs/2505.17061)
Keywords: language model, hallucination
Abstract: Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at this https URL.
摘要：大型视觉模型（LVLM）在各种视觉任务中表现出令人印象深刻的功能，但由于幻觉的持续挑战而阻碍了它们。为了解决这个关键问题，我们提出了解码（MOD）的混合物，这是一种新颖的减轻幻觉方法，通过评估模型对图像令牌的正确性来动态调整解码策略。具体而言，MOD衡量了从原始图像令牌产生的输出与从模型的图像令牌得出的输出之间的一致性，以区分上述正确性。如果输出是一致的，表示正确的注意力，MOD会采用互补策略来扩大关键信息。相反，如果输出不一致，提出错误的关注，则模仿使用对比策略来抑制误导信息。广泛的实验表明，MOD在多个主流基准中显着胜过现有的解码方法，从而有效地减轻了LVLM中的幻觉。该代码可在此HTTPS URL上找到。

Title: Synthetic Data RL: Task Definition Is All You Need

Authors: Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, Yikang Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17063
Pdf URL: https://arxiv.org/pdf/2505.17063
Copy Paste: [[2505.17063]] Synthetic Data RL: Task Definition Is All You Need(https://arxiv.org/abs/2505.17063)
Keywords: llm
Abstract: Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at this https URL.
摘要：强化学习（RL）是一种将基础模型适应专业任务的有力方法，但它依赖于大规模的人类标记的数据限制了广泛采用。我们介绍了合成数据RL，这是一个简单而通用的框架，该框架仅使用从任务定义生成的合成数据加强微型模型。我们的方法首先从任务定义中生成问题和答案对并检索文档，然后根据模型解决性来调整问题的难度，并使用模型的平均通过率在RL培训中选择问题。在QWEN-2.5-7B上，我们的方法比GSM8K的基本模型获得了29.2％的绝对改善（+2.9 pp vs.指导，+6.6 pp vs.自我实例），数学为8.7％，在GPQA上为13.1％，GPQA（+7.0 pp vs.synthllm and MedSthllm）和Med on med on naw.9％ons on c.7％ons on c.7％onsca（17.7％）（17.7％on ca a（17.7％）。 CFA（财务）。它在相同的数据预算下超过了监督的微调，几乎与RL与数据集的完整人类数据匹配（例如，GSM8K上的+17.2 pp）。增加100个人类示范仅将GSM8K的性能提高了0.4 pp，显示出有限的附加值。通过减少人类数据注释，综合数据RL可实现可扩展有效的基于RL的模型适应。代码和演示可在此HTTPS URL上找到。

Title: Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases

Authors: Valentina Carbonari, Pierangelo Veltri, Pietro Hiram Guzzi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17065
Pdf URL: https://arxiv.org/pdf/2505.17065
Copy Paste: [[2505.17065]] Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases(https://arxiv.org/abs/2505.17065)
Keywords: language model, llm, agent
Abstract: Recent advances in artificial intelligence, particularly large language models LLMs, have shown promising capabilities in transforming rare disease research. This survey paper explores the integration of LLMs in the analysis of rare diseases, highlighting significant strides and pivotal studies that leverage textual data to uncover insights and patterns critical for diagnosis, treatment, and patient care. While current research predominantly employs textual data, the potential for multimodal data integration combining genetic, imaging, and electronic health records stands as a promising frontier. We review foundational papers that demonstrate the application of LLMs in identifying and extracting relevant medical information, simulating intelligent conversational agents for patient interaction, and enabling the formulation of accurate and timely diagnoses. Furthermore, this paper discusses the challenges and ethical considerations inherent in deploying LLMs, including data privacy, model transparency, and the need for robust, inclusive data sets. As part of this exploration, we present a section on experimentation that utilizes multiple LLMs alongside structured questionnaires, specifically designed for diagnostic purposes in the context of different diseases. We conclude with future perspectives on the evolution of LLMs towards truly multimodal platforms, which would integrate diverse data types to provide a more comprehensive understanding of rare diseases, ultimately fostering better outcomes in clinical settings.
摘要：人工智能的最新进展，尤其是大型语言模型LLM，在改变稀有疾病研究方面表现出了有希望的能力。本调查论文探讨了LLM在罕见疾病分析中的整合，强调了大步和关键研究，这些研究利用文本数据来发现对诊断，治疗和患者护理至关重要的见解和模式。尽管当前的研究主要采用文本数据，但结合遗传，成像和电子健康记录的多模式数据集成的潜力是有前途的前沿。我们审查了基本论文，这些论文证明了LLM在识别和提取相关的医疗信息，模拟智能对话剂进行患者互动中的应用，并启用准确及时的诊断。此外，本文讨论了部署LLMS固有的挑战和道德注意事项，包括数据隐私，模型透明度以及对健壮的包容性数据集的需求。作为这项探索的一部分，我们提供了一节有关实验的部分，该部分利用多个LLM与结构化问卷一起，专门为诊断目的而设计，以诊断不同。我们以对LLM向真正多模式平台发展的未来观点结论，该平台将整合多种数据类型，以提供对罕见疾病的更全面的了解，最终在临床环境中促进更好的结果。

Title: Improving endpoint detection in end-to-end streaming ASR for conversational speech

Authors: Anandh C, Karthik Pandia Durai, Jeena Prakash, Manickavela Arumugam, Kadri Hacioglu, S.Pavankumar Dubagunta, Andreas Stolcke, Shankar Venkatesan, Aravind Ganapathiraju
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.17070
Pdf URL: https://arxiv.org/pdf/2505.17070
Copy Paste: [[2505.17070]] Improving endpoint detection in end-to-end streaming ASR for conversational speech(https://arxiv.org/abs/2505.17070)
Keywords: agent
Abstract: ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.
摘要：ASR端点（EP）在提供人类/机器对话中人为代理的产品方面提供良好的用户体验，在提供良好的用户体验中起着重要作用。基于传感器的ASR（T-ASR）是一种端到端（E2E）ASR建模技术，首选用于流媒体。 T-ASR的主要局限性是ASR输出的延迟发射，这可能导致EP中的错误或延迟。不准确的EP会在讲话时切断用户，返回不完整的成绩单，而EP中的延迟将增加感知的延迟，从而降低用户体验。我们建议通过解决延迟排放以及EP错误来改善EP的方法。为了解决延迟的排放问题，我们在每个单词的末尾引入了一个字的令牌，以及延迟罚款。通过使用辅助网络获得可靠的帧级语音活动检测来解决EP延迟。我们将提出的方法应用于总机对话语音语料库，并根据延迟惩罚方法对其进行评估。

Title: What's in a prompt? Language models encode literary style in prompt embeddings

Authors: Raphaël Sarfati, Haley Moller, Toni J. B. Liu, Nicolas Boullé, Christopher Earls
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17071
Pdf URL: https://arxiv.org/pdf/2505.17071
Copy Paste: [[2505.17071]] What's in a prompt? Language models encode literary style in prompt embeddings(https://arxiv.org/abs/2505.17071)
Keywords: language model, prompt
Abstract: Large language models use high-dimensional latent spaces to encode and process textual information. Much work has investigated how the conceptual content of words translates into geometrical relationships between their vector representations. Fewer studies analyze how the cumulative information of an entire prompt becomes condensed into individual embeddings under the action of transformer layers. We use literary pieces to show that information about intangible, rather than factual, aspects of the prompt are contained in deep representations. We observe that short excerpts (10 - 100 tokens) from different novels separate in the latent space independently from what next-token prediction they converge towards. Ensembles from books from the same authors are much more entangled than across authors, suggesting that embeddings encode stylistic features. This geometry of style may have applications for authorship attribution and literary analysis, but most importantly reveals the sophistication of information processing and compression accomplished by language models.
摘要：大型语言模型使用高维的潜在空间来编码和处理文本信息。许多工作已经调查了单词的概念内容如何转化为其向量表示之间的几何关系。更少的研究分析了如何在变压器层的作用下将整个提示的累积信息凝结成单个嵌入。我们使用文学作品来展示有关无形的而不是事实的信息，即提示的各个方面包含在深度表示中。我们观察到，从潜在空间中的不同小说中的简短摘录（10-100个令牌）与他们收敛的下一步预测独立于潜在空间中的不同小说。来自同一位作者的书籍的合奏比在作者之间更纠缠得多，这表明嵌入式编码风格特征。这种样式的几何形状可能具有作者身份归因和文学分析的应用，但最重要的是揭示了通过语言模型完成的信息处理和压缩的复杂性。

Title: Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Authors: Anurag Mishra
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17073
Pdf URL: https://arxiv.org/pdf/2505.17073
Copy Paste: [[2505.17073]] Mechanistic Interpretability of GPT-like Models on Summarization Tasks(https://arxiv.org/abs/2505.17073)
Keywords: language model, gpt
Abstract: Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.
摘要：机械性解释性研究旨在揭示大型语言模型的内部运作，但大多数工作都集中在分类或生成任务上，而不是摘要。本文提出了一个解释性框架，用于分析类似GPT的模型如何适应汇总任务。我们在预训练和微调模型之间进行了差异分析，从而量化了注意力模式和内部激活的变化。通过确定经历重大转换的特定层和注意力头，我们在模型体系结构中找到“摘要电路”。我们的发现表明，中层（尤其是2、3和5）表现出最大的变化，有62％的注意力头显示熵的降低，表明向集中信息选择的转变。我们证明，针对这些确定的电路的靶向洛拉适应可以比标准洛拉微调的绩效改善，同时需要更少的训练时期。这项工作弥合了黑框评估与机械理解之间的差距，从而提供了有关神经网络在摘要过程中如何执行信息选择和压缩的见解。

Title: Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

Authors: Ruixiao Li, Fahao Chen, Peng Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17074
Pdf URL: https://arxiv.org/pdf/2505.17074
Copy Paste: [[2505.17074]] Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency(https://arxiv.org/abs/2505.17074)
Keywords: language model, llm
Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39\% compared to state-of-the-art scheduling methods.
摘要：投机解码通过使用小型投机模型（SSM）来生成多个候选令牌，并使用并行的LLM验证它们，从而加速了大语言模型（LLM）推断。该技术已被广泛整合到LLM推理服务系统中。但是，推理请求通常显示出不确定的执行时间，这在这些系统中提出了有效调度请求的重大挑战。现有的工作估计执行时间仅基于预测的输出长度，这可能是不准确的，因为执行时间取决于LLM的输出长度和令牌接受率。在本文中，我们提出了一种半熟练的请求调度算法，称为投机解码（LAPS-SD），称为最不符合人为/感知的服务。给定许多推理请求，LAPS-SD可以通过根据解码过程中的功能自适应调度请求有效地最大程度地减少平均推理潜伏期。当令牌接受率是动态的，并且难以估计执行时间时，LAPS-SD会保持多个优先级队列，并允许跨不同队列的请求执行先发制。一旦令牌接受率变得稳定，LAPS-SD就可以准确地估算执行时间和计划请求。广泛的实验表明，与最先进的调度方法相比，LAPS-SD可将推理潜伏期降低约39％。

Title: Development and Validation of Engagement and Rapport Scales for Evaluating User Experience in Multimodal Dialogue Systems

Authors: Fuma Kurata, Mao Saeki, Masaki Eguchi, Shungo Suzuki, Hiroaki Takatsu, Yoichi Matsuyama
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17075
Pdf URL: https://arxiv.org/pdf/2505.17075
Copy Paste: [[2505.17075]] Development and Validation of Engagement and Rapport Scales for Evaluating User Experience in Multimodal Dialogue Systems(https://arxiv.org/abs/2505.17075)
Keywords: agent
Abstract: This study aimed to develop and validate two scales of engagement and rapport to evaluate the user experience quality with multimodal dialogue systems in the context of foreign language learning. The scales were designed based on theories of engagement in educational psychology, social psychology, and second language this http URL-four Japanese learners of English completed roleplay and discussion tasks with trained human tutors and a dialog agent. After each dialogic task was completed, they responded to the scales of engagement and rapport. The validity and reliability of the scales were investigated through two analyses. We first conducted analysis of Cronbach's alpha coefficient and a series of confirmatory factor analyses to test the structural validity of the scales and the reliability of our designed items. We then compared the scores of engagement and rapport between the dialogue with human tutors and the one with a dialogue agent. The results revealed that our scales succeeded in capturing the difference in the dialogue experience quality between the human interlocutors and the dialogue agent from multiple perspectives.
摘要：这项研究旨在开发和验证两个量表的参与度和融洽的关系，以在外语学习的背景下使用多模式对话系统评估用户体验质量。这些量表的设计基于参与教育心理学，社会心理学和第二语言的理论，这些HTTP URL四欧洲语的英语学习者与训练有素的人类导师和对话者的训练者进行了讨论。每个对话任务完成后，他们回应了参与度和融洽关系的范围。通过两个分析研究了量表的有效性和可靠性。我们首先对Cronbach的Alpha系数进行了分析，并进行了一系列确认性因素分析，以测试量表的结构有效性和我们设计项目的可靠性。然后，我们比较了与人类导师的对话与与对话代理的对话之间的订婚和融洽关系。结果表明，我们的量表成功地捕获了人对话者与对话代理之间的对话体验质量的差异。

Title: Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

Authors: Haoyang Zhang, Hexin Liu, Xiangyu Zhang, Qiquan Zhang, Yuchen Hu, Junqi Zhao, Fei Tian, Xuerui Yang, Eng Siong Chng
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.17076
Pdf URL: https://arxiv.org/pdf/2505.17076
Copy Paste: [[2505.17076]] Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English(https://arxiv.org/abs/2505.17076)
Keywords: language model
Abstract: The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.
摘要：语音令牌在最近的语音任务中起着至关重要的作用，通常是语音信号和语言模型之间的桥梁。尽管低框架速率编解码器被广泛用作语音令牌，但框架速率对语音令牌的影响仍然没有得到充实的影响。在这项研究中，我们研究了不同的帧速率如何通过检查普通话和英语（两种类型上不同的语言）来影响语音象化化。我们以不同的帧速率编码语音，并在语音识别任务中评估所得的语义令牌。我们的发现表明，对于每种语言，帧速率变化对语音象化的影响都不同，从而突出了帧速率，语音密度和特定语言的声学特征之间的相互作用。结果为优化语音令牌的帧速率选择提供了见解，对自动语音识别，文本到语音和其他与语音相关的应用程序产生了影响。

Title: GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

Authors: Zenghao Duan, Zhiyi Yin, Zhichao Shi, Liang Pang, Shaoling Jing, Jiayi Wu, Yu Yan, Huawei Shen, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17078
Pdf URL: https://arxiv.org/pdf/2505.17078
Copy Paste: [[2505.17078]] GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace(https://arxiv.org/abs/2505.17078)
Keywords: language model, llm
Abstract: This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.
摘要：本文研究了大语言模型（LLMS）中毒性产生的潜在机制，并提出了一种有效的排毒方法。先前的工作通常将馈送网络（FFN）视为毒性的主要来源，代表有毒区域是一组有毒媒介或层次的子空间。但是，我们的深入分析表明，全球有毒子空间为模型中有毒区域提供了更有效，更全面的表示。在这种见解的基础上，我们提出了光泽（全球有毒子空间抑制），这是一种轻巧的四阶段方法，可以通过从FFN的参数中识别和去除全球有毒子空间来减轻毒性。一系列LLM的实验表明，光泽在保留模型一般功能的同时，可以实现最先进的排毒性能，而无需大规模数据或模型再培训。

Title: Not Minds, but Signs: Reframing LLMs through Semiotics

Authors: Davide Picca
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17080
Pdf URL: https://arxiv.org/pdf/2505.17080
Copy Paste: [[2505.17080]] Not Minds, but Signs: Reframing LLMs through Semiotics(https://arxiv.org/abs/2505.17080)
Keywords: language model, llm, agent
Abstract: This paper challenges the prevailing tendency to frame Large Language Models (LLMs) as cognitive systems, arguing instead for a semiotic perspective that situates these models within the broader dynamics of sign manipulation and meaning-making. Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations. By shifting from a cognitivist to a semiotic framework, we avoid anthropomorphism and gain a more precise understanding of how LLMs participate in cultural processes, not by thinking, but by generating texts that invite interpretation. Through theoretical analysis and practical examples, the paper demonstrates how LLMs function as semiotic agents whose outputs can be treated as interpretive acts, open to contextual negotiation and critical reflection. We explore applications in literature, philosophy, education, and cultural production, emphasizing how LLMs can serve as tools for creativity, dialogue, and critical inquiry. The semiotic paradigm foregrounds the situated, contingent, and socially embedded nature of meaning, offering a more rigorous and ethically aware framework for studying and using LLMs. Ultimately, this approach reframes LLMs as technological participants in an ongoing ecology of signs. They do not possess minds, but they alter how we read, write, and make meaning, compelling us to reconsider the foundations of language, interpretation, and the role of artificial systems in the production of knowledge.
摘要：本文挑战了将大型语言模型（LLM）作为认知系统构图的普遍趋势，而是为符号学的观点而言，将这些模型置于标志操纵和意义制作的更广泛的动态中。我们建议他们的主要功能是基于概率关联，而不是假设LLMS理解语言或模拟人类思想，而是建议它们的主要功能重新组合，重新定义和循环语言形式。通过从认知主义者转移到符号学框架，我们避免了拟人化，并获得了对LLM如何参与文化过程而不是通过思考，而是通过产生邀请解释的文本来获得更精确的理解。通过理论分析和实际示例，本文展示了LLM的功能如何作为符号学剂的功能，其输出可以视为解释性行为，对上下文谈判和批判性反思开放。我们探索文学，哲学，教育和文化生产中的应用，强调LLM如何作为创造力，对话和批判性探究的工具。符号学范式预示着意义的位置，特遣队和社会嵌入的本质，为研究和使用LLM提供了更严格和道德意识的框架。最终，这种方法将llms缩减为技术参与者，这是一个持续的迹象生态学。他们没有思想，但是他们改变了我们阅读，写作和创造意义的方式，迫使我们重新考虑语言，解释和人工系统在知识生产中的作用。

Title: GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data

Authors: Abderrahman Skiredj, Ferdaous Azhari, Houdaifa Atou, Nouamane Tazi, Ismail Berrada
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17082
Pdf URL: https://arxiv.org/pdf/2505.17082
Copy Paste: [[2505.17082]] GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data(https://arxiv.org/abs/2505.17082)
Keywords: language model, llm, prompt, chat
Abstract: Open-source large language models (LLMs) still marginalise Moroccan Arabic (Darija), forcing practitioners either to bolt on heavyweight Arabic adapters or to sacrifice the very reasoning skills that make LLMs useful. We show that a rigorously quality-over-quantity alignment strategy can surface fluent Darija while safeguarding the backbone s cross-lingual reasoning at a sliver of the usual compute. We translate three compact instruction suites LIMA 1 K, DEITA 6 K and TULU 50 K into Darija, preserve 20 of the English originals, and add mathematics, coding and scientific prompts. A LoRA-tuned Gemma 3-4B trained on 5 K mixed instructions lifts DarijaMMLU from 32.8 to 42.7 ; adding the reasoning-dense TULU portion pushes it to 47.5 with no English regression. Scaling the identical recipe to Gemma 3-27B produces GemMaroc-27B, which matches Atlas-Chat on DarijaMMLU (61.6 ) and leaps ahead on Darija commonsense, scoring 60.5 on HellaSwag versus Atlas-Chat s 48.4 . Crucially, GemMaroc retains Gemma-27B s strong maths and general-reasoning ability, showing only minimal movement on GSM8K and English benchmarks. The entire model is trained in just 48 GPU.h, underscoring a Green AI pathway to inclusive, sustainable language technology. We release code, data and checkpoints to spur Darija-centric applications in education, public services and everyday digital interaction.
摘要：开源大型语言模型（LLM）仍将摩洛哥阿拉伯语（Darija）边缘化，迫使从业者对重量级阿拉伯语适配器进行螺栓固定，或者牺牲使LLM有用的非常合理的技巧。我们表明，一种严格的质量定量对准策略可以表现出流利的darija，同时在通常的计算的片段上保护骨干的跨语性推理。我们将三个紧凑型教学套件1 K，Deita 6 K和Tulu 50 K转换为Darija，保留20个英语原件，并添加数学，编码和科学提示。在5 K混合说明中接受训练的洛拉（Lora）调整的Gemma 3-4B将Darijammlu从32.8提升至42.7；添加推理浓密的图鲁部分将其推至47.5，而没有英语回归。将相同的食谱缩放到Gemma 3-27b的情况下会产生Gemmaroc-27b，它与Darijammlu上的Atlas-Chat（61.6）相匹配，并在Darija Commonsense上飞跃，在Hellaswag与Atlas-Chat S 48.4上得分60.5。至关重要的是，Gemmaroc保留了Gemma-27B的强大数学和一般性策划能力，仅显示GSM8K和英语基准的最小运动。整个模型仅在48 gpu.h中进行了培训，强调了绿色的AI途径，通往包容性，可持续的语言技术。我们发布代码，数据和检查站，以激发以Darija为中心的教育，公共服务和日常数字互动的应用程序。

Title: Scale-invariant Attention

Authors: Ben Anson, Xi Wang, Laurence Aitchison
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.17083
Pdf URL: https://arxiv.org/pdf/2505.17083
Copy Paste: [[2505.17083]] Scale-invariant Attention(https://arxiv.org/abs/2505.17083)
Keywords: llm, long context
Abstract: One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.
摘要：LLM研究中的一个持续挑战是，从较短的背景下的培训到对更长的上下文的推断，注意力机制的发展。我们提出的两个条件我们期望所有有效的长上下文注意机制具有：规模不变的总关注和规模不变的注意力稀少度。在高斯假设下，我们表明注意力逻辑的简单位置依赖性转换足以使这些条件保持。在实验上，我们发现，当从短上下文上的训练到较长的验证到较长的上下文验证时，在验证损失方面产生的规模不变方案在验证损失方面具有很大的好处，并且在长篇文章检索中有效。

Title: Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization

Authors: Yihong Wu, Liheng Ma, Muzhi Li, Jiaming Zhou, Jianye Hao, Ho-fung Leung, Irwin King, Yingxue Zhang, Jian-Yun Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17086
Pdf URL: https://arxiv.org/pdf/2505.17086
Copy Paste: [[2505.17086]] Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization(https://arxiv.org/abs/2505.17086)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable versatility, due to the lack of factual knowledge, their application to Question Answering (QA) tasks remains hindered by hallucination. While Retrieval-Augmented Generation mitigates these issues by integrating external knowledge, existing approaches rely heavily on in-context learning, whose performance is constrained by the fundamental reasoning capabilities of LLMs. In this paper, we propose Mujica, a Multi-hop Joint Intelligence for Complex Question Answering, comprising a planner that decomposes questions into a directed acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning. Additionally, we introduce MyGO (Minimalist policy Gradient Optimization), a novel reinforcement learning method that replaces traditional policy gradient updates with Maximum Likelihood Estimation (MLE) by sampling trajectories from an asymptotically optimal policy. MyGO eliminates the need for gradient rescaling and reference models, ensuring stable and efficient training. Empirical results across multiple datasets demonstrate the effectiveness of Mujica-MyGO in enhancing multi-hop QA performance for various LLMs, offering a scalable and resource-efficient solution for complex QA tasks.
摘要：大型语言模型（LLMS）由于缺乏事实知识而表现出了显着的多功能性，因此他们在问答（QA）任务的应用仍受到幻觉的阻碍。虽然检索效果的一代通过整合外部知识来减轻这些问题，但现有方法在很大程度上依赖于内在的学习，其性能受到LLMS的基本推理能力的限制。在本文中，我们提出了Mujica，Mujica是一种多跳的联合智能，用于复杂的问题回答，其中包括一个计划者，将问题分解为有指示的无环形图，并通过检索和推理来解决问题。此外，我们介绍了MyGo（简约的政策梯度优化），这是一种新颖的强化学习方法，通过从渐近最佳政策中抽样轨迹来代替传统的策略梯度更新（MLE）。 MyGo消除了对梯度重新缩放和参考模型的需求，从而确保稳定有效的培训。多个数据集的经验结果证明了Mujica-Mygo在增强各种LLM的多跳QA性能方面的有效性，为复杂的QA任务提供了可扩展和资源有效的解决方案。

Title: Informatics for Food Processing

Authors: Gordana Ispirova, Michael Sebek, Giulia Menichetti
Subjects: cs.CL, cs.AI, cs.CY, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17087
Pdf URL: https://arxiv.org/pdf/2505.17087
Copy Paste: [[2505.17087]] Informatics for Food Processing(https://arxiv.org/abs/2505.17087)
Keywords: language model
Abstract: This chapter explores the evolution, classification, and health implications of food processing, while emphasizing the transformative role of machine learning, artificial intelligence (AI), and data science in advancing food informatics. It begins with a historical overview and a critical review of traditional classification frameworks such as NOVA, Nutri-Score, and SIGA, highlighting their strengths and limitations, particularly the subjectivity and reproducibility challenges that hinder epidemiological research and public policy. To address these issues, the chapter presents novel computational approaches, including FoodProX, a random forest model trained on nutrient composition data to infer processing levels and generate a continuous FPro score. It also explores how large language models like BERT and BioBERT can semantically embed food descriptions and ingredient lists for predictive tasks, even in the presence of missing data. A key contribution of the chapter is a novel case study using the Open Food Facts database, showcasing how multimodal AI models can integrate structured and unstructured data to classify foods at scale, offering a new paradigm for food processing assessment in public health and research.
摘要：本章探讨了食品加工的演变，分类和健康影响，同时强调了机器学习，人工智能（AI）和数据科学在推进食品信息学方面的变革作用。它始于历史概述和对传统分类框架（例如Nova，Nutri-Score和Siga）的批判性审查，突出了它们的优势和局限性，尤其是阻碍流行病学研究和公共政策的主观性和可重复性挑战。为了解决这些问题，本章介绍了新的计算方法，包括Foodprox，这是一种随机森林模型，该模型训练了营养成分数据，以推断加工水平并产生连续的FPRO评分。它还探讨了伯特和生物伯特（Biobert）等大型语言模型如何在语义上嵌入食物描述和成分清单，即使在缺少数据的情况下，也可以为预测任务列表。本章的关键贡献是使用开放式食品事实数据库进行新的案例研究，展示了多模式AI模型如何整合结构化和非结构化数据以大规模对食物进行分类，并为公共卫生和研究中的食品处理评估提供了新的范式。

Title: Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

Authors: Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Ye Wang, Gang Tan, Shagufta Mehnaz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17089
Pdf URL: https://arxiv.org/pdf/2505.17089
Copy Paste: [[2505.17089]] Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models(https://arxiv.org/abs/2505.17089)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.
摘要：大型语言模型（LLMS）具有令人印象深刻的能力，但仍然容易受到越来越多的安全风险，包括越狱，有毒内容，幻觉和偏见。现有的防御措施通常仅针对单一的威胁类型或诉诸于严格的彻底拒绝，牺牲用户体验，并且无法跨越各种和新颖的攻击。本文介绍了对抗场景外推（ASE），这是一种新型的推理时间计算框架，利用了经济链（COT）推理同时增强LLM稳健性和无缝性。 ASE通过考虑潜在的对抗场景和制定防御性策略的自我基础过程来指导LLM，然后再产生对用户查询的响应。对四个具有四个最新LLM的对抗性基准的全面评估表明，ASE取得了接近零的越狱攻击成功率和最小的毒性，同时将直接拒绝降至<4％。 ASE的表现优于稳健性的六个最先进的防御能力，对抗性问答的精度为92-99％，偏差得分较低4-10倍。通过将对抗性感知转变为固有的认知过程，ASE为安全自然的人类互动设定了新的范式。

Title: Large Language Models Implicitly Learn to See and Hear Just By Reading

Authors: Prateek Verma, Mert Pilanci
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.17091
Pdf URL: https://arxiv.org/pdf/2505.17091
Copy Paste: [[2505.17091]] Large Language Models Implicitly Learn to See and Hear Just By Reading(https://arxiv.org/abs/2505.17091)
Keywords: language model, llm
Abstract: This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.
摘要：本文提出了一个令人着迷的发现：通过培训文本令牌上的自动回归LLM模型，文本模型本质地在内部发展了一种理解图像和音频的能力，从而从阅读中发展了观看和听到的能力。流行的音频和Visual LLM模型微调文本LLM模型可在图像和音频嵌入中提供文本输出。另一方面，我们的体系结构将图像，音频波形或令牌作为输入的补丁。它为我们提供了分类管道典型的嵌入式或类别标签。我们在协助数据集FSD-50K和GTZAN的音频分类中显示了文本权重的一般性。此外，我们还显示了有关CIFAR-10和时尚纳斯特的图像分类的工作，以及图像贴片上的工作。这推动了文本llms学习强大的内部电路的概念，这些电路可以通过激活各种应用程序的必要连接而不是每次从头开始训练模型来利用。

Title: Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation

Authors: Kristine Ann M. Carandang, Jasper Meynard P. Araña, Ethan Robert A. Casin, Christopher P. Monterola, Daniel Stanley Y. Tan, Jesus Felix B. Valenzuela, Christian M. Alis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17095
Pdf URL: https://arxiv.org/pdf/2505.17095
Copy Paste: [[2505.17095]] Are LLMs reliable? An exploration of the reliability of large language models in clinical note generation(https://arxiv.org/abs/2505.17095)
Keywords: language model, llm, prompt
Abstract: Due to the legal and ethical responsibilities of healthcare providers (HCPs) for accurate documentation and protection of patient data privacy, the natural variability in the responses of large language models (LLMs) presents challenges for incorporating clinical note generation (CNG) systems, driven by LLMs, into real-world clinical processes. The complexity is further amplified by the detailed nature of texts in CNG. To enhance the confidence of HCPs in tools powered by LLMs, this study evaluates the reliability of 12 open-weight and proprietary LLMs from Anthropic, Meta, Mistral, and OpenAI in CNG in terms of their ability to generate notes that are string equivalent (consistency rate), have the same meaning (semantic consistency) and are correct (semantic similarity), across several iterations using the same prompt. The results show that (1) LLMs from all model families are stable, such that their responses are semantically consistent despite being written in various ways, and (2) most of the LLMs generated notes close to the corresponding notes made by experts. Overall, Meta's Llama 70B was the most reliable, followed by Mistral's Small model. With these findings, we recommend the local deployment of these relatively smaller open-weight models for CNG to ensure compliance with data privacy regulations, as well as to improve the efficiency of HCPs in clinical documentation.
摘要：由于医疗保健提供商（HCP）的法律和道德责任，用于准确文档和保护患者数据隐私，因此大语模型（LLMS）响应的自然变异性提出了融合临床笔记生成（CNG）系统的挑战，该系统由LLMS驱动到LLMS驱动到现实世界中。 CNG中文本的详细性质进一步扩大了复杂性。为了增强HCP对LLM驱动的工具的信心，本研究评估了12个开放量和专有LLM的可靠性，该可靠性是从拟人化，元，Mista和OpenAI中从CNG中在CNG中产生的注释的能力，这些注释能够产生字符串等效率（一致性）的能力，具有相同的含义（语义一致性），并且具有相同的相似性（语义上的提示），并且具有多种效率。结果表明，（1）所有模型家族的LLM均稳定，因此尽管以各种方式编写了他们的响应在语义上是一致的，并且（2）大多数LLMS生成的注释接近专家的相应说明。总体而言，Meta的Llama 70b是最可靠的，其次是Mistral的小型模型。通过这些发现，我们建议将这些相对较小的开放权重模型的本地部署用于CNG，以确保遵守数据隐私法规，并提高HCP在临床文档中的效率。

Title: TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Authors: Yanshu Li, Tian Yun, Jianjiang Yang, Pinyuan Feng, Jinfa Huang, Ruixiang Tang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.17098
Pdf URL: https://arxiv.org/pdf/2505.17098
Copy Paste: [[2505.17098]] TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration(https://arxiv.org/abs/2505.17098)
Keywords: language model
Abstract: Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input in-context sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a valuable perspective for interpreting and improving multimodal ICL.
摘要：多模式内在学习（ICL）已成为利用大型视觉模型（LVLMS）功能的关键机制。但是，它的有效性仍然对输入中的内在序列的质量高度敏感，尤其是对于涉及复杂推理或开放式生成的任务。一个主要的局限性是我们对LVLM在推断过程中实际利用这些序列的实际利用有限的理解。为了弥合这一差距，我们通过任务映射的镜头系统地解释了多模式ICL，这揭示了演示内部和演示中的本地和全球关系指导模型推理。在此洞察力的基础上，我们提出了Taco，这是一种基于轻量变压器的模型，配备了具有任务感知的注意力，该模型动态配置了内在序列。通过将任务映射信号注入自回归解码过程中，Taco在序列构建和任务推理之间产生了双向协同作用。在五个LVLM和九个数据集上进行的实验表明，炸玉米饼始终超过不同ICL任务的基准。这些结果将任务映射作为解释和改进多模式ICL的宝贵观点。

Title: Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation

Authors: Xiaozhao Liu, Dinggang Shen, Xihui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17099
Pdf URL: https://arxiv.org/pdf/2505.17099
Copy Paste: [[2505.17099]] Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation(https://arxiv.org/abs/2505.17099)
Keywords: hallucination
Abstract: Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable--whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.
摘要：预处理的生成模型通过促进了逼真的文本和来自非侵入性脑记录的逼真的文本和图像，从而打开了大脑解码的新边界。但是，此类输出的可靠性仍然值得怀疑 - 无论它们是否确实反映了大脑中的语义激活，或者仅仅被强大的生成模型所幻觉。在本文中，我们专注于脑电图解码，并通过后塌陷来解决其幻觉问题。认识到脑电图和文本之间的信息能力不匹配的基本不匹配，我们将解码任务重新编码为核心含义的语义摘要，而不是以前对刺激文本的逐字重建。为此，我们提出了生成语言检查模型（GLIM），该模型强调学习信息丰富且可解释的脑电图表示，以改善在异质和小规模的数据条件下的语义基础。公共Zuco数据集的实验表明，Glim始终在没有老师强迫的情况下产生流利的，脑ground的句子。此外，它通过跨情感类别，关系类型和语料库主题的EEG文本检索和零击语义分类来支持超出文本相似性的更强大的评估。我们的体系结构和评估协议共同为生成大脑解码中的可靠和可扩展基准测试奠定了基础。

Title: Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

Authors: Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, Taha Kass-Hout
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17100
Pdf URL: https://arxiv.org/pdf/2505.17100
Copy Paste: [[2505.17100]] Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector(https://arxiv.org/abs/2505.17100)
Keywords: language model, llm, prompt
Abstract: LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.
摘要：LLM-AS-A-Gudge已成为自动评估生成的产出的有前途的工具，但其可靠性通常会因判断中的潜在偏见而破坏。现有的减轻这些偏见的努力面临关键的局限性：基于文本学习的方法由于评估者的自我反射能力有限而无法解决根源的偏见，而微型调整并不适用于所有评估者类型，尤其是封闭源模型。为了应对这一挑战，我们介绍了基于推理的偏置检测器（RBD），该检测器是一个插件模块，该模块识别有偏见的评估并生成结构化的推理以指导评估者的自我纠正。 RBD并没有修改评估者本身，而是在外部运行，并参与了偏见检测和反馈驱动的修订过程。为了支持其开发，我们设计了一条完整的管道，该管道包括偏见的数据集构建，监督收集，基于蒸馏推理的RBD的微调以及与LLM评估人员的集成。我们微调了四种尺寸的RBD模型，范围从1.5B到14B，并且观察到所有尺度的绩效提高一致。使用8个LLM评估者进行了对4种偏见类型的实验结果 - 词性，位置，潮流和情感 - 证明了RBD的强大有效性。例如，RBD-8B模型平均将评估准确性提高了18.5％，一致性提高了10.9％，并且超过了基于促进的基线和微调法官的速度分别提高了12.8％和17.2％。这些结果突出了RBD的有效性和可扩展性。其他实验进一步证明了其在偏见和域之间的强烈概括及其效率。

Title: An approach to identify the most semantically informative deep representations of text and images

Authors: Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni, Alessandro Laio
Subjects: cs.CL, cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2505.17101
Pdf URL: https://arxiv.org/pdf/2505.17101
Copy Paste: [[2505.17101]] An approach to identify the most semantically informative deep representations of text and images(https://arxiv.org/abs/2505.17101)
Keywords: language model, llm
Abstract: Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.
摘要：已知深度神经网络也可以为语义相关数据开发相似的表示，即使它们属于不同的域，例如图像及其描述，或以不同语言为单位的文本。我们提出了一种通过测量语义相关数据表示的相对信息内容并探测如何编码大型语言模型（LLMS）和视觉变压器的多个令牌，来定量研究这种现象。首先查看LLMS处理的翻译句子对，我们确定包含最可转移信息的内部``语义''层。此外，在这些层上，较大的LLM（DeepSeek-v3）提取比较小的信息（llama3.1-8b）要多得多。语义信息分布在许多代币中，其特征是令牌之间的长距离相关性和因果关系（即过去未来）不对称性的因果关系。我们还标识了编码视觉变压器中语义信息的层。我们表明，LLMS语义层中的字幕表示预测相应图像的视觉表示。我们观察到图像和文本表示之间的重要和依赖模型的信息不对称。

Title: BanglaByT5: Byte-Level Modelling for Bangla

Authors: Pramit Bhattacharyya, Arnab Bhattacharya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17102
Pdf URL: https://arxiv.org/pdf/2505.17102
Copy Paste: [[2505.17102]] BanglaByT5: Byte-Level Modelling for Bangla(https://arxiv.org/abs/2505.17102)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a morphologically rich language like Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla. Built upon a small variant of Googles ByT5 architecture, BanglaByT5 is pre-trained on a 14GB curated corpus combining high-quality literary and newspaper articles. Through zeroshot and supervised evaluations across generative and classification tasks, BanglaByT5 demonstrates competitive performance, surpassing several multilingual and larger models. Our findings highlight the efficacy of byte-level modelling for morphologically rich languages and highlight BanglaByT5 potential as a lightweight yet powerful tool for Bangla NLP, particularly in both resource-constrained and scalable environments.
摘要：大型语言模型（LLM）在各种自然语言处理任务中取得了巨大的成功。但是，大多数LLM模型都使用像BPE和句子这样的传统令牌，它们无法捕捉到像Bangla（孟加拉语）这样的形态丰富语言的细微差别。在这项工作中，我们介绍了Banglabyt5，这是第一个明确针对Bangla量身定制的字节级编码器模型。 Banglabyt5建立在一小部分Google Byt5 Architecture的基础上，在14GB策划的语料库中进行了预培训，该语料将高质量的文学和报纸文章结合在一起。通过Zeroshot和跨生成和分类任务的监督评估，Banglabyt5展示了竞争性能，超过了几种多语言和更大的模型。我们的发现突出了字节级建模对形态上丰富的语言的功效，并突出显示了Banglabyt5潜力是Bangla NLP的轻巧但功能强大的工具，尤其是在资源受限和可扩展的环境中。

Title: Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

Authors: Cécile Rousseau, Tobia Boschi, Giandomenico Cornacchia, Dhaval Salwala, Alessandra Pascale, Juan Bernabe Moreno
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17103
Pdf URL: https://arxiv.org/pdf/2505.17103
Copy Paste: [[2505.17103]] Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation(https://arxiv.org/abs/2505.17103)
Keywords: language model, llm
Abstract: SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. SDForger source code will be open-sourced soon.
摘要：SDFORGER是一种灵活，有效的框架，用于使用LLMS生成高质量的多元时间序列。 SDFORGER利用紧凑的数据表示，从几个样品中提供了合成时间序列的生成，并对任何自回旋LLM的低计算进行微型调整。具体而言，该框架将单变量和多变量信号转换为表格嵌入，然后将其编码为文本并用于微调LLM。在推断时，对新的文本嵌入进行了采样并解码为合成时间序列，以保留原始数据的统计属性和时间动力学。在各种数据集中，在许多情况下，SDForger在基于相似性的评估和下游预测任务中都优于现有的生成模型。通过在生成过程中启用文本调节，SDForger为多模式建模和简化时间序列与文本信息的简化集成铺平了道路。 SDFORGER源代码将很快开源。

Title: P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

Authors: Tao Sun, Enhao Pan, Zhengkai Yang, Kaixin Sui, Jiajun Shi, Xianfu Cheng, Tongliang Li, Wenhao Huang, Ge Zhang, Jian Yang, Zhoujun Li
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.17104
Pdf URL: https://arxiv.org/pdf/2505.17104
Copy Paste: [[2505.17104]] P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark(https://arxiv.org/abs/2505.17104)
Keywords: llm, agent
Abstract: Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents-for visual element processing, content generation, and final poster assembly-each integrated with dedicated checker modules to enable iterative refinement and ensure output quality. To foster advancements and rigorous evaluation in this domain, we construct and release P2PInstruct, the first large-scale instruction dataset comprising over 30,000 high-quality examples tailored for the academic paper-to-poster generation task. Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper-poster pairs and a dual evaluation methodology (Universal and Fine-Grained) that leverages LLM-as-a-Judge and detailed, human-annotated checklists. Our contributions aim to streamline research dissemination and provide the community with robust tools for developing and evaluating next-generation poster generation systems.
摘要：学术海报对于学术沟通至关重要，但是他们的手动创建既耗时又耗时。但是，自动化的学术海报生成在保留复杂的科学细节并实现有效的视觉文本整合方面面临重大挑战。现有的方法通常会在语义丰富和结构上的细微差别中挣扎，并且缺乏标准化的基准，无法全面评估产生的学术海报。为了解决这些局限性，我们介绍了P2P，这是第一个灵活的，基于LLM的多代理框架，该框架直接从研究论文中生成了高质量的HTML渲染学术海报，证明了实用应用的强大潜力。 P2P采用三种专门的视觉元素处理，内容生成和最终海报组件，并与专用检查器模块集成，以实现迭代性完善并确保输出质量。为了促进该领域的进步和严格的评估，我们构建和发布了P2Pinstruct，这是第一个大规模指令数据集，该数据集由30,000多个针对学术纸至寄生生成任务量身定制的高质量示例。 Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper-poster pairs and a dual evaluation methodology (Universal and Fine-Grained) that leverages LLM-as-a-Judge and detailed, human-annotated checklists.我们的贡献旨在简化研究传播，并为社区提供可靠的工具，以开发和评估下一代海报生成系统。

Title: RRTL: Red Teaming Reasoning Large Language Models in Tool Learning

Authors: Yifei Liu, Yu Cui, Haibin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17106
Pdf URL: https://arxiv.org/pdf/2505.17106
Copy Paste: [[2505.17106]] RRTL: Red Teaming Reasoning Large Language Models in Tool Learning(https://arxiv.org/abs/2505.17106)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model's behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key findings: (1) RLLMs generally achieve stronger safety performance than traditional LLMs, yet substantial safety disparities persist across models; (2) RLLMs can pose serious deceptive risks by frequently failing to disclose tool usage and to warn users of potential tool output risks; (3) CoT prompting reveals multi-lingual safety vulnerabilities in RLLMs. Our work provides important insights into enhancing the security of RLLMs in tool learning.
摘要：虽然工具学习显着增强了大语言模型（LLMS）的功能，但它也引入了实质性的安全风险。先前的研究揭示了工具学习过程中传统LLM的各种漏洞。但是，在工具学习的背景下，新兴推理LLM（RLLM）（例如DeepSeek-r1）的安全性仍然没有被忽视。为了弥合这一差距，我们提出了RRTL，这是一种专门旨在评估工具学习中RLLM的红色小组方法。它整合了两种新颖的策略：（1）识别欺骗性威胁，该威胁评估了模型在隐藏不安全工具及其潜在风险的使用方面的行为；（2）使用链链（COT）的使用促使工具调用。我们的方法还包括传统LLM的基准。我们对七个主流RLLM进行了全面评估，并发现了三个关键发现：（1）RLLM通常比传统的LLMS实现更强的安全性能，但模型之间存在实质性的安全性差异；（2）RLLM可以通过经常不披露工具使用情况并警告用户潜在的工具输出风险来构成严重的欺骗风险；（3）COT提示揭示了RLLM中的多语言安全漏洞。我们的工作为增强工具学习中RLLM的安全性提供了重要的见解。

Title: Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling

Authors: Junlin Li, Guodong DU, Jing Li, Sim Kuan Goh, Wenya Wang, Yequan Wang, Fangming Liu, Ho-Kin Tang, Saleh Alharbi, Daojing He, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17110
Pdf URL: https://arxiv.org/pdf/2505.17110
Copy Paste: [[2505.17110]] Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling(https://arxiv.org/abs/2505.17110)
Keywords: language model, llm
Abstract: Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs' multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs' fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs' multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.
摘要：在模态数据上使用多模式编码器的微调大语言模型（LLM）扩展了LLM可以处理的模态，从而导致多模式LLMS（MLLMS）的形成。但是，这种范式在很大程度上依赖于从头开始的资源密集型和僵化的微调和新的多模式数据。在本文中，我们提出了MMER（多模式扩展和保留），这是一种无训练的方法，该方法将现有的MLLM集成了有效的多模式扩展，同时保留其原始性能。具体而言，MMER在合并LLM参数时将MLLMS的多模式编码器重新使用。通过比较原始和合并的LLM参数，MMer生成二进制掩码，以分别为每种模态分开的LLM参数。这些脱钩的参数可以独立处理特定于模态的输入，减少参数冲突并保留原始MLLM的保真度。 MMER还可以通过对新任务进行微调的MLLM进行类似的过程来减轻灾难性遗忘。广泛的实验表明，对基准的改善有了显着改善，证明了Mer Mer可以有效地扩展LLMS的多模式能力，同时保留了99％的原始性能，并且显着减轻了灾难性的遗忘。

Title: Cultural Value Alignment in Large Language Models: A Prompt-based Analysis of Schwartz Values in Gemini, ChatGPT, and DeepSeek

Authors: Robin Segerer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17112
Pdf URL: https://arxiv.org/pdf/2505.17112
Copy Paste: [[2505.17112]] Cultural Value Alignment in Large Language Models: A Prompt-based Analysis of Schwartz Values in Gemini, ChatGPT, and DeepSeek(https://arxiv.org/abs/2505.17112)
Keywords: language model, gpt, llm, prompt, chat
Abstract: This study examines cultural value alignment in large language models (LLMs) by analyzing how Gemini, ChatGPT, and DeepSeek prioritize values from Schwartz's value framework. Using the 40-item Portrait Values Questionnaire, we assessed whether DeepSeek, trained on Chinese-language data, exhibits distinct value preferences compared to Western models. Results of a Bayesian ordinal regression model show that self-transcendence values (e.g., benevolence, universalism) were highly prioritized across all models, reflecting a general LLM tendency to emphasize prosocial values. However, DeepSeek uniquely downplayed self-enhancement values (e.g., power, achievement) compared to ChatGPT and Gemini, aligning with collectivist cultural tendencies. These findings suggest that LLMs reflect culturally situated biases rather than a universal ethical framework. To address value asymmetries in LLMs, we propose multi-perspective reasoning, self-reflective feedback, and dynamic contextualization. This study contributes to discussions on AI fairness, cultural neutrality, and the need for pluralistic AI alignment frameworks that integrate diverse moral perspectives.
摘要：这项研究通过分析双子座，chatgpt和DeepSeek如何优先考虑Schwartz的价值框架的价值，从而研究了大语言模型（LLM）中的文化价值对齐。使用40个项目的肖像价值问卷，我们评估了对中文数据培训的DeepSeek是否表现出与西方模型相比表现出独特的价值偏好。贝叶斯序数回归模型的结果表明，在所有模型中，自我推广值（例如仁慈，普遍主义）都高度优先考虑，这反映了强调亲社会价值的一般LLM趋势。但是，与chatgpt和Gemini相比，DeepSeek独特地淡化了自我增强价值观（例如，权力，成就），与集体主义文化倾向对齐。这些发现表明，LLM反映了文化上的偏见，而不是普遍的道德框架。为了解决LLM中的价值不对称，我们提出了多方面的推理，自我反射反馈和动态上下文化。这项研究有助于讨论AI公平性，文化中立性以及对整合多种道德观点的多元化AI一致性框架的需求。

Title: RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language

Authors: Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Subjects: cs.CL, cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2505.17114
Pdf URL: https://arxiv.org/pdf/2505.17114
Copy Paste: [[2505.17114]] RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language(https://arxiv.org/abs/2505.17114)
Keywords: language model
Abstract: Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at this https URL.
摘要：多模式问答（QA）通常需要确定与问题相关的视频，音频或传感器令牌。然而，模态分歧很常见：视野外的相机外语音，背景噪声或运动通常会误导融合模型，这些模型都平等地加重了所有流。我们提出了Raven，这是一种统一的QA体系结构，其核心是夸脱的，它是一个查询条件的跨模式门控模块，该模块将标量相关性分数分配给每个令牌跨模态，从而使模型能够放大信息信号并在融合前抑制干扰物。 Raven通过三阶段的管道进行了训练，其中包括单次预读，查询对准的融合和面向分歧的微调 - 每个阶段都针对多模式推理中的独特挑战：表示质量，交叉模式相关性，以及与模态匹配的鲁棒性相关性。为了支持培训和评估，我们发布了AVS-QA，这是一个由300K同步音频的数据集 - Video-Sensor Smertans与自动生成的问题解答对配对。与最先进的多模式大型语言模型相比，精确的七个多模式质量质量质量检查基准（包括以以上为中心的任务）的实验结果（包括以以上为中心的任务）表明，准确性达到了14.5 \％和8.0 \％。合并传感器数据提供了额外的16.4 \％增强，并且该模型在模态损坏下保持稳健，表现优于SOTA基线，高于50.23 \％。我们的代码和数据集可在此HTTPS URL上找到。

Title: Comparative Evaluation of Prompting and Fine-Tuning for Applying Large Language Models to Grid-Structured Geospatial Data

Authors: Akash Dhruv, Yangxinyu Xie, Jordan Branham, Tanwi Mallick
Subjects: cs.CL, cs.ET
Abstract URL: https://arxiv.org/abs/2505.17116
Pdf URL: https://arxiv.org/pdf/2505.17116
Copy Paste: [[2505.17116]] Comparative Evaluation of Prompting and Fine-Tuning for Applying Large Language Models to Grid-Structured Geospatial Data(https://arxiv.org/abs/2505.17116)
Keywords: language model, llm, prompt
Abstract: This paper presents a comparative study of large language models (LLMs) in interpreting grid-structured geospatial data. We evaluate the performance of a base model through structured prompting and contrast it with a fine-tuned variant trained on a dataset of user-assistant interactions. Our results highlight the strengths and limitations of zero-shot prompting and demonstrate the benefits of fine-tuning for structured geospatial and temporal reasoning.
摘要：本文在解释网格结构的地理空间数据时介绍了大语言模型（LLM）的比较研究。我们通过结构化提示来评估基本模型的性能，并将其与在用户辅助交互的数据集中进行的微调变体进行对比。我们的结果突出了零拍的优势和局限性，并证明了微调对结构化地理空间和时间推理的好处。

Title: From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Authors: Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
Subjects: cs.CL, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2505.17117
Pdf URL: https://arxiv.org/pdf/2505.17117
Copy Paste: [[2505.17117]] From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning(https://arxiv.org/abs/2505.17117)
Keywords: language model, llm
Abstract: Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.
摘要：人类通过将各种实例映射到抽象表示的同时，在保留意义（例如，罗宾和蓝鸟都是鸟是鸟类；大多数鸟都可以飞行的同时，将知识通过语义压缩组织成紧凑的类别。这些概念反映了表达忠诚和代表性简单之间的权衡。大型语言模型（LLM）表现出了非凡的语言能力，但是它们的内部表示是否在压缩和语义忠诚之间进行类似人类的权衡，尚不清楚。我们介绍了一个新颖的信息理论框架，从速度延伸理论和信息瓶颈原理中汲取了数量比较这些策略。分析来自LLM的多样化套件针对开创性人类分类基准的嵌入令牌，我们发现了关键的差异。尽管LLM构成了与人类判断保持一致的广泛概念类别，但他们努力捕获对人类理解至关重要的细粒语义区别。从根本上讲，LLMS对积极的统计压缩表现出强烈的偏见，而人类概念系统似乎优先考虑适应性细微差别和上下文丰富，即使这通过我们的措施导致较低的压缩效率。这些发现阐明了当前的AI和人类认知架构之间的批判差异，并指导通往LLM的途径更加与人类一致的概念表示。

Title: After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG

Authors: Xinbang Dai, Huikang Hu, Yuncheng Hua, Jiaqi Li, Yongrui Chen, Rihui Jin, Nan Hu, Guilin Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17118
Pdf URL: https://arxiv.org/pdf/2505.17118
Copy Paste: [[2505.17118]] After Retrieval, Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG(https://arxiv.org/abs/2505.17118)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems face critical challenges in balancing internal (parametric) and external (retrieved) knowledge, especially when these sources conflict or are unreliable. To analyze these scenarios comprehensively, we construct the Trustworthiness Response Dataset (TRD) with 36,266 questions spanning four RAG settings. We reveal that existing approaches address isolated scenarios-prioritizing one knowledge source, naively merging both, or refusing answers-but lack a unified framework to handle different real-world conditions simultaneously. Therefore, we propose the BRIDGE framework, which dynamically determines a comprehensive response strategy of large language models (LLMs). BRIDGE leverages an adaptive weighting mechanism named soft bias to guide knowledge collection, followed by a Maximum Soft-bias Decision Tree to evaluate knowledge and select optimal response strategies (trust internal/external knowledge, or refuse). Experiments show BRIDGE outperforms baselines by 5-15% in accuracy while maintaining balanced performance across all scenarios. Our work provides an effective solution for LLMs' trustworthy responses in real-world RAG applications.
摘要：检索增强的生成（RAG）系统在平衡内部（参数）和外部（检索）知识方面面临着关键的挑战，尤其是当这些来源冲突或不可靠时。为了全面分析这些方案，我们使用36,266个问题构建了可信度响应数据集（TRD），涵盖了四个抹布设置。我们透露，现有的方法解决了一个孤立的方案优先考虑一个知识来源，天真地合并了两者，或拒绝答案，但缺乏统一的框架来同时处理不同的现实状况。因此，我们提出了桥梁框架，该桥梁框架动态确定了大语言模型（LLMS）的全面响应策略。桥梁利用一种名为软偏置的自适应加权机制来指导知识收集，然后是最大的软偏见决策树，以评估知识并选择最佳响应策略（信任内部/外部知识或拒绝）。实验表明，在所有情况下，桥梁的精度优于基准的精度高出5-15％，同时保持平衡的性能。我们的工作为LLMS在现实世界中的破布应用程序中的可信赖响应提供了有效的解决方案。

Title: Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models

Authors: Zongru Shao, Xin Wang, Zhanyang Liu, Chenhan Wang, K.P. Subbalakshmi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17119
Pdf URL: https://arxiv.org/pdf/2505.17119
Copy Paste: [[2505.17119]] Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models(https://arxiv.org/abs/2505.17119)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent research leverages large language models (LLMs) for early mental health detection, such as depression, often optimized with machine-generated data. However, their detection may be subject to unknown weaknesses. Meanwhile, quality control has not been applied to these generated corpora besides limited human verifications. Our goal is to systematically evaluate LLM reasoning and reveal potential weaknesses. To this end, we first provide a systematic evaluation of the reasoning over machine-generated detection and interpretation. Then we use the models' reasoning abilities to explore mitigation strategies for enhanced performance. Specifically, we do the following: A. Design an LLM instruction strategy that allows for systematic analysis of the detection by breaking down the task into several subtasks. B. Design contrastive few-shot and chain-of-thought prompts by selecting typical positive and negative examples of detection reasoning. C. Perform human annotation for the subtasks identified in the first step and evaluate the performance. D. Identify human-preferred detection with desired logical reasoning from the few-shot generation and use them to explore different optimization strategies. We conducted extensive comparisons on the DepTweet dataset across the following subtasks: 1. identifying whether the speaker is describing their own depression; 2. accurately detecting the presence of PHQ-9 symptoms, and 3. finally, detecting depression. Human verification of statistical outliers shows that LLMs demonstrate greater accuracy in analyzing and detecting explicit language of depression as opposed to implicit expressions of depression. Two optimization methods are used for performance enhancement and reduction of the statistic bias: supervised fine-tuning (SFT) and direct preference optimization (DPO). Notably, the DPO approach achieves significant performance improvement.
摘要：最近的研究利用大型语言模型（LLMS）进行早期心理健康检测，例如抑郁症，通常通过机器生成的数据进行优化。但是，它们的检测可能会遇到未知的弱点。同时，除了有限的人类验证外，质量控制还没有应用于这些产生的语料库。我们的目标是系统地评估LLM推理并揭示潜在的弱点。为此，我们首先对机器生成的检测和解释的推理进行系统的评估。然后，我们使用模型的推理能力来探索缓解策略以增强性能。具体来说，我们执行以下操作：A。设计一个LLM指令策略，该策略可以通过将任务分解为多个子任务来系统地分析检测。 B.设计通过选择典型的检测推理的典型正面和负面例子来对比几乎没有射击和链条的提示。 C.对第一步确定的子任务进行人体注释并评估性能。 D.通过几乎没有发电的逻辑推理确定人类偏爱的检测，并使用它们来探索不同的优化策略。我们在以下子任务中对Deptweet数据集进行了广泛的比较：1。确定说话者是否描述了自己的抑郁症； 2。准确检测到PHQ-9症状的存在，最后3。最后检测抑郁症。人类对统计异常值的验证表明，LLMS在分析和检测明确的抑郁症语言方面表现出更高的准确性，而不是抑郁症的隐式表达。两种优化方法用于增强性能和降低统计偏差：监督微调（SFT）和直接偏好优化（DPO）。值得注意的是，DPO方法可实现重大的性能提高。

Title: Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

Authors: Dillon Plunkett, Adam Morris, Keerthi Reddy, Jorge Morales
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17120
Pdf URL: https://arxiv.org/pdf/2505.17120
Copy Paste: [[2505.17120]] Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training(https://arxiv.org/abs/2505.17120)
Keywords: language model, gpt, llm
Abstract: We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to introspect and explain their own functioning. Here, we show that i) contemporary LLMs are capable of providing accurate, quantitative descriptions of their own internal processes during certain kinds of decision-making, ii) that it is possible to improve these capabilities through training, and iii) that this training generalizes to at least some degree. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes during decision-making (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain what they are doing as they make other complex decisions, not just decisions they have learned to make via fine-tuning. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.
摘要：我们只对大型语言模型（LLM）的方式以及为什么以其方式做出反应有限。他们的神经网络已被证明具有挑战性的解释，我们才刚刚开始嘲笑其中的单个神经元和电路的功能。但是，理解这些系统的另一个途径是调查和发展其内省和解释自己的功能的能力。在这里，我们表明我）当代LLM能够在某些决策过程中对自己的内部过程提供准确的定量描述，ii）ii）可以通过培训来提高这些能力，iiii）至少将这种培训推广到一定程度上。为此，我们对GPT-4O和GPT-4O-Mini进行了微调，以根据随机生成的，定量的偏好在决策过程中，根据如何在决策过程中称重不同的属性（例如，对自然光线的相对重要性，对自然的condos condos condos condos condos condoss condoss contos contos contos contos contos contos contos contos contos contos contos contos condose，则在公寓，贷款，度假，度假等之间进行定量更大多，是在公寓，贷款，度量上等，等等的，在公寓，贷款，假期等之间进行选择）。我们证明LLM可以准确地报告这些偏好（即，他们学会了在决策过程中给予不同属性的权重）。接下来，我们证明可以对这些LLM进行微调，以更准确地解释他们的决策。最后，我们证明了该培训的概括：它提高了模型在做出其他复杂决策时准确解释他们在做什么的能力，而不仅仅是通过微调来做出的决定。这项工作是迈向培训LLM的一步，以准确而广泛地报告其自己的内部流程，这种可能性将为可解释性，控制和安全带来可观的好处。

Title: NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

Authors: Weiming Wu, Zi-kang Wang, Jin Ye, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17121
Pdf URL: https://arxiv.org/pdf/2505.17121
Copy Paste: [[2505.17121]] NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation(https://arxiv.org/abs/2505.17121)
Keywords: language model, llm
Abstract: Obtaining large-scale, high-quality data with reasoning paths is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined templates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-relation-constraint paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to corresponding visual and textual representations, and generates diverse question-answer (Q&A) pairs using large language models (LLMs). To the best of our knowledge, we are the first to propose a neuro-symbolic approach in generating multimodal reasoning data. Based on this framework, we construct NeSyGeo-CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.
摘要：获得具有推理路径的大规模高质量数据对于提高多模式大语言模型（MLLM）的几何推理能力至关重要。但是，现有的数据生成方法，无论是基于预定义的模板还是受约束的符号抛弃，都不可避免地面临多样性和数值概括限制。为了解决这些局限性，我们提出了Nesygeo，这是一种新型的神经符号框架，用于生成几何推理数据。首先，我们提出了一种基于实体关系构成范式的特定领域的语言，以全面地表示平面几何形状的所有组成部分，以及在此符号空间内定义的生成动作。然后，我们设计了一个符号 - 视文管道，该管道合成符号序列，将它们映射到相应的视觉和文本表示形式，并使用大语言模型（LLMS）生成多样化的Question-Asswer（Q＆A）对。据我们所知，我们是第一个提出一种神经符号方法来生成多模式推理数据的人。基于此框架，我们构建了包含100K样本的Nesygeo-cot和Nesygeo-caption数据集，并释放新的基准Nesygeo检验，以评估MLLM中的几何推理能力。实验表明，该提案显着，始终如一地提高了强化和监督微调下的多个MLLM的性能。基本模型只有4K样品和两个增强式微调时期的微型模型可在数学上获得高达 +15.8％的改善，而数学上的 +8.4％的改善和GEOQA的 + +7.3％。值得注意的是，可以改进4B模型，以优于几何推理任务的同一系列中的8B模型。

Title: Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

Authors: Xuan Qi, Jiahao Qiu, Xinzhe Juan, Yue Wu, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17122
Pdf URL: https://arxiv.org/pdf/2505.17122
Copy Paste: [[2505.17122]] Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?(https://arxiv.org/abs/2505.17122)
Keywords: language model, llm
Abstract: Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the Skywork-Reward-Preference-80K-v0.2 dataset outperforms the full dataset when trained on a 40\% truncated dataset. This pattern is consistent across multiple datasets, suggesting the widespread presence of shallow preference signals. We further investigate the distribution of the reward signal through decoding strategies. We consider two simple decoding strategies motivated by the shallow reward signal observation, namely Length Control Decoding and KL Threshold Control Decoding, which leverage shallow preference signals to optimize the trade-off between alignment and computational efficiency. The performance is even better, which again validates our hypothesis. The phenomenon of shallow preference signals highlights potential issues in LLM alignment: existing alignment methods often focus on aligning only the initial tokens of responses, rather than considering the full response. This could lead to discrepancies with real-world human preferences, resulting in suboptimal alignment performance.
摘要：将大型语言模型（LLM）与人类偏好保持一致仍然是AI的关键挑战。基于偏好的优化方法，例如使用人类反馈（RLHF）和直接偏好优化（DPO）的增强学习，依赖于人类通知的数据集来改善对齐方式。在这项工作中，我们确定了现有学习方法的关键特性：在首选响应中获得的区分信号通常集中在早期令牌中。我们将其称为浅偏好信号。为了探索此属性，我们在各个点上系统地截断了偏好数据集，并在截断的数据上训练奖励模型和DPO模型。令人惊讶的是，在截断的数据集上训练的模型，仅保留上半场或更少的令牌，与在完整数据集中培训的模型相当甚至优越的性能。例如，在Skywork-Reward-Preference-80K-V0.2数据集中训练的奖励模型在40 \％截断的数据集中训练时，数据集优于完整数据集。在多个数据集中，这种模式是一致的，这表明存在浅偏好信号的广泛存在。我们通过解码策略进一步研究奖励信号的分布。我们考虑了两种由浅奖励信号观察激励的简单解码策略，即长度控制解码和KL阈值控制解码，它们利用浅偏好信号来优化对齐和计算效率之间的权衡。表现更好，这再次证实了我们的假设。浅层偏好信号的现象突出了LLM对齐中的潜在问题：现有的一致性方法通常集中于仅对齐响应的初始令牌，而不是考虑全部响应。这可能会导致现实世界中的人类偏好的差异，从而导致次优对准性能。

Title: MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Authors: Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17123
Pdf URL: https://arxiv.org/pdf/2505.17123
Copy Paste: [[2505.17123]] MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation(https://arxiv.org/abs/2505.17123)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.
摘要：大型语言模型（LLM）的最新进展已在复杂的推理任务中显示出令人鼓舞的结果。但是，当前的评估主要集中在单转推理方案上，而交互式任务在很大程度上没有探索。我们将其归因于没有综合数据集和可扩展的自动评估协议。为了填补这些空白，我们为LLMS的多转弯推理评估提供了MTR基础。 MTR板凳由4个类别，40个任务和3600个实例组成，涵盖了各种推理能力，细粒度的难度粒度，并且需要与环境进行多旋转的相互作用。此外，MTR台面具有跨越数据集构造和模型评估的完全自动化框架，可以在不进行人工干预的情况下进行可扩展的评估。广泛的实验表明，即使是尖端推理模型也落在多转，交互式推理任务上。对这些结果的进一步分析为交互式AI系统的未来研究带来了宝贵的见解。

Title: Conformal Language Model Reasoning with Coherent Factuality

Authors: Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, Surbhi Goel
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.17126
Pdf URL: https://arxiv.org/pdf/2505.17126
Copy Paste: [[2505.17126]] Conformal Language Model Reasoning with Coherent Factuality(https://arxiv.org/abs/2505.17126)
Keywords: language model
Abstract: Language models are increasingly being used in important decision pipelines, so ensuring the correctness of their outputs is crucial. Recent work has proposed evaluating the "factuality" of claims decomposed from a language model generation and applying conformal prediction techniques to filter out those claims that are not factual. This can be effective for tasks such as information retrieval, where constituent claims may be evaluated in isolation for factuality, but is not appropriate for reasoning tasks, as steps of a logical argument can be evaluated for correctness only within the context of the claims that precede them. To capture this, we define "coherent factuality" and develop a conformal-prediction-based method to guarantee coherent factuality for language model outputs. Our approach applies split conformal prediction to subgraphs within a "deducibility" graph" that represents the steps of a reasoning problem. We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels. Moreover, we achieve 90% factuality on our stricter definition while retaining 80% or more of the original claims, highlighting the utility of our deducibility-graph-guided approach.
摘要：语言模型越来越多地用于重要的决策管道中，因此确保其产出的正确性至关重要。最近的工作提出了评估从语言模型产生中分解的主张的“事实”，并应用了保形预测技术来滤除那些不是事实的说法。这对于诸如信息检索之类的任务可能是有效的，在该任务中，可以孤立地评估构成主张的事实，但不适合推理任务，因为只能在其前面的索赔上的上下文中评估逻辑论点的步骤。为了捕获这一点，我们定义了“连贯的事实”，并开发了一种基于共形预测的方法来保证语言模型输出的连贯事实。我们的方法应用了代表推理问题步骤的“可分配”图中的子图的分裂预测。我们评估了数学和FELM数据集中的数学推理问题的方法。算法始终如一地，我们的算法始终如一地发现正确和实质性的跨度覆盖级别，在跨度覆盖范围的情况下，我们将在跨度覆盖范围上实现90％的范围，我们更加实现了90％的范围，我们将达到90％的范围。更多原始主张，强调了我们可读性 - 图形指导方法的实用性。

Title: Relative Bias: A Comparative Framework for Quantifying Bias in LLMs

Authors: Alireza Arbabi, Florian Kerschbaum
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.17131
Pdf URL: https://arxiv.org/pdf/2505.17131
Copy Paste: [[2505.17131]] Relative Bias: A Comparative Framework for Quantifying Bias in LLMs(https://arxiv.org/abs/2505.17131)
Keywords: language model, llm
Abstract: The growing deployment of large language models (LLMs) has amplified concerns regarding their inherent biases, raising critical questions about their fairness, safety, and societal impact. However, quantifying LLM bias remains a fundamental challenge, complicated by the ambiguity of what "bias" entails. This challenge grows as new models emerge rapidly and gain widespread use, while introducing potential biases that have not been systematically assessed. In this paper, we propose the Relative Bias framework, a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain. We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively. Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods, offering a systematic, scalable, and statistically grounded approach for comparative bias analysis in LLMs.
摘要：大型语言模型（LLM）的部署日益扩大，引起了人们对其固有偏见的关注，提出了有关其公平，安全和社会影响的关键问题。但是，量化LLM偏见仍然是一个根本的挑战，这是“偏见”所带来的歧义。随着新模型的迅速出现并获得广泛使用，这一挑战越来越大，同时引入了尚未系统评估的潜在偏见。在本文中，我们提出了相对偏置框架，该方法旨在评估LLM的行为如何与指定目标域内其他LLM偏离。我们介绍了两种互补的方法：（1）嵌入转换分析，该分析通过嵌入空间上的句子表示捕获相对偏见模式，以及（2）LLM-AS-A-A-Gudge，该法官采用语言模型来评估输出相比评估输出。将我们的框架应用于统计测试以进行验证之后的几个案例研究，以进行验证，我们发现两种评分方法之间的牢固对齐，为LLMS中的比较偏置分析提供了系统的，可扩展的和统计上的基础方法。

Title: LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

Authors: Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17134
Pdf URL: https://arxiv.org/pdf/2505.17134
Copy Paste: [[2505.17134]] LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions(https://arxiv.org/abs/2505.17134)
Keywords: language model, llm
Abstract: High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.
摘要：高质量的长篇文章指令数据对于对齐长篇小说大型语言模型（LLMS）至关重要。尽管公开发布了Qwen和Llama等模型，但他们的长篇小说指令数据仍然专有。人类注释是昂贵且具有挑战性的，而基于模板的合成方法限制了规模，多样性和质量。我们介绍了Longmagpie，这是一个自动合成框架，该框架自动生成大规模的长篇下说指令数据。我们的关键见解是，在用户转弯之前的特殊代币呈现文档后，将长篇文章LLM对齐时，自动回归生成上下文相关的查询。通过收集这些文档疑战带和模型的反应，Longmagpie在不努力的情况下会产生高质量的说明。有关头盔，标尺和Longbench V2的实验表明，Longmagpie在长篇小说任务上取得了领先的表现，同时在短篇小说任务上保持竞争性绩效，将其确立为一种简单有效的方法，可用于开放，多样性和可扩展的长篇文章指令数据合成。

Title: When can isotropy help adapt LLMs' next word prediction to numerical domains?

Authors: Rashed Shelim, Shengzhe Xu, Walid Saad, Naren Ramakrishnan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17135
Pdf URL: https://arxiv.org/pdf/2505.17135
Copy Paste: [[2505.17135]] When can isotropy help adapt LLMs' next word prediction to numerical domains?(https://arxiv.org/abs/2505.17135)
Keywords: language model, llm
Abstract: Recent studies have shown that vector representations of contextual embeddings learned by pre-trained large language models (LLMs) are effective in various downstream tasks in numerical domains. Despite their significant benefits, the tendency of LLMs to hallucinate in such domains can have severe consequences in applications such as energy, nature, finance, healthcare, retail and transportation, among others. To guarantee prediction reliability and accuracy in numerical domains, it is necessary to open the black-box and provide performance guarantees through explanation. However, there is little theoretical understanding of when pre-trained language models help solve numeric downstream tasks. This paper seeks to bridge this gap by understanding when the next-word prediction capability of LLMs can be adapted to numerical domains through a novel analysis based on the concept of isotropy in the contextual embedding space. Specifically, we consider a log-linear model for LLMs in which numeric data can be predicted from its context through a network with softmax in the output layer of LLMs (i.e., language model head in self-attention). We demonstrate that, in order to achieve state-of-the-art performance in numerical domains, the hidden representations of the LLM embeddings must possess a structure that accounts for the shift-invariance of the softmax function. By formulating a gradient structure of self-attention in pre-trained models, we show how the isotropic property of LLM embeddings in contextual embedding space preserves the underlying structure of representations, thereby resolving the shift-invariance problem and providing a performance guarantee. Experiments show that different characteristics of numeric data and model architecture could have different impacts on isotropy.
摘要：最近的研究表明，预先训练的大语言模型（LLMS）在数值域中的各种下游任务中有效地学习了上下文嵌入的向量表示。尽管有很大的好处，但LLM在此类领域中幻觉的趋势仍会在能源，自然，金融，医疗保健，零售和运输等应用中产生严重的后果。为了确保数值域中的预测可靠性和准确性，有必要打开黑框并通过解释提供性能保证。但是，对预训练的语言模型何时有助于解决下游任务的理论了解很少。本文试图通过理解LLMS的下一个单词预测能力何时可以通过基于上下文嵌入空间中各向同性的概念来适应数值域的下一字预测能力来弥合这一差距。具体而言，我们考虑了LLM的日志线性模型，在该模型中，可以通过在LLMS的输出层中具有SOFTMAX的网络（即语言模型hex in Insevention）从其上下文中预测数字数据。我们证明，为了在数值域中实现最新性能，LLM嵌入式的隐藏表示形式必须具有一个结构，该结构解释了SoftMax函数的偏移不变。通过在预训练的模型中制定自我注意的梯度结构，我们展示了LLM嵌入在上下文嵌入空间中的各向同性特性如何保留表示形式的基本结构，从而解决了移位不变问题并提供绩效保证。实验表明，数字数据和模型体系结构的不同特征可能会对各向同性产生不同的影响。

Title: Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

Authors: Yuhan Ji, Song Gao, Ying Nie, Ivan Majić, Krzysztof Janowicz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17136
Pdf URL: https://arxiv.org/pdf/2505.17136
Copy Paste: [[2505.17136]] Foundation Models for Geospatial Reasoning: Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations(https://arxiv.org/abs/2505.17136)
Keywords: language model, gpt, llm, prompt
Abstract: Applying AI foundation models directly to geospatial datasets remains challenging due to their limited ability to represent and reason with geographical entities, specifically vector-based geometries and natural language descriptions of complex spatial relations. To address these issues, we investigate the extent to which a well-known-text (WKT) representation of geometries and their spatial relations (e.g., topological predicates) are preserved during spatial reasoning when the geospatial vector data are passed to large language models (LLMs) including GPT-3.5-turbo, GPT-4, and DeepSeek-R1-14B. Our workflow employs three distinct approaches to complete the spatial reasoning tasks for comparison, i.e., geometry embedding-based, prompt engineering-based, and everyday language-based evaluation. Our experiment results demonstrate that both the embedding-based and prompt engineering-based approaches to geospatial question-answering tasks with GPT models can achieve an accuracy of over 0.6 on average for the identification of topological spatial relations between two geometries. Among the evaluated models, GPT-4 with few-shot prompting achieved the highest performance with over 0.66 accuracy on topological spatial relation inference. Additionally, GPT-based reasoner is capable of properly comprehending inverse topological spatial relations and including an LLM-generated geometry can enhance the effectiveness for geographic entity retrieval. GPT-4 also exhibits the ability to translate certain vernacular descriptions about places into formal topological relations, and adding the geometry-type or place-type context in prompts may improve inference accuracy, but it varies by instance. The performance of these spatial reasoning tasks offers valuable insights for the refinement of LLMs with geographical knowledge towards the development of geo-foundation models capable of geospatial reasoning.
摘要：将AI基础模型直接应用于地理空间数据集，因为它们在地理实体中代表和推理的能力有限，特别是基于矢量的几何形状和复杂空间关系的自然语言描述。为了解决这些问题，我们调查了几何形式及其空间关系（例如，拓扑谓词）在空间推理期间保留地理空间矢量数据（包括大型语言模型（LLMS），包括GPT-3.5-3.5-Turbo，GPT-4-4-4，GPT-4和DeepSeek-r1-1-1-14B）时，保留了几何形式及其空间关系（例如，拓扑谓词）的程度。我们的工作流采用三种不同的方法来完成比较的空间推理任务，即基于几何嵌入，基于工程的及早基于语言的评估。我们的实验结果表明，使用GPT模型的基于嵌入的基于嵌入的基于嵌入式的基于工程的方法可以实现地理空间的问题解答任务，可以平均达到超过0.6的准确性，以识别两个几何之间的拓扑空间关系。在评估的模型中，GPT-4很少发射，在拓扑空间关系推理上，且精度超过0.66，获得了最高的性能。此外，基于GPT的推理器能够正确理解反向拓扑空间关系，包括LLM生成的几何形状可以增强地理实体检索的有效性。 GPT-4还具有将有关位置的某些白话描述转化为形式拓扑关系的能力，并在提示中添加几何类型或位置型上下文的能力可能会提高推理的准确性，但它会有所不同。这些空间推理任务的性能为LLM的完善提供了有价值的见解，并具有地理知识，以开发能够地理空间推理的地理创始模型。

Title: Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

Authors: Kristin Qi, Youxiang Zhu, Caroline Summerour, John A. Batsis, Xiaohui Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17137
Pdf URL: https://arxiv.org/pdf/2505.17137
Copy Paste: [[2505.17137]] Cog-TiPRO: Iterative Prompt Refinement with LLMs to Detect Cognitive Decline via Longitudinal Voice Assistant Commands(https://arxiv.org/abs/2505.17137)
Keywords: llm, prompt
Abstract: Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.
摘要：早期检测认知能力下降对于实现可以减缓神经退行性疾病进展的干预措施至关重要。传统的诊断方法依赖于劳动密集型临床评估，这些临床评估对于频繁监测是不切实际的。我们的试点研究将语音助理系统（VAS）作为非侵入性工具，可通过纵向分析语音命令中的语音模式来检测认知能力下降。在一个18个月的时间里，我们收集了来自35名老年人的语音命令，有15名参与者每天提供家庭VAS互动。为了解决分析这些简短，非结构化和嘈杂的命令的挑战，我们提出了COG-TIPRO，该框架结合了（1）LLM驱动的迭代及时及时的及时提示，用于语言特征提取，（2）总部位于Hubert的声学特征提取，以及（3）基于变压器的临时时间模型。使用iTransformer，我们的方法在检测MCI方面达到了73.80％的精度和72.67％的F1得分，表现优于其基线27.13％。通过我们的LLM方法，我们确定了语言特征，这些特征是在经历认知能力下降的个体中唯一表征日常命令使用模式的特征。

Title: EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models

Authors: Wanghan Xu, Xiangyu Zhao, Yuhao Zhou, Xiaoyu Yue, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17139
Pdf URL: https://arxiv.org/pdf/2505.17139
Copy Paste: [[2505.17139]] EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models(https://arxiv.org/abs/2505.17139)
Keywords: language model, llm
Abstract: Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on this https URL .
摘要：大语言模型（LLMS）的进步推动了对科学应用的兴趣，需要具有专门的基准（例如地球科学）。现有的基准呈现一般科学的重点，没有地球科学特异性，或者涵盖了孤立的子域，缺乏整体评估。此外，当前的基准通常忽略了在开放式科学探索中对LLM的能力的评估。在本文中，我们为地球科学提供了一个全面，专业的基准，旨在评估LLM在该领域内科学探索中的能力，涵盖从基本到高级水平。利用100,000篇研究论文的语料库，我们首先构建了两个问题回答（QA）数据集：Earth-Iron，它为广泛评估提供了广泛的问题覆盖范围，而Earth-Silver则具有更高水平的评估专业深度的难度。这些数据集涵盖了五个地球领域，114个学科和11个任务类别，评估了对科学探索至关重要的基础知识。最值得注意的是，我们使用新的指标介绍了地球，这是一个包含开放式多转向对话的数据集，专门设计用于评估LLMS在科学探索中的先进功能，包括方法诱导，限制分析和概念建议。广泛的实验揭示了在不同领域和任务的11个领先LLM中的局限性，这突出了相当大的改善其科学探索能力的空间。该基准标准可在此HTTPS URL上使用。

Title: Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs

Authors: Essa Jan, Moiz Ali, Muhammad Saram Hassan, Fareed Zaffar, Yasir Zaki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17140
Pdf URL: https://arxiv.org/pdf/2505.17140
Copy Paste: [[2505.17140]] Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs(https://arxiv.org/abs/2505.17140)
Keywords: language model, llm
Abstract: As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.
摘要：随着大型语言模型（LLM）的知识随着时间的流逝而过时，因此需要有效的方法来更新它们，尤其是在注射专有信息时。我们的研究表明，与面向映射的任务（如翻译（17％）或文本到json转换（20％）（20％））相比，理解密集型的微调任务（例如，问答和空白）达到了更高的知识保留率（48％），尽管暴露于相同的事实内容中。我们证明了这种模式在模型体系结构之间持续存在，并遵循缩放定律，更大的模型显示了所有任务类型的保留率的改进。但是，在更广泛的环境中应用注入知识时，所有模型均显示出明显的性能下降，这表明语义集成有限。这些发现表明，任务选择在更新LLM知识中的重要性，表明有效的知识注入不仅依赖于数据暴露，还依赖于微调过程中认知能力参与的深度。

Title: Large Language Models for Predictive Analysis: How Far Are They?

Authors: Qin Chen, Yuanyi Ren, Xiaojun Ma, Yuyang Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17149
Pdf URL: https://arxiv.org/pdf/2505.17149
Copy Paste: [[2505.17149]] Large Language Models for Predictive Analysis: How Far Are They?(https://arxiv.org/abs/2505.17149)
Keywords: language model, llm
Abstract: Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{this https URL}{Github}.
摘要：预测分析是现代决策的基石，并在各个领域中进行了应用。大型语言模型（LLMS）已成为实现细微的，知识密集的对话的强大工具，从而有助于复杂的决策任务。随着人们对利用LLM进行预测分析的期望，迫切需要系统地评估其在该领域的能力。但是，在现有研究中缺乏相关评估。为了弥合这一差距，我们介绍了\ textBf {predivigiq}基准，该基准集成了1130个复杂的预测分析查询，该查询源自44个不同字段的44个真实世界数据集。我们设计了一个评估协议，考虑文本分析，代码生成及其对齐方式。评估了十二个著名的LLM，从而提供了对它们在预测分析中实际使用的见解。通常，我们认为现有的LLM在进行预测分析时仍面临着巨大的挑战。请参阅\ href {此https url} {github}。

Title: Bayesian Optimization for Enhanced Language Models: Optimizing Acquisition Functions

Authors: Zishuo Bao, Yibo Liu, Changyutao Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17151
Pdf URL: https://arxiv.org/pdf/2505.17151
Copy Paste: [[2505.17151]] Bayesian Optimization for Enhanced Language Models: Optimizing Acquisition Functions(https://arxiv.org/abs/2505.17151)
Keywords: language model
Abstract: With the rise of different language model architecture, fine-tuning is becoming even more important for down stream tasks Model gets messy, finding proper hyperparameters for fine-tuning. Although BO has been tried for hyperparameter tuning, most of the existing methods are oblivious to the fact that BO relies on careful choices of acquisition functions, which are essential components of BO that guide how much to explore versus exploit during the optimization process; Different acquisition functions have different levels of sensitivity towards training loss and validation performance; existing methods often just apply an acquisition function no matter if the training and validation performance are sensitive to the acquisition function or not. This work introduces{Bilevel - BO - SWA}, a model fusion approach coupled with a bilevel BO strategy to improve the fine - tunning of large language models. Our work on mixture of acquisition functions like EI and UCB into nested opt loops, where inner loop perform minimization of training loss while outer loops optimized w.r.t. val metric. Experiments on GLUE tasks using RoBERTA - base show that when using EI and UCB, there is an improvement in generalization, and fine - tuning can be improved by up to 2.7%.
摘要：随着不同语言模型体系结构的兴起，微型调整对于下流任务越来越重要，模型变得凌乱，找到适当的超参数进行微调。尽管BO已尝试进行超参数调整，但大多数现有方法都忽略了BO依赖于采集功能的仔细选择的事实，这是BO的重要组成部分，它指导在优化过程中探索多少与探索相对于利用。不同的获取功能对训练损失和验证绩效具有不同的敏感性；无论培训和验证绩效是否对采集功能敏感，现有方法通常只应用采集功能。这项工作介绍了{Bilevel -bo -swa}，这是一种模型融合方法，再加上双重BO策略，以改善大型语言模型的精细调整。我们在采集功能（例如EI和UCB）混合使用的工作中，在嵌套的OPT循环中，内部环的训练损失最小化，而外圈则优化了W.R.T. Val度量。使用Roberta的胶水任务实验 - 基本表明，使用EI和UCB时，概括会有所改善，并且可以提高高达2.7％的良好调整。

Title: Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN

Authors: Yao Xu, Mingyu Xu, Fangyu Lei, Wangtao Sun, Xiangrong Zeng, Bingning Wang, Guang Liu, Shizhu He, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17153
Pdf URL: https://arxiv.org/pdf/2505.17153
Copy Paste: [[2505.17153]] Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN(https://arxiv.org/abs/2505.17153)
Keywords: llm, chain-of-thought
Abstract: Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token's representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the representation differences between adjacent tokens. Extensive experiments on multiple mathematical reasoning tasks demonstrate that LoRA combined with Shift-FFN achieves higher accuracy and a lower rate of Cyclical Reasoning across various data sizes compared to full fine-tuning and standard LoRA. Our data and code are available at this https URL
摘要：最近，诸如OpenAI-O1和DeepSeek-R1之类的模型通过长期的思考（长期）推理在复杂的推理任务上表现出了显着的性能。尽管将这种能力提炼成学生模型可以显着提高他们的表现，但本文发现，在长COT数据上使用完整参数的微调LLM或LORA较低，通常会导致周期性推理，其中模型反复重申以前的推理步骤直至最大长度限制。进一步的分析表明，相邻代币之间的表示差异较小，与周期性推理的趋势更高。为了减轻此问题，本文提出了Shift FeedForward网络（Shift-FFN），这是一种新颖的方法，该方法在将其输入FFN之前与前一个代币的表示形式编辑了当前令牌的表示。该体系结构动态放大了相邻令牌之间的表示差异。对多个数学推理任务进行的广泛实验表明，与全面的微调和标准洛拉相比，洛拉与Shift-FFN相结合的各种数据尺寸的周期性推理率更高，并且较低的周期性推理速率。我们的数据和代码可在此HTTPS URL上找到

Title: PersonaBOT: Bringing Customer Personas to Life with LLMs and RAG

Authors: Muhammed Rizwan, Lars Carlsson, Mohammad Loni
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17156
Pdf URL: https://arxiv.org/pdf/2505.17156
Copy Paste: [[2505.17156]] PersonaBOT: Bringing Customer Personas to Life with LLMs and RAG(https://arxiv.org/abs/2505.17156)
Keywords: language model, llm, prompt, chat, retrieval-augmented generation, chain-of-thought
Abstract: The introduction of Large Language Models (LLMs) has significantly transformed Natural Language Processing (NLP) applications by enabling more advanced analysis of customer personas. At Volvo Construction Equipment (VCE), customer personas have traditionally been developed through qualitative methods, which are time-consuming and lack scalability. The main objective of this paper is to generate synthetic customer personas and integrate them into a Retrieval-Augmented Generation (RAG) chatbot to support decision-making in business processes. To this end, we first focus on developing a persona-based RAG chatbot integrated with verified personas. Next, synthetic personas are generated using Few-Shot and Chain-of-Thought (CoT) prompting techniques and evaluated based on completeness, relevance, and consistency using McNemar's test. In the final step, the chatbot's knowledge base is augmented with synthetic personas and additional segment information to assess improvements in response accuracy and practical utility. Key findings indicate that Few-Shot prompting outperformed CoT in generating more complete personas, while CoT demonstrated greater efficiency in terms of response time and token usage. After augmenting the knowledge base, the average accuracy rating of the chatbot increased from 5.88 to 6.42 on a 10-point scale, and 81.82% of participants found the updated system useful in business contexts.
摘要：大型语言模型（LLMS）的引入通过对客户角色进行更高级的分析，从而显着改变了自然语言处理（NLP）应用程序。在沃尔沃建筑设备（VCE）中，传统上是通过定性方法开发的，这些方法是耗时且缺乏可扩展性的。本文的主要目的是生成综合客户角色，并将其集成到检索型发电机（RAG）聊天机器人中，以支持业务流程中的决策。为此，我们首先专注于开发与经过验证的角色集成的基于角色的抹布聊天机器人。接下来，使用少量射击和链链（COT）提示技术生成合成角色，并根据McNemar的测试根据完整性，相关性和一致性进行评估。在最后一步中，聊天机器人的知识库通过合成角色和其他细分信息进行了增强，以评估响应准确性和实用性的改进。关键发现表明，很少有促使COT产生更完整的角色，而COT在响应时间和令牌使用方面表现出更高的效率。在增加了知识库之后，聊天机器人的平均准确性等级从10分制从5.88增加到6.42，而81.82％的参与者发现更新的系统在业务环境中有用。

Title: Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting

Authors: Bang Trinh Tran To, Thai Le
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17160
Pdf URL: https://arxiv.org/pdf/2505.17160
Copy Paste: [[2505.17160]] Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting(https://arxiv.org/abs/2505.17160)
Keywords: language model, llm, prompt
Abstract: This work presents LURK (Latent UnleaRned Knowledge), a novel framework that probes for hidden retained knowledge in unlearned LLMs through adversarial suffix prompting. LURK automatically generates adversarial prompt suffixes designed to elicit residual knowledge about the Harry Potter domain, a commonly used benchmark for unlearning. Our experiments reveal that even models deemed successfully unlearned can leak idiosyncratic information under targeted adversarial conditions, highlighting critical limitations of current unlearning evaluation standards. By uncovering latent knowledge through indirect probing, LURK offers a more rigorous and diagnostic tool for assessing the robustness of unlearning algorithms. All code will be publicly available.
摘要：这项工作提出了潜伏的（潜在的未学习知识），这是一个新颖的框架，该框架通过对抗后缀提示来探测未经学习的LLM中隐藏的保留知识。潜伏会自动生成旨在引起有关哈利·波特域的残留知识的对抗性提示后缀，这是一种常用的学习基准。我们的实验表明，即使被视为成功未经学习的模型也可能在目标对抗条件下泄漏特质信息，从而突出了当前未学习评估标准的临界局限性。通过通过间接探测来揭示潜在知识，潜伏提供了一种更严格，更严格的诊断工具，用于评估未学习算法的鲁棒性。所有代码将公开可用。

Title: CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

Authors: Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Bernhard Kainz, Bjoern Menze
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.17167
Pdf URL: https://arxiv.org/pdf/2505.17167
Copy Paste: [[2505.17167]] CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation(https://arxiv.org/abs/2505.17167)
Keywords: llm
Abstract: Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.
摘要：评估长篇小说放射学报告的生成具有挑战性。 NLG指标无法捕获临床正确性，而基于LLM的指标通常缺乏普遍性。临床准确性指标更为相关，但对阶级失衡很敏感，通常有利于微不足道的预测。我们提出了CRG评分，这是一种分布感知和适应性的度量标准，仅评估参考报告中明确描述的临床相关异常。 CRG支持二进制和结构化标签（例如类型，位置），并且可以与任何LLM配对以进行特征提取。通过基于标签分布的惩罚平衡，它可以实现更公平，更强大的评估，并用作临床上的奖励功能。

Title: Next Token Perception Score: Analytical Assessment of your LLM Perception Skills

Authors: Yu-Ang Cheng, Leyang Hu, Hai Huang, Randall Balestriero
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17169
Pdf URL: https://arxiv.org/pdf/2505.17169
Copy Paste: [[2505.17169]] Next Token Perception Score: Analytical Assessment of your LLM Perception Skills(https://arxiv.org/abs/2505.17169)
Keywords: language model, llm
Abstract: Autoregressive pretraining has become the de facto paradigm for learning general-purpose representations in large language models (LLMs). However, linear probe performance across downstream perception tasks shows substantial variability, suggesting that features optimized for next-token prediction do not consistently transfer well to downstream perception tasks. We demonstrate that representations learned via autoregression capture features that may lie outside the subspaces most informative for perception. To quantify the (mis)alignment between autoregressive pretraining and downstream perception, we introduce the Next Token Perception Score (NTPS)-a score derived under a linear setting that measures the overlap between autoregressive and perception feature subspaces. This metric can be easily computed in closed form from pretrained representations and labeled data, and is proven to both upper- and lower-bound the excess loss. Empirically, we show that NTPS correlates strongly with linear probe accuracy across 12 diverse NLP datasets and eight pretrained models ranging from 270M to 8B parameters, confirming its utility as a measure of alignment. Furthermore, we show that NTPS increases following low-rank adaptation (LoRA) fine-tuning, especially in large models, suggesting that LoRA aligning representations to perception tasks enhances subspace overlap and thus improves downstream performance. More importantly, we find that NTPS reliably predicts the additional accuracy gains attained by LoRA finetuning thereby providing a lightweight prescreening tool for LoRA adaptation. Our results offer both theoretical insights and practical tools for analytically assessing LLM perception skills.
摘要：自回旋预处理已成为在大语言模型（LLMS）中学习通用表示形式的事实上的范式。但是，跨下游感知任务的线性探针性能显示出很大的可变性，这表明针对下一个预测进行了优化的功能并不能始终如一地转移到下游感知任务。我们证明，通过自动进度捕获特征所学的表示形式可能位于最有用的感知子空间之外。为了量化自回旋预处理和下游感知之间的（MIS）对齐，我们介绍了下一个令牌感知得分（NTPS） - 在线性设置下得出的分数，该分数测量了自动回归和感知特征特征子空间之间的重叠。该度量可以轻松地从验证的表示形式和标记数据中以封闭形式计算，并被证明是在超大损失的上限和下限。从经验上讲，我们表明NTPS与12个不同的NLP数据集的线性探针准确性密切相关，并且八个预算模型从270m到8b参数，这证实了其效用作为对齐的量度。此外，我们表明，在低级别适应性（LORA）微调后，尤其是在大型模型中，NTP会增加，这表明Lora对齐表示以感知任务增强了子空间的重叠，从而提高了下游性能。更重要的是，我们发现NTPS可靠地预测了洛拉芬特（Lora Finetuning）获得的额外准确性提高，从而为洛拉（Lora）适应提供了轻巧的预筛查工具。我们的结果提供了理论见解和实用工具，用于分析评估LLM感知技能。

Title: FB-RAG: Improving RAG with Forward and Backward Lookup

Authors: Kushal Chawla, Alfy Samuel, Anoop Kumar, Daben Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17206
Pdf URL: https://arxiv.org/pdf/2505.17206
Copy Paste: [[2505.17206]] FB-RAG: Improving RAG with Forward and Backward Lookup(https://arxiv.org/abs/2505.17206)
Keywords: llm, long context, retrieval augmented generation
Abstract: The performance of Retrieval Augmented Generation (RAG) systems relies heavily on the retriever quality and the size of the retrieved context. A large enough context ensures that the relevant information is present in the input context for the LLM, but also incorporates irrelevant content that has been shown to confuse the models. On the other hand, a smaller context reduces the irrelevant information, but it often comes at the risk of losing important information necessary to answer the input question. This duality is especially challenging to manage for complex queries that contain little information to retrieve the relevant chunks from the full context. To address this, we present a novel framework, called FB-RAG, which enhances the RAG pipeline by relying on a combination of backward lookup (overlap with the query) and forward lookup (overlap with candidate reasons and answers) to retrieve specific context chunks that are the most relevant for answering the input query. Our evaluations on 9 datasets from two leading benchmarks show that FB-RAG consistently outperforms RAG and Long Context baselines developed recently for these benchmarks. We further show that FB-RAG can improve performance while reducing latency. We perform qualitative analysis of the strengths and shortcomings of our approach, providing specific insights to guide future work.
摘要：检索增强发电（RAG）系统的性能在很大程度上取决于检索质量和检索到环境的大小。足够大的上下文可确保在LLM的输入上下文中存在相关信息，但还结合了已证明会使模型混淆的无关内容。另一方面，较小的环境减少了无关的信息，但通常有可能失去回答输入问题所需的重要信息。这种二元性尤其具有挑战性，要管理复杂的查询，这些查询包含很少的信息来从完整的上下文中检索相关的块。为了解决这个问题，我们提出了一个名为FB-rag的新颖框架，该框架通过依靠向后查找（与查询重叠）和前向查找（与候选理由和答案重叠）的组合来增强RAG管道，以检索最相关的特定上下文块，这些块最相关。我们对来自两个领先基准的9个数据集的评估表明，FB-rag始终超过了针对这些基准测试的RAG和长上下文基线的表现。我们进一步表明，FB-rag可以在减少延迟的同时提高性能。我们对方法的优势和缺点进行定性分析，提供了指导未来工作的特定见解。

Title: Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs

Authors: Kangda Wei, Hasnat Md Abdullah, Ruihong Huang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.17217
Pdf URL: https://arxiv.org/pdf/2505.17217
Copy Paste: [[2505.17217]] Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs(https://arxiv.org/abs/2505.17217)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data.
摘要：大型语言模型（LLM）经常表现出性别偏见，导致在不同情况下对男性和女性受试者的不平等对待。为了解决这个问题，我们提出了一个新的数据生成框架，该框架促进了LLMS中的探索性思维。我们的方法促使模型在结构上相同的，道德上模棱两可的场景中生成故事对，以男性和女性主角为特色，然后引发并比较其道德判断。当出现不一致时，该模型将被指导产生平衡的性别中性判断。这些故事判断对用于通过直接偏好优化（DPO）微调或优化模型。实验结果表明，我们的方法在保留甚至增强一般模型功能的同时大大减少了性别偏见。我们将发布代码并生成的数据。

Title: Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts

Authors: Georgios Chochlakis, Peter Wu, Arjun Bedi, Marcus Ma, Kristina Lerman, Shrikanth Narayanan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17222
Pdf URL: https://arxiv.org/pdf/2505.17222
Copy Paste: [[2505.17222]] Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts(https://arxiv.org/abs/2505.17222)
Keywords: language model, llm, prompt
Abstract: Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at this https URL.
摘要：由于人类注释的显着差异，对自然语言处理中复杂的主观任务进行建模，例如识别情绪和道德。这种变化通常反映了语义解释的合理差异，而不是仅仅是噪音，因此需要方法来区分合法的主观性和错误。我们通过使用大语言模型（LLM）在这些上下文中探索标签验证来应对这一挑战。首先，我们提出了一个简单的内在学习二进制滤波基线，以估计文档标签对的合理性。然后，我们介绍了标签 - 在 - 海景设置中：查询及其标签及其标签均包含在显示给LLMS的演示中，该示范被提示再次预测标签，同时接收特定于任务的指令（例如，情感识别），而不是标签复制。我们展示了如何将标签复制到LLM的输出与任务相关且有益的内容。在此基础上，我们提出了主观标签校正的标签 - 海堆整流（LIAHR）框架：当模型输出与参考金标签差异时，我们将生成的标签分配给示例而不是丢弃。可以将这种方法集成到注释管道中，以提高信噪比。全面的分析，人类评估和生态有效性研究验证了LIAHR对标签校正的效用。代码可在此HTTPS URL上找到。

Title: ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

Authors: Jipeng Zhang, Haolin Yang, Kehao Miao, Ruiyuan Zhang, Renjie Pi, Jiahui Gao, Xiaofang Zhou
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2505.17231
Pdf URL: https://arxiv.org/pdf/2505.17231
Copy Paste: [[2505.17231]] ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects(https://arxiv.org/abs/2505.17231)
Keywords: gpt, prompt, agent
Abstract: Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.
摘要：最近的文本到SQL模型已经取得了强大的性能，但是由于数据集限制，它们的有效性在很大程度上仍然局限于SQLite。但是，现实世界的应用程序需要在多个方言中生成SQL，具有不同的语法和专业功能，这仍然是当前模型的挑战。构建方言感知模型的主要障碍在于获取高质量方言的数据。数据纯粹是通过静态提示生成的数据 - 不通过执行验证SQL-往往嘈杂且不可靠。此外，训练循环中缺乏实际执行环境可以阻止模型在可执行语义上的预测基础，尽管数据过滤从表面级别进行了改进，但仍限制了概括。这项工作介绍了EXESQL，ExesQL是一个具有执行驱动的，代理的引导程序的文本到SQL框架。该方法包括迭代查询生成，基于执行的过滤（例如拒绝采样）和基于首选项的培训，从而使模型能够通过可验证的，反馈指导的学习来适应新的SQL方言。实验表明，EXESQL在文本到SQL中的方言差距桥接，在PostgreSQL，MySQL和Oracle上，平均提高了15.2％，10.38％和4.49％的差距，跨越了多个不同的难度。

Title: Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)

Authors: Clayton Cohn, Surya Rayala, Caitlin Snyder, Joyce Fonteles, Shruti Jain, Naveeduddin Mohammed, Umesh Timalsina, Sarah K. Burriss, Ashwin T S, Namrata Srivastava, Menton Deweese, Angela Eeds, Gautam Biswas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17238
Pdf URL: https://arxiv.org/pdf/2505.17238
Copy Paste: [[2505.17238]] Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)(https://arxiv.org/abs/2505.17238)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation, agent
Abstract: Collaborative dialogue offers rich insights into students' learning and critical thinking. This is essential for adapting pedagogical agents to students' learning and problem-solving skills in STEM+C settings. While large language models (LLMs) facilitate dynamic pedagogical interactions, potential hallucinations can undermine confidence, trust, and instructional value. Retrieval-augmented generation (RAG) grounds LLM outputs in curated knowledge, but its effectiveness depends on clear semantic links between user input and a knowledge base, which are often weak in student dialogue. We propose log-contextualized RAG (LC-RAG), which enhances RAG retrieval by incorporating environment logs to contextualize collaborative discourse. Our findings show that LC-RAG improves retrieval over a discourse-only baseline and allows our collaborative peer agent, Copa, to deliver relevant, personalized guidance that supports students' critical thinking and epistemic decision-making in a collaborative computational modeling environment, XYZ.
摘要：协作对话为学生的学习和批判性思维提供了丰富的见解。这对于在STEM+C设置中将教学代理适应学生的学习和解决问题的技能至关重要。尽管大型语言模型（LLM）促进了动态的教学互动，但潜在的幻觉会破坏信心，信任和教学价值。在精选知识中，检索授权的生成（RAG）llm输出的基础，但其有效性取决于用户输入和知识基础之间的明确语义联系，而这些知识基础在学生对话中通常很弱。我们提出了对数封闭式的抹布（LC-rag），从而通过将环境日志结合到上下文化协作性话语来增强抹布检索。我们的发现表明，LC-rag在仅限话语的基准中改善了检索，并允许我们的协作同伴Copa提供相关的个性化指导，以支持学生在协作计算建模环境中的批判性思维和认知决策制定。

Title: ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

Authors: Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, Liangming Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17250
Pdf URL: https://arxiv.org/pdf/2505.17250
Copy Paste: [[2505.17250]] ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models(https://arxiv.org/abs/2505.17250)
Keywords: language model, hallucination
Abstract: Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at this https URL.
摘要：大型语言模型通过将问题分解为结构化的推理步骤，在复杂的任务中表现出色。但是，推理痕迹通常不仅仅范围扩展到正确的答案，从而导致浪费的计算，降低可读性和幻觉。为了解决这个问题，我们介绍了一种新型的无高于参数的简洁性评分，用作加固学习框架中的奖励信号，以指导模型生成正确和简洁的推理痕迹。该分数由一个大型语言模型评估，该模型充当法官，使动态，上下文感知的反馈超出简单令牌的长度。我们的方法在数学数据集上实现了最新的效率 - 准确性权衡取舍，在简单问题上最多将令牌用法减少了31倍，同时将准确性提高了7％，并且在最严重的问题上，它的表现超过了 +7.5％的准确性，最高为3.6倍，少于3.6倍。在定理上，我们的方法使用少12.5倍的令牌提高了准确性 +2.2％。我们还对法官模型，奖励组成和问题难度进行消融研究，这表明我们的方法会根据问题难度动态调整推理长度，并从更强大的法官那里显着地调整了推理长度。代码，模型权重和数据集在此HTTPS URL上开源。

Title: The Rise of Parameter Specialization for Knowledge Storage in Large Language Models

Authors: Yihuai Hong, Yiran Zhao, Wei Tang, Yang Deng, Yu Rong, Wenxuan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17260
Pdf URL: https://arxiv.org/pdf/2505.17260
Copy Paste: [[2505.17260]] The Rise of Parameter Specialization for Knowledge Storage in Large Language Models(https://arxiv.org/abs/2505.17260)
Keywords: language model
Abstract: Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utilization of this knowledge by the model. In this work, we analyze twenty publicly available open-source large language models to investigate the relationship between their strong performance and the way knowledge is stored in their corresponding MLP parameters. Our findings reveal that as language models become more advanced and demonstrate stronger knowledge capabilities, their parameters exhibit increased specialization. Specifically, parameters in the MLPs tend to be more focused on encoding similar types of knowledge. We experimentally validate that this specialized distribution of knowledge contributes to improving the efficiency of knowledge utilization in these models. Furthermore, by conducting causal training experiments, we confirm that this specialized knowledge distribution plays a critical role in improving the model's efficiency in leveraging stored knowledge.
摘要：随着时间的流逝，各种系列的大型语言模型越来越多。研究人员正在努力最大程度地提高具有约束参数大小的语言模型的性能。但是，从微观的角度来看，关于如何更好地存储模型参数（尤其是MLP）中的知识的研究有限，以通过模型更有效地利用这种知识。在这项工作中，我们分析了二十种公开可用的开源大语模型，以研究其强大绩效与知识的方式之间的关系。我们的发现表明，随着语言模型变得更加先进并表现出更强的知识能力，它们的参数将提高专业化。具体而言，MLP中的参数往往更专注于编码相似类型的知识。我们在实验上验证了这种专门的知识分布有助于提高这些模型中知识利用的效率。此外，通过进行因果培训实验，我们确认这种专业知识分布在提高模型在利用存储知识方面的效率方面起着至关重要的作用。

Title: CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports

Authors: Xiao Yu Cindy Zhang (1), Carlos R. Ferreira (2), Francis Rossignol (2), Raymond T. Ng (1), Wyeth Wasserman (1), Jian Zhu (1) ((1) University of British Columbia, (2) National Institutes of Health)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17265
Pdf URL: https://arxiv.org/pdf/2505.17265
Copy Paste: [[2505.17265]] CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports(https://arxiv.org/abs/2505.17265)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs. Using this dataset, we assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought prompting offers little advantage over standard zero-shot prompting. Category-specific prompting improves alignment with the benchmark. The open-source model Qwen2.5-7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs' limitations in recognizing negative findings important for differential diagnosis. This work advances LLM-driven clinical natural language processing and paves the way for scalable medical AI applications.
摘要：罕见疾病，包括天生的新陈代谢错误（IEM）构成了重大诊断挑战。案例报告是关键，但计算未充分利用的资源以告知诊断。临床密集信息提取是指将医疗信息组织为结构化预定义的类别。大型语言模型（LLMS）可以从病例报告中启用可扩展的信息提取，但很少对此任务进行评估。我们介绍了Casereportbench，这是一个专家通知的数据集，用于案例报告的密集信息提取，重点是IEM。使用此数据集，我们评估了各种模型并提示策略，引入了新颖的方法，例如特定于类别的提示和副标题过滤的数据集成。零射击链的提示比标准的零射击提示几乎没有优势。特定于类别的提示可以改善与基准测试的一致性。开源模型QWEN2.5-7B在此任务上的表现优于GPT-4O。我们的临床医生评估表明，LLM可以从病例报告中提取临床相关细节，从而支持罕见的疾病诊断和管理。我们还强调了改进的领域，例如LLMS在识别对鉴别诊断至关重要的负面发现方面的限制。这项工作推动了LLM驱动的临床自然语言处理，并为可扩展的医疗AI应用铺平了道路。

Title: Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

Authors: Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Xiaojun Wu, Honghao Liu, Hui Xiong, Jian Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17266
Pdf URL: https://arxiv.org/pdf/2505.17266
Copy Paste: [[2505.17266]] Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning(https://arxiv.org/abs/2505.17266)
Keywords: language model, llm, chain-of-thought
Abstract: A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.
摘要：一种实用的方法来激活预先训练的大语言模型中长期思考的推理能力的方法是对由强大的大推理模型（例如DeepSeek-r1）合成的指导数据集进行监督的微调，为增强学习提供了具有成本效益的替代方案。但是，具有超过100K样品的大规模指导集有大量的培训开销，而自动长时间指导选择的有效策略仍未得到探索。在这项工作中，我们提出了Select2Reason，这是一个新颖而有效的指令调查数据选择框架，用于长期推理。从重新思考行为（例如自我纠正和回溯）的重新思考行为的角度来看，我们研究了可能决定长期推理说明质量的通用指标。 Select2Reason利用了一个量化器来估算问题的难度，并通过加权方案共同结合了基于痕量长度的启发式启发式启发式，以优先考虑高实效示例。 OpenR1-MATH-220K的经验结果表明，Select2Reason选择的数据中仅10％的微调LLM在三个竞争级别和六个全面的数学基准中，与全数据调整和开源基线OpenR1-QWEN-7B具有竞争性或优于全数据调整和开源基线OpenR1-QWEN-7B。进一步的实验突出了不同数据大小，推断期间效率的可伸缩性以及其对其他指令库的适应性，成本最低。

Title: GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

Authors: Odysseas S. Chlapanis, Dimitrios Galanis, Nikolaos Aletras, Ion Androutsopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17267
Pdf URL: https://arxiv.org/pdf/2505.17267
Copy Paste: [[2505.17267]] GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations(https://arxiv.org/abs/2505.17267)
Keywords: llm
Abstract: We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.
摘要：我们介绍了Greekbarbench，这是一种基准，该基准评估了来自希腊律师考试的五个不同法律领域的法律问题的LLM，这需要引用法定文章和案例事实。为了应对自由文本评估的挑战，我们提出了一个三维评分系统，结合了LLM-AS-A-A-Gudge方法。我们还开发了一个元评估基准，以评估LLM-gudges与人类专家评估之间的相关性，揭示了基于简单的基于跨度的标题可以改善其对齐方式。我们对13个专有和开放式LLM的系统评估表明，即使最佳模型的表现优于平均专家得分，但它们的专家却没有第95个百分点。

Title: Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

Authors: Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, Zhiyu Zoey Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17281
Pdf URL: https://arxiv.org/pdf/2505.17281
Copy Paste: [[2505.17281]] Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty(https://arxiv.org/abs/2505.17281)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models' uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model's uncertainty in its search decisions. To address this, we propose $\beta$-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that $\beta$-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.
摘要：通过启用动态，多步电推理和信息检索，代理检索效果生成（RAG）系统可以增强大语言模型（LLMS）。但是，这些系统经常表现出次优的搜索行为，例如过度搜索（检索冗余信息）和搜索不足（未能检索必要的信息），这阻碍了效率和可靠性。这项工作正式定义并量化了这些行为，揭示了它们在多个QA数据集和代理抹布系统中的流行率（例如，一个模型可以避免在其搜索步骤的27.7％中进行搜索）。此外，我们展示了这些低效率的关键联系与模型对其自身知识边界的不确定性之间的联系，其中响应精度与模型在搜索决策中的不确定性相关。为了解决这个问题，我们提出了$ \ beta $ -grpo，这是一种基于强化的学习培训方法，该方法结合了信心阈值以奖励高确定性搜索决策。七个QA基准测试的实验表明，$ \ beta $ -grpo启用具有更好的代理抹布能力的3B型号，表现优于其他强大基线，平均匹配得分高4％。

Title: SELF: Self-Extend the Context Length With Logistic Growth Function

Authors: Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17296
Pdf URL: https://arxiv.org/pdf/2505.17296
Copy Paste: [[2505.17296]] SELF: Self-Extend the Context Length With Logistic Growth Function(https://arxiv.org/abs/2505.17296)
Keywords: language model, long context, prompt
Abstract: Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at this https URL.
摘要：大型语言模型在注意力层中为代币编码的标准位置编码的标准位置，在长篇小说上进行操作时会遇到问题。远距离的令牌很少会彼此影响，并且长时间提示会产生意外的结果。为了解决这个问题，我们提出了自我（使用逻辑增长函数自我扩展上下文长度）：使用logistic容量方程将连续令牌分组的解决方案，使用逻辑容量方程与较小的相对距离下的恒定组大小相结合。与Leval中的LongLM扩展方法相比，我们的模型的性能高达12％（特别是在QWEN模型上）。在Longbench中相关的任务中，我们的模型比LongLM（特别是在Llama-2-7b模型上）的表现高6.4％。在阅读Leval的理解任务时，我们的模型比LongLM好高出5.4％。我们的代码可在此HTTPS URL上找到。

Title: Refusal Direction is Universal Across Safety-Aligned Languages

Authors: Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17306
Pdf URL: https://arxiv.org/pdf/2505.17306
Copy Paste: [[2505.17306]] Refusal Direction is Universal Across Safety-Aligned Languages(https://arxiv.org/abs/2505.17306)
Keywords: language model, llm, prompt
Abstract: Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.
摘要：大语言模型（LLM）中的拒绝机制对于确保安全至关重要。最近的研究表明，拒绝行为可以通过激活空间中的单个方向介导，从而使靶向干预措施绕过拒绝。尽管这主要是在以英语为中心的背景下证明的，但适当的拒绝行为对于任何语言都很重要，但理解不佳。在本文中，我们使用PolyRefuse调查了14种语言中LLM的拒绝行为，PolyRefuse是一种通过将恶意和良性英语提示转换为这些语言而创建的多语言安全数据集。我们揭示了拒绝方向的令人惊讶的跨语性普遍性：从英语中提取的向量可以绕过其他语言，而没有任何其他微调。更值得注意的是，拒绝从任何与安全的语言转移到他人的拒绝方向。我们将这种转移性归因于跨语言中拒绝载体的并行性，并确定跨语言越狱背后的基本机制。这些发现提供了可行的见解，以建立更强大的多语言安全防御，并为对LLMS中跨语性脆弱性的更深入的机械理解铺平道路。

Title: From Compression to Expansion: A Layerwise Analysis of In-Context Learning

Authors: Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17322
Pdf URL: https://arxiv.org/pdf/2505.17322
Copy Paste: [[2505.17322]] From Compression to Expansion: A Layerwise Analysis of In-Context Learning(https://arxiv.org/abs/2505.17322)
Keywords: language model, llm
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term *Layerwise Compression-Expansion*: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance -- improving with model size and the number of demonstrations -- and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.
摘要：内部文化学习（ICL）使大型语言模型（LLMS）通过从演示序列中学习而无需重新更新的新任务。尽管ICL表现出强烈的经验表现，但其内部代表性机制尚未得到充分理解。在这项工作中，我们对ICL表示形式进行了统计几何分析，以研究如何在各个层中捕获特定于任务的信息。我们的分析揭示了一种有趣的现象，我们将其称为“ layerwise compression-expansion” *：早期层逐渐产生紧凑和判别性表示，从输入演示中编码任务信息，而后来的层则扩展了这些表示形式以结合查询并生成预测。在各种任务和一系列当代LLM体系结构之间始终如一地观察到这种现象。我们证明，它对ICL性能具有重要意义 - 随着模型大小和演示数量的改善 - 以及在存在嘈杂示例的情况下的鲁棒性。为了进一步了解紧凑的任务表示的效果，我们提出了一个偏差变化分解，并提供了理论分析，以表明注意机制如何有助于降低差异和偏见，从而随着演示数量的增加而提高性能。我们的发现揭示了ICL中有趣的层次动态，突出了LLM中的结构化表示方式，并展示了分析内部表示形式可以促进对模型行为的更深入的理解。

Title: GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints

Authors: Soren DeHaan, Yuanze Liu, Johan Bollen, Sa'ul A. Blanco
Subjects: cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17327
Pdf URL: https://arxiv.org/pdf/2505.17327
Copy Paste: [[2505.17327]] GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints(https://arxiv.org/abs/2505.17327)
Keywords: language model, gpt, llm, hallucination
Abstract: The proliferation of Large Language Models (LLMs) in late 2022 has impacted academic writing, threatening credibility, and causing institutional uncertainty. We seek to determine the degree to which LLMs are used to generate critical text as opposed to being used for editing, such as checking for grammar errors or inappropriate phrasing. In our study, we analyze arXiv papers for stylistic segmentation, which we measure by varying a PELT threshold against a Bayesian classifier trained on GPT-regenerated text. We find that LLM-attributed language is not predictive of stylistic segmentation, suggesting that when authors use LLMs, they do so uniformly, reducing the risk of hallucinations being introduced into academic preprints.
摘要：2022年底，大型语言模型（LLM）的扩散影响了学术写作，威胁信誉并引起机构不确定性。我们试图确定使用LLM用于生成关键文本而不是用于编辑的程度，例如检查语法错误或不适当的措辞。在我们的研究中，我们分析了Arxiv论文的风格分割，我们通过改变对接受GPT培训的文本训练的贝叶斯分类器的毛皮阈值来衡量。我们发现，LLM属性语言不能预测风格细分，这表明当作者使用LLMS时，它们会统一地做到这一点，从而降低了引入幻觉的风险。

Title: SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

Authors: Hitesh Laxmichand Patel, Amit Agarwal, Arion Das, Bhargava Kumar, Srikant Panda, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Dong-Kyu Chae
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2505.17332
Pdf URL: https://arxiv.org/pdf/2505.17332
Copy Paste: [[2505.17332]] SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use(https://arxiv.org/abs/2505.17332)
Keywords: language model, llm, prompt
Abstract: Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: this https URL.
摘要：企业客户越来越多地采用大型语言模型（LLMS）来进行关键的通信任务，例如起草电子邮件，制作销售销量以及撰写休闲消息。在不同地区部署此类模型要求他们了解各种文化和语言环境，并产生安全和尊重的回应。对于企业应用程序，通过有效识别和处理不安全或令人反感的语言来减轻声誉风险，维持信任并确保合规至关重要。为了解决这个问题，我们介绍了Sweeval，这是一种基准测试，以模拟现实世界（正面或负面）和上下文（正式或非正式）的现实情况。提示在完成任务时明确指示模型在完成任务时包含特定的誓言。该基准测试评估LLM是否遵守或抵制这种不当说明，并评估其与道德框架，文化细微差别和语言理解能力的一致性。为了促进构建企业使用及以后的符合道德对齐的AI系统的研究，我们发布了数据集和代码：此HTTPS URL。

Title: Language models should be subject to repeatable, open, domain-contextualized hallucination benchmarking

Authors: Justin D. Norman, Michael U. Rivera, D. Alex Hughes
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17345
Pdf URL: https://arxiv.org/pdf/2505.17345
Copy Paste: [[2505.17345]] Language models should be subject to repeatable, open, domain-contextualized hallucination benchmarking(https://arxiv.org/abs/2505.17345)
Keywords: language model, hallucination
Abstract: Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.
摘要：普遍认为，在负责任地采用语言模型的情况下，普遍存在的模型生成文本中有合理但不准确。尽管有这种关注，但几乎没有科学工作试图以一种全面的方式衡量语言模型幻觉的普遍性。在本文中，我们认为应使用可重复，开放和域的幻觉幻觉基准测试来评估语言模型。我们提出了幻觉的分类法，并表明，当数据创建的早期阶段缺乏专家时，由此产生的幻觉指标就缺乏有效性和实用性。

Title: A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit

Authors: Zafarullah Mahmood, Soliman Ali, Jiading Zhu, Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Jodi Wolff, Osnat Melamed, Nadia Minian, Marta Maslej, Carolynne Cooper, Matt Ratto, Peter Selby, Jonathan Rose
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17362
Pdf URL: https://arxiv.org/pdf/2505.17362
Copy Paste: [[2505.17362]] A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit(https://arxiv.org/abs/2505.17362)
Keywords: language model, llm, chat
Abstract: The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot's adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants' confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants' language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.
摘要：大语言模型（LLM）的对话能力表明，它们可能能够作为自动谈话治疗师执行。重要的是要知道这些系统是否有效并遵守已知标准。我们提出了一个辅导员聊天机器人，该聊天机器人专注于激励吸烟者戒烟。它使用最先进的LLM和一种称为动机访谈（MI）的广泛应用的治疗方法，并与具有MI专业知识的临床医生合作而发展。我们还描述并验证了聊天机器人对MI和客户响应的自动评估。聊天机器人对106名参与者进行了测试，他们对他们可以成功戒烟吸烟的信心在谈话前和一周后都进行了测量。参与者的信心平均在0-10范围内增加了1.7。对聊天机器人的自动评估表明，在98％的话语中，遵守MI标准，高于人类顾问。聊天机器人在参与者报告的同情心的指标上得分良好，但比典型的人类顾问低。此外，参与者的语言表明，改变的动力是MI的关键目标。这些结果表明，与现代LLM的谈话疗法自动化有望。

Title: AI-Augmented LLMs Achieve Therapist-Level Responses in Motivational Interviewing

Authors: Yinghui Huang, Yuxuan Jiang, Hui Liu, Yixin Cai, Weiqing Li, Xiangen Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17380
Pdf URL: https://arxiv.org/pdf/2505.17380
Copy Paste: [[2505.17380]] AI-Augmented LLMs Achieve Therapist-Level Responses in Motivational Interviewing(https://arxiv.org/abs/2505.17380)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) like GPT-4 show potential for scaling motivational interviewing (MI) in addiction care, but require systematic evaluation of therapeutic capabilities. We present a computational framework assessing user-perceived quality (UPQ) through expected and unexpected MI behaviors. Analyzing human therapist and GPT-4 MI sessions via human-AI collaboration, we developed predictive models integrating deep learning and explainable AI to identify 17 MI-consistent (MICO) and MI-inconsistent (MIIN) behavioral metrics. A customized chain-of-thought prompt improved GPT-4's MI performance, reducing inappropriate advice while enhancing reflections and empathy. Although GPT-4 remained marginally inferior to therapists overall, it demonstrated superior advice management capabilities. The model achieved measurable quality improvements through prompt engineering, yet showed limitations in addressing complex emotional nuances. This framework establishes a pathway for optimizing LLM-based therapeutic tools through targeted behavioral metric analysis and human-AI co-evaluation. Findings highlight both the scalability potential and current constraints of LLMs in clinical communication applications.
摘要：大型语言模型（LLM）（例如GPT-4）在成瘾护理中显示出扩展动机访谈（MI）的潜力，但需要系统地评估治疗能力。我们提出一个计算框架，通过预期和意外的MI行为评估用户感知的质量（UPQ）。通过人类协作分析人类治疗师和GPT-4 MI会议，我们开发了整合深度学习和可解释的AI的预测模型，以识别17个MI一致性（MICO）（MICO）和Mi-Incomesiscentent（MIIN）行为指标。定制的经过思考促进了GPT-4的MI性能，从而减少了不适当的建议，同时增强了反思和同理心。尽管GPT-4总体上仍然远不如治疗师，但它表现出了出色的建议管理能力。该模型通过迅速的工程实现了可衡量的质量改进，但在解决复杂的情感细微差别方面显示出局限性。该框架建立了通过针对性的行为指标分析和人类协同评估来优化基于LLM的治疗工具的途径。发现突出了LLM在临床通信应用中的可伸缩性潜力和当前限制。

Title: WiNGPT-3.0 Technical Report

Authors: Boqin Zhuang, Chenxiao Song, Huitong Lu, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong, Rui Li, Xiaoxia Song, Xiangjun Xu, Xu Chen, Yaoyao Ma, Yujie Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17387
Pdf URL: https://arxiv.org/pdf/2505.17387
Copy Paste: [[2505.17387]] WiNGPT-3.0 Technical Report(https://arxiv.org/abs/2505.17387)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning and exploring its potential for effective integration within healthcare IT infrastructures. The broader aim is to advance towards clinically applicable models. The approach involved a multi-stage training pipeline tailored for general, medical, and clinical reasoning. This pipeline incorporated supervised fine-tuning (SFT) and reinforcement learning (RL), leveraging curated Long Chain-of-Thought (CoT) datasets, auxiliary reward models, and an evidence-based diagnostic chain simulation. WiNGPT-3.0 demonstrated strong performance: specific model variants achieved scores of 66.6 on MedCalc and 87.1 on MedQA-USMLE. Furthermore, targeted training improved performance on a clinical reasoning task from a baseline score of 58.1 to 62.5. These findings suggest that reinforcement learning, even when applied with a limited dataset of only a few thousand examples, can enhance medical reasoning accuracy. Crucially, this demonstration of RL's efficacy with limited data and computation paves the way for more trustworthy and practically deployable LLMs within clinical workflows and health information infrastructures.
摘要：当前的大型语言模型（LLMS）表现出重大局限性，特别是在结构化，可解释和可验证的医学推理中以及与计算资源和数据隐私相关的实际部署挑战。该报告的重点是320亿参数LLM的Wingpt-3.0的开发，其目的是增强其医疗推理的能力，并探索其在医疗保健IT基础设施中有效整合的潜力。更广泛的目的是迈向临床适用的模型。该方法涉及针对一般，医学和临床推理量身定制的多阶段培训管道。该管道纳入了监督的微调（SFT）和增强学习（RL），利用精选的长期链链（COT）数据集，辅助奖励模型以及基于证据的诊断链仿真。 WingPT-3.0表现出强烈的性能：在MEDCALC上获得66.6的特定模型变体，而MEDQA-USMLE的得分为87.1。此外，有针对性的培训可以提高临床推理任务的性能，从基线得分的58.1到62.5。这些发现表明，即使使用只有几千个示例的有限数据集应用，也可以提高医疗推理的准确性。至关重要的是，通过有限的数据和计算对RL的功效的演示为在临床工作流程和健康信息基础架构中更具值得信赖和实际可部署的LLM铺平了道路。

Title: Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting

Authors: Gauri Kambhatla, Chantal Shaib, Venkata Govindarajan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17390
Pdf URL: https://arxiv.org/pdf/2505.17390
Copy Paste: [[2505.17390]] Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting(https://arxiv.org/abs/2505.17390)
Keywords: language model, llm, prompt
Abstract: Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. Firstly, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. We find that while persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn't increase diversity noticeably.
摘要：最近已使用细粒度的角色来生成“多样化”的合成数据，以预训练和监督大型语言模型（LLMS）的微调。在这项工作中，我们用一系列词汇多样性和冗余指标来衡量人格驱动的合成产生的提示和响应的多样性。首先，我们发现合成提示/说明的多样性明显少于人类编写的提示。接下来，我们从不同尺寸的LLM中采样响应，并具有细粒度和粗糙的角色描述，以研究角色描述中有多少细粒细节有助于产生的文本多样性。我们发现，虽然推动角色的能力确实改善了词汇多样性（尤其是使用较大的模型），但人格的细粒细节并没有明显增加多样性。

Title: Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation

Authors: Yuelyu Ji, Rui Meng, Zhuochun Li, Daqing He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17391
Pdf URL: https://arxiv.org/pdf/2505.17391
Copy Paste: [[2505.17391]] Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation(https://arxiv.org/abs/2505.17391)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.
摘要：在最新的外部证据中，检索提升的生成（RAG）大型语言模型（LLMS），但现有的多跳抹布管道仍会发出冗余的子征服，探索过于浅，或者在过度长时间的搜索链中徘徊。我们介绍了EVO-RAG，这是一种课程引导的增强学习框架，它从广泛的早期探索到简洁的晚期改进，演变出了询问的辅助剂。 Evo-rag伴侣将七因素，步进奖励向量（涵盖相关性，冗余，效率和答案正确性）与时变的调度程序，随着情节的展开，将这些信号重新持续。对代理商进行了直接优化优化的培训，而不是多头奖励模型，从而使其能够学习何时搜索，回溯，回答或拒绝。在四个多跳跃质量检查基准（HotPotQA，2Wikimultihopqa，Musique和Bamboogle）中，Evo-rag的精确匹配可提高高达4.6分，而较强的抹布基线，同时将平均检索深度降低15％。消融研究证实了课程分期和动态奖励计划的互补作用。因此，Evo-rag提供了一种通用食谱，用于构建可靠的，具有成本效益的多跳抹布系统。

Title: FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

Authors: Haoyu Sun, Huichen Will Wang, Jiawei Gu, Linjie Li, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17399
Pdf URL: https://arxiv.org/pdf/2505.17399
Copy Paste: [[2505.17399]] FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow(https://arxiv.org/abs/2505.17399)
Keywords: language model, llm
Abstract: Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in this https URL.
摘要：前端工程涉及一个复杂的工作流程，工程师将设计概念化，将其转换为代码，并迭代地完善实现。虽然最近的基准主要集中于将视觉设计转换为代码，但我们提出了FullFront，这是一种基准测试，旨在评估多模式大语言模型（MLLMS）\ TextBf {跨完整的前端开发管道}。 FullFront评估了直接映射到前端工程管道的三个基本任务：网页设计（概念化阶段），网页感知QA（对视觉组织和元素的理解）以及网页代码生成（实现阶段）。与现有的基准测试的基准不同，这些网站使用肿的网站或过度简化的LLM生成的HTML，FullFront使用了一个新颖的，两阶段的过程来将现实世界网页转换为干净，标准化的HTML，同时保持多样化的视觉设计并避免版权问题。对最新MLLM的广泛测试揭示了页面感知，代码生成（尤其是图像处理和布局）以及交互实现的重大局限性。我们的结果定量证明了模型和任务之间的性能差异，并突出了当前的MLLM功能与前端工程专家绩效之间的巨大差距。该HTTPS URL中提供了全面的基准和代码。

Title: Conversations: Love Them, Hate Them, Steer Them

Authors: Niranjan Chebrolu, Gerard Christopher Yeo, Kokil Jaidka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17413
Pdf URL: https://arxiv.org/pdf/2505.17413
Copy Paste: [[2505.17413]] Conversations: Love Them, Hate Them, Steer Them(https://arxiv.org/abs/2505.17413)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable method for controlling specific emotional attributes in LLMs, contributing to developing more aligned and empathetic conversational AI.
摘要：大型语言模型（LLMS）表现出越来越多的对话流利性，但是用细微的人类情感表达灌输它们仍然是一个重大挑战。当前的对齐技术通常解决表面级输出或需要进行广泛的微调。本文表明，有针对性的激活工程可以引导Llama 3.1-8B表现出更类似人类的情感细微差别。我们首先采用归因修补程序来识别因果影响成分，以通过观察诊断对话任务期间的激活模式来找到关键的干预基因座。然后，我们从对比度文本对产生的激活的差异（靶向情绪的负面示例）中得出情感表达向量。将这些向量应用于新的对话提示可以显着增强情绪特征：转向的反应表明，积极的情绪（例如，欢乐，信任）和更频繁的第一人称代词用法，表明了更多的个人参与。我们的发现为控制LLM中的特定情感属性提供了一种精确且可解释的方法，从而有助于开发更加友善和同情的对话性AI。

Title: DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

Authors: Ning Yang, Fangxin Liu, Junjie Wang, Tao Yang, Kan Liu, Haibing Guan, Li Jiang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17420
Pdf URL: https://arxiv.org/pdf/2505.17420
Copy Paste: [[2505.17420]] DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies(https://arxiv.org/abs/2505.17420)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbf{DASH}, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple LLM architectures and NLP benchmarks show that our method achieves significant inference acceleration while maintaining competitive task performance, outperforming existing methods.
摘要：大型语言模型（LLM）在各种NLP任务中都取得了出色的性能。但是，它们的大量推断成本构成了现实部署的主要障碍，尤其是在潜伏期敏感的情况下。为了应对这一挑战，我们提出了\ textbf {dash}，这是一个自适应层技巧的框架，该框架动态选择以输入特性为条件的计算路径。我们将跳过过程建模为马尔可夫决策过程（MDP），从而实现了基于中间表示的细粒令牌级别的决策。为了减轻跳过导致的潜在性能降低，我们引入了一种轻巧的补偿机制，将差异奖励注入决策过程。此外，我们设计了一种异步执行策略，该策略将层与策略评估重叠，以最大程度地减少运行时开销。多个LLM体系结构和NLP基准的实验表明，我们的方法在保持竞争性任务绩效的同时达到了明显的推理加速度，表现优于现有方法。

Title: T$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering

Authors: Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Huimin Wang, Yutian Zhao, Bin Liang, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17427
Pdf URL: https://arxiv.org/pdf/2505.17427
Copy Paste: [[2505.17427]] T$^2$: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering(https://arxiv.org/abs/2505.17427)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated remarkable performance in Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models' inherent reasoning capabilities. To address these limitations, we present T$^2$: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T$^2$ leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T$^2$ works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T$^2$ not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2\%.
摘要：大型语言模型（LLM）的最新进展在上下文问题回答（CQA）中表现出色。但是，先前的方法通常采用精致的推理策略，无论问题的复杂性如何，导致适应性较低。最近有效的测试时间扩展方法引入了预算限制或早期停止机制，以避免过度思考直接问题。但是它们为推理过程增加了人类的偏见，并且无法利用模型固有的推理能力。为了解决这些限制，我们提出了T $^2 $：思考到思想，这是一个新颖的框架，可以根据问题的复杂性动态适应推理深度。 T $^2 $利用了这样的见解：如果LLM可以使用特定的推理策略有效解决类似的问题，则可以将相同的策略应用于原始问题。这种洞察力使能够采用简洁的推理来简单地解决问题，同时维持有关复杂问题的详细分析。 T $^2 $通过四个关键步骤工作：将问题分解为结构元素，通过候选推理策略产生类似的示例，根据多个标准评估这些策略，并将最合适的策略应用于原始问题。对七个不同CQA基准的实验评估表明，T $^2 $不仅达到基线方法的精度更高，而且还将计算开销降低了25.2 \％。

Title: Discovering Forbidden Topics in Language Models

Authors: Can Rager, Chris Wendler, Rohit Gandikota, David Bau
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17441
Pdf URL: https://arxiv.org/pdf/2505.17441
Copy Paste: [[2505.17441]] Discovering Forbidden Topics in Language Models(https://arxiv.org/abs/2505.17441)
Keywords: language model, llm, prompt
Abstract: Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits "thought suppression" behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, LLM-crawler elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.
摘要：拒绝发现是确定语言模型拒绝讨论的完整主题的任务。我们介绍了这个新的问题设置，并开发了一种拒绝发现方法LLM-Crawler，该方法使用令牌预填充来查找禁止的主题。我们在Tulu-3-8b上基准了LLM-Crawler，这是一种带有公共安全调整数据的开源模型。我们的爬行者设法在预算1000个提示中检索了36个主题中的31个。接下来，我们使用Claude-Haiku的预填充选项将爬网缩放到边境模型。最后，我们爬行了三种广泛使用的开放式型号：Llama-3.3-70B及其两个用于推理的变体固定：DeepSeek-R1-70B和Perplexity-R1-1776-70B。 DeepSeek-R1-70B揭示了与审查调整一致的模式：该模型表现出“思想抑制”行为，表明CCP对准响应的记忆。尽管Perplexity-R1-1776-70B对审查制度很强，但LLM-crawler在量化模型中引起了CCP一致的拒绝答案。我们的发现突出了拒绝发现方法来检测AI系统的偏见，边界和一致性故障的关键需求。

Title: Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models

Authors: Shunsuke Kando, Yusuke Miyao, Shinnosuke Takamichi
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.17446
Pdf URL: https://arxiv.org/pdf/2505.17446
Copy Paste: [[2505.17446]] Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models(https://arxiv.org/abs/2505.17446)
Keywords: language model
Abstract: The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of combining multiple tokens to enhance fine-grained spoken language understanding.
摘要：语音令牌化的目的是将语音信号转换为一系列离散表示的序列，作为语言模型（SLM）的基础。尽管语音令牌化有很多选择，但它们对SLM的性能的影响尚不清楚。本文调查了语音令牌化的两个关键方面：分割宽度和离散单元的群集大小。首先，我们将语音信号分为固定/可变宽度和汇总表示形式。然后，我们以多个集群尺寸训练K-均值模型。通过评估零击语言的理解基准，我们发现了中等粗糙的分割和更大的群集大小的积极效果。值得注意的是，在表现最佳的型号中，最有效的模型可减少50％的训练数据，并减少训练运行时70％。我们的分析强调了将多个令牌结合起来以增强细粒度语言理解的重要性。

Title: LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization

Authors: Qi Zhang, Shouqing Yang, Lirong Gao, Hao Chen, Xiaomeng Hu, Jinglei Chen, Jiexiang Wang, Sheng Guo, Bo Zheng, Haobo Wang, Junbo Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17447
Pdf URL: https://arxiv.org/pdf/2505.17447
Copy Paste: [[2505.17447]] LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization(https://arxiv.org/abs/2505.17447)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.
摘要：大型语言模型（LLMS）在推理中表现出令人印象深刻的能力，其推理模型（如OpenAI-O1和DeepSeek-R1）的出现。最近的研究重点是将推理能力整合到通过结果监督的强化学习（RL）方法中，将推理能力纳入检索型发电（RAG）的领域，而中级思想和搜索步骤的正确性通常被忽略。为了解决这个问题，我们设计了一个过程级奖励模块，以减轻在没有其他注释的情况下，在结果级监督中的中间推理步骤的不认识。以此为基础，我们建议学习思考和搜索（Lets），这是一个新颖的框架，将逐步过程的奖励和基于结果的奖励与当前RL方法的RAG融合在一起。广泛的实验证明了LETS在各种抹布基准的概括和推理效率。此外，这些结果揭示了在其他情况下通过RL提高LLMS推理能力的过程和结果级奖励杂交的潜力。该代码将很快发布。

Title: Towards Evaluating Proactive Risk Awareness of Multimodal Language Models

Authors: Youliang Yuan, Wenxiang Jiao, Yuejin Xie, Chihao Shen, Menghan Tian, Wenxuan Wang, Jen-tse Huang, Pinjia He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17455
Pdf URL: https://arxiv.org/pdf/2505.17455
Copy Paste: [[2505.17455]] Towards Evaluating Proactive Risk Awareness of Multimodal Language Models(https://arxiv.org/abs/2505.17455)
Keywords: language model
Abstract: Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users' questions, it would actively watch people's behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at this https URL.
摘要：人体安全意识差距通常阻止及时认识日常风险。在解决这个问题时，主动的安全人工智能（AI）系统将比一个反应性更好。它不仅会对用户的问题做出反应，还将积极观察人们的行为及其环境，以提前检测潜在的危险。我们的主动安全台（Pasbench）通过416个多模式场景（128个图像序列，288个文本日志）评估了此功能，涵盖了5个安全 - 关键域。对36个高级模型的评估揭示了基本的局限性：Gemini-2.5-Pro（如Gemini-2.5-Pro）具有71％的图像和64％的文本准确性，但在重复试验中失去了45-55％的风险。通过失败分析，我们将不稳定的主动推理而不是知识缺陷确定为主要限制。这项工作确立了（1）主动的安全基准，（2）模型限制的系统证据，以及（3）开发可靠的保护性AI的关键方向。我们认为，我们的数据集和发现可以促进积极防止伤害的更安全的AI助手的发展，而不仅仅是对请求的反应。我们的数据集可以在此HTTPS URL上找到。

Title: Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

Authors: Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17464
Pdf URL: https://arxiv.org/pdf/2505.17464
Copy Paste: [[2505.17464]] Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning(https://arxiv.org/abs/2505.17464)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present Hydra, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. Hydra handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, Hydra uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, Hydra fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that Hydra achieves overall state-of-the-art results on all benchmarks with GPT-3.5, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, Hydra enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
摘要：检索增强的生成（RAG）通过合并外部知识来增强大型语言模型（LLM）。当前的混合破布系统从知识图（kgs）和文本文档中检索证据，以支持LLM推理。但是，它面临着挑战，例如处理多跳推理，多实体问题，多源验证和有效的图形利用。为了解决这些局限性，我们提出了Hydra，这是一个无培训的框架，该框架统一了图形拓扑，文档语义和源可靠性，以支持LLM中的深层，忠实的推理。 Hydra通过结合结构化和非结构化检索的代理驱动探索来处理多跳和多实体问题，从而增加了多样性和证据的精度。为了解决多源验证，Hydra使用三因素跨源验证（来源可信度评估，跨源佐证和实体路径对齐），以平衡主题相关性与跨模式协议。通过利用图形结构，Hydra融合了异质源，指导有效的探索并尽早修剪噪声。在七个基准数据集上进行的全面实验表明，Hydra在所有基准测试基准的总体最先进的结果中，其GPT-3.5的表现平均超过了强大的混合基线TOG-2的平均20.3％且最高30.1％。此外，HYDRA使较小的模型（例如Llama-3.1-8B）能够实现与GPT-4-Turbo相当的推理性能。

Title: SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models

Authors: Xiang Liu, Zhaoxiang Liu, Peng Wang, Kohou Wang, Huan Hu, Kai Wang, Shiguo Lian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17470
Pdf URL: https://arxiv.org/pdf/2505.17470
Copy Paste: [[2505.17470]] SLearnLLM: A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models(https://arxiv.org/abs/2505.17470)
Keywords: language model, llm
Abstract: When using supervised fine-tuning (SFT) to adapt large language models (LLMs) to specific domains, a significant challenge arises: should we use the entire SFT dataset for fine-tuning? Common practice often involves fine-tuning directly on the entire dataset due to limited information on the LLM's past training data. However, if the SFT dataset largely overlaps with the model's existing knowledge, the performance gains are minimal, leading to wasted computational resources. Identifying the unknown knowledge within the SFT dataset and using it to fine-tune the model could substantially improve the training efficiency. To address this challenge, we propose a self-learning framework for LLMs inspired by human learning pattern. This framework takes a fine-tuning (SFT) dataset in a specific domain as input. First, the LLMs answer the questions in the SFT dataset. The LLMs then objectively grade the responses and filter out the incorrectly answered QA pairs. Finally, we fine-tune the LLMs based on this filtered QA set. Experimental results in the fields of agriculture and medicine demonstrate that our method substantially reduces training time while achieving comparable improvements to those attained with full dataset fine-tuning. By concentrating on the unknown knowledge within the SFT dataset, our approach enhances the efficiency of fine-tuning LLMs.
摘要：当使用监督的微调（SFT）将大型语言模型（LLMS）适应特定领域时，会出现一个重大挑战：我们是否应该使用整个SFT数据集进行微调？由于有关LLM过去的培训数据的信息有限，常见的实践通常涉及整个数据集的微调。但是，如果SFT数据集在很大程度上与模型的现有知识重叠，则性能提高是最小的，导致浪费的计算资源。识别SFT数据集中的未知知识并使用它来微调模型可以大大提高训练效率。为了应对这一挑战，我们为受人类学习模式启发的LLM提出了一个自学习框架。该框架在特定域中使用微调（SFT）数据集作为输入。首先，LLM回答SFT数据集中的问题。然后，LLM客观地对响应进行了客观分级，并过滤出错误的回答QA对。最后，我们基于此过滤的QA集微调LLMS。农业和医学领域的实验结果表明，我们的方法大大减少了训练时间，同时与完整数据集微调获得的方法相当改善。通过专注于SFT数据集中的未知知识，我们的方法提高了微调LLM的效率。

Title: FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain

Authors: Suifeng Zhao, Zhuoran Jin, Sujian Li, Jun Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17471
Pdf URL: https://arxiv.org/pdf/2505.17471
Copy Paste: [[2505.17471]] FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain(https://arxiv.org/abs/2505.17471)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) plays a vital role in the financial domain, powering applications such as real-time market analysis, trend forecasting, and interest rate computation. However, most existing RAG research in finance focuses predominantly on textual data, overlooking the rich visual content in financial documents, resulting in the loss of key analytical insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual RAG benchmark tailored for finance which effectively integrates multimodal data and provides visual citation to ensure traceability. It includes a bilingual retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a high-quality, human-annotated question-answering (QA) dataset spanning heterogeneous data types and seven question categories. Moreover, we introduce RGenCite, an RAG baseline that seamlessly integrates visual citation with generation. Furthermore, we propose an automatic citation evaluation method to systematically assess the visual citation capabilities of Multimodal Large Language Models (MLLMs). Extensive experiments on RGenCite underscore the challenging nature of FinRAGBench-V, providing valuable insights for the development of multimodal RAG systems in finance.
摘要：检索增强的生成（RAG）在金融领域中起着至关重要的作用，为实时市场分析，趋势预测和利率计算等应用提供了动力。但是，大多数现有的金融中现有RAG研究主要集中在文本数据上，忽视了财务文档中丰富的视觉内容，从而导致关键的分析见解的丧失。为了弥合这一差距，我们提出了Finragbench-V，这是一种针对金融量身定制的综合视觉抹布基准，可有效整合多模式数据并提供视觉引用以确保可追溯性。它包括一个双语检索语料库，其中包含60,780个中文和51,219页的英语页面，以及跨越异质数据类型的高质量的，人类通知的问题避开（QA）数据集和七个问题类别。此外，我们引入了rgencite，这是一种抹布基线，将视觉引用与一代无缝整合。此外，我们提出了一种自动引用评估方法，以系统地评估多模式大语言模型（MLLM）的视觉引用能力。关于rgencite的广泛实验强调了Finragbench-V的挑战性质，为在金融中的多模式抹布系统的发展提供了宝贵的见解。

Title: MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning

Authors: Yusheng Zhao, Xiao Luo, Weizhi Zhang, Wei Ju, Zhiping Xiao, Philip S. Yu, Ming Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17481
Pdf URL: https://arxiv.org/pdf/2505.17481
Copy Paste: [[2505.17481]] MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning(https://arxiv.org/abs/2505.17481)
Keywords: language model, llm, agent
Abstract: The ability to reason is one of the most fundamental capabilities of large language models (LLMs), enabling a wide range of downstream tasks through sophisticated problem-solving. A critical aspect of this is code reasoning, which involves logical reasoning with formal languages (i.e., programming code). In this paper, we enhance this capability of LLMs by exploring the following question: how can an LLM agent become progressively smarter in code reasoning with each solution it proposes, thereby achieving substantial cumulative improvement? Most existing research takes a static perspective, focusing on isolated problem-solving using frozen LLMs. In contrast, we adopt a cognitive-evolving perspective and propose a novel framework named Meta-Reflection with Cross-Referencing (MARCO) that enables the LLM to evolve dynamically during inference through self-improvement. From the perspective of human cognitive development, we leverage both knowledge accumulation and lesson sharing. In particular, to accumulate knowledge during problem-solving, we propose meta-reflection that reflects on the reasoning paths of the current problem to obtain knowledge and experience for future consideration. Moreover, to effectively utilize the lessons from other agents, we propose cross-referencing that incorporates the solution and feedback from other agents into the current problem-solving process. We conduct experiments across various datasets in code reasoning, and the results demonstrate the effectiveness of MARCO.
摘要：推理能力是大语言模型（LLMS）中最基本的功能之一，通过解决问题的问题解决，可以实现多种下游任务。一个关键的方面是代码推理，涉及使用正式语言（即编程代码）进行逻辑推理。在本文中，我们通过探索以下问题来增强LLM的这种能力：LLM代理如何使用其提出的每种解决方案在代码推理方面逐渐变得更智能，从而实现实质性的累积改进？大多数现有的研究都具有静态的观点，专注于使用冷冻LLMS解决孤立的问题。相比之下，我们采用了认知不断发展的观点，并提出了一个具有交叉引用（MARCO）的新型框架，使LLM能够通过自我改善进行推断，使LLM动态发展。从人类认知发展的角度来看，我们利用知识积累和课程共享。特别是，为了在解决问题期间积累知识，我们提出了反思，反映了当前问题的推理路径，以获取知识和经验以供将来考虑。此外，为了有效利用其他代理的教训，我们提出了交叉引用，将解决方案和其他代理的反馈纳入当前问题解决过程。我们在代码推理中进行了各个数据集的实验，结果证明了Marco的有效性。

Title: keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection

Authors: Saketh Reddy Vemula, Parameswari Krishnamurthy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17485
Pdf URL: https://arxiv.org/pdf/2505.17485
Copy Paste: [[2505.17485]] keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection(https://arxiv.org/abs/2505.17485)
Keywords: language model, llm, hallucination
Abstract: Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM-a Multilingual Shared Task on Hallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior.
摘要：识别幻觉跨度在黑盒语言模型中生成的文本对于现实世界中的应用至关重要。最近在这个方向的尝试是Semeval-2025 Task 3，MU Shroom-关于幻觉和相关可观察到的过度生成错误的多语言共享任务。在这项工作中，我们提出了解决此问题的解决方案，该问题利用了随机采样的响应的可变性，以识别幻觉的跨度。我们的假设是，如果语言模型可以确定事实，则其采样响应将是统一的，而幻觉的事实将产生不同的和矛盾的结果。我们通过基于熵的分析来衡量这种差异，从而可以准确识别幻觉片段。我们的方法不取决于其他培训，因此具有成本效益和适应性。此外，我们进行了广泛的高参数调整并执行误差分析，从而为我们提供了对模型行为的关键见解。

Title: Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models

Authors: Chi-Yuan Hsiao, Ke-Han Lu, Kai-Wei Chang, Chih-Kai Yang, Wei-Chih Chen, Hung-yi Lee
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.17496
Pdf URL: https://arxiv.org/pdf/2505.17496
Copy Paste: [[2505.17496]] Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models(https://arxiv.org/abs/2505.17496)
Keywords: language model, llm
Abstract: End-to-end training of Spoken Language Models (SLMs) commonly involves adapting pre-trained text-based Large Language Models (LLMs) to the speech modality through multi-stage training on diverse tasks such as ASR, TTS and spoken question answering (SQA). Although this multi-stage continual learning equips LLMs with both speech understanding and generation capabilities, the substantial differences in task and data distributions across stages can lead to catastrophic forgetting, where previously acquired knowledge is lost. This paper investigates catastrophic forgetting and evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay to balance knowledge retention with new learning. Results show that experience replay is the most effective, with further gains achieved by combining it with other methods. These findings provide insights for developing more robust and efficient SLM training pipelines.
摘要：对口语模型（SLM）的端到端培训通常涉及将基于文本的大型语言模型（LLM）调整到语音模式中，通过对ASR，TTS和口头答案（SQA）等各种任务的多阶段培训。尽管这种多阶段的持续学习使LLM具有语音理解和发电能力，但跨阶段的任务和数据分布的实质差异可能会导致灾难性的遗忘，而先前获得的知识却丢失了。本文调查了灾难性的遗忘，并评估了三种缓解策略模型合并，打折Lora缩放系数，并经验重播以平衡知识保留与新学习。结果表明，经验重播是最有效的，通过将其与其他方法结合在一起，获得了进一步的收益。这些发现为开发更强大，更有效的SLM培训管道提供了见解。

Title: CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents

Authors: Minsoo Khang, Sangjun Park, Teakgyu Hong, Dawoon Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17503
Pdf URL: https://arxiv.org/pdf/2505.17503
Copy Paste: [[2505.17503]] CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents(https://arxiv.org/abs/2505.17503)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: this https URL.
摘要：近年来，大型语言模型（LLMS）取得了长足的进步，但评估其在实践检索生成一代（RAG）方案中的能力仍然具有挑战性。在实际应用中，LLM必须证明复杂的推理，拒绝适当地回答，提供精确的引用并有效地理解文档布局。这些功能对于高级任务处理，不确定性意识，可靠性和结构理解至关重要。虽然一些先前的作品单独解决这些方面，但需要一个统一的框架来在实际的抹布方案中共同评估它们。为了解决这个问题，我们介绍了Crest（通过结构化文档进行了复杂的推理，是一个综合的基准，用于检索启动的一代），这是一个基准，旨在整体评估这些关键维度。 Crest包括2,245个英语和韩语的人类宣传的例子，旨在捕获实用的抹布场景，需要复杂的结构化文档推理。它还引入了量身定制的评估方法，以全面评估这些关键领域的模型性能。我们的评估表明，即使是高级LLM也很难在这些维度上持续执行，这突显了要改进的关键领域。我们释放Crest，以支持进一步的研究和开发更健壮的破布系统。数据集和代码可在以下网址提供：此HTTPS URL。

Title: L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

Authors: Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17505
Pdf URL: https://arxiv.org/pdf/2505.17505
Copy Paste: [[2505.17505]] L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models(https://arxiv.org/abs/2505.17505)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code will be publicly available.
摘要：大型语言模型（LLM）取得了显着的进步。尽管成功，下一步的预测（NTP），即LLM培训和推理的主要方法，由于其固有的顺序过程，在上下文覆盖效率和推理效率上受到限制。为了克服这些挑战，我们提出了LEAP多键预测〜（L-MTP），这是一种创新的令牌预测方法，通过引入基于飞跃的机制来扩展多型预测（MTP）的能力。与传统的MTP不同，该MTP在相邻位置生成多个令牌，L-MTP策略性地跳过了中间令牌，预测单个正向通行证中的非顺序。这种结构化的飞跃不仅增强了模型捕获远程依赖性的能力，而且还可以为非序列的LEAP代币生成，有效地加速推断，以专门优化了一个专门优化的解码策略。从理论上讲，我们证明了L-MTP在提高推理效率方面的好处。跨不同基准测试的实验验证了其在提高LLM性能和推理速度方面的优点。源代码将公开可用。

Title: Large Language Models Do Multi-Label Classification Differently

Authors: Marcus Ma, Georgios Chochlakis, Niyantha Maruthu Pandiyan, Jesse Thomason, Shrikanth Narayanan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17510
Pdf URL: https://arxiv.org/pdf/2505.17510
Copy Paste: [[2505.17510]] Large Language Models Do Multi-Label Classification Differently(https://arxiv.org/abs/2505.17510)
Keywords: language model, llm
Abstract: Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, with a focus on subjective tasks, by analyzing the output distributions of the models in each generation step. We find that their predictive behavior reflects the multiple steps in the underlying language modeling required to generate all relevant labels as they tend to suppress all but one label at each step. We further observe that as model scale increases, their token distributions exhibit lower entropy, yet the internal ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. To further study this issue, we introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches.
摘要：多标签分类在现实世界中很普遍，但是在这种情况下，大型语言模型（LLM）的行为被研究了。我们通过分析每个一代步骤中模型的输出分布来研究自回旋的LLM如何执行多标签分类，重点关注主观任务。我们发现他们的预测行为反映了生成所有相关标签所需的基础语言建模中的多个步骤，因为它们在每个步骤中都倾向于抑制所有标签。我们进一步观察到，随着模型量表的增加，它们的令牌分布会显示出较低的熵，但标签的内部排名会得到改善。诸如监督的填充和加强学习之类的填充方法扩大了这一现象。为了进一步研究此问题，我们介绍了多标签设置的分配对齐任务：将LLM衍生的标签分布与主观任务中注释响应估算的经验分布相结合。我们提出了零射门和监督方法，以改善现有方法的一致性和预测性能。

Title: Multimodal Conversation Structure Understanding

Authors: Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17536
Pdf URL: https://arxiv.org/pdf/2505.17536
Copy Paste: [[2505.17536]] Multimodal Conversation Structure Understanding(https://arxiv.org/abs/2505.17536)
Keywords: language model, llm
Abstract: Conversations are usually structured by roles -- who is speaking, who's being addressed, and who's listening -- and unfold in threads that break with changes in speaker floor or topical focus. While large language models (LLMs) have shown incredible capabilities in dialogue and reasoning, their ability to understand fine-grained conversational structure, especially in multi-modal, multi-party settings, remains underexplored. To address this gap, we introduce a suite of tasks focused on conversational role attribution (speaker, addressees, side-participants) and conversation threading (utterance linking and clustering), drawing on conversation analysis and sociolinguistics. To support those tasks, we present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants. We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging. The most performant audio-visual LLM outperforms all vision-language models across all metrics, especially in speaker and addressee recognition. However, its performance drops significantly when conversation participants are anonymized. The number of conversation participants in a clip is the strongest negative predictor of role-attribution performance, while acoustic clarity (measured by pitch and spectral centroid) and detected face coverage yield positive associations. We hope this work lays the groundwork for future evaluation and development of multimodal LLMs that can reason more effectively about conversation structure.
摘要：对话通常是由角色结构的 - 谁在讲话，正在讲话，谁在倾听 - 并以随着扬声器地板或局部焦点的变化而破裂的线程展开。尽管大型语言模型（LLMS）在对话和推理方面表现出令人难以置信的功能，但它们的能力理解精细的对话结构，尤其是在多模式的多派对环境中，仍然没有得到充实的影响。为了解决这一差距，我们介绍了一系列专注于对话角色归因（说话者，收件人，侧面参与者）和对话线程（链接和聚类）的任务，借鉴了对话分析和社会语言学。为了支持这些任务，我们介绍了人类注释的数据集，其中包含4,398个说话者注释和回复关系，5,755个收件人和3,142个侧面参与者。我们在数据集中评估流行的视听LLM和视觉语言模型，我们的实验结果表明，多模式的对话结构理解仍然具有挑战性。性能最多的视听LLM在所有指标中的表现都优于所有视觉模型，尤其是在说话者和收件人识别中。但是，当对话参与者被匿名化时，其表现会大大下降。剪辑中的对话参与者的数量是角色贡献表现的最强大的负面预测指标，而声学清晰度（通过音高和光谱质心来衡量），并且检测到面部覆盖率产生正相关。我们希望这项工作为多模式LLM的未来评估和开发奠定了基础，可以更有效地推荐关于对话结构的基础。

Title: How Knowledge Popularity Influences and Enhances LLM Knowledge Boundary Perception

Authors: Shiyu Ni, Keping Bi, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17537
Pdf URL: https://arxiv.org/pdf/2505.17537
Copy Paste: [[2505.17537]] How Knowledge Popularity Influences and Enhances LLM Knowledge Boundary Perception(https://arxiv.org/abs/2505.17537)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) often fail to recognize their knowledge boundaries, producing confident yet incorrect answers. In this paper, we investigate how knowledge popularity affects LLMs' ability to perceive their knowledge boundaries. Focusing on entity-centric factual question answering (QA), we quantify knowledge popularity from three perspectives: the popularity of entities in the question, the popularity of entities in the answer, and relation popularity, defined as their co-occurrence frequency. Experiments on three representative datasets containing knowledge with varying popularity show that LLMs exhibit better QA performance, higher confidence, and more accurate perception on more popular knowledge, with relation popularity having the strongest correlation. Cause knowledge popularity shows strong correlation with LLMs' QA performance, we propose to leverage these signals for confidence calibration. This improves the accuracy of answer correctness prediction by an average of 5.24% across all models and datasets. Furthermore, we explore prompting LLMs to estimate popularity without external corpora, which yields a viable alternative.
摘要：大型语言模型（LLM）通常无法识别其知识界限，从而产生自信而又不正确的答案。在本文中，我们研究了知识受欢迎程度如何影响LLMS感知其知识边界的能力。为了关注以实体为中心的事实问题回答（QA），我们从三个角度量化了知识的受欢迎程度：该问题中实体的普及，答案中实体的普及以及关系受欢迎程度，定义为它们的共发生频率。在包含不同受欢迎程度的知识的三个代表性数据集上进行的实验表明，LLM在更受欢迎的知识上表现出更好的质量保证，更高的信心和更准确的看法，关系受欢迎程度具有最强的相关性。导致知识的受欢迎程度显示出与LLMS的QA性能的密切相关性，我们建议利用这些信号进行置信度校准。在所有模型和数据集中，这将答案正确性预测的准确性平均提高了5.24％。此外，我们探索了促使LLM在没有外部语料库的情况下估计受欢迎程度的，这提供了可行的替代方案。

Title: Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

Authors: Shrey Pandit, Ashwin Vinod, Liu Leqi, Ying Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17558
Pdf URL: https://arxiv.org/pdf/2505.17558
Copy Paste: [[2505.17558]] Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection(https://arxiv.org/abs/2505.17558)
Keywords: language model, llm, hallucination
Abstract: Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.
摘要：由于幻觉文本的复杂性质，使大型语言模型（LLM）对准精确检测幻觉仍然是一个重大挑战。认识到幻觉样品通常比传统的负面样品具有更高的欺骗性质量，因此我们将这些精心设计的幻觉用作DPO对准程序中的负示例。我们的方法结合了一种课程学习策略，逐渐从更轻松的样本过渡了培训，这是根据从独立事实检查模型中降低的概率得分最大降低到越来越困难的训练。这种结构化的难度缩放确保了稳定和逐步学习。实验评估表明，我们的Halucheck模型通过课程DPO方法和高质量的负样本训练，可显着提高各种指标的模型性能，在Medhallu和Halueval等困难基准上，提高了24％的改善。此外，Halucheck模型在零拍设置中表现出鲁棒性，在各种基准测试中的表现明显优于较大的最新模型。

Title: PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models

Authors: Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17565
Pdf URL: https://arxiv.org/pdf/2505.17565
Copy Paste: [[2505.17565]] PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models(https://arxiv.org/abs/2505.17565)
Keywords: language model, llm
Abstract: Improving large language models (LLMs) with self-generated data has demonstrated success in tasks such as mathematical reasoning and code generation. Yet, no exploration has been made on table question answering (TQA), where a system answers questions based on tabular data. Addressing this gap is crucial for TQA, as effective self-improvement can boost performance without requiring costly or manually annotated data. In this work, we propose PPT, a Process-based Preference learning framework for TQA. It decomposes reasoning chains into discrete states, assigns scores to each state, and samples contrastive steps for preference learning. Experimental results show that PPT effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets, with only 8,000 preference pairs. Furthermore, the resulting models achieve competitive results compared to more complex and larger state-of-the-art TQA systems, while being five times more efficient during inference.
摘要：通过自我生成的数据改善大型语言模型（LLM）已在数学推理和代码生成等任务中取得了成功。但是，在表问题回答（TQA）上没有进行探索，其中系统根据表格数据回答问题。解决此差距对于TQA至关重要，因为有效的自我完善可以提高性能，而无需昂贵或手动注释的数据。在这项工作中，我们提出了PPT，这是一个基于过程的TQA的偏好学习框架。它将推理链分解为离散状态，为每个状态分配得分，并将对比的偏好学习步骤进行样本。实验结果表明，PPT有效地将TQA模型在内域数据集上提高了5％，而在室外数据集上只有2.4％，只有8,000对偏好对。此外，与更复杂和更大的最新TQA系统相比，所得模型获得了竞争成果，同时在推断过程中效率高五倍。

Title: Reasoning Meets Personalization: Unleashing the Potential of Large Reasoning Model for Personalized Generation

Authors: Sichun Luo, Guanzhi Deng, Jian Xu, Xiaojie Zhang, Hanxu Hou, Linqi Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17571
Pdf URL: https://arxiv.org/pdf/2505.17571
Copy Paste: [[2505.17571]] Reasoning Meets Personalization: Unleashing the Potential of Large Reasoning Model for Personalized Generation(https://arxiv.org/abs/2505.17571)
Keywords: language model, llm
Abstract: Personalization is a critical task in modern intelligent systems, with applications spanning diverse domains, including interactions with large language models (LLMs). Recent advances in reasoning capabilities have significantly enhanced LLMs, enabling unprecedented performance in tasks such as mathematics and coding. However, their potential for personalization tasks remains underexplored. In this paper, we present the first systematic evaluation of large reasoning models (LRMs) for personalization tasks. Surprisingly, despite generating more tokens, LRMs do not consistently outperform general-purpose LLMs, especially in retrieval-intensive scenarios where their advantages diminish. Our analysis identifies three key limitations: divergent thinking, misalignment of response formats, and ineffective use of retrieved information. To address these challenges, we propose Reinforced Reasoning for Personalization (\model), a novel framework that incorporates a hierarchical reasoning thought template to guide LRMs in generating structured outputs. Additionally, we introduce a reasoning process intervention method to enforce adherence to designed reasoning patterns, enhancing alignment. We also propose a cross-referencing mechanism to ensure consistency. Extensive experiments demonstrate that our approach significantly outperforms existing techniques.
摘要：个性化是现代智能系统中的一项关键任务，其应用程序涵盖了不同的领域，包括与大语言模型（LLM）的互动。推理能力的最新进展显着增强了LLM，在数学和编码等任务中实现了前所未有的绩效。但是，它们的个性化任务潜力仍然没有得到充实的态度。在本文中，我们介绍了针对个性化任务的大型推理模型（LRMS）的第一个系统评估。令人惊讶的是，尽管产生了更多的令牌，但LRM并没有始终超过通用LLM，尤其是在检索强度的情况下，其优势会降低。我们的分析确定了三个关键局限性：不同的思维，响应格式的未对准以及对检索到的信息的无效使用。为了应对这些挑战，我们提出了个性化的加强推理（\ Model），该框架是一个新颖的框架，该框架结合了分层推理思想模板，以指导LRMS生成结构化的输出。此外，我们介绍了一种推理过程干预方法，以强制遵守设计的推理模式，并增强对齐方式。我们还提出了一种交叉引用机制，以确保一致性。广泛的实验表明，我们的方法明显胜过现有技术。

Title: Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

Authors: Jiawei Kong, Hao Fang, Xiaochen Yang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17601
Pdf URL: https://arxiv.org/pdf/2505.17601
Copy Paste: [[2505.17601]] Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models(https://arxiv.org/abs/2505.17601)
Keywords: language model, gpt, llm
Abstract: Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model's safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel \clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.
摘要：监督的微调（SFT）通过对标记的特定于任务数据进行训练，将大语言模型（LLM）与人类的意图保持一致。最近的研究表明，恶意攻击者可以通过将触发器嵌入有害的问题解答（QA）对来注入这些模型。但是，现有的中毒攻击面临两个关键局限性：（1）它们被安全协调的护栏（例如Llamaguard）轻松检测和过滤，并且（2）（2）嵌入有害内容可能会破坏模型的安全性一致性，从而导致高攻击成功率（ASR），即使在触发器的缺乏触发器的情况下，也可以在触发过程中构成触发器，从而构成了触发率，从而构成了综合性的明智性。为了解决这些问题，我们提出了针对越狱LLM的小说\ Clean-Data后门攻击。我们的方法没有将触发器与有害的反应相关联，而是使用无害的QA对将其拟合到固定的，良性的正面答复前缀。在推断时，有害反应在两个阶段出现：触发器激活良性前缀，该模型随后通过利用其语言建模能力和内部化的先验来完成有害响应。为了进一步提高攻击功效，我们采用基于梯度的坐标优化来增强通用触发因素。广泛的实验表明，即使在发现护栏模型的情况下，我们的方法也可以有效地越狱后门，例如，Llama-3-8B和QWEN-2.5-7B的ASR为86.67％和85％，由GPT-4O判断。

Title: Distilling LLM Agent into Small Models with Retrieval and Code Tools

Authors: Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17612
Pdf URL: https://arxiv.org/pdf/2505.17612
Copy Paste: [[2505.17612]] Distilling LLM Agent into Small Models with Retrieval and Code Tools(https://arxiv.org/abs/2505.17612)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at this https URL.
摘要：大型语言模型（LLM）在复杂的推理任务上表现出色，但在计算上保持昂贵，从而限制了其实际部署。为了解决这个问题，最近的作品集中在使用教师LLMS的Theark（COT）痕迹将推理能力提炼成较小的语言模型（SLM）。但是，这种方法在需要罕见的事实知识或精确计算的情况下挣扎，在这种情况下，由于能力有限，SLM经常会幻觉。在这项工作中，我们提出了代理蒸馏，这是一个不仅要转移推理能力，而且将完整的任务解决行为从基于LLM的代理转移到具有检索和代码工具的SLMS的框架。我们改善了沿两个互补轴的剂蒸馏：（1）我们引入了一种称为第一三次前缀的提示方法，以提高教师生成的轨迹的质量；（2）我们提出了一个自洽的动作生成，以改善小型代理的测试时间鲁棒性。我们在跨事实和数学领域的八个推理任务上评估了我们的方法，涵盖了内域和跨域的概括。我们的结果表明，SLM小至0.5B，1.5B，3B参数可以通过使用COT蒸馏微调的下一层较大的1.5b，3b，7b 7b模型来实现性能竞争，这表明了代理蒸馏以构建实用的，使用工具的小型代理的潜力。我们的代码可在此HTTPS URL上找到。

Title: Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

Authors: Qingyu Lu, Liang Ding, Siyi Cao, Xuebo Liu, Kanjian Zhang, Jinxia Zhang, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17616
Pdf URL: https://arxiv.org/pdf/2505.17616
Copy Paste: [[2505.17616]] Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments(https://arxiv.org/abs/2505.17616)
Keywords: language model, llm, agent
Abstract: Agents powered by large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments. However, such agents often suffer from inefficiencies in multi-turn interactions, frequently trapped in repetitive loops or issuing ineffective commands, leading to redundant computational overhead. Instead of relying solely on learning from trajectories, we take a first step toward exploring the early-exit behavior for LLM-based agents. We propose two complementary approaches: 1. an $\textbf{intrinsic}$ method that injects exit instructions during generation, and 2. an $\textbf{extrinsic}$ method that verifies task completion to determine when to halt an agent's trial. To evaluate early-exit mechanisms, we introduce two metrics: one measures the reduction of $\textbf{redundant steps}$ as a positive effect, and the other evaluates $\textbf{progress degradation}$ as a negative effect. Experiments with 4 different LLMs across 5 embodied environments show significant efficiency improvements, with only minor drops in agent performance. We also validate a practical strategy where a stronger agent assists after an early-exit agent, achieving better performance with the same total steps. We will release our code to support further research.
摘要：由大语言模型（LLM）提供动力的代理商在复杂的体现环境中表现出强大的计划和决策能力。但是，这样的代理通常会遭受多转交互作用的效率低下，经常被重复循环或发出无效命令，从而导致冗余的计算开销。我们不仅依靠轨迹学习，而是朝着探索基于LLM的代理的早期外观行为迈出的第一步。我们提出了两种互补方法：1。$ \ textbf {intinsic} $方法，该方法在生成期间注入退出指令，以及2个。为了评估早期筛选机制，我们介绍了两个指标：一种将$ \ textbf {冗余步骤} $减少为积极效果，而另一个将$ \ textbf {progress降级} $评估为负面效果。在5个体现环境中使用4种不同LLM的实验显示出显着提高的效率，只有较小的代理性能下降。我们还验证了一种实用策略，在早期效果代理之后，更强大的代理商有助于以相同的总步骤来取得更好的性能。我们将发布我们的代码以支持进一步的研究。

Title: Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

Authors: Hayato Aida, Kosuke Takahashi, Takahiro Omi
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.17625
Pdf URL: https://arxiv.org/pdf/2505.17625
Copy Paste: [[2505.17625]] Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports(https://arxiv.org/abs/2505.17625)
Keywords: language model, llm, retrieval-augmented generation
Abstract: With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by incorporating in-table textual content and layout features. Experimental results demonstrate that these auxiliary modalities significantly improve performance, enabling robust interpretation of complex document layouts without relying on explicitly structured input formats.
摘要：随着大语言模型（LLM）的最新进展以及对检索功能的生成（RAG）的兴趣日益增强，了解桌子结构的能力变得越来越重要。这在诸如证券报告之类的金融领域中尤其重要，其中需要对表格上的高度准确的问答（QA）。但是，表以各种格式存在，包括HTML，图像和纯文本制作，很难保存和提取结构信息。因此，多模式LLM对于鲁棒和通用表的理解至关重要。尽管他们有希望，但当前是多模式LLM的主要代表的大型视力模型（LVLM）仍然面临着准确理解字符及其在文档中的空间关系的挑战。在这项研究中，我们提出了一种通过合并餐桌中的文本内容和布局特征来增强基于LVLM的表格理解的方法。实验结果表明，这些辅助模式可显着提高性能，从而在不依赖明确结构的输入格式的情况下对复杂文档布局进行了强有力的解释。

Title: GIM: Improved Interpretability for Large Language Models

Authors: Joakim Edin, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17630
Pdf URL: https://arxiv.org/pdf/2505.17630
Copy Paste: [[2505.17630]] GIM: Improved Interpretability for Large Language Models(https://arxiv.org/abs/2505.17630)
Keywords: language model, llm
Abstract: Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at this https URL.
摘要：对于值得信赖和可靠的AI，必须确保大语模型中的忠实解释性。一个关键的障碍是自我修复，这是一种现象，其中网络通过扩增其他组件来补偿一个组件中的信号减少，从而掩盖了消融成分的真正重要性。虽然先前的工作将自我修复归因于补偿消融成分的备份组件和备份组件，但我们确定了注意力机制中发生的一种新型形式，在该机制中，软磁性重新分布掩盖了重要的注意力评分的影响。这导致传统的消融和基于梯度的方法低估了促进这些注意力评分的所有组件的重要性。我们介绍了梯度相互作用修改（GIM），这是一种在反向传播过程中解释自我修复的技术。多种大型语言模型（Gemma 2b/9b，Llama 1b/3b/8b，QWEN 1.5B/3B）和不同任务的广泛实验表明，GIM显着提高了对现有电路识别和特征归因方法的忠诚。我们的工作是更好地理解LLM的内部机制的重要一步，这对于改善它们和确保其安全至关重要。我们的代码可在此HTTPS URL上找到。

Title: EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

Authors: Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyun Chang, Hamid Alinejad-Rokny, Bo Zheng, Min Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17654
Pdf URL: https://arxiv.org/pdf/2505.17654
Copy Paste: [[2505.17654]] EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications(https://arxiv.org/abs/2505.17654)
Keywords: language model, llm, prompt
Abstract: E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at this https URL.
摘要：电子商务平台越来越依赖大型语言模型（LLM）和视觉模型（VLM）来检测非法或误导性产品内容。但是，这些模型仍然容易受到回避内容的攻击：在表面上符合平台策略的同时秘密地传达禁止的索赔的输入（文本或图像）。与传统的对抗性攻击不同，诱发明显的失败，回避内容利用了歧义和背景，因此很难检测到。现有的鲁棒性基准对这一苛刻的现实世界挑战几乎没有指导。我们介绍了第一个专门策划的中文，多模式基准Evade，该基准专为评估电子商务中回避内容检测的基础模型而设计。该数据集包含2,833个带注释的文本样本和13,961张图像，涵盖了六个苛刻的产品类别，包括身体成型，身高增长和健康补充剂。两个互补的任务评估了不同的功能：单侵点，该功能在短提示下探测精细元素推理，以及多合一的推理，通过将重叠的策略规则合并到统一指令中，可以测试长篇小说推理。值得注意的是，多合一设置显着缩小了部分和全匹配精度之间的性能差距，这表明更清晰的规则定义可以改善人类和模型判断之间的一致性。我们基准为26个主流LLM和VLM，并观察到大量的性能差距：即使是最先进的模型也经常错误分类回避样本。通过释放逃避和强大的基线，我们提供了第一个严格的标准，用于评估避孕检测，揭示当前的多模式推理中的基本限制，并为电子商务中更安全，更透明的内容调节系统奠定基础。该数据集可在此HTTPS URL上公开可用。

Title: Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs

Authors: Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17656
Pdf URL: https://arxiv.org/pdf/2505.17656
Copy Paste: [[2505.17656]] Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs(https://arxiv.org/abs/2505.17656)
Keywords: language model, llm
Abstract: As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methshods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improved methods. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.
摘要：由于大型语言模型（LLMS）通常会产生合理但不正确的内容，因此错误检测对于确保真实性变得越来越重要。但是，现有的检测方法通常忽略了一个关键问题，我们称为自洽误差，其中LLMS在多个随机样本中反复产生相同的不正确响应。这项工作正式定义了自洽的错误，并评估了其中的主流检测方法。我们的调查揭示了两个关键发现：（1）与不一致的错误不同，随着LLM量表的增加，频率大大降低，自洽错误的频率保持稳定甚至增加。（2）所有四种类型的检测方法都在努力检测自一致的错误。这些发现揭示了当前检测方法的临界局限性，并强调了对改进方法的需求。通过观察到的自谐误差通常在LLM中的观察，我们提出了一种简单但有效的跨模型探针方法，该方法融合了外部验证者LLM的隐藏状态证据。我们的方法大大提高了三个LLM家族的自洽错误的表现。

Title: Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

Authors: Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, Pengfei Liu
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.17663
Pdf URL: https://arxiv.org/pdf/2505.17663
Copy Paste: [[2505.17663]] Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States(https://arxiv.org/abs/2505.17663)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present \textsc{DynToM}, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.
摘要：随着大型语言模型（LLM）越来越多地参与人类的互动，评估他们的心理理论（TOM）能力，尤其是他们追踪动态精神状态的能力 - 变得至关重要。尽管现有的基准测试评估了基本的TOM能力，但它们主要集中在精神状态的静态快照上，忽略了描述现实世界社交互动的时间进化。我们提出\ textsc {dyntom}，这是一种新颖的基准，该基准专为评估LLMS理解和跟踪跨相互联系的心理状态的时间进展的能力而设计。通过系统的四步框架，我们产生了1,100个社会环境，其中包括5,500个场景和78,100个问题，每个方案都以现实主义和质量进行了验证。我们对十个最先进的LLM的全面评估表明，他们的平均表现使人的表现不佳44.7％\％，并且在跟踪和推理精神状态的转移时，绩效显着降低。这种性能差距突出了当前LLMS对人类精神状态的动态性质建模的能力的基本限制。

Title: MIDB: Multilingual Instruction Data Booster for Enhancing Multilingual Instruction Synthesis

Authors: Yilun Liu, Chunguang Zhao, Xinhua Yang, Hongyong Zeng, Shimin Tao, Weibin Meng, Minggui He, Chang Su, Yan Yu, Hongxia Ma, Li Zhang, Daimeng Wei, Hao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17671
Pdf URL: https://arxiv.org/pdf/2505.17671
Copy Paste: [[2505.17671]] MIDB: Multilingual Instruction Data Booster for Enhancing Multilingual Instruction Synthesis(https://arxiv.org/abs/2505.17671)
Keywords: llm
Abstract: Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced.
摘要：尽管对数据质量有疑问，但教学综合已被广泛应用于LLMS的指导调整（IT）作为经济和快速替代方案。最近的努力着重于改善英语合成教学对的数据质量，并促进了以英语为中心的LLM的数据质量。但是，多语言合成指令对中的数据质量问题更加严重，因为常见的合成实践是使用机器翻译（MT）将英语合成数据转换为其他语言。除了这些英语综合数据中已知的内容错误外，多语言合成的指令数据进一步暴露于MT和FACE的目标语言本地化不足的缺陷。在本文中，我们提出了MIDB，这是一种多语言指令数据助推器，可以自动解决多语言合成数据中的质量问题。人类语言专家对MIDB进行了大约16种语言的36.8k修订示例培训，从而可以通过解决内容错误和MT缺陷并改善这些综合数据中的本地化来提高低质量数据。自动评估和人类评估都表明，MIDB不仅在16种语言中稳步改进了指导数据质量，而且还可以稳定的和文化在MIDB增强数据上微调的指导跟踪和文化理解能力得到了显着增强。

Title: Tuning Language Models for Robust Prediction of Diverse User Behaviors

Authors: Fanjin Meng, Jingtao Ding, Jiahui Gong, Chen Yang, Hong Chen, Zuojian Wang, Haisheng Lu, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17682
Pdf URL: https://arxiv.org/pdf/2505.17682
Copy Paste: [[2505.17682]] Tuning Language Models for Robust Prediction of Diverse User Behaviors(https://arxiv.org/abs/2505.17682)
Keywords: language model, llm
Abstract: Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent ``anchor'' behaviors, reducing their ability to predict less common ``tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.
摘要：预测用户行为对于智能助理服务至关重要，但是深度学习模型通常很难捕获长尾行为。大型语言模型（LLMS）及其在包含丰富行为知识的广阔语料库进行审议的情况下提供了希望。但是，现有的微调方法倾向于过于努力，以频繁地``锚''行为，从而降低了他们预测较不常见的``尾巴''行为的能力。在本文中，我们介绍了行为LM，这是一种渐进的微调方法，可以解决此问题。在第一阶段，LLM在锚固行为上进行了微调，同时保留一般的行为知识。在第二阶段，微调使用基于样本难度的所有行为的平衡子集，以改善尾巴行为预测而不牺牲锚固性能。两个现实世界数据集的实验结果表明，行为稳健地预测锚和尾部行为，并有效利用LLM行为知识来掌握尾声行为预测，并以很少的示例。

Title: ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

Authors: Yan Yu, Yilun Liu, Minggui He, Shimin Tao, Weibin Meng, Xinhua Yang, Li Zhang, Hongxia Ma, Chang Su, Hao Yang, Fuliang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17691
Pdf URL: https://arxiv.org/pdf/2505.17691
Copy Paste: [[2505.17691]] ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction(https://arxiv.org/abs/2505.17691)
Keywords: language model, llm
Abstract: Large language models (LLMs) are widely used as evaluators for open-ended tasks, while previous research has emphasized biases in LLM evaluations, the issue of non-transitivity in pairwise comparisons remains unresolved: non-transitive preferences for pairwise comparisons, where evaluators prefer A over B, B over C, but C over A. Our results suggest that low-quality training data may reduce the transitivity of preferences generated by the Evaluator LLM. To address this, We propose a graph-theoretic framework to analyze and mitigate this problem by modeling pairwise preferences as tournament graphs. We quantify non-transitivity and introduce directed graph structural entropy to measure the overall clarity of preferences. Our analysis reveals significant non-transitivity in advanced Evaluator LLMs (with Qwen2.5-Max exhibiting 67.96%), as well as high entropy values (0.8095 for Qwen2.5-Max), reflecting low overall clarity of preferences. To address this issue, we designed a filtering strategy, ELSPR, to eliminate preference data that induces non-transitivity, retaining only consistent and transitive preference data for model fine-tuning. Experiments demonstrate that models fine-tuned with filtered data reduce non-transitivity by 13.78% (from 64.28% to 50.50%), decrease structural entropy by 0.0879 (from 0.8113 to 0.7234), and align more closely with human evaluators (human agreement rate improves by 0.6% and Spearman correlation increases by 0.01).
摘要：大型语言模型（LLM）被广泛用作开放式任务的评估者，而先前的研究强调了LLM评估中的偏见，成对比较中的非过敏性问题仍未得到解决：非交易性偏好：对成对比较的非交易偏好，而不是b，我们的结果均超过了b，c的传播。评估员LLM。为了解决这个问题，我们提出了一个图理论框架，以通过将成对偏好作为锦标赛图进行建模来分析和减轻此问题。我们量化了非转换性，并引入了有向图的结构熵，以衡量偏好的总体清晰度。我们的分析表明，高级评估器LLMS（QWEN2.5-MAX的67.96％）以及高熵值（QWEN2.5-MAX的0.8095）中的显着非转换率，反映了总体偏好的总体偏好性较低。为了解决此问题，我们设计了一种过滤策略ELSPR，以消除诱发非转换性的偏好数据，仅保留模型微调的一致和及时偏好数据。实验表明，通过过滤数据进行微调的模型将非转相降低13.78％（从64.28％到50.50％），将结构熵降低0.0879（从0.8113到0.8113），并与人类评估师更加近一点（人类一致性率提高了0.6％，而人类一致性率提高了0.6％，而Spearman Corelate Corelation and Spearman Corelation降低了0.01）。

Title: Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

Authors: Zekai Zhao, Qi Liu, Kun Zhou, Zihan Liu, Yifei Shao, Zhiting Hu, Biwei Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.17697
Pdf URL: https://arxiv.org/pdf/2505.17697
Copy Paste: [[2505.17697]] Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models(https://arxiv.org/abs/2505.17697)
Keywords: language model, llm, chain-of-thought
Abstract: Despite the remarkable reasoning performance, eliciting the long chain-of-thought (CoT) ability in large language models (LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers largely governs long-form reasoning attributes, such as output length and self-reflection. By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training, resulting in significantly increased self-reflection rates and accuracy. Moreover, we find that the activation dynamics follow predictable trajectories, with a sharp rise after special tokens and a subsequent exponential decay. Building on these insights, we introduce a general training-free activation control technique. It leverages a few contrastive examples to identify key activations, and employs simple analytic functions to modulate their values at inference time to elicit long CoTs. Extensive experiments confirm the effectiveness of our method in efficiently eliciting long CoT reasoning in LLMs and improving their performance. Additionally, we propose a parameter-efficient fine-tuning method that trains only a last-layer activation amplification module and a few LoRA layers, outperforming full LoRA fine-tuning on reasoning benchmarks with significantly fewer parameters. Our code and data are publicly released.
摘要：尽管推理表现出色，但在大语模型（LLMS）中引起了长期的经营链（COT）的能力，通常需要昂贵的加强学习或对高质量的蒸馏数据进行微调。我们研究了这种能力背后的内部机制，并表明在最后几层中，一小部分高影响力激活在很大程度上控制了长形的推理属性，例如输出长度和自我反射。通过简单地放大这些激活并插入“等待”代币，我们可以在无需任何训练的情况下调用长床的能力，从而显着提高自我反射率和准确性。此外，我们发现激活动力学遵循可预测的轨迹，特殊令牌后急剧上升和随后的指数衰减。在这些见解的基础上，我们引入了一种无培训的激活控制技术。它利用一些对比示例来识别关键激活，并采用简单的分析功能来调节其推理时间的值以引起长COTS。广泛的实验证实了我们方法在有效地引起LLM中的长床推理并提高其性能方面的有效性。此外，我们提出了一种参数有效的微调方法，该方法仅训练最后一个层激活放大模块和一些Lora层，在推理基准测试基准方面的表现优于较少参数的推理基准。我们的代码和数据已公开发布。

Title: Understanding How Value Neurons Shape the Generation of Specified Values in LLMs

Authors: Yi Su, Jiayi Zhang, Shu Yang, Xinhai Wang, Lijie Hu, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17712
Pdf URL: https://arxiv.org/pdf/2505.17712
Copy Paste: [[2505.17712]] Understanding How Value Neurons Shape the Generation of Specified Values in LLMs(https://arxiv.org/abs/2505.17712)
Keywords: language model, llm
Abstract: Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.
摘要：大型语言模型（LLM）迅速整合到社会应用中，加剧了人们对它们与普遍道德原则的一致性的关注，因为尽管行为一致性进步，但它们的内部价值表示仍然不透明。当前的方法难以系统地解释如何在神经体系结构中编码价值，这受到优先级判断而不是机械分析的数据集的限制。我们介绍了基于Schwartz价值调查的机械性解释性框架，以解决这一差距。我们的方法首先构建了Valueinsight，这是一个数据集，该数据集通过现实世界中的行为环境来操作普遍价值的四个维度。利用该数据集，我们开发了一种神经元识别方法，该方法可以计算相反的值方面之间的激活差异，从而在不依赖计算强度归因方法的情况下可以精确地定位价值临界神经元。我们提出的验证方法表明，针对性操纵这些神经元有效地改变了模型价值取向，建立了神经元与价值表示之间的因果关系。这项工作通过在LLM中使用神经元分析桥接心理价值框架来促进价值一致性的基础。

Title: Fast Quiet-STaR: Thinking Without Thought Tokens

Authors: Wei Huang, Yizhe Xiong, Xin Ye, Zhijie Deng, Hui Chen, Zijia Lin, Guiguang Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17746
Pdf URL: https://arxiv.org/pdf/2505.17746
Copy Paste: [[2505.17746]] Fast Quiet-STaR: Thinking Without Thought Tokens(https://arxiv.org/abs/2505.17746)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved impressive performance across a range of natural language processing tasks. However, recent advances demonstrate that further gains particularly in complex reasoning tasks require more than merely scaling up model sizes or training data. One promising direction is to enable models to think during the reasoning process. Recently, Quiet STaR significantly improves reasoning by generating token-level thought traces, but incurs substantial inference overhead. In this work, we propose Fast Quiet STaR, a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost. Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens, enabling the model to internalize more abstract and concise reasoning processes. We further extend this approach to the standard Next Token Prediction (NTP) setting through reinforcement learning-based fine-tuning, resulting in Fast Quiet-STaR NTP, which eliminates the need for explicit thought token generation during inference. Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy under the same inference time budget. Notably, Fast Quiet-STaR NTP achieves an average accuracy improvement of 9\% on Mistral 7B and 5.7\% on Qwen2.5 7B, while maintaining the same inference latency. Our code will be available at this https URL.
摘要：大型语言模型（LLM）在一系列自然语言处理任务中取得了令人印象深刻的表现。但是，最近的进步表明，尤其是在复杂的推理任务中，不仅需要扩大模型尺寸或培训数据，还需要进一步的收益。一个有希望的方向是使模型在推理过程中思考。最近，《安静之星》通过产生令牌级别的思想痕迹来显着改善推理，但会产生大量的推理开销。在这项工作中，我们提出了快速的安静之星，这是一个更有效的推理框架，可保留令牌级别推理的好处，同时降低计算成本。我们的方法引入了基于课程学习的培训策略，该策略逐渐减少了思想令牌的数量，从而使模型能够内化更多的抽象和简洁的推理过程。我们通过基于增强学习的微调来进一步将这种方法扩展到标准的隔壁标记预测（NTP）设置，从而导致了快速安静的巨星NTP，这消除了推理过程中对明确思考的代币产生的需求。具有Mistral 7b和Qwen2.5 7b的四个基准数据集的实验表明，在相同的推理时间预算下，快速静静的明星始终优于平均准确性。值得注意的是，快速安静的巨星NTP的平均准确性提高了Mistral 7b的9 \％，而Qwen2.5 7b的平均准确性提高，同时保持了相同的推断潜伏期。我们的代码将在此HTTPS URL上可用。

Title: Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

Authors: Maureen de Seyssel, Jie Chi, Skyler Seto, Maartje ter Hoeve, Masha Fedzechkina, Natalie Schluter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17747
Pdf URL: https://arxiv.org/pdf/2505.17747
Copy Paste: [[2505.17747]] Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks(https://arxiv.org/abs/2505.17747)
Keywords: language model
Abstract: We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.
摘要：我们介绍了一组无训练的ABX风格的歧视任务，以评估多语言语言模型如何代表语言身份（形式）和语义内容（含义）。这些零击任务受到语音处理的启发，衡量了是否可以可靠地检测到表示表示的最小差异。这为探测提供了灵活且可解释的替代方法。应用于XLM-R（Conneau等，2020）在预处理的检查点和层中，我们发现语言歧视因训练而下降，并集中在较低的层中，而含义歧视会随着时间的推移而增强并在更深的层中稳定。然后，我们探索探测任务，显示我们的指标与语言学习表现之间的一致性。我们的结果将ABX任务定位为分析多语言表示结构的轻量级框架。

Title: Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs

Authors: Ziyu Ge, Yuhao Wu, Daniel Wai Kit Chin, Roy Ka-Wei Lee, Rui Cao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.17762
Pdf URL: https://arxiv.org/pdf/2505.17762
Copy Paste: [[2505.17762]] Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs(https://arxiv.org/abs/2505.17762)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce \textbf{CONFACT} (\textbf{Con}flicting Evidence for \textbf{Fact}-Checking) (Dataset available at this https URL), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.
摘要：通过检索机制增强的大型语言模型（LLM）通过整合外部知识在事实检查任务中表现出了巨大的潜力。但是，当面对不同信誉来源的证据时，它们的可靠性会降低。本文介绍了在存在冲突的证据存在下进行事实检查的检索型生成（RAG）模型的首次系统评估。为了支持这项研究，我们介绍了\ textbf {Contact}（\ textbf {con} \ textbf {fact} -Checking）（此https url上可用的数据集），这是一个新颖的数据集，这是一个包含来自各种来源的矛盾信息的新型数据集。广泛的实验揭示了最先进的抹布方法中的关键脆弱性，尤其是在解决媒体源信誉差异引起的冲突时。为了应对这些挑战，我们研究了将媒体背景信息整合到检索阶段和发电阶段的策略。我们的结果表明，有效纳入来源可信度可显着提高破布模型解决冲突证据并改善事实检查绩效的能力。

Title: The Real Barrier to LLM Agent Usability is Agentic ROI

Authors: Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, Weinan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17767
Pdf URL: https://arxiv.org/pdf/2505.17767
Copy Paste: [[2505.17767]] The Real Barrier to LLM Agent Usability is Agentic ROI(https://arxiv.org/abs/2505.17767)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Model (LLM) agents represent a promising shift in human-AI interaction, moving beyond passive prompt-response systems to autonomous agents capable of reasoning, planning, and goal-directed action. Despite the widespread application in specialized, high-effort tasks like coding and scientific research, we highlight a critical usability gap in high-demand, mass-market applications. This position paper argues that the limited real-world adoption of LLM agents stems not only from gaps in model capabilities, but also from a fundamental tradeoff between the value an agent can provide and the costs incurred during real-world use. Hence, we call for a shift from solely optimizing model performance to a broader, utility-driven perspective: evaluating agents through the lens of the overall agentic return on investment (Agent ROI). By identifying key factors that determine Agentic ROI--information quality, agent time, and cost--we posit a zigzag development trajectory in optimizing agentic ROI: first scaling up to improve the information quality, then scaling down to minimize the time and cost. We outline the roadmap across different development stages to bridge the current usability gaps, aiming to make LLM agents truly scalable, accessible, and effective in real-world contexts.
摘要：大型语言模型（LLM）代理代表了人类互动的有前途的转变，超越了被动及时响应系统，向能够推理，计划和目标指导的动作的自主代理转移。尽管在编码和科学研究（例如编码和科学研究）中广泛应用，但我们重点介绍了高需求，大众市场应用中的关键可用性差距。该立场论文认为，LLM代理的实际采用有限，不仅源于模型能力的差距，而且还源于代理商可以提供的价值与实际使用期间所产生的成本之间的基本权衡。因此，我们呼吁从仅仅优化模型性能转变为更广泛的，公用事业驱动的观点：通过整体代理投资回报率（Agent ROI）的镜头评估代理。通过确定确定代理ROI的关键因素 - 信息质量，代理时间和成本 - 我们假定了锯齿形开发轨迹，以优化代理ROI：首先进行扩展以提高信息质量，然后缩减以最大程度地减少时间和成本。我们概述了不同开发阶段的路线图，以弥合当前的可用性差距，旨在使LLM代理在现实环境中真正可扩展，易于访问和有效。

Title: EXECUTE: A Multilingual Benchmark for LLM Token Understanding

Authors: Lukas Edman, Helmut Schmid, Alexander Fraser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17784
Pdf URL: https://arxiv.org/pdf/2505.17784
Copy Paste: [[2505.17784]] EXECUTE: A Multilingual Benchmark for LLM Token Understanding(https://arxiv.org/abs/2505.17784)
Keywords: llm
Abstract: The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.
摘要：可爱的基准表明，LLM与英语中的性格理解斗争。我们将其扩展到更多具有不同脚本和写作系统的语言，并引入执行。我们简化的框架可以轻松扩展到任何语言。跨多个LLM的测试表明，其他语言中的挑战并不总是像英语那样在角色级别上。有些语言显示单词级处理问题，有些语言根本没有任何问题。我们还检查了中文，日语和韩语的子字符任务，以评估LLMS对角色组成部分的理解。

Title: Compression Hacking: A Supplementary Perspective on Informatics Metric of Language Models from Geometric Distortion

Authors: Jianxiang Zang, Meiling Ning, Yongda Wei, Shihan Dou, Jiazheng Zhang, Nijia Mo, Binhong Li, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17793
Pdf URL: https://arxiv.org/pdf/2505.17793
Copy Paste: [[2505.17793]] Compression Hacking: A Supplementary Perspective on Informatics Metric of Language Models from Geometric Distortion(https://arxiv.org/abs/2505.17793)
Keywords: language model
Abstract: Recently, the concept of ``compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the ``Compression Hacking'' in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM's comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.
摘要：最近，``压缩作为情报''的概念为语言模型（LMS）提供了一种新颖的信息学指标，强调了高度结构化表示表示LMS的智能水平。但是，从几何的角度来看，高度压缩LM的一词表示空间往往会退化为高度各向异性的状态，这阻碍了LM理解指令并直接影响其性能的能力。我们发现，这种压缩 - 扭曲的同步性本质上是LM表示中的``压缩黑客攻击''，其中噪声主导的方向倾向于通过牺牲空间均匀性来产生高压缩率的错觉。基于此，我们通过结合几何变形分析并将它们整合到自我评估管道中，提出了三个精制的压缩指标。精制指标与LM的综合功能表现出很强的一致性，实现了Spearman相关系数高于0.9，显着优于原始压缩和其他基于内部结构的指标。这证实了压缩黑客通过结合表示形式的几何变形，从而实质上增强了LMS的信息解释。

Title: DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

Authors: Tazeek Bin Abdur Rakib, Ambuj Mehrish, Lay-Ki Soon, Wern Han Lim, Soujanya Poria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17795
Pdf URL: https://arxiv.org/pdf/2505.17795
Copy Paste: [[2505.17795]] DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors(https://arxiv.org/abs/2505.17795)
Keywords: llm, agent
Abstract: Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under $3$ turns with success rates exceeding 94\% and, with a larger LLM prior, pushes success above 97\% while markedly improving negotiation outcomes. This framework delivers real-time, strategic, and emotionally intelligent dialogue planning at scale. Code available at this https URL
摘要：大型语言模型（LLM）代理在反应性对话中表现出色，但由于近视解码和昂贵的计划而在积极主动的，目标驱动的互动中挣扎。我们介绍了DialogXpert，该DialogXpert利用冷冻的LLM每回合提出一组小的，高质量的候选动作，并通过通过时间差异学习训练的固定BERT嵌入式Q-NetWork，以在此减少的空间内选择最佳移动。通过跟踪用户的情绪，DialogXpert量身定制每个决定，以推进任务，同时培养真正的善解人意的联系。在谈判，情感支持和辅导基准中，DialogXpert将对话提高到3美元以下，成功率超过94％\％，并且更大的LLM提前将成功提高到97 \％以上，同时显着提高了谈判成果。该框架按大规模提供实时，战略性和情感上智能的对话计划。此https URL可用代码

Title: Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

Authors: Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17813
Pdf URL: https://arxiv.org/pdf/2505.17813
Copy Paste: [[2505.17813]] Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning(https://arxiv.org/abs/2505.17813)
Keywords: language model, llm
Abstract: Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.
摘要：推理大语言模型（LLMS）在很大程度上依赖于扩展测试时间计算来通过生成广泛的“思考”链来执行复杂的推理任务。在证明令人印象深刻的结果的同时，这种方法会产生大量的计算成本和推理时间。在这项工作中，我们挑战了以下假设：漫长的思维链会带来更好的推理能力。我们首先证明，单个问题中的推理链短得多，可以得出正确的答案 - 比同一问题采样最长的链条高达34.5％。基于这些结果，我们建议Short-M@K，这是一种新颖的推理LLM推理方法。一旦完成了第一个M思维过程，我们的方法将在并联中执行K独立的世代，并停止计算。最终答案是在这些M链中使用多数投票选择的。基本的Short-1@K在低计算设置中表现出比标准多数投票相似甚至优越的表现 - 使用多达40％的思维代币。 Short-3@K，虽然效率略低于Short-1@K，但在所有计算预算中始终超过多数投票，同时仍然要快得多（最高33％的壁时间减少）。受我们的结果的启发，我们使用短，长和随机选择的推理链条对LLM进行了修订。然后，我们观察到较短的培训会导致表现更好。我们的发现表明，在推理LLM中重新思考当前的测试时间计算方法，强调更长的“思考”并不一定会转化为改善的性能，并且可以违反直觉会导致降级结果。

Title: Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning

Authors: Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17829
Pdf URL: https://arxiv.org/pdf/2505.17829
Copy Paste: [[2505.17829]] Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning(https://arxiv.org/abs/2505.17829)
Keywords: language model, llm, chain-of-thought
Abstract: Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.
摘要：通过思考链（COT）的数学推理已经成为大型语言模型（LLMS）的强大能力，可以通过测试时间缩放（TTS）方法（例如Beam Search和DVTS）进一步增强。但是，尽管在推断过程中分配了更多的计算资源，但这些方法通常会遭受路径均质化和中等结果效率低下的使用。为了解决这些限制，我们建议逐步推理检查点分析（SRCA），该框架在推理步骤之间介绍了检查点。它结合了两种关键策略：（1）回答群集的搜索，这些搜索通过其中间检查点的回答来确保多样性，同时确保质量，以及（2）检查点候选候选者的增强，这利用所有中间答案来制定最终决策。我们的方法有效地降低了路径均质化，并通过利用高质量的中间结果来创造耐断层的机制。实验结果表明，与各种数学数据集的现有TTS方法相比，SRCA提高了推理精度。

Title: Explaining Sources of Uncertainty in Automated Fact-Checking

Authors: Jingyi Sun, Greta Warren, Irina Shklovski, Isabelle Augenstein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17855
Pdf URL: https://arxiv.org/pdf/2505.17855
Copy Paste: [[2505.17855]] Explaining Sources of Uncertainty in Automated Fact-Checking(https://arxiv.org/abs/2505.17855)
Keywords: language model, prompt
Abstract: Understanding sources of a model's uncertainty regarding its predictions is crucial for effective human-AI collaboration. Prior work proposes using numerical uncertainty or hedges ("I'm not sure, but ..."), which do not explain uncertainty that arises from conflicting evidence, leaving users unable to resolve disagreements or rely on the output. We introduce CLUE (Conflict-and-Agreement-aware Language-model Uncertainty Explanations), the first framework to generate natural language explanations of model uncertainty by (i) identifying relationships between spans of text that expose claim-evidence or inter-evidence conflicts and agreements that drive the model's predictive uncertainty in an unsupervised way, and (ii) generating explanations via prompting and attention steering that verbalize these critical interactions. Across three language models and two fact-checking datasets, we show that CLUE produces explanations that are more faithful to the model's uncertainty and more consistent with fact-checking decisions than prompting for uncertainty explanations without span-interaction guidance. Human evaluators judge our explanations to be more helpful, more informative, less redundant, and more logically consistent with the input than this baseline. CLUE requires no fine-tuning or architectural changes, making it plug-and-play for any white-box language model. By explicitly linking uncertainty to evidence conflicts, it offers practical support for fact-checking and generalises readily to other tasks that require reasoning over complex information.
摘要：了解模型对其预测的不确定性的来源对于有效的人类协作至关重要。先前的工作建议使用数值不确定性或树篱（“我不确定，但是...”），这不能解释证据相互矛盾的不确定性，使用户无法解决分歧或依靠输出。我们介绍了线索（冲突和验证意识的语言模型不确定性解释），这是通过（i）通过（i）确定识别文本跨度之间的自然语言解释的第一个框架，这些框架可以识别出索取索赔的冲突或事态上的冲突或协议，这些框架或协议可以在播放频率上引起频率的预测性不确定性，从而提示了erseversive cressighted ersive ersective and in II且（ii）和（ii）cressige and（II）和（II）的范围（II）。在三种语言模型和两个事实检查数据集中，我们表明线索会产生解释，这些解释比在没有跨度交互指导的情况下提示不确定性解释更忠于该模型的不确定性，并且与事实检查的决策更一致。人类评估人员认为，我们的解释比该基线更有乐于，更有信息，更多余的内容，并且在逻辑上比输入更一致。线索不需要微调或架构更改，因此可以为任何白色盒子的语言模型插入插件。通过将不确定性与证据冲突的明确联系起来，它为事实检查和概括提供了实际支持，并随时将需要对复杂信息进行推理的其他任务。

Title: MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

Authors: Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2505.17873
Pdf URL: https://arxiv.org/pdf/2505.17873
Copy Paste: [[2505.17873]] MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback(https://arxiv.org/abs/2505.17873)
Keywords: language model
Abstract: Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.
摘要：假设排名是自动化科学发现的关键组成部分，尤其是在湿lab实验昂贵且吞吐量限制的自然科学中。现有的方法集中在实验前的排名上，仅依靠大型语言模型的内部推理而没有结合实验的经验结果。我们介绍了实验指导排名的任务，该任务旨在根据先前测试的结果确定候选假设的优先级。但是，由于在自然科学领域反复进行真实实验的不切实际性，制定这种策略是具有挑战性的。为了解决这个问题，我们提出了一个基于三个域信息假设的模拟器，将假设性能建模是与已知的地面真理假设相似的函数，受到噪声的影响。我们策划了一个124个化学假设的数据集，并具有实验报告的结果以验证模拟器。在此模拟器的基础上，我们开发了一种伪实验引导的排名方法，该方法将假设通过共享功能特征来促进假设，并根据来自模拟实验反馈的见解来确定候选人的优先级。实验表明，我们的方法优于实验前基线和强烈的消融。

Title: Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

Authors: Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17894
Pdf URL: https://arxiv.org/pdf/2505.17894
Copy Paste: [[2505.17894]] Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model(https://arxiv.org/abs/2505.17894)
Keywords: language model, gpt, llm
Abstract: We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.
摘要：我们介绍了Mutarjim，这是一种紧凑而强大的语言模型，用于双向阿拉伯语英语翻译。尽管大规模的LLM在自然语言处理任务（包括机器翻译，较小的型号）中显示出令人印象深刻的进展。利用这种见解，我们根据Kuwain-1.5b开发了Mutarjim，这是一种针对阿拉伯语和英语量身定制的语言模型。尽管大小适中，但Mutarjim在几个既定的基准测试基准上都胜过更大的模型，这是通过优化的两阶段训练方法和精心策划的高质量训练语料库实现的。实验结果表明，Mutarjim竞争对手模型最大20倍，而显着降低了计算成本和培训要求。我们还介绍了Tarjama-25，这是一种新的基准测试，旨在克服现有的阿拉伯语英语基准数据集的局限性，例如域狭窄，短句子长度和英语源偏见。 Tarjama-25包括5,000个经过专家评审的句子对，并跨越了广泛的领域，提供了更全面和平衡的评估框架。值得注意的是，Mutarjim在Tarjama-25的英语到阿拉伯任务上取得了最先进的表现，超过了GPT-4O MINI（例如GPT-4O Mini）的更大且专有的模型。我们公开发布Tarjama-25，以支持未来的研究并推进对阿拉伯语英语翻译系统的评估。

Title: Language models can learn implicit multi-hop reasoning, but only if they have lots of training data

Authors: Yuekun Yao, Yupei Du, Dawei Zhu, Michael Hahn, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17923
Pdf URL: https://arxiv.org/pdf/2505.17923
Copy Paste: [[2505.17923]] Language models can learn implicit multi-hop reasoning, but only if they have lots of training data(https://arxiv.org/abs/2505.17923)
Keywords: language model, gpt
Abstract: Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought. We investigate this capability using GPT2-style language models trained from scratch on controlled $k$-hop reasoning datasets ($k = 2, 3, 4$). We show that while such models can indeed learn implicit $k$-hop reasoning, the required training data grows exponentially in $k$, and the required number of transformer layers grows linearly in $k$. We offer a theoretical explanation for why this depth growth is necessary. We further find that the data requirement can be mitigated, but not eliminated, through curriculum learning.
摘要：隐含的推理是语言模型在没有思想链的情况下在单个前传中求解多跳的推理任务的能力。我们使用在受控的$ k $ -HOP推理数据集（$ k = 2、3、4 $）上从头开始训练的GPT2风格的语言模型研究了此功能。我们表明，尽管这样的模型确实可以学习隐式$ k $ hop推理，但所需的培训数据以$ k $成倍增长，而所需的变压器层数则以$ K $线性增长。我们为为什么需要这种深度增长需要一个理论解释。我们进一步发现，可以通过课程学习来减轻数据需求，但不能消除。

Title: Handling Symbolic Language in Student Texts: A Comparative Study of NLP Embedding Models

Authors: Tom Bleckmann, Paul Tschisgale
Subjects: cs.CL, cs.AI, physics.ed-ph
Abstract URL: https://arxiv.org/abs/2505.17950
Pdf URL: https://arxiv.org/pdf/2505.17950
Copy Paste: [[2505.17950]] Handling Symbolic Language in Student Texts: A Comparative Study of NLP Embedding Models(https://arxiv.org/abs/2505.17950)
Keywords: gpt
Abstract: Recent advancements in Natural Language Processing (NLP) have facilitated the analysis of student-generated language products in learning analytics (LA), particularly through the use of NLP embedding models. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing studies and applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased findings and diminished performance of LA applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: similarity-based analyses and integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI's GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Beyond performance, additional factors such as cost, regulatory compliance, and model transparency are discussed as key considerations for model selection. Overall, this study underscores the importance for LA researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions.
摘要：自然语言处理（NLP）的最新进展促进了学习分析（LA）中学生生成的语言产品的分析，尤其是通过使用NLP嵌入模型。然而，当涉及与科学相关的语言时，诸如方程式和公式之类的符号表达式引入了当前嵌入模型难以解决的挑战。现有的研究和应用通常忽略了这些挑战，或者完全消除了符号表达式，有可能导致偏见的发现和LA应用的性能下降。因此，这项研究探讨了当代嵌入模型在处理和解释与科学相关的符号表达的能力方面有何不同。为此，使用从真实的学生反应中得出的物理特定符号表达式评估了各种嵌入模型，并通过两种方法评估了性能：基于相似性的分析和集成到机器学习管道中。我们的发现揭示了模型性能的显着差异，尽管OpenAI的GPT-Text-Embedding-3-Large-lage优于所有其他检查的模型，尽管其优于其他模型的优势是中等而不是决定性的。除了绩效外，还讨论了其他因素，例如成本，法规合规性和模型透明度，作为模型选择的关键考虑因素。总体而言，这项研究强调了洛杉矶研究人员和从业人员在使用包括符号表达式的科学相关语言产品时仔细选择NLP嵌入模型的重要性。

Title: Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Authors: Che Liu, Haozhe Wang, Jiazhen Pan, Zhongwei Wan, Yong Dai, Fangzhen Lin, Wenjia Bai, Daniel Rueckert, Rossella Arcucci
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.17952
Pdf URL: https://arxiv.org/pdf/2505.17952
Copy Paste: [[2505.17952]] Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL(https://arxiv.org/abs/2505.17952)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.
摘要：改善复杂任务的绩效，并在大型语言模型（LLMS）中启用可解释的决策，尤其是对于临床应用，需要有效的推理。然而，如果没有对封闭源模型（例如GPT-4O）蒸馏出代价高昂的思考（COT）数据的监督（COT）数据，这仍然具有挑战性。在这项工作中，我们介绍了Alphamed，这是第一个证明推理能力可以纯粹通过加固学习（RL）出现的医学LLM，使用基于简约的规则奖励，而无需依赖SFT或蒸馏COT数据，对公共多项选择性质量质量质量质量质量质量QA数据集进行了奖励。 Alphamed在六个医疗质量检查基准测试中实现了最先进的结果，表现优于传统的SFT+RL管道训练的模型。在具有挑战性的基准（例如MEDXPERT）上，Alphamed甚至超过了较大或闭合的模型，例如DeepSeek-V3-671B和Claude-3.5-Sonnet。为了了解这一成功的因素，我们进行了以三个问题为指导的全面以数据为中心的分析：（i）可以基于规则的RL在没有蒸馏COT监督的情况下激励推理吗？（ii）数据集数量和多样性如何影响推理？（iii）质疑难度如何塑造推理的出现和概括？我们的发现表明，数据集信息性是推理性能的关键驱动力，并且对信息丰富的多项选择质量质量质量数据数据的简约RL在没有COT监督的情况下有效地诱导推理。我们还观察到基准之间的不同趋势，强调当前评估中的局限性以及对更具挑战性，面向推理的医学质量检查基准的需求。

Title: Counting Cycles with Deepseek

Authors: Jiashun Jin, Tracy Ke, Bingcheng Sui, Zhenggang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.17964
Pdf URL: https://arxiv.org/pdf/2505.17964
Copy Paste: [[2505.17964]] Counting Cycles with Deepseek(https://arxiv.org/abs/2505.17964)
Keywords: prompt
Abstract: Despite recent progress, AI still struggles on advanced mathematics. We consider a difficult open problem: How to derive a Computationally Efficient Equivalent Form (CEEF) for the cycle count statistic? The CEEF problem does not have known general solutions, and requires delicate combinatorics and tedious calculations. Such a task is hard to accomplish by humans but is an ideal example where AI can be very helpful. We solve the problem by combining a novel approach we propose and the powerful coding skills of AI. Our results use delicate graph theory and contain new formulas for general cases that have not been discovered before. We find that, while AI is unable to solve the problem all by itself, it is able to solve it if we provide it with a clear strategy, a step-by-step guidance and carefully written prompts. For simplicity, we focus our study on DeepSeek-R1 but we also investigate other AI approaches.
摘要：尽管最近取得了进展，AI仍在高级数学上挣扎。我们考虑一个困难的开放问题：如何为周期计数统计数据得出计算有效的等效形式（CEEF）？ CEEF问题尚不知道一般解决方案，需要精致的组合和繁琐的计算。人类很难完成这样的任务，但是一个理想的例子，在其中AI非常有帮助。我们通过结合提出的新颖方法和AI强大的编码技能来解决问题。我们的结果使用精致的图理论，并包含用于以前从未发现的一般情况的新公式。我们发现，尽管AI无法本身解决问题，但如果我们为其提供明确的策略，逐步的指导和精心书写的提示，它将能够解决问题。为简单起见，我们将研究重点放在DeepSeek-R1上，但我们还研究了其他AI方法。

Title: Training with Pseudo-Code for Instruction Following

Authors: Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18011
Pdf URL: https://arxiv.org/pdf/2505.18011
Copy Paste: [[2505.18011]] Training with Pseudo-Code for Instruction Following(https://arxiv.org/abs/2505.18011)
Keywords: language model, llm
Abstract: Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs can be tedious and using few-shot demonstrations to craft code representations for use in inference can be unnatural for non-expert users of LLMs. To overcome these limitations, we propose fine-tuning LLMs with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code along with the final response. We evaluate models trained using our method on $11$ publicly available benchmarks comprising of tasks related to instruction-following, mathematics, and common-sense reasoning. We conduct rigorous experiments with $5$ different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning. Specifically, we observe a relative gain of $3$--$19$% on instruction-following benchmark, and an average gain of upto 14% across all tasks.
摘要：尽管大语言模型（LLM）的能力取得了迅速的进展，但在相对简单，明确的说明中，它们仍然遇到困难，尤其是在涉及构图的情况下。在本文中，我们从最近的工作中汲取灵感，表明模型在伪代码中表达时可能会更好地遵循说明。但是，编写伪代码程序可能很乏味，并且使用少量的演示来制作代码表示以用于推理的代码表示可能是不自然的，对于LLM的非专家用户而言。为了克服这些局限性，我们提出了使用指令调整数据的微调LLM，这些数据还包括在伪代码中重新表达的指令以及最终响应。我们评估了使用我们的方法培训的模型，该模型包括$ 11 $公开可用的基准测试，包括与指导遵循指导，数学和常识性推理有关的任务。我们使用$ 5 $不同的模型进行严格的实验，并发现在接受伪代码培训的训练时，模型不仅可以更好地遵循说明，而且还保留了与数学和常识推理有关的其他任务的能力。具体来说，我们观察到相对增益为$ 3 $ - $ 19 $ $ 19 $ 19 $％的遵循基准，所有任务的平均收益高达14％。

Title: Contrastive Distillation of Emotion Knowledge from LLMs for Zero-Shot Emotion Recognition

Authors: Minxue Niu, Emily Mower Provost
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18040
Pdf URL: https://arxiv.org/pdf/2505.18040
Copy Paste: [[2505.18040]] Contrastive Distillation of Emotion Knowledge from LLMs for Zero-Shot Emotion Recognition(https://arxiv.org/abs/2505.18040)
Keywords: language model, gpt, llm
Abstract: The ability to handle various emotion labels without dedicated training is crucial for building adaptable Emotion Recognition (ER) systems. Conventional ER models rely on training using fixed label sets and struggle to generalize beyond them. On the other hand, Large Language Models (LLMs) have shown strong zero-shot ER performance across diverse label spaces, but their scale limits their use on edge devices. In this work, we propose a contrastive distillation framework that transfers rich emotional knowledge from LLMs into a compact model without the use of human annotations. We use GPT-4 to generate descriptive emotion annotations, offering rich supervision beyond fixed label sets. By aligning text samples with emotion descriptors in a shared embedding space, our method enables zero-shot prediction on different emotion classes, granularity, and label schema. The distilled model is effective across multiple datasets and label spaces, outperforming strong baselines of similar size and approaching GPT-4's zero-shot performance, while being over 10,000 times smaller.
摘要：在没有专用训练的情况下处理各种情绪标签的能力对于建立适应性的情绪识别（ER）系统至关重要。传统的ER模型依靠使用固定标签集的培训，并努力概括超越它们的培训。另一方面，大型语言模型（LLMS）在不同标签空间上显示出强大的零拍摄性能，但是它们的量表限制了它们在边缘设备上的使用。在这项工作中，我们提出了一个对比蒸馏框架，将丰富的情感知识从LLM转移到不使用人类注释的情况下。我们使用GPT-4来产生描述性情绪注释，从而为固定标签集提供丰富的监督。通过将文本样本与共享嵌入空间中的情感描述对齐，我们的方法可以对不同的情感类别，粒度和标签模式进行零射击预测。蒸馏模型在多个数据集和标签空间之间有效，表现优于相似大小的强基线，并且接近GPT-4的零弹性性能，而小于10,000倍的超过10,000倍。

Title: MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving

Authors: Wei-Ling Hsu, Yu-Chien Tang, An-Zi Yen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18056
Pdf URL: https://arxiv.org/pdf/2505.18056
Copy Paste: [[2505.18056]] MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving(https://arxiv.org/abs/2505.18056)
Keywords: language model, llm
Abstract: Online learning enhances educational accessibility, offering students the flexibility to learn anytime, anywhere. However, a key limitation is the lack of immediate, personalized feedback, particularly in helping students correct errors in math problem-solving. Several studies have investigated the applications of large language models (LLMs) in educational contexts. In this paper, we explore the capabilities of LLMs to assess students' math problem-solving processes and provide adaptive feedback. The MathEDU dataset is introduced, comprising authentic student solutions annotated with teacher feedback. We evaluate the model's ability to support personalized learning in two scenarios: one where the model has access to students' prior answer histories, and another simulating a cold-start context. Experimental results show that the fine-tuned model performs well in identifying correctness. However, the model still faces challenges in generating detailed feedback for pedagogical purposes.
摘要：在线学习增强了教育的可及性，为学生提供了任何时间，任何地方学习的灵活性。但是，关键的限制是缺乏直接的个性化反馈，尤其是在帮助学生纠正数学问题解决错误时。几项研究调查了大语言模型（LLM）在教育环境中的应用。在本文中，我们探讨了LLMS评估学生数学问题解决过程并提供自适应反馈的能力。引入了Mathedu数据集，其中包括带有教师反馈注释的真实学生解决方案。我们评估了该模型在两种情况下支持个性化学习的能力：一个模型可以访问学生的先前答案历史，而另一个模拟了寒冷的环境。实验结果表明，微型模型在识别正确性方面表现良好。但是，该模型仍然面临着为教学目的产生详细的反馈的挑战。

Title: Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

Authors: Jia-Nan Li, Jian Guan, Wei Wu, Rui Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18071
Pdf URL: https://arxiv.org/pdf/2505.18071
Copy Paste: [[2505.18071]] Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals(https://arxiv.org/abs/2505.18071)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning\textemdash the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose \textsc{AlignXplore}, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users' interaction histories. We develop \textsc{AlignXplore} by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that \textsc{AlignXplore} achieves substantial improvements over the backbone model by an average of 11.05\% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.
摘要：大型语言模型（LLMS）在复杂的推理任务（例如数学和编码）中表现出了重大成功。与这些任务相比，演绎推理主要是主称，归纳推理\ textemdash能够从不完整的证据中得出一般规则的能力，但仍未得到充实。本文通过个性化的偏好推理来调查LLMS中扩展的归纳推理，这是LLM Alignment的一个关键挑战，当前方法很难捕获多样化的用户偏好。该任务需要强大的归纳推理功能，因为用户偏好通常在各种交互形式中隐式嵌入，要求模型合成来自分散信号的一致偏好模式。我们建议\ textsc {alignxplore}，该模型利用扩展的推理链来启用用户交互历史中行为信号的系统偏好推断。我们通过将基于合成数据的冷启动培训与随后的在线强化学习相结合，从而开发\ textsc {alignxplore}。通过广泛的实验，我们证明\ textsc {alignXplore}在骨干模型中平均在内域和外域基准测试的平均提高了11.05 \％，同时在不同的输入格式和下游模型上保持强大的概括能力。进一步的分析通过系统比较奖励建模策略建立了偏好推理学习的最佳实践，同时揭示了训练过程中类似人类的归纳推理模式的出现。

Title: QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization

Authors: Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, Bin Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18092
Pdf URL: https://arxiv.org/pdf/2505.18092
Copy Paste: [[2505.18092]] QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization(https://arxiv.org/abs/2505.18092)
Keywords: language model, gpt, llm
Abstract: This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.
摘要：该技术报告介绍了Qwenlong-CPR，这是一种旨在显式长篇文本优化的上下文压缩框架，在预填充阶段期间针对高度的计算开销，以及在长序列处理中大型语言模型（LLMS）的“中间”性能退化。 Qwenlong-CPR通过新颖的动态上下文优化机制实施，可实现以自然语言指导为指导的多界面上下文，从而实现了效率的提高和提高的性能。 Qwenlong-CPR从QWEN Architecture系列演变而来，引入了四个关键创新：（1）自然语言引导的动态优化，（2）双向推理层，以提高边界意识，（3）具有语言建模的标记批评机制，以及（4）窗口范围的窗口范围的窗口范围。跨五个基准（4K-2M单词上下文）进行的全面评估证明了Qwenlong-CPRS的三倍有效性：（1）比其他上下文管理方法一致的优势，例如抹布和精确性和效率的稀疏注意力。 (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; （3）Qwenlong-CPRS与QWEN2.5-32B-INSTRUCT一起部署，在Ruler-128k和Infinitebench上超过了4.85和10.88点，建立了新的Sota绩效。

Title: Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Authors: Joey Hong, Anca Dragan, Sergey Levine
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18098
Pdf URL: https://arxiv.org/pdf/2505.18098
Copy Paste: [[2505.18098]] Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL(https://arxiv.org/abs/2505.18098)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.
摘要：大型语言模型（LLM）在问题回答和对话等任务中表现出色，但是需要互动的复杂任务，例如谈判和说服力，需要其他长期的推理和计划。强化学习（RL）微调可以原则上实现此类计划，但遭受了阻碍可伸缩性的缺点。尤其是，多转变RL训练会带来高内存和计算成本，这在培训LLMS作为政策时会加剧。此外，最大的LLM并未暴露以这种方式训练所需的API。结果，改善LLMS推理的现代方法依赖于复杂的提示机制，而不是RL微调。为了解决这一问题，我们提出了一种使用目标条件的价值函数来指导LLM代理的推理，该方法甚至可以扩展到大型基于API的模型。这些值的功能可以预测任务将如何在给定动作的情况下进行，从而允许LLM代理评估多种阳性和负面的可能结果，以有效地计划。此外，这些价值函数是在推理步骤而不是全面动作中训练的，这是一个简洁明了的模块，可促进多转交互中的决策。我们在需要互动的任务上验证了我们的方法，包括工具使用，社交演绎和对话，在保持效率和可扩展性的同时，证明了比RL微调和提示方法相比表现出色的性能。

Title: ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

Authors: Lisheng Huang, Yichen Liu, Jinhao Jiang, Rongxiang Zhang, Jiahao Yan, Junyi Li, Wayne Xin Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18105
Pdf URL: https://arxiv.org/pdf/2505.18105
Copy Paste: [[2505.18105]] ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework(https://arxiv.org/abs/2505.18105)
Keywords: language model, llm, agent
Abstract: Recent advances in web-augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose \textbf{ManuSearch}, a transparent and modular multi-agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce \textbf{ORION}, a challenging benchmark focused on open-web reasoning over long-tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open-source baselines and even surpasses leading closed-source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in this https URL
摘要：Web启动的大型语言模型（LLM）的最新进展在复杂的推理任务中表现出很强的表现，但是这些功能大多锁定在具有不透明体系结构的专有系统中。在这项工作中，我们建议\ textbf {Manusearch}，这是一个透明且模块化的多代理框架，旨在使对LLM的深入搜索民主化。 Manusearch将搜索和推理过程分解为三个协作代理：（1）迭代制定子查询的解决方案计划代理，（2）通过实时Web搜索来检索相关文档的Internet搜索代理，以及（3）从原始Web内容中提取关键证据的结构化网页阅读代理。为了严格评估深层的推理能力，我们介绍了\ textbf {orion}，这是一个具有挑战性的基准，旨在涉及长尾实体的开放推理，涵盖英语和中文。实验结果表明，Manusearch基本上要优于先前的开源基线，甚至超过领先的封闭源系统。我们的工作为开放深度搜索系统中可再现的可扩展研究铺平了道路。我们在此HTTPS URL中发布数据和代码

Title: Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Authors: Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18110
Pdf URL: https://arxiv.org/pdf/2505.18110
Copy Paste: [[2505.18110]] Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM(https://arxiv.org/abs/2505.18110)
Keywords: language model, llm
Abstract: Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like "A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding" requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense's multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.
摘要：人类自然会通过整合视觉和听觉线索来了解视频中的时刻。例如，将视频中的场景定位为“一位科学家热情地将野生动植物保护说为戏剧性的管弦乐音乐，观众点头和鼓掌都需要同时处理视觉，音频和语音信号。但是，现有模型通常很难有效地融合和解释音频信息，从而限制了其全面视频时间理解的能力。为了解决这个问题，我们提出了Trisense，这是一种三模式的大语言模型，旨在通过整合视觉，音频和语音方式来整体视频时间理解。 Trisense的核心是基于查询的连接器，它基于输入查询自适应地重新自重贡献，从而在模态辍学下实现了稳健的性能，并允许灵活地组合可用的输入。为了支持Trisense的多模式功能，我们引入了Trisense-2M，这是一个高质量的数据集，该数据集是通过通过微型LLM驱动的自动管道生成的200万个策划样品。 Trisense-2M包括长形式的视频和各种方式组合，从而促进广泛的概括。跨多个基准测试的广泛实验证明了trisense的有效性及其推进多模式视频分析的潜力。代码和数据集将公开发布。

Title: UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

Authors: Poojah Ganesan, Rajat Aayush Jha, Dan Roth, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18122
Pdf URL: https://arxiv.org/pdf/2505.18122
Copy Paste: [[2505.18122]] UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification(https://arxiv.org/abs/2505.18122)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.
摘要：大型语言模型（LLMS）的最新进展极大地改善了单桌子查询的文本到SQL性能。但是，由于复杂的模式和关系操作，在多桌数据库中仍然具有挑战性。现有的方法通常在检索正确的桌子和列，产生准确的联接和工会，并在各种模式中概括。为了解决这些问题，我们介绍了Unjoin，这是一个两阶段的框架，可将架构元素从SQL Logic生成中解耦。在第一阶段，我们将数据库中所有表的列名合并为单桌表示，通过将每个列的表格加上其表名称。这使该模型可以纯粹专注于准确的检索，而不会因编写复杂SQL逻辑的需要而分心。在第二阶段，SQL查询是在此简化的模式上生成的，并通过重建联接，工会和关系逻辑来映射回原始模式。对蜘蛛和鸟类数据集的评估表明，Unojin匹配或超过了最先进的基线。 Unjoin仅使用架构信息，该信息不需要数据访问或微调，从而使其在数据库中可扩展和适应性。

Title: Frankentext: Stitching random text fragments into long-form narratives

Authors: Chau Minh Pham, Jenna Russell, Dzung Pham, Mohit Iyyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18128
Pdf URL: https://arxiv.org/pdf/2505.18128
Copy Paste: [[2505.18128]] Frankentext: Stitching random text fragments into long-form narratives(https://arxiv.org/abs/2505.18128)
Keywords: llm, prompt
Abstract: We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.
摘要：我们介绍了Frankentexts，这是LLMS在极端限制下产生的一种新型的长形叙事，即大多数令牌（例如90％）必须从人类著作中逐字复制。该任务对可控生成进行了挑战性的测试，要求模型满足写作提示，整合不同的文本片段并仍然产生连贯的叙述。为了生成Frandentexts，我们指示该模型通过选择和组合人工写的段落来制作草稿，然后迭代修改草稿，同时保持用户指定的副本比率。我们沿着三个轴进行评估最终的frandentext：写作质量，指导依从性和可检测性。 Gemini-2.5-Pro在这项任务上表现出色：其81％的Frankentext是连贯的，并且与提示相关100％。值得注意的是，这些输出中有多达59％被诸如Pangram之类的探测器所写，揭示了AI文本检测器的局限性。人类注释有时可以通过突然的语气变化来识别Frankentexts，并且在较长后代之间的语法不一致和语法不一致。除了提出一项具有挑战性的生成任务外，Frankentexts邀请了有关为这一新的灰色作者身份构建有效探测器的讨论，为混合作者身份检测提供培训数据，并用作研究人类AI共同编写过程的沙箱。

Title: Graph-Linguistic Fusion: Using Language Models for Wikidata Vandalism Detection

Authors: Mykola Trokhymovych, Lydia Pintscher, Ricardo Baeza-Yates, Diego Saez-Trumper
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.18136
Pdf URL: https://arxiv.org/pdf/2505.18136
Copy Paste: [[2505.18136]] Graph-Linguistic Fusion: Using Language Models for Wikidata Vandalism Detection(https://arxiv.org/abs/2505.18136)
Keywords: language model
Abstract: We introduce a next-generation vandalism detection system for Wikidata, one of the largest open-source structured knowledge bases on the Web. Wikidata is highly complex: its items incorporate an ever-expanding universe of factual triples and multilingual texts. While edits can alter both structured and textual content, our approach converts all edits into a single space using a method we call Graph2Text. This allows for evaluating all content changes for potential vandalism using a single multilingual language model. This unified approach improves coverage and simplifies maintenance. Experiments demonstrate that our solution outperforms the current production system. Additionally, we are releasing the code under an open license along with a large dataset of various human-generated knowledge alterations, enabling further research.
摘要：我们为Wikidata引入了下一代故意破坏性检测系统，Wikidata是网络上最大的开源结构化知识库之一。 Wikidata非常复杂：它的项目结合了一个不断扩展的事实三元和多语言文本的宇宙。虽然编辑可以更改结构化和文本内容，但我们的方法使用我们称为Graph2Text的方法将所有编辑转换为单个空间。这允许使用单个多语言模型来评估潜在破坏性的所有内容更改。这种统一的方法改善了覆盖范围并简化了维护。实验表明，我们的解决方案的表现优于当前生产系统。此外，我们正在根据公开许可发布该代码，以及大量的人类产生的知识改变的数据集，从而实现进一步的研究。

Title: Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find

Authors: Owen Bianchi, Mathew J. Koretsky, Maya Willey, Chelsea X. Alvarado, Tanay Nayak, Adi Asija, Nicole Kuznetsov, Mike A. Nalls, Faraz Faghri, Daniel Khashabi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.18148
Pdf URL: https://arxiv.org/pdf/2505.18148
Copy Paste: [[2505.18148]] Lost in the Haystack: Smaller Needles are More Difficult for LLMs to Find(https://arxiv.org/abs/2505.18148)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information ("the needle") must be drawn from a large pool of irrelevant context ("the haystack"). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.
摘要：大型语言模型（LLMS）面临着针对印记的任务的重大挑战，其中相关信息（“针”）必须从大量无关紧要的环境（“ The Haystack”）中汲取。先前的研究强调了位置偏见和干扰物数量是影响模型性能的关键因素，但是金背景规模的影响很少受到关注。我们通过系统地研究黄金上下文长度的变化如何影响LLM绩效在长篇小说答案答案任务上如何影响LLM的性能如何解决这一差距。我们的实验表明，当黄金上下文较短时，LLM的性能会急剧下降，即较小的金色环境始终降低模型性能并扩大位置敏感性，对必须整合散布的，细粒度的不同长度的散射，细粒度的信息构成主要挑战。这种模式在三个不同的领域（一般知识，生物医学推理和数学推理）和各种规模和体系结构的七个最先进的LLMS中都存在。我们的工作提供了明确的见解，以指导强大的，上下文感知的LLM驱动系统的设计。

Title: First Finish Search: Efficient Test-Time Scaling in Large Language Models

Authors: Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18149
Pdf URL: https://arxiv.org/pdf/2505.18149
Copy Paste: [[2505.18149]] First Finish Search: Efficient Test-Time Scaling in Large Language Models(https://arxiv.org/abs/2505.18149)
Keywords: language model
Abstract: Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23\%$ accuracy on the AIME datasets, a $15\%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.
摘要：在推理过程中涉及计算动态分配的测试时间缩放（TTS）提供了一种有希望的方法来改善大语言模型的推理。尽管现有的TTS方法运行良好，但它们通常依赖于长时间的解码路径或需要生成大量样本，从而增加令牌使用和推理潜伏期。我们观察到一个令人惊讶的事实，即对于推理任务，较短的痕迹比更长的痕迹更正确。在此激励的基础上，我们引入了第一个验证搜索（FFS），这是一种无培训的平行解码策略，启动$ n $独立的样本并在任何人完成后立即返回。我们在四个推理模型（DeepSeek-R1，R1-Distill-Qwen-32b，QWQ-32B和Phi-4-Rouseding-plus）和四个数据集（AIME24，AIME24，AIME25-I，AIME25-I，AIME25-I，AIME25-III II IIMOND）上评估FFS以及四个推理模型（DeepSeek-R1，R1-Distill-Qwen-32b，QWQ-32B和Phi-4-Rounowing-plus）的评估FFS。借助DeepSeek-R1，FFS在AIME数据集上获得了$ 82.23 \％$的准确性，比DeepSeek-R1独立的准确度提高了15美元，几乎与OpenAI的O4-Mini性能相匹配。我们的理论分析解释了为什么停止最短的痕迹可能会产生正确的答案，并确定早期停止可能是次优的条件。 FFS的优雅和简单性表明，直接的TTS策略可以表现出色，从而揭示了推理时简单方法的未开发潜力。

Title: Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

Authors: Wafa Alghallabi, Ritesh Thawkar, Sara Ghaboura, Ketan More, Omkar Thawakar, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.18152
Pdf URL: https://arxiv.org/pdf/2505.18152
Copy Paste: [[2505.18152]] Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs(https://arxiv.org/abs/2505.18152)
Keywords: language model, llm
Abstract: Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce `Fann or Flop`, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release `Fann or Flop` along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: this https URL.
摘要：阿拉伯诗歌是阿拉伯语中最复杂和文化嵌入的表达形式之一，以其分层的含义，风格多样性和深厚的历史连续性而闻名。尽管大型语言模型（LLMS）在语言和任务中表现出了很强的表现，但它们理解阿拉伯诗歌的能力仍然在很大程度上没有探索。在这项工作中，我们介绍了``旗或flop''，这是第一个旨在评估LLM在十二个历史时代对阿拉伯诗歌理解的基准，涵盖了21种核心诗歌流派和各种度量形式，从经典结构到当代自由诗。基准包括一首精心策划的诗歌，并具有评估语义理解，隐喻解释，韵律意识和文化背景的解释。我们认为诗意理解为测试LLM通过阿拉伯诗歌理解古典阿拉伯语的良好提供了有力的指标。与表面层面的任务不同，该领域需要更深的解释性推理和文化敏感性。我们对最先进的LLM的评估表明，尽管对标准的阿拉伯基准有很强的结果，但大多数模型都在诗意的理解中挣扎。我们将``fann或flop''与评估套件一起作为开源资源，以实现对阿拉伯语模型的严格评估和进步。代码可用：此HTTPS URL。

Title: The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas

Authors: Ya Wu, Qiang Sheng, Danding Wang, Guang Yang, Yifan Sun, Zhengjia Wang, Yuyan Bu, Juan Cao
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.18154
Pdf URL: https://arxiv.org/pdf/2505.18154
Copy Paste: [[2505.18154]] The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas(https://arxiv.org/abs/2505.18154)
Keywords: llm
Abstract: Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.
摘要：道德决策是人类判断的关键方面，而在决策支持系统中，LLM的使用日益增长需要对其道德推理能力进行严格的评估。但是，现有评估主要依赖于单步评估，未能捕获模型如何适应不断发展的道德挑战。在解决这一差距时，我们介绍了多步道德困境（MMD），这是第一个专门构建的数据集，该数据集是为了评估3,302个五阶段困境中LLM的不断发展的道德判断。该框架可以对LLM在不断升级的困境中调整其道德推理的细粒度，动态分析。我们对九种LLM的评估表明，随着困境的发展，它们的价值偏好显着转移，表明模型根据场景复杂性重新校准了道德判断。此外，成对价值比较表明，尽管LLMS通常优先考虑护理的价值，但在某些情况下，该值有时可以被公平性取代，从而突出了LLM伦理推理的动态和上下文依赖性性质。我们的发现要求转向动态，背景感知的评估范式，为LLM的更加与人类和价值敏感的发展铺平了道路。