2025-08-18

Title: A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation

Authors: Jie Lei, Ruofan Jia, J. Andrew Zhang, Hao Zhang
Subjects: cs.CL, cs.AR, cs.PL
Abstract URL: https://arxiv.org/abs/2508.10904
Pdf URL: https://arxiv.org/pdf/2508.10904
Copy Paste: [[2508.10904]] A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation(https://arxiv.org/abs/2508.10904)
Keywords: language model, llm, hallucination, agent
Abstract: In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. However, a persistent and substantial gap remains between algorithm design and hardware implementation. Bridging this gap traditionally requires extensive domain expertise and time-consuming manual development, due to fundamental mismatches between high-level programming languages like MATLAB and hardware description languages (HDLs) such as Verilog-in terms of memory access patterns, data processing manners, and datatype representations. To address this challenge, we propose A2HCoder: a Hierarchical Algorithm-to-HDL Coding Agent, powered by large language models (LLMs), designed to enable agile and reliable algorithm-to-hardware translation. A2HCoder introduces a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code. In the horizontal dimension, A2HCoder decomposes complex algorithms into modular functional blocks, simplifying code generation and improving consistency. In the vertical dimension, instead of relying on end-to-end generation, A2HCoder performs step-by-step, fine-grained translation, leveraging external toolchains such as MATLAB and Vitis HLS for debugging and circuit-level synthesis. This structured process significantly mitigates hallucinations and ensures hardware-level correctness. We validate A2HCoder through a real-world deployment case in the 5G wireless communication domain, demonstrating its practicality, reliability, and deployment efficiency.
摘要：在无线通信系统中，诸如超低潜伏期和功耗之类的严格要求已大大增加了对有效算法到硬件软件部署的需求。但是，在算法设计和硬件实现之间仍然存在持久而实质性的差距。传统上，由于高级编程语言（如MATLAB和硬件说明语言（HDLS））（例如内存访问模式，数据处理方式和数据标准表示）等高级编程语言（例如MATLAB）和硬件说明语言（HDLS）之间的根本不匹配，因此弥合此差距需要广泛的域专业知识和耗时的手动开发。为了应对这一挑战，我们提出了A2HCODER：由大语言模型（LLMS）提供动力的层次结构算法到HDL编码代理，旨在启用敏捷且可靠的算法到硬件软件翻译。 A2HCODER引入了一个层次结构框架，可以增强鲁棒性和解释性，同时抑制LLM生成的代码中常见的幻觉问题。在水平尺寸中，A2HCoder将复杂算法分解为模块化功能块，简化了代码生成并提高了一致性。在垂直维度中，A2HCoder不依赖端到端的生成，而是执行逐步的，细粒度的翻译，利用MATLAB和VIS HLS等外部工具链进行调试和电路级别的合成。这个结构化的过程大大减轻了幻觉并确保硬件级别的正确性。我们通过5G无线通信域中的现实部署案例来验证A2HCoder，以证明其实用性，可靠性和部署效率。

Title: PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins

Authors: Sihan Chen, John P. Lalor, Yi Yang, Ahmed Abbasi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.10906
Pdf URL: https://arxiv.org/pdf/2508.10906
Copy Paste: [[2508.10906]] PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins(https://arxiv.org/abs/2508.10906)
Keywords: language model, gpt, llm, prompt
Abstract: While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark PersonaTwin against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis.
摘要：尽管大型语言模型（LLMS）为用户建模和人类行为的近似提供了新的可能性，但他们通常无法捕获单个用户的多维细微差别。在这项工作中，我们介绍了Personatwin，这是一个多层及时调理框架，通过整合人口统计，行为和心理测量数据来构建自适应数字双胞胎。使用8500多名个人的医疗保健环境中的全面数据集，我们会系统地基准对标准LLM输出进行基准测试，而我们严格的评估将最先进的文本相似性指标与专门的人口统计学平等评估，以确保生成的响应保持准确且无效。实验结果表明，我们的框架在与Oracle设置相当的情况下产生了模拟保真度。此外，对角色双Wins培训的下游模型近似模型在基于GPT-4O的基于GPT-4O和基于LLAMA的模型的预测和公平指标方面对个体进行了培训。这些发现共同强调了LLM数字双胞胎方法在产生现实和情感上细微差别的用户模拟中的潜力，为个性化的数字用户建模和行为分析提供了强大的工具。

Title: gpt-oss-120b & gpt-oss-20b Model Card

Authors: OpenAI: Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, Vlad Fomenko, Timur Garipov, Kristian Georgiev, Mia Glaese, Tarun Gogineni, Adam Goucher, Lukas Gross, Katia Gil Guzman, John Hallman, Jackie Hehir, Johannes Heidecke, Alec Helyar, Haitang Hu, Romain Huet, Jacob Huh, Saachi Jain, Zach Johnson, Chris Koch, Irina Kofman, Dominik Kundel, Jason Kwon, Volodymyr Kyrylov, Elaine Ya Le, Guillaume Leclerc, James Park Lennon, Scott Lessans, Mario Lezcano-Casado, Yuanzhi Li, Zhuohan Li, Ji Lin, Jordan Liss, Lily (Xiaoxuan)Liu, Jiancheng Liu, Kevin Lu, Chris Lu, Zoran Martinovic, Lindsay McCallum, Josh McGrath, Scott McKinney, Aidan McLaughlin, Song Mei, Steve Mostovoy, Tong Mu, Gideon Myles, Alexander Neitz, Alex Nichol, Jakub Pachocki, Alex Paino, Dana Palmie, Ashley Pantuliano, Giambattista Parascandolo, Jongsoo Park, Leher Pathak, Carolina Paz, Ludovic Peran, Dmitry Pimenov, Michelle Pokrass, Elizabeth Proehl, Huida Qiu, Gaby Raila, Filippo Raso, Hongyu Ren, Kimmy Richardson, David Robinson, Bob Rotsted, Hadi Salman, Suvansh Sanjeev, Max Schwarzer, D. Sculley, Harshit Sikchi, Kendal Simon, Karan Singhal, Yang Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10925
Pdf URL: https://arxiv.org/pdf/2508.10925
Copy Paste: [[2508.10925]] gpt-oss-120b & gpt-oss-20b Model Card(https://arxiv.org/abs/2508.10925)
Keywords: gpt, chat, agent
Abstract: We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
摘要：我们提出了GPT-OSS-1220B和GPT-OSS-20B，这是两个开放的推理模型，这些模型推动了准确性和推理成本的前沿。这些模型使用有效的专家变压器结构，并使用大型蒸馏和增强学习进行了训练。我们优化了模型以具有强大的代理功能（深度研究浏览，Python工具使用和对开发人员提供的功能的支持），同时使用渲染的聊天格式，可实现明确的指令以下和角色描述。这两种模型都在数学，编码和安全性等基准上取得了良好的结果。我们在Apache 2.0许可下释放模型权重，推理实现，工具环境和代币器，以实现广泛使用和进一步的研究。

Title: Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News

Authors: Jiaxin Pei, Soumya Vadlamannati, Liang-Kang Huang, Daniel Preotiuc-Pietro, Xinyu Hua
Subjects: cs.CL, cs.AI, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10927
Pdf URL: https://arxiv.org/pdf/2508.10927
Copy Paste: [[2508.10927]] Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News(https://arxiv.org/abs/2508.10927)
Keywords: language model, llm, prompt
Abstract: Identifying risks associated with a company is important to investors and the well-being of the overall financial market. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competitions. We sample and annotate 744 news articles and benchmark various machine learning models. While large language models have achieved huge progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of-the-art LLMs (e.g. LLaMA-2) can only achieve moderate to low performances in identifying risk factors. And fine-tuned pre-trained language models are performing better on most of the risk factors. Using this model, we analyze over 277K Bloomberg news articles and demonstrate that identifying risk factors from news could provide extensive insight into the operations of companies and industries.
摘要：确定与公司相关的风险对投资者和整体金融市场的福祉很重要。在这项研究中，我们建立了一个计算框架，以自动从新闻文章中提取公司风险因素。我们新提出的模式包括七个不同的方面，例如供应链，法规和竞争。我们采样并注释744篇新闻文章，并基准测试各种机器学习模型。尽管大型语言模型在各种NLP任务中取得了巨大进展，但我们的实验表明，零射击和少量促使最先进的LLM（例如Llama-2）只能在识别风险因素中实现中度至低的性能。和微调的预训练语言模型在大多数风险因素上的表现更好。使用此模型，我们分析了超过277K彭博新闻文章，并证明从新闻中确定风险因素可以提供对公司和行业运营的广泛见解。

Title: Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

Authors: Nasim Shirvani-Mahdavi, Chengkai Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.10971
Pdf URL: https://arxiv.org/pdf/2508.10971
Copy Paste: [[2508.10971]] Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules(https://arxiv.org/abs/2508.10971)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models' performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at this https URL.
摘要：知识图（kg）可以通过规则挖掘来增强；但是，由于人类固有的复杂性和单个KGS的特质标签惯例，因此最终的逻辑规则通常很难解释。这项工作介绍了Rule2Text，这是一个综合框架，利用大型语言模型（LLMS）为采矿逻辑规则生成自然语言解释，从而改善了KG可访问性和可用性。我们使用多个数据集进行了广泛的实验，包括FreeBase变体（FB-CVT-REV，FB+CVT-REV和FB15K-237）以及OGBL-BIOKG数据集，以及使用AMIE 3.5.1挖掘的规则。我们系统地评估了多种促使策略的多个LLM，包括零射，很少，可变类型掺入和经过思考的推理。为了系统地评估模型的性能，我们对生成的正确性和清晰度的解释进行了人体评估。为了解决评估可伸缩性，我们开发和验证了一个法官律师框架，该框架表现出与人类评估者的强烈一致。利用表现最佳的模型（Gemini 2.0 Flash），LLM法官和人类的反馈，我们构建了高质量的地面真实数据集，我们用来微调开源Zephyr模型。我们的结果表明，微调后的解释质量有了显着改善，在特定于域的数据集中取得了特别强大的收益。此外，我们集成了一种类型的推理模块，以支持缺乏明确类型信息的kg。所有代码和数据均在此HTTPS URL上公开可用。

Title: Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling

Authors: Tejomay Kishor Padole, Suyash P Awate, Pushpak Bhattacharyya
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.10995
Pdf URL: https://arxiv.org/pdf/2508.10995
Copy Paste: [[2508.10995]] Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling(https://arxiv.org/abs/2508.10995)
Keywords: language model
Abstract: Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature.
摘要：蒙版扩散语言模型（MDMS）最近已成为自然语言的可行生成框架。这可以归因于与其他扩散模型范式相比，它的可扩展性和易于训练的范围，因此将自己确立为离散数据的最先进的非自动性发电机。通常，扩散模型通过增加推理时间缩放来提高降级步骤或通过在每个步骤的输出之上使用外部验证器来指导生成来提高推理时间缩放来提高发电质量的能力。在这项工作中，我们提出了一种基于验证者的推理时间缩放方法，该方法有助于在MDM的降解过程中找到更好的候选者生成。我们的实验证明了MDM在标准文本式传输任务中的应用，并将MDMS建立为自回归语言模型的更好替代方法。此外，我们表明，即使在现有文献中，即使在典型的无分类器指导设置之上，也可以使用现成的预训练的嵌入模型为MDMS进行简单的MDMS设置。

Title: SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth

Authors: Wenpeng Xing, Lanyi Wei, Haixiao Hu, Rongchang Li, Mohan Li, Changting Lin, Meng Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11009
Pdf URL: https://arxiv.org/pdf/2508.11009
Copy Paste: [[2508.11009]] SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth(https://arxiv.org/abs/2508.11009)
Keywords: language model, llm, prompt
Abstract: The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.
摘要：针对儿童和青少年的应用中，大型语言模型（LLMS）的快速扩散需要基本重新评估普遍的AI安全框架，该框架主要针对成人用户量身定制，并忽略了未成年人的独特发展脆弱性。本文重点介绍了现有LLM安全基准的关键缺陷，包括跨越幼儿的年龄特异性认知，情感和社会风险的覆盖不足（0--6岁），童年中期（7---12）（7---12）和青少年（13---18）。为了弥合这些差距，我们介绍了Sproutbench，这是一个创新的评估套件，其中包括1,283个发展扎根的对抗提示，旨在探究诸如情绪依赖，侵犯隐私和模仿危险行为等风险。通过对47种不同LLM的严格经验评估，我们发现了实质性的安全漏洞，并通过强大的跨二维相关性（例如，在安全性和风险预防之间）和相互作用与年龄适当性之间存在显着的反关系来证实。这些见解产生了推进以儿童为中心的AI设计和部署的实用指南。

Title: Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics

Authors: Carter Blum, Katja Filipova, Ann Yuan, Asma Ghandeharioun, Julian Zimmert, Fred Zhang, Jessica Hoffmann, Tal Linzen, Martin Wattenberg, Lucas Dixon, Mor Geva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11017
Pdf URL: https://arxiv.org/pdf/2508.11017
Copy Paste: [[2508.11017]] Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics(https://arxiv.org/abs/2508.11017)
Keywords: language model, llm
Abstract: Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs.
摘要：大型语言模型（LLMS）与跨语言知识转移斗争：当用一种语言询问培训期间用不同语言表达的事实时，他们会幻觉。这项工作介绍了一个受控的设置，以研究合成多语言数据集的SCRATCH小型变压器模型来研究这种现象的原因和动态。我们确定了一个学习阶段，其中一个模型跨语言开发了相同事实的单独或统一表示，并表明统一对于跨语性转移至关重要。我们还表明，统一程度取决于事实和培训数据语言之间的相互信息，以及提取该语言的容易程度。基于这些见解，我们开发了通过操纵数据分布和令牌化来调节跨语性转移水平的方法，并引入指标和可视化以正式表征其对统一的影响。我们的工作表明，受控设置如何阐明预训练动力学，并提出了改善LLMS跨语性转移的新方向。

Title: Hell or High Water: Evaluating Agentic Recovery from External Failures

Authors: Andrew Wang, Sophia Hager, Adi Asija, Daniel Khashabi, Nicholas Andrews
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11027
Pdf URL: https://arxiv.org/pdf/2508.11027
Copy Paste: [[2508.11027]] Hell or High Water: Evaluating Agentic Recovery from External Failures(https://arxiv.org/abs/2508.11027)
Keywords: language model, agent
Abstract: As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent's performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work.
摘要：由于语言模型代理被应用于增加复杂性的现实世界问题，因此他们将期望在大型搜索空间中制定计划。如果这些计划因其无法控制的原因而失败，那么语言代理如何寻找实现目标的替代方法？我们设计了一个专门的代理计划基准来研究这个问题。每个计划问题都是通过功能调用的组合来解决的。代理从一组超过四千个可能性中搜索相关功能，并以功能输出或错误消息的形式观察环境反馈。我们的基准在其工作流程中以外部故障面对代理，例如突然无法获得的功能。同时，即使引入了这些失败，我们也保证该任务仍然可以解决。理想情况下，代理在计划任务上的表现不应受外部失败的存在影响。总体而言，我们发现语言代理商努力以响应环境反馈来制定和执行备份计划。尽管最新的模型通常能够确定在正确的上下文中使用的正确功能，但它们也很难适应来自环境的反馈，并且即使搜索空间受到人为限制，也常常无法追求替代行动方案。我们对开源和商业模型的故障进行了系统的分析，研究了搜索空间大小的影响以及在我们的环境中缩放模型大小的好处。我们的分析确定了当前生成模型的关键挑战以及未来工作的有希望的方向。

Title: BIPOLAR: Polarization-based granular framework for LLM bias evaluation

Authors: Martin Pavlíček, Tomáš Filip, Petr Sosík
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11061
Pdf URL: https://arxiv.org/pdf/2508.11061
Copy Paste: [[2508.11061]] BIPOLAR: Polarization-based granular framework for LLM bias evaluation(https://arxiv.org/abs/2508.11061)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are known to exhibit biases in downstream tasks, especially when dealing with sensitive topics such as political discourse, gender identity, ethnic relations, or national stereotypes. Although significant progress has been made in bias detection and mitigation techniques, certain challenges remain underexplored. This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in LLM (both open-source and closed-source). Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements, using a predefined set of semantic categories. As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs: Llama-3, Mistral, GPT-4, Claude 3.5, and Gemini 1.0. Beyond aggregate bias scores, with a general trend for more positive sentiment toward Ukraine, the framework allowed fine-grained analysis with considerable variation between semantic categories, uncovering divergent behavioural patterns among models. Adaptation to prompt modifications showed further bias towards preconceived language and citizenship modification. Overall, the framework supports automated dataset generation and fine-grained bias assessment, is applicable to a variety of polarisation-driven scenarios and topics, and is orthogonal to many other bias-evaluation strategies.
摘要：众所周知，大型语言模型（LLM）在下游任务中表现出偏见，尤其是在处理诸如政治话语，性别认同，种族关系或民族刻板印象之类的敏感主题时。尽管在偏置检测和缓解技术方面取得了重大进展，但某些挑战仍未得到充实。这项研究提出了一个可重复使用的，颗粒状和主题不合时式的框架，以评估LLM（开源和封闭消息）中与极化相关的偏差。我们的方法使用一组预定义的语义类别组合了对两极分化敏感的情感指标与综合生成的与冲突相关语句的平衡数据集。作为一个案例研究，我们创建了一个集中在俄罗斯 - 乌克兰战争上的合成数据集，并评估了几种LLM的偏见：Llama-3，Mismtral，GPT-4，Claude 3.5和Gemini 1.0。除了总偏差分数，对乌克兰的积极情绪的一般趋势，该框架还允许细粒度分析，语义类别之间的差异很大，从而发现了模型之间的行为模式不同。适应迅速修改的改编显示了对先入为主的语言和公民身份修改的进一步偏见。总体而言，该框架支持自动数据集生成和细粒度偏置评估，适用于各种极化驱动的场景和主题，并且与许多其他偏见评估策略正交。

Title: Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs

Authors: Nicolas Goulet, Alexandre Blondin Massé, Moussa Abdendi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11068
Pdf URL: https://arxiv.org/pdf/2508.11068
Copy Paste: [[2508.11068]] Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs(https://arxiv.org/abs/2508.11068)
Keywords: language model
Abstract: Abstract meaning representation (AMR) is a semantic formalism used to represent the meaning of sentences as directed acyclic graphs. In this paper, we describe how real digital dictionaries can be embedded into AMR directed graphs (digraphs), using state-of-the-art pre-trained large language models. Then, we reduce those graphs in a confluent manner, i.e. with transformations that preserve their circuit space. Finally, the properties of these reduces digraphs are analyzed and discussed in relation to the symbol grounding problem.
摘要：抽象含义表示（AMR）是一种语义形式主义，用于表示句子的含义，如定向的无环图。在本文中，我们描述了如何使用最先进的预训练的大语言模型将真实的数字词典嵌入到AMR有向图（Digraphs）中。然后，我们以汇合方式减少这些图，即与保留其电路空间的转换。最后，分析和讨论了与符号接地问题有关的这些降低挖掘的特性。

Title: Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning

Authors: Lorenzo Jaime Yu Flores, Junyi Shen, Xiaoyuan Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11120
Pdf URL: https://arxiv.org/pdf/2508.11120
Copy Paste: [[2508.11120]] Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning(https://arxiv.org/abs/2508.11120)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real-world applications remains limited. In this paper, we introduce a multi-agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long-term memory store, which is a knowledge base of client-specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments.
摘要：大型语言模型（LLM）的最新进展使可以开发可以计划并与工具进行交互以完成复杂任务的AI代理。但是，关于其在现实世界应用中可靠性的文献仍然有限。在本文中，我们介绍了一个用于营销任务的多代理框架：受众策划。为了解决这个问题，我们介绍了一个名为RAMP的框架，该框架是迭代计划，调用工具，验证输出并生成建议以提高产生的受众质量的建议。此外，我们为该模型配备了一个长期存储店，这是客户特定事实和过去查询的知识库。总体而言，我们证明了LLM计划和内存的使用，这在88个评估查询中提高了准确性28个百分点。此外，我们显示了迭代验证和反思对更模棱两可的查询的影响，显示出更好的回忆（大约+20个百分点），并在较小的挑战集和更高的用户满意度上进行了更多验证/反映。我们的结果为在动态，面向行业的环境中部署可靠的基于LLM的系统提供了实用的见解。

Title: MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Authors: Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, Reut Tsarfaty
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2508.11133
Pdf URL: https://arxiv.org/pdf/2508.11133
Copy Paste: [[2508.11133]] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents(https://arxiv.org/abs/2508.11133)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions -- with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: this https URL
摘要：大型语言模型（LLM）正在作为查询信息的首选工具。但是，当前的LLM基准很少出现自然问题，这些问题既是寻求信息的，也是对人类的真正耗时的。为了解决这一差距，我们介绍了摩纳哥，摩纳哥是1,315个自然和复杂问题的基准，这些问题需要数十个中间步骤来解决 - 远远超过任何现有的QA基准。为了建造摩纳哥，我们开发了一条分解的注释管道，以大规模引起并手动回答自然耗时的问题。对摩纳哥进行评估的Frontier LLMS最多达到61.2％F1，受到低召回和幻觉的阻碍。我们的结果强调了对推理模型的需求，这些模型可以更好地处理现实世界中信息 - 寻求信息的复杂性和广度 - 摩纳哥为跟踪这种进步提供了有效的资源。摩纳哥基准，代码库，提示和模型预测可公开可用：此HTTPS URL

Title: MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering

Authors: Hikaru Asano, Hiroki Ouchi, Akira Kasuga, Ryo Yonetani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11163
Pdf URL: https://arxiv.org/pdf/2508.11163
Copy Paste: [[2508.11163]] MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering(https://arxiv.org/abs/2508.11163)
Keywords: language model, llm
Abstract: This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnote{MobQA dataset is available at this https URL.}
摘要：本文介绍了MOBQA，这是一种基准数据集，旨在通过自然语言问题回答来评估人类流动性数据的大型语言模型（LLMS）的语义理解能力。尽管现有模型在预测人类运动模式方面表现出色，但他们仍然毫不客气地解释这些模式的根本原因或语义含义。 MOBQA为LLM提供了一个全面的评估框架，以回答有关每天跨每周粒度的各种人类GP轨迹的问题。它包括三种互补问题类型的5,800个高质量的提问对，事实检索（精确的数据提取），多项选择性推理（语义推理）和自由形式的解释（解释性描述），所有这些都需要空间，时间和语义推理。我们对主要LLM的评估揭示了在事实检索但在语义推理和解释问题答案中的重大局限性的良好表现，轨迹长度基本影响了模型的有效性。这些发现证明了语义移动性理解的最新llms的成就和局限性。

Title: Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

Authors: Tao Wu, Jingyuan Chen, Wang Lin, Jian Zhan, Mengze Li, Kun Kuang, Fei Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11184
Pdf URL: https://arxiv.org/pdf/2508.11184
Copy Paste: [[2508.11184]] Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction(https://arxiv.org/abs/2508.11184)
Keywords: language model, llm
Abstract: Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student's past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student's underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student's reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student's reasoning on new questions, enabling the generation of personalized distractors that align with the student's recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability.
摘要：分散术品，在多项选择问题（MCQ）中不正确但合理的答案选择，通过诊断学生的误解在教育评估中起着至关重要的作用。最近的工作利用了大型语言模型（LLMS）来通过学习大型学生群体的共同错误模式来生成共享的，群体级别的干扰因素。但是，这种干扰因素通常无法捕获个别学生的各种推理错误，从而限制了他们的诊断效果。为了解决这一限制，我们介绍了个性化的分心生成的任务，该任务旨在根据从每个学生过去的问答记录（QA）记录中推断出的个体误解来产生量身定制的干扰因素，以确保每个学生都会收到有效暴露其特定推理错误的选项。尽管很有希望，但这项任务是具有挑战性的，因为每个学生通常只有几个质量检查记录，这些记录通常缺乏学生的基本推理过程，从而使基于培训的小组级别的方法变得不可行。为了克服这一点，我们提出了一个无训练的两阶段框架。在第一阶段，我们通过应用蒙特卡洛树搜索（MCT）来构建一个特定于学生的误解原型，以从过去不正确的答案中恢复学生的推理轨迹。在第二阶段，该原型指导了学生对新问题的推理的模拟，从而使能够与学生的反复误解相符的个性化干扰素。实验表明，我们的方法在为140名学生生成合理的个性化干扰素方面取得了最佳性能，并且还有效地将其推广到小组级别的设置，从而强调了其鲁棒性和适应性。

Title: Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering

Authors: Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, Ning Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11247
Pdf URL: https://arxiv.org/pdf/2508.11247
Copy Paste: [[2508.11247]] Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering(https://arxiv.org/abs/2508.11247)
Keywords: llm, retrieval-augmented generation
Abstract: Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6$\times$ speedup in retrieval efficiency.
摘要：多跳问题回答（MHQA）需要整合散布在多个段落中的知识，以得出正确的答案。传统的检索演奏生成（RAG）方法主要集中于粗粒的文本语义相似性，而忽略了分散知识之间的结构关联，这限制了它们在MHQA任务中的有效性。 GraphRag方法通过利用知识图（kgs）来捕获结构关联来解决这一问题，但它们倾向于过于依赖结构信息和细粒度的单词或短语或短语或短语或短语级检索，从而导致文本语义的利用不足。在本文中，我们提出了一种新型的抹布方法，称为MHQA的Hgrag，该方法通过超图实现了结构和语义信息的跨粒度整合。从结构上讲，我们构建了一个实体超图，其中细颗粒实体用作节点和粗粒段作为Hyperedges，并通过共享实体建立知识关联。从语义上讲，我们设计了一种超图检索方法，该方法通过超图扩散整合了细粒实体相似性和粗粒度相似性。最后，我们采用了检索增强模块，该模块在语义和结构上进一步完善了检索结果，以获取最相关的段落作为使用LLM的答案产生的背景。基准数据集的实验结果表明，我们的方法在QA性能方面的表现优于最先进的方法，并实现了6 $ \ times $ $加速的检索效率。

Title: UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?

Authors: Mukund Choudhary, KV Aditya Srivatsa, Gaurja Aeron, Antara Raaghavi Bhattacharya, Dang Khoa Dang Dinh, Ikhlasul Akmal Hanif, Daria Kotova, Ekaterina Kochmar, Monojit Choudhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11260
Pdf URL: https://arxiv.org/pdf/2508.11260
Copy Paste: [[2508.11260]] UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?(https://arxiv.org/abs/2508.11260)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on linguistics puzzles remains consistently poor. These puzzles, often derived from Linguistics Olympiad (LO) contests, provide a minimal contamination environment to assess LLMs' linguistic reasoning abilities across low-resource languages. This work analyses LLMs' performance on 629 problems across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on puzzles involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.
摘要：大型语言模型（LLM）在推理任务中表现出了潜力，但是它们在语言学难题上的表现仍然持续较差。这些拼图通常来自语言学奥林匹克（LO）竞赛，提供了一个最小的污染环境，以评估LLMS跨低资源语言的语言推理能力。这项工作通过标记每个语言知情的功能以揭示弱点，从而分析了LLMS在41种低资源语言中的629个问题上的表现。我们的分析表明，LLM与涉及较高形态复杂性的难题斗争，并且在涉及语言特征的难题上表现得更好，而语言特征也可以在英语中找到。我们还表明，将单词分成词素作为预处理步骤可提高可解决性，这表明需要更有知情和特定语言的代币。因此，这些发现为低资源语言的语言推理和建模中的一些挑战提供了见解。

Title: LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Authors: Ruiyan Qi, Congding Wen, Weibo Zhou, Shangsong Liang, Lingbo Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11280
Pdf URL: https://arxiv.org/pdf/2508.11280
Copy Paste: [[2508.11280]] LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought(https://arxiv.org/abs/2508.11280)
Keywords: language model, llm, hallucination, tree-of-thought
Abstract: Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.
摘要：由于带注释的基准和诸如幻觉之类的持续问题的高昂成本，评估像旅游业这样的特定领域中的大型语言模型（LLM）仍然具有挑战性。我们提出$ \ textbf {l} $ able $ \ textbf {e} $ llm上的llm估值，$ \ textbf {t} $使用专家$ \ textbf {t} $ ree-$ \ $ \ $ \ textbf {o textbf {o}在旅游业中标记的数据访问LLM的结构肯定。首先，我们通过与通用质量维度和专家反馈的一致性来迭代完善和验证层次TOT组件。结果证明了我们系统优化的专家TOT具有4.99-14.15 \％相对质量比基线的有效性。其次，我们应用Lettot优化的专家TOT评估不同尺度（32b-671b参数）的模型，显示：（1）缩放定律持续存在于专用域（DeepSeek-V3 Leads）中，但推理 - 增强型较小的较小模型（例如，DeepSeek-R1-distill-distill-distill-lll-lllama-lllama-lllama-70b）Close this Gap; （2）对于以下72B模型，明确的推理体系结构的准确性和简洁性（$ p <0.05 $）的表现优于对应。我们的工作为域特异性LLM评估建立了可扩展的，无标签的范式，为常规注释基准提供了可靠的替代品。

Title: ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

Authors: Axel Delaval, Shujian Yang, Haicheng Wang, Han Qiu, Jialiang Lu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11281
Pdf URL: https://arxiv.org/pdf/2508.11281
Copy Paste: [[2508.11281]] ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection(https://arxiv.org/abs/2508.11281)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model's final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks.
摘要：使用语言模型检测有毒内容至关重要但具有挑战性。尽管英语取得了实质性进展，但法国人的毒性检测仍然不发达，这主要是由于缺乏文化相关的大规模数据集。在这项工作中，我们介绍了Toxifrench，这是53,622个法国在线评论的新公共基准，该基准是通过半自动化的注释管道构建的，该管道通过基于高度信仰LLM的高度信仰前LLM的预先注册和人类验证将手动标签降低至仅10％。然后，我们基准了广泛的模型，并发现了违反直觉的见解：在毒性检测任务下，小语言模型（SLM）在鲁棒和概括方面的表现优于许多较大的模型。在这一发现的激励下，我们提出了一种新颖的基础（COT）微调策略，使用动态加权损失，逐渐强调了该模型的最终决定，从而大大提高了忠诚。我们的微调4B型号达到了最先进的性能，在基线的F1分数中提高了13％，并且超过了GPT-40和Gemini-2.5等LLM。对跨语性毒性基准的进一步评估表明了强大的多语言能力，这表明我们的方法可以有效地扩展到其他语言和安全至关重要的分类任务。

Title: AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries

Authors: Arya VarastehNezhad, Reza Tavasoli, Soroush Elyasi, MohammadHossein LotfiNia, Hamed Farbeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11285
Pdf URL: https://arxiv.org/pdf/2508.11285
Copy Paste: [[2508.11285]] AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries(https://arxiv.org/abs/2508.11285)
Keywords: language model, gpt, llm, prompt
Abstract: Depression, anxiety, and stress are widespread mental health concerns that increasingly drive individuals to seek information from Large Language Models (LLMs). This study investigates how eight LLMs (Claude Sonnet, Copilot, Gemini Pro, GPT-4o, GPT-4o mini, Llama, Mixtral, and Perplexity) reply to twenty pragmatic questions about depression, anxiety, and stress when those questions are framed for six user profiles (baseline, woman, man, young, old, and university student). The models generated 2,880 answers, which we scored for sentiment and emotions using state-of-the-art tools. Our analysis revealed that optimism, fear, and sadness dominated the emotional landscape across all outputs, with neutral sentiment maintaining consistently high values. Gratitude, joy, and trust appeared at moderate levels, while emotions such as anger, disgust, and love were rarely expressed. The choice of LLM significantly influenced emotional expression patterns. Mixtral exhibited the highest levels of negative emotions including disapproval, annoyance, and sadness, while Llama demonstrated the most optimistic and joyful responses. The type of mental health condition dramatically shaped emotional responses: anxiety prompts elicited extraordinarily high fear scores (0.974), depression prompts generated elevated sadness (0.686) and the highest negative sentiment, while stress-related queries produced the most optimistic responses (0.755) with elevated joy and trust. In contrast, demographic framing of queries produced only marginal variations in emotional tone. Statistical analyses confirmed significant model-specific and condition-specific differences, while demographic influences remained minimal. These findings highlight the critical importance of model selection in mental health applications, as each LLM exhibits a distinct emotional signature that could significantly impact user experience and outcomes.
摘要：抑郁，焦虑和压力是普遍的心理健康问题，越来越多地促使个人从大语模型（LLMS）中寻求信息。这项研究调查了八个LLM（Claude Sonnet，Copilot，Gemini Pro，GPT-4O，GPT-4O MINI，LLAMA，LLAMA，MIXTRAL和CLEPLEXITY）如何回答有关二十个有关抑郁症，焦虑和强调这些问题的问题时，这些问题是为六个问题带来了六个问题时（基线，男女，男女，男女，年轻，年轻，年轻，老年人，老年和大学）。这些模型产生了2,880个答案，我们使用最先进的工具为情感和情感评分。我们的分析表明，乐观，恐惧和悲伤占据了所有产出的情感景观，中性情绪保持始终如一。感恩，喜悦和信任出现在中等水平上，而愤怒，厌恶和爱等情绪很少表达。 LLM的选择显着影响了情绪表达模式。混音表现出最高水平的负面情绪，包括不赞成，烦恼和悲伤，而美洲驼则表现出最乐观和最快乐的反应。精神健康状况的类型显着形成了情感反应：焦虑提示引起的恐惧得分极高（0.974），抑郁症提示产生了升高的悲伤（0.686）和最高的负面情绪，而与压力相关的查询产生了最乐观的回应（0.755）（0.755）。相比之下，查询的人口框架仅产生情感语调的边际变化。统计分析证实了明显的模型特异性和条件特异性差异，而人口影响仍然很小。这些发现突出了在心理健康应用中选择模型的重要性，因为每个LLM都表现出独特的情感签名，可能会严重影响用户体验和结果。

Title: SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory

Authors: Utsav Maskey, Sumit Yadav, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11290
Pdf URL: https://arxiv.org/pdf/2508.11290
Copy Paste: [[2508.11290]] SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory(https://arxiv.org/abs/2508.11290)
Keywords: llm, prompt
Abstract: LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that superficially resemble harmful content. This phenomena diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through comprehensive evaluation, we demonstrate that LLMs still tend to refuse responses to harmful instructions when those instructions are reframed to appear as benign tasks. Our mechanistic analysis reveal that LLMs follow distinct "constellation" patterns in embedding space as representations traverse layers, with each task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, and by preserving general model behavior, our method reduces over-refusal rates by up to 73% with minimal impact on utility-offering a principled approach to mitigating over-refusals.
摘要：LLM越来越表现出过度的行为，安全机制会导致模型拒绝良性指示表面上类似于有害内容。这种现象减少了在生产应用程序中的效用，这些应用程序反复依靠常见的提示模板或经常依靠LLMS进行特定任务的应用程序（例如，情感分析，语言翻译）。通过全面的评估，我们证明LLMS仍倾向于拒绝对这些指示的反应，而这些指示被重新构成以良性任务的形式出现。我们的机械分析表明，LLMS遵循嵌入空间中的不同“星座”模式，作为表示层的表示，每个任务都保持一致的轨迹，这些轨迹可以预见地在拒绝和非刺激案例之间移动。我们介绍了SafeConstellations，这是一种推理时间轨迹移动方法，该方法跟踪特定于任务的轨迹模式，并指导表示非杂物途径。通过仅在容易过度狂欢的任务上有选择地指导模型行为，并且通过保留一般模型行为，我们的方法可将过度频率降低高达73％，而对效用的效用最小的有原则方法来减轻过度倍率。

Title: SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Authors: Beichen Guo, Zhiyuan Wen, Yu Yang, Peng Gao, Ruosong Yang, Jiaxing Shen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.11310
Pdf URL: https://arxiv.org/pdf/2508.11310
Copy Paste: [[2508.11310]] SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems(https://arxiv.org/abs/2508.11310)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.
摘要：对自动测量生成（ASG）的兴趣日益增长，这是一项传统上需要大量时间和精力的任务，这受到了大型语言模型（LLMS）的最新进展。随着检索型发电（RAG）的进步以及多机构系统（MASS）的日益普及，使用LLMS进行合成的学术调查已成为一种可行的方法，从而提高了该领域中强大的评估方法的需求。但是，现有的评估方法遭受了多种局限性，包括偏见的指标，缺乏人类偏好以及对LLMS-AS-AS-gudges的过度依赖。为了应对这些挑战，我们提出了SGSIMEVAL，这是一种具有相似性增强评估的测量生成基准，通过整合轮廓，内容和参考的评估来评估自动调查系统，并将基于LLM的评分与定量指标相结合，以提供多方面的评估框架。在SGSIMEVAL中，我们还介绍了人类的偏好指标，这些指标既强调与人类的固有质量和相似性。广泛的实验表明，当前的ASG系统在轮廓生成中表现出可观的人性优势，同时显示出很大的含义，可以改善内容和参考的生成，而我们的评估指标与人类评估保持了强烈的一致性。

Title: LLM Compression: How Far Can We Go in Balancing Size and Performance?

Authors: Sahil Sk, Debasish Dhal, Sonal Khosla, Sk Shahid, Sambit Shekhar, Akash Dhaka, Shantipriya Parida, Dilip K. Prasad, Ondřej Bojar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11318
Pdf URL: https://arxiv.org/pdf/2508.11318
Copy Paste: [[2508.11318]] LLM Compression: How Far Can We Go in Balancing Size and Performance?(https://arxiv.org/abs/2508.11318)
Keywords: language model, gpt, llm
Abstract: Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.
摘要：量化是一种重要而流行的技术，可以通过降低记忆使用和计算成本，同时保持性能，从而改善大语言模型（LLMS）的可访问性。在这项研究中，我们将4位组缩放量化（GSQ）和生成预验证的变压器量化（GPTQ）应用于Llama 1B，QWEN 0.5B和PHI 1.5B，评估了它们在多个NLP任务中的影响。我们将这些模型基于MARCO（信息检索），Boolq（Boolean Qualte atsing）和GSM8K（数学推理）数据集进行基准测试，从而评估了各种任务的准确性和效率。该研究衡量了模型压缩和任务性能之间的权衡，分析关键评估指标，即准确性，推理潜伏期和吞吐量（每秒产生的总输出令牌），从而提供了对现实部署低位量化的适用性的见解。使用结果，用户可以根据需要满足的规格做出适当的决策。我们讨论了GSQ和GPTQ技术对不同尺寸模型的利弊，这些模型也是将来实验的基准。

Title: SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis

Authors: Haitong Luo, Weiyao Zhang, Suhang Wang, Wenji Zou, Chungang Lin, Xuying Meng, Yujun Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11343
Pdf URL: https://arxiv.org/pdf/2508.11343
Copy Paste: [[2508.11343]] SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis(https://arxiv.org/abs/2508.11343)
Keywords: language model, llm
Abstract: The proliferation of high-quality text from Large Language Models (LLMs) demands reliable and efficient detection methods. While existing training-free approaches show promise, they often rely on surface-level statistics and overlook fundamental signal properties of the text generation process. In this work, we reframe detection as a signal processing problem, introducing a novel paradigm that analyzes the sequence of token log-probabilities in the frequency domain. By systematically analyzing the signal's spectral properties using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT), we find that human-written text consistently exhibits significantly higher spectral energy. This higher energy reflects the larger-amplitude fluctuations inherent in human writing compared to the suppressed dynamics of LLM-generated text. Based on this key insight, we construct SpecDetect, a detector built on a single, robust feature from the global DFT: DFT total energy. We also propose an enhanced version, SpecDetect++, which incorporates a sampling discrepancy mechanism to further boost robustness. Extensive experiments demonstrate that our approach outperforms the state-of-the-art model while running in nearly half the time. Our work introduces a new, efficient, and interpretable pathway for LLM-generated text detection, showing that classical signal processing techniques offer a surprisingly powerful solution to this modern challenge.
摘要：大语言模型（LLM）的高质量文本的扩散需要可靠，有效的检测方法。尽管现有的无培训方法表现出希望，但它们通常依赖于表面级统计数据和忽略文本生成过程的基本信号特性。在这项工作中，我们将检测作为信号处理问题进行了重新构架，引入了一种新型范式，该范式分析了频域中令牌对数探测的顺序。通过系统地使用全局离散傅立叶变换（DFT）和局部短期傅立叶变换（STFT）来系统地分析信号的光谱特性，我们发现人写的文本始终显示出明显更高的光谱能量。与LLM生成的文本的抑制动力相比，这种较高的能量反映了人类写作固有的较大振幅波动。基于此关键洞察力，我们构建了SpecDetect，这是一种基于全局DFT：DFT总能量的单一功能的检测器。我们还提出了一个增强的版本SpecDetect ++，该版本结合了采样差异机制，以进一步增强鲁棒性。广泛的实验表明，我们的方法在几乎一半的时间内运行时都超过了最先进的模型。我们的工作为LLM生成的文本检测引入了一种新的，高效且可解释的途径，表明经典信号处理技术为这一现代挑战提供了令人惊讶的强大解决方案。

Title: Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning

Authors: Sylvio Rüdian, Yassin Elsir, Marvin Kretschmer, Sabine Cayrou, Niels Pinkwart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11364
Pdf URL: https://arxiv.org/pdf/2508.11364
Copy Paste: [[2508.11364]] Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning(https://arxiv.org/abs/2508.11364)
Keywords: language model, llm
Abstract: Automated feedback generation has the potential to enhance students' learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students' submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students' submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.
摘要：自动反馈生成有可能通过提供及时和有针对性的反馈来增强学生的学习进度。此外，它可以帮助教师优化他们的时间，使他们能够专注于更具战略和个性化的教学方面。为了产生高质量，信息丰富的形成反馈，首先要提取相关指标，因为这些指标是构建反馈的基础。教师经常采用反馈标准网格，这些网格由各种指标组成，它们是系统地评估的。这项研究研究了使用大型语言模型Llama 3.1从学生提交语言学习课程中提取此类指标的初始阶段。因此，研究了LLM产生的指标与在各种反馈标准中产生的指标之间的一致性。这些发现表明，即使在涉及指标和标准的意外组合的情况下，也具有统计学意义的相关性。本文采用的方法为使用LLM的学生提交指标提供了有希望的基础。在未来的研究中，此类指标可能可用于自动产生可解释和透明的形成性反馈。

Title: When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

Authors: Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11383
Pdf URL: https://arxiv.org/pdf/2508.11383
Copy Paste: [[2508.11383]] When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs(https://arxiv.org/abs/2508.11383)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: this https URL.
摘要：大型语言模型（LLM）对迅速措辞和格式的微妙，非语义变化高度敏感。在这项工作中，我们提出了对5种方法进行的首次系统评估，以改善统一的实验框架内的迅速鲁棒性。我们在来自自然指令数据集的52个任务中的Llama，Qwen和Gemma家族的8种型号上进行基准测试。我们的评估涵盖了来自微调和内在学习范式的鲁棒性方法，并测试了它们对多种类型的分布变化的概括。最后，我们将分析扩展到GPT-4.1和DeepSeek V3，以评估Frontier模型当前对格式扰动的鲁棒性。我们的发现提供了对这些鲁棒性方法的相对有效性的可行见解，使从业者可以在实现现实世界应用中稳定且可靠的LLM绩效时做出明智的决定。代码：此HTTPS URL。

Title: Retrieval-augmented reasoning with lean language models

Authors: Ryan Sze-Yin Chan, Federico Nanni, Tomas Lazauskas, Rosie Wood, Penelope Yong, Lionel Tarassenko, Mark Girolami, James Geddes, Andrew Duncan
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11386
Pdf URL: https://arxiv.org/pdf/2508.11386
Copy Paste: [[2508.11386]] Retrieval-augmented reasoning with lean language models(https://arxiv.org/abs/2508.11386)
Keywords: language model, retrieval augmented generation, agent
Abstract: This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.
摘要：该技术报告详细介绍了一种在单个精益语言模型体系结构中结合推理和检索增强发电（RAG）的新方法。尽管现有的抹布系统通常依赖大型模型和外部API，但我们的工作解决了可在资源受限或安全环境中部署的性能和隐私的解决方案的需求不断增长。在测试时间缩放和小规模推理模型的最新发展的基础上，我们开发了一种检索增强的对话剂，能够使用轻质骨干模型来解释复杂的特定领域特定查询。我们的系统将密集的检索器与微调的QWEN2.5教学模型相结合，使用合成查询的产生和推理痕迹从策划的语料库衍生而成的合成痕迹（例如，DeepSeek-R1），在这种情况下，NHS A-TO-t-to-Z条件页。我们探讨了基于汇总的文档压缩，合成数据设计以及推理意识的微调对模型性能的影响。对非争议和通用精益模型的评估表明，我们的领域特定的微调方法在答案的准确性和一致性方面取得了可观的提高，接近边境级别的性能，而在本地部署方面仍然是可行的。公开发布所有实施详细信息和代码，以支持跨域之间的可重复性和适应性。

Title: Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

Authors: Shangrui Nie, Florian Mai, David Kaczér, Charles Welch, Zhixue Zhao, Lucie Flek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11414
Pdf URL: https://arxiv.org/pdf/2508.11414
Copy Paste: [[2508.11414]] Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions(https://arxiv.org/abs/2508.11414)
Keywords: language model, llm
Abstract: Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model's value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model's behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model's behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model's behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model's answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.
摘要：大型语言模型隐含地编码了对人类价值观的偏好，但是转向它们通常需要大量的培训数据。在这项工作中，我们研究了一种简单的方法：我们可以通过训练以相应地回答价值调查问题来可靠地修改下游行为中的模型价值系统？我们首先构建了几个开源LLM的价值曲线，要求它们对20个不同人类价值的一系列与价值相关的描述进行评分，我们将其用作后续实验的基线。然后，我们研究模型的价值系统是否可以通过对价值调查进行微调来控制。我们以两种方式评估了芬太尼对模型行为的影响；首先，我们评估答案如何改变内域，持有调查问题。其次，我们评估模型的行为是否在室外设置（情境方案）中发生变化。为此，我们基于Reddit帖子构建了上下文化的道德判断数据集，并评估模型在基于文本的冒险游戏中的行为的变化。我们证明，我们的简单方法不仅可以改变模型对内域调查问题的答案，而且还可以在隐式下游任务行为中产生重大变化（价值对齐）。

Title: HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor

Authors: Shivam Dubey
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11429
Pdf URL: https://arxiv.org/pdf/2508.11429
Copy Paste: [[2508.11429]] HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor(https://arxiv.org/abs/2508.11429)
Keywords: language model, llm, chain-of-thought
Abstract: Automated humor generation with Large Language Models (LLMs) often yields jokes that feel generic, repetitive, or tone-deaf because humor is deeply situated and hinges on the listener's cultural background, mindset, and immediate context. We introduce HumorPlanSearch, a modular pipeline that explicitly models context through: (1) Plan-Search for diverse, topic-tailored strategies; (2) Humor Chain-of-Thought (HuCoT) templates capturing cultural and stylistic reasoning; (3) a Knowledge Graph to retrieve and adapt high-performing historical strategies; (4) novelty filtering via semantic embeddings; and (5) an iterative judge-driven revision loop. To evaluate context sensitivity and comedic quality, we propose the Humor Generation Score (HGS), which fuses direct ratings, multi-persona feedback, pairwise win-rates, and topic relevance. In experiments across nine topics with feedback from 13 human judges, our full pipeline (KG + Revision) boosts mean HGS by 15.4 percent (p < 0.05) over a strong baseline. By foregrounding context at every stage from strategy planning to multi-signal evaluation, HumorPlanSearch advances AI-driven humor toward more coherent, adaptive, and culturally attuned comedy.
摘要：具有大型语言模型（LLM）的自动幽默产生通常会产生笑话，因为幽默的位置很深，并且在听众的文化背景，思维方式和直接背景下取决于幽默，因此会感到笑话。我们介绍了HumorPlansearch，这是一种模块化管道，通过以下方式明确地对上下文进行了以下方式建模上下文：（1）针对多样化的主题范围策略进行计划；（2）幽默链（Hucot）模板捕获文化和风格推理；（3）一个知识图，以检索和适应高性能的历史策略；（4）通过语义嵌入的新颖性过滤；（5）迭代法官驱动的修订循环。为了评估上下文敏感性和喜剧质量，我们提出了幽默发电评分（HGS），该评分融合了直接评分，多人反馈，成对的赢率和主题相关性。在九个主题的实验中，有13位人类法官的反馈，我们的完整管道（kg +修订）在强基线上将HGS的平均汞提高了15.4％（p <0.05）。通过在策略规划到多信号评估的每个阶段的前景环境，幽默搜索将AI驱动的幽默发展为更连贯，适应性和文化的喜剧。

Title: Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse

Authors: Aditi Dutta, Susan Banducci
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11434
Pdf URL: https://arxiv.org/pdf/2508.11434
Copy Paste: [[2508.11434]] Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse(https://arxiv.org/abs/2508.11434)
Keywords: language model, llm
Abstract: Anti-sexist speech, i.e., public expressions that challenge or resist gendered abuse and sexism, plays a vital role in shaping democratic debate online. Yet automated content moderation systems, increasingly powered by large language models (LLMs), may struggle to distinguish such resistance from the sexism it opposes. This study examines how five LLMs classify sexist, anti-sexist, and neutral political tweets from the UK, focusing on high-salience trigger events involving female Members of Parliament in the year 2022. Our analysis show that models frequently misclassify anti-sexist speech as harmful, particularly during politically charged events where rhetorical styles of harm and resistance converge. These errors risk silencing those who challenge sexism, with disproportionate consequences for marginalised voices. We argue that moderation design must move beyond binary harmful/not-harmful schemas, integrate human-in-the-loop review during sensitive events, and explicitly include counter-speech in training data. By linking feminist scholarship, event-based analysis, and model evaluation, this work highlights the sociotechnical challenges of safeguarding resistance speech in digital political spaces.
摘要：反性别言论，即挑战或抵制性别虐待和性别歧视的公共表达，在塑造民主辩论的在线辩论中起着至关重要的作用。然而，由大型语言模型（LLM）越来越多的自动化内容审核系统可能难以将这种抵抗力与它反对的性别歧视区分开。这项研究探讨了五个LLMS如何对英国的性别歧视，反性别和中立的政治推文进行分类，重点介绍了2022年涉及议会女性成员的高科技触发事件。我们的分析表明，模型经常误以为是有害的，尤其是在政治充电的事件中，尤其是在有危害和抵抗风格的政治充电事件中，这是有害的。这些错误有可能使那些挑战性别歧视的人沉默，对边缘化的声音产生不成比例的后果。我们认为，审核设计必须超越二进制有害/无害的模式，在敏感事件期间整合人类的审查，并明确将反语音包括在培训数据中。通过将女权主义奖学金，基于事件的分析和模型评估联系起来，这项工作突出了维护数字政治空间中阻力言论的社会技术挑战。

Title: Reference Points in LLM Sentiment Analysis: The Role of Structured Context

Authors: Junichiro Niimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11454
Pdf URL: https://arxiv.org/pdf/2508.11454
Copy Paste: [[2508.11454]] Reference Points in LLM Sentiment Analysis: The Role of Structured Context(https://arxiv.org/abs/2508.11454)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation--disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment.
摘要：现在，大型语言模型（LLM）在包括营销研究在内的许多领域中广泛使用。尤其是情感分析，可以帮助企业了解消费者的偏好。虽然大多数NLP研究仅从审查文本中分类，但营销理论（例如潜在的理论和期望） - 证实理论，指出客户评估不仅是由实际经验，而且是由其他参考点所塑造的。因此，这项研究调查了这种补充信息的内容和格式如何使用LLMS影响情感分析。我们使用适合实用营销应用程序的轻量级参数模型比较自然语言（NL）和JSON-Formatted提示。在两个Yelp类别（餐厅和夜生活）上进行的实验表明，JSON提示的其他信息胜过所有基准，而无需微调：Macro-F1上涨了1.6％和4％，而RMSE分别下降了16％和9.1％，使其在资源受限的边缘设备中可部署。此外，后续分析证实，绩效源于真正的上下文推理，而不是标记代理。这项工作表明，结构化提示可以使较小的模型能够实现竞争性能，从而提供了大规模模型部署的实用替代方案。

Title: Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models

Authors: Monika Jotautaitė, Lucius Caviola, David A. Brewster, Thilo Hagendorff
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.11534
Pdf URL: https://arxiv.org/pdf/2508.11534
Copy Paste: [[2508.11534]] Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models(https://arxiv.org/abs/2508.11534)
Keywords: language model, llm
Abstract: As large language models (LLMs) become more widely deployed, it is crucial to examine their ethical tendencies. Building on research on fairness and discrimination in AI, we investigate whether LLMs exhibit speciesist bias -- discrimination based on species membership -- and how they value non-human animals. We systematically examine this issue across three paradigms: (1) SpeciesismBench, a 1,003-item benchmark assessing recognition and moral evaluation of speciesist statements; (2) established psychological measures comparing model responses with those of human participants; (3) text-generation tasks probing elaboration on, or resistance to, speciesist rationalizations. In our benchmark, LLMs reliably detected speciesist statements but rarely condemned them, often treating speciesist attitudes as morally acceptable. On psychological measures, results were mixed: LLMs expressed slightly lower explicit speciesism than people, yet in direct trade-offs they more often chose to save one human over multiple animals. A tentative interpretation is that LLMs may weight cognitive capacity rather than species per se: when capacities were equal, they showed no species preference, and when an animal was described as more capable, they tended to prioritize it over a less capable human. In open-ended text generation tasks, LLMs frequently normalized or rationalized harm toward farmed animals while refusing to do so for non-farmed animals. These findings suggest that while LLMs reflect a mixture of progressive and mainstream human views, they nonetheless reproduce entrenched cultural norms around animal exploitation. We argue that expanding AI fairness and alignment frameworks to explicitly include non-human moral patients is essential for reducing these biases and preventing the entrenchment of speciesist attitudes in AI systems and the societies they influence.
摘要：随着大型语言模型（LLM）的部署更广泛，研究其道德倾向至关重要。在AI中的公平性和歧视研究的基础上，我们研究了LLM是否表现出物种主义偏见 - 基于物种成员资格的歧视 - 以及它们如何重视非人类动物。我们在三个范式中系统地检查了这个问题：（1）物种主义者，1,003个确定基准评估物种主义陈述的识别和道德评估；（2）已建立的心理措施将模型反应与人类参与者的反应进行比较；（3）探测物种主义合理化或抵抗的文本生成任务。在我们的基准中，LLM可靠地检测到了物种主义的陈述，但很少谴责它们，经常将物种主义的态度视为道德上可以接受的态度。根据心理措施，结果混合在一起：LLMs表达的显式种类症比人略低，但是在直接的权衡中，他们经常选择将一个人保存在多个动物上。试探性的解释是，LLM可以加权认知能力而不是物种本身：当能力相等时，它们没有物种的偏好，而当描述动物更有能力时，它们往往将其优先于一个较低的人类。在开放式文本生成任务中，LLMS经常对养殖动物进行标准化或合理化的伤害，同时拒绝对非养殖动物进行危害。这些发现表明，尽管LLM反映了渐进式和主流人类观点的混合，但它们仍然在动物剥削周围繁殖了根深蒂固的文化规范。我们认为，扩大AI公平性和一致性框架以明确包括非人类道德患者，对于减少这些偏见并防止物种主义态度的人AI系统及其影响的社会至关重要。

Title: Language models align with brain regions that represent concepts across modalities

Authors: Maria Ryskina, Greta Tuckute, Alexander Fung, Ashley Malkin, Evelina Fedorenko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11536
Pdf URL: https://arxiv.org/pdf/2508.11536
Copy Paste: [[2508.11536]] Language models align with brain regions that represent concepts across modalities(https://arxiv.org/abs/2508.11536)
Keywords: language model
Abstract: Cognitive science and neuroscience have long faced the challenge of disentangling representations of language from representations of conceptual meaning. As the same problem arises in today's language models (LMs), we investigate the relationship between LM--brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset (Pereira et al., 2018). Our experiments show that both language-only and language-vision models predict the signal better in more meaning-consistent areas of the brain, even when these areas are not strongly sensitive to language processing, suggesting that LMs might internally represent cross-modal conceptual meaning.
摘要：认知科学和神经科学长期以来一直面临着从概念含义的表示中解除语言表示的挑战。 As the same problem arises in today's language models (LMs), we investigate the relationship between LM--brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset （Pereira等，2018）。我们的实验表明，即使这些领域对语言处理并不强烈敏感，在更一致的大脑区域中，仅语言和语言视觉模型都可以更好地预测信号，这表明LMS可能在内部代表交叉模式的概念含义。

Title: AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

Authors: Jinpeng Hu, Ao Wang, Qianqian Xie, Hui Ma, Zhuo Li, Dan Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11567
Pdf URL: https://arxiv.org/pdf/2508.11567
Copy Paste: [[2508.11567]] AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment(https://arxiv.org/abs/2508.11567)
Keywords: agent
Abstract: Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. We introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user's basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and further enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches.
摘要：心理健康评估对于早期干预和有效治疗至关重要，但是传统的基于临床医生的方法受到合格专业人员短缺的限制。人工智能的最新进展激发了人们对自动心理评估的兴趣，但是大多数现有方法都受到对静态文本分析的依赖的限制，从而限制了它们通过动态互动和迭代性质疑而出现的更深入，更有信息的见解的能力。因此，在本文中，我们提出了一个用于心理健康评估的多代理框架，该框架模拟了临床医生对话，专门的代理商分配了质疑，足够评估，评分和更新。我们介绍了一种自适应质疑机制，在该机制中，评估代理评估用户响应的充分性，以确定有必要生成有针对性的后续查询以解决歧义和丢失信息。此外，我们采用了树结构的内存，其中根节点编码用户的基本信息，而儿童节点（例如主题和语句）根据不同的症状类别和相互作用的转弯来组织关键信息。在整个交互过程中，该内存会动态更新，以减少冗余质疑并进一步增强信息提取和上下文跟踪功能。 DAIC-WOZ数据集的实验结果说明了我们提出的方法的有效性，该方法的性能比现有方法更好。

Title: Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

Authors: Qiguang Chen, Dengyun Peng, Jinhao Liu, HuiKang Su, Jiannan Guan, Libo Qin, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.11582
Pdf URL: https://arxiv.org/pdf/2508.11582
Copy Paste: [[2508.11582]] Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models(https://arxiv.org/abs/2508.11582)
Keywords: language model, llm, chain-of-thought
Abstract: Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.
摘要：大型语言模型（LLM）的最新进展已通过长期思考链（COT）极大地提高了它们在复杂的推理任务方面的能力。但是，这种方法通常会导致大量冗余，损害计算效率并在实时应用中造成重大延迟。为了提高效率，当前的方法通常依赖于人类定义的难度，这与LLM的自我意识难度不符，导致效率低下。在本文中，我们介绍了动态推理的自我意识框架（DR。SAF），该框架使模型能够动态评估和调整其推理深度，以响应问题的复杂性。博士SAF集成了三个关键组成部分：边界自我意识一致性，自适应奖励管理和边界保存机制。这些组件允许模型优化其推理过程，平衡效率和准确性，而不会损害性能。我们的实验结果表明DR。 SAF的总反应令牌降低了49.27％，准确性损失最小。该框架还可以使代币效率获得6.59倍的增长，培训时间降低了5倍，使其非常适合资源有限的设置。在极端培训期间，博士。 SAF甚至可以超过16％的准确性提高令牌效率的传统模型。

Title: Dataset Creation for Visual Entailment using Generative AI

Authors: Rob Reijtenbach, Suzan Verberne, Gijs Wijnholds
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11605
Pdf URL: https://arxiv.org/pdf/2508.11605
Copy Paste: [[2508.11605]] Dataset Creation for Visual Entailment using Generative AI(https://arxiv.org/abs/2508.11605)
Keywords: prompt
Abstract: In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.
摘要：在本文中，我们介绍并验证了一个新的合成数据集，以培训视觉上的模型。与用于文本需要的数据集相比，现有的可视化数据集较小且稀疏。手动创建数据集是劳动密集型的。我们将合成数据集以SNLI数据集为基础，以进行文本构成。我们将SNLI的前提文本作为生成图像模型中的输入提示，即稳定的扩散，创建图像以替换每个文本前提。我们在本质上和外部评估我们的数据集。对于外部评估，我们通过使用基于夹子特征向量的视觉构成分类器作为训练数据来评估生成图像的有效性。我们发现，合成训练数据仅导致SNLI-VE的质量略有下降，而F-评分为0.686，而对实际数据进行培训时，质量下降了0.703。我们还将生成的培训数据的质量与另一个数据集的原始培训数据进行比较：Sick-VTE。同样，F-评分仅略有下降：从0.400到0.384。这些结果表明，在具有数据稀疏性的设置中，合成数据可能是训练视觉构成模型的有前途的解决方案。

Title: TinyTim: A Family of Language Models for Divergent Generation

Authors: Christopher J. Agostino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.11607
Pdf URL: https://arxiv.org/pdf/2508.11607
Copy Paste: [[2508.11607]] TinyTim: A Family of Language Models for Divergent Generation(https://arxiv.org/abs/2508.11607)
Keywords: language model
Abstract: This work introduces TinyTim, a family of large language models fine-tuned on James Joyce's `Finnegans Wake'. Through quantitative evaluation against baseline models, we demonstrate that TinyTim V1 produces a statistically distinct generative profile characterized by high lexical diversity and low semantic coherence. These findings are interpreted through theories of creativity and complex problem-solving, arguing that such specialized models can function as divergent knowledge sources within more extensive creative architectures, powering automated discovery mechanisms in diverse settings.
摘要：这项工作介绍了Tinytim，这是一个大型语言模型的家族，对詹姆斯·乔伊斯（James Joyce）的“ Finnegans Wake”进行了微调。通过针对基线模型的定量评估，我们证明了Tinytim V1产生统计上不同的生成曲线，其特征是高词汇多样性和低语义相干性。这些发现是通过创造力和复杂的问题解决的理论来解释的，认为这种专业模型可以在更广泛的创意体系结构中充当不同的知识来源，从而为各种设置中的自动发现机制提供动力。