2025-09-04

Title: Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models

Authors: Gustavo Bonil, João Gondim, Marina dos Santos, Simone Hashiguti, Helena Maia, Nadia Silva, Helio Pedrini, Sandra Avila
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02834
Pdf URL: https://arxiv.org/pdf/2509.02834
Copy Paste: [[2509.02834]] Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models(https://arxiv.org/abs/2509.02834)
Keywords: language model
Abstract: This study investigates how large language models, in particular LLaMA 3.2-3B, construct narratives about Black and white women in short stories generated in Portuguese. From 2100 texts, we applied computational methods to group semantically similar stories, allowing a selection for qualitative analysis. Three main discursive representations emerge: social overcoming, ancestral mythification and subjective self-realization. The analysis uncovers how grammatically coherent, seemingly neutral texts materialize a crystallized, colonially structured framing of the female body, reinforcing historical inequalities. The study proposes an integrated approach, that combines machine learning techniques with qualitative, manual discourse analysis.
摘要：这项研究调查了大型语言模型，尤其是Llama 3.2-3b，在葡萄牙人产生的短篇小说中构建了有关黑白女性的叙述。从2100个文本中，我们将计算方法应用于语义上相似的故事，从而选择定性分析。出现了三个主要的话语表达：社会克服，祖先神话和主观的自我实现。该分析发现了语法上一致的，看似中性的文本如何实现了女性身体的结晶，结构化的构架，从而增强了历史上的不平等现象。该研究提出了一种综合方法，该方法将机器学习技术与定性，手动话语分析相结合。

Title: IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

Authors: Hyunji Nam, Lucia Langlois, James Malamut, Mei Tan, Dorottya Demszky
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.02855
Pdf URL: https://arxiv.org/pdf/2509.02855
Copy Paste: [[2509.02855]] IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations(https://arxiv.org/abs/2509.02855)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly applied to open-ended, interpretive annotation tasks, such as thematic analysis by researchers or generating feedback on student work by teachers. These tasks involve free-text annotations requiring expert-level judgments grounded in specific objectives (e.g., research questions or instructional goals). Evaluating whether LLM-generated annotations align with those generated by expert humans is challenging to do at scale, and currently, no validated, scalable measure of similarity in ideas exists. In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a "pick-the-odd-one-out" triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge via IDEAlgin, against these human benchmarks. Applying this approach to two real-world educational datasets (interpretive analysis and feedback generation), we find that vector-based metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. These results establish IDEAlgin as a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, informing responsible deployment of LLMs in education and beyond.
摘要：大型语言模型（LLMS）越来越多地应用于开放式的解释性注释任务，例如研究人员的主题分析或对教师的学生工作产生反馈。这些任务涉及自由文本注释，需要以特定目标（例如研究问题或教学目标）为基础的专家级判断。评估LLM生成的注释是否与专家人类生成的注释相一致，在大规模上进行挑战，目前尚无经过验证的，可扩展的思想相似性的衡量标准。 In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a "pick-the-odd-one-out" triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge通过理想金，针对这些人类基准。将这种方法应用于两个现实世界的教育数据集（解释性分析和反馈生成），我们发现基于向量的指标在很大程度上无法捕获对专家有意义的相似性的细微差异。与传统的基于词汇和基于矢量的指标相比，通过Idealgin提示LLM可显着提高与专家判断的一致性（增长9-30％）。这些结果确立了理想之路作为对开放式专家注释评估LLM的有希望的范式，并为负责任的LLM在教育及其他方面的负责部署提供了信息。

Title: A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

Authors: Kesen Wang, Daulet Toibazar, Pedro J. Moreno
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.02864
Pdf URL: https://arxiv.org/pdf/2509.02864
Copy Paste: [[2509.02864]] A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation(https://arxiv.org/abs/2509.02864)
Keywords: language model, agent
Abstract: We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.
摘要：我们为阿拉伯语中的长篇小说提问（QA）一代提供了端到端，自我发展的对抗工作流程。通过精心策划多个专业的LVLM：一个问题发生器，评估器和一群答案生成器，我们的系统迭代地在没有任何人类干预的情况下就可以完善自己的性能。从跨不同领域的原始，多页的阿拉伯文档开始，生成器的问题会产生细粒度的，上下文感知的查询，应由答案生成器群解决，评估人员评估和回馈质量指标。这个闭环循环可以持续学习：低信心输出触发自动重新生成和模型更新，逐渐增强了问题的难度和相关性。此外，我们将质量指标设置为可调的超参数，以可控制且可自定义的难度级别为问题生成。我们发布了Aralongbench，这是一个跨越数百页的单页和多页挑战的大规模阿拉伯语基准，并证明我们的自我发展的工作流程基本上超过了静态管道，显着增强了领先的阿拉伯语大型语言模型（LVLMS）的长期远距离理解能力。最后，我们还精心构建了一个全自动的代理工作流程，可为长篇小说阿拉伯文档收集。

Title: English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

Authors: Taekyung Ahn, Hosung Nam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02915
Pdf URL: https://arxiv.org/pdf/2509.02915
Copy Paste: [[2509.02915]] English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM(https://arxiv.org/abs/2509.02915)
Keywords: language model, llm
Abstract: This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft's Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC > 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both < 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.
摘要：这项研究表明，通过低级别适应（LORA）适应的多模式大语言模型（MLLM）可以同时执行自动发音评估（APA）和错误发音检测和诊断（MDD）。利用微软的Phi-4-Multimodal-Inscruct，我们的微调方法消除了对这些不同任务通常需要的复杂体系结构更改或单独的培训程序的需求。在Speechocean762数据集中进行了微调，该模型预测的发音评估得分表现出强的Pearson相关系数（PCC> 0.7），具有人为分配的分数，同时达到低单词错误率（WER）和音调误差率（均<0.15）。值得注意的是，只对洛拉层进行微调足以达到与通过微调所有音频层相当的性能水平。这项研究强调，与以前为同时APA和MDD设计的关节模型相比，使用更简单的训练方法可以通过调整大型多模型而不进行全面微调来建立整合的发音评估系统。这种有效的基于洛拉的方法为英语L2学习者提供了更容易访问，集成和有效的计算机辅助发音培训（Capt）技术铺平了道路。

Title: ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Authors: Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, Teruko Mitamura
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2509.02949
Pdf URL: https://arxiv.org/pdf/2509.02949
Copy Paste: [[2509.02949]] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly(https://arxiv.org/abs/2509.02949)
Keywords: llm
Abstract: Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.
摘要：从日常任务到工业环境的助手具有使人类受益的巨大潜力。但是，在实用环境中，尤其是在组装中，没有测试床支持以应用程序为导向的系统评估。为了促进开发，我们提出了一个有关组装活动的新的多模式质量模式数据集。我们的数据集，PromQA组装，由391对QA对组成，需要以在线方式对人类活动记录及其指导手册进行多模式理解。在开发中，我们采用了一种半自动化的QA注释方法，在该方法中，LLMS生成候选人并将其验证为一种具有成本效益的方法，并通过将细粒度的动作标签整合到多样化问题类型中，从而进一步改善它。此外，我们为组装玩具车的目标任务创建指令任务图。这些新创建的任务图用于我们的基准测试实验，并促进QA注释中的人类验证过程。利用我们的数据集，我们进行了基准模型，包括竞争专有的多模式模型。我们的结果表明了当前型号的好空间。我们认为，我们的新评估数据集可以为程序活动助理的进一步发展做出贡献。

Title: DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling

Authors: Yougen Zhou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, Liang He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.02999
Pdf URL: https://arxiv.org/pdf/2509.02999
Copy Paste: [[2509.02999]] DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling(https://arxiv.org/abs/2509.02999)
Keywords: language model, llm, agent
Abstract: Psychotherapy reaches only a small fraction of individuals suffering from mental disorders due to social stigma and the limited availability of therapists. Large language models (LLMs), when equipped with professional psychotherapeutic skills, offer a promising solution to expand access to mental health services. However, the lack of psychological conversation datasets presents significant challenges in developing effective psychotherapy-guided conversational agents. In this paper, we construct a long-periodic dialogue corpus for counseling based on cognitive behavioral therapy (CBT). Our curated dataset includes multiple sessions for each counseling and incorporates cognitive conceptualization diagrams (CCDs) to guide client simulation across diverse scenarios. To evaluate the utility of our dataset, we train an in-depth counseling model and present a comprehensive evaluation framework to benchmark it against established psychological criteria for CBT-based counseling. Results demonstrate that DiaCBT effectively enhances LLMs' ability to emulate psychologists with CBT expertise, underscoring its potential for training more professional counseling agents.
摘要：由于社会污名和治疗师的可用性有限，心理治疗仅达到一小部分患有精神障碍的人。大型语言模型（LLMS）配备了专业的心理治疗技能，便为扩大获得心理健康服务的访问提供了有希望的解决方案。但是，缺乏心理对话数据集在开发有效的心理治疗引导的对话剂方面面临着重大挑战。在本文中，我们构建了基于认知行为疗法（CBT）的咨询长期对话语料库。我们的策划数据集包括针对每种咨询的多个会话，并结合了认知概念图（CCD），以指导各种情况下的客户模拟。为了评估我们数据集的实用性，我们培训了一个深入的咨询模型，并提出了一个全面的评估框架，以对基于CBT的咨询的既定心理标准进行基准测试。结果表明，DIACBT有效地增强了LLMS具有CBT专业知识的心理学家的能力，从而强调了其培训更多专业咨询代理商的潜力。

Title: Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

Authors: Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2509.03020
Pdf URL: https://arxiv.org/pdf/2509.03020
Copy Paste: [[2509.03020]] Training LLMs to be Better Text Embedders through Bidirectional Reconstruction(https://arxiv.org/abs/2509.03020)
Keywords: language model, llm
Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
摘要：大型语言模型（LLMS）越来越多地作为强大的文本嵌入式探索。现有的基于LLM的文本嵌入方法通常利用最终令牌的嵌入，通常是保留的特殊令牌，例如[EOS]。但是，这些令牌尚未经过故意培训以捕获整个上下文的语义，从而限制了它们作为文本嵌入的能力，尤其是用于检索和重新排列的任务。我们建议在对比度学习之前增加一个新的训练阶段，以丰富最终令牌嵌入的语义。此阶段采用双向生成重建任务，即eBQ2D（基于嵌入的查询文档）和eBD2Q（基于嵌入的文档到Query），它们交织以锚定[EOS]嵌入和重建查询描绘的两侧。实验结果表明，我们的额外训练阶段可显着提高大规模文本嵌入基准（MTEB）的LLM性能，从而在不同的LLM基本模型和尺度上实现了新的最新结果。

Title: Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

Authors: Ming Gong, Yingnan Deng, Nia Qi, Yujun Zou, Zhihao Xue, Yun Zi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03057
Pdf URL: https://arxiv.org/pdf/2509.03057
Copy Paste: [[2509.03057]] Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models(https://arxiv.org/abs/2509.03057)
Keywords: language model
Abstract: This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.
摘要：本文解决了大语模型微调中的参数冗余，刚性结构和有限的任务适应性问题。它提出了一种基于基于转移的微调方法，建立在结构可行的机制上。通过引入可区分的门控函数和结构稀疏控制变量，该方法可以自动优化适配器插入点，激活路径和模块组合。这使模型可以在多任务设置中灵活地调整其结构，以匹配不同的任务特征。随着骨干参数的冻结，该方法使用结构搜索机制来指导训练期间特定于任务特定的有效子结构的动态结构。这显着提高了参数利用和代表性。此外，本文设计了一组灵敏度分析实验，以系统地评估稀疏重量，噪声注入比和数据扰动对模型性能的影响。这些实验验证了在各种多任务自然语言理解任务中所提出方法的稳定性和鲁棒性。实验结果表明，所提出的方法在多个任务上优于主流参数有效的调整技术。它在噪声和扰动方面的精度，压缩率和鲁棒性之间取得了更好的平衡。

Title: Measuring Scalar Constructs in Social Science with LLMs

Authors: Hauke Licht, Rupak Sarkar, Patrick Y. Wu, Pranav Goel, Niklas Stoehr, Elliott Ash, Alexander Miserlis Hoyle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03116
Pdf URL: https://arxiv.org/pdf/2509.03116
Copy Paste: [[2509.03116]] Measuring Scalar Constructs in Social Science with LLMs(https://arxiv.org/abs/2509.03116)
Keywords: language model, llm, prompt
Abstract: Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just "simple" or "complex," but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study yields actionable findings for applied researchers. First, LLMs prompted to generate pointwise scores directly from texts produce discontinuous distributions with bunching at arbitrary numbers. The quality of the measurements improves with pairwise comparisons made by LLMs, but it improves even more by taking pointwise scores and weighting them by token probability. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.
摘要：许多表征语言的结构，例如其复杂性或情感性，具有自然连续的语义结构。公开演讲不仅是“简单”或“复杂”，而且存在于极端之间的连续体中。尽管大型语言模型（LLMS）是测量标量结构的有吸引力的工具，但它们对数值输出的特质处理提出了如何最好地应用它们的问题。我们通过对基于LLM的基于LLM的标量构造测量方法的全面评估来解决这些问题。使用来自政治学文献的多个数据集，我们评估了四种方法：未加权的直接得分，成对比较的聚合，令牌概率加权的点数得分和填充。我们的研究为应用研究人员提供了可行的发现。首先，LLMS提示直接从文本中生成点分数，以任意数字以束束产生不连续的分布。通过LLM进行的成对比较，测量值的质量提高了，但是通过取得刻度得分并通过令牌概率加权它们可以改善。最后，较小的训练对的较小型号可以匹配或超过提示LLM的性能。

Title: From Evaluation to Defense: Constructing Persistent Edit-Based Fingerprints for Large Language Models

Authors: Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Xiaoling Wang, Linlin Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.03122
Pdf URL: https://arxiv.org/pdf/2509.03122
Copy Paste: [[2509.03122]] From Evaluation to Defense: Constructing Persistent Edit-Based Fingerprints for Large Language Models(https://arxiv.org/abs/2509.03122)
Keywords: language model, llm
Abstract: The intellectual property (IP) protection of Large Language Models (LLMs) is increasingly critical. Injecting specialized fingerprints into LLMs through instruction tuning is a common IP protection technique. However, this may significantly degrade model performance, requires substantial computational resources, and exhibits poor persistence under model modifications. We argue that knowledge editing offers a lightweight alternative that is more suitable for fingerprint injection. Accordingly, we apply knowledge editing to fingerprint injection for the first time and demonstrate its strong capability. Despite using scrambled text as fingerprints to prevent them from being overwritten during fine-tuning, degradation still occurs under large-scale fine-tuning. To address this, we propose Fingerprint Subspace-aware Fine-Tuning (FSFT), which reduces fingerprint degradation by constraining the update of the fingerprint subspace. The performance of FSFT exceeds fine-tuning by 10% even in the worst-case scenario. Additionally, we observe that the fingerprint-injected models struggle to distinguish between fingerprints and similar texts due to the high similarity of their features. This finding underscores the urgent need for more robust and fine-grained fingerprinting injection methods for LLMs.
摘要：大型语言模型（LLM）的知识产权（IP）保护越来越关键。通过指令调整将专门的指纹注入LLM是一种常见的IP保护技术。但是，这可能会大大降低模型性能，需要大量的计算资源，并且在模型修改下表现出较差的持久性。我们认为，知识编辑提供了一种轻巧的替代方案，更适合指纹注射。因此，我们首次将知识编辑应用于指纹注射，并证明其强大的能力。尽管使用炒文本作为指纹，以防止在微调过程中覆盖它们，但在大规模微调调中仍会降解。为了解决这个问题，我们提出了指纹子空间感知微调（FSFT），这通过约束指纹子空间的更新来减少指纹降解。即使在最坏的情况下，FSFT的性能也超过了10％。此外，我们观察到，注射指纹的模型由于其特征的高相似性而难以区分指纹和类似文本。这一发现强调了迫切需要对LLM的更坚固和细粒的指纹注射方法。

Title: Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader

Authors: Jannis Vamvas, Ignacio Pérez Prat, Not Battesta Soliva, Sandra Baltermia-Guetg, Andrina Beeli, Simona Beeli, Madlaina Capeder, Laura Decurtins, Gian Peder Gregori, Flavia Hobi, Gabriela Holderegger, Arina Lazzarini, Viviana Lazzarini, Walter Rosselli, Bettina Vital, Anna Rutkiewicz, Rico Sennrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03148
Pdf URL: https://arxiv.org/pdf/2509.03148
Copy Paste: [[2509.03148]] Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader(https://arxiv.org/abs/2509.03148)
Keywords: llm
Abstract: The Romansh language, spoken in Switzerland, has limited resources for machine translation evaluation. In this paper, we present a benchmark for six varieties of Romansh: Rumantsch Grischun, a supra-regional variety, and five regional varieties: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. Our reference translations were created by human translators based on the WMT24++ benchmark, which ensures parallelism with more than 55 other languages. An automatic evaluation of existing MT systems and LLMs shows that translation out of Romansh into German is handled relatively well for all the varieties, but translation into Romansh is still challenging.
摘要：在瑞士使用的罗马语语言的机器翻译评估资源有限。在本文中，我们为六种罗马人提供了基准：Rumantsch Grischun，一种超级区域品种和五个区域性品种：Sursilvan，Sutsilvan，Surmiran，Puter，Puter和Vallader。我们的参考翻译是由人类翻译人员基于WMT24 ++基准创建的，该基准可确保与其他55种其他语言的并行性。对现有MT系统和LLM的自动评估表明，从罗曼什（Romansh）转化为德语的翻译相对较好处理，但转化为罗曼什（Romansh）仍然具有挑战性。

Title: Domain Adaptation of LLMs for Process Data

Authors: Rafael Seidi Oyamada, Jari Peeperkorn, Jochen De Weerdt, Johannes De Smedt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.03161
Pdf URL: https://arxiv.org/pdf/2509.03161
Copy Paste: [[2509.03161]] Domain Adaptation of LLMs for Process Data(https://arxiv.org/abs/2509.03161)
Keywords: language model, llm, prompt
Abstract: In recent years, Large Language Models (LLMs) have emerged as a prominent area of interest across various research domains, including Process Mining (PM). Current applications in PM have predominantly centered on prompt engineering strategies or the transformation of event logs into narrative-style datasets, thereby exploiting the semantic capabilities of LLMs to address diverse tasks. In contrast, this study investigates the direct adaptation of pretrained LLMs to process data without natural language reformulation, motivated by the fact that these models excel in generating sequences of tokens, similar to the objective in PM. More specifically, we focus on parameter-efficient fine-tuning techniques to mitigate the computational overhead typically associated with such models. Our experimental setup focuses on Predictive Process Monitoring (PPM), and considers both single- and multi-task predictions. The results demonstrate a potential improvement in predictive performance over state-of-the-art recurrent neural network (RNN) approaches and recent narrative-style-based solutions, particularly in the multi-task setting. Additionally, our fine-tuned models exhibit faster convergence and require significantly less hyperparameter optimization.
摘要：近年来，大型语言模型（LLM）已成为各个研究领域（包括流程挖掘（PM））的重要领域。 PM中的当前应用主要集中在迅速的工程策略或事件日志转换为叙事风格的数据集中，从而利用LLMS的语义功能来解决各种任务。相比之下，这项研究调查了经过验证的LLM的直接适应在没有自然语言重新制定的情况下处理数据的直接适应，这是因为这些模型在生成代币序列时表现出色的事实，类似于PM中的目标。更具体地说，我们专注于参数有效的微调技术，以减轻通常与此类模型相关的计算开销。我们的实验设置着重于预测过程监测（PPM），并考虑了单个和多任务预测。结果表明，预测性能在最新的复发性神经网络（RNN）方法和最新基于叙事风格的解决方案方面有潜在的改善，尤其是在多任务设置中。此外，我们的微调模型表现出更快的收敛性，并且需要明显较小的超参数优化。

Title: SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

Authors: Ashmari Pramodya, Nirasha Nelki, Heshan Shalinda, Chamila Liyanage, Yusuke Sakai, Randil Pushpananda, Ruvan Weerasinghe, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03162
Pdf URL: https://arxiv.org/pdf/2509.03162
Copy Paste: [[2509.03162]] SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala(https://arxiv.org/abs/2509.03162)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.
摘要：大型语言模型（LLMS）表现出令人印象深刻的一般知识和推理能力，但是他们的评估主要集中在全球或中心的主题上，常常忽略了低资源的语言和文化特定的内容。尽管最近的多语言基准试图弥合这一差距，但许多人依靠自动翻译，这可能会引入错误并歪曲原始文化背景。为了解决这个问题，我们介绍了Sinhalammlu，这是第一个多项选择的问题，回答专为Sinhala设计的基准，这是一种低资源语言。该数据集包括跨越多个大学教育水平的超过7,000个问题，与斯里兰卡国家课程保持一致，并涵盖了六个领域和30个科目，其中包括一般学术主题和文化上的知识。我们在Sinhalammlu上评估了26个LLM，并观察到，虽然Claude 3.5十四行诗和GPT-4O分别以67％和62％的速度达到了最高的平均精度，但总体模型性能仍然有限。特别是，模型在文化丰富的领域（例如人文科学）中挣扎，这揭示了改善LLMS对低资源和文化特定环境的重大空间。

Title: AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Authors: Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2509.03312
Pdf URL: https://arxiv.org/pdf/2509.03312
Copy Paste: [[2509.03312]] AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?(https://arxiv.org/abs/2509.03312)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.
摘要：大型语言模型（LLM）的代理系统通常包括多个模型，复杂的工具调用和编排协议，其表现大大胜过整体剂。然而，这种非常复杂的方法会扩大它们的脆弱性，使它们更容易出现系统故障。查明对长执行跟踪中错误负责的特定代理或步骤定义了代理系统故障归因的任务。然而，当前最新的推理LLMS仍不为此挑战而明显不足，精度通常低于10％。为了解决这一差距，我们提出了AgentRacer，这是通过反事实重播和编程故障注入的第一个自动化框架，用于注释失败的多代理轨迹，从而产生策划的数据集Tracertraj。利用此资源，我们开发了AgentRacer-8B，这是一种轻巧的故障示踪剂，训练有多个晶体增强学习，能够有效诊断在冗长的多代理相互作用中的错误。在WHO和WH时，AgentRacer-8b的表现优于Gemini-2.5-Pro和Claude-4-sonnet，最高为18.18％，在LLM代理失败归属中设定了新标准。更重要的是，AgentRacer-8B向现成的多代理系统（如Metagpt和Maas）提供了可行的反馈，其性能增长为4.8-14.2％，赋予自我纠正和自我发展的代理AI的能力。

Title: LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

Authors: Daniela Gottesman, Alon Gilae-Dotan, Ido Cohen, Yoav Gur-Arieh, Marius Mosbach, Ori Yoran, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03405
Pdf URL: https://arxiv.org/pdf/2509.03405
Copy Paste: [[2509.03405]] LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations(https://arxiv.org/abs/2509.03405)
Keywords: language model
Abstract: Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.
摘要：语言模型（LMS）越来越多地推动需要世界知识的现实世界应用。但是，模型将数据转化为关于世界知识和信念的表示的内部过程，对世界的理解很少。对这些过程的洞察力可以为开发具有更一致，健壮和完整的知识表示的LMS铺平道路。为了促进研究这些问题，我们介绍了Lment，这是一个用于分析训练过程中LMS知识获取的套件。 LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks.这些资源共同提供了一个受控的环境，用于分析实体在训练和下游绩效中提及的联系，以及因果干预措施在训练数据中的影响。我们通过研究跨检查站的知识获取来显示律师事业的实用性，发现事实频率是关键，但不能完全解释学习趋势。我们释放LMNT，以支持LMS知识的研究，包括知识表示，可塑性，编辑，归因和学习动态。

Title: Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

Authors: Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03419
Pdf URL: https://arxiv.org/pdf/2509.03419
Copy Paste: [[2509.03419]] Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges(https://arxiv.org/abs/2509.03419)
Keywords: language model, llm
Abstract: As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.
摘要：随着大型语言模型（LLM）的发展越来越强大，它们面临越来越多样化和复杂的任务，使可靠的评估具有挑战性。法官作为LLM的范式已成为可扩展解决方案，但先前的工作主要集中在简单的设置上。它们在复杂的任务中的可靠性 - 多面专栏，非结构化的参考答案和细微的标准是关键的 - 怪人。在本文中，我们构建了ComplexEval，这是一个挑战基准，旨在系统地暴露和量化辅助信息引起的偏见。我们在12个基本和3个高级方案中系统地研究并验证了6个以前未开发的偏差。关键发现揭示了：（1）所有评估的模型均表现出对这些偏见的显着敏感性，并具有任务复杂性的偏差级尺度；（2）值得注意的是，大型推理模型（LRMS）显示出矛盾的脆弱性。我们的深入分析为提高评估信号的准确性和验证性提供了关键见解，为更一般和强大的评估模型铺平了道路。

Title: Design and Optimization of Reinforcement Learning-Based Agents in Text-Based Games

Authors: Haonan Wang, Mingjia Zhao, Junfeng Sun, Wei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.03479
Pdf URL: https://arxiv.org/pdf/2509.03479
Copy Paste: [[2509.03479]] Design and Optimization of Reinforcement Learning-Based Agents in Text-Based Games(https://arxiv.org/abs/2509.03479)
Keywords: agent
Abstract: As AI technology advances, research in playing text-based games with agents has becomeprogressively popular. In this paper, a novel approach to agent design and agent learning ispresented with the context of reinforcement learning. A model of deep learning is first applied toprocess game text and build a world model. Next, the agent is learned through a policy gradient-based deep reinforcement learning method to facilitate conversion from state value to optimal this http URL enhanced agent works better in several text-based game experiments and significantlysurpasses previous agents on game completion ratio and win rate. Our study introduces novelunderstanding and empirical ground for using reinforcement learning for text games and sets thestage for developing and optimizing reinforcement learning agents for more general domains andproblems.
摘要：随着AI技术的发展，与代理商一起玩基于文本的游戏的研究变得广受欢迎。在本文中，一种针对代理设计和代理学习的新颖方法对强化学习的背景表现出来。深度学习模型首先应用了TopCrocess游戏文本并建立世界模型。接下来，通过基于策略梯度的深度强化学习方法来学习该代理，以促进从状态价值转换为最佳的最佳转换。在几个基于文本的游戏实验中，HTTP URL增强代理的效果更好，并显着对以前的代理商在游戏完成比和获胜率上进行了效果。我们的研究介绍了将强化学习用于文本游戏的新颖毫无意义和经验的基础，并为开发和优化更通用的领域和问题的增强学习剂设置了一生。