2025-07-09

Title: TokenShapley: Token Level Context Attribution with Shapley Value

Authors: Yingtai Xiao, Yuqing Zhu, Sirat Samyoun, Wanrong Zhang, Jiachen T. Wang, Jian Du
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05261
Pdf URL: https://arxiv.org/pdf/2507.05261
Copy Paste: [[2507.05261]] TokenShapley: Token Level Context Attribution with Shapley Value(https://arxiv.org/abs/2507.05261)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate strong capabilities in in-context learning, but verifying the correctness of their generated responses remains a challenge. Prior work has explored attribution at the sentence level, but these methods fall short when users seek attribution for specific keywords within the response, such as numbers, years, or names. To address this limitation, we propose TokenShapley, a novel token-level attribution method that combines Shapley value-based data attribution with KNN-based retrieval techniques inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed datastore for contextual retrieval and computing Shapley values to quantify token importance, TokenShapley provides a fine-grained data attribution approach. Extensive evaluations on four benchmarks show that TokenShapley outperforms state-of-the-art baselines in token-level attribution, achieving an 11-23% improvement in accuracy.
摘要：大型语言模型（LLMS）表现出强大的功能，但可以验证其生成的响应的正确性仍然是一个挑战。先前的工作探索了句子级别的归因，但是当用户寻求响应中特定关键字（例如数字，年份或名称）中的特定关键字的归因时，这些方法不足。为了解决这一限制，我们提出了Tokenshapley，这是一种新颖的令牌级属性方法，将基于Shapley的数据归因与基于KNN的检索技术相结合，灵感来自KNN-EAGMENTED LLMS的最新进展。通过利用预先计算的数据存储进行上下文检索和计算shapley值以量化令牌重要性，Tokenshapley提供了一种精细的数据归因方法。对四个基准测试的广泛评估表明，Tokenshapley在令牌级别归属中的最先进基线的表现，其准确性提高了11-23％。

Title: User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs

Authors: Sougata Saha, Monojit Choudhury
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05266
Pdf URL: https://arxiv.org/pdf/2507.05266
Copy Paste: [[2507.05266]] User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs(https://arxiv.org/abs/2507.05266)
Keywords: language model, gpt, llm
Abstract: Measuring the generalization ability of Large Language Models (LLMs) is challenging due to data contamination. As models grow and computation becomes cheaper, ensuring tasks and test cases are unseen during training phases will become nearly impossible. We argue that knowledge-retrieval and reasoning tasks are not ideal for measuring generalization, as LLMs are not trained for specific tasks. Instead, we propose user behavior prediction, also a key aspect of personalization, as a theoretically sound, scalable, and robust alternative. We introduce a novel framework for this approach and test it on movie and music recommendation datasets for GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct. Results align with our framework's predictions, showing GPT-4o outperforms GPT-4o-mini and Llama, though all models have much room for improvement, especially Llama.
摘要：由于数据污染，测量大语言模型（LLM）的概括能力（LLMS）具有挑战性。随着模型的增长和计算变得更便宜，在训练阶段确保任务和测试用例几乎是不可能的。我们认为，由于LLM没有针对特定任务进行培训，因此知识回复和推理任务不是衡量概括的理想选择。取而代之的是，我们提出了用户行为预测，也是个性化的关键方面，作为理论上的声音，可扩展和稳健的替代方案。我们为这种方法介绍了一个新颖的框架，并在GPT-4O，GPT-4O-Mini和Llama-3.1-8B-Instruct的电影和音乐推荐数据集上进行了测试。结果与我们的框架的预测保持一致，显示GPT-4O的表现优于GPT-4O-Mini和Llama，尽管所有模型都有很大的改进空间，尤其是Llama。

Title: Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion

Authors: Miloud Mihoubi, Meriem Zerkouk, Belkacem Chikhaoui
Subjects: cs.CL, cs.AI, cs.CY, cs.IR
Abstract URL: https://arxiv.org/abs/2507.05285
Pdf URL: https://arxiv.org/pdf/2507.05285
Copy Paste: [[2507.05285]] Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion(https://arxiv.org/abs/2507.05285)
Keywords: prompt, retrieval-augmented generation
Abstract: Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., "isolation," "workload anxiety"). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems
摘要：远程学习中的学生辍学仍然是一个关键的挑战，并带来了深远的社会和经济后果。尽管经典的机器学习模型利用结构化的社会人口统计学和行为数据，但它们通常无法捕获非结构化的学生互动中嵌入的细微情感和上下文因素。本文介绍了一个变革性的AI框架，该框架通过三个协同创新重新定义了辍学的预测：用于领域特异性情绪分析的检索增强生成（RAG），促使工程促使工程以解码学术压力源，并动态地与文本，行为，行为和社会观点洞察力融合。通过在教学内容的精心策划知识基础上进行情感分析，我们的抹布增强的BERT模型以前所未有的上下文相关性来解释学生的评论，而优化的提示提示了学术困扰的孤立指标（例如，“隔离”，“工作负载焦虑”）。然后，跨模式的注意力层将这些见解与时间参与模式融合在一起，从而创造了整体风险概况。该框架在4423名学生的纵向数据集上进行了评估，该框架的准确性为89％，F1得分为0.88，表现优于常规模型，并使虚假负面因素降低了21％。除了预测之外，该系统还通过检索上下文一致的策略（例如，针对孤立的学习者的指导计划）来生成可解释的干预措施。这项工作弥合了预测分析与可操作的教学法之间的差距，提供了可扩展的解决方案，以减轻全球教育系统中的辍学风险

Title: LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review

Authors: Cheng Yuan, Xinkai Rui, Yongqi Fan, Yawei Fan, Boyang Zhong, Jiacheng Wang, Weiyan Zhang, Tong Ruan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05319
Pdf URL: https://arxiv.org/pdf/2507.05319
Copy Paste: [[2507.05319]] LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review(https://arxiv.org/abs/2507.05319)
Keywords: language model, llm, hallucination
Abstract: Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository this https URL.
摘要：尽管大型语言模型（LLM）在自动放电摘要中的表现出色，但它们仍然遭受幻觉问题的困扰，例如产生不准确的内容或没有有效来源的信息。此外，电子病历（EMRS）通常由长格式数据组成，这使得LLMS将生成的内容归因于来源具有挑战性。为了应对这些挑战，我们建议LCD，这是一个逻辑控制的放电摘要生成系统。 LCD通过计算EMR和放电摘要之间的文本相似性来构建源映射表，以限制摘要内容的范围。此外，LCD结合了一套综合的逻辑规则，使其能够生成针对不同临床领域量身定制的更可靠的银排放摘要。此外，LCD支持生成内容的来源归因，使专家可以有效审查，提供反馈并纠正错误。随后记录了所得的黄金排放摘要，以进行LLM的增量微调。我们的项目和演示视频位于此HTTPS URL的GitHub存储库中。

Title: MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

Authors: Ming Gong, Xucheng Huang, Chenghan Yang, Xianhan Peng, Haoxin Wang, Yang Liu, Ling Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05330
Pdf URL: https://arxiv.org/pdf/2507.05330
Copy Paste: [[2507.05330]] MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents(https://arxiv.org/abs/2507.05330)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular "MLLM-as-Tool" strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.
摘要：大型语言模型（LLM）的最新进展已启用了电子商务客户服务中的新应用程序。但是，它们的功能仍在复杂的多模式场景中受到限制。我们介绍MindFlow，这是第一个为电子商务量身定制的开源多模式LLM代理。它建立在煤层框架上，整合了内存，决策和动作模块，并采用模块化的“ mllm as-as tool”策略来效果视觉文本推理。通过在线A/B测试和基于仿真的消融进行评估，MindFlow在处理复杂查询，提高用户满意度和降低运营成本方面表现出了可观的收益，在现实世界部署中观察到了93.53％的相对改进。

Title: LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks

Authors: William Fleshman, Benjamin Van Durme
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05346
Pdf URL: https://arxiv.org/pdf/2507.05346
Copy Paste: [[2507.05346]] LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks(https://arxiv.org/abs/2507.05346)
Keywords: language model, retrieval-augmented generation
Abstract: The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG's compatibility with alternative solutions such as retrieval-augmented generation (RAG).
摘要：针对特定任务和域的微调语言模型专家的扩散表明需要进行有效的选择和组合方法。我们提出了洛拉（Lora）的生成（滞后），以利用大量知识和特定于任务的洛拉适配器图书馆。滞后不需要额外的培训或访问数据，并有效地进行过滤，检索并以每句话和层为基础应用专家。我们评估了各种知识密集型任务的滞后，从而实现了超过现有无数据的方法的卓越性能。我们探讨了可获得其他数据的方案，证明了与替代解决方案（例如检索型生成（RAG））的兼容性。

Title: On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Authors: Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05362
Pdf URL: https://arxiv.org/pdf/2507.05362
Copy Paste: [[2507.05362]] On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study(https://arxiv.org/abs/2507.05362)
Keywords: language model, llm
Abstract: Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.
摘要：自然语言处理的最新进展突出了改善大语言模型（LLMS）推理的两个关键因素：（i）更多的测试时间计算倾向于解决更严重的问题，但通常会引入推理轨迹的冗余，并且（ii）当推理是系统性的和渐进的是系统性的和逐步形成的结构化链时，对人类的想法（COTS）构成了人类问题。为了孤立研究这些因素，我们根据分层图中的最短路径任务引入了受控设置。我们使用自定义令牌训练仅在问答痕迹 - 答案的三元组上训练仅解码器的变压器，将对最佳自下而上动态编程轨迹训练的模型与在涉及回溯较长的，有效的有效痕迹的训练的模型中进行了比较。令人惊讶的是，凭借相同的训练预算，对效率低下的痕迹进行培训的模型可以更好地概括地看不见的图表。这种好处并不是由于单独将任意冗余的长度注入推理痕迹而无法提供帮助，甚至可能损害性能。取而代之的是，我们发现概括与模型对下一步预测的信心相关，这表明长，连贯和局部增量的痕迹使训练信号更容易优化。

Title: The Generalization Ridge: Information Flow in Natural Language Generation

Authors: Ruidi Chang, Chunyuan Deng, Hanjie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05387
Pdf URL: https://arxiv.org/pdf/2507.05387
Copy Paste: [[2507.05387]] The Generalization Ridge: Information Flow in Natural Language Generation(https://arxiv.org/abs/2507.05387)
Keywords: language model
Abstract: Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG) tasks, yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on ridge layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.
摘要：基于变压器的语言模型已经在自然语言生成（NLG）任务中实现了最先进的表现，但是它们用于合成与任务相关信息的内部机制尚未足够理解。虽然先前的研究表明，中间层通常比最终层产生更多的可推广表示，但在训练期间，这种概括能力如何出现并在跨层中传播。为了解决这一差距，我们建议信息理论框架Inforidge，以表征跨深度的隐藏表示形式和目标输出变化之间的相互信息的方式。估计此数量使我们能够在培训期间在整个模型中追踪与任务相关的信息的流动。我们在各种模型和数据集中进行的实验揭示了一个一致的非单调趋势：中上层层中的预测信息峰形成了概括脊，然后在最终层中下降，反映了概括和记忆之间的过渡。为了进一步研究这种现象，我们介绍了应用于每个残留块的剩余缩放系数 - 可训练的标量参数 - 作为评估单个变压器层相对重要性的功能探针。这些系数表明，在分配变化下，模型减少了最终层，并越来越依赖山脊层，突出了它们在概括中的作用。这些发现共同提供了对变压器内部机制的新见解，并强调了中间层在支持概括中的关键作用。

Title: Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

Authors: Guillem Ramírez, Alexandra Birch, Ivan Titov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05391
Pdf URL: https://arxiv.org/pdf/2507.05391
Copy Paste: [[2507.05391]] Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences(https://arxiv.org/abs/2507.05391)
Keywords: language model, llm
Abstract: Large language models (LLMs) are primarily accessed via commercial APIs, but this often requires users to expose their data to service providers. In this paper, we explore how users can stay in control of their data by using privacy profiles: simple natural language instructions that say what should and should not be revealed. We build a framework where a local model uses these instructions to rewrite queries, only hiding details deemed sensitive by the user, before sending them to an external model, thus balancing privacy with performance. To support this research, we introduce PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. Our experiments with lightweight LLMs show they can follow these instructions to some extent, but also face consistent challenges, highlighting the need for models that better understand and comply with user-defined privacy preferences.
摘要：大型语言模型（LLMS）主要是通过商业API访问的，但这通常要求用户将其数据曝光到服务提供商。在本文中，我们探讨了用户如何使用隐私配置文件来控制他们的数据：简单的自然语言说明，说出应该和不应透露的内容。我们构建一个框架，本地模型使用这些说明来重写查询，只是隐藏用户认为敏感的细节，然后将其发送到外部模型，从而平衡隐私与性能。为了支持这项研究，我们介绍了PEEP，PEEP是一个真实用户查询的多语言数据集，该数据集注释以标记私有内容，并与合成隐私配置文件配对。我们对轻量级LLM的实验表明，他们可以在某种程度上遵循这些说明，但也面临一致的挑战，强调了对更好理解并遵守用户定义的隐私偏好的模型的需求。

Title: Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Authors: Jaedong Hwang, Kumar Tanmay, Seok-Jin Lee, Ayush Agrawal, Hamid Palangi, Kumar Ayush, Ila Fiete, Paul Pu Liang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05418
Pdf URL: https://arxiv.org/pdf/2507.05418
Copy Paste: [[2507.05418]] Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning(https://arxiv.org/abs/2507.05418)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. this https URL
摘要：大型语言模型（LLM）在数学，事实质量质量质量质量和代码生成等领域中取得了强大的性能，但是它们在这些任务中的多语言推理能力仍然不发达。特别是对于诸如斯瓦希里语或泰语之类的低资源语言，LLMS通常会误解或默认用英语推理。这种对高资源语言的隐性偏见破坏了事实准确性，解释性和信任。当前的多语言基准仅着眼于最终答案，忽略了模型是否实际上是针对目标语言的原因。为了解决这一差距，我们介绍了GeoFact-X，这是一种基于地理的多语言事实推理基准，具有带注释的推理痕迹的五种语言：英语，印地语，日语，斯瓦希里语和泰语。我们进一步提出了Bridge，这是一种新颖的培训方法，它以语言一致性的奖励指导监督微调和测试时间加强学习，以使推理与输入语言保持一致。最后，我们使用LLM-AS-A-Gudge制定自动评估协议，以评估答案的正确性以及推理痕迹的质量和语言一致性，从而超出了表面级指标的细微差别分析。我们的结果表明，桥梁可显着增强多语言推理的忠诚度，表明推理意识到的多语言增强学习对于稳健的跨语性概括至关重要。此HTTPS URL

Title: "Lost-in-the-Later": Framework for Quantifying Contextual Grounding in Large Language Models

Authors: Yufei Tao, Adam Hiatt, Rahul Seetharaman, Ameeta Agrawal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05424
Pdf URL: https://arxiv.org/pdf/2507.05424
Copy Paste: [[2507.05424]] "Lost-in-the-Later": Framework for Quantifying Contextual Grounding in Large Language Models(https://arxiv.org/abs/2507.05424)
Keywords: language model, llm, hallucination, prompt, chain-of-thought
Abstract: Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.
摘要：大型语言模型能够利用上下文知识和参数知识，但是它们如何优先级和集成这些来源仍然没有充满信心。我们介绍了COPE，这是一个新颖的评估框架，该框架可以系统地测量跨模型和语言的上下文知识（CK）和参数知识（PK）。我们使用英语，西班牙语和丹麦语的多维基族组数据集，我们分析了大型语言模型（LLMS）如何整合上下文，优先级信息并将PK纳入开放式问题的答案中。我们的分析发现了一种我们称为“遗失者”的现象，在该现象中，LLM倾向于忽略或剥夺在给定情况下出现的信息，从而揭示了影响上下文基础的强烈位置偏见。我们进一步发现，推理模型以及促使经营链（COT）促成的非调理模型，使用的上下文甚至比没有COT的非争议模型少，并且无法减轻遗传效果。尤其是促使COT导致召回和较短的响应导致降低上下文基础。基于这些见解，我们设计了基于及时的方法以有效利用输入上下文。应用COPE进行摘要的案例研究表明，CK信息的提示改善了事实接地并减少了幻觉。

Title: PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs

Authors: Sana Kang, Myeongseok Gwon, Su Young Kwon, Jaewook Lee, Andrew Lan, Bhiksha Raj, Rita Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05444
Pdf URL: https://arxiv.org/pdf/2507.05444
Copy Paste: [[2507.05444]] PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs(https://arxiv.org/abs/2507.05444)
Keywords: language model, llm
Abstract: Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner's first language (L1) to aid in acquiring L2 vocabulary. However, most of this research has focused on native English speakers learning other languages, rather than the reverse. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that retrieves L1 keyword sequence based on phonological similarity and uses LLMs to generate mnemonics. We evaluate PhoniTale using both automated metrics and human evaluations, comparing its output to mnemonics created by humans and by previous automated approaches. To assess practical effectiveness, we also conduct a short-term recall test measuring mnemonic helpfulness. Our findings show that PhoniTale performs comparably to human-authored mnemonics. We also highlight key areas for future improvement in mnemonic quality and methodology.
摘要：词汇获取对第二语言（L2）学习者构成了重大挑战，尤其是在学习类型上远处的语言（例如英语和韩语）时，语音和结构上的不匹配会使词汇学习复杂化。最近，大型语言模型（LLMS）已通过利用学习者的母语（L1）的类似关键字来帮助获取L2词汇来生成关键字助记符。但是，这项研究的大多数都集中在以英语为英语的人学习其他语言，而不是反面。在本文中，我们提出了一种新型的跨语性助记符生成系统Phonitale，该系统基于语音相似性检索L1关键字序列，并使用LLMS生成助记符。我们使用自动化指标和人类评估来评估Phonitale，将其输出与人类和以前的自动化方法创建的助记符进行比较。为了评估实际有效性，我们还进行了短期召回测试测量助记符的帮助。我们的发现表明，Phonitale的表现与人为实现的助记符相当。我们还重点介绍了助记符质量和方法的未来改善的关键领域。

Title: On the Semantics of Large Language Models

Authors: Martin Schuele
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05448
Pdf URL: https://arxiv.org/pdf/2507.05448
Copy Paste: [[2507.05448]] On the Semantics of Large Language Models(https://arxiv.org/abs/2507.05448)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.
摘要：大型语言模型（LLM）（例如ChatGpt）证明了通过技术复制人类语言能力的潜力，从文本生成到进行对话。但是，这些系统在多大程度上真正理解语言。我们通过将问题缩小到LLMS的语言和句子级别的语义来研究这个问题。通过检查LLM的内部运作及其生成的语言表示，并利用Frege和Russell的经典语义理论，我们对LLMS的潜在语义能力更加细微。

Title: ModelCitizens:Representing Community Voices in Online Safety

Authors: Ashima Suvarna, Christina Chance, Hamid Palangi, Sophie Hao, Thomas Hartvigsen, Saadia Gabriel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05455
Pdf URL: https://arxiv.org/pdf/2507.05455
Copy Paste: [[2507.05455]] ModelCitizens:Representing Community Voices in Online Safety(https://arxiv.org/abs/2507.05455)
Keywords: gpt, llm
Abstract: Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation.
摘要：自动有毒语言检测对于创建安全，包容性的在线空间至关重要。但是，这是一项高度主观的任务，对社区规范和生活经验塑造的有毒语言的看法。现有的毒性检测模型通常是对注释进行培训的，这些注释将各种注释者的观点崩溃为单个地面真理，从而消除了重要的上下文特定的毒性概念，例如回收语言。为了解决这个问题，我们介绍了ModelCitizens，一个由6.8k社交媒体帖子和40K毒性注释的数据集跨不同身份组。为了捕捉对话环境在毒性（社交媒体帖子的典型特征）上的作用，我们以LLM生成的对话情景增强了ModelCitizens帖子。最先进的毒性检测工具（例如OpenAI Mederation API，GPT-O4-MINI）在模型中表现不佳，并在上下文提出的帖子上进一步退化。最后，我们释放Llamacitizen-8b和Gemmacitizen-12b，基于Llama-和Gemma基于ModelCitizen的模型，在分布评估上，它们的表现优于GPT-O4-Mini的表现，高于GPT-O4-Mini。我们的发现突出了社区信息的注释和建模对于包容性内容审核的重要性。

Title: Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Authors: Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos, Phillip Swazinna, Miguel Del-Agua, Jerome Tremblay, Akila Jeeson Daniel, Cari Bader, Kevin Cho, Pooja Krishnan, Nathan Bodenstab, Thomas Lin, Wenxuan Teng, Francois Beaulieu, Paul Vozila
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05517
Pdf URL: https://arxiv.org/pdf/2507.05517
Copy Paste: [[2507.05517]] Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications(https://arxiv.org/abs/2507.05517)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.
摘要：诸如GPT-4O和O1之类的大型语言模型（LLM）在跨多个医疗基准的临床自然语言处理（NLP）任务方面表现出了强劲的表现。但是，尽管有积极的行业努力，但由于数据稀缺和敏感性而导致的两个高影响力NLP任务 - 护士命令和医疗咨询中的医疗命令的结构化表格报告。这些现实世界中临床任务的实用解决方案可以大大减轻医疗保健提供者的文档负担，从而更加专注于患者护理。在本文中，我们使用私人和开源临床数据集研究了这两个具有挑战性的任务，评估开放和关闭体重LLM的性能，并分析其各自的优势和局限性。此外，我们提出了一种代理管道，用于产生逼真的，非敏感的护士命令，从而实现临床观察的结构化提取。为了支持这两个领域的进一步研究，我们发布了Synur和Simord，这是第一个用于护士观察提取和医疗订单提取的开源数据集。

Title: Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS

Authors: Alex ZH Dou, Zhongwei Wan, Dongfei Cui, Xin Wang, Jing Xiong, Haokun Lin, Chaofan Tao, Shen Yan, Mi Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05557
Pdf URL: https://arxiv.org/pdf/2507.05557
Copy Paste: [[2507.05557]] Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS(https://arxiv.org/abs/2507.05557)
Keywords: language model, llm, chain-of-thought
Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.
摘要：测试时间缩放已成为语言建模的有希望的范式，在推理时利用其他计算资源来增强模型性能。在这项工作中，我们介绍了R2-LLM，这是一种新颖且多才多艺的层次结构检索 - 启动推理框架，旨在在大型语言模型（LLMS）中提高测试时间缩放，而无需从更先进的模型中蒸馏以获取theque of theque（COT）培训数据。 R2-llms通过整合基于双级检索基于双关节的内在学习来增强推理时间的概括：（1）在粗级别，我们的方法从复杂的推理问题中提取抽象模板，并从复杂的问题上提取类似的问题 - 答案对，以促进高级的高级内在学习；（2）在良好级别，在蒙特卡洛树搜索（MCT）期间，R2-llms有效地从参考数学问题数据集中从参考数学问题数据集中检索了类似的中间解决方案步骤，并借助过程奖励模型（PRM）来完善逐步推理以进行得分。 R2-llms是一种强大的层次推理 - 启发方法，可增强内在级别的推理，同时与阶梯级树搜索方法无缝集成。利用PRM，它同时完善了候选人的产生和决策，以提高推理精度。与基准相比，使用Llama-3.1-8B对Math500，GSM8K和OlympiadBench-TO数据集进行了对数据集的经验评估，并且使用Llama-3.1-8B提高了16％，与基础线相比，提高了16％，从而展示了我们在复杂推理任务中我们方法的有效性。

Title: Self-Review Framework for Enhancing Instruction Following Capability of LLM

Authors: Sihyun Park
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05598
Pdf URL: https://arxiv.org/pdf/2507.05598
Copy Paste: [[2507.05598]] Self-Review Framework for Enhancing Instruction Following Capability of LLM(https://arxiv.org/abs/2507.05598)
Keywords: language model, gpt, llm
Abstract: Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.
摘要：已经提出了各种技术来改善大型语言模型（LLMS）遵守格式和指导约束。最有效的方法之一是利用强大模型生成的高质量数据。但是，这样的模型通常无法完全遵守单一代的复杂说明。为了解决此限制，已经引入了迭代修订方法。然而，随着数据点和修订迭代的数量增加，相关的货币成本大大增长。作为一种资源有效的替代方法，已经提出了利用高性能评估工具来补偿开源LLM的自我评估能力有限的方法。但是，这些方法通常会导致由于过度修订而导致的产出质量降解。为了克服这些挑战，我们提出了RE5，这是一个自我评估和修订框架，旨在增强跟随性能的同时保留生成的内容的质量。 RE5从用户指令中提取任务和约束组件，执行结构评估以防止错误积累，并应用精细的约束特定内容评估，然后进行选择性修订。此过程确保了精确和质量的改进。最终的高质量输出用于对齐调整，从而通过以数据为中心的迭代精致环路进行了长期对齐的改进。实验结果表明，RE5达到了指导性能的性能，与在高性能模型GPT-4O-MINI生成的数据中训练的模型相当，即使使用少量数据，同时以64.24％的win速率保持了响应质量，而不是重新定义的初始响应。这些结果验证了RE5作为一种有效的有效解决方案，可通过最小的外部监督来增强教学依从性。

Title: Flipping Knowledge Distillation: Leveraging Small Models' Expertise to Enhance LLMs in Text Matching

Authors: Mingzhe Li, Jing Xiang, Qishen Zhang, Kaiyang Wan, Xiuying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05617
Pdf URL: https://arxiv.org/pdf/2507.05617
Copy Paste: [[2507.05617]] Flipping Knowledge Distillation: Leveraging Small Models' Expertise to Enhance LLMs in Text Matching(https://arxiv.org/abs/2507.05617)
Keywords: language model, llm
Abstract: Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.
摘要：知识蒸馏通常涉及将知识从大语言模型（LLM）转移到较小的语言模型（SLM）。但是，在诸如文本匹配之类的任务中，微调较小的模型通常会产生更有效的域特异性表示，因为它们专注于优化输入对的相似性。为了利用小型模型的专业优势和对LLM的丰富语义理解，我们引入了一个翻转的知识蒸馏范式，LLM从SLM中学到了。具体而言，我们通过使用LORA以编码器方式重新解释LLMS来解决仅解码器llms和较小编码器模型之间的架构差距。编码器生成压缩表示，而解码器将它们映射到输出空间。在培训期间，编码器会产生表示形式及其相似性，然后使用我们提出的利润率的对比度学习（MCL）方法与教师产生的相似性分数保持一致。 MCL可确保正面和负对的准确相似性，并适应地处理正和负样本中的内部差异。我们的范式仅需要一个合理的绩效SLM，就可以使LLM提高性能。关于财务和医疗基准测试以及现实世界应用程序的实验确认了其有效性，并且该模型已在在线环境中完全部署。

Title: SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

Authors: Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.05633
Pdf URL: https://arxiv.org/pdf/2507.05633
Copy Paste: [[2507.05633]] SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression(https://arxiv.org/abs/2507.05633)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented Generation (RAG) extends large language models (LLMs) with external knowledge but faces key challenges: restricted effective context length and redundancy in retrieved documents. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets. SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness. It represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs the compression vectors for dynamic reranking of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.
摘要：检索增强的生成（RAG）扩展了具有外部知识的大语言模型（LLM），但面临关键挑战：限制有效上下文长度和检索文档的冗余。基于压缩的方法降低了输入大小，但通常会丢弃对事实准确性必不可少的细粒细节。我们提出了萨拉（Sara），这是一个统一的抹布框架，在紧张的背景预算下平衡了当地精确度和全球知识覆盖范围。 Sara将自然语言文本片段与语义压缩向量相结合，以共同提高上下文效率并回答正确性。它代表两个互补级别的上下文：1）保留关键实体和数值的细颗粒自然语言跨度，以及2）汇总高级语义的紧凑，可解释的向量。迭代循证选择模块采用压缩向量进行上下文的动态重新融合。在9个数据集和5个开源LLM中，跨越了3个模型系列（Mistral，Llama和Gemma），Sara始终提高答案相关性（+17.71），答案正确性（+13.72）和语义相似性（+15.53）（+15.53），表明了整合文本和压缩的表述的重要性。

Title: ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

Authors: Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05639
Pdf URL: https://arxiv.org/pdf/2507.05639
Copy Paste: [[2507.05639]] ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?(https://arxiv.org/abs/2507.05639)
Keywords: gpt, llm, agent
Abstract: In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.
摘要：在本文中，我们介绍了Ecom-Bench，这是第一个用于评估电子商务客户支持域中具有多模式功能的LLM代理的基准框架。 Ecom-Bench具有基于从真实的电子商务客户互动中收集的角色信息的动态用户仿真以及从真实的电子商务对话中得出的现实任务数据集。这些任务涵盖了广泛的业务方案，旨在反映现实世界中的复杂性，使Ecom Bench高度挑战。例如，即使像GPT-4O这样的高级模型也只能在我们的基准标准中获得10-20％的通行证^3度量，从而突出了复杂的电子商务方案带来的实质困难。发表后，将开源代码和数据，以促进该领域的进一步研究和开发。

Title: Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs

Authors: SeungWon Ji, Jungyup Lee, Jemin Kim, Sang Park, SeungJae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05686
Pdf URL: https://arxiv.org/pdf/2507.05686
Copy Paste: [[2507.05686]] Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs(https://arxiv.org/abs/2507.05686)
Keywords: language model, llm, prompt
Abstract: Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt's language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications.
摘要：多语言大语模型（LLM）经常表现出语言混乱，无论提示语言的语言如何，以主导语言产生响应的趋势。为了解决这个问题，我们提出了一种轻巧的事后方法冰沙Qwen，可以减轻语言偏见而无需再培训。该技术有选择地调整令牌输出概率，以有效抑制不希望的语言生成。应用于QWEN模型，我们的方法将意外的中国产量降低了95％以上，同时保留了多种语言基准的任务准确性。这项工作为增强LLM的语言可控性提供了一种实用和高效的解决方案，使其对全球应用程序更可靠。

Title: Agentic-R1: Distilled Dual-Strategy Reasoning

Authors: Weihua Du, Pranjal Aggarwal, Sean Welleck, Yiming Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05707
Pdf URL: https://arxiv.org/pdf/2507.05707
Copy Paste: [[2507.05707]] Agentic-R1: Distilled Dual-Strategy Reasoning(https://arxiv.org/abs/2507.05707)
Keywords: chain-of-thought, agent
Abstract: Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at this https URL
摘要：当前的长链（长期）模型在数学推理方面表现出色，但依赖于缓慢且容易发生错误的自然语言痕迹。刀具的代理通过代码执行解决算术，但通常会在复杂的逻辑任务上步履蹒跚。我们介绍了一个微调框架Dualdistill，该框架将互补的推理策略从多个教师提炼成一个统一的学生模型。使用这种方法，我们训练Agesic-R1，该方法可以动态选择每个查询的最佳策略，为算术和算法问题调用工具，并使用基于文本的推理用于抽象。我们的方法提高了各种任务的准确性，包括计算密集型和标准基准测试，证明了多策略蒸馏剂在实现强大而有效的推理方面的有效性。我们的项目可在此HTTPS URL上找到

Title: DRAGON: Dynamic RAG Benchmark On News

Authors: Fedor Chernogorskii, Sergei Averkiev, Liliya Kudraleeva, Zaven Martirosian, Maria Tikhonova, Valentin Malykh, Alena Fenogenova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05713
Pdf URL: https://arxiv.org/pdf/2507.05713
Copy Paste: [[2507.05713]] DRAGON: Dynamic RAG Benchmark On News(https://arxiv.org/abs/2507.05713)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.
摘要：检索增强的生成（RAG）是通过在推理时纳入外部知识来改善大语言模型（LLM）的事实的广泛采用方法。尽管有多个英语的抹布基准，但是包括俄罗斯在内的其他语言的评估资源仍然很少且静态，但未能捕获现实世界部署的动态性质。在这项工作中，我们介绍了Dragon（新闻上的动态抹布基准），这是在不断变化的新闻库中评估俄罗斯的抹布系统的第一个动态基准。 Dragon建立在定期更新的俄罗斯新闻和公共文件的语料库基础上，并支持对猎犬和发电机组件的全面评估。问题生成是通过使用从语料库构建的知识图自动执行的，并可以提取与不同的子图模式对齐的四种核心问题类型。我们发布了一个完整的评估框架，该框架包括自动问题生成的管道，评估脚本，这些脚本可能可重复使用其他语言和多语言设置以及基准数据。我们还推出了公共排行榜，以鼓励社区参与和比较。

Title: HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

Authors: YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Jian Wang, Peng Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05714
Pdf URL: https://arxiv.org/pdf/2507.05714
Copy Paste: [[2507.05714]] HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation(https://arxiv.org/abs/2507.05714)
Keywords: language model, retrieval-augmented generation, chain-of-thought
Abstract: Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a "think before answering" strategy. This method enhances the model's open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model's performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
摘要：检索增强的一代（RAG）已成为解决大型语言模型在处理实时信息和特定领域问题方面面临的挑战的基本范式。传统的抹布系统主要依赖于大语言模型本身的文化内部学习（ICL）功能。尽管如此，缺乏对破布生成模型所需的特定功能的深入研究，这导致了文档质量不一致和检索系统缺陷的挑战。甚至有限的研究，即经常\ textit（缺乏对抹布任务}或\ textit {对经过思考链的过程的更深入利用}的细微研究。为了解决这个问题，我们建议抹布模型应具有三个逐步的分层能力（1）过滤：选择相关信息的能力；（2）组合：跨段落结合语义信息的能力；（3）特定于抹布的推理：使用内部知识进一步处理外部知识的能力。因此，我们介绍了新的抹布教学微调方法，分层三级指令调查检索生成一代（HIRAG）结合了“思考之前思考”策略。该方法通过利用多层次的渐进式思想链来增强模型的开放式检查能力。实验表明，HIRAG培训策略可显着提高该模型在RGB，POPQA，Musique，HotPotQA和PubMedQA等数据集上的性能。

Title: Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Authors: Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.05724
Pdf URL: https://arxiv.org/pdf/2507.05724
Copy Paste: [[2507.05724]] Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition(https://arxiv.org/abs/2507.05724)
Keywords: language model
Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.
摘要：专家（MOE）架构的混合物已从语言建模到自动语音识别（ASR）扩展。传统的MOE方法（例如开关变压器）在每一层中独立地路由专家。我们的分析表明，大多数层中的路由器做出了与其他层中路由器的选择密切相关的专家选择。为了增加不同层级专家之间的合作并鼓励更大的专业化，我们使用不同的MOE层的共享路由器。我们称此模型\ emph {Omni-Router Transformer}。对大规模伪标记的数据集进行了广泛的实验，以及对10种不同的，偏置的ASR基准进行的评估表明，Omni-Router Transforter能够实现较低的训练损失，并始终超过密集和转换型变压器模型，并将平均误差率降低了11.2％和8.2％，同时提供了分别为多样性的术语，并将其分别提供了均匀的术语。

Title: GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

Authors: Yujia Hu, Tuan-Phong Nguyen, Shrestha Ghosh, Moritz Müller, Simon Razniewski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05740
Pdf URL: https://arxiv.org/pdf/2507.05740
Copy Paste: [[2507.05740]] GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge(https://arxiv.org/abs/2507.05740)
Keywords: language model, gpt, llm
Abstract: Language models are powerful tools, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for $14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization (Hu et al., ACL 2025). The demonstration experience focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the research area of systematic analysis of LLM knowledge, as well as for automated KB construction. The GPTKB demonstrator is accessible at this https URL.
摘要：语言模型是强大的工具，但是他们的事实知识仍然很少理解，并且无法访问临时浏览和可扩展的统计分析。该演示介绍了GPTKB v1.5，这是一种使用GPTKB的大规模回报LLM知识物质化的GPTKB方法，以14,000美元的价格与GPT-4.1建造的密集互动的10000万知识库（KB）（Hu等，ACL 2025）。演示经验的重点是三种用例：（1）基于链接 - 传播的LLM知识探索，（2）基于SPARQL的结构化LLM知识查询，（3）对LLM知识的优势和劣势的比较探索。对于LLM知识的系统分析以及自动化的KB构建，大量的LLM知识知识物质化是一个开创性的机会。在此HTTPS URL上可以访问GPTKB演示器。

Title: DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

Authors: Jing Yang Lee, Hamed Bonab, Nasser Zalmout, Ming Zeng, Sanket Lokegaonkar, Colin Lockard, Binxuan Huang, Ritesh Sarkhel, Haodong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05750
Pdf URL: https://arxiv.org/pdf/2507.05750
Copy Paste: [[2507.05750]] DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities(https://arxiv.org/abs/2507.05750)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at this https URL.
摘要：大型语言模型（LLM）越来越多地用于多转交谈任务中，但是它们的预训练数据主要由连续的散文组成，在所需的功能和培训范式之间造成了潜在的不匹配。我们介绍了一种新颖的方法来解决这种差异，通过综合现有文本语料库的对话数据。我们提出了一条管道，该管道将多个相关文档的群集转换为扩展的多转，多主题信息寻求对话。我们将管道应用于Wikipedia文章，我们策划了Doctalk，这是一种由超过730k的对话组成的多转训练对话语料库。我们假设在预训练期间暴露于此类合成的对话结构可以增强LLM的基本多转变功能，例如上下文记忆和理解。从经验上讲，我们表明在训练期间纳入医生在上下文记忆和理解中获得高达40％的增长，而不会损害基础绩效。此HTTPS URL可用Doctalk。

Title: Flippi: End To End GenAI Assistant for E-Commerce

Authors: Anand A. Rajasekar, Praveen Tangarajan, Anjali Nainani, Amogh Batwal, Vinay Rao Dandin, Anusua Trivedi, Ozan Ersoy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05788
Pdf URL: https://arxiv.org/pdf/2507.05788
Copy Paste: [[2507.05788]] Flippi: End To End GenAI Assistant for E-Commerce(https://arxiv.org/abs/2507.05788)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The emergence of conversational assistants has fundamentally reshaped user interactions with digital platforms. This paper introduces Flippi-a cutting-edge, end-to-end conversational assistant powered by large language models (LLMs) and tailored for the e-commerce sector. Flippi addresses the challenges posed by the vast and often overwhelming product landscape, enabling customers to discover products more efficiently through natural language dialogue. By accommodating both objective and subjective user requirements, Flippi delivers a personalized shopping experience that surpasses traditional search methods. This paper details how Flippi interprets customer queries to provide precise product information, leveraging advanced NLP techniques such as Query Reformulation, Intent Detection, Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and Context Reduction. Flippi's unique capability to identify and present the most attractive offers on an e-commerce site is also explored, demonstrating how it empowers users to make cost-effective decisions. Additionally, the paper discusses Flippi's comparative analysis features, which help users make informed choices by contrasting product features, prices, and other relevant attributes. The system's robust architecture is outlined, emphasizing its adaptability for integration across various e-commerce platforms and the technological choices underpinning its performance and accuracy. Finally, a comprehensive evaluation framework is presented, covering performance metrics, user satisfaction, and the impact on customer engagement and conversion rates. By bridging the convenience of online shopping with the personalized assistance traditionally found in physical stores, Flippi sets a new standard for customer satisfaction and engagement in the digital marketplace.
摘要：对话助手的出现从根本上重塑了用户与数字平台的互动。本文介绍了Flippi-A尖端，端到端的对话助理，该助理由大型语言模型（LLM）提供支持，并为电子商务领域量身定制。 Flippi解决了广阔且经常压倒性的产品景观所带来的挑战，使客户能够通过自然语言对话更有效地发现产品。通过满足客观和主观用户需求，Flippi提供了超过传统搜索方法的个性化购物体验。本文详细介绍了Flippi如何解释客户查询以提供精确的产品信息，并利用高级NLP技术，例如查询重新印象，意图检测，检索效果生成（RAG），命名实体识别（NER）和上下文减少。还探索了Flippi在电子商务网站上识别和介绍最有吸引力的报价的独特能力，并展示了它如何使用户能够做出具有成本效益的决策。此外，本文讨论了Flippi的比较分析功能，该功能通过对比的产品功能，价格和其他相关属性来帮助用户做出明智的选择。概述了该系统的强大体系结构，强调了其在各种电子商务平台上集成的适应性以及其性能和准确性的基础的技术选择。最后，提出了一个全面的评估框架，涵盖了性能指标，用户满意度以及对客户参与率和转换率的影响。通过在物理商店中发现的个性化帮助，通过弥合在线购物的便利性，Flippi为客户满意度和在数字市场中的参与设定了新的标准。

Title: Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports

Authors: Amane Watahiki, Tomoki Doi, Taiga Shinozaki, Satoshi Nishida, Takuya Niikawa, Katsunori Miyahara, Hitomi Yanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05799
Pdf URL: https://arxiv.org/pdf/2507.05799
Copy Paste: [[2507.05799]] Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports(https://arxiv.org/abs/2507.05799)
Keywords: language model, prompt
Abstract: One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
摘要：开发大型视觉模型（LVLM）的主要目标之一是工程系统，这些系统可以帮助人类执行多模式任务，包括解释对感知体验的描述。在这种情况下，一种中心现象是阿莫达尔的完成，即使隐藏了部分对象，人们也会感知对象。尽管许多研究评估了计算机视觉算法是否可以检测或重建遮挡区域，但LVLMS在与Amodal完成相关的文本上的推断能力仍未得到探索。为了解决这一差距，我们构建了一个基于基本正式本体论的基准，以实现对Amodal完成的系统分类。我们的结果表明，尽管许多LVLM总体上达到了人为比较的性能，但它们的准确性对于某些类型的对象的精度有所不同。值得注意的是，在某些类别中，与缺少视觉内容的空白刺激相比，某些LLAVA-NEXT变体和Claude 3.5十四行诗在原始图像上的精度较低。有趣的是，这种差异仅在日本提示下才会出现，这表明这些模型中日本特定的语言能力缺乏。

Title: Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Authors: Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05890
Pdf URL: https://arxiv.org/pdf/2507.05890
Copy Paste: [[2507.05890]] Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators(https://arxiv.org/abs/2507.05890)
Keywords: language model, llm
Abstract: As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs replicate human-like behavior. We will publicly release our dataset and code to support future work.
摘要：随着心理测量调查越来越多地用于评估大语言模型（LLMS）的特征，因此适合LLMS的可扩展调查项目的需求也在增长。这里的一个关键挑战是确保生成项目的构造有效性，即它们是否真正衡量了预期的特征。传统上，这需要昂贵的大规模人类数据收集。为了使其有效，我们为使用LLMS提供了一个虚拟受访者模拟的框架。我们的核心思想是说明调解人：相同特征可以引起对调查项目的反应的因素。通过模拟受访者，我们确定了稳健衡量预期特征的调查项目。对三种心理特征理论（Big5，Schwartz，Via）进行的实验表明，我们的调解人生成方法和模拟框架有效地识别了高效率项目。 LLMS展示了从性状定义中生成合理的调解人并模拟受访者行为以进行项目验证的能力。我们的问题制定，指标，方法和数据集为成本效益的调查开发开辟了一个新的方向，并对LLMS如何复制人类的行为有了更深入的了解。我们将公开发布我们的数据集和代码，以支持未来的工作。

Title: Few-shot text-based emotion detection

Authors: Teodor-George Marchitan, Claudiu Creanga, Liviu P. Dinu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05918
Pdf URL: https://arxiv.org/pdf/2507.05918
Copy Paste: [[2507.05918]] Few-shot text-based emotion detection(https://arxiv.org/abs/2507.05918)
Keywords: language model, prompt
Abstract: This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. With our final system, for the multi-label emotion detection track (track A), we got an F1-macro of $0.7546$ (26/96 teams) for the English subset, $0.1727$ (35/36 teams) for the Portuguese (Mozambican) subset and $0.325$ (\textbf{1}/31 teams) for the Emakhuwa subset.
摘要：本文介绍了Unibuc -NLP团队在解决Semeval 2025研讨会时的方法，任务11：在基于文本的情感检测中弥合差距。我们主要专注于使用大型语言模型（Gemini，Qwen，DeepSeek）的实验，并进行了很少的弹性或微调。对于我们的最终系统，对于多标签的情感检测轨道（轨道A），我们的F1-Macro的英语子集的F1-Macro为$ 0.7546 $（26/96个团队），葡萄牙（Mozambican）$ 0.1727 $（35/36个团队）$ 0.1727 $（35/36个团队）

Title: Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems

Authors: Sandeep Mishra, Anubhab Mandal, Bishal Santra, Tushar Abhishek, Pawan Goyal, Manish Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05940
Pdf URL: https://arxiv.org/pdf/2507.05940
Copy Paste: [[2507.05940]] Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems(https://arxiv.org/abs/2507.05940)
Keywords: gpt, chat
Abstract: Ghosting, the ability to predict a user's intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available
摘要：Ghoting，可以预测用户预期的文本输入用于Inline查询自动完成的能力，是现代搜索引擎和聊天接口的宝贵功能，可以极大地增强用户体验。通过建议完成不完整的查询（或前缀），Ghothing AIDS用速度缓慢，残疾或有限的语言水平的用户有助于用户。鬼魂是一个具有挑战性的问题，随着基于聊天的系统（例如Chatgpt，Copilot等）的无处不在。尽管基于聊天的系统的突出性越来越重要，但利用鬼魂的系统越来越突出，但这个具有挑战性的聊天问题却很少受到NLP/ML研究社区的关注。缺乏标准化的基准和深度学习和非深度学习方法的相对性能分析。我们通过使用四个公开可用的对话数据集对此问题进行开放而彻底的研究：两个人类（DailyDialog和DSTC7-Ubuntu）和两个人类机器人（开放助理和ShareGPT）。我们使用或不使用对话框上下文，尝试各种现有的查询自动完成方法（使用尝试），N-Gram方法和深度学习方法。我们还提出了一种基于熵的新型动态早期停止策略。我们的分析发现，统计N-Gram模型和尝试以模型性能和可见前缀的推理效率优于基于深度学习的模型。对于看不见的查询，T5和PHI-2等神经模型会带来更好的结果。添加会话上下文可以显着改善，尤其是对于开放式和股份而言。我们公开提供代码和数据

Title: OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

Authors: Lucas Fonseca Lage, Simon Ostermann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05965
Pdf URL: https://arxiv.org/pdf/2507.05965
Copy Paste: [[2507.05965]] OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation(https://arxiv.org/abs/2507.05965)
Keywords: language model, gpt, llm, chat
Abstract: We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: this https URL.
摘要：我们介绍了OpenFactScore，这是评估大语言模型（LLMS）生成的文本的事实框架的开源实现。 Factscore通过使用原子事实生成（AFG）来评估长形式文本的事实准确性（AFG）来提取个人事实主张和原子事实验证（AFV），以对可信赖的知识来源验证每个主张。虽然原始的Factscore依赖于封闭式和商业模型（例如Consendgpt和Chatgpt），但OpenFactScore允许使用任何拥抱的面部兼容模型对AFG和AFV使用。我们提供了有关实施的详细技术概述，突出了为支持开放模型的设计选择和修改。我们使用原始的FactScore基准评估了AFG和AFV上的多个开源LLM，并报告了AFG的Bertscore-F1，并且相对于AFV的人类注释，AFG和错误率。我们的结果表明，开放型模型可以近似封闭源系统的性能，而Gemma可以实现最佳的整体性能，而我们的最终设置与原始FactScore实验获得了0.99 Pearson的相关性。 OpenFactScore促进了透明度，可重复性和具有成本效益的评估，可在以下方面获得：此HTTPS URL。

Title: RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Authors: Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05980
Pdf URL: https://arxiv.org/pdf/2507.05980
Copy Paste: [[2507.05980]] RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages(https://arxiv.org/abs/2507.05980)
Keywords: language model, llm
Abstract: Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.
摘要：大型语言模型（LLMS）及其安全分类器通常由于培训数据和评估基准有限，因此在低资源语言上的表现较差。本文介绍了Rabakbench，这是一种本地化为新加坡独特语言背景的新的多语言安全基准，涵盖了Singlish，Chinese，Malay和Tamil。 Rabakbench是通过可扩展的三阶段管道来构建的：（i）通过使用LLM驱动的红色团队增强真正的Singlish Web内容，生成 - 对抗性示例生成；（ii）标签 - 使用与人类判断对齐的多数级别LLM标签的半自动多标签安全注释；（iii）翻译 - 高保真翻译保留语言上的语言细微差别和毒性。最终的数据集包括四种语言和六个具有严重性水平的细粒度安全类别的5,000多个安全标记的示例。对11个受欢迎的开源和闭合护栏分类器的评估显示出明显的性能下降。 Rabakbench不仅可以在东南亚多语言环境中实现强大的安全评估，而且还提供了一个可重现的框架，用于在低资源环境中构建局部安全数据集。公开可用的基准数据集，包括人验证的翻译和评估代码。

Title: Evolution without Large Models: Training Language Model with Task Principles

Authors: Minghang Zhu, Shen Gao, Zhengliang Shi, Jiabao Fang, Pengjie Ren, Zhaochun Ren, Zhumin Chen, Shuo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05991
Pdf URL: https://arxiv.org/pdf/2507.05991
Copy Paste: [[2507.05991]] Evolution without Large Models: Training Language Model with Task Principles(https://arxiv.org/abs/2507.05991)
Keywords: language model, llm
Abstract: A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model this http URL method significantly reduces training costs by eliminating the need for extensive human data annotation. However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage when we use closed-source LLMs. To address these issues, we propose a self-evolution method for language models. First, we introduce the Multi-level Principle Generation, which enables a large-scale model to summarize task-completion principles based on a small amount of task data. Then, we propose the Principle-based Instance Generation, in which a smaller-scale language model uses these task principles to generate a large amount of data. This data is then used for model training. Experimental results show that our proposed method significantly improves model performance compared to directly using a smaller-scale language model to generate data. Additionally, since we only use the large-scale language model to generate the task-completion principles, the carbon emissions associated with training the model are greatly reduced.
摘要：语言模型的一种常见培训方法涉及使用大规模语言模型扩展人类提供的数据集，随后将其用于模型该HTTP URL方法可以通过消除广泛的人类数据注释的需求来大大降低培训成本。但是，当我们使用封闭源LLM时，它仍然面临着诸如数据增强过程中高碳排放和数据泄漏的风险之类的挑战。为了解决这些问题，我们为语言模型提出了一种自我进化方法。首先，我们介绍了多级原则生成，该生成使大规模模型可以根据少量任务数据来汇总任务完成原则。然后，我们提出了基于原理的实例生成，其中较小规模的语言模型使用这些任务原理来生成大量数据。然后将这些数据用于模型培训。实验结果表明，与直接使用较小的语言模型生成数据相比，我们提出的方法显着改善了模型性能。此外，由于我们仅使用大型语言模型来生成任务完成原则，因此与训练相关的碳排放大大降低了。

Title: DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Authors: Nicholas Popovič, Ashish Kangen, Tim Schopf, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05997
Pdf URL: https://arxiv.org/pdf/2507.05997
Copy Paste: [[2507.05997]] DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations(https://arxiv.org/abs/2507.05997)
Keywords: language model, llm
Abstract: Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
摘要：在零射门或少数拍摄设置中，在文档级实体和关系提取方面，大型高质量的注释语料库仍然很少。在本文中，我们提出了一条基于自动的，基于LLM的管道，用于合成数据生成和文档级实体和关系提取的秘密学习。与依赖手动注释的演示或直接零弹性推理的现有方法相反，我们的方法将合成数据生成与基于检索的基于基于检索的内在学习，并使用推理优化的语言模型相结合。这使我们能够在没有手动注释的情况下构建高质量的演示数据库，并在推理时间动态检索相关示例。基于我们的方法，我们生产的合成数据集超过$ 5K $ Wikipedia摘要，约为59k $ $ $ $ $ $ 59K $ $。最后，我们在Docie共享的任务上评估了在零镜头设置中从长文档中提取实体和关系的文章学习绩效。我们发现，即使对于最先进的大语言模型，文档级的关节内实体和关系提取仍然是一项具有挑战性的任务。

Title: Conditional Multi-Stage Failure Recovery for Embodied Agents

Authors: Youmna Farag, Svetlana Stoyanchev, Mohan Li, Simon Keizer, Rama Doddipatla
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06016
Pdf URL: https://arxiv.org/pdf/2507.06016
Copy Paste: [[2507.06016]] Conditional Multi-Stage Failure Recovery for Embodied Agents(https://arxiv.org/abs/2507.06016)
Keywords: llm, prompt, agent
Abstract: Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero-shot chain prompting. The framework is structured into four error-handling stages, with three operating during task execution and one functioning as a post-execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state-of-the-art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.
摘要：执行复杂任务的体现的代理容易受到执行故障的影响，激发了有效的故障恢复机制的需求。在这项工作中，我们介绍了一个有条件的多阶段故障恢复框架，该框架采用了零拍链提示。该框架构成了四个错误处理阶段，其中三个在任务执行期间运行，一个在执行后反射阶段中起作用。我们的方法利用LLM的推理能力来分析其环境环境中的执行挑战并设计战略解决方案。我们在教学数据集的TFD基准上评估了我们的方法，并实现最先进的性能，优于基线而没有错误恢复的基线，并超过了最强的现有模型19％。

Title: Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Authors: Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06056
Pdf URL: https://arxiv.org/pdf/2507.06056
Copy Paste: [[2507.06056]] Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs(https://arxiv.org/abs/2507.06056)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or "gibberish", we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).
摘要：众所周知，大型语言模型（LLMS）可以记住其部分培训数据，有时在适当提示时逐字复制内容。在这项工作中，我们研究了记忆领域中一个基本但探索的问题：如何表征LLMS中训练数据的记忆难度？通过关于Olmo的经验实验，Olmo是一个开放模型的家族，我们介绍了熵误解定律。这表明数据熵与记忆评分线性相关。此外，在记住高度随机字符串或“ gibberish”的案例研究中，我们观察到，尽管存在明显的随机性，但与更广泛的训练语料库相比，这种序列显然表现出意外的经验熵。采用相同的策略来发现熵迁移法，我们得出了一种简单而有效的方法来区分培训和测试数据，从而使数据集推理（DI）。

Title: A Survey on Prompt Tuning

Authors: Zongqian Li, Yixuan Su, Nigel Collier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06085
Pdf URL: https://arxiv.org/pdf/2507.06085
Copy Paste: [[2507.06085]] A Survey on Prompt Tuning(https://arxiv.org/abs/2507.06085)
Keywords: language model, prompt
Abstract: This survey reviews prompt tuning, a parameter-efficient approach for adapting language models by prepending trainable continuous vectors while keeping the model frozen. We classify existing approaches into two categories: direct prompt learning and transfer learning. Direct prompt learning methods include: general optimization approaches, encoder-based methods, decomposition strategies, and mixture-of-experts frameworks. Transfer learning methods consist of: general transfer approaches, encoder-based methods, and decomposition strategies. For each method, we analyze method designs, innovations, insights, advantages, and disadvantages, with illustrative visualizations comparing different frameworks. We identify challenges in computational efficiency and training stability, and discuss future directions in improving training robustness and broadening application scope.
摘要：这项调查综述了及时调查，这是一种通过准备可训练的连续向量的同时保持模型冻结的可训练的连续向量，一种适应语言模型的参数效率方法。我们将现有方法分为两类：直接及时学习和转移学习。直接及时的学习方法包括：一般优化方法，基于编码器的方法，分解策略和Experts框架的混合物。转移学习方法包括：一般转移方法，基于编码器的方法和分解策略。对于每种方法，我们分析方法设计，创新，见解，优势和缺点，并通过比较不同框架的说明性可视化。我们确定计算效率和训练稳定性方面的挑战，并讨论改善培训鲁棒性和扩大应用程序范围的未来方向。

Title: NeoBabel: A Multilingual Open Tower for Visual Generation

Authors: Mohammad Mahdi Derakhshani, Dheeraj Varghese, Marzieh Fadaee, Cees G. M. Snoek
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.06137
Pdf URL: https://arxiv.org/pdf/2507.06137
Copy Paste: [[2507.06137]] NeoBabel: A Multilingual Open Tower for Visual Generation(https://arxiv.org/abs/2507.06137)
Keywords: llm, prompt
Abstract: Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.
摘要：文本到图像的生成进步主要以英语为中心，为非英语说话者创造了障碍，并使数字不平等永存。在现有系统依靠翻译管道的同时，这些系统引入了语义漂移，计算开销和文化错位。我们介绍了Neobabel，这是一个新型的多语言图像生成框架，为性能，效率和包容性提供了新的帕累托前沿，支持六种语言：英语，中文，荷兰语，法语，印度语和波斯语。该模型是使用大规模多语言预处理和高分辨率指令调整的组合来训练的。为了评估其功能，我们将两个仅英文基准扩展到多语言等效物：M-Geneval和M-DPG。 Neobabel在保持强大的英语能力，在M-Geneval上得分为0.75，在M-DPG上得分为0.68。值得注意的是，它在英语任务上以领先的模型表现出色，而在多语言基准测试中的表现+0.11和+0.09的表现，即使这些模型是基于多语言基础LLM的。这证明了我们有针对性的对准训练在保存和扩展跨语言概括方面的有效性。我们进一步介绍了两个新的指标，以严格评估与代码混合提示的多语言对齐和稳健性。值得注意的是，Neobabel匹配或超过仅英语模型，而较小的2-4倍。我们发布一个开放工具包，包括所有代码，模型检查点，一个由12400万个多语言文本图像对的策划数据集以及标准化的多语言评估协议，以推进包容性的AI研究。我们的工作表明，多语言能力不是权衡的，而是提高生成AI中鲁棒性，效率和文化忠诚度的催化剂。

Title: Coding Triangle: How Does Large Language Model Understand Code?

Authors: Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06138
Pdf URL: https://arxiv.org/pdf/2507.06138
Copy Paste: [[2507.06138]] Coding Triangle: How Does Large Language Model Understand Code?(https://arxiv.org/abs/2507.06138)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.
摘要：大型语言模型（LLMS）在代码生成方面取得了显着进展，但其真正的编程能力仍然没有得到充实。我们介绍了代码三角框架，该框架系统地评估了三个基本维度的LLM：编辑分析，代码实现和测试案例生成。通过对竞争性编程基准测试的广泛实验，我们透露，尽管LLM可以在这些维度上形成一个自一致的系统，但它们的解决方案通常缺乏人类程序员的多样性和鲁棒性。我们确定了模型认知和人类专业知识之间的重大分布转移，由于训练数据偏见和有限的推理转移，模型错误趋向于聚类。我们的研究表明，将人类生成的社论，解决方案和不同的测试用例以及利用模型混合物纳入可以大大提高LLM的性能和鲁棒性。此外，我们揭示了可能有助于自我反思和自我完善的LLM认知认知的一致性和不一致，从而为开发更强大的编码模型提供了潜在的方向。

Title: Skywork-R1V3 Technical Report

Authors: Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Yahui Zhou
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2507.06167
Pdf URL: https://arxiv.org/pdf/2507.06167
Copy Paste: [[2507.06167]] Skywork-R1V3 Technical Report(https://arxiv.org/abs/2507.06167)
Keywords: language model, llm
Abstract: We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
摘要：我们介绍了Skywork-R1V3，这是一种先进的开源视觉语言模型（VLM），它是一种新的视觉推理方法。它的关键创新在于有效地将推理技能从仅文本大型语言模型（LLM）转移到视觉任务。 SkyWork-R1V3的强劲性能主要源于我们精心培训的训练后RL框架，该框架有效地激活和增强了模型的推理能力，而无需额外的继续前训练。通过这个框架，我们进一步揭示了连接器模块在实现多模式推理模型的稳健跨模式比对中的基本作用。此外，我们介绍了推理能力的独特指标，即关键推理令牌的熵，这在RL培训期间已证明对检查点的选择非常有效。 Skywork-R1V3在MMMU上实现了最新的结果，从64.3％的增长到76.0％。该性能与入门级人类能力相匹配。值得注意的是，我们的RL驱动后训练方法甚至可以使38B参数模型媲美顶部封闭源VLM。该实施成功将数学推理转移到其他与主题相关的推理任务。我们还包括对课程学习和加强框架策略的分析，以及关于多模式推理的更广泛讨论。 SkyWork-R1V3代表了多模式推理的重大飞跃，将RL作为推动开源VLM功能的强大引擎。

Title: CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

Authors: Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06181
Pdf URL: https://arxiv.org/pdf/2507.06181
Copy Paste: [[2507.06181]] CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization(https://arxiv.org/abs/2507.06181)
Keywords: gpt
Abstract: Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models' ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.
摘要：将自然语言数学语句转换为正式的可执行代码是自动定理证明的基本挑战。虽然先前的工作集中在产生和汇编成功上，但对评论家阶段的关注很少 - 评估生成的形式化是否真正捕捉了原始问题的语义意图。在本文中，我们介绍了评论家，这是一个新颖的评论家指导的加强学习框架，将评论家的作用从被动验证者提升到主动学习组成部分。具体来说，首先，我们提出了通过监督的微调和强化学习培训的评论家，以严格评估LEAN 4形式化的语义忠诚。然后，我们介绍了评论家，这是一种旨在衡量模型将语义上正确与错误形式化的能力的基准，并证明我们训练有素的评论家模型可以显着胜过强大的开放式和封闭式基线。在Criticlean框架的基础上，我们构建了FineleanCorpus，这是一个数据集，其中包括超过285K的问题，这些问题表现出了丰富的领域多样性，广泛的困难覆盖范围和基于人类评估的高正确性。总体而言，我们的发现强调，优化评论家阶段对于产生可靠的形式化至关重要，我们希望我们的批评者将为未来的正式数学推理提供宝贵的见解。

Title: DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation

Authors: Maximilian Heil, Dionne Bang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06189
Pdf URL: https://arxiv.org/pdf/2507.06189
Copy Paste: [[2507.06189]] DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation(https://arxiv.org/abs/2507.06189)
Keywords: gpt
Abstract: This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at this https URL.
摘要：本文介绍了我们提交的任务1（主观性检测）检查！在2025年CLEF的实验室。我们研究了转移学习和风格数据增强的有效性，以改善英语新闻文本中主观和客观句子的分类。我们的方法将预训练的编码器的微调与相关任务的微调变压器进行了微调。我们还使用GPT-4O引入了受控的增强管道，以生成预定义的主观性样式的释义。为了确保标签和样式一致性，我们采用相同的模型来纠正和完善生成的样品。结果表明，指定编码器的转移学习优于微调通用的传递，并且精心策划的增强显着增强了模型的鲁棒性，尤其是在检测主观内容时。我们的正式提交给24名参与者的$ 16^{th} $。总体而言，我们的发现强调了将编码器专业化与标签一致的增强相结合以改善主观性检测的价值。我们的代码可在此HTTPS URL上找到。

Title: UQLM: A Python Package for Uncertainty Quantification in Large Language Models

Authors: Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06196
Pdf URL: https://arxiv.org/pdf/2507.06196
Copy Paste: [[2507.06196]] UQLM: A Python Package for Uncertainty Quantification in Large Language Models(https://arxiv.org/abs/2507.06196)
Keywords: language model, llm, hallucination
Abstract: Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
摘要：幻觉定义为大型语言模型（LLMS）产生虚假或误导性内容的实例，构成了一项重大挑战，会影响下游应用程序的安全性和信任。我们介绍了UQLM，这是一种使用最新的不确定性定量（UQ）技术的LLM幻觉检测的Python软件包。该工具包提供了一套基于UQ的得分手，这些得分人计算响应级别的置信度范围从0到1。该库为基于UQ的幻觉检测提供了一个现成的解决方案，可以轻松地集成以增强LLM输出的可靠性。

Title: A Survey on Latent Reasoning

Authors: Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06203
Pdf URL: https://arxiv.org/pdf/2507.06203
Copy Paste: [[2507.06203]] A Survey on Latent Reasoning(https://arxiv.org/abs/2507.06203)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: this https URL.
摘要：大型语言模型（LLMS）表现出了令人印象深刻的推理能力，尤其是在以明确的思维链（COT）为指导下，推理了口头上间步骤的推理。尽管COT提高了可解释性和准确性，但其对自然语言推理的依赖性限制了模型的表现性带宽。潜在推理通过完全在模型的隐藏状态下执行多步推断来解决这种瓶颈，从而消除了令牌级的监督。为了推进潜在的推理研究，这项调查提供了对潜在推理的新兴领域的全面概述。首先，我们研究神经网络层作为推理的计算底物的基础作用，突出了分层表示如何支持复杂的转换。接下来，我们探讨了多种潜在推理方法，包括基于激活的复发，隐藏状态传播以及压缩或内部化显式推理痕迹的微调策略。最后，我们讨论了高级范式，例如通过掩盖扩散模型的无限深度潜在推理，从而实现全球一致和可逆的推理过程。通过统一这些观点，我们旨在阐明潜在推理的概念格局，并在LLM认知前沿绘制未来的研究方向。相关的GitHub存储库收集了最新的论文和存储库，网址为：此HTTPS URL。

Title: DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media

Authors: Ayush Parikh, Hoang Thanh Thanh Truong, Jeanette Schofield, Maximilian Heil
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06205
Pdf URL: https://arxiv.org/pdf/2507.06205
Copy Paste: [[2507.06205]] DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media(https://arxiv.org/abs/2507.06205)
Keywords: llm, prompt
Abstract: In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at this https URL.
摘要：在本文中，我们作为Clef 2025 Checkthat的DS@GT团队！任务4A科学网络话语检测，介绍了我们为此任务探索的方法。对于这项多类分类任务，我们确定了一条推文是否包含科学主张，对科学研究或出版物的参考以及/或提及科学实体，例如大学或科学家。我们为此任务提供了3种建模方法：变形金刚登录，LLM的很少射击提示以及一个组合的集合模型，其设计已通过较早的实验告知。我们的团队在比赛中排名第七，获得了0.8611的宏观平均得分，比Debertav3基线的提高为0.8375。我们的代码可在此HTTPS URL上的GitHub上找到。

Title: Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Authors: Zhiyuan Peng, Ting-ruen Wei, Tingyu Song, Yilun Zhao, Yi Fang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06223
Pdf URL: https://arxiv.org/pdf/2507.06223
Copy Paste: [[2507.06223]] Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers(https://arxiv.org/abs/2507.06223)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E\textsuperscript{2}R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.
摘要：大型语言模型（LLMS）最近已应用于信息检索中的重新管理任务，从而实现了强劲的性能。但是，它们的高计算需求通常会阻碍实际部署。现有研究使用诸如延迟，正向通行，输入令牌和输出令牌等代理指标（例如延迟数量）评估了基于LLM的Rerankers的效率。但是，这些指标取决于硬件和运行时选择（\ e Caralled或not，批量大小等），并且通常无法解释模型大小，从而难以解释和掩盖对效率效果交易的评估。为了解决此问题，我们建议使用基于LLM的Rerankers的E \ TextSuperScript {2} r-flops：对于每个PETAFLOP（QPP）的相关性，用于硬件 - 刺激性吞吐量的每个计算和查询。与新的指标相吻合，即使没有进行任何实验，也可以构建一个可解释的FLOP估计器，以估算基于LLM的Reranker的拖船。根据拟议的指标，我们进行了全面的实验，以评估具有不同体系结构的各种基于LLM的Rerankers，研究效率效果的权衡，并将此问题引起研究界的注意。

Title: Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Authors: Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06229
Pdf URL: https://arxiv.org/pdf/2507.06229
Copy Paste: [[2507.06229]] Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving(https://arxiv.org/abs/2507.06229)
Keywords: gpt, agent
Abstract: As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other's experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.
摘要：随着语言代理处理越来越复杂的任务，他们在有效的错误纠正和跨领域的重复使用方面挣扎。我们介绍了Agent KB，这是一个分层体验框架，可以通过新颖的retrieve-refine管道解决复杂的代理问题解决。代理KB解决了一个核心限制：传统上，代理商无法从彼此的经历中学习。通过捕获高级策略和详细的执行日志，Agent KB创建了一个共享的知识库，可以实现跨质量知识转移。在Gaia基准测试中，Agent KB在Gaia基准测试中提高了16.28个百分点。在最具挑战性的任务上，Claude-3从38.46％提高到57.69％，而GPT-4的中级任务从53.49％提高到73.26％。在SWE基础代码维修中，Agent KB使Claude-3可以从41.33％提高到53.33％。我们的结果表明，Agent KB提供了一个模块化的，框架的基础架构，以使代理商能够从过去的经验中学习并将成功的策略推广到新任务。