2025-07-28

Title: Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Title: Evaluating Code-Mixing in LLMs Across 18 Languages

Title: PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

Title: MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

Title: REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

Title: SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

Title: Large language models provide unsafe answers to patient-posed medical questions

Title: A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

Title: Mining Contextualized Visual Associations from Images for Creativity Understanding

Title: Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Title: A Similarity Measure for Comparing Conversational Dynamics

Title: A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Title: Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Title: An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case

Title: Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Title: How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

Title: Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Title: Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump's Presidential Campaigns

Title: AutoPCR: Automated Phenotype Concept Recognition by Prompting

Title: Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks

Title: SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

Title: Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Title: TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Title: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Title: Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models