2026-01-29

Title: From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

Title: Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

Title: Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

Title: PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

Title: PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Title: Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLMs

Title: Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Title: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

Title: Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

Title: OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

Title: Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

Title: The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Title: Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

Title: Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

Title: SDUs DAISY: A Benchmark for Danish Culture

Title: CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity

Title: "Newspaper Eat" Means "Not Tasty": A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews

Title: Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

Title: Quantifying non deterministic drift in large language models

Title: Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Title: On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Title: LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

Title: Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method

Title: VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

Title: Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects

Title: FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language

Title: Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

Title: Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

Title: Me-Agent: A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction

Title: Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

Title: Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

Title: SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

Title: RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

Title: Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

Title: SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

Title: Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

Title: CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Title: PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

Title: MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Title: Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space

Title: Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Title: SpeechMapper: Speech-to-text Embedding Projector for LLMs

Title: PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

Title: BMAM: Brain-inspired Multi-Agent Memory Framework

Title: Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

Title: Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models

Title: A Computational Approach to Language Contact -- A Case Study of Persian

Title: AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Title: P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

Title: A Dialectic Pipeline for Improving LLM Robustness

Title: Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

Title: Efficient Multimodal Planning Agent for Visual Question-Answering

Title: ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code

Title: AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Title: QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

Title: Like a Therapist, But Not: Reddit Narratives of AI in Mental Health Contexts

Title: Persona Prompting as a Lens on LLM Social Reasoning

Title: SERA: Soft-Verified Efficient Repository Agents

Title: Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

Title: Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction

Title: Linear representations in language models can change dramatically over a conversation

Title: When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation