2025-11-10

Title: Evaluating LLMs' Reasoning Over Ordered Procedural Steps

Title: Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

Title: Reasoning Up the Instruction Ladder for Controllable Language Models

Title: EncouRAGe: Evaluating RAG Local, Fast, and Reliable

Title: multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder

Title: Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

Title: Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Title: POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Title: GEMMA-SQL: A Novel Text-to-SQL Model Based on Large Language Models

Title: First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

Title: Learning to reason about rare diseases through retrieval-augmented agents

Title: Surprisal reveals diversity gaps in image captioning and different scorers change the story

Title: Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

Title: Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

Title: Minimal and Mechanistic Conditions for Behavioral Self-Awareness in LLMs

Title: SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Title: BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

Title: AgentExpt: Automating AI Experiment Design with LLM-based Resource Retrieval Agent

Title: LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Title: Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Title: Acquiring Common Chinese Emotional Events Using Large Language Model

Title: Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

Title: UA-Code-Bench: A Competitive Programming Benchmark for Evaluating LLM Code Generation in Ukrainian

Title: Order-Level Attention Similarity Across Language Models: A Latent Commonality

Title: On Text Simplification Metrics and General-Purpose LLMs for Accessible Health Information, and A Potential Architectural Advantage of The Instruction-Tuned LLM class

Title: Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

Title: A Toolbox for Improving Evolutionary Prompt Search

Title: Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

Title: Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models

Title: Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese

Title: Reflective Personalization Optimization: A Post-hoc Rewriting Framework for Black-Box Large Language Models

Title: Listening Between the Lines: Decoding Podcast Narratives with Language Modeling

Title: What Are the Facts? Automated Extraction of Court-Established Facts from Criminal-Court Opinions

Title: Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Title: Large Language Models for Explainable Threat Intelligence

Title: Steering Language Models with Weight Arithmetic