Computation and Language 88
☆ Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
Parallel thinking has emerged as a novel approach for enhancing the reasoning
capabilities of large language models (LLMs) by exploring multiple reasoning
paths concurrently. However, activating such capabilities through training
remains challenging, as existing methods predominantly rely on supervised
fine-tuning (SFT) over synthetic data, which encourages teacher-forced
imitation rather than exploration and generalization. Different from them, we
propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework
that enables parallel thinking behaviors for complex real-world reasoning
tasks. Our framework employs a progressive curriculum that explicitly addresses
the cold-start problem in training parallel thinking with RL. We first use SFT
on prompt-generated trajectories from easier tasks to instill the parallel
thinking ability, then transition to RL to explore and generalize this skill on
harder problems. Experiments on various math benchmarks, including MATH, AMC23,
and AIME, show that Parallel-R1 successfully instills parallel thinking,
leading to 8.4% accuracy improvements over the sequential thinking model
trained directly on challenging tasks with RL. Further analysis reveals a clear
shift in the model's thinking behavior: at an early stage, it uses parallel
thinking as an exploration strategy, while in a later stage, it uses the same
capability for multi-perspective verification. Most significantly, we validate
parallel thinking as a \textbf{mid-training exploration scaffold}, where this
temporary exploratory phase unlocks a higher performance ceiling after RL,
yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and
code will be open-source at https://github.com/zhengkid/Parallel-R1.
comment: Project website: https://zhengkid.github.io/Parallel_R1.github.io/
☆ Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Recent advances in large multimodal models have leveraged image-based tools
with reinforcement learning to tackle visual problems. However, existing
open-source approaches often exhibit monotonous reasoning patterns and allow
only a limited number of interaction turns, making them inadequate for
difficult tasks that require trial-and-error exploration. In this work, we
address this limitation by scaling up tool-based interactions and introduce
Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of
steps -- and achieves state-of-the-art performance on challenging visual search
tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key
components. First, we construct the Visual Probe Dataset, a collection of
thousands of challenging visual search problems designed for exploratory
reasoning. Second, we develop an iterative data collection pipeline to obtain
cold-start trajectories that exhibit diverse reasoning patterns, including
depth-first search, trial-and-error, and goal maintenance. Third, we propose an
over-turn masking strategy that prevents penalization of over-turn responses
(those that hit the maximum number of turns) during reinforcement learning,
thereby balancing training-time efficiency with test-time scalability. Despite
training with an upper bound of only six interaction turns, our model generates
trajectories that naturally scale to tens of turns at inference time, with
accuracy improving as the number of turns increases. Extensive experiments
demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking
paths, effectively solving challenging visual search problems.
comment: Code, datasets, models are available at
https://github.com/Mini-o3/Mini-o3. Project Page: https://mini-o3.github.io/
☆ SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large
Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It
addresses critical limitations in OpenAI's benchmark, including noisy and
incorrect labels, topical biases, and question redundancy. SimpleQA Verified
was created through a rigorous multi-stage filtering process involving
de-duplication, topic balancing, and source reconciliation to produce a more
reliable and challenging evaluation set, alongside improvements in the
autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a
state-of-the-art F1-score of 55.6, outperforming other frontier models,
including GPT-5. This work provides the research community with a
higher-fidelity tool to track genuine progress in parametric model factuality
and to mitigate hallucinations. The benchmark dataset, evaluation code, and
leaderboard are available at:
https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
☆ Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Visual reasoning over structured data such as tables is a critical capability
for modern vision-language models (VLMs), yet current benchmarks remain limited
in scale, diversity, or reasoning depth, especially when it comes to rendered
table images. Addressing this gap, we introduce Visual-TableQA, a large-scale,
open-domain multimodal dataset specifically designed to evaluate and enhance
visual reasoning over complex tabular data. Our generation pipeline is modular,
scalable, and fully autonomous, involving multiple reasoning LLMs collaborating
across distinct roles: generation, validation, and inspiration. Visual-TableQA
comprises 2.5k richly structured LaTeX-rendered tables and 6k
reasoning-intensive QA pairs, all produced at a cost of under USD 100. To
promote diversity and creativity, our pipeline performs multi-model
collaborative data generation via cross-model prompting ('inspiration') and
LLM-jury filtering. Stronger models seed layouts and topics that weaker models
elaborate, collectively distilling diverse reasoning patterns and visual
structures into the dataset. Empirical results show that models fine-tuned on
Visual-TableQA generalize robustly to external benchmarks, outperforming
several proprietary models despite the dataset's synthetic nature. The full
pipeline and resources are publicly available at
https://github.com/AI-4-Everyone/Visual-TableQA.
comment: Work in Progress
☆ GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models EMNLP 2025
Uncertainty estimation is essential for enhancing the reliability of Large
Language Models (LLMs), particularly in high-stakes applications. Existing
methods often overlook semantic dependencies, relying on token-level
probability measures that fail to capture structural relationships within the
generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty
Estimation for Large Language Models, a structure-aware framework that
leverages dependency parse trees and hierarchical graph pooling to refine
uncertainty quantification. By incorporating supervised learning, GENUINE
effectively models semantic and structural relationships, improving confidence
assessments. Extensive experiments across NLP tasks show that GENUINE achieves
up to 29% higher AUROC than semantic entropy-based approaches and reduces
calibration errors by over 15%, demonstrating the effectiveness of graph-based
uncertainty modeling. The code is available at
https://github.com/ODYSSEYWT/GUQ.
comment: Accepted by EMNLP 2025
☆ Uncovering Scaling Laws for Large Language Models via Inverse Problems EMNLP
Arun Verma, Zhaoxuan Wu, Zijian Zhou, Xiaoqiang Lin, Zhiliang Chen, Rachael Hwee Ling Sim, Rui Qiao, Jingtan Wang, Nhung Bui, Xinyuan Niu, Wenyang Hu, Gregory Kang Ruey Lau, Zi-Yu Khoo, Zitong Zhao, Xinyi Xu, Apivich Hemachandra, See-Kiong Ng, Bryan Kian Hsiang Low
Large Language Models (LLMs) are large-scale pretrained models that have
achieved remarkable success across diverse domains. These successes have been
driven by unprecedented complexity and scale in both data and computations.
However, due to the high costs of training such models, brute-force
trial-and-error approaches to improve LLMs are not feasible. Inspired by the
success of inverse problems in uncovering fundamental scientific laws, this
position paper advocates that inverse problems can also efficiently uncover
scaling laws that guide the building of LLMs to achieve the desirable
performance with significantly better cost-effectiveness.
comment: Accepted at EMNLP Findings 2025
☆ Biased Tales: Cultural and Topic Bias in Generating Children's Stories
Stories play a pivotal role in human communication, shaping beliefs and
morals, particularly in children. As parents increasingly rely on large
language models (LLMs) to craft bedtime stories, the presence of cultural and
gender stereotypes in these narratives raises significant concerns. To address
this issue, we present Biased Tales, a comprehensive dataset designed to
analyze how biases influence protagonists' attributes and story elements in
LLM-generated stories. Our analysis uncovers striking disparities. When the
protagonist is described as a girl (as compared to a boy), appearance-related
attributes increase by 55.26%. Stories featuring non-Western children
disproportionately emphasize cultural heritage, tradition, and family themes
far more than those for Western children. Our findings highlight the role of
sociocultural bias in making creative AI use more equitable and diverse.
☆ From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing NLPCC 2025
This paper presents our team's solution to Shared Task 7 of NLPCC-2025, which
focuses on sentence-level gender bias detection and mitigation in Chinese. The
task aims to promote fairness and controllability in natural language
generation by automatically detecting, classifying, and mitigating gender bias.
To address this challenge, we adopt a fine-tuning approach based on large
language models (LLMs), efficiently adapt to the bias detection task via
Low-Rank Adaptation (LoRA). In terms of data processing, we construct a more
balanced training set to alleviate class imbalance and introduce heterogeneous
samples from multiple sources to enhance model generalization. For the
detection and classification sub-tasks, we employ a majority voting strategy
that integrates outputs from multiple expert models to boost performance.
Additionally, to improve bias generation detection and mitigation, we design a
multi-temperature sampling mechanism to capture potential variations in bias
expression styles. Experimental results demonstrate the effectiveness of our
approach in bias detection, classification, and mitigation. Our method
ultimately achieves an average score of 47.90%, ranking fourth in the shared
task.
comment: NLPCC 2025
☆ Are Humans as Brittle as Large Language Models?
The output of large language models (LLM) is unstable, due to both
non-determinism of the decoding process as well as to prompt brittleness. While
the intrinsic non-determinism of LLM generation may mimic existing uncertainty
in human annotations through distributional shifts in outputs, it is largely
assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs.
This raises the question: do human annotators show similar sensitivity to
instruction changes? If so, should prompt brittleness in LLMs be considered
problematic? One may alternatively hypothesize that prompt brittleness
correctly reflects human annotation variances. To fill this research gap, we
systematically compare the effects of prompt modifications on LLMs and
identical instruction modifications for human annotators, focusing on the
question of whether humans are similarly sensitive to prompt perturbations. To
study this, we prompt both humans and LLMs for a set of text classification
tasks conditioned on prompt variations. Our findings indicate that both humans
and LLMs exhibit increased brittleness in response to specific types of prompt
modifications, particularly those involving the substitution of alternative
label sets or label formats. However, the distribution of human judgments is
less affected by typographical errors and reversed label order than that of
LLMs.
☆ Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost
Literary translation has recently gained attention as a distinct and complex
task in machine translation research. However, the translation by small open
models remains an open problem. We contribute to this ongoing research by
introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for
dataset creation, fine tuning, and evaluation in English-Romanian literary
translations, centred on the creation and open release of both a compact, fine
tuned language model (TF2-12B) and large scale synthetic parallel datasets
(DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the
largest collection of synthetic English fables to date, we address the need for
rich, high quality literary datasets in low resource languages such as
Romanian. Our pipeline first generates 15k high quality Romanian references
from the TF1 pool using a high performing LLM. We then apply a two stage fine
tuning process to a 12B parameter open weight model: (i) instruction tuning to
capture genre specific narrative style, and (ii) adapter compression for
efficient deployment. Evaluation combines corpus level BLEU and a five
dimension LLM based rubric (accuracy, fluency, coherence, style, cultural
adaptation) to provide a nuanced assessment of translation quality. Results
show that our fine tuned model achieves fluency and adequacy competitive with
top performing large proprietary models, while being open, accessible, and
significantly more cost effective. Alongside the fine tuned model and both
datasets, we publicly release all scripts and evaluation prompts. TF2 thus
provides an end-to-end, reproducible pipeline for research on cost efficient
translation, cross lingual narrative generation, and the broad adoption of open
models for culturally significant literary content in low resource settings.
comment: 25 pages, 8 figures, includes datasets and models released on Hugging
Face
☆ Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems
Textual response generation is pivotal for multimodal \mbox{task-oriented}
dialog systems, which aims to generate proper textual responses based on the
multimodal context. While existing efforts have demonstrated remarkable
progress, there still exist the following limitations: 1) \textit{neglect of
unstructured review knowledge} and 2) \textit{underutilization of large
language models (LLMs)}. Inspired by this, we aim to fully utilize dual
knowledge (\textit{i.e., } structured attribute and unstructured review
knowledge) with LLMs to promote textual response generation in multimodal
task-oriented dialog systems. However, this task is non-trivial due to two key
challenges: 1) \textit{dynamic knowledge type selection} and 2)
\textit{intention-response decoupling}. To address these challenges, we propose
a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for
multimodal dialog systems (named DK2R). To be specific, DK2R first extracts
both structured attribute and unstructured review knowledge from external
knowledge base given the dialog context. Thereafter, DK2R uses an LLM to
evaluate each knowledge type's utility by analyzing LLM-generated provisional
probe responses. Moreover, DK2R separately summarizes the intention-oriented
key clues via dedicated reasoning, which are further used as auxiliary signals
to enhance LLM-based textual response generation. Extensive experiments
conducted on a public dataset verify the superiority of DK2R. We have released
the codes and parameters.
☆ SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP EMNLP 2025
Structured information extraction from scientific literature is crucial for
capturing core concepts and emerging trends in specialized fields. While
existing datasets aid model development, most focus on specific publication
sections due to domain complexity and the high cost of annotating scientific
texts. To address this limitation, we introduce SciNLP - a specialized
benchmark for full-text entity and relation extraction in the Natural Language
Processing (NLP) domain. The dataset comprises 60 manually annotated full-text
NLP publications, covering 7,072 entities and 1,826 relations. Compared to
existing research, SciNLP is the first dataset providing full-text annotations
of entities and their relationships in the NLP domain. To validate the
effectiveness of SciNLP, we conducted comparative experiments with similar
datasets and evaluated the performance of state-of-the-art supervised models on
this dataset. Results reveal varying extraction capabilities of existing models
across academic texts of different lengths. Cross-comparisons with existing
datasets show that SciNLP achieves significant performance improvements on
certain baseline models. Using models trained on SciNLP, we implemented
automatic construction of a fine-grained knowledge graph for the NLP domain.
Our KG has an average node degree of 3.2 per entity, indicating rich semantic
topological information that enhances downstream applications. The dataset is
publicly available at https://github.com/AKADDC/SciNLP.
comment: EMNLP 2025 Main
☆ Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning
Michele Joshua Maggini, Dhia Merzougui, Rabiraj Bandyopadhyay, Gaël Dias, Fabrice Maurel, Pablo Gamallo
The spread of fake news, polarizing, politically biased, and harmful content
on online platforms has been a serious concern. With large language models
becoming a promising approach, however, no study has properly benchmarked their
performance across different models, usage methods, and languages. This study
presents a comprehensive overview of different Large Language Models adaptation
paradigms for the detection of hyperpartisan and fake news, harmful tweets, and
political bias. Our experiments spanned 10 datasets and 5 different languages
(English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and
multiclass classification scenarios. We tested different strategies ranging
from parameter efficient Fine-Tuning of language models to a variety of
different In-Context Learning strategies and prompts. These included zero-shot
prompts, codebooks, few-shot (with both randomly-selected and
diversely-selected examples using Determinantal Point Processes), and
Chain-of-Thought. We discovered that In-Context Learning often underperforms
when compared to Fine-Tuning a model. This main finding highlights the
importance of Fine-Tuning even smaller models on task-specific settings even
when compared to the largest models evaluated in an In-Context Learning setup -
in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and
Qwen2.5-7B-Instruct.
☆ Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts EMNLP 2025
As large language models (LLMs) adapted to sensitive domains such as
medicine, their fluency raises safety risks, particularly regarding provenance
and accountability. Watermarking embeds detectable patterns to mitigate these
risks, yet its reliability in medical contexts remains untested. Existing
benchmarks focus on detection-quality tradeoffs, overlooking factual risks
under low-entropy settings often exploited by watermarking's reweighting
strategy. We propose a medical-focused evaluation workflow that jointly
assesses factual accuracy and coherence. Using GPT-Judger and further human
validation, we introduce the Factuality-Weighted Score (FWS), a composite
metric prioritizing factual accuracy beyond coherence to guide watermarking
deployment in medical domains. Our evaluation shows current watermarking
methods substantially compromise medical factuality, with entropy shifts
degrading medical entity representation. These findings underscore the need for
domain-aware watermarking approaches that preserve the integrity of medical
content.
comment: Accepted at EMNLP 2025 Findings
☆ M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models EMNLP2025
For Relation Extraction (RE), the manual annotation of training data may be
prohibitively expensive, since the sentences that contain the target relations
in texts can be very scarce and difficult to find. It is therefore beneficial
to develop an efficient method that can automatically extract training
instances from unlabeled texts for training RE models. Recently, large language
models (LLMs) have been adopted in various natural language processing tasks,
with RE also benefiting from their advances. However, when leveraging LLMs for
RE with predefined relation categories, two key challenges arise. First, in a
multi-class classification setting, LLMs often struggle to comprehensively
capture the semantics of every relation, leading to suboptimal results. Second,
although employing binary classification for each relation individually can
mitigate this issue, it introduces significant computational overhead,
resulting in impractical time complexity for real-world applications.
Therefore, this paper proposes a framework called M-BRe to extract training
instances from unlabeled texts for RE. It utilizes three modules to combine the
advantages of both of the above classification approaches: Relation Grouping,
Relation Extraction, and Label Decision. Extensive experiments confirm its
superior capability in discovering high-quality training samples from unlabeled
texts for RE.
comment: Accepted by EMNLP2025 Main Conference
☆ MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs
Efficient communication between patients and clinicians plays an important
role in shared decision-making. However, clinical reports are often lengthy and
filled with clinical jargon, making it difficult for domain experts to identify
important aspects in the document efficiently. This paper presents the
methodology we applied in the MultiClinSUM shared task for summarising clinical
case documents. We used an Iterative Self-Prompting technique on large language
models (LLMs) by asking LLMs to generate task-specific prompts and refine them
via example-based few-shot learning. Furthermore, we used lexical and embedding
space metrics, ROUGE and BERT-score, to guide the model fine-tuning with
epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved
ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P,
R, F1) from the official evaluation on 3,396 clinical case reports from various
specialties extracted from open journals. The high BERTscore indicates that the
model produced semantically equivalent output summaries compared to the
references, even though the overlap at the exact lexicon level is lower, as
reflected in the lower ROUGE scores. This work sheds some light on how
perspective-aware ISP (PA-ISP) can be deployed for clinical report
summarisation and support better communication between patients and clinicians.
comment: system paper at CLEF 2025
☆ BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment SIGIR
In recent years, there has been substantial progress in using pretrained
Language Models (LMs) on a range of tasks aimed at improving the understanding
of biomedical texts. Nonetheless, existing biomedical LLMs show limited
comprehension of complex, domain-specific concept structures and the factual
information encoded in biomedical Knowledge Graphs (KGs). In this work, we
propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel
joint LM and KG pre-training method that augments an LM with external knowledge
by the simultaneous learning of a dedicated KG encoder and aligning the
representations of both the LM and the graph. For a given textual sequence, we
link biomedical concept mentions to the Unified Medical Language System (UMLS)
KG and utilize local KG subgraphs as cross-modal positive samples for these
mentions. Our empirical findings indicate that implementing our method on
several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves
their performance on a range of language understanding tasks and the quality of
entity representations, even with minimal pre-training on a small alignment
dataset sourced from PubMed scientific abstracts.
comment: 9 pages, 1 figure, published in "The 48th International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 2025)"
☆ Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition EMNLP
In a rapidly evolving world where information updates swiftly, knowledge in
large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a
cost-effective option, making knowledge editing (KE) without modifying
parameters particularly necessary. We find that although existing
retrieval-augmented generation (RAG)-based KE methods excel at editing simple
knowledge, they struggle with KE in multi-hop question answering due to the
issue of "edit skipping", which refers to skipping the relevant edited fact in
inference. In addition to the diversity of natural language expressions of
knowledge, edit skipping also arises from the mismatch between the granularity
of LLMs in problem-solving and the facts in the edited memory. To address this
issue, we propose a novel Iterative Retrieval-Augmented Knowledge Editing
method with guided decomposition (IRAKE) through the guidance from single
edited facts and entire edited cases. Experimental results demonstrate that
IRAKE mitigates the failure of editing caused by edit skipping and outperforms
state-of-the-art methods for KE in multi-hop question answering.
comment: Accepted in EMNLP Findings 2025
☆ VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang
With the rapid progress of multimodal large language models, operating system
(OS) agents become increasingly capable of automating tasks through on-device
graphical user interfaces (GUIs). However, most existing OS agents are designed
for idealized settings, whereas real-world environments often present
untrustworthy conditions. To mitigate risks of over-execution in such
scenarios, we propose a query-driven human-agent-GUI interaction framework that
enables OS agents to decide when to query humans for more reliable task
completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy
OS agent trained with a two-stage learning paradigm that falicitate the
decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent
autonomously executes actions in normal conditions while proactively querying
humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves
the average step-wise success rate by 20.64\% in untrustworthy scenarios over
the state-of-the-art, without compromising normal performance. Analysis
highlights VeriOS-Agent's rationality, generalizability, and scalability. The
codes, datasets and models are available at
https://github.com/Wuzheng02/VeriOS.
☆ Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data
Large language models (LLMs) have transformed NLP, yet their integration with
audio remains underexplored -- despite audio's centrality to human
communication. We introduce Falcon3-Audio, a family of Audio-Language Models
(ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably
small amount of public audio data -- less than 30K hours (5K unique) --
Falcon3-Audio-7B matches the best reported performance among open-weight models
on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while
distinguishing itself through superior data and parameter efficiency,
single-stage training, and transparency. Notably, our smallest 1B model remains
competitive with larger open models ranging from 2B to 13B parameters. Through
extensive ablations, we find that common complexities -- such as curriculum
learning, multiple audio encoders, and intricate cross-attention connectors --
are not required for strong performance, even compared to models trained on
over 500K hours of data.
comment: Accepted at ASRU 2025
☆ ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval
Many contemporary data-driven research efforts in the natural sciences, such
as chemistry and materials science, require large-scale, high-performance
entity recognition from scientific datasets. Large language models (LLMs) have
increasingly been adopted to solve the entity recognition task, with the same
trend being observed on all-spectrum NLP tasks. The prevailing entity
recognition LLMs rely on fine-tuned technology, yet the fine-tuning process
often incurs significant cost. To achieve a best performance-cost trade-off, we
propose ALLabel, a three-stage framework designed to select the most
informative and representative samples in preparing the demonstrations for LLM
modeling. The annotated examples are used to construct a ground-truth retrieval
corpus for LLM in-context learning. By sequentially employing three distinct
active learning strategies, ALLabel consistently outperforms all baselines
under the same annotation budget across three specialized domain datasets.
Experimental results also demonstrate that selectively annotating only 5\%-10\%
of the dataset with ALLabel can achieve performance comparable to the method
annotating the entire dataset. Further analyses and ablation studies verify the
effectiveness and generalizability of our proposal.
☆ Astra: A Multi-Agent System for GPU Kernel Performance Optimization
Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, Alex Aiken
GPU kernel optimization has long been a central challenge at the intersection
of high-performance computing and machine learning. Efficient kernels are
crucial for accelerating large language model (LLM) training and serving, yet
attaining high performance typically requires extensive manual tuning.
Compiler-based systems reduce some of this burden, but still demand substantial
manual design and engineering effort. Recently, researchers have explored using
LLMs for GPU kernel generation, though prior work has largely focused on
translating high-level PyTorch modules into CUDA code. In this work, we
introduce Astra, the first LLM-based multi-agent system for GPU kernel
optimization. Unlike previous approaches, Astra starts from existing CUDA
implementations extracted from SGLang, a widely deployed framework for serving
LLMs, rather than treating PyTorch modules as the specification. Within Astra,
specialized LLM agents collaborate through iterative code generation, testing,
profiling, and planning to produce kernels that are both correct and
high-performance. On kernels from SGLang, Astra achieves an average speedup of
1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study
further demonstrates that LLMs can autonomously apply loop transformations,
optimize memory access patterns, exploit CUDA intrinsics, and leverage fast
math operations to yield substantial performance gains. Our work highlights
multi-agent LLM systems as a promising new paradigm for GPU kernel
optimization.
☆ HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention
Detecting content that contradicts or is unsupported by a given source text
is a critical challenge for the safe deployment of generative language models.
We introduce HALT-RAG, a post-hoc verification system designed to identify
hallucinations in the outputs of Retrieval-Augmented Generation (RAG)
pipelines. Our flexible and task-adaptable framework uses a universal feature
set derived from an ensemble of two frozen, off-the-shelf Natural Language
Inference (NLI) models and lightweight lexical signals. These features are used
to train a simple, calibrated, and task-adapted meta-classifier. Using a
rigorous 5-fold out-of-fold (OOF) training protocol to prevent data leakage and
produce unbiased estimates, we evaluate our system on the HaluEval benchmark.
By pairing our universal feature set with a lightweight, task-adapted
classifier and a precision-constrained decision policy, HALT-RAG achieves
strong OOF F1-scores of 0.7756, 0.9786, and 0.7391 on the summarization, QA,
and dialogue tasks, respectively. The system's well-calibrated probabilities
enable a practical abstention mechanism, providing a reliable tool for
balancing model performance with safety requirements.
☆ From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation
Mardiyyah Oduwole, Oluwatosin Olajide, Jamiu Suleiman, Faith Hunja, Busayo Awobade, Fatimo Adebanjo, Comfort Akanni, Chinonyelum Igwe, Peace Ododo, Promise Omoigui, Steven Kolawole, Abraham Owodunni
The linguistic diversity across the African continent presents different
challenges and opportunities for machine translation. This study explores the
effects of data augmentation techniques in improving translation systems in
low-resource African languages. We focus on two data augmentation techniques:
sentence concatenation with back translation and switch-out, applying them
across six African languages. Our experiments show significant improvements in
machine translation performance, with a minimum increase of 25\% in BLEU score
across all six languages.We provide a comprehensive analysis and highlight the
potential of these techniques to improve machine translation systems for
low-resource languages, contributing to the development of more robust
translation systems for under-resourced languages.
comment: 8 pages, 3 tables. Exploratory work on Data Augmentation for African
Machine Translation
☆ Understanding Stigmatizing Language Lexicons: A Comparative Analysis in Clinical Contexts
Stigmatizing language results in healthcare inequities, yet there is no
universally accepted or standardized lexicon defining which words, terms, or
phrases constitute stigmatizing language in healthcare. We conducted a
systematic search of the literature to identify existing stigmatizing language
lexicons and then analyzed them comparatively to examine: 1) similarities and
discrepancies between these lexicons, and 2) the distribution of positive,
negative, or neutral terms based on an established sentiment dataset. Our
search identified four lexicons. The analysis results revealed moderate
semantic similarity among them, and that most stigmatizing terms are related to
judgmental expressions by clinicians to describe perceived negative behaviors.
Sentiment analysis showed a predominant proportion of negatively classified
terms, though variations exist across lexicons. Our findings underscore the
need for a standardized lexicon and highlight challenges in defining
stigmatizing language in clinical texts.
☆ AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Positive, supportive online communication in social media (candy speech) has
the potential to foster civility, yet automated detection of such language
remains underexplored, limiting systematic analysis of its impact. We
investigate how candy speech can be reliably detected in a 46k-comment German
YouTube corpus by monolingual and multilingual language models, including
GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual
XLM-RoBERTa-Large model trained to detect candy speech at the span level
outperforms other approaches, ranking first in both binary positive F1: 0.8906)
and categorized span-based detection (strict F1: 0.6307) subtasks at the
GermEval 2025 Shared Task on Candy Speech Detection. We speculate that
span-based training, multilingual capabilities, and emoji-aware tokenizers
improved detection performance. Our results demonstrate the effectiveness of
multilingual models in identifying positive, supportive language.
comment: 6 pages, 1 figure, 2 tables
☆ GLEAM: Learning to Match and Explain in Cross-View Geo-Localization
Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Xue Yang, Hongsheng Li
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences
between images captured from distinct perspectives of the same geographical
location. However, existing CVGL approaches are typically restricted to a
single view or modality, and their direct visual matching strategy lacks
interpretability: they merely predict whether two images correspond, without
explaining the rationale behind the match. In this paper, we present GLEAM-C, a
foundational CVGL model that unifies multiple views and modalities-including
UAV imagery, street maps, panoramic views, and ground photographs-by aligning
them exclusively with satellite imagery. Our framework enhances training
efficiency through optimized implementation while achieving accuracy comparable
to prior modality-specific CVGL models through a two-phase training strategy.
Moreover, to address the lack of interpretability in traditional CVGL methods,
we leverage the reasoning capabilities of multimodal large language models
(MLLMs) to propose a new task, GLEAM-X, which combines cross-view
correspondence prediction with explainable reasoning. To support this task, we
construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro
to generate training and testing data. The test set is further refined through
detailed human revision, enabling systematic evaluation of explainable
cross-view reasoning and advancing transparency and scalability in
geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL
pipeline that integrates multi-modal, multi-view alignment with interpretable
correspondence analysis, unifying accurate cross-view matching with explainable
reasoning and advancing Geo-Localization by enabling models to better Explain
And Match. Code and datasets used in this work will be made publicly accessible
at https://github.com/Lucky-Lance/GLEAM.
comment: 18 pages
☆ Language Self-Play For Data-Free Training
Large language models (LLMs) have advanced rapidly in recent years, driven by
scale, abundant high-quality training data, and reinforcement learning. Yet
this progress faces a fundamental bottleneck: the need for ever more data from
which models can continue to learn. In this work, we propose a reinforcement
learning approach that removes this dependency by enabling models to improve
without additional data. Our method leverages a game-theoretic framework of
self-play, where a model's capabilities are cast as performance in a
competitive game and stronger policies emerge by having the model play against
itself - a process we call Language Self-Play (LSP). Experiments with
Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained
models can not only enhance their performance on challenging tasks through
self-play alone, but can also do so more effectively than data-driven
baselines.
☆ LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction
Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong Xu, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong
Large language models (LLMs) make significant progress in Emotional
Intelligence (EI) and long-context understanding. However, existing benchmarks
tend to overlook certain aspects of EI in long-context scenarios, especially
under realistic, practical settings where interactions are lengthy, diverse,
and often noisy. To move towards such realistic settings, we present
LongEmotion, a benchmark specifically designed for long-context EI tasks. It
covers a diverse set of tasks, including Emotion Classification, Emotion
Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion
Expression. On average, the input length for these tasks reaches 8,777 tokens,
with long-form generation required for Emotion Expression. To enhance
performance under realistic constraints, we incorporate Retrieval-Augmented
Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them
with standard prompt-based methods. Unlike conventional approaches, our RAG
method leverages both the conversation context and the large language model
itself as retrieval sources, avoiding reliance on external knowledge bases. The
CoEM method further improves performance by decomposing the task into five
stages, integrating both retrieval augmentation and limited knowledge
injection. Experimental results show that both RAG and CoEM consistently
enhance EI-related performance across most long-context tasks, advancing LLMs
toward more practical and real-world EI applications. Furthermore, we conducted
a comparative case study experiment on the GPT series to demonstrate the
differences among various models in terms of EI. Code is available on GitHub at
https://github.com/LongEmotion/LongEmotion, and the project page can be found
at https://longemotion.github.io/.
comment: Technical Report
☆ The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering ACL 2025
Integrating knowledge graphs (KGs) into the reasoning processes of large
language models (LLMs) has emerged as a promising approach to mitigate
hallucination. However, existing work in this area often relies on proprietary
or extremely large models, limiting accessibility and scalability. In this
study, we investigate the capabilities of existing integration methods for
small language models (SLMs) in KG-based question answering and observe that
their performance is often constrained by their limited ability to traverse and
reason over knowledge graphs. To address this limitation, we propose leveraging
simple and efficient exploration modules to handle knowledge graph traversal in
place of the language model itself. Experiment results demonstrate that these
lightweight modules effectively improve the performance of small language
models on knowledge graph question answering tasks. Source code:
https://github.com/yijie-cheng/SLM-ToG/.
comment: Extended from ACL 2025 SRW
☆ Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents
Existing evaluation studies on linguistic competence of large language models
(LLM agents) have focused primarily on vocabulary learning, morphological rule
induction, syntactic generalization, pragmatic inference, and cross-linguistic
transfer. However, none assess whether LLM agents can acquire a language
through pattern recognition and interactive feedback, a central feature of
human language acquisition. We propose a novel experimental framework in which
an LLM agent is evaluated on its ability to acquire and use a newly constructed
language (Tinkatongue) in conversation with a bot that understands only
Tinkatongue. Our findings show that LLM agents fail to establish a conversation
within 100 responses, yet they adopt distinct strategies that mirror human
approaches to language learning. The results suggest a new direction for
evaluation benchmarks and open pathways to model designs that learn more
effectively from interactive feedback.
comment: Under review
☆ PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
Recent advancements in Large Language Models (LLMs) demonstrate remarkable
capabilities across various fields. These developments have led to more direct
communication between humans and LLMs in various situations, such as social
companionship and psychological support. However, LLMs often exhibit
limitations in emotional perception and social competence during real-world
conversations. These limitations partly originate from their inability to adapt
their communication style and emotional expression to different social and task
contexts. In this work, we introduce PersonaFuse, a novel LLM post-training
framework that enables LLMs to adapt and express different personalities for
varying situations. Inspired by Trait Activation Theory and the Big Five
personality model, PersonaFuse employs a Mixture-of-Expert architecture that
combines persona adapters with a dynamic routing network, enabling contextual
trait expression. Experimental results show that PersonaFuse substantially
outperforms baseline models across multiple dimensions of social-emotional
intelligence. Importantly, these gains are achieved without sacrificing general
reasoning ability or model safety, which remain common limitations of direct
prompting and supervised fine-tuning approaches. PersonaFuse also delivers
consistent improvements in downstream human-centered applications, such as
mental health counseling and review-based customer service. Finally, human
preference evaluations against leading LLMs, including GPT-4o and DeepSeek,
demonstrate that PersonaFuse achieves competitive response quality despite its
comparatively smaller model size. These findings demonstrate that
PersonaFuse~offers a theoretically grounded and practical approach for
developing social-emotional enhanced LLMs, marking a significant advancement
toward more human-centric AI systems.
☆ Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation EMNLP 2025
Transformer-based self-attention mechanism serves as the core of modern
language models, yet it often suffers from localization, where attentions
collapse onto a limited subset of tokens and fail to capture long-range
dependencies. To address this issue, we propose Self-Attention One-step Belief
Propagation (SAOBP), a refinement framework that injects multi-hop
relationships through a belief propagation process. To interpret and quantify
these interactions, we introduce Global Token Dependency (GTD) that captures
the relative contribution of multihop connections within the attention graph.
Empirical results indicate that SAOBP helps prevent entropy collapse in deeper
layers and adaptively maintains GTD at task-appropriate levels, thereby
supporting improvements in model performance. Importantly, we observe
competitive gains in small-scale models, highlighting its potential for
improving inference quality in resource-constrained scenarios.
comment: Accepted at EMNLP 2025
☆ Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations
Recent advances in large language models (LLMs) have been driven by
pretraining, supervised fine tuning (SFT), and alignment tuning. Among these,
SFT plays a crucial role in transforming a model 's general knowledge into
structured responses tailored to specific tasks. However, there is no clearly
established methodology for effective training data selection. Simply
increasing the volume of data does not guarantee performance improvements,
while preprocessing, sampling, and validation require substantial time and
cost.
To address this issue, a variety of data selection methods have been
proposed. Among them, knowledge based selection approaches identify suitable
training data by analyzing the model 's responses. Nevertheless, these methods
typically rely on prompt engineering, making them sensitive to variations and
incurring additional costs for prompt design.
In this study, we propose Knowledge Analysis via Model Internal
Representations (KAMIR), a novel approach that overcomes these limitations by
analyzing data based on the model 's internal representations. KAMIR computes
similarities between the hidden states of each layer (block) and the final
hidden states for a given input to assess the data. Unlike prior methods that
were largely limited to multiple choice tasks, KAMIR can be applied to a wide
range of tasks such as machine reading comprehension and summarization.
Moreover, it selects data useful for training based on the model 's familiarity
with the input, even with a small dataset and a simple classifier architecture.
Experiments across diverse task datasets demonstrate that training with less
familiar data leads to better generalization performance.
☆ Instance-level Performance Prediction for Long-form Generation Tasks
We motivate and share a new benchmark for instance-level performance
prediction of long-form generation tasks having multi-faceted, fine-grained
quality metrics. Our task-, model- and metric-agnostic formulation predicts
continuous evaluation metric scores given only black-box model inputs and
outputs. Beyond predicting point estimates of metric scores, the benchmark also
requires inferring prediction intervals to quantify uncertainty around point
estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs,
baselines, and metrics per task. We show that scores can be effectively
predicted across long-form generation tasks using as few as 16 training
examples. Overall, we introduce a novel and useful task, a valuable benchmark
to drive progress, and baselines ready for practical adoption today.
☆ Basis Vector Metric: A Method for Robust Open-Ended State Change Detection
We test a new method, which we will abbreviate using the acronym BVM (Basis
Vectors Method), in its ability to judge the state changes in images through
using language embeddings. We used the MIT-States dataset, containing about
53,000 images, to gather all of our data, which has 225 nouns and 115
adjectives, with each noun having about 9 different adjectives, forming
approximately 1000 noun-adjective pairs. For our first experiment, we test our
method's ability to determine the state of each noun class separately against
other metrics for comparison. These metrics are cosine similarity, dot product,
product quantization, binary index, Naive Bayes, and a custom neural network.
Among these metrics, we found that our proposed BVM performs the best in
classifying the states for each noun. We then perform a second experiment where
we try using BVM to determine if it can differentiate adjectives from one
another for each adjective separately. We compared the abilities of BVM to
differentiate adjectives against the proposed method the MIT-States paper
suggests: using a logistic regression model. In the end, we did not find
conclusive evidence that our BVM metric could perform better than the logistic
regression model at discerning adjectives. Yet, we were able to find evidence
for possible improvements to our method; this leads to the chance of increasing
our method's accuracy through certain changes in our methodologies.
comment: 24 pages
☆ Causal Attention with Lookahead Keys
In standard causal attention, each token's query, key, and value (QKV) are
static and encode only preceding context. We introduce CAuSal aTtention with
Lookahead kEys (CASTLE), an attention mechanism that continually updates each
token's keys as the context unfolds. We term these updated keys lookahead keys
because they belong to earlier positions yet integrate information from tokens
that appear later relative to those positions, while strictly preserving the
autoregressive property. Although the mechanism appears sequential, we derive a
mathematical equivalence that avoids explicitly materializing lookahead keys at
each position and enables efficient parallel training. On language modeling
benchmarks, CASTLE consistently outperforms standard causal attention across
model scales, reducing validation perplexity and improving performance on a
range of downstream tasks.
♻ ☆ Llama-Nemotron: Efficient Reasoning Models
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Prasoon Varshney, Makesh Narsimhan, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi Mahabadi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Shaona Ghosh, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Chris Alexiuk, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung
We introduce the Llama-Nemotron series of models, an open family of
heterogeneous reasoning models that deliver exceptional reasoning capabilities,
inference efficiency, and an open license for enterprise use. The family comes
in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs
competitively with state-of-the-art reasoning models such as DeepSeek-R1 while
offering superior inference throughput and memory efficiency. In this report,
we discuss the training procedure for these models, which entails using neural
architecture search from Llama 3 models for accelerated inference, knowledge
distillation, and continued pretraining, followed by a reasoning-focused
post-training stage consisting of two main parts: supervised fine-tuning and
large scale reinforcement learning. Llama-Nemotron models are the first
open-source models to support a dynamic reasoning toggle, allowing users to
switch between standard chat and reasoning modes during inference. To further
support open research and facilitate model development, we provide the
following resources: 1. We release the Llama-Nemotron reasoning models --
LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA
Open Model License Agreement. 2. We release the complete post-training dataset:
Llama-Nemotron-Post-Training-Dataset. 3. We also release our training
codebases: NeMo, NeMo-Aligner, and Megatron-LM.
♻ ☆ Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts ACL
Evaluating natural language generation systems is challenging due to the
diversity of valid outputs. While human evaluation is the gold standard, it
suffers from inconsistencies, lack of standardisation, and demographic biases,
limiting reproducibility. LLM-based evaluators offer a scalable alternative but
are highly sensitive to prompt design, where small variations can lead to
significant discrepancies. In this work, we propose an inversion learning
method that learns effective reverse mappings from model outputs back to their
input instructions, enabling the automatic generation of highly effective,
model-specific evaluation prompts. Our method requires only a single evaluation
sample and eliminates the need for time-consuming manual prompt engineering,
thereby improving both efficiency and robustness. Our work contributes toward a
new direction for more robust and efficient LLM-based evaluation.
comment: 12 pages, accepted by Transactions of the Association for
Computational Linguistics (TACL)
♻ ☆ A Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges
This systematic review of the research literature on retrieval-augmented
generation (RAG) provides a focused analysis of the most highly cited studies
published between 2020 and May 2025. A total of 128 articles met our inclusion
criteria. The records were retrieved from ACM Digital Library, IEEE Xplore,
Scopus, ScienceDirect, and the Digital Bibliography and Library Project (DBLP).
RAG couples a neural retriever with a generative language model, grounding
output in up-to-date, non-parametric memory while retaining the semantic
generalisation stored in model weights. Guided by the PRISMA 2020 framework, we
(i) specify explicit inclusion and exclusion criteria based on citation count
and research questions, (ii) catalogue datasets, architectures, and evaluation
practices, and (iii) synthesise empirical evidence on the effectiveness and
limitations of RAG. To mitigate citation-lag bias, we applied a lower
citation-count threshold to papers published in 2025 so that emerging
breakthroughs with naturally fewer citations were still captured. This review
clarifies the current research landscape, highlights methodological gaps, and
charts priority directions for future research.
comment: 58 page
♻ ☆ Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers
Antonym vs synonym distinction across multiple languages presents unique
computational challenges due to the paradoxical nature of antonymous
relationships words that share semantic domains while expressing opposite
meanings. This work introduces Bhav-Net, a novel dual-space architecture that
enables effective knowledge transfer from complex multilingual models to
simpler, language-specific architectures while maintaining robust cross-lingual
antonym--synonym distinction capabilities. Our approach combines
language-specific BERT encoders with graph transformer networks, creating
distinct semantic projections where synonymous pairs cluster in one space while
antonymous pairs exhibit high similarity in a complementary space. Through
comprehensive evaluation across eight languages (English, German, French,
Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic
relationship modeling transfers effectively across languages. The dual-encoder
design achieves competitive performance against state-of-the-art baselines
while providing interpretable semantic representations and effective
cross-lingual generalization.
comment: Found some issues and need to correct them
♻ ☆ Cardiverse: Harnessing LLMs for Novel Card Game Prototyping EMNLP 2025
The prototyping of computer games, particularly card games, requires
extensive human effort in creative ideation and gameplay evaluation. Recent
advances in Large Language Models (LLMs) offer opportunities to automate and
streamline these processes. However, it remains challenging for LLMs to design
novel game mechanics beyond existing databases, generate consistent gameplay
environments, and develop scalable gameplay AI for large-scale evaluations.
This paper addresses these challenges by introducing a comprehensive automated
card game prototyping framework. The approach highlights a graph-based indexing
method for generating novel game variations, an LLM-driven system for
consistent game code generation validated by gameplay records, and a gameplay
AI constructing method that uses an ensemble of LLM-generated heuristic
functions optimized through self-play. These contributions aim to accelerate
card game prototyping, reduce human labor, and lower barriers to entry for game
developers. For code repo visit this http URL
https://github.com/danruili/Cardiverse
comment: 37 pages, 13 figures, 8 tables. Accepted by EMNLP 2025
♻ ☆ JoPA:Explaining Large Language Model's Generation via Joint Prompt Attribution ACL 2025
Large Language Models (LLMs) have demonstrated impressive performances in
complex text generation tasks. However, the contribution of the input prompt to
the generated content still remains obscure to humans, underscoring the
necessity of understanding the causality between input and output pairs.
Existing works for providing prompt-specific explanation often confine model
output to be classification or next-word prediction. Few initial attempts
aiming to explain the entire language generation often treat input prompt texts
independently, ignoring their combinatorial effects on the follow-up
generation. In this study, we introduce a counterfactual explanation framework
based on Joint Prompt Attribution, JoPA, which aims to explain how a few prompt
texts collaboratively influences the LLM's complete generation. Particularly,
we formulate the task of prompt attribution for generation interpretation as a
combinatorial optimization problem, and introduce a probabilistic algorithm to
search for the casual input combination in the discrete space. We define and
utilize multiple metrics to evaluate the produced explanations, demonstrating
both the faithfulness and efficiency of our framework.
comment: Accepted to ACL 2025 (Main)
♻ ☆ Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection EMNLP 2025
Hateful memes have become a significant concern on the Internet,
necessitating robust automated detection systems. While Large Multimodal Models
(LMMs) have shown promise in hateful meme detection, they face notable
challenges like sub-optimal performance and limited out-of-domain
generalization capabilities. Recent studies further reveal the limitations of
both supervised fine-tuning (SFT) and in-context learning when applied to LMMs
in this setting. To address these issues, we propose a robust adaptation
framework for hateful meme detection that enhances in-domain accuracy and
cross-domain generalization while preserving the general vision-language
capabilities of LMMs. Analysis reveals that our approach achieves improved
robustness under adversarial attacks compared to SFT models. Experiments on six
meme classification datasets show that our approach achieves state-of-the-art
performance, outperforming larger agentic systems. Moreover, our method
generates higher-quality rationales for explaining hateful content compared to
standard SFT, enhancing model interpretability. Code available at
https://github.com/JingbiaoMei/RGCL
comment: EMNLP 2025 Main
♻ ☆ Hunyuan-MT Technical Report
In this report, we introduce Hunyuan-MT-7B, our first open-source
multilingual translation model, which supports bidirectional translation across
33 major languages and places a special emphasis on translation between
Mandarin and several ethnic minority languages as well as dialects.
Furthermore, to serve and address diverse translation scenarios and enhance
model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a
translation model inspired by the slow thinking mode. This model integrates
multiple outputs generated by the Hunyuan-MT-7B model under varying parameter
settings, thereby achieving performance superior to that of conventional
slow-thinking models based on Chain-of-Thought (CoT). The development of our
models follows a holistic training process specifically engineered for
multilingual translation, which begins with general and MT-oriented
pre-training to build foundational capabilities, proceeds to Supervised
Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced
alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through
comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and
Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models
of comparable parameter size and most of the SOTA large models, particularly on
the task of translation between Mandarin and minority languages as well as
dialects. In the WMT2025 shared task (General Machine Translation), our models
demonstrate state-of-the-art performance, ranking first in 30 out of 31
language pairs. This result highlights the robustness of our models across a
diverse linguistic spectrum, encompassing high-resource languages such as
Chinese, English, and Japanese, as well as low-resource languages including
Czech, Marathi, Estonian, and Icelandic.
♻ ☆ TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review EMNLP2025
Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, Ngai Wong
While Large Language Models (LLMs) have shown significant potential in
assisting peer review, current methods often struggle to generate thorough and
insightful reviews while maintaining efficiency. In this paper, we propose
TreeReview, a novel framework that models paper review as a hierarchical and
bidirectional question-answering process. TreeReview first constructs a tree of
review questions by recursively decomposing high-level questions into
fine-grained sub-questions and then resolves the question tree by iteratively
aggregating answers from leaf to root to get the final review. Crucially, we
incorporate a dynamic question expansion mechanism to enable deeper probing by
generating follow-up questions when needed. We construct a benchmark derived
from ICLR and NeurIPS venues to evaluate our method on full review generation
and actionable feedback comments generation tasks. Experimental results of both
LLM-based and human evaluation show that TreeReview outperforms strong
baselines in providing comprehensive, in-depth, and expert-aligned review
feedback, while reducing LLM token usage by up to 80% compared to
computationally intensive approaches. Our code and benchmark dataset are
available at https://github.com/YuanChang98/tree-review.
comment: Accepted to EMNLP2025 Main
♻ ☆ UPLex: Fine-Grained Personality Control in Large Language Models via Unsupervised Lexical Modulation EMNLP 2025
Tianlong Li, Wenhao Liu, Muling Wu, Shihan Dou, Zhenghua Wang, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang
Personality is a crucial factor that shapes human communication patterns,
thereby regulating the personalities of large language models (LLMs) holds
significant potential in enhancing their user experiences. Previous approaches
either relied on fine-tuning LLMs on specific corpora or required manually
crafted prompts to evoke specific personalities from LLMs. However, the former
is inefficient and costly, while the latter cannot precisely manipulate
personality traits at a fine-grained level. To address these challenges, we
propose UPLex, a method that uses an Unsupervisedly-Built Personalized Lexicon
(UPL) during the decoding phase to manipulate LLM's personality traits. UPL can
be constructed from a newly built situational judgment test dataset in an
unsupervised fashion, and used to modulate the personality expression of LLMs
by dynamically altering their predicted probability of upcoming words in a
pluggable fashion. Extensive experimentation demonstrates the remarkable
effectiveness and pluggability of our method for fine-grained manipulation of
LLMs' personalities.
comment: EMNLP 2025 Findings
♻ ☆ Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
With the continuous advancement in the performance of large language models
(LLMs), their demand for computational resources and memory has significantly
increased, which poses major challenges for efficient inference on
consumer-grade devices and legacy servers. These devices typically feature
relatively weaker GPUs and stronger CPUs. Although techniques such as parameter
offloading and partial offloading can alleviate GPU memory pressure to some
extent, their effectiveness is limited due to communication latency and
suboptimal hardware resource utilization. To address this issue, we propose
Dovetail, a lossless inference acceleration method that leverages the
complementary characteristics of heterogeneous devices and the advantages of
speculative decoding. Dovetail deploys a draft model on the GPU to perform
preliminary predictions, while a target model running on the CPU validates
these outputs. By reducing the granularity of data transfer, Dovetail
significantly minimizes communication overhead. To further improve efficiency,
we optimize the draft model specifically for heterogeneous hardware
environments by reducing the number of draft tokens to lower parallel
verification latency, increasing model depth to enhance predictive
capabilities, and introducing a Dynamic Gating Fusion (DGF) mechanism to
improve the integration of feature and embedding information. We conduct
comprehensive evaluations of Dovetail across various consumer-grade GPUs,
covering multiple tasks and mainstream models. Experimental results on 13B
models demonstrate that Dovetail achieves inference speedups ranging from 1.79x
to 10.1x across different devices, while maintaining consistency and stability
in the distribution of generated texts.
comment: 14 pages, 6 figures
♻ ☆ MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Multi-entity question answering (MEQA) represents significant challenges for
large language models (LLM) and retrieval-augmented generation (RAG) systems,
which frequently struggle to consolidate scattered information across diverse
documents. While existing methods excel at single-document comprehension, they
often struggle with cross-document aggregation, particularly when resolving
entity-dense questions like "What is the distribution of ACM Fellows among
various fields of study?", which require integrating entity-centric insights
from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we
introduce MEBench, a novel multi-document, multi-entity benchmark designed to
systematically evaluate LLMs' capacity to retrieve, consolidate, and reason
over fragmented information. Our benchmark comprises 4,780 questions which are
systematically categorized into three primary categories, further divided into
eight distinct types, ensuring broad coverage of real-world multi-entity
reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4,
Llama-3) and RAG pipelines reveal critical limitations: even advanced models
achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance
of completeness and factual precision of information extraction in MEQA tasks,
using Entity-Attributed F1 (EA-F1) metric for granular evaluation of
entity-level correctness and attribution validity. MEBench not only highlights
systemic weaknesses in current LLM frameworks but also provides a foundation
for advancing robust, entity-aware QA architectures.
♻ ☆ Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models
Advances in hardware and language model architecture have spurred a
revolution in natural language generation. However, autoregressive models
compute probability distributions over next-token choices, and sampling from
these distributions, known as decoding, has received significantly less
attention than other design choices. Existing decoding strategies are largely
based on heuristics, resulting in methods that are difficult to apply or
improve in a principled manner. We develop the theory of decoding strategies
for language models by expressing popular decoding algorithms as equilibrium
states in the language of ergodic theory and stating the objective functions
they optimize. Using this, we analyze the effect of the local normalization
step required to make probabilities sum to one in top-k, nucleus, and
temperature sampling. We argue that local normalization distortion is a
fundamental defect of decoding strategies and quantify the size of this
distortion and its effect on mathematical proxies for the quality and diversity
of generated text. This yields conclusions for the design of decoding
algorithms and the detection of machine-generated text.
♻ ☆ AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Recently, extensive research on the hallucination of the large language
models (LLMs) has mainly focused on the English language. Despite the growing
number of multilingual and Arabic-specific LLMs, evaluating LLMs' hallucination
in the Arabic context remains relatively underexplored. The knowledge gap is
particularly pressing given Arabic's widespread use across many regions and its
importance in global communication and media. This paper presents the first
comprehensive hallucination evaluation of Arabic and multilingual LLMs on two
critical Arabic natural language generation tasks: generative question
answering (GQA) and summarization. This study evaluates a total of 12 LLMs,
including 4 Arabic pre-trained models, 4 multilingual models, and 4
reasoning-based models. To assess the factual consistency and faithfulness of
LLMs' outputs, we developed a fine-grained hallucination evaluation framework
consisting of 12 fine-grained hallucination indicators that represent the
varying characteristics of each task. The results reveal that factual
hallucinations are more prevalent than faithfulness errors across all models
and tasks. Notably, the Arabic pre-trained model Allam consistently
demonstrates lower hallucination rates than multilingual models and a
comparative performance with reasoning-based models. The code is available at:
https://github.com/aishaalansari57/AraHalluEval
♻ ☆ A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
We present a Japanese domain-specific language model for the pharmaceutical
field, developed through continual pretraining on 2 billion Japanese
pharmaceutical tokens and 8 billion English biomedical tokens. To enable
rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on
national pharmacist licensing exams; NayoseQA, which tests cross-lingual
synonym and terminology normalization; and SogoCheck, a novel task designed to
assess consistency reasoning between paired statements. We evaluate our model
against both open-source medical LLMs and commercial models, including GPT-4o.
Results show that our domain-specific model outperforms existing open models
and achieves competitive performance with commercial ones, particularly on
terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o
performs poorly on SogoCheck, suggesting that cross-sentence consistency
reasoning remains an open challenge. Our benchmark suite offers a broader
diagnostic lens for pharmaceutical NLP, covering factual recall, lexical
variation, and logical consistency. This work demonstrates the feasibility of
building practical, secure, and cost-effective language models for Japanese
domain-specific applications, and provides reusable evaluation resources for
future research in pharmaceutical and healthcare NLP. Our model, codes, and
datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.
comment: 15 pages, 9 tables, 5 figures
♻ ☆ FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain
Retrieval-Augmented Generation (RAG) plays a vital role in the financial
domain, powering applications such as real-time market analysis, trend
forecasting, and interest rate computation. However, most existing RAG research
in finance focuses predominantly on textual data, overlooking the rich visual
content in financial documents, resulting in the loss of key analytical
insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual
RAG benchmark tailored for finance which effectively integrates multimodal data
and provides visual citation to ensure traceability. It includes a bilingual
retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a
high-quality, human-annotated question-answering (QA) dataset spanning
heterogeneous data types and seven question categories. Moreover, we introduce
RGenCite, an RAG baseline that seamlessly integrates visual citation with
generation. Furthermore, we propose an automatic citation evaluation method to
systematically assess the visual citation capabilities of Multimodal Large
Language Models (MLLMs). Extensive experiments on RGenCite underscore the
challenging nature of FinRAGBench-V, providing valuable insights for the
development of multimodal RAG systems in finance.
♻ ☆ Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation
Pretraining data curation is a cornerstone in Large Language Model (LLM)
development, leading to growing research on quality filtering of large web
corpora. From statistical quality flags to LLM-based labelling systems,
datasets are divided into categories, frequently reducing to a binary: those
passing the filters are deemed as valuable examples, others are discarded as
useless or detrimental. However, a more detailed understanding of the
contribution of different kinds of texts to model performance is still largely
lacking. In this article, we present the first study utilising registers or
genres - a widely used standard in corpus linguistics to model linguistic
variation - to curate pretraining datasets and investigate the effect of
register on the performance of LLMs. We train small generative models with
register classified data and evaluate them using standard benchmarks, and show
that the register of pretraining data substantially affects model performance.
We uncover surprising relationships between the pretraining material and the
resulting models: using the News register results in subpar performance, and on
the contrary, including the Opinion class, covering texts such as reviews and
opinion blogs, is highly beneficial. While a model trained on the entire
unfiltered dataset outperforms those trained on datasets limited to a single
register, combining well-performing registers like How-to-Instructions,
Informational Description, and Opinion leads to major improvements.
Furthermore, analysis of individual benchmark results reveals key differences
in the strengths and drawbacks of specific register classes as pretraining
data. These findings show that register is an important explainer of model
variation and can facilitate more deliberate future data selection practices.
♻ ☆ TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection EMNLP2025
Rapid advances in Large Language Models (LLMs) have spurred demand for
processing extended context sequences in contemporary applications. However,
this progress faces two challenges: performance degradation due to sequence
lengths out-of-distribution, and excessively long inference times caused by the
quadratic computational complexity of attention. These issues limit LLMs in
long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache
Selection (TokenSelect), a training-free method for efficient and accurate
long-context inference. TokenSelect builds upon the observation of
non-contiguous attention sparsity, using QK dot products to measure per-head KV
Cache criticality at token-level. By per-head soft voting mechanism,
TokenSelect selectively involves a few critical KV cache tokens in attention
calculation without sacrificing accuracy. To further accelerate TokenSelect, we
design the Selection Cache based on observations of consecutive Query
similarity and implemented the efficient Paged Dot Product Kernel,
significantly reducing the selection overhead. A comprehensive evaluation of
TokenSelect demonstrates up to $23.84\times$ speedup in attention computation
and up to $2.28\times$ acceleration in end-to-end latency, while providing
superior performance compared to state-of-the-art long-context inference
methods.
comment: Accepted by EMNLP2025
♻ ☆ Trust but Verify! A Survey on Verification Design for Test-time Scaling
Test-time scaling (TTS) has emerged as a new frontier for scaling the
performance of Large Language Models. In test-time scaling, by using more
computational resources during inference, LLMs can improve their reasoning
process and task performance. Several approaches have emerged for TTS such as
distilling reasoning traces from another model or exploring the vast decoding
search space by employing a verifier. The verifiers serve as reward models that
help score the candidate outputs from the decoding process to diligently
explore the vast solution space and select the best outcome. This paradigm
commonly termed has emerged as a superior approach owing to parameter free
scaling at inference time and high performance gains. The verifiers could be
prompt-based, fine-tuned as a discriminative or generative model to verify
process paths, outcomes or both. Despite their widespread adoption, there is no
detailed collection, clear categorization and discussion of diverse
verification approaches and their training mechanisms. In this survey, we cover
the diverse approaches in the literature and present a unified view of verifier
training, types and their utility in test-time scaling. Our repository can be
found at
https://github.com/elixir-research-group/Verifierstesttimescaling.github.io.
comment: 18 pages
♻ ☆ LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning
Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng
Text-to-SQL is a critical task in natural language processing that aims to
transform natural language questions into accurate and executable SQL queries.
In real-world scenarios, these reasoning tasks are often accompanied by complex
mathematical computations, domain knowledge, and hypothetical reasoning
scenarios. However, existing large-scale Text-to-SQL datasets typically focus
on business logic and task logic, neglecting critical factors such as vertical
domain knowledge, complex mathematical reasoning, and hypothetical reasoning,
which are essential for realistically reflecting the reasoning demands in
practical applications and completing data querying and analysis. To bridge
this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset
specifically designed for complex reasoning and chain-of-thought parsing,
encompassing physics, arithmetic, commonsense, and hypothetical reasoning
scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed
chain-of-thought reasoning steps, spanning 45 databases across diverse domains,
significantly surpassing existing datasets in complexity. Experimental results
demonstrate that LogicCat substantially increases the task difficulty for
current state-of-the-art models to at most 33.20% execution accuracy,
indicating that this task remains exceptionally challenging. The advancement of
LogicCat represents a crucial step toward developing systems suitable for
real-world enterprise data analysis and autonomous query generation. We have
released our dataset code at https://github.com/Ffunkytao/LogicCat.
comment: 9 pages, 5 figures
♻ ☆ CTourLLM: Enhancing LLMs with Chinese Tourism Knowledge
Recently, large language models (LLMs) have demonstrated their effectiveness
in various natural language processing (NLP) tasks. However, the lack of
tourism knowledge limits the performance of LLMs in tourist attraction
presentations and travel planning. To address this challenge, we constructed a
supervised fine-tuning dataset for the Chinese culture and tourism domain,
named Cultour. This dataset consists of three parts: tourism knowledge base
data, travelogues data, and tourism QA data. Additionally, we propose CTourLLM,
a Qwen-based model supervised fine-tuned with Cultour, to improve the quality
of information about attractions and travel planning. To evaluate the
performance of CTourLLM, we proposed a human evaluation criterion named RRA
(Relevance, Readability, Availability), and employed both automatic and human
evaluation. The experimental results demonstrate that CTourLLM outperforms
ChatGPT, achieving an improvement of 1.21 in BLEU-1 and 1.54 in Rouge-L,
thereby validating the effectiveness of the response outcomes. Our proposed
Cultour is accessible at https://github.com/mrweiqk/Cultour.
♻ ☆ Interleaving Reasoning for Better Text-to-Image Generation
Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
Unified multimodal understanding and generation models recently have achieve
significant improvement in image generation capability, yet a large gap remains
in instruction following and detail preservation compared to systems that
tightly couple comprehension with generation such as GPT-4o. Motivated by
recent advances in interleaving reasoning, we explore whether such reasoning
can further improve Text-to-Image (T2I) generation. We introduce Interleaving
Reasoning Generation (IRG), a framework that alternates between text-based
thinking and image synthesis: the model first produces a text-based thinking to
guide an initial image, then reflects on the result to refine fine-grained
details, visual quality, and aesthetics while preserving semantics. To train
IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL),
which targets two sub-goals: (1) strengthening the initial think-and-generate
stage to establish core content and base quality, and (2) enabling high-quality
textual reflection and faithful implementation of those refinements in a
subsequent image. We curate IRGL-300K, a dataset organized into six decomposed
learning modes that jointly cover learning text-based thinking, and full
thinking-image trajectories. Starting from a unified foundation model that
natively emits interleaved text-image outputs, our two-stage training first
builds robust thinking and reflection, then efficiently tunes the IRG pipeline
in the full thinking-image trajectory data. Extensive experiments show SoTA
performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF,
GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality
and fine-grained fidelity. The code, model weights and datasets will be
released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .
♻ ☆ SemCAFE: When Named Entities make the Difference Assessing Web Source Reliability through Entity-level Analytics
With the shift from traditional to digital media, the online landscape now
hosts not only reliable news articles but also a significant amount of
unreliable content. Digital media has faster reachability by significantly
influencing public opinion and advancing political agendas. While newspaper
readers may be familiar with their preferred outlets political leanings or
credibility, determining unreliable news articles is much more challenging. The
credibility of many online sources is often opaque, with AI generated content
being easily disseminated at minimal cost. Unreliable news articles,
particularly those that followed the Russian invasion of Ukraine in 2022,
closely mimic the topics and writing styles of credible sources, making them
difficult to distinguish. To address this, we introduce SemCAFE, a system
designed to detect news reliability by incorporating entity relatedness into
its assessment. SemCAFE employs standard Natural Language Processing
techniques, such as boilerplate removal and tokenization, alongside entity
level semantic analysis using the YAGO knowledge base. By creating a semantic
fingerprint for each news article, SemCAFE could assess the credibility of
46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of
Ukraine. Our approach improved the macro F1 score by 12% over state of the art
methods. The sample data and code are available on GitHub
♻ ☆ Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
While Multimodal Large Language Models (MLLMs) excel at general
vision-language tasks, visuospatial cognition - reasoning about spatial
layouts, relations, and dynamics - remains a significant challenge. Existing
models often lack the necessary architectural components and specialized
training data for fine-grained spatial understanding. We introduce ViCA2
(Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial
reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP
for semantics and Hiera for spatial structure, coupled with a token ratio
control mechanism for efficiency. We also developed ViCA-322K, a new
large-scale dataset with over 322,000 spatially grounded question-answer pairs
for targeted instruction tuning. On the challenging VSI-Bench benchmark, our
ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly
surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and
leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the
effectiveness of our approach in achieving strong visuospatial intelligence
with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset
to facilitate further research.
comment: 26 pages, 19 figures, 4 tables
♻ ☆ Visuospatial Cognitive Assistant
Video-based spatial cognition is vital for robotics and embodied AI but
challenges current Vision-Language Models (VLMs). This paper makes two key
contributions. First, we introduce ViCA (Visuospatial Cognitive
Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor
videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D
metadata-grounded queries and video-based complex reasoning. Second, we develop
ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all
eight VSI-Bench tasks, outperforming existing models, including larger ones
(e.g., +26.1 on Absolute Distance). For interpretability, we present
ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune
ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial
reasoning. Our work highlights the importance of targeted data and suggests
paths for improved temporal-spatial modeling. We release all resources to
foster research in robust visuospatial intelligence.
comment: 31 pages, 10 figures, 6 tables
♻ ☆ Are Economists Always More Introverted? Analyzing Consistency in Persona-Assigned LLMs EMNLP 2025
Personalized Large Language Models (LLMs) are increasingly used in diverse
applications, where they are assigned a specific persona - such as a happy high
school teacher - to guide their responses. While prior research has examined
how well LLMs adhere to predefined personas in writing style, a comprehensive
analysis of consistency across different personas and task types is lacking. In
this paper, we introduce a new standardized framework to analyze consistency in
persona-assigned LLMs. We define consistency as the extent to which a model
maintains coherent responses when assigned the same persona across different
tasks and runs. Our framework evaluates personas across four different
categories (happiness, occupation, personality, and political stance) spanning
multiple task dimensions (survey writing, essay generation, social media post
generation, single turn, and multi-turn conversations). Our findings reveal
that consistency is influenced by multiple factors, including the assigned
persona, stereotypes, and model design choices. Consistency also varies across
tasks, increasing with more structured tasks and additional context. All code
is available on GitHub.
comment: Accepted to EMNLP 2025 findings
♻ ☆ When Large Language Models Meet Speech: A Survey on Integration Approaches ACL 2025
Recent advancements in large language models (LLMs) have spurred interest in
expanding their application beyond text-based tasks. A large number of studies
have explored integrating other modalities with LLMs, notably speech modality,
which is naturally related to text. This paper surveys the integration of
speech with LLMs, categorizing the methodologies into three primary approaches:
text-based, latent-representation-based, and audio-token-based integration. We
also demonstrate how these methods are applied across various speech-related
applications and highlight the challenges in this field to offer inspiration
for
comment: Accepted at Findings of ACL 2025 (Long Paper)
♻ ☆ Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models EMNLP 2025
Kaiyan Chang, Yonghao Shi, Chenglong Wang, Hang Zhou, Chi Hu, Xiaoqian Liu, Yingfeng Luo, Yuan Ge, Tong Xiao, Jingbo Zhu
Test-Time Scaling (TTS) is a promising approach to progressively elicit the
model's intelligence during inference. Recently, training-based TTS methods,
such as continued reinforcement learning (RL), have further surged in
popularity, while training-free TTS methods are gradually fading from
prominence. However, the additional computation overhead of training amplifies
the burden on test-time scaling. In this paper, we focus on training-free TTS
methods for reasoning. We first design Conditional Step-level Self-refinement,
a fine-grained sequential scaling method guided by process verification. On top
of its effectiveness, we further combine it with other classical parallel
scaling methods at the step level, to introduce a novel inference paradigm
called Hybrid Test-Time Scaling. Extensive experiments on five
instruction-tuned LLMs across different scales (3B-14B) and families
demonstrate that hybrid strategy incorporating various training-free TTS
methods at a fine granularity has considerable potential for expanding the
reasoning performance boundaries of LLMs.
comment: Accepted by EMNLP 2025. Code: https://github.com/Lucky-259/Hybrid_TTS
♻ ☆ MedGellan: LLM-Generated Medical Guidance to Support Physicians
Medical decision-making is a critical task, where errors can result in
serious, potentially life-threatening consequences. While full automation
remains challenging, hybrid frameworks that combine machine intelligence with
human oversight offer a practical alternative. In this paper, we present
MedGellan, a lightweight, annotation-free framework that uses a Large Language
Model (LLM) to generate clinical guidance from raw medical records, which is
then used by a physician to predict diagnoses. MedGellan uses a
Bayesian-inspired prompting strategy that respects the temporal order of
clinical data. Preliminary experiments show that the guidance generated by the
LLM with MedGellan improves diagnostic performance, particularly in recall and
$F_1$ score.
♻ ☆ Training LLMs to be Better Text Embedders through Bidirectional Reconstruction EMNLP 2025
Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin
Large language models (LLMs) have increasingly been explored as powerful text
embedders. Existing LLM-based text embedding approaches often leverage the
embedding of the final token, typically a reserved special token such as [EOS].
However, these tokens have not been intentionally trained to capture the
semantics of the whole context, limiting their capacity as text embeddings,
especially for retrieval and re-ranking tasks. We propose to add a new training
stage before contrastive learning to enrich the semantics of the final token
embedding. This stage employs bidirectional generative reconstruction tasks,
namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based
Document-to-Query), which interleave to anchor the [EOS] embedding and
reconstruct either side of Query-Document pairs. Experimental results
demonstrate that our additional training stage significantly improves LLM
performance on the Massive Text Embedding Benchmark (MTEB), achieving new
state-of-the-art results across different LLM base models and scales.
comment: accepted by EMNLP 2025 Main Conference
♻ ☆ Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation EMNLP 2025
We introduce Debate Speech Evaluation as a novel and challenging benchmark
for assessing LLM judges. Evaluating debate speeches requires a deep
understanding of the speech at multiple levels, including argument strength and
relevance, the coherence and organization of the speech, the appropriateness of
its style and tone, and so on. This task involves a unique set of cognitive
abilities that previously received limited attention in systematic LLM
benchmarking. To explore such skills, we leverage a dataset of over 600
meticulously annotated debate speeches and present the first in-depth analysis
of how state-of-the-art LLMs compare to human judges on this task. Our findings
reveal a nuanced picture: while larger models can approximate individual human
judgments in some respects, they differ substantially in their overall judgment
behavior. We also investigate the ability of frontier LLMs to generate
persuasive, opinionated speeches, showing that models may perform at a human
level on this task.
comment: EMNLP 2025. Code:
https://github.com/noy-sternlicht/Debatable-Intelligence
♻ ☆ Modelling Intertextuality with N-gram Embeddings
Intertextuality is a central tenet in literary studies. It refers to the
intricate links between literary texts that are created by various types of
references. This paper proposes a new quantitative model of intertextuality to
enable scalable analysis and network-based insights: perform pairwise
comparisons of the embeddings of n-grams from two texts and average their
results as the overall intertextuality. Validation on four texts with known
degrees of intertextuality, alongside a scalability test on 267 diverse texts,
demonstrates the method's effectiveness and efficiency. Network analysis
further reveals centrality and community structures, affirming the approach's
success in capturing and quantifying intertextual relationships.
♻ ☆ Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation
Hongyan Xie, Yitong Yao, Yikun Ban, Zixuan Huang, Deqing Wang, Zhenhe Wu, Haoxiang Su, Chao Wang, Shuangyong Song
Large language models (LLMs) excel at reasoning tasks but are expensive to
deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated
by LLMs to copy LLMs' abilities. However, these CoT data may include noisy
rationales that either fail to substantiate the answers or contribute no
additional information to support answer prediction, which leads SLMs to
capture spurious correlations between questions and answers and compromise the
quality of reasoning. In this work, we propose Chain-of-Thought Correctness
Perception Distillation (CoPeD), which aims to improve the reasoning quality of
the student model from the perspectives of task setting and data utilization.
Firstly, we introduce a correctness-aware task setting that encourages the
student model to predict answers based on correct rationales and revise them
when they are incorrect. This setting improves the faithfulness of reasoning
and allows the model to learn from its mistakes. Then, we propose a
Correctness-Aware Weighted loss, which dynamically adjusts the contribution of
each training instance based on the combined loss of the rationale and the
answer. This strategy encourages the model to focus more on samples where the
rationale offers stronger support for the correct answer. Experiments have
shown that CoPeD is effective on both in-distribution (IND) and
out-of-distribution (OOD) benchmark reasoning datasets.
comment: PrePrint
♻ ☆ Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA
Chinese information extraction (IE) involves multiple tasks across diverse
temporal domains, including Classical and Modern documents. Fine-tuning a
single model on heterogeneous tasks and across different eras may lead to
interference and reduced performance. Therefore, in this paper, we propose
Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with
a Mixture-of-Experts (MoE) design. Multiple low-rank LoRA experts specialize in
different IE tasks and eras, while a task-era-aware router mechanism
dynamically allocates expert contributions. Experiments show that Tea-MOELoRA
outperforms both single-task and joint LoRA baselines, demonstrating its
ability to leverage task and temporal knowledge effectively.
comment: 9 pages, 3 figures
♻ ☆ Linearly Controlled Language Generation with Performative Guarantees
The increasing prevalence of Large Language Models (LMs) in critical
applications highlights the need for controlled language generation strategies
that are not only computationally efficient but that also enjoy performance
guarantees. To achieve this, we use a common model of concept semantics as
linearly represented in an LM's latent space. In particular, we take the view
that natural language generation traces a trajectory in this continuous
semantic space, realized by the language model's hidden activations. This view
permits a control-theoretic treatment of text generation in latent space, in
which we propose a lightweight, gradient-free intervention that dynamically
steers trajectories away from regions corresponding to undesired meanings. In
particular, we propose to directly intervene the activations of the token that
is being generated in embedding space in an online fashion. Crucially, we do
not simply steer activations towards a desirable region. Instead, our method
relies on classical techniques from control theory to precisely control
activations in a context-dependent way, and guarantees that they are brought
into a specific pre-defined region of embedding space that corresponds to
allowed semantics. Our intervention is computed in closed-form according to an
optimal controller formulation, minimally impacting generation time. This
control of the activations in embedding space allows for fine-grained steering
of attributes of the generated sequence. We demonstrate the effectiveness of
our approach on different objectives -- toxicity avoidance and sentiment
control -- while maintaining text quality.
comment: Under review
♻ ☆ Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects EMNLP 2025
While text-based emotion recognition methods have achieved notable success,
real-world dialogue systems often demand a more nuanced emotional understanding
than any single modality can offer. Multimodal Emotion Recognition in
Conversations (MERC) has thus emerged as a crucial direction for enhancing the
naturalness and emotional understanding of human-computer interaction. Its goal
is to accurately recognize emotions by integrating information from various
modalities such as text, speech, and visual signals.
This survey offers a systematic overview of MERC, including its motivations,
core tasks, representative methods, and evaluation strategies. We further
examine recent trends, highlight key challenges, and outline future directions.
As interest in emotionally intelligent systems grows, this survey provides
timely guidance for advancing MERC research.
comment: EMNLP 2025 Findings
♻ ☆ M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis EMNLP 2025
Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Yun Xue, Barbara Plank
Aspect-based sentiment analysis (ABSA) is a crucial task in information
extraction and sentiment analysis, aiming to identify aspects with associated
sentiment elements in text. However, existing ABSA datasets are predominantly
English-centric, limiting the scope for multilingual evaluation and research.
To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7
domains and 21 languages, making it the most extensive multilingual parallel
dataset for ABSA to date. Our primary focus is on triplet extraction, which
involves identifying aspect terms, aspect categories, and sentiment polarities.
The dataset is constructed through an automatic translation process with human
review to ensure quality. We perform extensive experiments using various
baselines to assess performance and compatibility on M-ABSA. Our empirical
findings highlight that the dataset enables diverse evaluation tasks, such as
multilingual and multi-domain transfer learning, and large language model
evaluation, underscoring its inclusivity and its potential to drive
advancements in multilingual ABSA research.
comment: EMNLP 2025
♻ ☆ Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis EMNLP
Large Language Models (LLMs), despite their remarkable capabilities, are
hampered by hallucinations. A particularly challenging variant, knowledge
overshadowing, occurs when one piece of activated knowledge inadvertently masks
another relevant piece, leading to erroneous outputs even with high-quality
training data. Current understanding of overshadowing is largely confined to
inference-time observations, lacking deep insights into its origins and
internal mechanisms during model training. Therefore, we introduce
PhantomCircuit, a novel framework designed to comprehensively analyze and
detect knowledge overshadowing. By innovatively employing knowledge circuit
analysis, PhantomCircuit dissects the function of key components in the circuit
and how the attention pattern dynamics contribute to the overshadowing
phenomenon and its evolution throughout the training process. Extensive
experiments demonstrate PhantomCircuit's effectiveness in identifying such
instances, offering novel insights into this elusive hallucination and
providing the research community with a new methodological lens for its
potential mitigation.
comment: Accepted by 2025 EMNLP Main
♻ ☆ Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Large Language Models (LLMs) often exhibit significant behavioral shifts when
they perceive a change from a real-world deployment context to a controlled
evaluation setting, a phenomenon known as "evaluation awareness." This
discrepancy poses a critical challenge for AI alignment, as benchmark
performance may not accurately reflect a model's true safety and honesty. In
this work, we systematically quantify these behavioral changes by manipulating
the perceived context of prompts. We introduce a methodology that uses a linear
probe to score prompts on a continuous scale from "test-like" to "deploy-like"
and leverage an LLM rewriting strategy to shift these prompts towards a more
natural, deployment-style context while preserving the original task. Using
this method, we achieved a 30% increase in the average probe score across a
strategic role-playing dataset after rewriting. Evaluating a suite of
state-of-the-art models on these original and rewritten prompts, we find that
rewritten "deploy-like" prompts induce a significant and consistent shift in
behavior. Across all models, we observed an average increase in honest
responses of 5.26% and a corresponding average decrease in deceptive responses
of 12.40%. Furthermore, refusal rates increased by an average of 6.38%,
indicating heightened safety compliance. Our findings demonstrate that
evaluation awareness is a quantifiable and manipulable factor that directly
influences LLM behavior, revealing that models are more prone to unsafe or
deceptive outputs in perceived test environments. This underscores the urgent
need for more realistic evaluation frameworks to accurately gauge true model
alignment before deployment.
♻ ☆ Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD EMNLP 2025
Large Language Models (LLMs) can struggle to balance gullibility to
misinformation and resistance to valid corrections in persuasive dialogues, a
critical challenge for reliable deployment. We introduce DuET-PD (Dual
Evaluation for Trust in Persuasive Dialogues), a framework evaluating
multi-turn stance-change dynamics across dual dimensions: persuasion type
(corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via
SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves
only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions.
Moreover, results reveal a concerning trend of increasing sycophancy in newer
open-source models. To address this, we introduce Holistic DPO, a training
approach balancing positive and negative persuasion examples. Unlike prompting
or resist-only training, Holistic DPO enhances both robustness to
misinformation and receptiveness to corrections, improving
Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts
from 4.21% to 76.54%. These contributions offer a pathway to developing more
reliable and adaptable LLMs for multi-turn dialogue. Code is available at
https://github.com/Social-AI-Studio/DuET-PD.
comment: To appear at EMNLP 2025
♻ ☆ PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Lin, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen
Recent advancements in large multimodal models (LMMs) have leveraged
extensive multimodal datasets to enhance capabilities in complex
knowledge-driven tasks. However, persistent challenges in perceptual and
reasoning errors limit their efficacy, particularly in interpreting intricate
visual data and deducing multimodal relationships. To address these issues, we
introduce PIN (Paired and INterleaved multimodal documents), a novel data
format designed to foster a deeper integration of visual and textual knowledge.
The PIN format uniquely combines semantically rich Markdown files, which
preserve fine-grained textual structures, with holistic overall images that
capture the complete document layout. Following this format, we construct and
release two large-scale, open-source datasets: PIN-200M (~200 million
documents) and PIN-14M (~14 million), compiled from diverse web and scientific
sources in both English and Chinese. To maximize usability, we provide detailed
statistical analyses and equip the datasets with quality signals, enabling
researchers to easily filter and select data for specific tasks. Our work
provides the community with a versatile data format and substantial resources,
offering a foundation for new research in pre-training strategies and the
development of more powerful knowledge-intensive LMMs.
comment: Technical report v1.0
♻ ☆ OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models EMNLP 25
Large language models (LLMs) trained over extensive corpora risk memorizing
sensitive, copyrighted, or toxic content. To address this, we propose
\textbf{OBLIVIATE}, a robust unlearning framework that removes targeted data
while preserving model utility. The framework follows a structured process:
extracting target tokens, building retain sets, and fine-tuning with a tailored
loss function comprising three components -- masking, distillation, and world
fact. Using low-rank adapters (LoRA) ensures efficiency without compromising
unlearning quality. We conduct experiments on multiple datasets, including
Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics:
\emph{forget quality} (via a new document-level memorization score),
\emph{model utility}, and \emph{fluency}. Results demonstrate its effectiveness
in resisting membership inference attacks, minimizing the impact on retained
data, and maintaining robustness across diverse scenarios.
comment: To appear at EMNLP 25 main conference
♻ ☆ MEMIT-Merge: Addressing MEMIT's Key-Value Conflicts in Same-Subject Batch Editing for LLMs ACL2025
As large language models continue to scale up, knowledge editing techniques
that modify models' internal knowledge without full retraining have gained
significant attention. MEMIT, a prominent batch editing algorithm, stands out
for its capability to perform mass knowledge modifications. However, we uncover
that MEMIT's editing efficacy significantly deteriorates when processing
batches containing multiple edits sharing the same subject. Our analysis
reveals this stems from MEMIT's key value modeling framework: identical keys
(derived from the shared subject) are forced to represent different values
(corresponding to different knowledge), resulting in update conflicts during
editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach
that merges value computation processes for facts sharing the same subject,
effectively resolving the performance degradation in samesubject batch editing
scenarios. Experimental results demonstrate that when MEMIT's edit success rate
drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate
exceeding 90%, showcasing remarkable robustness to subject entity collisions.
The code is available at https://github.com/NUSTM/ MEMIT-Merge.
comment: Accepted by ACL2025 findings
♻ ☆ CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models EMNLP2025
Large Language Models (LLMs) have achieved remarkable success across various
domains. However, a fundamental question remains: Can LLMs effectively utilize
causal knowledge for prediction and generation? Through empirical studies, we
find that LLMs trained directly on large-scale data often capture spurious
correlations rather than true causal relationships, leading to suboptimal
performance, especially in out-of-distribution (OOD) scenarios. To address this
challenge, we propose Causal Attention Tuning (CAT), a novel approach that
injects fine-grained causal knowledge into the attention mechanism. We propose
an automated pipeline that leverages human priors to automatically generate
token-level causal signals and introduce the Re-Attention mechanism to guide
training, helping the model focus on causal structures while mitigating noise
and biases in attention scores. Experimental results on our proposed Spurious
Token Game (STG) benchmark and multiple downstream tasks demonstrate that our
approach effectively leverages causal knowledge for prediction and remains
robust in OOD scenarios. The CAT achieves an average improvement of 5.76% on
the STG dataset and 1.56% on downstream tasks. Notably, the OOD performance of
the Llama-3.1-8B model on STG_M increased from 64.5% to 90.5%, and Qwen's OOD
performance on the STG_H dataset improved from 25.4% to 55.9%. Implementation
details can be found at https://github.com/Kairong-Han/CAT.
comment: Accepted to EMNLP2025 Main conference
♻ ☆ LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding EMNLP 2025
Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li
Recent progress in Large Language Models (LLMs) has opened new avenues for
solving complex optimization problems, including Neural Architecture Search
(NAS). However, existing LLM-driven NAS approaches rely heavily on prompt
engineering and domain-specific tuning, limiting their practicality and
scalability across diverse tasks. In this work, we propose LM-Searcher, a novel
framework that leverages LLMs for cross-domain neural architecture optimization
without the need for extensive domain-specific adaptation. Central to our
approach is NCode, a universal numerical string representation for neural
architectures, which enables cross-domain architecture encoding and search. We
also reformulate the NAS problem as a ranking task, training LLMs to select
high-performing architectures from candidate pools using instruction-tuning
samples derived from a novel pruning-based subspace sampling strategy. Our
curated dataset, encompassing a wide range of architecture-performance pairs,
encourages robust and transferable learning. Comprehensive experiments
demonstrate that LM-Searcher achieves competitive performance in both in-domain
(e.g., CNNs for image classification) and out-of-domain (e.g., LoRA
configurations for segmentation and generation) tasks, establishing a new
paradigm for flexible and generalizable LLM-based architecture search. The
datasets and models will be released at https://github.com/Ashone3/LM-Searcher.
comment: EMNLP 2025 Main
♻ ☆ Benchmarking for Domain-Specific LLMs: A Case Study on Academia and Beyond EMNLP2025
The increasing demand for domain-specific evaluation of large language models
(LLMs) has led to the development of numerous benchmarks. These efforts often
adhere to the principle of data scaling, relying on large corpora or extensive
question-answer (QA) sets to ensure broad coverage. However, the impact of
corpus and QA set design on the precision and recall of domain-specific LLM
performance remains poorly understood. In this paper, we argue that data
scaling is not always the optimal principle for domain-specific benchmark
construction. Instead, we introduce Comp-Comp, an iterative benchmarking
framework grounded in the principle of comprehensiveness and compactness.
Comprehensiveness ensures semantic recall by covering the full breadth of the
domain, while compactness improves precision by reducing redundancy and noise.
To demonstrate the effectiveness of our approach, we present a case study
conducted at a well-renowned university, resulting in the creation of
PolyBench, a large-scale, high-quality academic benchmark. Although this study
focuses on academia, the Comp-Comp framework is domain-agnostic and readily
adaptable to a wide range of specialized fields. The source code and datasets
can be accessed at https://github.com/Anya-RB-Chen/COMP-COMP.
comment: Accepted by EMNLP2025 Findings
♻ ☆ Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer Mechanism
Zhiwei Wang, Yunji Wang, Zhongwang Zhang, Zhangchen Zhou, Hui Jin, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Yaoyu Zhang, Zhi-Qin John Xu
Large language models have consistently struggled with complex reasoning
tasks, such as mathematical problem-solving. Investigating the internal
reasoning mechanisms of these models can help us design better model
architectures and training strategies, ultimately enhancing their reasoning
capability. In this study, we constructed a symbolic multi-step reasoning task
to investigate the information propagation mechanisms in Transformer models
when solving the task through direct answering and Chain-of-Thought (CoT)
reasoning. We introduced the concept of buffer mechanism: the model stores
various information in distinct buffers and selectively extracts it through the
query-key matrix. We proposed a random matrix-based algorithm to enhance the
model's reasoning ability. This algorithm introduces only 132 trainable
parameters, yet leads to significant performance improvements on 7 multi-step
reasoning datasets, including PrOntoQA, LogicAsker, and LogicInference. These
findings provide new insights into understanding the large language models.
♻ ☆ SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty
Equipping large language models (LLMs) with complex, interleaved reasoning
and tool-use capabilities has become a key focus in agentic AI research,
especially with recent advances in reasoning-oriented (``thinking'') models.
Such capabilities are key to unlocking a number of important applications. One
such application is Deep Research (DR), which requires extensive search and
reasoning over many sources. Our work in this paper focuses on the development
of native Autonomous Single-Agent models for DR featuring minimal web crawling
and Python tool integration. Unlike multi-agent systems, where agents take up
pre-defined roles and are told what to do at each step in a static workflow, an
autonomous single-agent determines its next action dynamically based on
context, without manual directive. While prior work has proposed training
recipes for base or instruction-tuned LLMs, we focus on continual reinforcement
learning (RL) of reasoning-optimized models to further enhance agentic skills
while preserving reasoning ability. Towards this end, we propose a simple RL
recipe with entirely synthetic data, which we apply to various open-source
LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam
benchmark. In addition, we conduct key analysis experiments to provide more
insights into our methodologies.
comment: Technical Report
♻ ☆ Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
We introduce $\texttt{StorySim}$, a programmable framework for synthetically
generating stories to evaluate the theory of mind (ToM) and world modeling (WM)
capabilities of large language models (LLMs). Unlike prior benchmarks that may
suffer from contamination in pretraining data, $\texttt{StorySim}$ produces
novel, compositional story prompts anchored by a highly controllable
$\texttt{Storyboard}$, enabling precise manipulation of character perspectives
and events. We use this framework to design first- and second-order ToM tasks
alongside WM tasks that control for the ability to track and model mental
states. Our experiments across a suite of state-of-the-art LLMs reveal that
most models perform better on WM tasks than ToM tasks, and that models tend to
perform better reasoning with humans compared to inanimate objects.
Additionally, our framework enabled us to find evidence of heuristic behavior
such as recency bias and an over-reliance on earlier events in the story. All
code for generating data and evaluations is freely available.
comment: 12 pages, 11 figures
♻ ☆ RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
Remote sensing is critical for disaster monitoring, yet existing datasets
lack temporal image pairs and detailed textual annotations. While
single-snapshot imagery dominates current resources, it fails to capture
dynamic disaster impacts over time. To address this gap, we introduce the
Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark
comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods,
wildfires, and more) paired with rich, human-like change captions. By bridging
the temporal and semantic divide in remote sensing data, RSCC enables robust
training and evaluation of vision-language models for disaster-aware
bi-temporal understanding. Our results highlight RSCC's ability to facilitate
detailed disaster-related analysis, paving the way for more accurate,
interpretable, and scalable vision-language applications in remote sensing.
Code and dataset are available at https://github.com/Bili-Sakura/RSCC.
comment: under review
♻ ☆ The Model Hears You: Audio Language Model Deployments Should Consider the Principle of Least Privilege
The latest Audio Language Models (Audio LMs) process speech directly instead
of relying on a separate transcription step. This shift preserves detailed
information, such as intonation or the presence of multiple speakers, that
would otherwise be lost in transcription. However, it also introduces new
safety risks, including the potential misuse of speaker identity cues and other
sensitive vocal attributes, which could have legal implications. In this paper,
we urge a closer examination of how these models are built and deployed. Our
experiments show that end-to-end modeling, compared with cascaded pipelines,
creates socio-technical safety risks such as identity inference, biased
decision-making, and emotion detection. This raises concerns about whether
Audio LMs store voiceprints and function in ways that create uncertainty under
existing legal regimes. We then argue that the Principle of Least Privilege
should be considered to guide the development and deployment of these models.
Specifically, evaluations should assess (1) the privacy and safety risks
associated with end-to-end modeling; and (2) the appropriate scope of
information access. Finally, we highlight related gaps in current audio LM
benchmarks and identify key open research questions, both technical and
policy-related, that must be addressed to enable the responsible deployment of
end-to-end Audio LMs.
comment: Published at AIES 2025