Ahrii Kim

I am an independent researcher in NLP. I study how to evaluate language models, especially their multilingual capabilities. I aim to build reliable evaluation methods by combining perspectives from linguistics, translation studies, and computer science.

news

Sep 15, 2025	Three papers are accepted at WMT 2025 🎉
Aug 25, 2025	Participated in MT Marathon 2025 held in Helsinki (Aug 25-30).
May 11, 2025	One paper is accepted at ACL 2025

selected publications

ACL

RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models

Ahrii Kim

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), Jul 2025

Abs DOI PDF Video Code

Referred to as LLM-as-judge, a generative large language model (LLM) has demonstrated considerable efficacy as an evaluator in various tasks, including Machine Translation (LAJ-MT) by predicting scores or identifying error types for individual sentences. However, its dependability in practical application has yet to be demonstrated, as there is only an \textitapproximated match due to the task’s open-ended nature. To address this problem, we introduce a straightforward and novel meta-evaluation strategy PromptCUE and evaluate cutting-edge LAJ-MT models such as GEMBA-MQM. We identify their fundamental deficits, including certain label biases and the inability to assess near-perfect translations.To improve reliability, we investigate more trustworthy and less biased models using multidimensional prompt engineering. Our findings indicate that the combination of span-level error quantification and a rubric-style prompt tailored to the characteristics of LLMs has efficiently addressed the majority of the challenges current LAJ-MT models face. Furthermore, it demonstrates a considerably enhanced alignment with human values. Accordingly, we present Rubric-MQM, the LAJ-MT for high-end models and an updated version of GEMBA-MQM.
preprint

Multi-agentMT: Deploying AI Agent in the WMT25 Shared Task Accepted at WMT 2025

Ahrii Kim

TechRxiv, Aug 2025

Abs DOI PDF Code

We introduce our model, referred to as Multi-agentMT, for participation in the WMT 25 General Machine Translation Shared Task. This model operationalizes the notion of an AI Agent by employing a multi-agent workflow known as Prompt Chaining (Briva-Iglesias, 2025) alongside the automatic MQM (Multidimensional Quality Metrics) error annotation framework designated as RUBRIC-MQM (Kim, 2025). Our primary submission is developed through the Translate-Postedit-Proofread paradigm, whereby the positions of the errors are clearly marked and enhanced throughout the process. Our study suggests that a semi-autonomous agent scheme in Machine Translation is viable with an older and smaller model in some language pairs, resulting in comparable results with 2.3x faster speed and only 2% of the budget.
preprint

Context is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation Accepted at WMT 2025

Ahrii Kim

TechRxiv, Aug 2025

Abs DOI PDF Code

As sentence-level performance in modern Machine Translation (MT) models reaches a plateau where differences are minimal, there is a growing need for robust document-level evaluation methods. We present a reproducible human evaluation protocol that is structured upon the FALCON framework (Kim, 2025) encompassing pragmatic features. With professional translators as annotators, we investigate the sources of low inter-annotator agreement and identify the primary contributing factors. To address these challenges and align with human values, we propose a comprehensive annotation-rating methodology referred to as H-FALCON. Our experiment shows that, while perfect annotator consensus remains elusive, the proposed scoring scheme achieves equal or higher correlations with traditional sentencelevel metrics. Linear regression analysis further reveals that contextual information is inherent in all sentences-contrary to the belief that only a subset requires it-and that previous estimates such as "n % of sentences require context" stem from flawed calculations. Context contributes approximately 10% to the variance of the holistic score in our evaluation, highlighting its universal yet limited influence on the MT evaluation. Codes will be released.
preprint

FALCON: Holistic Framework for Document-Level Machine Translation Evaluation

Ahrii Kim

TechRxiv, May 2025

Abs DOI PDF Code

As per Michael Halliday, language is not just a system of rules, but a tool for meaningmaking within sociocultural contexts, whereby language choices shape the functions of a text. We employ Julian House’s Translation Quality Assessment model inspired by Halliday’s Systemic Functional Linguistics to assess Machine Translation (MT) at the document level, establishing a novel approach titled FALCON (Functional Assessment of Language and COntextuality in Narratives). It is a skillspecific evaluation framework offering a holistic view of document-level translation phenomena with fine-grained context knowledge annotation. Rather than concentrating on the textual quality, our approach explores the discourse quality of translation by defining a set of core criteria on a sentence basis. To the best of our knowledge, this study represents the inaugural attempt to extend MT evaluation into pragmatics. We revisit WMT 2024 with the English-to-X test set encompassing German, Spanish, and Icelandic, assessing 29 distinct systems in four domains. We present groundbreaking but compelling findings concerning document-level phenomena, which yield conclusions that differ from those established in existing research. Emphasizing the pivotal role of discourse analysis in current MT evaluation, our findings demonstrate a robust correlation with human values, inclusive of the ESA gold scores.