publications | Ahrii Kim

2025

ACL

RUBRIC-MQM : Span-Level LLM-as-judge in Machine Translation For High-End Models

Ahrii Kim

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), Jul 2025

Abs DOI PDF Video Code

Referred to as LLM-as-judge, a generative large language model (LLM) has demonstrated considerable efficacy as an evaluator in various tasks, including Machine Translation (LAJ-MT) by predicting scores or identifying error types for individual sentences. However, its dependability in practical application has yet to be demonstrated, as there is only an \textitapproximated match due to the task’s open-ended nature. To address this problem, we introduce a straightforward and novel meta-evaluation strategy PromptCUE and evaluate cutting-edge LAJ-MT models such as GEMBA-MQM. We identify their fundamental deficits, including certain label biases and the inability to assess near-perfect translations.To improve reliability, we investigate more trustworthy and less biased models using multidimensional prompt engineering. Our findings indicate that the combination of span-level error quantification and a rubric-style prompt tailored to the characteristics of LLMs has efficiently addressed the majority of the challenges current LAJ-MT models face. Furthermore, it demonstrates a considerably enhanced alignment with human values. Accordingly, we present Rubric-MQM, the LAJ-MT for high-end models and an updated version of GEMBA-MQM.
preprint

Multi-agentMT: Deploying AI Agent in the WMT25 Shared Task Accepted at WMT 2025

Ahrii Kim

TechRxiv, Aug 2025

Abs DOI PDF Code

We introduce our model, referred to as Multi-agentMT, for participation in the WMT 25 General Machine Translation Shared Task. This model operationalizes the notion of an AI Agent by employing a multi-agent workflow known as Prompt Chaining (Briva-Iglesias, 2025) alongside the automatic MQM (Multidimensional Quality Metrics) error annotation framework designated as RUBRIC-MQM (Kim, 2025). Our primary submission is developed through the Translate-Postedit-Proofread paradigm, whereby the positions of the errors are clearly marked and enhanced throughout the process. Our study suggests that a semi-autonomous agent scheme in Machine Translation is viable with an older and smaller model in some language pairs, resulting in comparable results with 2.3x faster speed and only 2% of the budget.
preprint

Context is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation Accepted at WMT 2025

Ahrii Kim

TechRxiv, Aug 2025

Abs DOI PDF Code

As sentence-level performance in modern Machine Translation (MT) models reaches a plateau where differences are minimal, there is a growing need for robust document-level evaluation methods. We present a reproducible human evaluation protocol that is structured upon the FALCON framework (Kim, 2025) encompassing pragmatic features. With professional translators as annotators, we investigate the sources of low inter-annotator agreement and identify the primary contributing factors. To address these challenges and align with human values, we propose a comprehensive annotation-rating methodology referred to as H-FALCON. Our experiment shows that, while perfect annotator consensus remains elusive, the proposed scoring scheme achieves equal or higher correlations with traditional sentencelevel metrics. Linear regression analysis further reveals that contextual information is inherent in all sentences-contrary to the belief that only a subset requires it-and that previous estimates such as "n % of sentences require context" stem from flawed calculations. Context contributes approximately 10% to the variance of the holistic score in our evaluation, highlighting its universal yet limited influence on the MT evaluation. Codes will be released.
preprint

FALCON: Holistic Framework for Document-Level Machine Translation Evaluation

Ahrii Kim

TechRxiv, May 2025

Abs DOI PDF Code

As per Michael Halliday, language is not just a system of rules, but a tool for meaningmaking within sociocultural contexts, whereby language choices shape the functions of a text. We employ Julian House’s Translation Quality Assessment model inspired by Halliday’s Systemic Functional Linguistics to assess Machine Translation (MT) at the document level, establishing a novel approach titled FALCON (Functional Assessment of Language and COntextuality in Narratives). It is a skillspecific evaluation framework offering a holistic view of document-level translation phenomena with fine-grained context knowledge annotation. Rather than concentrating on the textual quality, our approach explores the discourse quality of translation by defining a set of core criteria on a sentence basis. To the best of our knowledge, this study represents the inaugural attempt to extend MT evaluation into pragmatics. We revisit WMT 2024 with the English-to-X test set encompassing German, Spanish, and Icelandic, assessing 29 distinct systems in four domains. We present groundbreaking but compelling findings concerning document-level phenomena, which yield conclusions that differ from those established in existing research. Emphasizing the pivotal role of discourse analysis in current MT evaluation, our findings demonstrate a robust correlation with human values, inclusive of the ESA gold scores.
preprint

IR_Multi-AgentMT at WMT25 Translation Task: A Summary Accepted at WMT 2025

Ahrii Kim

TechRxiv, Jul 2025

Abs DOI PDF Code

We introduce our model, referred to as MULTI-AGENTMT, for participation in the WMT 25 Translation Task. This model operationalizes the notion of an AI Agent by employing a multiagent workflow known as Prompt Chaining (Briva-Iglesias, 2025) alongside the automatic MQM error annotation framework designated as RUBRIC-MQM (Kim, 2025). Our primary submission is developed through the Translate-Postedit-Proofread paradigm, whereby each stage incrementally enhances the translation output. Our experimental findings indicate the feasibility of implementing a semi-autonomous improvement process in Machine Translation within this framework, yielding superior outcomes with a smaller model at reduced cost.

2023

preprint

The Suboptimal WMT Test Sets and Its Impact on Human Parity

Ahrii Kim, Yunju Bak, Jimin Sun, and 2 more authors

Preprints, Feb 2023

Abs DOI PDF Code

With the advent of Neural Machine Translation, the more the achievement of human-machine parity is claimed at WMT, the more we come to ask ourselves if their evaluation environment can be trusted. In this paper, we argue that the low quality of the source test set of the news track at WMT may lead to an overrated human parity claim. First of all, we report nine types of so-called technical contaminants in the data set, originated from an absence of meticulous inspection after web-crawling. Our empirical findings show that when they are corrected, about 5% of the segments that have previously achieved a human parity claim turn out to be statistically invalid. Such a tendency gets evident when the contaminated sentences are solely concerned. To the best of our knowledge, it is the first attempt to question the “source” side of the test set as a potential cause of the overclaim of human parity. We cast evidence for such phenomenon that according to sentence-level TER scores, those trivial errors change a good part of system translations. We conclude that to overlook it would be a mistake, especially when it comes to an NMT evaluation.

2022

ACL

Vacillating Human Correlation of SacreBLEU in Unprotected Languages

Ahrii Kim and Jinhyeon Kim

In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), May 2022

Abs DOI PDF Video Code

SacreBLEU, by incorporating a text normalizing step in the pipeline, has become a rising automatic evaluation metric in recent MT studies. With agglutinative languages such as Korean, however, the lexical-level metric cannot provide a conceivable result without a customized pre-tokenization. This paper endeavors to ex- amine the influence of diversified tokenization schemes –word, morpheme, subword, character, and consonants & vowels (CV)– on the metric after its protective layer is peeled off. By performing meta-evaluation with manually- constructed into-Korean resources, our empirical study demonstrates that the human correlation of the surface-based metric and other homogeneous ones (as an extension) vacillates greatly by the token type. Moreover, the human correlation of the metric often deteriorates due to some tokenization, with CV one of its culprits. Guiding through the proper usage of tokenizers for the given metric, we discover i) the feasibility of the character tokens and ii) the deficit of CV in the Korean MT evaluation.

2020

Journal

Human Evaluation of NMT & Annual Progress Report: A Case Study on Spanish to Korean

Ahrii Kim and Carme Colominas

Revista Tradumàtica. Tecnologies de la Traducció, Dec 2020

Abs DOI

Este artículo propone la primera evaluación de traducción automática neuronal en la combinación lingüística español-coreano. Se han utilizado cuatro métodos de evaluación humana: la evaluación directa, la comparación mediante ranking y el análisis de tiempo y de esfuerzo de la posedición del texto traducido automáticamente (en inglés, MTPE), y un método de evaluación semiautomática. El motor de traducción automática neuronal utilizado ha sido Google Translate, en concreto el dominio de noticias. Después de ser evaluado por seis traductores profesionales se constata que el motor aumenta el rendimiento en un 78% y la productividad en un 37%. Además, el 40,249% de los resultados del motor se modifican con un intervalo de 15 meses, mostrando así un índice de mejora del 11%.