A Set of Recommendations for Assessing Human–Machine Parity in Language Translation | Journal of Artificial Intelligence Research

PDF

Published: Mar 23, 2020

DOI: https://doi.org/10.1613/jair.1.11371

Keywords:

machine translation, natural language

Samuel Läubli

University of Zurich

https://orcid.org/0000-0001-5362-4106

Sheila Castilho

Dublin City University

Graham Neubig

Carnegie Mellon University

Rico Sennrich

University of Edinburgh

Qinlan Shen

Carnegie Mellon University

Antonio Toral

University of Groningen

Abstract

The quality of machine translation has increased remarkably over the past years, to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations. We reassess Hassan et al.'s 2018 investigation into Chinese to English news translation, showing that the finding of human–machine parity was owed to weaknesses in the evaluation design—which is currently considered best practice in the field. We show that the professional human translations contained significantly fewer errors, and that perceived quality in human evaluation depends on the choice of raters, the availability of linguistic context, and the creation of reference translations. Our results call for revisiting current best practices to assess strong machine translation systems in general and human–machine parity in particular, for which we offer a set of recommendations based on our empirical findings.

Issue

Vol. 67 (2020)

Section

Articles

afiliatedsites

JAIR is published by AI Access Foundation, a nonprofit public charity whose purpose is to facilitate the dissemination of scientific results in artificial intelligence. JAIR, established in 1993, was one of the first open-access scientific journals on the Web, and has been a leading publication venue since its inception. We invite you to check out our other initiatives.

Learn more

Article Sidebar

Main Article Content

Abstract

Article Details