Skip to main content
Lifeng Han
  • Kilburn Building, Oxford Road, Uni.Manchester, England, UK
From both human translators (HT) and machine translation (MT) researchers’ point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which... more
From both human translators (HT) and machine translation (MT) researchers’ point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAP) are often chosen as the golden standard (Han et al., 2021b). Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspe...
This paper is to describe our machine translation evaluation systems used for participation in the WMT13 shared Metrics Task. In the Metrics task, we submitted two automatic MT evaluation systems nLEPOR_baseline and LEPOR_v3.1.... more
This paper is to describe our machine translation evaluation systems used for participation in the WMT13 shared Metrics Task. In the Metrics task, we submitted two automatic MT evaluation systems nLEPOR_baseline and LEPOR_v3.1. nLEPOR_baseline is an n-gram based language independent MT evaluation metric employing the factors of modified sentence length penalty, position difference penalty, n-gram precision and n-gram recall. nLEPOR_baseline measures the similarity of the system output translations and the reference translations only on word sequences. LEPOR_v3.1 is a new version of LEPOR metric using the mathematical harmonic mean to group the factors and employing some linguistic features, such as the part-of-speech information. The evaluation results of WMT13 show LEPOR_v3.1 yields the highest averagescore 0.86 with human judgments at systemlevel using Pearson correlation criterion on English-to-other (FR, DE, ES, CS, RU) language pairs.
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human... more
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output based on professional post-editing annotations. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecti...
Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the... more
Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python version we developed (ported) which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses Optuna hyper-parameter optimisation framework to fine-tune hLEPOR weighting parameters towards better agreement to pre-trained language models (using LaBSE) regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards professional human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreem...
自从 1950 年代开始,机器翻译成为人工智能研究发展的重要任务 之一,经历了几个不同时期和阶段性发展,包括基于规则的方法、统计的方 法、和最近提出的基于神经网络的学习方法。伴随这几个阶段性飞跃的是机器 翻译的评测研究与发展,尤其是评测方法在统计翻译和神经翻译研究上所扮 演的重要角色。机器翻译的评测任务不仅仅在于评价机器翻译质量,还在于及 时的反馈给机器翻译研究人员机器翻译本身存在的问题,如何去改进以及如... more
自从 1950 年代开始,机器翻译成为人工智能研究发展的重要任务 之一,经历了几个不同时期和阶段性发展,包括基于规则的方法、统计的方 法、和最近提出的基于神经网络的学习方法。伴随这几个阶段性飞跃的是机器 翻译的评测研究与发展,尤其是评测方法在统计翻译和神经翻译研究上所扮 演的重要角色。机器翻译的评测任务不仅仅在于评价机器翻译质量,还在于及 时的反馈给机器翻译研究人员机器翻译本身存在的问题,如何去改进以及如 何去优化。在一些实际的应用领域,比如在没有参考译文的情况下,机器翻译 的质量估计更是起到重要的指示作用来揭示自动翻译目标译文的可信度。这 份报告主要包括一下内容:机器翻译评测的简史、研究方法分类、以及前沿的 进展,这其中包括人工评测、自动评测、和评测方法的评测(元评测)。人工评 测和自动评测包含基于参考译文的和不需参考译文参与的;自动评测方法包 括传统字符串匹配、应用句法和语义的模型、以及深度学习模型;评测方法的 评测包含估计人工评测的可信度、自动评测的可信度、和测试集的可信度等。 前沿的评测方法进展包括基于任务的评测、基于大数据预训练的模型、以及应 用蒸馏技术的轻便优化模型。 English Abstract: Since the 1950s, machine translation has become one of the important tasks of artificial intelligence research and development, and has experienced several different periods and stages of development, including rulebased methods, statistical methods, and recently proposed neural networkbased learning methods. Accompanying these staged leaps is the evaluation research and development of machine translation, especially the important role of evaluation methods in statistical tran...
In neural machine translation (NMT), researchers face the challenge of un-seen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into... more
In neural machine translation (NMT), researchers face the challenge of un-seen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into sub-words or compounds. In this paper, we try to address this OOV issue and improve the NMT adequacy with a harder language Chinese whose characters are even more sophisticated in composition. We integrate the Chinese radicals into the NMT model with different settings to address the unseen words challenge in Chinese to English translation. On the other hand, this also can be considered as semantic part of the MT system since the Chinese radicals usually carry the essential meaning of the words they are constructed in. Meaningful radicals and new characters can be integrated into the NMT systems with our models. We use an attention-based NMT system as a strong baseline system. The experiments on standard Chinese-to-English NIST translation shared task dat...
We introduce the Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy,... more
We introduce the Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories, including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms, textual entailment, paraphrase, semantic roles, and language models. The deep learning models for evaluation are very newly ...
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied... more
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing...
The conventional machine translation evaluation metrics tend to perform well on certain language pairs but weak on other language pairs. Furthermore, some evaluation metrics could only work on certain language pairs not... more
The conventional machine translation evaluation metrics tend to perform well on certain language pairs but weak on other language pairs. Furthermore, some evaluation metrics could only work on certain language pairs not language-independent. Finally, no considering of linguistic information usually leads the metrics result in low correlation with human judgments while too many linguistic features or external resources make the metrics complicated and difficult in replicability. To address these problems, a novel language-independent evaluation metric is proposed in this work with enhanced factors and optional linguistic information (part-of-speech, n-grammar) but not very much. To make the metric perform well on different language pairs, extensive factors are designed to reflect the translation quality and the assigned parameter weights are tunable according to the special characteristics of focused language pairs. Experiments show that this novel evaluation metric yields better per...
In neural machine translation (NMT), researchers face the challenge of un-seen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into... more
In neural machine translation (NMT), researchers face the challenge of un-seen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into sub-words or compounds. In this paper, we try to address this OOV issue and improve the NMT adequacy with a harder language Chinese whose characters are even more sophisticated in composition. We integrate the Chinese radicals into the NMT model with different settings to address the unseen words challenge in Chinese to English translation. On the other hand, this also can be considered as semantic part of the MT system since the Chinese radicals usually carry the essential meaning of the words they are constructed in. Meaningful radicals and new characters can be integrated into the NMT systems with our models. We use an attention-based NMT system as a strong baseline system. The experiments on standard Chinese-to-English NIST translation shared task dat...
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation... more
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement criteria and automated evaluation metrics, which we classify into further detailed sub-categories. We hope that this work will be an asset for both translation model researchers and quality assessment researchers. In addition, we hope that it will enable practitioners to quickly develop a better understanding of the conventional TQA field, and to find corresponding closely relevant evaluation solutions for their own needs. This work may also serve inspire further development of quality assessment and evaluation methodologies for other natural language processing (NLP) tasks in addition to ...
Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as... more
Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general eva...
Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the... more
Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python portable version we developed which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses LABSE distilled knowledge model to improve the metric agreement with human judgements by automatically optimised factor weights regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreements to PLMs like LABSE with much lower co...
Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideograph or stroke level embedding. However,... more
Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideograph or stroke level embedding. However, questions remain about different decomposition levels of Chinese character representations, radical and strokes, best suited for MT. To investigate the impact of Chinese decomposition embedding in detail, i.e., radical, stroke, and intermediate levels, and how well these decompositions represent the meaning of the original character sequences, we carry out analysis with both automated and human evaluation of MT. Furthermore, we investigate if the combination of decomposed Multiword Expressions (MWEs) can enhance the model learning. MWE integration into MT has seen more than a decade of exploration. However, decomposed MWEs has not previously been explored.
In the conventional evaluation metrics of machine translation, considering less information about the translations usually makes the result not reasonable and low correlation with human judgments. On the other hand, using many external... more
In the conventional evaluation metrics of machine translation, considering less information about the translations usually makes the result not reasonable and low correlation with human judgments. On the other hand, using many external linguistic resources and tools (e.g. Part-of-speech tagging, morpheme, stemming, and synonyms) makes the metrics complicated, time-consuming and not universal due to that different languages have the different linguistic features. This paper proposes a novel evaluation metric employing rich and augmented factors without relying on any additional resource or tool. Experiments show that this novel metric yields the state-of-the-art correlation with human judgments compared with classic metrics BLEU, TER, Meteor-1.3 and two latest metrics (AMBER and MP4IBM1), which proves it a robust one by employing a feature-rich and model-independent approach.
Research Interests:
该论文将病人到医院门诊的输入情况视为输入参数为λ的泊松流,将每一个服务台(病床)的服务持续时间(入院到出院的时间)视为服从独立同分布的参数为μ的指数分布,建立M/M/r形式的马儿可夫队列,求得M/M/r队列中病人的平均等待数L。建立模型,以病床使用率v(实际住院人数与病床总数之比)、病人拒绝率h(等待人数与系统人数之比)、入院时间的清晰度δ(病人门诊时医生能否确切告诉病人住院时间)、病人平均逗留时间Ls(等待时间与住院时间之和)等为重要参数和评价指标,再结合数理统计知识,利用... more
该论文将病人到医院门诊的输入情况视为输入参数为λ的泊松流,将每一个服务台(病床)的服务持续时间(入院到出院的时间)视为服从独立同分布的参数为μ的指数分布,建立M/M/r形式的马儿可夫队列,求得M/M/r队列中病人的平均等待数L。建立模型,以病床使用率v(实际住院人数与病床总数之比)、病人拒绝率h(等待人数与系统人数之比)、入院时间的清晰度δ(病人门诊时医生能否确切告诉病人住院时间)、病人平均逗留时间Ls(等待时间与住院时间之和)等为重要参数和评价指标,再结合数理统计知识,利用数理统计软件进行精确做表、统计计算,并辅以清晰明了的图形表示,解决了题设相应的问题。
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
房地产行业作为国民经济发展的支柱产业,关系到国家经济的发展和社会的稳定。本文运用描述性统计分析探讨了房地产行业的发展态势和商品住宅供需状况,运用灰色系统理论对住房需求进行预测并找出影响需求较大的因素,运用投入—产出模型对房地产业与其他行业的关系进行分析并模拟了国家调配金融业的积极影响,并以天津为例运用多元回归分析确定了天津住房均价计算公式。本文的亮点在于:1. 态势分析时多维度地进行分析和描述;2. 在投入—产出模型中通过调整直接耗散系数实证国家政策的积极效应。... more
房地产行业作为国民经济发展的支柱产业,关系到国家经济的发展和社会的稳定。本文运用描述性统计分析探讨了房地产行业的发展态势和商品住宅供需状况,运用灰色系统理论对住房需求进行预测并找出影响需求较大的因素,运用投入—产出模型对房地产业与其他行业的关系进行分析并模拟了国家调配金融业的积极影响,并以天津为例运用多元回归分析确定了天津住房均价计算公式。本文的亮点在于:1. 态势分析时多维度地进行分析和描述;2. 在投入—产出模型中通过调整直接耗散系数实证国家政策的积极效应。 针对行业发展态势的研究,结果显示:1. 伴随国民经济的快速发展,我国的房地产投资规模逐步加大;2.我国的房地产价格和成交量十年来持续上涨,房价泡沫现象严重;3. 从2003年开始,地价的涨幅每年都比房价的涨幅高出几个百分点,地价的不断上涨从侧面带动了房价成本的上升;4. 我国东部经济相对发达的地区,房价上涨速度过快,房地产投机现象严重,而对于经济发展相对滞后的中西部地区来讲,其整体上房地产的发展还是比较健康的;5. 房价收入比过高,严重影响了人们的生活质量。 针对住宅供需状况的研究,结果显示:1. 需求旺盛仍然是我国现阶段房地产市场发展的基本特征;2. 从1999年到2004年商品住宅供过于求,2005年以后需求量高于供应量。 针对住房需求的研究,结果显示:1.在未来的三年里,住房需求面积依次为161723.1万平方米,193060.6万平方米和230470.3万平方米;2. 影响住宅需求较大的两个因素为住宅商品房本年销售价格和城镇居民人均建筑面积。 针对房地产与其他行业的关系的研究,结果显示:1. 单一房地产业最终需求的增加对国民经济其他行业社会总产出都有影响,其中程度较大的行业包括金融业和租赁商务服务业;2. 国家对于金融业的调控政策有利于房地产业的可持续发展。
Research Interests:
Research Interests:
With the rapid development of internet technologies, social networks, and other related areas, user authentication becomes more and more important to protect the data of the users. Password authentication is one of the widely used methods... more
With the rapid development of internet technologies, social networks, and other related areas, user authentication becomes more and more important to protect the data of the users. Password authentication is one of the widely used methods to achieve authentication for legal users and defense against intruders. There have been many password cracking methods developed during the past years, and people have been designing the countermeasures against password cracking all the time. However, we find that the survey work on the password cracking research has not been done very much. This paper is mainly to give a brief review of the password cracking methods, import technologies of password cracking, and the countermeasures against password cracking that are usually designed at two stages including the password design stage (e.g. user education, dynamic password, use of tokens, computer generations) and after the design (e.g. reactive password checking, proactive password checking, passwo...
Research Interests:
Research Interests:
Research Interests:
Research Interests:
This paper introduces the research works of Chinese named entity recognition (CNER) including person name, organization name and location name. To differ from the conventional approaches that usually introduce more about the used... more
This paper introduces the research works of Chinese named entity recognition (CNER) including person name, organization name and location name. To differ from the conventional approaches that usually introduce more about the used algorithms with less discussion about the CNER problem itself, this paper firstly conducts a study of the Chinese characteristics and makes a discussion of the different feature sets; then a promising comparison result is shown with the optimized features and concise model. Furthermore, different performances are analyzed of various features and algorithms employed by other researchers. To facilitate the further researches, this paper provides some formal definitions about the issues in the CNER with potential solutions. Following the SIGHAN bakeoffs, the experiments are performed in the closed track but the problems of the open track tasks are also discussed.
With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between... more
With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability. Thirdly, the employed reference translations are very expensive and sometimes not available in the practice. In this paper, the authors propose an unsupervise...
Verbal multiword Expressions (VMWE) identification can be addressed successfully as a sequence labelling problem via conditional random fields (CRFs) by returning the one label sequence with maximal probability. This work describes a... more
Verbal multiword Expressions (VMWE) identification can be addressed successfully as a sequence labelling problem via conditional random fields (CRFs) by returning the one label sequence with maximal probability. This work describes a system that reranks the top 10 most likely CRF candidate VMWE sequences using a decision tree regression model. The reranker aims to operationalise the intuition that a non-compositional MWE can have a different distributional behaviour than that of its constituent words. This is why it uses semantic features based on comparing the context vector of a candidate expression against those of its constituent words. However, not all VMWE are non-compostional, and analysis shows that non-semantic features also play an important role in the behaviour of the reranker. In fact, the analysis shows that the combination of the sequential approach of the CRF component with the context-based approach of the reranker is the main factor of improvement: our reranker achieves a 12% macro-average F1-score improvement on the basic CRF method, as measured using data from PARSEME shared task on VMWE identification.
In neural machine translation (NMT), researchers face the challenge of unseen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into... more
In neural machine translation (NMT), researchers face the challenge of unseen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into sub-words or compounds. In this paper, we try to address this OOV issue and improve the NMT adequacy with a harder language Chinese whose characters are even more sophisticated in composition. We integrate the Chinese radicals into the NMT model with different settings to address the unseen words challenge in Chinese to English translation. On the other hand, this also can be considered as semantic part of the MT system since the Chinese radicals usually carry the essential meaning of the words they are constructed in. Meaningful radicals and new characters can be integrated into the NMT systems with our models. We use an attention-based NMT system as a strong baseline system. The experiments on standard Chinese-to-English NIST translation shared task data 2006 and 2008 show that our designed models outperform the baseline model in a wide range of state-of-the-art evaluation metrics including LEPOR, BEER, and CharacTER, in addition to the traditional BLEU and NIST scores, especially on the adequacy-level translation. We also have some interesting findings from the results of our various experiment settings about the performance of words and characters in Chinese NMT, which is different with other languages. For instance, the fully character level NMT may perform very well or the state of the art in some other languages as researchers demonstrated recently, however, in the Chinese NMT model, word boundary knowledge is important for the model learning. 1
Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideo-graph or stroke level embedding. However ,... more
Chinese character decomposition has been used as a feature to enhance Machine Translation (MT) models, combining radicals into character and word level models. Recent work has investigated ideo-graph or stroke level embedding. However , questions remain about different decomposition levels of Chinese character representations, radical and strokes, best suited for MT. To investigate the impact of Chinese decomposition embedding in detail, i.e., radical, stroke, and intermediate levels, and how well these decomposi-tions represent the meaning of the original character sequences, we carry out analysis with both automated and human evaluation of MT. Furthermore, we investigate if the combination of decomposed Mul-tiword Expressions (MWEs) can enhance the model learning. MWE integration into MT has seen more than a decade of exploration. However, decomposed MWEs has not previously been explored.
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied... more
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely: Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at github.com/poethan/AlphaMWE.
Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as... more
Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multilingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multilingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features.
A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The... more
A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup base-line system and argue for a more purpose-specific evaluation scheme.
Research Interests:
Named entity recognition (NER) plays an important role in the NLP literature. The traditional methods tend to employ large annotated corpus to achieve a high performance. Different with many semi-supervised learning models for NER task,... more
Named entity recognition (NER) plays an important role in the NLP literature. The traditional methods tend to employ large annotated corpus to achieve a high performance. Different with many semi-supervised learning models for NER task, in this paper, we employ the graph-based semi-supervised learning (GBSSL) method to utilize the freely available unlabeled data. The experiment shows that the unlabeled corpus can enhance the state-of-the-art conditional random field (CRF) learning model and has potential to improve the tagging accuracy even though the margin is a little weak and not satisfying in current experiments .
Research Interests:
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation... more
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess  translation quality. From the perspectives of accuracy, reliability, repeatability and cost,  translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement criteria and automated evaluation metrics, which we classify into further detailed sub-categories. We hope that this work will be an asset for both translation model researchers and quality assessment researchers. In addition, we hope that it will enable practitioners to quickly develop a better understanding of the conventional TQA field, and to find corresponding closely relevant evaluation solutions for their own needs. This work may also serve inspire further development of quality assessment and evaluation methodologies for other natural language processing (NLP) tasks  in addition to machine translation (MT), such as automatic text summarization (ATS), natural language understanding (NLU) and natural language generation (NLG).
With the rapid development of internet technologies, social networks, and other related areas, user authentication becomes more and more important to protect the data of users. Password authentication is one of the widely used methods to... more
With the rapid development of internet technologies, social networks, and other related areas, user authentication becomes more and more important to protect the data of users. Password authentication is one of the widely used methods to achieve authentication for legal users and defense against intruders. There have been many password-cracking methods developed during the past years, and people have been designing countermeasures against password cracking all the time. However, we find that the survey work on password cracking research has not been done very much. This paper is mainly to give a brief review of the password cracking methods, import technologies of password cracking, and the countermeasures against password cracking that are usually designed at two stages including the password design stage (e.g. user education, dynamic password, use of tokens, computer generations) and after the design (e.g. reactive password checking, proactive password checking, password encryption, access control). The main objective of this work is to offer the abecedarian IT security professionals and the common audiences some knowledge about computer security and password cracking and promote the development of this area. Keywords- Computer security; User authentication; Password cracking; Cryptanalysis; Countermeasures
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for... more
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.
Research Interests:

And 2 more

In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied... more
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely: Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at github.com/poethan/AlphaMWE
In neural machine translation (NMT), researchers face the challenge of unseen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into... more
In neural machine translation (NMT), researchers face the challenge of unseen (or out-of-vocabulary OOV) words translation. To solve this, some researchers propose the splitting of western languages such as English and German into sub-words or compounds. In this paper, we try to address this OOV issue and improve the NMT adequacy with a harder language Chinese whose characters are even more sophisticated in composition. We integrate the Chinese radicals into the NMT model with different settings to address the unseen words challenge in Chinese to English translation. On the other hand, this also can be considered as semantic part of the MT system since the Chinese radicals usually carry the essential meaning of the words they are constructed in. Meaningful radicals and new characters can be integrated into the NMT systems with our models. We use an attention-based NMT system as a strong baseline system. The experiments on standard Chinese-to-English NIST translation shared task data 2006 and 2008 show that our designed models outperform the baseline model in a wide range of state-of-the-art evaluation metrics including LEPOR, BEER, and CharacTER, in addition to the traditional BLEU and NIST scores, especially on the adequacy-level translation. We also have some interesting findings from the results of our various experiment settings about the performance of words and characters in Chinese NMT, which is different with other languages. For instance, the fully character level NMT may perform very well or the state of the art in some other languages as researchers demonstrated recently, however, in the Chinese NMT model, word boundary knowledge is important for the model learning. https://www.rug.nl/research/portal/files/74648346/Proceedings_of_the_ESSLLI_2018_Student_Session.pdf#page=55
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for... more
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.
Research Interests:
Research Interests:
Many treebanks have been developed in recent years for different languages. But these treebanks usually employ different syntactic tag sets. This forms an obstacle for other researchers to take full advantages of them, especially when... more
Many treebanks have been developed in recent years for different languages. But these treebanks usually employ different syntactic tag sets. This forms an obstacle for other researchers to take full advantages of them, especially when they undertake the multilingual research. To address this problem and to facilitate future research in unsupervised induction of syntactic structures, some researchers have developed a universal POS tag set. However, the disaccord problem of the phrase tag sets remains unsolved. Trying to bridge the phrase level tag sets of multilingual treebanks, this paper designs a phrase mapping between the French Treebank and the English Penn Treebank. Furthermore, one of the potential applications of this mapping work is explored in the machine translation evaluation task. This novel evaluation model developed without using reference translations yields promising results as compared to the state-of-the-art evaluation metrics.
Research Interests:
One problem of automatic translation is the evaluation of the result. The result should be as close to a human reference translation as possible, but varying word order or synonyms have to be taken into account for the evaluation of the... more
One problem of automatic translation is the evaluation of the result. The result should be as close to a human reference translation as possible, but varying word order or synonyms have to be taken into account for the evaluation of the similarity of both. In the conventional methods, researchers tend to employ many resources such as the synonyms vocabulary, paraphrasing, and text entailment data, etc. To make the evaluation model both accurate and concise, this paper explores the evaluation only using Part-of-Speech information of the words, which means the method is based only on the consilience of the POS strings of the hypothesis translation and reference. In this developed method, the POS also acts as the similar function with the synonyms in addition to its syntactic or morphological behaviour of the lexical item in question. Measures for the similarity between machine translation and human reference are dependent on the language pair since the word order or the number of synonyms may vary, for instance. This new measure solves this problem to a certain extent by introducing weights to different sources of information. The experiment results on English, German and French languages correlate on average better with the human reference than some existing measures, such as BLEU, AMBER and MP4IBM1.
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The... more
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series.
Research Interests:
Research Interests:
Research Interests:
This project contains mappings from language and Treebank specific phrase tagsets (phrase categories) to a set of refined universal phrase tags, as described in the published paper: “Phrase Tagset Mapping for French and English Treebanks... more
This project contains mappings from language and Treebank specific phrase tagsets (phrase categories) to a set of refined universal phrase tags, as described in the published paper: “Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation” by Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li, and Ling Zhu. 2013. Proceedings of the 25th international conference of the German Society for Computational Linguistics and Language Technology, September 23rd-27th, Darmstadt, Germany. LNCS Vol. 8105, pp. 119–131. Springer-Verlag Berlin Heidelberg. [http://link.springer.com/chapter/10.1007/978-3-642-40722-2_13#!] Currently, the phrase tagset mapping work and the open source code cover 16 treebanks and the following 15 languages: Chinese, English, French, German, Czech, Spanish, Arabic, Korean, Danish, Estonian, Hungarian, Icelandic, Italian, Japanese, Swedish, … Project Home: https://github.com/aaronlifenghan/A-Universal-Ph...
Research Interests:
Research Interests:
Research Interests:
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the... more
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
Research Interests:
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the... more
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
Research Interests:
Multi-word Expressions (MWEs) present challenges in natural language processing and computational linguistics due to their popular usage, richness in variety, idiomaticity, and non-decompositionality, which are present in the text content... more
Multi-word Expressions (MWEs) present challenges in natural language processing and computational linguistics due to their popular usage, richness in variety, idiomaticity, and non-decompositionality, which are present in the text content in which they are used. This is a typical level of expectation in the machine translation (MT) field where we require algorithms to perform a translation from one human language to another automatically while requiring high-quality output including features such as adequacy, fluency, and keeping the same or making creative and correct style decisions in that output.

In this thesis, we carry out an extensive investigation into MWEs in Neural MT.
Firstly, we carry out a review of relevant literature which includes experimental work on re-examining state-of-the-art models that combine knowledge of MWEs into MT systems, but with new language pairs setting to see what gaps might exist in the published literature.
Secondly, we propose our new models on how to address MWE translations. This includes a design where we treat MWEs as low-frequency words and phrases translation issues, by integrating language-specific features such as strokes and radicals representation of Chinese characters into the learning model, expecting that this will facilitate improved accuracy.
Thirdly, to properly examine different MT models' performances in the context of MWEs, we need to carry out a new evaluation methodology, and in light of this, we create a multilingual parallel corpus with MWE annotations (AlphaMWE). During the creation of this corpus, we classify the MT issues on MWE-related content into several categories with the expectation that this will help future MT researchers to focus on one or some of these in order to achieve a new state of the art in MT performance, ultimately moving towards human parity.
Finally, we propose a new methodology for human in the loop MT evaluation with MWE considerations (HiLMeMe).

PhD thesis, Dublin City University. https://doras.dcu.ie/26559/
基于石家庄市,建立文明城市
Research Interests:
Topic Modelling (TM) is from the research branches of natural language understanding (NLU) and natural language processing (NLP) that is to facilitate insightful analysis from large documents and datasets, such as a summarisation of main... more
Topic Modelling (TM) is from the research branches of natural language understanding (NLU) and natural language processing (NLP) that is to facilitate insightful analysis from large documents and datasets, such as a summarisation of main topics and the topic changes. This kind of discovery is getting more popular in real-life applications due to its impact on big data analytics. In this study, from the social-media and healthcare domain, we apply popular Latent Dirichlet Allocation (LDA) methods to model the topic changes in Swedish newspaper articles about Coronavirus. We describe the corpus we created including 6515 articles, methods applied, and statistics on topic changes over approximately 1 year and two months period of time from 17th January 2020 to 13th March 2021. We hope this work can be an asset for grounding applications of topic modelling and can be inspiring for similar case studies in an era with pandemics, to support socioeconomic impact research as well as clinical and healthcare analytics. open-source
Pre-trained language models (PLMs) often take advantage of the monolingual and multilingual dataset that is freely available online to acquire general or mixed domain knowledge before deployment into specific tasks. Extra-large PLMs... more
Pre-trained language models (PLMs) often take advantage of the monolingual and multilingual dataset that is freely available online to acquire general or mixed domain knowledge before deployment into specific tasks. Extra-large PLMs (xLPLMs) are proposed very recently to claim supreme performances over smaller-sized PLMs such as in machine translation (MT) tasks. These xLPLMs include Meta-AI's wmt21-dense-24-wide-en-X (2021) and NLLB (2022). In this work, we examine if xLPLMs are absolutely superior to smaller-sized PLMs in fine-tuning toward domain-specific MTs. We use two different in-domain data of different sizes: commercial automotive in-house data and clinical shared task data from the ClinSpEn2022 challenge at WMT2022. We choose popular Marian Helsinki as smaller sized PLM and two massive-sized Mega-Transformers from Meta-AI as xLPLMs. Our experimental investigation shows that 1) on smaller sized in-domain commercial automotive data, xLPLM wmt21-dense-24-wideen-X indeed shows much better evaluation scores using SACREBLEU and hLEPOR metrics than smaller-sized Marian, even though its score increase rate is lower than Marian after fine-tuning; 2) on relatively larger-size well prepared clinical data fine-tuning, the xLPLM NLLB tends to lose its advantage over smaller-sized Marian on two sub-tasks (clinical terms and ontology concepts) using ClinSpEn offered metrics METEOR, COMET, and ROUGE-L, and totally lost to Marian on Task-1 (clinical cases) on all official metrics including SACREBLEU and BLEU; 3) metrics do not always agree with each other on the same tasks using the same model outputs.