[go: up one dir, main page]

Academia.eduAcademia.edu
Unknown Words Guesser João Rebelo1 , Nuno Mamede1,2 , and Jorge Baptista2,3 1 Universidade de Lisboa - Instituto Superior Técnico L F – Spoken Language Systems Laboratory – INESC ID Lisboa Universidade do Algarve - Faculdade de Ciências Humanas e Sociais joao.humberto@tecnico.ulisboa.pt nuno.mamede@inesc-id.pt jbaptis@ualg.pt 2 3 2 Abstract. STRING is a Natural Language Processing (NLP) chain developed at L2 F / INESC-ID Lisboa and LexMan is the module responsible to realize the tokenization and morphological analysis. This paper addresses the challenge of improve LexMan submodule responsible for propose correct pairs (lemma + tag) to unknown words. Keywords: Natural Language Processing · Morphological Analysis · Unknown Word Guesser · Markov Models. 1 Introduction Natural Language Processing (NLP) is a Linguistic and Artificial Intelligent subfield that studies problems related with generation and automatic understanding of natural languages. This processing is composed by several successive stages: tokenization, morphological analysis, syntactic analysis and semantic analysis. The first step of the process is where the input text is divided by tokens, each token can be formed by one or more words, numbers, etc. After this process, the Morphological Analyzer receive and analyze these tokens to assign the morphosyntactic tags. Based on these tags, a Syntactic Analyzer performs a structured representation of the input text. Lastly, a Semantic Analyzer is responsible for returning a formal representation with possible meanings for the text. 1.1 STRING STRING (Statistical and Rule-based Natural Language processing chain) [7]. This system has a pipeline structure and is composed by four modules: LexMan, RuDriCo, MARv and XIP . LexMan [3] is the first module of the chain, is responsible for the input text tokenization and morphological analysis. LexMan processing is divided in two stages. 2 João Rebelo et al. The dictionary generation takes place in the first stage, this process is composed by three submodules: a Words Generator, a Suffix Generator and a Transducer Generator. The first module combines a lexicon of lemmas with a set of inflection paradigms to generate all the inflected forms associated to this lemmas and to each word associate a morphosyntactic tag. The Suffix Generator [1] produces the simple and compound words generated and the derived forms by suffixation associated for each type of word. Lastly, the Transducer Generator is responsible for transducers creation for the words generated before and combines them with prefixes, clitics information and with handmade transducers. This last generated transducer corresponds to the lexicon used by LexMan during the morphological analysis. During the second processing stage of LexMan the input text tokenization and the morphological analysis is realized. The input text transducer is composed with LexMan transducer lexicon. After that LexMan analyzer go through the tokens to assign them the obtained tags from that composition. If there are some words that had not been identified on lexicon, LexMan invokes Guesser submodule to identify this unknown words. RuDriCo [4] is a morphosyntactic desambiguator, which is responsible to solve ambiguities not resolved by LexMan.To solve that, this module uses a set of segmentation and disambiguation rules. This allows join or separate two or more segments, for example Coreia, do and Sul into Coreia do Sul or no into em o. MARv [10] is a statistical desambiguator responsible to resolve some ambiguities that may still exists and are derived from RuDriCo. This module is based on second order Markov Models and it uses Viterbi’s algorithm [14]. XIP (Xerox Incremental Parser) [5] is responsible for text syntactic and semantic analysis. This module constructs syntactic constituents (chunks) and establishes dependency relations between them. 1.2 Context Attending the problem of identifying unknown words in Portuguese language, there are already some tools: – Palavroso [8] is a morphological analyzer that can be used as a morphological or lexical component for natural language processing systems in Portuguese. Correcto is one of this systems that uses Palavroso as a lexical component. Correcto [8] is a orthography corrector. This module proposes a set of alternatives to the unknown words providing processes to identify linguistic errors and typographic errors. These alternatives will be verified using Palavroso lexicon. Unknown Words Guesser 3 – Palavras [2] is a processing text system for Portuguese language in both European and Brazilian dialect. Palmorf [2] is the module of this system which is responsible to realize the text morphological analysis. Palmorf can propose alternatives for unknown words, based in European and Brazilian Portuguese differences and errors associated to incorrectly placed or missing diacritics. – Sumit Sharma and Swadha Gupta proposed an intelligent automatic orthography corrector based on the use of Markov Models to identify spelling mistakes on text [12]. Conventional spell checkers only fix non-word errors and the real-word errors that gives valid words but are not intended by the user was not detected. In order to correct this kind of problems they propose an approach based in trigrams and Bayesian approaches. 1.3 Goal The goal of this work is to improve the LexMan submodule which is responsible for proposing pairs (lemma + tag) for words that do not exist in the LexMan lexicon, the Guesser . The original solution of this submodule deals with problems related with upper cases, common errors, compound words and incorrect placed or missing diacritics. The goal of the new solution is to generate more alternatives for unknown words by using different strategies. The first task is the improvement of the original solution. Once completed, the addition of new modules can be used to handle new strategies. This new modules aims to resolve some problems reported on Alexandre Vicente [13] and Hugo Almeida [1] master’s thesis. 2 Guesser Original Architecture The Guesser [1] is a LexMan submodule, responsible for proposing pairs (lemma + tag) of words that are not in LexMan lexicon. This module is essentially divided into two stages. The first is executed during the tagging process and the second after all the input text has been processed. 2.1 1st Processing stage The first stage is where Guesser generates the alternatives for unknown words when invoked by LexMan. This execution is divided in 5 submodules, each one can propose alternatives for the words based in different strategies. The original Guesser solution follows a pipeline architecture for alternatives generation by the following modules: – the Upper Case Converter , the first Guesser submodule, is responsible to generate alternatives based on incorrect capitalization; – the Direct Substitution is the module that deals with the most common typographical errors. This module saves this common errors as their corresponding tags on a file, and when the unknown word is one of this saved 4 João Rebelo et al. errors, the module executes the respective substitution. This is the most efficient Guesser module because it does not need to validate an alternative word; – the Compounds is a module that deals with compounded words that are not automatically generated. This module saves a set of compounded words terminations: -alvo, -mestre, -m~ ae, -padr~ ao e -chave. If the unknown word was formed by one of this terminations, like público-alvo, the generated alternative will correspond to the first element of the compounded word, público; – the Diacritics module deals with incorrect placed diacritics, and it is only invoked if the unknown word had at list one diacritic. These kind of errors are very common, such cafe instead of café or ràpido instead of rápido; – the Termination module is invoked when the other modules did not produce alternatives or produced but were not in LexMan lexicon. This module takes into consideration the end of words. For example, words ending in -o are considered to be of the masculine gender, while those ending in -a are considered feminine. This module always assigns a tag to the unknown words, even if it is not correct. When LexMan finishes processing the input text, a composition of the alternatives file is performed with the LexMan dictionary. The transducer obtained from that composition is used in the next processing stage. 2.2 2nd Processing stage This second stage has the objective of validate the alternatives generated in the first stage and occurs after LexMan processes all the input text. Now for each unknown word, Guesser module will go through the transducer obtained from the composition between alternatives file and LexMan dictionary to verify if there is a valid alternative to the unknown word, if there is, it assigns its tag, otherwise invokes the Termination module to propose one. 3 Guesser Solution This new solution comes to solve some problems identified in original architecture. Changes were made to the original architecture and modules, and five new modules were added to the Guesser . 3.1 Original Architecture Modifications Before this work, the Upper Case Converter was the first module to be executed, and the other models could not generate alternatives for words started by upper case’s. The new solution places this module on the third position of execution and allow Guesser to process words like PrimeiroMinistro or ESTado. Now only words in which only the first letter is uppercase are not processed, like Blimunda. Unknown Words Guesser 5 Other problem identified, also related with Upper Case Converter , was that this module always generated an alternative and verified on LexMan lexicon, even if the alternative had the same shape of the original word. This resulted in a significant increase in processing time. To solve that, in the new Guesser solution, Upper Case Converter will just generate an alternative if there are upper case letters in unknown word, and this alternative is added to the alternatives file as is done by other modules. The original Diacritic module did not address problems related to the nonplacement of ç’s. To solve this, the new solution save a set of possible substitutions, listed in table 1. This substitutions only occur at the end of the words, for example, for unknown canc~ ao, the generated alternative will be canç~ ao. original substitutions cão ção cões ções ca ça co ço cu çu Table 1. Possible substitutions of c to ç. To optimize the execution time and not increase with the addition of the new modules. The modules Upper Case Converter , Hyphen, Language Model , Phonemes and Diacritic are executed in different processes to optimize the execution time, except the modules Direct Substitution and Compounds, because when these modules create an alternative, they do not requires the execution of the others. 3.2 New Modules The following points describes the new modules added. All these modules were executed simultaneously and concurrently in different threads, and all of them generate alternatives for the same alternatives file. Hyphen During the analysis of the identified unknown words in the corpora used in this work, the presence of some errors originated by the absence of hyphens were detected, like PrimeiroMinistro or DecretoLei. These words have characteristics that allows the identification of the presence of a uppercase letter which is neither the first nor the last. In these cases the module generates an alternative by placing a hyphen before the upper case, for example, Primeiro-Ministro or Decreto-Lei. 6 João Rebelo et al. Fig. 1. New Guesser solution architecture - stage 1. Phonemes This module creates alternatives based in commons phonetic errors of Portuguese language. Table 2 contains the set of substitutions that this module is allowed to do. When the unknown word has one letter or a sequence of letters presented on table 2, the module generates an alternative by replacing that for Unknown Words Guesser 7 the corresponding substitution. For example, if the unknown word was ascenç~ ao, the module will generate ascens~ ao by replacing ç for s. original substitution ç ss ss ç ç s s ç c ss ss c x ch ch x z s ão am am ão Table 2. Phonetic substitutions saved. Language Model The goal of this module is to generate alternatives formed by various types of errors. To do this, the module uses a second order Markov Model and for each unknown word is also saved the two previous words within the text. With these two words, the Language Model search in trigrams for possible words that can appear after this sequence. In order to select the better alternative to generate, it uses the minimum editions number [6] algorithm between the alternative and the unknown word, the alternative with lower value is added to the alternatives file. The minimum number of editions is changed according to the unknown word size for there to be no alternatives that are generated that do not match the correct unknown word form. Select Alternative This module is executed in the second processing stage of Guesser with the goal of selecting the best alternative when there are more than one corrected for the same unknown word. To do this, the module uses trigrams, unigrams and also the number of editions between the alternatives and the unknown word. In the original Guesser solution even if there was many correct alternatives, it was always chosen the first found. Now with the new implantation, is selected the most appropriated alternative. 3.3 New Guesser Architecture As illustrated in Figure 1 1, the new architecture for the Guesser first stage where it is possible to observe the new position of the original modules and also the new ones. The most significant changes are the addition of 3 new modules and the parallel execution of the modules that generate alternatives to the file, except the compounds module. Regarding the architecture of the Guesser second stage, it is presented in Figure 2. The only change made in this stage of the Guesser was the addition of the Select Alternative module. 8 João Rebelo et al. Fig. 2. New Guesser solution architecture - stage 2. 4 Evaluation This section presents the results obtained in the evaluation of the new solution regarding the processing time and the precision, in the correct assignment of the pairs (lemma + tag) to the unknown words. To do this, firstly was created an excerpt with 4000 phrases which contained unknown words obtained from two corpora: one from a Portuguese newspaper, Público [11], and the other one from Portuguese Parliament [9]. In order to evaluate the correct identification of unknown words by Guesser , a script was created to annotate these unknown words. This annotation task was done by a Doctorate in Linguistic, a Doctorate in Electrical and Computer Engineering and a master’s student in Computer Science Engineering. Unknown Words Guesser 4.1 9 Time To evaluate the processing time, were used two approaches: one to evaluate each Guesser module using the excerpt created and other one too evaluate Guesser in a real context using different files with different sizes from Público corpus. 20 19.5 Processing time (s) 19 18.5 18 17.5 17 16.5 16 15.5 15 Original First Improvements New Solution Fig. 3. Processing time results using the excerpt with 4000 sentences. 0.4 Processing time (ms) Processing time (ms) 70 60 50 40 30 20 10 0 1 2 A 3 Original Version New Version 0.3 0.2 0.1 0 4 5 6 7 8 B Fig. 4. Word execution time by number of sentences processed. Graph A - files 1 - 1; 2 - 10; 3 - 100 and Graph B - files 4 - 500; 5 - 1000; 6 - 5000; 7 - 10000; 8 - 50000 In the graph of Figure 3 is presented the results for time when processing the excerpt with 4000 sentences. Through the analysis of the obtained results, it is conclusive that with the improvements made to the initial architecture 10 João Rebelo et al. the processing time decreases slightly. However, with the addition of the new modules the situation was reversed . In the graph’s of Figure 4 is presented the results for time when processing files with different number of sentences in order to simulate real context. Through the analysis of the obtained results, it is conclusive that for smaller and big files the new solution get better results then the original one. 4.2 Precision 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Off ✗ DS C UC H LM P D S ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ Original Version ✗ ✗ ✗ ✗ ✗ ✗ New Solution 576 (11%) 583 (12%) 592 (12%) 576 (11%) 593 (12%) 804 (16%) 604 (12%) 929 (19%) 576 (11%) 599 (12%) 599 (12%) 946 (19%) 616 (12%) 836 (17%) 859 (17%) 1126 (23%) 1132 (23%) 946 (19%) Table 3. Results obtained for the correct assignment of the pairs (lemma + tag) to the unknown words by changing the modules used in the processing of the excerpt with 4000 sentences. In Table 3 is presented the results of processing the excerpt with 4000 sentences. For each column in Table 3 correspond the module used in each test realized: Off - all modules off, except Termination that is always active; DS Direct Substitution; C - Compounds; UC - Upper Case Converter; H - Hyphen; LM - Language Model; P - Phonemes; D - Diacritics; S - Select Alternative. It is conclusive that the modules that contribute with more identified words is Language Model and Diacritics. As presented, it is conclusive that the modules that contribute with more identified words is Language Model and Diacritics. The new solution for Guesser will produce significantly better results when comparing with the original. The new solution can correct identifying more 4% of unknown words. Unknown Words Guesser 5 11 Conclusions This work addresses the problem of improve Guesser module in the identification of words that are not in LexMan lexicon. To achieve this was added different strategies to the original solution to create more alternatives for the unknown words. The results obtained in comparing with the baseline was better regarding the precision in unknown words identification than was for processing time, nevertheless the time was better when processing big texts. In general, this work improved the computation performance of the Guesser module. References 1. Almeida, H., Baptista, J., Mamede, N.: Suffix identification in Portuguese using transducers. In: Proceedings of INFORUM 2016 – Simpósio de Informática. pp. 38—47. INForum 2015, IST, Lisboa, Portugal (2015) 2. Bick, E.: The Parsing System “PALAVRAS”. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Phd thesis/tese de doutoramento, University oh Aarhus, Aarhus, DF (December 2000) 3. Diniz, C., Mamede, N.: Lexman - lexical morphological analyser. Tech. rep., L2F / INESC ID Lisboa, Lisboa (2011) 4. Diniz, C., Mamede, N.J., Pereira, J.C.S.D.: RuDriCo2 - A faster disambiguator and segmentation modifier. In: II Simpósio de Informática (INForum). pp. 573– 584. Universidade do Minho, Portugal (September 2010) 5. Hagège, C., Tannier, X.: XRCE-T: XIP temporal module for TempEval campaign. In: Proceedings of the 4th International Workshop on Semantic Evaluations. pp. 492–495. SemEval ’07, Association for Computational Linguistics, Stroudsburg, PA, USA (2007) 6. Karin Beijering, C.G., Heeringa, W.: Predicting intelligibility and perceived linguistic distance by means of the levenshtein algorithm pp. 4–6 (2008) 7. Mamede, N., Baptista, J., Cláudio, D.: STRING - An Hybrid Statistical and RuleBased Natural Language Processing Chain for Portuguese. In: Springer (ed.) PROPOR 2012 (2012) 8. Medeiros, J.C.D.: Processamento Morfológico e Correcção Ortográfica do Português. Dissertação para obter o grau de mestre em engenharia electrotécnica e de computadores, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa, IST (Fevereiro 1995) 9. Parlamento: Debates Parlamentares - 3a República (1976-2015), http://debates.parlamento.pt/catalogo/r3/dar 10. Ribeiro, R.: Anotação Morfossintáctica Desambiguada do Português. Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa (Março 2003) 11. Rocha, P., Santos, D.: Cetempúblico: Um corpus de grandes dimensões de linguagem jornalı́stica portuguesa. In: Actas do V Encontro para o processamento computacional da lı́ngua portuguesa escrita e falada, PROPOR’2000. pp. 131–140. Atibaia, são Paulo, Brasil (November 2000) 12. Sumit Sharma, S.G.: Procedia Computer Science. In: A correction model for realword errors. pp. 99–106. Elsevier B. V., Kurukshetra, Haryana, India (December 2015) 12 João Rebelo et al. 13. Vicente, A.M.F.: LexMan: um Segmentador e Analisador Morfológico com Transdutores. Master’s thesis, Instituto Superior Técnico (June 2013) 14. Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimal Decoding Algorithm. Institute of Electrical and Electronic Engineers (IEEE) Transactions on Information Theory 13(2), 260–269 (1967)