[go: up one dir, main page]

Academia.eduAcademia.edu
DANUBE ADRIA ASSOCIATION FOR AUTOMATION & MANUFACTURING DAAAM INTERNATIONAL VIENNA PDF OFF-PRINTS Author(s): BOIANGIU, C[ostin] - A[nton]; CANANAU, D[an] - C[ristian]; PETRESCU, S[erban] & MOLDOVEANU, A[lin] This Publication has to be referred as: Boiangiu, C. - A.; Cananau, D. - C.; Petrescu, S. & Moldoveanu, A. (2009). OCR Post Processing Based on Character Pattern Matching (2009). 03230325, Annals of DAAAM for 2009 & Proceedings of the 20th International DAAAM Symposium, ISBN 978-3-901509-70-4, ISSN 1726-9679, pp 162, Editor B[ranko] Katalinic, Published by DAAAM International, Vienna, Austria 2009 www.daaam.com DAAAM INTERNATIONAL VIENNA – VIENNA UNIVERSITY OF TECHNOLOGY Austrian Society of Engineers and Architects – ÖIAV 1848 0323 Annals of DA AAAM for 2009 & Proceedings P of the 20th 2 International DAAAM D Symposium m, Volume 20, No. 1, ISSN 1726-96799 ISBN 978-3-9001509-70-4, Editor B. B Katalinic, Publishhed by DAAAM Innternational, Viennaa, Austria, EU, 20099 Make Harm mony Between Technnology and Nature, and Your Mind willl Fly Free as a Birdd Annals of DA AAAM Internationall OCR R POST PR ROCESSIN NG BASED D ON CHA ARACTER PATTERN N MATCHING BOIANGIU,, C[ostin] - A[nton]; CANA ANAU, D[an n] - C[ristian]; PETR RESCU, S[erb ban] & MOL LDOVEANU,, A[lin] ¶ Abstract: Thhis paper preesents an OC CR post-proceessing approach baseed on characteer pattern matcching. The apprroach can use both a given alphabet for word matching, m makiing it dependent on the language used u or it can generate g an alphhabet in order to doo the processingg independent of o the languagee. The second approoach can provee useful for thhe creation off new dictionaries annd understandinng of unknown languages. Key words: OCR, alphabet, character recognition, word processing 1. INTRODU UCTION mage processingg is in a continnuous The domaain of digital im expansion andd for this reason more and moore solutions apppear to different prroblems. Howevver, each probllem solved creaates a new possible improvement and a so new metthods are develloped all the time. Digital image processing cann be divided innto a large number of sub categories that deal with w different asspects mage. There aree algorithms deealing with diffferent of a digital im conversions from f one collour scheme to another. Other O algorithms deaal with the recoognition of diffeerent elements found f in an image. The goal of this t last type of algorithms is to h eye byy matching diffferent patterns with simulate the human already knownn models. Matching patterns can vary from siimple lines to more complicated characters that contain c a varieety of geometrical shhapes. One of thee most importannt areas of interrest in digital im mage processing is the t optical charracter recognitiion, or simply OCR. O There are num merous softwaare programs that t implementt and present such OCR O engines, some of them being open soource, and each proggram has its advantages a and disadvantages. For example, som me OCR engiines are oriennted towards letter recognition, while w others arre performing better for textt line recognition. Whatever W their characteristics c a all OCR enngines are, have to deal with w word recoggnition and correection. The resultt of the OCR software is usually u a page or a fragment in which w the charaacters, words and a elements found f have been associated a corrrectly with an understanddable language. For example, a deeteriorated new wspaper page written in English shoould be reconstrructed in order to t be readable. Even thouugh there are numerous n progrrams that deal with different aspects and have different d featurees, there is no such f thing as a peerfect OCR. Eaach algorithm has its own flaws. Some of the algorithms useed may fail onn a certain typpe of o to solve risen document andd combining alll of them in order problems is simply s not reccommended beecause of the huge running time that t such a com mbined algorithm m may have. Annd so the necessity for f pre- and posst-processing haas appeared. 2. PROBLEM STATEM MENT The two types of proccessing solve different probblems before and aftter passing a paage or a fragmeent through the OCR engine. The preprocessing p d deals with everyything connectted to page or fraggment preparaation for the actual processsing. Methods usedd in this sectioon are mostly oriented on im mage conversions annd image proceessing such as smoothing or noise removal. Post processing tecchniques are wiidely spread annd do not deal only withh a certain asppect of the document. Whilee some processing deals d with imagge improvemen nt, another mayy deal with word reccognition and so on. Almost all algorithmss used d in the second category requiire a dictionary or an alphabett whicch will be ussed for word correction. A part of thesee algo orithms can worrk for any languuage as long ass a dictionary iss giveen (Kolak & Resnik, R 2005), (Perez-Cortes et al., 2000),, (Zhu uang & Zhu, 2005). 2 Other allgorithms are used u only for a giveen language, for f example G German (Wiedeenhofer et al.,, 1995 5) or Chinesee (Long et all., 2006). A more m complexx apprroach could usee both pre- and post processing g together withh a feeedback functioon that assessees the final ressult of the twoo proccessing. The T aim of thhe algorithm ppresented in th his paper is too imprrove the resultts of the OCR R through the post-processing p g step by using booth dictionary dependent an nd independentt variaations. 3. THE T ALGOR RITHM Most M of the errrors produced bby the OCR eng gine are a resultt of erroneous chaaracter recognnition or of bad scannedd uments. Most of o the words arre detected; som me percentagess docu of th hem are not reaal words, but vvariations of th hem. Variationss are words w that are very similar, bbut differ only by a group off charracters. The diictionary approoach presented d in this paperr imprroves the charracter results oobtained by th he actual OCR R proccessing. 3.1 The T dictionaryy approach In I order to rem move the variaations and to find fi the correctt form m of the word, in the first stage the algoritthm takes eachh word d in the docum ment and insertts it in a hash table with thee hash h function givenn by the sum oof the squares of o the Unicodee valu ues of each chharacter in the word. The sq quare has beenn chossen for optimizzation purposes. The hash fun nction assures a good d spreading off the values inn the hash tab ble in order too imprrove the data alllocation of thee entire algorith hm. In the samee stagee an input dicttionary is neceessary for patteern recognition.. The same operatioon of inserting the words in a hash table iss d for the dictionnary. After thiss stage has beeen finished, thee used hash h values are com mpared with thhe hash values of a dictionaryy indeexed in a similaar hash table. Iff the values mattch, the word iss checcked for the coorrect form annd, if that existts, the word iss remo oved from the hash h table becauuse it doesn’t need n correction.. In th he end of this sttep there will be a hash table that t contains alll the erroneous e wordds. The T next step is the actual coorrection. In an nother table alll the letters l of the giiven alphabet aand the possiblee combinationss of tw wo or three leetters are inserrted. This willl be called thee alph habet of the lannguage. Each eerroneous word d is taken at a timee and every letteer or group of ttwo or three lettters is replacedd with h a value from the alphabet annd then cross-rreferenced withh the dictionary d for a match. If onlyy one match is found f the wordd is co orrected with a high degree off certainty. If more m words aree foun nd at this mom ment, the algoorithm computtes a certaintyy degrree and chooses the word witth the highest certainty c as thee solu ution. The certtainty degree ccan be compu uted in severall 0324 manners depending on the desired output. For example a fast but not reliable solution would be to simply choose the first word found in the dictionary and accept it as the correct solution. However, this approach is not recommended, because errors are likely to occur. The certainty degree should be computed using also semantic information if possible. By using the algorithm presented so far the OCR correction has been improved by 60%. However, further improvements can be made. The operation of replacing a character or group of characters with another character or group of the alphabet is called a substitution. In order to improve the correction, two more operations are introduced: the addition and the subtraction. The subtraction means the elimination of a character from the word and crossreference with the dictionary. The addition implies adding one character from the alphabet to the word and cross-reference with the dictionary. This is done for each position in the word at a time and for each character in the alphabet. By using all the three presented operations a large variety of choices will be available for one word. The previous certainty degree has to be calculated by taking into account the fact that there are three different operations instead of just one. Again, a fast, but not so reliable approach would be to compute the degree by assigning a higher degree for simple operations, with the highest one being the substitution, and a lower degree for combined operations. However, an algorithm that takes into account semantic information is again recommended. In order to view the functionality of this algorithm the following example is presented. In an English document the word “cat” is not recognized correct and it is approximated with “cal” by the OCR engine. By using the dictionary approach the erroneous word would be easily found when cross referring it to the English dictionary. By using only a substitution operation the variants “car” and “cat” are found. By using all the three operations a large variety of solutions is provided: “car”, “cat”, “calf”, “cold”, “fall”, “ball”, “mall”, “hall” and many others variants. From all of these variants a good semantically certainty allocation algorithm will set “cat” with the highest probability and so the word will be corrected. A similar approach has been presented in (Reynaert, 2008). highest frequency, “cat” would have been removed and replaced by “cas” which would have lead to an erroneous result. 3.2 An independent approach The presented algorithm was based on a given dictionary or alphabet and so it assumed that the language used in the document is known. But what happens when the language is unknown or if the purpose of passing the document through the OCR is to find a new language. A new and independent approach has to be developed. And so, the next algorithm is used. As the previous one, it starts from the words found by the OCR. It adds all of them into a table and computes the frequency for each one. After this step the algorithm finds all the similar variants from the table. In this stage only the words that differ by one character will be considered variants. After finding all the variants the most frequent one will be considered correct and all the others will be corrected using this value. The following example is considered. The words “cat”, “cas” and “ca” will be found in a document with “cat” having the highest frequencies. All three variants shall be included in the dictionary and after processing “ca” and “cas” words found in the document will be replaced with “cat” and removed from the dictionary. In the end the result will be a newly created dictionary for the given document. Such an approach can correct a document written in an unknown language and can also create a new document for that language. However, this approach has its downside. If the language is known by the human reader, but not by the OCR, the variant found as correct by the post processing may be erroneous because of the OCR detection of characters. If in the previous example “cas” would have been found with the Kolak, O. & Resnik, P. (2005). OCR post-processing for low density languages, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 867-874, Vancouver, Canada, 2005, Association for Computational Linguistics, Morristown, USA Long, C., Zhu, X.; Huang, K.; Sun, J.; Hotta, Y. & Naoi, S. (2006). An efficient post-processing approach for off-line handwritten Chinese address recognition, Proceedings of The 8th International Conference on Signal Processing, pp. 16-20, ISBN: 0-7803-9736-3, 2006, Beijing Perez-Cortes, J.C.; Amengual, J.C.; Arlandis, J. & Llobet, R. (2000). Stochastic error-correcting parsing for OCR postprocessing, Proceedings of the 15th International Conference on Pattern Recognition, ICPR, vol. 4, ISSN: 1051-4651, 2000, IEEE Computer Society, Washington, DC, USA Reynaert, M. (2008). Lecture notes in computer science, SPRINGER-VERLAG, pg. 617-630, ISSN 0302-9743, Numb 4919, 2008 Wiedenhofer, L.; Hein, H.-G. & Dengel, A. (1995). Postprocessing of OCR results for automatic indexing, Proceedings of The Third International Conference on Document Analysis and Recognition (ICDAR'95), ICDAR, vol. 2, pp.592-596, ISBN: 0-8186-7128-9, 1995, IEEE Computer Society, Washington, DC, USA Zhuang, L. & Zhu, X. (2005). Lecture notes in computer science, SPRINGER-VERLAG, pg. 346-352, ISSN 03029743, Numb 3681, 2005 4. CONCLUSIONS The algorithm presents an interesting approach for OCR improvement in the post processing stage. The main advantage is that the dictionary approach is reliable and fast and improves the OCR correction by up to 90%. On the other side the independent approach is more intuitive and even though it can prove to be unreliable in some cases, it is a very good algorithm for unknown languages. The problems with the independent approach appear on documents written in a known language. The first improvement that can be made for this algorithm is to create some sort of language recognition for documents and when the correction is used the algorithm should signal the fact that the language is known and that a dictionary is required for proper functioning. This could be done by inserting in the algorithm a small database with a few of the most common words found in different languages. However, the independent approach should perform well in the cases where discovering a new language or dictionary for a language is a must. The approaches require a certainty degree association function as stated in the paper. One of the improvements and further research related to the subject could be the creation of such a function with respect to the semantic knowledge found in the sentences. Further improvements can be made on the operations: introducing a new operation or creating a selection algorithm that decides if it is necessary to apply an operation or not. Also an algorithm for finding a stopping step in the approaches may prove to be a good optimization. To sum up, the algorithm is very useful in its current state and can prove to be a good direction for further research in the OCR post processing domain. 5. REFERENCES