[go: up one dir, main page]

CN101882148B - Method and system thereof for automatically identifying Uyghur in web page - Google Patents

Method and system thereof for automatically identifying Uyghur in web page Download PDF

Info

Publication number
CN101882148B
CN101882148B CN2010101898517A CN201010189851A CN101882148B CN 101882148 B CN101882148 B CN 101882148B CN 2010101898517 A CN2010101898517 A CN 2010101898517A CN 201010189851 A CN201010189851 A CN 201010189851A CN 101882148 B CN101882148 B CN 101882148B
Authority
CN
China
Prior art keywords
language
webpage
feature
tuple
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2010101898517A
Other languages
Chinese (zh)
Other versions
CN101882148A (en
Inventor
倪耀群
许洪波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN2010101898517A priority Critical patent/CN101882148B/en
Publication of CN101882148A publication Critical patent/CN101882148A/en
Application granted granted Critical
Publication of CN101882148B publication Critical patent/CN101882148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

本发明涉及自动识别网页中维吾尔文的方法和系统,方法包括:步骤1,确定作为识别特征的n元组中n的取值,使用的每种语言,统计语言的各个n元组在使用语言的训练网页中的出现频率,为一个权重值,并且取n元组在预设标准编码中的有效位生成权重值对应的特征ID;步骤2,统计待识别网页中每个n元组的出现次数,取n元组在预设标准编码中有效位生成出现次数对应的识别ID,对于训练网页使用的每种语言,将特征ID的权重值和与特征ID相同的识别ID的出现次数相乘,叠加乘积,所得和值为待识别网页对应于语言的分值,待识别网页使用的语言为最高分值对应的语言。本发明能够提高识别的准确性。

Figure 201010189851

The present invention relates to a method and system for automatically identifying Uighur in webpages. The method includes: step 1, determining the value of n in the n-tuple as the identification feature, each language used, and counting each n-tuple of the language in the language used The frequency of occurrence in the training webpage is a weight value, and the effective bits of the n-tuple in the preset standard encoding are taken to generate the feature ID corresponding to the weight value; step 2, the occurrence of each n-tuple in the webpage to be identified is counted Number of times, take the n-tuple and generate the identification ID corresponding to the number of occurrences in the effective bits in the preset standard encoding. For each language used in the training webpage, multiply the weight value of the feature ID by the number of occurrences of the identification ID that is the same as the feature ID , superimposed products, and the obtained sum value is the score corresponding to the language of the webpage to be recognized, and the language used by the webpage to be recognized is the language corresponding to the highest score. The invention can improve the recognition accuracy.

Figure 201010189851

Description

Automatically discern the method and the system thereof of Uighur in the webpage
Technical field
The present invention relates to the webpage process field, relate in particular to the method and the system thereof of Uighur in the automatic identification webpage.
Background technology
Nearly in the world more than 60 national literal are that write on the basis with the Arabic alphabet at present; Uighur, Kazakh and kirgiz that China's Xinjiang region is used all belong to this type literal; The dimension literary composition that title is write with Arabic alphabet is ASU (Arabic-Script Uyghur), and how in literal such as Arabic, Farsi, Kazakh and kirgiz, to identify Uighur is problem to be solved.
In order to distinguish the Uighur that occurs on Uygur's literal, the especially webpage, two kinds of ways are arranged in the prior art.The one, check literal code, whether peculiar and 18 letters that Arabic does not have of Uighur appear between taking a fancy to; The 2nd, check the font file in the webpage, see the common font name of Uighur whether occurs.
Relying on special letter identification Uygur literal has two shortcomings, and the one, must occur one of these 18 special letters in the requirement webpage could differentiate; The 2nd, the coding of these 18 letters might be also used the uses such as literal such as Kazakh of Arabic alphabets by other, thereby causes identification error.
In some dimension web page text, used WEFT (Microsoft Web Embedding Font Tool webpage font insertion tool technology); Can the font in the webpage be made into the compressed information Chinese library of an EOT (the embedded profile type-word of Embedded OpenType body) form, show the idio-morphosis of dimension literary composition with this character library.There is different EOT filenames different websites; The give chapter and verse information such as title of these font files of iff are discerned the dimension literary composition; Have three shortcomings: the one, some Tibetan web page has also adopted the WEFT technology in the reality, can not guarantee that the EOT filename of Tibetan language is different with the dimension literary composition; The 2nd, have the civilian website of a lot of dimensions future and used new EOT filename, new EOT title is unpredictable; The browser of three right and wrong IE kernels such as firefox, the browser chrome of Google etc. do not support the WEFT technology, thereby make this method lose efficacy.
The literal of now a lot of ethnic groups is mainly encoded with utf8 on webpage and is shown, transmits and store, and can handle all language on the utf8 coding theory, does not just distinguish the classification of language.As far as Chinese user, Arabic, Wei Wen, Kazakh, kirgiz is closely similar, if not by means of recognizer, these several kinds of literal almost can't be differentiated.The problem that solution is discerned Uighur automatically.
Summary of the invention
For addressing the above problem, the invention provides the method and the system thereof of Uighur in the automatic identification webpage, whether used language is Uighur in the language n tuple identification webpage through using, and improves the accuracy of identification.
The invention discloses the method for Uighur in a kind of automatic identification webpage, comprising:
Step 1; Confirm value as n in the n tuple of recognition feature; For the every kind of language that uses in the training webpage; The frequency of occurrences of each n tuple of adding up said language in the training webpage that uses said language is a weighted value with the said frequency of occurrences, and gets the significance bit of said n tuple in the preset standard coding and generate said weighted value characteristic of correspondence ID; The language that uses in the said training webpage comprise Uighur with the similar literal of Uighur, saidly comprise Arabic, Kazakh, kirgiz with the similar literal of Uighur;
Step 2; Add up the occurrence number of each n tuple in the webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score;
Also comprise before the said step 1:
Step 21 is carried out pre-service to training webpage and webpage to be identified, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced;
Said step 21 also comprises:
Step 31 converts the write the two or more syllables of a word together alphabetic coding of representing two characters in the text into the coding that splits two corresponding characters of back;
Step 32 judges before the prefix vowel whether have hamza, if do not have, then before said prefix vowel, adds;
Step 33 judges whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters;
Step 34 converts the letter in the Arabic alphabet expansion area into the letter of correspondingly-shaped in the Arabic alphabet base region.
Confirm in the said step 1 as the value of n in the n tuple of recognition feature further to do,
Step 41 for every kind of language that training is used in the webpage, is added up the probability of occurrence of i tuple in the said language, and the value of i is 1 to m, and m is a preset value;
Step 42 preferentially selects the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.
Also comprise in the said step 1:
Step 51 for every kind of language, is pressed rank from big to small with weighted value, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language;
The every kind of language that uses for the training webpage in the said step 2, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID multiplied each other further does,
Step 52, the every kind of language that uses for the training webpage multiplies each other the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID.
The occurrence number of each n tuple in the statistics webpage to be identified in the said step 2 is got said n tuple significance bit in the preset standard coding and is also comprised after generating the corresponding identification id of said occurrence number,
Step 61, the occurrence number of the said n tuple of storage is the subscript of said storage unit with said identification id in storage unit, each storage unit is formed n tuple array;
The every kind of language that uses for the training webpage in the said step 2, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID multiplied each other further does,
Step 62; For every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as said characteristic ID down, the numerical value of the cell stores that finds and the weighted value of said row are multiplied each other.
The invention also discloses the system of Uighur in a kind of automatic identification webpage, comprising:
Training module; Be used for confirming value as the n tuple n of recognition feature; For the every kind of language that uses in the training webpage; The frequency of occurrences of each n tuple of adding up said language in the training webpage that uses said language is a weighted value with the said frequency of occurrences, and gets the significance bit of said n tuple in the preset standard coding and generate said weighted value characteristic of correspondence ID; The language that uses in the said training webpage comprise Uighur with the similar literal of Uighur, saidly comprise Arabic, Kazakh, kirgiz with the similar literal of Uighur;
Identification module; Be used for adding up the occurrence number of each n tuple of webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score;
Said system also comprises pre-processing module, the operation before said training module and the startup of said identification module of said pre-processing module,
Said pre-processing module is used for training webpage and webpage to be identified are carried out pre-service, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced;
Said pre-processing module also is used for representing the write the two or more syllables of a word together alphabetic coding of two characters to convert the coding that splits two corresponding characters of back in text; Judge before the prefix vowel whether have hamza,, then before said prefix vowel, add if do not have; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; Letter in the Arabic alphabet expansion area is converted into the letter of correspondingly-shaped in the Arabic alphabet base region.
Be further used for every kind of language using in the training webpage during value of said training module n in confirming as the n tuple of recognition feature, add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, and m is a preset value; Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.
Said training module also is used for for every kind of language, and weighted value is pressed rank from big to small, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language;
Said identification module is at the every kind of language that uses for the training webpage; Be further used for every kind of language using for the training webpage when occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID is multiplied each other.
The occurrence number of said identification module each n tuple in statistics webpage to be identified is got said n tuple significance bit in the preset standard coding and also is used for after generating the corresponding identification id of said occurrence number,
The occurrence number of the said n tuple of storage is the subscript of said storage unit with said identification id in storage unit, and each storage unit is formed n tuple array;
Said identification module is at the every kind of language that uses for the training webpage; Be further used for for every kind of language when the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, travel through the feature weight table of said language, for each row of feature weight table; Read the characteristic ID of said row; In said n tuple array, search the storage unit that is designated as said characteristic ID down, read the numerical value of the cell stores that finds, the weighted value of said numerical value with said row multiplied each other.
Beneficial effect of the present invention is whether used language is Uighur in the language n tuple identification webpage through using, and improves the accuracy of identification; Through training webpage and webpage to be identified are carried out pre-service, improve the accuracy of identification, improve recognition efficiency through setting up the feature weight table; Search for through use characteristic ID and identification id, improve recognition efficiency.
Description of drawings
Fig. 1 discerns the process flow diagram of the method for Uighur in the webpage automatically for the present invention;
Fig. 2 discerns the embodiment process flow diagram of the method for Uighur in the webpage automatically for the present invention;
Fig. 3 discerns the displayed map as a result of Uighur method in the webpage automatically for adopting the present invention;
Fig. 4 discerns the structural drawing of the system of Uighur in the webpage automatically for the present invention;
Fig. 5 is for discern the structural drawing of the system of Uighur in the webpage automatically in more excellent embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is done further detailed description.
It is as shown in Figure 1 that the present invention discerns the flow process of the method for Uighur in the webpage automatically.
Step S100; Confirm value as n in the n tuple of recognition feature; For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the said frequency of occurrences; And get the significance bit of said n tuple in preset standard coding and generate said weighted value characteristic of correspondence ID, the language that uses in the said training webpage comprise Uighur with the similar literal of Uighur.
Step S200; Add up the occurrence number of each n tuple in the webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.
It is as shown in Figure 2 that the present invention discerns the flow process of the embodiment of the method for Uighur in the webpage automatically.
Step S301 carries out pre-service to training webpage and webpage to be identified, removes the webpage label, obtain the text of the character of web displaying, and ESC reduced, the language that uses in the training webpage comprise Uighur with the similar literal of Uighur.
Pre-service is to convert webpage into plain text.Existing HTML analyzer is analyzed by dom tree, and the final text that generates is on leafy node, and not parsing of ESC, is slavish copying.And the character of Uighur has much used the decimal system or hexadecimal character entity, in webpage, is expressed as “ &#1574 like
Figure GSB00000641301900061
; ", perhaps “ ئ ", perhaps “ ئ ".
Therefore, when pre-service, when removing the webpage label, to ESC, the decimal system and sexadecimal all reduce.
For example:
&#DDDDD converts decimal number DDDDD into unsigned short int;
&#XHHHH converts sexadecimal number HHHH into unsigned short int;
Other ESC like &nbsp, is handled by the space, also plays the effect of cutting word.
Through pre-service, generate the text of the character of webpage actual displayed, carry out literary composition with convenient follow-up use n-gram and plant judgement.
Use is with the webpage of the similar literal of Uighur, and whether wherein similar whether can to distinguish this literal with the reader who is ignorant of Uighur be standard with Uighur; For example, the similar literal of Uighur comprises Arabic, Kazakh, kirgiz.
In preferred implementation, the pre-service of training webpage and webpage to be identified is also comprised following operation.
To two encoded questions of a letter; For example
Figure GSB00000641301900062
correspondence is encoded to 0xfefb,
Figure GSB00000641301900063
reality represented two characters
Figure GSB00000641301900064
and operated as follows.The coding of representing two characters in the text is split as two corresponding characters; For example, is split as corresponding two characters
Figure GSB00000641301900066
To the inconsistent problem that adds hamza (
Figure GSB00000641301900067
represents the symbol of locking sound) before the prefix vowel; Operate as follows; Judge before the prefix vowel and whether have hamza; If do not have, then before said prefix vowel, add; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; For example, the prefix vowel is that compound vowel letter
Figure GSB00000641301900071
then is split as corresponding and
Figure GSB00000641301900073
to it
To an alphabetical n encoded question, wherein n is greater than 2, and this problem reason mainly is that Wei Wen writes by Arabic alphabet, but has the more distinctive language phenomenons of dimension literary composition; And the Unicode standard does not give the dimension literary composition separate encoding district, so the different web sites deviser has selected different coding to represent same dimension literal female for showing the female needs of special dimension literal.Convert the letter in the Arabic alphabet expansion area letter of correspondingly-shaped in the Arabic alphabet base region into, kept the consistance on the coding.
Through pre-service, make Uighur with the analysis of the webpage of the similar literal of Uighur with judge based on consistent standard, avoided confusion, improved judgment accuracy.
Step S302 confirms the value as n in the n tuple of recognition feature.
Definite method of n, a kind of simple embodiment are through the configuration input.
The preferred implementation of confirming n is following.
For every kind of language that training is used in the webpage, add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, m is a preset value.For the value of an i, add up this i tuple probability of occurrence in the training webpage of each language.
Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.A kind of selection mode is for each i tuple, and with the probability of occurrence addition of this i tuple in each language, the i tuple of selecting to add with maximum is a recognition feature; Addition after also can probability of occurrence being multiplied each other with the weighted value for each language that disposes.Perhaps for every kind of language, each i tuple is pressed the probability of occurrence ordering, each the set positions score value in the corresponding ordering with the score value addition of each i tuple in the formation of each language, is pressed score value and is selected recognition feature.
For example, from the text of all pretreated training webpages, add up various language the probability of occurrence of bigram (doublet), trigram (tlv triple), 4gram (four-tuple) and 5gram (five-tuple).Preferentially select the higher i tuple of probability of occurrence as recognition feature according to bigram, trigram, 4gram and 5gram probability of occurrence in the training webpage of each language.
The value of n gets 2 in the specific embodiment.
Step S303 is in the training stage, according to the frequency of occurrences generating feature weight table of each language in the training webpage.
For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the frequency of occurrences.For every kind of language, will with n tuple rank, choose preceding K n tuple by weighted value by from big to small.
Among the embodiment; For every kind of language; The occurrence number of each doublet of statistics from the text of the training webpage that uses said language; For each doublet, with the occurrence number of this doublet summation divided by all doublet occurrence numbers in the training webpage of this language, the gained quotient is the frequency of occurrences of this doublet.To every kind of language, take out existing K the highest doublet of frequency, generate this language characteristic of correspondence weight table.The value of K to satisfy this K doublet the frequency of occurrences add and greater than pre-set threshold value.For example, pre-set threshold value is 95%, is 1000 for Uighur K then, is 400 for Arabic K.
Be limited to 65536 on the number of doublet among the application in theory; The dimension literary composition doublet number that statistics obtains in the actual experiment is 1130, and the highest doublet has occurred 5106348 times, minimum appearance 1 time; Wherein the occurrence number of less more than 100 doublet of occurrence number is between 1 time to 30 times; Explain that these doublet are the rare collocation between the letter of this literal, indifferent to this language expression, belong to invalid doublet; And the weight of these doublet is minimum, near machine zero, can't represent with floating number.Keep these doublet, not only help for the differentiation of literary composition kind little, and can cause bigger storage overhead and computing time complexity; These abandon no doublet occurrence number accumulative total about 1300 times, in all doublet number of times summations, occupy the ratio of less than 1%, influence very little.
Each row comprises the frequency of occurrences of n tuple in the feature weight table, is weighted value and n tuple characteristic of correspondence ID.Characteristic ID generates according to the significance bit of this n tuple in the preset standard coding.For example; Preset coding standard is unicode; Then the unicode of doublet
Figure GSB00000641301900081
is encoded to 0x0644 and 0x0649; Low level sexadecimal number with separately is that 0x44 and 0x49 are combined as 0x4449, and promptly decimal number 17481 is as the ID of doublet .
With Uighur feature weight table is example, and table is long by 1000, with the frequency of occurrences of 1000 bigram of record Uighur, as shown in table 1.
Characteristic ID Weighted value
24358 0.027939
18783 0.027746
10079 0.016921
24362 0.014487
17481 0.013896
24360 0.013854
12639 0.013757
... ...
Table 1
First representation feature ID of each row representes with a unsigned number, also adds up the subscript of array in the identifying as bigram occurrence number in the webpage to be identified.Separate with the space as " 17481 " the expression recognition feature bigram
Figure GSB00000641301900091
of fifth line are for the sake of clarity alphabetical, it is 17481 that real text shows as
Figure GSB00000641301900092
characteristic of correspondence ID.The weighted value of second this recognition feature of expression; Be this recognition feature shared ratio in all bigram of Uighur training webpage in training process, represent that like " 0.013896 " of fifth line
Figure GSB00000641301900093
weighted value is 0.013896.
Step S304 adds up the occurrence number of each n tuple in the webpage to be identified, gets said n tuple significance bit in the preset standard coding and generates the corresponding identification id of said occurrence number.
The occurrence number of storage n tuple in storage unit be the subscript of this storage unit with the identification id of correspondence, the storage unit composition n tuple array of each n tuple.
All bigram in the webpage to be identified are added up one time; Time complexity is O (n), needs 256 * 256=65536 storage unit altogether, is stored in the integer array; Storage unit for each bigram; Get bigram significance bit in the preset standard coding and generate corresponding identification id, following this identification id that is designated as of storage unit, the value of each cell stores is the occurrence number of this bigram in the webpage to be identified.Call the bigram array to this array.
Step S305; Every kind of language for the use of training webpage; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.
For every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as said characteristic ID down, the numerical value of the cell stores that finds and the weighted value of said row are multiplied each other; If be zero, explain that this characteristic does not occur in webpage to be identified.
Particularly; To each row of the Bigram feature weight table of Arabic, be multiplicand with the weighted value of this row, in the Bigram array, search with the characteristic ID of this row and be target storage unit down; Take out the occurrence number of the bigram of this cell stores, as multiplier; Multiplier and multiplicand are multiplied each other.
The Bigram feature weight table of Arabic has 400 row, after above-mentioned 400 products accumulative total, is exactly the Arabic score of this webpage to be identified, is expressed as the probability that webpage to be identified belongs to Arabic.
Equally, calculate the Uighur score, the adding up of 1000 product terms, if it is maximum to tie up civilian score value, then this webpage to be identified is exactly a Uighur.
Beneficial effect is that feature weight table smaller (in 1000 row) is with the Bigram array index of characteristic ID as webpage to be identified; Inquiry velocity is fast, counts the score, and this score representes that webpage to be identified belongs to the probability of certain language; The multiplication number of times few, it is high to carry out efficient.
Show the accuracy rate of dimension literary composition identification below through two tests, the project of test 1 is as shown in table 2.
Figure GSB00000641301900101
Table 2
As shown in Figure 3, last score value greater than 0 be Uighur, wherein uy_12.htm is a page of software company of Uygur (tieing up soft company), another ties up the website of the civilian page from Zepu County, Xinjiang senior middle school of Uygur, can accurately identify.
The project of test 2 is as shown in table 3
Figure GSB00000641301900102
Figure GSB00000641301900111
Table 3
The structure of the system of Uighur is as shown in Figure 4 in a kind of automatic identification webpage.
Training module 200; Be used for confirming value as the n tuple n of recognition feature; For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the said frequency of occurrences; And get the significance bit of said n tuple in preset standard coding and generate said weighted value characteristic of correspondence ID, the language that uses in the said training webpage comprise Uighur with the similar literal of Uighur.
Identification module 300; Be used for adding up the occurrence number of each n tuple of webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.
In a preferable scheme, said system also comprises pre-processing module 100, and the operation before training module 200 and identification module 300 start of said pre-processing module 100 is as shown in Figure 5.
Pre-processing module 100 is used for training webpage and webpage to be identified are carried out pre-service, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced.
Further; Pre-processing module 100 also is used for text is represented the write the two or more syllables of a word together letter of two characters, and for example
Figure GSB00000641301900112
and
Figure GSB00000641301900113
code conversion is for splitting the coding of two corresponding characters of back; Judge before the prefix vowel whether have hamza,, then before said prefix vowel, add if do not have; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; Letter in the Arabic alphabet expansion area is converted into the letter of correspondingly-shaped in the Arabic alphabet base region.
In a preferable scheme; Be further used for every kind of language using in the training webpage during value of training module 200 n in confirming as the n tuple of recognition feature; Add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, and m is a preset value; Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.
In a preferable scheme, training module 200 also is used for for every kind of language, and weighted value is pressed rank from big to small, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language.
Identification module 300 is at the every kind of language that uses for the training webpage; Be further used for every kind of language using for the training webpage when occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID is multiplied each other.
Further; The occurrence number of identification module 300 each n tuple in statistics webpage to be identified; Get and also be used for occurrence number after said n tuple significance bit in the preset standard coding generates the corresponding identification id of said occurrence number in the said n tuple of storage unit storage; With said identification id is the subscript of said storage unit, and each storage unit is formed n tuple array;
Identification module 300 is at the every kind of language that uses for the training webpage; When being multiplied each other, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is further used for for every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as down said characteristic ID, read the numerical value stored in the storage unit of finding the weighted value of this numerical value and said row is multiplied each other.
Those skilled in the art can also carry out various modifications to above content under the condition that does not break away from the definite the spirit and scope of the present invention of claims.Therefore scope of the present invention is not limited in above explanation, but confirm by the scope of claims.

Claims (8)

1.一种自动识别网页中维吾尔文的方法,其特征在于,包括:1. A method for automatically identifying Uighur in webpages, characterized in that, comprising: 步骤1,确定作为识别特征的n元组中n的取值,对于训练网页中使用的每种语言,统计所述语言的各个n元组在使用所述语言的训练网页中的出现频率,以所述出现频率为一个权重值,并且取所述n元组在预设标准编码中的有效位生成所述权重值对应的特征ID,所述训练网页中使用的语言包括维吾尔文和同维吾尔文相似文字,所述同维吾尔文相似文字包括阿拉伯文、哈萨克文、柯尔克孜文;Step 1, determine the value of n in the n-tuple as the identification feature, for each language used in the training webpage, count the frequency of occurrence of each n-tuple of the language in the training webpage using the language, to The frequency of occurrence is a weight value, and the effective bit of the n-tuple in the preset standard encoding is used to generate the feature ID corresponding to the weight value, and the language used in the training webpage includes Uyghur and Uyghur Similar words, the words similar to Uyghur include Arabic, Kazakh, Kirgiz; 步骤2,统计待识别网页中每个n元组的出现次数,取所述n元组在预设标准编码中有效位生成所述出现次数对应的识别ID,对于训练网页使用的每种语言,将特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘,叠加乘积,所得和值为待识别网页对应于所述语言的分值,所述待识别网页使用的语言为最高分值对应的语言;Step 2, counting the number of occurrences of each n-tuple in the webpage to be identified, taking the effective bits of the n-tuple in the preset standard encoding to generate the identification ID corresponding to the number of occurrences, for each language used in the training webpage, Multiply the weight value of the feature ID and the number of occurrences of the identification ID identical to the feature ID, superimpose the product, and the resulting sum value is the score of the webpage to be identified corresponding to the language, and the language used by the webpage to be identified is The language corresponding to the highest score; 其中,在所述步骤1前还包括:Wherein, before said step 1, also include: 步骤21,对训练网页和待识别网页进行预处理,除去网页标签,获得网页显示的字符的正文,并对转义字符进行还原;Step 21, preprocessing the training webpage and the webpage to be recognized, removing the webpage label, obtaining the text of the characters displayed on the webpage, and restoring the escape character; 其中,所述步骤21还包括:Wherein, the step 21 also includes: 步骤31,将正文中代表两个字符的连写字母编码转换为拆分后对应的两个字符的编码;Step 31, converting the ligature letter code representing two characters in the text into the code corresponding to the two characters after splitting; 步骤32,判断词首元音字母前是否具有hamza,如果不具有,则在所述词首元音字母前添加;Step 32, judge whether there is hamza before the initial vowel, if not, then add before the initial vowel; 步骤33,判断词首元音字母是否为复合元音字母,如果是,则将所述复合元音字母拆分为对应的两个字符;Step 33, judge whether the initial vowel is a compound vowel, if yes, then split the compound vowel into two corresponding characters; 步骤34,将阿拉伯字母扩展区中的字母转换为阿拉伯字母基本区中对应形状的字母。Step 34, converting the letters in the extended area of Arabic letters into letters of corresponding shapes in the basic area of Arabic letters. 2.如权利要求1所述的自动识别网页中维吾尔文的方法,其特征在于,2. the method for automatically identifying Uighur in the webpage as claimed in claim 1, is characterized in that, 所述步骤1中确定作为识别特征的n元组中n的取值进一步为,The value of n in the n-tuple determined as the identification feature in the step 1 is further, 步骤41,对于训练网页中使用的每种语言,统计所述语言中i元组的出现概率,i的取值为1至m,m为预设值;Step 41, for each language used in the training webpage, count the occurrence probability of the i-tuple in the language, the value of i is from 1 to m, and m is a preset value; 步骤42,依据所述出现概率优先选择出现概率较高的i元组作为识别特征。Step 42, according to the occurrence probability, preferentially select i-tuples with higher occurrence probability as identification features. 3.如权利要求1所述的自动识别网页中维吾尔文的方法,其特征在于,3. the method for automatically identifying Uighur in the webpage as claimed in claim 1, is characterized in that, 所述步骤1中还包括:Said step 1 also includes: 步骤51,对于每种语言,将权重值按从大到小排名,选取前K个权重值,将选取的权重值和所述权重值对应的特征ID记录到所述语言的特征权重表中;所述语言的前K个权重值相加大于预设阀值,K为所述语言对应的选取值;Step 51, for each language, rank the weight values from large to small, select the top K weight values, and record the selected weight values and the feature IDs corresponding to the weight values in the feature weight table of the language; The sum of the first K weight values of the language is greater than the preset threshold, and K is the selected value corresponding to the language; 所述步骤2中对于训练网页使用的每种语言,将特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘进一步为,For each language used in the training webpage in the step 2, the weight value of the feature ID and the number of occurrences of the identification ID identical to the feature ID are multiplied to be further, 步骤52,对于训练网页使用的每种语言,将所述语言的特征权重表中特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘。Step 52, for each language used in the training webpage, multiply the weight value of the feature ID in the feature weight table of the language by the number of occurrences of the same identification ID as the feature ID. 4.如权利要求3所述的自动识别网页中维吾尔文的方法,其特征在于,4. the method for automatically identifying Uighur in the webpage as claimed in claim 3, is characterized in that, 所述步骤2中统计待识别网页中每个n元组的出现次数,取所述n元组在预设标准编码中有效位生成所述出现次数对应的识别ID后还包括,Count the number of occurrences of each n-tuple in the webpage to be identified in the step 2, and get the effective bits of the n-tuple in the preset standard coding to generate the identification ID corresponding to the number of occurrences and also include, 步骤61,在存储单元中存储所述n元组的出现次数,以所述识别ID为所述存储单元的下标,各个存储单元组成n元组数组;Step 61, storing the number of occurrences of the n-tuple in the storage unit, using the identification ID as the subscript of the storage unit, and each storage unit forms an n-tuple array; 所述步骤2中对于训练网页使用的每种语言,将特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘进一步为,For each language used in the training webpage in the step 2, the weight value of the feature ID and the number of occurrences of the identification ID identical to the feature ID are multiplied to be further, 步骤62,对于每种语言,遍历所述语言的特征权重表,对于特征权重表每一行,读取所述行的特征ID,在所述n元组数组中查找下标为所述特征ID的存储单元,将查找到的存储单元存储的数值与所述行的权重值相乘。Step 62, for each language, traverse the feature weight table of the language, for each row of the feature weight table, read the feature ID of the row, and search for the subscript of the feature ID in the n-tuple array The storage unit multiplies the value stored in the found storage unit by the weight value of the row. 5.一种自动识别网页中维吾尔文的系统,其特征在于,包括:5. A system for automatically identifying Uighur in web pages, characterized in that it comprises: 训练模块,用于确定作为识别特征的n元组中n的取值,对于训练网页中使用的每种语言,统计所述语言的各个n元组在使用所述语言的训练网页中的出现频率,以所述出现频率为一个权重值,并且取所述n元组在预设标准编码中的有效位生成所述权重值对应的特征ID,所述训练网页中使用的语言包括维吾尔文和同维吾尔文相似文字,所述同维吾尔文相似文字包括阿拉伯文、哈萨克文、柯尔克孜文;The training module is used to determine the value of n in the n-tuple as the identification feature, and for each language used in the training webpage, count the frequency of occurrence of each n-tuple of the language in the training webpage using the language , using the frequency of occurrence as a weight value, and taking the effective bits of the n-tuple in the preset standard encoding to generate the feature ID corresponding to the weight value, the language used in the training webpage includes Uighur and the same Uyghur-like scripts, including Arabic, Kazakh, and Kirgiz; 识别模块,用于统计待识别网页中每个n元组的出现次数,取所述n元组在预设标准编码中有效位生成所述出现次数对应的识别ID,对于训练网页使用的每种语言,将特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘,叠加乘积,所得和值为待识别网页对应于所述语言的分值,所述待识别网页使用的语言为最高分值对应的语言;以及The identification module is used to count the number of occurrences of each n-tuple in the webpage to be identified, and the valid bits of the n-tuple in the preset standard encoding are used to generate the identification ID corresponding to the number of occurrences. For each type of n-tuple used in the training webpage language, the weight value of the feature ID is multiplied by the number of occurrences of the same identification ID as the feature ID, and the product is superimposed, and the resulting sum value is the score corresponding to the language of the webpage to be identified, and the webpage to be identified uses Language is the language corresponding to the highest score; and 预处理模块,所述预处理模块在所述训练模块和所述识别模块启动前运行;所述预处理模块,用于对训练网页和待识别网页进行预处理,除去网页标签,获得网页显示的字符的正文,并对转义字符进行还原;A preprocessing module, the preprocessing module runs before the training module and the recognition module are started; the preprocessing module is used to preprocess the training webpage and the webpage to be identified, remove the webpage label, and obtain the displayed content of the webpage The body of the character, and restore the escaped character; 其中,所述预处理模块还用于将正文中代表两个字符的连写字母编码转换为拆分后对应的两个字符的编码;判断词首元音字母前是否具有hamza,如果不具有,则在所述词首元音字母前添加;判断词首元音字母是否为复合元音字母,如果是,则将所述复合元音字母拆分为对应的两个字符;将阿拉伯字母扩展区中的字母转换为阿拉伯字母基本区中对应形状的字母。Wherein, the preprocessing module is also used to convert the ligature letter encoding representing two characters in the text into the corresponding two character encoding after splitting; whether there is hamza before the initial vowel, if not, then Add before the initial vowel; judge whether the initial vowel is a compound vowel, if so, then split the compound vowel into two corresponding characters; in the Arabic letter extension area The letters of are converted to letters of the corresponding shape in the basic area of the Arabic alphabet. 6.如权利要求5所述的自动识别网页中维吾尔文的系统,其特征在于,6. the system of automatically identifying Uyghur language in the webpage as claimed in claim 5, is characterized in that, 所述训练模块在确定作为识别特征的n元组中n的取值时进一步用于对于训练网页中使用的每种语言,统计所述语言中i元组的出现概率,i的取值为1至m,m为预设值;依据所述出现概率优先选择出现概率较高的i元组作为识别特征。When the training module determines the value of n in the n-tuple as the identification feature, it is further used for each language used in the training webpage to count the occurrence probability of the i-tuple in the language, and the value of i is 1 to m, where m is a preset value; according to the occurrence probability, the i-tuple with a higher occurrence probability is preferentially selected as the identification feature. 7.如权利要求5所述的自动识别网页中维吾尔文的系统,其特征在于,7. The system of Uighur language in the automatic identification webpage as claimed in claim 5, is characterized in that, 所述训练模块还用于对于每种语言,将权重值按从大到小排名,选取前K个权重值,将选取的权重值和所述权重值对应的特征ID记录到所述语言的特征权重表中;所述语言的前K个权重值相加大于预设阀值,K为所述语言对应的选取值;The training module is also used to rank the weight values from large to small for each language, select the top K weight values, and record the selected weight values and the feature IDs corresponding to the weight values into the language feature In the weight table; the sum of the first K weight values of the language is greater than the preset threshold, and K is the selected value corresponding to the language; 所述识别模块在对于训练网页使用的每种语言,将特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘时进一步用于对于训练网页使用的每种语言,将所述语言的特征权重表中特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘。The identification module is further used for each language used in the training webpage when multiplying the weight value of the feature ID and the number of occurrences of the same identification ID as the feature ID for each language used in the training webpage, and the The weight value of the feature ID in the feature weight table of the above-mentioned language is multiplied by the occurrence times of the same identification ID as the feature ID. 8.如权利要求7所述的自动识别网页中维吾尔文的系统,其特征在于,8. the system of automatically identifying Uyghur language in the webpage as claimed in claim 7, is characterized in that, 所述识别模块在统计待识别网页中每个n元组的出现次数,取所述n元组在预设标准编码中有效位生成所述出现次数对应的识别ID后还用于,The identification module counts the number of occurrences of each n-tuple in the webpage to be identified, and after taking the effective bits of the n-tuple in the preset standard encoding to generate the identification ID corresponding to the number of occurrences, it is also used for, 在存储单元中存储所述n元组的出现次数,以所述识别ID为所述存储单元的下标,各个存储单元组成n元组数组;Store the number of occurrences of the n-tuple in the storage unit, with the identification ID as the subscript of the storage unit, and each storage unit forms an n-tuple array; 所述识别模块在对于训练网页使用的每种语言,将特征ID的权重值和与所述特征ID相同的识别ID的出现次数相乘时进一步用于对于每种语言,遍历所述语言的特征权重表,对于特征权重表每一行,读取所述行的特征ID,在所述n元组数组中查找下标为所述特征ID的存储单元,读取查找到的存储单元存储的数值,将所述数值同所述行的权重值相乘。When the recognition module multiplies the weight value of the feature ID and the number of occurrences of the same recognition ID as the feature ID for each language used in the training webpage, it is further used for traversing the features of the language for each language Weight table, for each row of the feature weight table, read the feature ID of the row, search the storage unit whose subscript is the feature ID in the n-tuple array, and read the value stored in the found storage unit, Multiplies the value by the row's weight value.
CN2010101898517A 2010-05-24 2010-05-24 Method and system thereof for automatically identifying Uyghur in web page Active CN101882148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101898517A CN101882148B (en) 2010-05-24 2010-05-24 Method and system thereof for automatically identifying Uyghur in web page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101898517A CN101882148B (en) 2010-05-24 2010-05-24 Method and system thereof for automatically identifying Uyghur in web page

Publications (2)

Publication Number Publication Date
CN101882148A CN101882148A (en) 2010-11-10
CN101882148B true CN101882148B (en) 2012-01-04

Family

ID=43054163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101898517A Active CN101882148B (en) 2010-05-24 2010-05-24 Method and system thereof for automatically identifying Uyghur in web page

Country Status (1)

Country Link
CN (1) CN101882148B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528521A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Method and device for screening social application data
CN111124481A (en) * 2019-12-24 2020-05-08 广州市百果园信息技术有限公司 Installation package generation method and device of webpage application program, storage medium and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089791A (en) * 2006-06-15 2007-12-19 彭建强 System and method for multilingual character and number input
CN101201820A (en) * 2007-11-28 2008-06-18 北京金山软件有限公司 A bilingual corpus filtering method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6477488B1 (en) * 2000-03-10 2002-11-05 Apple Computer, Inc. Method for dynamic context scope selection in hybrid n-gram+LSA language modeling

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101089791A (en) * 2006-06-15 2007-12-19 彭建强 System and method for multilingual character and number input
CN101201820A (en) * 2007-11-28 2008-06-18 北京金山软件有限公司 A bilingual corpus filtering method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
薛亚平,袁保社.全文检索系统中语种识别与索引技术研究.《网络安全技术与应用》.2009,(第12期),49-51. *
靳简明,王华,丁晓青.维汉英混排文档识别.《电子与信息学报》.2006,第28卷(第7期),1188-1191. *

Also Published As

Publication number Publication date
CN101882148A (en) 2010-11-10

Similar Documents

Publication Publication Date Title
JP5138046B2 (en) Search system, search method and program
JP2007004633A (en) Language model generation device and language processing device using language model generated by the same
Bourne et al. A study of methods for systematically abbreviating English words and names
CN107168966B (en) Search engine index construction method and device
CN109857912A (en) A kind of font recognition methods, electronic equipment and storage medium
Sanjay et al. AMRITA_CEN-NLP@ FIRE 2015: CRF Based Named Entity Extractor For Twitter Microposts.
CN116992855A (en) Document processing method, system and related equipment
CN103136453B (en) The automatic volume group method of document function topic and automatic marking method
Jain et al. Context sensitive text summarization using k means clustering algorithm
CN101308512A (en) Mutual translation pair extraction method and device based on web page
CN101882148B (en) Method and system thereof for automatically identifying Uyghur in web page
CN111984845B (en) Website wrongly written word recognition method and system
Nagalavi et al. N-gram Word prediction language models to identify the sequence of article blocks in English e-newspapers
US11741121B2 (en) Computerized data compression and analysis using potentially non-adjacent pairs
JP5206296B2 (en) Similar sentence extraction program, method and apparatus
CN119474581A (en) Web page content extraction method based on combination of rules and statistical features
CN117574888A (en) Variant sensitive word recognition method and device, electronic equipment and storage medium
CN105320716A (en) Automatic labeling method for digital publication
JP2681663B2 (en) Japanese sentence correction candidate character extraction method
CN110737748B (en) Text deduplication method and system
JP3253657B2 (en) Document search method
US20230096564A1 (en) Chunking execution system, chunking execution method, and information storage medium
JP2007219620A (en) Text retrieval device, program, and method
JP4307287B2 (en) Metadata extraction device
CN111858837B (en) Text processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Ni Yaoqun

Inventor after: Xu Hongbo

Inventor after: Cheng Xueqi

Inventor before: Ni Yaoqun

Inventor before: Xu Hongbo

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: NI YAOQUN XU HONGBO TO: NI YAOQUN XU HONGBO CHENG XUEQI

EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20101110

Assignee: Branch DNT data Polytron Technologies Inc

Assignor: Institute of Computing Technology, Chinese Academy of Sciences

Contract record no.: 2018110000033

Denomination of invention: Method and system thereof for automatically identifying Uyghur in web page

Granted publication date: 20120104

License type: Common License

Record date: 20180807

EE01 Entry into force of recordation of patent licensing contract