Summary of the invention
For addressing the above problem, the invention provides the method and the system thereof of Uighur in the automatic identification webpage, whether used language is Uighur in the language n tuple identification webpage through using, and improves the accuracy of identification.
The invention discloses the method for Uighur in a kind of automatic identification webpage, comprising:
Step 1; Confirm value as n in the n tuple of recognition feature; For the every kind of language that uses in the training webpage; The frequency of occurrences of each n tuple of adding up said language in the training webpage that uses said language is a weighted value with the said frequency of occurrences, and gets the significance bit of said n tuple in the preset standard coding and generate said weighted value characteristic of correspondence ID; The language that uses in the said training webpage comprise Uighur with the similar literal of Uighur, saidly comprise Arabic, Kazakh, kirgiz with the similar literal of Uighur;
Step 2; Add up the occurrence number of each n tuple in the webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score;
Also comprise before the said step 1:
Step 21 is carried out pre-service to training webpage and webpage to be identified, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced;
Said step 21 also comprises:
Step 31 converts the write the two or more syllables of a word together alphabetic coding of representing two characters in the text into the coding that splits two corresponding characters of back;
Step 32 judges before the prefix vowel whether have hamza, if do not have, then before said prefix vowel, adds;
Step 33 judges whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters;
Step 34 converts the letter in the Arabic alphabet expansion area into the letter of correspondingly-shaped in the Arabic alphabet base region.
Confirm in the said step 1 as the value of n in the n tuple of recognition feature further to do,
Step 41 for every kind of language that training is used in the webpage, is added up the probability of occurrence of i tuple in the said language, and the value of i is 1 to m, and m is a preset value;
Step 42 preferentially selects the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.
Also comprise in the said step 1:
Step 51 for every kind of language, is pressed rank from big to small with weighted value, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language;
The every kind of language that uses for the training webpage in the said step 2, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID multiplied each other further does,
Step 52, the every kind of language that uses for the training webpage multiplies each other the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID.
The occurrence number of each n tuple in the statistics webpage to be identified in the said step 2 is got said n tuple significance bit in the preset standard coding and is also comprised after generating the corresponding identification id of said occurrence number,
Step 61, the occurrence number of the said n tuple of storage is the subscript of said storage unit with said identification id in storage unit, each storage unit is formed n tuple array;
The every kind of language that uses for the training webpage in the said step 2, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID multiplied each other further does,
Step 62; For every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as said characteristic ID down, the numerical value of the cell stores that finds and the weighted value of said row are multiplied each other.
The invention also discloses the system of Uighur in a kind of automatic identification webpage, comprising:
Training module; Be used for confirming value as the n tuple n of recognition feature; For the every kind of language that uses in the training webpage; The frequency of occurrences of each n tuple of adding up said language in the training webpage that uses said language is a weighted value with the said frequency of occurrences, and gets the significance bit of said n tuple in the preset standard coding and generate said weighted value characteristic of correspondence ID; The language that uses in the said training webpage comprise Uighur with the similar literal of Uighur, saidly comprise Arabic, Kazakh, kirgiz with the similar literal of Uighur;
Identification module; Be used for adding up the occurrence number of each n tuple of webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score;
Said system also comprises pre-processing module, the operation before said training module and the startup of said identification module of said pre-processing module,
Said pre-processing module is used for training webpage and webpage to be identified are carried out pre-service, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced;
Said pre-processing module also is used for representing the write the two or more syllables of a word together alphabetic coding of two characters to convert the coding that splits two corresponding characters of back in text; Judge before the prefix vowel whether have hamza,, then before said prefix vowel, add if do not have; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; Letter in the Arabic alphabet expansion area is converted into the letter of correspondingly-shaped in the Arabic alphabet base region.
Be further used for every kind of language using in the training webpage during value of said training module n in confirming as the n tuple of recognition feature, add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, and m is a preset value; Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.
Said training module also is used for for every kind of language, and weighted value is pressed rank from big to small, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language;
Said identification module is at the every kind of language that uses for the training webpage; Be further used for every kind of language using for the training webpage when occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID is multiplied each other.
The occurrence number of said identification module each n tuple in statistics webpage to be identified is got said n tuple significance bit in the preset standard coding and also is used for after generating the corresponding identification id of said occurrence number,
The occurrence number of the said n tuple of storage is the subscript of said storage unit with said identification id in storage unit, and each storage unit is formed n tuple array;
Said identification module is at the every kind of language that uses for the training webpage; Be further used for for every kind of language when the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, travel through the feature weight table of said language, for each row of feature weight table; Read the characteristic ID of said row; In said n tuple array, search the storage unit that is designated as said characteristic ID down, read the numerical value of the cell stores that finds, the weighted value of said numerical value with said row multiplied each other.
Beneficial effect of the present invention is whether used language is Uighur in the language n tuple identification webpage through using, and improves the accuracy of identification; Through training webpage and webpage to be identified are carried out pre-service, improve the accuracy of identification, improve recognition efficiency through setting up the feature weight table; Search for through use characteristic ID and identification id, improve recognition efficiency.
Embodiment
Below in conjunction with accompanying drawing, the present invention is done further detailed description.
It is as shown in Figure 1 that the present invention discerns the flow process of the method for Uighur in the webpage automatically.
Step S100; Confirm value as n in the n tuple of recognition feature; For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the said frequency of occurrences; And get the significance bit of said n tuple in preset standard coding and generate said weighted value characteristic of correspondence ID, the language that uses in the said training webpage comprise Uighur with the similar literal of Uighur.
Step S200; Add up the occurrence number of each n tuple in the webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.
It is as shown in Figure 2 that the present invention discerns the flow process of the embodiment of the method for Uighur in the webpage automatically.
Step S301 carries out pre-service to training webpage and webpage to be identified, removes the webpage label, obtain the text of the character of web displaying, and ESC reduced, the language that uses in the training webpage comprise Uighur with the similar literal of Uighur.
Pre-service is to convert webpage into plain text.Existing HTML analyzer is analyzed by dom tree, and the final text that generates is on leafy node, and not parsing of ESC, is slavish copying.And the character of Uighur has much used the decimal system or hexadecimal character entity, in webpage, is expressed as “ ئ like
; ", perhaps “ ئ ", perhaps “ ئ ".
Therefore, when pre-service, when removing the webpage label, to ESC, the decimal system and sexadecimal all reduce.
For example:
&#DDDDD converts decimal number DDDDD into unsigned short int;
&#XHHHH converts sexadecimal number HHHH into unsigned short int;
Other ESC like  , is handled by the space, also plays the effect of cutting word.
Through pre-service, generate the text of the character of webpage actual displayed, carry out literary composition with convenient follow-up use n-gram and plant judgement.
Use is with the webpage of the similar literal of Uighur, and whether wherein similar whether can to distinguish this literal with the reader who is ignorant of Uighur be standard with Uighur; For example, the similar literal of Uighur comprises Arabic, Kazakh, kirgiz.
In preferred implementation, the pre-service of training webpage and webpage to be identified is also comprised following operation.
To two encoded questions of a letter; For example
correspondence is encoded to 0xfefb,
reality represented two characters
and operated as follows.The coding of representing two characters in the text is split as two corresponding characters; For example,
is split as corresponding two characters
To the inconsistent problem that adds hamza (
represents the symbol of locking sound) before the prefix vowel; Operate as follows; Judge before the prefix vowel and whether have hamza; If do not have, then before said prefix vowel, add; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; For example, the prefix vowel is that compound vowel letter
then is split as corresponding
and
to it
To an alphabetical n encoded question, wherein n is greater than 2, and this problem reason mainly is that Wei Wen writes by Arabic alphabet, but has the more distinctive language phenomenons of dimension literary composition; And the Unicode standard does not give the dimension literary composition separate encoding district, so the different web sites deviser has selected different coding to represent same dimension literal female for showing the female needs of special dimension literal.Convert the letter in the Arabic alphabet expansion area letter of correspondingly-shaped in the Arabic alphabet base region into, kept the consistance on the coding.
Through pre-service, make Uighur with the analysis of the webpage of the similar literal of Uighur with judge based on consistent standard, avoided confusion, improved judgment accuracy.
Step S302 confirms the value as n in the n tuple of recognition feature.
Definite method of n, a kind of simple embodiment are through the configuration input.
The preferred implementation of confirming n is following.
For every kind of language that training is used in the webpage, add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, m is a preset value.For the value of an i, add up this i tuple probability of occurrence in the training webpage of each language.
Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.A kind of selection mode is for each i tuple, and with the probability of occurrence addition of this i tuple in each language, the i tuple of selecting to add with maximum is a recognition feature; Addition after also can probability of occurrence being multiplied each other with the weighted value for each language that disposes.Perhaps for every kind of language, each i tuple is pressed the probability of occurrence ordering, each the set positions score value in the corresponding ordering with the score value addition of each i tuple in the formation of each language, is pressed score value and is selected recognition feature.
For example, from the text of all pretreated training webpages, add up various language the probability of occurrence of bigram (doublet), trigram (tlv triple), 4gram (four-tuple) and 5gram (five-tuple).Preferentially select the higher i tuple of probability of occurrence as recognition feature according to bigram, trigram, 4gram and 5gram probability of occurrence in the training webpage of each language.
The value of n gets 2 in the specific embodiment.
Step S303 is in the training stage, according to the frequency of occurrences generating feature weight table of each language in the training webpage.
For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the frequency of occurrences.For every kind of language, will with n tuple rank, choose preceding K n tuple by weighted value by from big to small.
Among the embodiment; For every kind of language; The occurrence number of each doublet of statistics from the text of the training webpage that uses said language; For each doublet, with the occurrence number of this doublet summation divided by all doublet occurrence numbers in the training webpage of this language, the gained quotient is the frequency of occurrences of this doublet.To every kind of language, take out existing K the highest doublet of frequency, generate this language characteristic of correspondence weight table.The value of K to satisfy this K doublet the frequency of occurrences add and greater than pre-set threshold value.For example, pre-set threshold value is 95%, is 1000 for Uighur K then, is 400 for Arabic K.
Be limited to 65536 on the number of doublet among the application in theory; The dimension literary composition doublet number that statistics obtains in the actual experiment is 1130, and the highest doublet has occurred 5106348 times, minimum appearance 1 time; Wherein the occurrence number of less more than 100 doublet of occurrence number is between 1 time to 30 times; Explain that these doublet are the rare collocation between the letter of this literal, indifferent to this language expression, belong to invalid doublet; And the weight of these doublet is minimum, near machine zero, can't represent with floating number.Keep these doublet, not only help for the differentiation of literary composition kind little, and can cause bigger storage overhead and computing time complexity; These abandon no doublet occurrence number accumulative total about 1300 times, in all doublet number of times summations, occupy the ratio of less than 1%, influence very little.
Each row comprises the frequency of occurrences of n tuple in the feature weight table, is weighted value and n tuple characteristic of correspondence ID.Characteristic ID generates according to the significance bit of this n tuple in the preset standard coding.For example; Preset coding standard is unicode; Then the unicode of doublet
is encoded to 0x0644 and 0x0649; Low level sexadecimal number with separately is that 0x44 and 0x49 are combined as 0x4449, and promptly decimal number 17481 is as the ID of doublet
.
With Uighur feature weight table is example, and table is long by 1000, with the frequency of occurrences of 1000 bigram of record Uighur, as shown in table 1.
| Characteristic ID |
Weighted value |
| 24358 |
0.027939 |
| 18783 |
0.027746 |
| 10079 |
0.016921 |
| 24362 |
0.014487 |
| 17481 |
0.013896 |
| 24360 |
0.013854 |
| 12639 |
0.013757 |
| ... |
... |
Table 1
First representation feature ID of each row representes with a unsigned number, also adds up the subscript of array in the identifying as bigram occurrence number in the webpage to be identified.Separate with the space as " 17481 " the expression recognition feature bigram
of fifth line are for the sake of clarity alphabetical, it is 17481 that real text shows as
characteristic of correspondence ID.The weighted value of second this recognition feature of expression; Be this recognition feature shared ratio in all bigram of Uighur training webpage in training process, represent that like " 0.013896 " of fifth line
weighted value is 0.013896.
Step S304 adds up the occurrence number of each n tuple in the webpage to be identified, gets said n tuple significance bit in the preset standard coding and generates the corresponding identification id of said occurrence number.
The occurrence number of storage n tuple in storage unit be the subscript of this storage unit with the identification id of correspondence, the storage unit composition n tuple array of each n tuple.
All bigram in the webpage to be identified are added up one time; Time complexity is O (n), needs 256 * 256=65536 storage unit altogether, is stored in the integer array; Storage unit for each bigram; Get bigram significance bit in the preset standard coding and generate corresponding identification id, following this identification id that is designated as of storage unit, the value of each cell stores is the occurrence number of this bigram in the webpage to be identified.Call the bigram array to this array.
Step S305; Every kind of language for the use of training webpage; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.
For every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as said characteristic ID down, the numerical value of the cell stores that finds and the weighted value of said row are multiplied each other; If be zero, explain that this characteristic does not occur in webpage to be identified.
Particularly; To each row of the Bigram feature weight table of Arabic, be multiplicand with the weighted value of this row, in the Bigram array, search with the characteristic ID of this row and be target storage unit down; Take out the occurrence number of the bigram of this cell stores, as multiplier; Multiplier and multiplicand are multiplied each other.
The Bigram feature weight table of Arabic has 400 row, after above-mentioned 400 products accumulative total, is exactly the Arabic score of this webpage to be identified, is expressed as the probability that webpage to be identified belongs to Arabic.
Equally, calculate the Uighur score, the adding up of 1000 product terms, if it is maximum to tie up civilian score value, then this webpage to be identified is exactly a Uighur.
Beneficial effect is that feature weight table smaller (in 1000 row) is with the Bigram array index of characteristic ID as webpage to be identified; Inquiry velocity is fast, counts the score, and this score representes that webpage to be identified belongs to the probability of certain language; The multiplication number of times few, it is high to carry out efficient.
Show the accuracy rate of dimension literary composition identification below through two tests, the project of test 1 is as shown in table 2.
Table 2
As shown in Figure 3, last score value greater than 0 be Uighur, wherein uy_12.htm is a page of software company of Uygur (tieing up soft company), another ties up the website of the civilian page from Zepu County, Xinjiang senior middle school of Uygur, can accurately identify.
The project of test 2 is as shown in table 3
Table 3
The structure of the system of Uighur is as shown in Figure 4 in a kind of automatic identification webpage.
Training module 200; Be used for confirming value as the n tuple n of recognition feature; For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the said frequency of occurrences; And get the significance bit of said n tuple in preset standard coding and generate said weighted value characteristic of correspondence ID, the language that uses in the said training webpage comprise Uighur with the similar literal of Uighur.
Identification module 300; Be used for adding up the occurrence number of each n tuple of webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.
In a preferable scheme, said system also comprises pre-processing module 100, and the operation before training module 200 and identification module 300 start of said pre-processing module 100 is as shown in Figure 5.
Pre-processing module 100 is used for training webpage and webpage to be identified are carried out pre-service, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced.
Further; Pre-processing module 100 also is used for text is represented the write the two or more syllables of a word together letter of two characters, and for example
and
code conversion is for splitting the coding of two corresponding characters of back; Judge before the prefix vowel whether have hamza,, then before said prefix vowel, add if do not have; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; Letter in the Arabic alphabet expansion area is converted into the letter of correspondingly-shaped in the Arabic alphabet base region.
In a preferable scheme; Be further used for every kind of language using in the training webpage during value of training module 200 n in confirming as the n tuple of recognition feature; Add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, and m is a preset value; Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.
In a preferable scheme, training module 200 also is used for for every kind of language, and weighted value is pressed rank from big to small, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language.
Identification module 300 is at the every kind of language that uses for the training webpage; Be further used for every kind of language using for the training webpage when occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID is multiplied each other.
Further; The occurrence number of identification module 300 each n tuple in statistics webpage to be identified; Get and also be used for occurrence number after said n tuple significance bit in the preset standard coding generates the corresponding identification id of said occurrence number in the said n tuple of storage unit storage; With said identification id is the subscript of said storage unit, and each storage unit is formed n tuple array;
Identification module 300 is at the every kind of language that uses for the training webpage; When being multiplied each other, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is further used for for every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as down said characteristic ID, read the numerical value stored in the storage unit of finding the weighted value of this numerical value and said row is multiplied each other.
Those skilled in the art can also carry out various modifications to above content under the condition that does not break away from the definite the spirit and scope of the present invention of claims.Therefore scope of the present invention is not limited in above explanation, but confirm by the scope of claims.