CN101882148B

CN101882148B - Method and system thereof for automatically identifying Uyghur in web page

Info

Publication number: CN101882148B
Application number: CN2010101898517A
Authority: CN
Inventors: 倪耀群; 许洪波
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2010-05-24
Filing date: 2010-05-24
Publication date: 2012-01-04
Anticipated expiration: 2030-05-24
Also published as: CN101882148A

Abstract

The present invention relates to a method and system for automatically identifying Uighur in webpages. The method includes: step 1, determining the value of n in the n-tuple as the identification feature, each language used, and counting each n-tuple of the language in the language used The frequency of occurrence in the training webpage is a weight value, and the effective bits of the n-tuple in the preset standard encoding are taken to generate the feature ID corresponding to the weight value; step 2, the occurrence of each n-tuple in the webpage to be identified is counted Number of times, take the n-tuple and generate the identification ID corresponding to the number of occurrences in the effective bits in the preset standard encoding. For each language used in the training webpage, multiply the weight value of the feature ID by the number of occurrences of the identification ID that is the same as the feature ID , superimposed products, and the obtained sum value is the score corresponding to the language of the webpage to be recognized, and the language used by the webpage to be recognized is the language corresponding to the highest score. The invention can improve the recognition accuracy.

Description

Automatically discern the method and the system thereof of Uighur in the webpage

Technical field

The present invention relates to the webpage process field, relate in particular to the method and the system thereof of Uighur in the automatic identification webpage.

Background technology

Nearly in the world more than 60 national literal are that write on the basis with the Arabic alphabet at present; Uighur, Kazakh and kirgiz that China's Xinjiang region is used all belong to this type literal; The dimension literary composition that title is write with Arabic alphabet is ASU (Arabic-Script Uyghur), and how in literal such as Arabic, Farsi, Kazakh and kirgiz, to identify Uighur is problem to be solved.

In order to distinguish the Uighur that occurs on Uygur's literal, the especially webpage, two kinds of ways are arranged in the prior art.The one, check literal code, whether peculiar and 18 letters that Arabic does not have of Uighur appear between taking a fancy to; The 2nd, check the font file in the webpage, see the common font name of Uighur whether occurs.

Relying on special letter identification Uygur literal has two shortcomings, and the one, must occur one of these 18 special letters in the requirement webpage could differentiate; The 2nd, the coding of these 18 letters might be also used the uses such as literal such as Kazakh of Arabic alphabets by other, thereby causes identification error.

In some dimension web page text, used WEFT (Microsoft Web Embedding Font Tool webpage font insertion tool technology); Can the font in the webpage be made into the compressed information Chinese library of an EOT (the embedded profile type-word of Embedded OpenType body) form, show the idio-morphosis of dimension literary composition with this character library.There is different EOT filenames different websites; The give chapter and verse information such as title of these font files of iff are discerned the dimension literary composition; Have three shortcomings: the one, some Tibetan web page has also adopted the WEFT technology in the reality, can not guarantee that the EOT filename of Tibetan language is different with the dimension literary composition; The 2nd, have the civilian website of a lot of dimensions future and used new EOT filename, new EOT title is unpredictable; The browser of three right and wrong IE kernels such as firefox, the browser chrome of Google etc. do not support the WEFT technology, thereby make this method lose efficacy.

The literal of now a lot of ethnic groups is mainly encoded with utf8 on webpage and is shown, transmits and store, and can handle all language on the utf8 coding theory, does not just distinguish the classification of language.As far as Chinese user, Arabic, Wei Wen, Kazakh, kirgiz is closely similar, if not by means of recognizer, these several kinds of literal almost can't be differentiated.The problem that solution is discerned Uighur automatically.

Summary of the invention

For addressing the above problem, the invention provides the method and the system thereof of Uighur in the automatic identification webpage, whether used language is Uighur in the language n tuple identification webpage through using, and improves the accuracy of identification.

The invention discloses the method for Uighur in a kind of automatic identification webpage, comprising:

Step 1; Confirm value as n in the n tuple of recognition feature; For the every kind of language that uses in the training webpage; The frequency of occurrences of each n tuple of adding up said language in the training webpage that uses said language is a weighted value with the said frequency of occurrences, and gets the significance bit of said n tuple in the preset standard coding and generate said weighted value characteristic of correspondence ID; The language that uses in the said training webpage comprise Uighur with the similar literal of Uighur, saidly comprise Arabic, Kazakh, kirgiz with the similar literal of Uighur;

Step 2; Add up the occurrence number of each n tuple in the webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score;

Also comprise before the said step 1:

Step 21 is carried out pre-service to training webpage and webpage to be identified, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced;

Said step 21 also comprises:

Step 31 converts the write the two or more syllables of a word together alphabetic coding of representing two characters in the text into the coding that splits two corresponding characters of back;

Step 32 judges before the prefix vowel whether have hamza, if do not have, then before said prefix vowel, adds;

Step 33 judges whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters;

Step 34 converts the letter in the Arabic alphabet expansion area into the letter of correspondingly-shaped in the Arabic alphabet base region.

Confirm in the said step 1 as the value of n in the n tuple of recognition feature further to do,

Step 41 for every kind of language that training is used in the webpage, is added up the probability of occurrence of i tuple in the said language, and the value of i is 1 to m, and m is a preset value;

Step 42 preferentially selects the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.

Also comprise in the said step 1:

Step 51 for every kind of language, is pressed rank from big to small with weighted value, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language;

The every kind of language that uses for the training webpage in the said step 2, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID multiplied each other further does,

Step 52, the every kind of language that uses for the training webpage multiplies each other the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID.

The occurrence number of each n tuple in the statistics webpage to be identified in the said step 2 is got said n tuple significance bit in the preset standard coding and is also comprised after generating the corresponding identification id of said occurrence number,

Step 61, the occurrence number of the said n tuple of storage is the subscript of said storage unit with said identification id in storage unit, each storage unit is formed n tuple array;

Step 62; For every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as said characteristic ID down, the numerical value of the cell stores that finds and the weighted value of said row are multiplied each other.

The invention also discloses the system of Uighur in a kind of automatic identification webpage, comprising:

Training module; Be used for confirming value as the n tuple n of recognition feature; For the every kind of language that uses in the training webpage; The frequency of occurrences of each n tuple of adding up said language in the training webpage that uses said language is a weighted value with the said frequency of occurrences, and gets the significance bit of said n tuple in the preset standard coding and generate said weighted value characteristic of correspondence ID; The language that uses in the said training webpage comprise Uighur with the similar literal of Uighur, saidly comprise Arabic, Kazakh, kirgiz with the similar literal of Uighur;

Identification module; Be used for adding up the occurrence number of each n tuple of webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score;

Said system also comprises pre-processing module, the operation before said training module and the startup of said identification module of said pre-processing module,

Said pre-processing module is used for training webpage and webpage to be identified are carried out pre-service, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced;

Said pre-processing module also is used for representing the write the two or more syllables of a word together alphabetic coding of two characters to convert the coding that splits two corresponding characters of back in text; Judge before the prefix vowel whether have hamza,, then before said prefix vowel, add if do not have; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; Letter in the Arabic alphabet expansion area is converted into the letter of correspondingly-shaped in the Arabic alphabet base region.

Be further used for every kind of language using in the training webpage during value of said training module n in confirming as the n tuple of recognition feature, add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, and m is a preset value; Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.

Said training module also is used for for every kind of language, and weighted value is pressed rank from big to small, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language;

Said identification module is at the every kind of language that uses for the training webpage; Be further used for every kind of language using for the training webpage when occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID is multiplied each other.

The occurrence number of said identification module each n tuple in statistics webpage to be identified is got said n tuple significance bit in the preset standard coding and also is used for after generating the corresponding identification id of said occurrence number,

The occurrence number of the said n tuple of storage is the subscript of said storage unit with said identification id in storage unit, and each storage unit is formed n tuple array;

Said identification module is at the every kind of language that uses for the training webpage; Be further used for for every kind of language when the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, travel through the feature weight table of said language, for each row of feature weight table; Read the characteristic ID of said row; In said n tuple array, search the storage unit that is designated as said characteristic ID down, read the numerical value of the cell stores that finds, the weighted value of said numerical value with said row multiplied each other.

Beneficial effect of the present invention is whether used language is Uighur in the language n tuple identification webpage through using, and improves the accuracy of identification; Through training webpage and webpage to be identified are carried out pre-service, improve the accuracy of identification, improve recognition efficiency through setting up the feature weight table; Search for through use characteristic ID and identification id, improve recognition efficiency.

Description of drawings

Fig. 1 discerns the process flow diagram of the method for Uighur in the webpage automatically for the present invention;

Fig. 2 discerns the embodiment process flow diagram of the method for Uighur in the webpage automatically for the present invention;

Fig. 3 discerns the displayed map as a result of Uighur method in the webpage automatically for adopting the present invention;

Fig. 4 discerns the structural drawing of the system of Uighur in the webpage automatically for the present invention;

Fig. 5 is for discern the structural drawing of the system of Uighur in the webpage automatically in more excellent embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is done further detailed description.

It is as shown in Figure 1 that the present invention discerns the flow process of the method for Uighur in the webpage automatically.

Step S100; Confirm value as n in the n tuple of recognition feature; For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the said frequency of occurrences; And get the significance bit of said n tuple in preset standard coding and generate said weighted value characteristic of correspondence ID, the language that uses in the said training webpage comprise Uighur with the similar literal of Uighur.

Step S200; Add up the occurrence number of each n tuple in the webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.

It is as shown in Figure 2 that the present invention discerns the flow process of the embodiment of the method for Uighur in the webpage automatically.

Step S301 carries out pre-service to training webpage and webpage to be identified, removes the webpage label, obtain the text of the character of web displaying, and ESC reduced, the language that uses in the training webpage comprise Uighur with the similar literal of Uighur.

Pre-service is to convert webpage into plain text.Existing HTML analyzer is analyzed by dom tree, and the final text that generates is on leafy node, and not parsing of ESC, is slavish copying.And the character of Uighur has much used the decimal system or hexadecimal character entity, in webpage, is expressed as “ &#1574 like

; ", perhaps “ &#X0626; ", perhaps “ ئ ".

Therefore, when pre-service, when removing the webpage label, to ESC, the decimal system and sexadecimal all reduce.

For example:

&#DDDDD converts decimal number DDDDD into unsigned short int;

&#XHHHH converts sexadecimal number HHHH into unsigned short int;

Other ESC like &nbsp, is handled by the space, also plays the effect of cutting word.

Through pre-service, generate the text of the character of webpage actual displayed, carry out literary composition with convenient follow-up use n-gram and plant judgement.

Use is with the webpage of the similar literal of Uighur, and whether wherein similar whether can to distinguish this literal with the reader who is ignorant of Uighur be standard with Uighur; For example, the similar literal of Uighur comprises Arabic, Kazakh, kirgiz.

In preferred implementation, the pre-service of training webpage and webpage to be identified is also comprised following operation.

To two encoded questions of a letter; For example

correspondence is encoded to 0xfefb,

reality represented two characters

and operated as follows.The coding of representing two characters in the text is split as two corresponding characters; For example, is split as corresponding two characters

To the inconsistent problem that adds hamza (

represents the symbol of locking sound) before the prefix vowel; Operate as follows; Judge before the prefix vowel and whether have hamza; If do not have, then before said prefix vowel, add; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; For example, the prefix vowel is that compound vowel letter

then is split as corresponding and

to it

To an alphabetical n encoded question, wherein n is greater than 2, and this problem reason mainly is that Wei Wen writes by Arabic alphabet, but has the more distinctive language phenomenons of dimension literary composition; And the Unicode standard does not give the dimension literary composition separate encoding district, so the different web sites deviser has selected different coding to represent same dimension literal female for showing the female needs of special dimension literal.Convert the letter in the Arabic alphabet expansion area letter of correspondingly-shaped in the Arabic alphabet base region into, kept the consistance on the coding.

Through pre-service, make Uighur with the analysis of the webpage of the similar literal of Uighur with judge based on consistent standard, avoided confusion, improved judgment accuracy.

Step S302 confirms the value as n in the n tuple of recognition feature.

Definite method of n, a kind of simple embodiment are through the configuration input.

The preferred implementation of confirming n is following.

For every kind of language that training is used in the webpage, add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, m is a preset value.For the value of an i, add up this i tuple probability of occurrence in the training webpage of each language.

Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.A kind of selection mode is for each i tuple, and with the probability of occurrence addition of this i tuple in each language, the i tuple of selecting to add with maximum is a recognition feature; Addition after also can probability of occurrence being multiplied each other with the weighted value for each language that disposes.Perhaps for every kind of language, each i tuple is pressed the probability of occurrence ordering, each the set positions score value in the corresponding ordering with the score value addition of each i tuple in the formation of each language, is pressed score value and is selected recognition feature.

For example, from the text of all pretreated training webpages, add up various language the probability of occurrence of bigram (doublet), trigram (tlv triple), 4gram (four-tuple) and 5gram (five-tuple).Preferentially select the higher i tuple of probability of occurrence as recognition feature according to bigram, trigram, 4gram and 5gram probability of occurrence in the training webpage of each language.

The value of n gets 2 in the specific embodiment.

Step S303 is in the training stage, according to the frequency of occurrences generating feature weight table of each language in the training webpage.

For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the frequency of occurrences.For every kind of language, will with n tuple rank, choose preceding K n tuple by weighted value by from big to small.

Among the embodiment; For every kind of language; The occurrence number of each doublet of statistics from the text of the training webpage that uses said language; For each doublet, with the occurrence number of this doublet summation divided by all doublet occurrence numbers in the training webpage of this language, the gained quotient is the frequency of occurrences of this doublet.To every kind of language, take out existing K the highest doublet of frequency, generate this language characteristic of correspondence weight table.The value of K to satisfy this K doublet the frequency of occurrences add and greater than pre-set threshold value.For example, pre-set threshold value is 95%, is 1000 for Uighur K then, is 400 for Arabic K.

Be limited to 65536 on the number of doublet among the application in theory; The dimension literary composition doublet number that statistics obtains in the actual experiment is 1130, and the highest doublet has occurred 5106348 times, minimum appearance 1 time; Wherein the occurrence number of less more than 100 doublet of occurrence number is between 1 time to 30 times; Explain that these doublet are the rare collocation between the letter of this literal, indifferent to this language expression, belong to invalid doublet; And the weight of these doublet is minimum, near machine zero, can't represent with floating number.Keep these doublet, not only help for the differentiation of literary composition kind little, and can cause bigger storage overhead and computing time complexity; These abandon no doublet occurrence number accumulative total about 1300 times, in all doublet number of times summations, occupy the ratio of less than 1%, influence very little.

Each row comprises the frequency of occurrences of n tuple in the feature weight table, is weighted value and n tuple characteristic of correspondence ID.Characteristic ID generates according to the significance bit of this n tuple in the preset standard coding.For example; Preset coding standard is unicode; Then the unicode of doublet

is encoded to 0x0644 and 0x0649; Low level sexadecimal number with separately is that 0x44 and 0x49 are combined as 0x4449, and promptly decimal number 17481 is as the ID of doublet .

With Uighur feature weight table is example, and table is long by 1000, with the frequency of occurrences of 1000 bigram of record Uighur, as shown in table 1.

Characteristic ID	Weighted value
		24358	0.027939
18783	0.027746
		10079	0.016921
24362	0.014487
		17481	0.013896
24360	0.013854
		12639	0.013757
...	...

Table 1

First representation feature ID of each row representes with a unsigned number, also adds up the subscript of array in the identifying as bigram occurrence number in the webpage to be identified.Separate with the space as " 17481 " the expression recognition feature bigram

of fifth line are for the sake of clarity alphabetical, it is 17481 that real text shows as

characteristic of correspondence ID.The weighted value of second this recognition feature of expression; Be this recognition feature shared ratio in all bigram of Uighur training webpage in training process, represent that like " 0.013896 " of fifth line

weighted value is 0.013896.

Step S304 adds up the occurrence number of each n tuple in the webpage to be identified, gets said n tuple significance bit in the preset standard coding and generates the corresponding identification id of said occurrence number.

The occurrence number of storage n tuple in storage unit be the subscript of this storage unit with the identification id of correspondence, the storage unit composition n tuple array of each n tuple.

All bigram in the webpage to be identified are added up one time; Time complexity is O (n), needs 256 * 256=65536 storage unit altogether, is stored in the integer array; Storage unit for each bigram; Get bigram significance bit in the preset standard coding and generate corresponding identification id, following this identification id that is designated as of storage unit, the value of each cell stores is the occurrence number of this bigram in the webpage to be identified.Call the bigram array to this array.

Step S305; Every kind of language for the use of training webpage; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.

For every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as said characteristic ID down, the numerical value of the cell stores that finds and the weighted value of said row are multiplied each other; If be zero, explain that this characteristic does not occur in webpage to be identified.

Particularly; To each row of the Bigram feature weight table of Arabic, be multiplicand with the weighted value of this row, in the Bigram array, search with the characteristic ID of this row and be target storage unit down; Take out the occurrence number of the bigram of this cell stores, as multiplier; Multiplier and multiplicand are multiplied each other.

The Bigram feature weight table of Arabic has 400 row, after above-mentioned 400 products accumulative total, is exactly the Arabic score of this webpage to be identified, is expressed as the probability that webpage to be identified belongs to Arabic.

Equally, calculate the Uighur score, the adding up of 1000 product terms, if it is maximum to tie up civilian score value, then this webpage to be identified is exactly a Uighur.

Beneficial effect is that feature weight table smaller (in 1000 row) is with the Bigram array index of characteristic ID as webpage to be identified; Inquiry velocity is fast, counts the score, and this score representes that webpage to be identified belongs to the probability of certain language; The multiplication number of times few, it is high to carry out efficient.

Show the accuracy rate of dimension literary composition identification below through two tests, the project of test 1 is as shown in table 2.

Table 2

As shown in Figure 3, last score value greater than 0 be Uighur, wherein uy_12.htm is a page of software company of Uygur (tieing up soft company), another ties up the website of the civilian page from Zepu County, Xinjiang senior middle school of Uygur, can accurately identify.

The project of test 2 is as shown in table 3

Table 3

The structure of the system of Uighur is as shown in Figure 4 in a kind of automatic identification webpage.

Training module 200; Be used for confirming value as the n tuple n of recognition feature; For the every kind of language that uses in the training webpage, add up the frequency of occurrences of each n tuple in the training webpage that uses said language of said language, be a weighted value with the said frequency of occurrences; And get the significance bit of said n tuple in preset standard coding and generate said weighted value characteristic of correspondence ID, the language that uses in the said training webpage comprise Uighur with the similar literal of Uighur.

Identification module 300; Be used for adding up the occurrence number of each n tuple of webpage to be identified, get said n tuple significance bit in the preset standard coding and generate the corresponding identification id of said occurrence number, for every kind of language of training webpage use; The occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other; The stack product, gained is the score value of webpage to be identified corresponding to said language with value, the language that said webpage to be identified uses is the corresponding language of highest score.

In a preferable scheme, said system also comprises pre-processing module 100, and the operation before training module 200 and identification module 300 start of said pre-processing module 100 is as shown in Figure 5.

Pre-processing module 100 is used for training webpage and webpage to be identified are carried out pre-service, removes the webpage label, obtains the text of the character of web displaying, and ESC is reduced.

Further; Pre-processing module 100 also is used for text is represented the write the two or more syllables of a word together letter of two characters, and for example

and

code conversion is for splitting the coding of two corresponding characters of back; Judge before the prefix vowel whether have hamza,, then before said prefix vowel, add if do not have; Judge whether the prefix vowel is the compound vowel letter, if then said compound vowel letter is split as two corresponding characters; Letter in the Arabic alphabet expansion area is converted into the letter of correspondingly-shaped in the Arabic alphabet base region.

In a preferable scheme; Be further used for every kind of language using in the training webpage during value of training module 200 n in confirming as the n tuple of recognition feature; Add up the probability of occurrence of i tuple in the said language, the value of i is 1 to m, and m is a preset value; Preferentially select the higher i tuple of probability of occurrence as recognition feature according to said probability of occurrence.

In a preferable scheme, training module 200 also is used for for every kind of language, and weighted value is pressed rank from big to small, and K weighted value before choosing records weighted value of choosing and said weighted value characteristic of correspondence ID in the feature weight table of said language; Preceding K weighted value addition of said language is greater than pre-set threshold value, and K is the corresponding selected value of said language.

Identification module 300 is at the every kind of language that uses for the training webpage; Be further used for every kind of language using for the training webpage when occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is multiplied each other, the occurrence number of the weighted value of characteristic ID in the feature weight table of said language and the identification id identical with said characteristic ID is multiplied each other.

Further; The occurrence number of identification module 300 each n tuple in statistics webpage to be identified; Get and also be used for occurrence number after said n tuple significance bit in the preset standard coding generates the corresponding identification id of said occurrence number in the said n tuple of storage unit storage; With said identification id is the subscript of said storage unit, and each storage unit is formed n tuple array;

Identification module 300 is at the every kind of language that uses for the training webpage; When being multiplied each other, the occurrence number of the weighted value of characteristic ID and the identification id identical with said characteristic ID is further used for for every kind of language; Travel through the feature weight table of said language,, read the characteristic ID of said row for each row of feature weight table; In said n tuple array, search the storage unit that is designated as down said characteristic ID, read the numerical value stored in the storage unit of finding the weighted value of this numerical value and said row is multiplied each other.

Those skilled in the art can also carry out various modifications to above content under the condition that does not break away from the definite the spirit and scope of the present invention of claims.Therefore scope of the present invention is not limited in above explanation, but confirm by the scope of claims.

Claims

1. A method for automatically identifying Uighur in webpages, characterized in that, comprising:

Step 1, determine the value of n in the n-tuple as the identification feature, for each language used in the training webpage, count the frequency of occurrence of each n-tuple of the language in the training webpage using the language, to The frequency of occurrence is a weight value, and the effective bit of the n-tuple in the preset standard encoding is used to generate the feature ID corresponding to the weight value, and the language used in the training webpage includes Uyghur and Uyghur Similar words, the words similar to Uyghur include Arabic, Kazakh, Kirgiz;

Step 2, counting the number of occurrences of each n-tuple in the webpage to be identified, taking the effective bits of the n-tuple in the preset standard encoding to generate the identification ID corresponding to the number of occurrences, for each language used in the training webpage, Multiply the weight value of the feature ID and the number of occurrences of the identification ID identical to the feature ID, superimpose the product, and the resulting sum value is the score of the webpage to be identified corresponding to the language, and the language used by the webpage to be identified is The language corresponding to the highest score;

Wherein, before said step 1, also include:

Step 21, preprocessing the training webpage and the webpage to be recognized, removing the webpage label, obtaining the text of the characters displayed on the webpage, and restoring the escape character;

Wherein, the step 21 also includes:

Step 31, converting the ligature letter code representing two characters in the text into the code corresponding to the two characters after splitting;

Step 32, judge whether there is hamza before the initial vowel, if not, then add before the initial vowel;

Step 33, judge whether the initial vowel is a compound vowel, if yes, then split the compound vowel into two corresponding characters;

Step 34, converting the letters in the extended area of Arabic letters into letters of corresponding shapes in the basic area of Arabic letters.

2. the method for automatically identifying Uighur in the webpage as claimed in claim 1, is characterized in that,

The value of n in the n-tuple determined as the identification feature in the step 1 is further,

Step 41, for each language used in the training webpage, count the occurrence probability of the i-tuple in the language, the value of i is from 1 to m, and m is a preset value;

Step 42, according to the occurrence probability, preferentially select i-tuples with higher occurrence probability as identification features.

3. the method for automatically identifying Uighur in the webpage as claimed in claim 1, is characterized in that,

Said step 1 also includes:

Step 51, for each language, rank the weight values from large to small, select the top K weight values, and record the selected weight values and the feature IDs corresponding to the weight values in the feature weight table of the language; The sum of the first K weight values of the language is greater than the preset threshold, and K is the selected value corresponding to the language;

For each language used in the training webpage in the step 2, the weight value of the feature ID and the number of occurrences of the identification ID identical to the feature ID are multiplied to be further,

Step 52, for each language used in the training webpage, multiply the weight value of the feature ID in the feature weight table of the language by the number of occurrences of the same identification ID as the feature ID.

4. the method for automatically identifying Uighur in the webpage as claimed in claim 3, is characterized in that,

Count the number of occurrences of each n-tuple in the webpage to be identified in the step 2, and get the effective bits of the n-tuple in the preset standard coding to generate the identification ID corresponding to the number of occurrences and also include,

Step 61, storing the number of occurrences of the n-tuple in the storage unit, using the identification ID as the subscript of the storage unit, and each storage unit forms an n-tuple array;

Step 62, for each language, traverse the feature weight table of the language, for each row of the feature weight table, read the feature ID of the row, and search for the subscript of the feature ID in the n-tuple array The storage unit multiplies the value stored in the found storage unit by the weight value of the row.

5. A system for automatically identifying Uighur in web pages, characterized in that it comprises:

The training module is used to determine the value of n in the n-tuple as the identification feature, and for each language used in the training webpage, count the frequency of occurrence of each n-tuple of the language in the training webpage using the language , using the frequency of occurrence as a weight value, and taking the effective bits of the n-tuple in the preset standard encoding to generate the feature ID corresponding to the weight value, the language used in the training webpage includes Uighur and the same Uyghur-like scripts, including Arabic, Kazakh, and Kirgiz;

The identification module is used to count the number of occurrences of each n-tuple in the webpage to be identified, and the valid bits of the n-tuple in the preset standard encoding are used to generate the identification ID corresponding to the number of occurrences. For each type of n-tuple used in the training webpage language, the weight value of the feature ID is multiplied by the number of occurrences of the same identification ID as the feature ID, and the product is superimposed, and the resulting sum value is the score corresponding to the language of the webpage to be identified, and the webpage to be identified uses Language is the language corresponding to the highest score; and

A preprocessing module, the preprocessing module runs before the training module and the recognition module are started; the preprocessing module is used to preprocess the training webpage and the webpage to be identified, remove the webpage label, and obtain the displayed content of the webpage The body of the character, and restore the escaped character;

Wherein, the preprocessing module is also used to convert the ligature letter encoding representing two characters in the text into the corresponding two character encoding after splitting; whether there is hamza before the initial vowel, if not, then Add before the initial vowel; judge whether the initial vowel is a compound vowel, if so, then split the compound vowel into two corresponding characters; in the Arabic letter extension area The letters of are converted to letters of the corresponding shape in the basic area of the Arabic alphabet.

6. the system of automatically identifying Uyghur language in the webpage as claimed in claim 5, is characterized in that,

When the training module determines the value of n in the n-tuple as the identification feature, it is further used for each language used in the training webpage to count the occurrence probability of the i-tuple in the language, and the value of i is 1 to m, where m is a preset value; according to the occurrence probability, the i-tuple with a higher occurrence probability is preferentially selected as the identification feature.

7. The system of Uighur language in the automatic identification webpage as claimed in claim 5, is characterized in that,

The training module is also used to rank the weight values from large to small for each language, select the top K weight values, and record the selected weight values and the feature IDs corresponding to the weight values into the language feature In the weight table; the sum of the first K weight values of the language is greater than the preset threshold, and K is the selected value corresponding to the language;

The identification module is further used for each language used in the training webpage when multiplying the weight value of the feature ID and the number of occurrences of the same identification ID as the feature ID for each language used in the training webpage, and the The weight value of the feature ID in the feature weight table of the above-mentioned language is multiplied by the occurrence times of the same identification ID as the feature ID.

8. the system of automatically identifying Uyghur language in the webpage as claimed in claim 7, is characterized in that,

The identification module counts the number of occurrences of each n-tuple in the webpage to be identified, and after taking the effective bits of the n-tuple in the preset standard encoding to generate the identification ID corresponding to the number of occurrences, it is also used for,

Store the number of occurrences of the n-tuple in the storage unit, with the identification ID as the subscript of the storage unit, and each storage unit forms an n-tuple array;

When the recognition module multiplies the weight value of the feature ID and the number of occurrences of the same recognition ID as the feature ID for each language used in the training webpage, it is further used for traversing the features of the language for each language Weight table, for each row of the feature weight table, read the feature ID of the row, search the storage unit whose subscript is the feature ID in the n-tuple array, and read the value stored in the found storage unit, Multiplies the value by the row's weight value.