CN105468584A

CN105468584A - Filtering method and system for bad literal information in text

Info

Publication number: CN105468584A
Application number: CN201511027950.4A
Authority: CN
Inventors: 高玉环; 喻西香; 朱山; 朱光喜
Original assignee: Wuhan Hongruida Information Technology Co Ltd
Current assignee: Wuhan Hongruida Information Technology Co Ltd
Priority date: 2015-12-31
Filing date: 2015-12-31
Publication date: 2016-04-06

Abstract

The invention relates to the technical field of text processing, in particular to a filtering method and system for bad literal information in a text. The filtering method includes the steps that 1, the text to be filtered is extracted; 2, the value of word length Maxlen in the maximum matching algorithm is determined dynamically with the entry length in a dictionary, and word segmentation is carried out on the text to be filtered through Maxlen; 3, whether each word obtained after word segmentation is a sensitive word or not is judged circularly, if yes, the sensitive words are replaced by non-sensitive words, and then a text with the sensitive words replaced is output. The problems that as the initial value of Maxlen is not changed in the word segmentation process, long words are segmented wrong, consumed time is long, and efficiency is low are solved. Sensitivity judgment is carried out on word strings obtained after word segmentation, and the text is output according to the judgment result. As the improved word segmentation mode is adopted, overall filtering speed and filtering accuracy are improved.

Description

The filter method of bad Word message and filtering system in text

Technical field

The present invention relates to text-processing technical field, be specifically related to filter method and the filtering system of bad Word message in text.

Background technology

Internet era, word content chat is universal, and some lawless persons utilize some negative news of Internet publicity, or involve the political situation of the time, or starts a rumour and spread the rumour, or attacking network user, causes negative effect.For creating the harmonious internet character content chat environment of a civilization, some responsive vocabulary being filtered and is absolutely necessary.

Existing internet character filtering system and method, although also have certain word screening and filtering function, but in the integral filter precision of system, filtration efficiency and the large concurrent capability aspect of process, classic method is lack of wisdom type in participle is accurate, and the feature not by learning user carries out auto upgrading.

Summary of the invention

The object of the present invention is to provide filter method and the filtering system of bad Word message in a kind of text, solve low, the slow-footed problem of existing internet character filtering accuracy.

The invention provides the filter method of bad Word message in a kind of text, it comprises:

Step 1, extracts text to be filtered;

Step 2, utilizes entry length in dictionary dynamically to determine the value of the long Maxlen of word in maximum matching algorithm, carries out participle by Mexlen to described text to be filtered;

Step 3, whether each vocabulary after cycle criterion participle is responsive vocabulary, if responsive vocabulary, after adopting non-sensitive vocabulary to replace described responsive vocabulary, exports the text after replacing sensitive vocabulary.

In certain embodiments, be preferably, described step 2 comprises:

If comprise English in described text to be filtered, then English character string participle is pressed in described English;

If comprise numeral in described text to be filtered, then described numeral is pressed digit strings participle;

Extract from described text to be filtered do not comprise English and numeral treat participle character string;

Utilize entry length in dictionary dynamically to determine the value of the long Maxlen of word in maximum matching algorithm, treat that participle character string carries out participle by Mexlen to described.

In certain embodiments, be preferably, the described value utilizing entry length in dictionary dynamically to determine the long Maxlen of word in maximum matching algorithm, treat that participle character string is carried out participle and comprised by Mexlen to described:

01) for treating participle character string S1 initialize;

02) judge to treat whether participle character string S1 is empty; If empty, export the word string after participle, and word string is not recorded in output;

03) if treat participle character string S1 not for empty, judge to treat whether participle character string S1 is individual character, if individual character, directly separate individual character;

04) if not individual character, get and treat participle character string S1 first left word W, found the word length WLen of corresponding described first character W by Hash table in dictionary;

05) treat whether the length of participle character string is less than word length WLen, if be less than, then enters 07 described in judgement);

06) if be not less than, get and treat that in participle character string, the long word string Word for WLen mates one by one with the entry for WLen long in dictionary; When described word string is mated with a certain entry, export this word string;

07) if do not mated, judge whether WLen is 2, if not, PLen++, turns 04)

08) if 07) result be yes, described word string is exported as not record word string.

In certain embodiments, be preferably, step 3 comprises:

Load responsive lexicon;

Word string after all participles is mated successively in described responsive lexicon, if the match is successful, then shields the word string that the match is successful;

After adopting non-sensitive vocabulary to replace described responsive vocabulary, export the text after replacing sensitive vocabulary.

In certain embodiments, be preferably, if mate unsuccessful, then directly export text.

In certain embodiments, be preferably, between step 2 and step 3, also comprise: the word string that do not record exported is added in dictionary.

Present invention also offers the filtering system of bad Word message in a kind of text, it comprises:

Extraction module, for extracting text to be filtered;

Word-dividing mode, for the value utilizing entry length in dictionary dynamically to determine the long Maxlen of word in maximum matching algorithm, carries out participle by Mexlen to described text to be filtered;

Whether filtering module is responsive vocabulary for each vocabulary after cycle criterion participle, if responsive vocabulary, after adopting non-sensitive vocabulary to replace described responsive vocabulary, exports the text after replacing sensitive vocabulary.

In certain embodiments, be preferably, described word-dividing mode comprises:

English string segmentation unit, if comprise English in described text to be filtered is, for English character string participle is pressed in described English;

Numeral participle unit, if comprise numeral in described text to be filtered, for pressing digit strings participle by described numeral;

Extraction unit, for extract from described text to be filtered do not comprise English and numeral treat participle character string;

To described, dynamic participle unit, for the value utilizing entry length in dictionary dynamically to determine the long Maxlen of word in maximum matching algorithm, treats that participle character string carries out participle by Mexlen.

In certain embodiments, be preferably, the participle program that described dynamic participle unit performs is:

01) for treating participle character string S1 initialize;

07) if do not mated, judge whether WLen is 2, if not, PLen++, turns 04)

The filter method of bad Word message and filtering system in the text that the embodiment of the present invention provides, compared with prior art, propose to utilize the length of entry in dictionary dynamically to determine the value of the long MaxLen of word, get character string and the dictionary matching of corresponding length from the character string left side to be slit according to determined word length, thus solve constant the brought long word of MaxLen initial value in participle process be split mistake and the time grow, inefficient problem.Carry out sensitivity to the word string after participle subsequently to judge, and export text according to judged result.Owing to have employed the participle mode of improvement, therefore improve integral filter speed and filter accuracy.

Accompanying drawing explanation

Fig. 1 is the flowage structure schematic diagram of the filter method of bad Word message in one embodiment of the invention Chinese version.

Fig. 2 is the data structure schematic diagram of dictionary in one embodiment of the invention.

Embodiment

By reference to the accompanying drawings the present invention is described in further detail below by specific embodiment.

Consider current internet or various information communication platform to the bad word processing speed in text slowly, the problem that degree of accuracy is not high, the invention provides filter method and the filtering system of bad Word message in text.

In the text, the filter method of bad Word message comprises:

Step 1, extracts text to be filtered;

Step 2, utilizes entry length in dictionary dynamically to determine the value of the long Maxlen of word in maximum matching algorithm, carries out participle by Mexlen to text to be filtered;

Step 3, whether each vocabulary after cycle criterion participle is responsive vocabulary, if responsive vocabulary, after adopting non-sensitive vocabulary replacing sensitive vocabulary, exports the text after replacing sensitive vocabulary.

In a kind of text, the filtering system of bad Word message comprises:

Extraction module, for extracting text to be filtered;

Word-dividing mode, for the value utilizing entry length in dictionary dynamically to determine the long Maxlen of word in maximum matching algorithm, carries out participle by Mexlen to text to be filtered;

Whether filtering module is responsive vocabulary for each vocabulary after cycle criterion participle, if responsive vocabulary, after adopting non-sensitive vocabulary replacing sensitive vocabulary, exports the text after replacing sensitive vocabulary.

Propose to utilize the length of entry in dictionary dynamically to determine the value of the long MaxLen of word, get character string and the dictionary matching of corresponding length from the character string left side to be slit according to determined word length, thus solve constant the brought long word of MaxLen initial value in participle process be split mistake and the time grow, inefficient problem.Carry out sensitivity to the word string after participle subsequently to judge, and export text according to judged result.Owing to have employed the participle mode of improvement, therefore improve integral filter speed and filter accuracy.

Below, technology is specifically described:

This method is mainly used in the text filtering before the transmission of information communication platform Chinese version information.Be specially:

In text, a filter method for bad Word message, is characterized in that, comprising:

Step 101, extracts text to be filtered;

When carrying out information communication between user, information is sent to Platform Server by user usually, and Platform Server extracts this information, and is defined as text to be filtered.

Needing when filtering this information to improve filter velocity, being as the criterion not postpone communication speed between user, ensure information communication smoothly between user.

Step 102, utilizes entry length in dictionary dynamically to determine the value of the long Maxlen of word in maximum matching algorithm, carries out participle by Mexlen to text to be filtered;

This step is committed step, carries out effective participle to text, can improve the speed that follow-up responsive vocabulary filters.For this reason, inventors performed conscientious thinking and exploration, obtain the principle that following participle needs to follow:

1) granularity is the bigger the better: for carrying out the text participle of semantic analysis, require that the granularity of word segmentation result is larger, namely the number of words of word is more, and the implication that can represent is more definite.

2) in cutting result, non-dictionary word is more few better, and individual character dictionary word number is more few better, and " non-dictionary word " is here exactly the individual character be not included in dictionary, and " individual character dictionary word " refers to the individual character that can independently use.

3) total pronouns, general term for nouns, numerals and measure words number is more few better, and when identical number of words, total word number is fewer, and declarative semantics unit is fewer, and the weight of so relative single semantic primitive can be larger, and therefore accuracy can be higher.

Normally used in existing participle technique is Forward Maximum Method algorithm, and its major drawbacks is that the value of the long MaxLen of initial word immobilizes, and before being namely syncopated as a word in participle process, always first composes a fixing initial value to MaxLen.Most the long initial value of major term is constant easily causes two problems: (1) word is long too short, and long word will be cut mistake; (2) word is long long, and efficiency is just lower.

Initialization word string S1 to be slit, exports word string S2, the most long Maxlen of major term; Suppose that in dictionary for word segmentation, most Chinese character number contained by long word bar is MaxLen, intercept a length from the beginning of word string S1 to be slit is the word string W of MaxLen at every turn, W is made to match successively with the entry in dictionary, if certain entry mates completely with it, W is gone out as word cutting from S1, and then intercept from the beginning of S1 the word string that another length is MaxLen, repeat the process matched with entry in dictionary, until character string to be slit is empty.If can not find the entry mated with W in dictionary, just deduct a word from the afterbody of W, continue to mate with entry in dictionary by the character string of MaxLen-1 length, if the match is successful, be syncopated as a word, otherwise deduct a word from W afterbody again, repeated matching process, until the match is successful.Its workflow as shown in Figure 1.

Why segmenting method based on dictionary have algorithm design simply, easily to realize, cut the advantages such as word speed is fast, is largely to devise good dictionary configuration.Dictionary is the foundation being syncopated as each word, and the quantity that can be syncopated as word depends on the scale of dictionary.The structure of dictionary designs in conjunction with segmentation methods.Therefore, in order to coordinate above-mentioned algorithm, dictionary is adjusted.

Definition 1 start with same word, have one group of entry of equal length (the Chinese character number contained in entry is identical) to be called phrase.

Define 2 one dictionaries to be made up of a lot of phrases.

When searching dictionary in algorithm, word string always first mates the most long word started with same Chinese character, mating the unsuccessful vice-minister's entry that then mates, if this word string matches with there being all entries of identical lead-in, can waste a large amount of time; When mating unsuccessful, this word string end is removed a word and is mated dictionary again, and entry successively decreases, length not always 1, also may lose time.So one group can be saved as with the entry of identical lead-in, equal length, like this can whole dictionaries store in ordered list.Just only mate with the entry in the group with identical lead-in, equal length when a word string mates dictionary again, matching times greatly reduces.Based on this, devise the dictionaries store structure being suitable for algorithm herein.

All entries are first placed in text.First, entry arranges according to lead-in lexicographic order; Secondly, the word started with same word, arranges from long to short according to the length of each word.Entry is arranged with as from " the long state of Arabic Joint Unitary ", " Albania " to " auntie ", and entry length is successively decreased.Read in internal memory by dictionary before participle, the data structure of dictionary in internal memory as shown in Figure 2.

Dictionary is made up of three parts:

(1) lead-in Hash table

Chinese character exists with ISN form in a computer, transfers to for region-position code, then region-position code is converted to a metric numeral after being calculated by ISN, and this numeral is exactly the sequence number of this Chinese character in Hash table.This sequence number indicates with the entry address of the long table of word of this word word that is lead-in.Being calculated as of entry address:

Offset=(ch1-0xB0) x94+ (ch2-0xA1) (1) wherein, offset is the sequence number of this word in Hash table, and ch1, ch2 are respectively high byte and the low byte of this word machine ISN.

(2) glossarial index table

Make the long length that have recorded the entry of common lead-in of word, such as have 6,4,3,2 these four words long with the length of " debating " word that is lead-in.The instruction of word pointer has the first address of the long entry of certain word.

(3) dictionary text

Entry is stored in units of group.

Based on the correspondence layout of above-mentioned algorithm improvement thought and dictionary, the Forward Maximum Method algorithm of this improvement is:

FMM algorithm is the process of a sequential search word on dictionary, the most long MaxLen initial value of major term is most long word bar length in dictionary for word segmentation, and showing most of word in table 1 is two words, three words and four words, and therefore sequential search compares often and loses time.The word priority of long word reduce seek scope, reduce the time of searching, branching away is the problem that this innovatory algorithm will solve.Propose herein to utilize the length of entry in dictionary dynamically to determine the value of the long MaxLen of word, get character string and the dictionary matching of corresponding length from the character string left side to be slit according to determined word length, thus solve constant the brought long word of MaxLen initial value in participle process be split mistake and the time grow, inefficient problem.As follows to the FMM arthmetic statement after improving:

Input: S1: treat participle character string; Dictionary.

Export: S2: the word string after participle, with "/" interval between word; S3: unregistered word string, with "/" interval between word.

Algorithm:

1) initialization: be S1 initialize, S2, S3 initial value is empty;

2) whether S1 is empty, if be not empty, continues, otherwise turns 15);

3) whether S1 is individual character, and namely whether the length S1Len of S1 waits 1.If not individual character, continue, otherwise turn 14);

4) get the 1st, S1 left side word W, find W position in Hash table by Hash (W), obtain pointing to the pointer PLen with the length WLen of the word of W beginning according to W position in Hash table;

5) the length WLen of word is found according to PLen;

6) whether S1Len is less than WLen, if be not less than, continues, otherwise turns 11);

7) the pointer Qw pointing to entry is found according to word Long pointer PLen;

8) get the long word string Word for WLen in S1 to mate one by one with the entry for WLen long in dictionary, the entry namely pointed to Qw matches, and moves in matching process after Qw, runs into " ^ " and terminates mobile;

9) with dictionary whether the long word string for WLen match with an entry of length, if the match is successful, continues, otherwise turn 11);

10) S2=S2+Word+ "/'; S1=S1-Word; Turn 2); // separate a word

11) whether WLen is 2, if so, continues, otherwise turns 13);

12) S2=S2+W+ "/'; S3=S3+Word; S1=S1-W; Turn 2); //S3 is unregistered word string

13) PLen++; Turn 5);

14) S2=S2+S1+II/II; //S1 is individual character, directly S1 is separated

15) S2, S3 is exported.

Owing to being all at every turn from treating that the left side of point word string cuts out a word, if short word is a part for a long word, then this long word is branched away, long word is long as much as possible.Such as " People's Republic of China (PRC) ", this word contains " China ", " people ", " republic ", " People's Republic of China (PRC) " four words, by herein algorithm, such long word is branched away as a whole word.So embody the principle of " priority of long word ", and make the quantity of the word branched away few as much as possible, also decrease ambiguity as much as possible.For point not word out, then the lead-in of word is outputted in S2, the length word string got in S1 according to the entry started with this lead-in in dictionary outputs to unregistered word string S3.

Should be based on the cutting method of the word string to be slit of this innovatory algorithm:

Participle is carried out to one section of document, need through following four-stage:

(1) dictionary stage loading dictionary for word segmentation is loaded in internal memory.First in Hash table, lead-in is stored.Secondly and entry long according to depth-first stored word.Fig. 2 is the data structure of dictionary at internal memory.

(2) text pre-processing phase punctuation mark is the natural separator between sentence, by punctuation mark, document is cut into sentence one by one, with "/F separates between sentence.If have numeral in literary composition, then adjacent numeral is divided into a word.Punctuation is stored in punctuation mark and filters in dictionary, and the arabic numeral of full half-angle are stored in digital dictionary, before using, this two text is called in internal memory, stores with the form of strongly-typed list List<T>.

(3) if having English character string in participle stage document, just according to there being several English character string to be just divided into several word.If have numeral in document, then several numeric string is had just to be divided into several word.What do not have participle in such full wafer document only has Chinese character string.The Forward Maximum Method algorithm improved is used to carry out participle to Chinese character string again.

(4) process the word segmentation result stage all with dictionary in do not have the word that the match is successful all will as unregistered word process.After examination & verification, by hand unregistered word is added in dictionary.The word separated and responsive dictionary are compared again, masks the word be present in responsive vocabulary, export text results.

In actual applications, by packaged segmentation methods embedding program, load the responsive dictionary of Custom Dictionaries storehouse, sets itself.Content of text is by being cut into some vocabulary after participle technique, and whether comparison vocabulary is in responsive dictionary successively, then shields in responsive dictionary.

Such as:

Content of text: elegant 90 is catenet entertainment video websites after YY, six rooms.

Word segmentation result: elegant/90/ is/continue between/YY/ six/room/after// mono-/large-scale/network/amusement// video/website

Us, the result of participle does not think that " elegant 90 ", " six rooms " separately have all separated.For avoiding this situation, the vocabulary that we want can be added in self-defined dictionary.Word segmentation result after adding " elegant 90 ", " six rooms ":

Elegant 90/ be/continue/YY/ six rooms/after// mono-/large-scale/network/amusement/video/website

If YY, six rooms are added in responsive vocabulary,

Content of text: YY, six room beauties are numerous

Output rusults: * *, * * * beauty are numerous

So just reach the object of our text filtering.

Step 104, adds to the word string that do not record exported in dictionary.

This step can be added in the arbitrary step after participle, because existing vocabulary solar tracking increases, various neologisms emerge in an endless stream, in order to adapt to this situation, the vocabulary of dictionary needs to constantly update, such as: when carrying out cutting to text, once run into the word do not appeared in dictionary, then can be positioned in dictionary, to reach the object of renewal.Facilitate during follow-up cutting and use

For performing above-mentioned filter method, additionally provide the filtering system of bad Word message in text, this system can load in hardware and perform, and also can build software systems and run.It comprises:

Extraction module, for extracting text to be filtered;

Wherein, word-dividing mode comprises:

English string segmentation unit, if comprise English in text to be filtered is, for English being pressed English character string participle;

Numeral participle unit, if comprise numeral in text to be filtered, for pressing digit strings participle by numeral;

Extraction unit, for extract from text to be filtered do not comprise English and numeral treat participle character string;

Dynamic participle unit, for the value utilizing entry length in dictionary dynamically to determine the long Maxlen of word in maximum matching algorithm, treats participle character string by Mexlen and carries out participle.

The participle program that dynamic participle unit performs is:

01) for treating participle character string S1 initialize;

04) if not individual character, get and treat participle character string S1 first left word W, found the word length WLen of corresponding first character W by Hash table in dictionary;

05) judge to treat whether the length of participle character string is less than word length WLen, if be less than, then enters 07);

06) if be not less than, get and treat that in participle character string, the long word string Word for WLen mates one by one with the entry for WLen long in dictionary; When word string is mated with a certain entry, export this word string;

07) if do not mated, judge whether WLen is 2, if not, PLen++, turns 04)

08) if 07) result be yes, by word string export for not record word string.

These are only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the filter method of bad Word message in text, is characterized in that, comprising:

Step 1, extracts text to be filtered;

2. the filter method of bad Word message in text as claimed in claim 1, it is characterized in that, described step 2 comprises:

3. the filter method of bad Word message in text as claimed in claim 2, it is characterized in that, the described value utilizing entry length in dictionary dynamically to determine the long Maxlen of word in maximum matching algorithm, treat that participle character string is carried out participle and comprised by Mexlen to described:

01) for treating participle character string S1 initialize;

07) if do not mated, judge whether WLen is 2, if not, PLen++, turns 04)

4. the filter method of bad Word message in text as claimed in claim 3, it is characterized in that, step 3 comprises:

Load responsive lexicon;

5. the filter method of bad Word message in text as claimed in claim 4, is characterized in that, if mate unsuccessful, then directly export text.

6. the filter method of bad Word message in text as claimed in claim 3, is characterized in that, between step 2 and step 3, also comprise: added in dictionary by the word string that do not record exported.

7. the filtering system of bad Word message in text, is characterized in that, comprising:

Extraction module, for extracting text to be filtered;

8. the filtering system of bad Word message in text as claimed in claim 7, it is characterized in that, described word-dividing mode comprises:

9. the filtering system of bad Word message in text as claimed in claim 8, it is characterized in that, the participle program that described dynamic participle unit performs is:

01) for treating participle character string S1 initialize;

07) if do not mated, judge whether WLen is 2, if not, PLen++, turns 04)