CN102377690B - Anti-spam gateway system and method - Google Patents
Anti-spam gateway system and method Download PDFInfo
- Publication number
- CN102377690B CN102377690B CN201110304470.3A CN201110304470A CN102377690B CN 102377690 B CN102377690 B CN 102377690B CN 201110304470 A CN201110304470 A CN 201110304470A CN 102377690 B CN102377690 B CN 102377690B
- Authority
- CN
- China
- Prior art keywords
- sample
- module
- classification
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 239000003795 chemical substances by application Substances 0.000 claims description 35
- 238000004422 calculation algorithm Methods 0.000 claims description 22
- 238000009412 basement excavation Methods 0.000 claims description 15
- 238000012550 audit Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012795 verification Methods 0.000 claims description 8
- 230000004044 response Effects 0.000 claims description 7
- 238000012384 transportation and delivery Methods 0.000 claims description 6
- 238000012790 confirmation Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 description 11
- 235000012907 honey Nutrition 0.000 description 9
- 238000012549 training Methods 0.000 description 8
- 230000006399 behavior Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000007726 management method Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000007621 cluster analysis Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 230000008713 feedback mechanism Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 241001494479 Pecora Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000007630 basic procedure Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses an anti-spam gateway system and an anti-spam method. The system comprises a mail sample database for storing various mail samples, and a mail characteristic exploration module for acquiring the mail samples from the mail sample database, comparing the mail samples with all central points, and directly adding the samples into the central points if the similarity between the mail samples and all the central points is less than a certain threshold value, wherein each central point represents a type of samples; when the similarity between the mail samples and the central points is calculated, the mail samples and the central points are resolved into a plurality of parts of contents respectively; for each part, the similarities of the mail sample and the central point are compared; and the global similarity between the mail samples and the central points can be acquired by weighted combination of the similarities of the parts. By using the system and the method, the sample database and a characteristic database have better adaptability to burst spam types and the like; therefore, the leakage rate of spam is low, the instantaneity is high, the manual intervention is low and the system contractability is high.
Description
Technical field
The present invention relates to email disposal field, particularly a kind of Anti-spam gateway system and method based on mass-mailer content clustering.
Background technology
Spam is generally defined as the Email with following attribute: (one) addressee does not claim in advance or the tendentious Email such as the advertisement agreeing to receive, electronic publication, various forms of propaganda materials; (2) Email that addressee cannot reject; (3) hide the Email of the information such as sender's identity, address, title; (4) Email that contains the information such as false information source, sender, route.
Since the first envelope spam is born, spam has become a difficult problem for puzzlement mail user, and also the raising user of Cheng Liao mail operator experiences, attracts user's significant consideration.The task of anti-rubbish mail is that spam is blocked in beyond mailing system or user's inbox.Main flow anti-rubbish technology mainly based on the behavior of posting a letter of Mail Contents and mail.
The existing anti-spam technologies based on Mail Contents mainly contains: the system of increasing income Dspam(can download by website http://www.nuclearelephant.com); The patent application that the application number of Tencent Technology (Shenzhen) Co., Ltd. is 200810227762, denomination of invention is " method and apparatus of patent to intercepting junk mail "; The patent application that the application number of Zhejiang University is 200810059602, denomination of invention is " the Chinese Spam Filtering method returning based on Logistic "; The patent application that the application number of Peking University is 200810115584, denomination of invention is " a kind of junk mail detection method " etc.
Above-mentioned anti-spam technologies mainly comprises on training and line and uses two flow processs, take below Dspam as example introduce its train and line on several key steps while using, all the other correlation techniques are substantially similar.The training flow process of Dspam comprises following step: 1, obtain a large amount of mail samples and these samples are manually designated to spam and normal email; 2, mail is decoded; 3, message body content is carried out to participle; 4, add up the frequency that each participle occurs; 5, use Bayesian formula training Naive Bayes Classification Model.After Dspam model training is good, on line, use flow process relatively simple, only comprise following two steps: 1, mail on line is carried out to participle; 2, use the Naive Bayes Classification Model training to classify to mail.
Anti-rubbish mail strategy based on the behavior of posting a letter in real time and content-based anti-rubbish mail strategy there is bigger difference.Anti-garbage system based on real-time behavior is not generally trained this step.The post a letter anti-rubbish strategy of behavior of typical mail mainly contains Checksum(and can download by website http://www.rhyolite.com/dcc/), the application number of Harbin Engineering University is 200810064806, denomination of invention is " a kind of method for judging rubbish mail based on topological behavior " patent application etc.The Checksum of take below introduces its basic procedure as example.The basic assumption of Checksum is that the mail that multiplicity is large is spam, and its flow process is roughly as follows: 1, for each mail, calculate a fingerprint; 2, for the fingerprint of all mails of inline system, count; 3, for the high mail of fingerprint multiplicity, be directly judged to spam.
It is the main flow of current business anti-garbage mail system aspect that Mail Contents and the in real time behavior of posting a letter combine.By Mail Contents and the behavioral trait of posting a letter be in real time converted to rule, and take each rule accumulation bonus point, and whether be that spam is by the effective means of both combinations according to score threshold decision.Representational technology has, the SpamAssassin(of the system of increasing income can download by website http://spamassassin.apache.org/), the application number of South China Science & Engineering University is 200710029369, denomination of invention is the patent application of " anti-rubbish E-mail error filtering method based on integrated decision-making and system ", the bright mail system of business system Symantec Corporation (http://www.symantec.com/business/products/family.jsp familyid=brightmail can download by website), the Chinese opens scientific and technological KBAS system (http://www.hanqinet.com/project1.html can download by website) etc.The SpamAssassin of take is representative introduces its main flow process.SpamAssassin comprise training and line on use two flow processs.The training of rule-based anti-rubbish correlation technique mainly comprises following step: 1, obtain a large amount of mail samples and these samples are manually designated to spam and normal email; 2, manually add rule and set up rule base; 3, use artificial sign sample to mark to rule.On line, use and comprise following two steps: the rule of 1, calculating every envelope mail coupling; 2, to all satisfied regular scores summations and whether be spam according to threshold decision.
Mainly there is the deficiency of several aspects in existing anti-garbage mail system: A), lack effective Feedback collection mechanism, feedback information can not effectively utilize.Although most of mailing system all has the feedback mechanisms such as spam report, but the feedback information from the various channels of user feedback, honey jar mailbox, keeper's audit etc. is relatively independent, disperse, lack effectively the mechanism of collecting, integrating and utilizing, wherein honey jar mailbox is a kind of special Email Accounts, and the mail entering is wherein all spam.B), lack automatic learning mechanism, to the spam of flared, can not respond in time, and anti-garbage system is easily broken through by anti-rubbish mail person.Existing anti-garbage mail system is all that parameter good based on prior learning or that arrange judges the email type of newly arriving.This anti-rubbish mail thinking can not effectively be processed for the new spam type of flared.Meanwhile, because the model in conventional garbage mailing system is relatively fixing, easily by anti-rubbish mail person, found system features, cause system by spammer, broken through after a while and lost efficacy.C), misdetection rate is high and False Rate is high.Existing anti-garbage mail system can not adapt to the anti-rubbish mail strategy that email type changes fast, part is external and not consider the Chinese reasons such as special circumstances, causes higher misdetection rate.Meanwhile, because existing anti-garbage mail system lacks effectively erroneous judgement feedback mechanism, cause erroneous judgement effectively not correct, False Rate is too high.D), manual examination and verification amount is large.Two links of existing system need more manual examination and verification.First, for components of system as directed, can not differentiate result needs manual examination and verification, and this part audit amount is larger.Secondly, in order to make system adapt to new spam type needs, prepare sample and again train, this part sample size of not only examining is large, and sample distribution is also had to high requirement, causes difficulty large.
Summary of the invention
In order to solve the problems of the technologies described above, the present invention proposes a kind of Anti-spam gateway system and method.
Anti-spam gateway system of the present invention comprises: mailing system interface, for from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent, mail distribution module, for mail requests on line being transmitted to on-line/off-line classification of mail device, passes to mail sample collection module by the mail requests of feeding back by variety of way, online classification of mail module, for according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval, off-line classification of mail module, for obtaining up-to-date mail features according to certain hour interval from mail features database, is used the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent, mail sample collection module, the request that response mail distribution module sends over, connects and obtains mail sample type and content, mail features is excavated module, for obtaining mail sample from mail sample database, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also for obtaining mail sample from mail sample database, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, mail sample database, for storing various mail samples.
In addition, the invention allows for a kind of anti-rubbish mail method, the method comprises: by mailing system interface from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent, by mail distribution module, mail requests on line is transmitted to on-line/off-line classification of mail device, and the mail requests of feeding back by variety of way is passed to mail sample collection module, utilize line classification of mail module according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval, utilize off-line classification of mail module from mail features database, to obtain up-to-date mail features according to certain hour interval, use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent, the request sending over by mail sample collection module responds mail distribution module, connects and obtains mail sample type and content, by mail features, excavate module and from mail sample database, obtain mail sample, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also by mail features, excavate module and from mail sample database, obtain mail sample, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, in mail sample database, store various mail samples.
Use Anti-spam gateway system of the present invention and method, there is the following aspects: 1) spam type of flared etc. is had to good adaptability, the effective feedback capture mechanism that the present invention proposes can be unified timely collection by the mail of honey jar mailbox, user's report, keeper's audit, can obtain in real time the latest development of spam on line, and by the on-line/off-line study module of mail features, can obtain in time the latest features situation of mail on line, thereby make system can adapt to spam type, change fast.2) spam misdetection rate is low, real-time good.The invention provides the anti-rubbish module of two levels, be respectively online classification of mail module and off-line classification of mail module.Online mail online classification device is passing through loss part discovery rate, promoted the real-time response ability of system, off-line classification of mail device can make up the deficiency of online classification of mail device, with the larger spam discovery rate of larger delay acquisition, plays the effect of mending the fold after the sheep is lost.The anti-rubbish mail gateway of the present invention that is used in conjunction with of on-line/off-line classification of mail device has obtained lower misdetection rate and good real-time.3) manual intervention is little.The present invention can extract the feature of mail automatically effectively by feedback capture mechanism and mail features mining algorithm, do not need manually sample to be examined, keeper only need to for excavation to part mail features examine, this part amount is considerably less.Therefore, use system and method for the present invention, manual examination and verification amount is considerably less.4) system contractility is good, and system can adapt to the anti-garbage mail system of multiple scale by revising the dynamically quantity of the various sort module servers of increase and decrease of mail distribution Servers installed, has good contractility.
Accompanying drawing explanation
Fig. 1 is Anti-spam gateway system Organization Chart of the present invention;
Fig. 2 is the flow chart of of the present invention spam method;
Fig. 3 feeds back the schematic diagram of realizing of obtaining step in of the present invention spam method;
Fig. 4 be in of the present invention spam method mail features excavation step realize schematic diagram;
Fig. 5 be in of the present invention spam method classification of mail step realize schematic diagram.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Fig. 1 shows the Anti-spam gateway system Organization Chart that the present invention is based on mass-mailer content clustering.
With reference to Fig. 1, gateway system of the present invention comprises mailing system interface, mail distribution module, online classification of mail module, off-line classification of mail module, mail sample collection module, mail features excavation module, system management module, administrator interface, database interface, mail sample database, mail features database
Mailing system interface, for realizing, anti-rubbish mail gateway and mailing system are various communicates by letter, comprise from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module; The classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent; Setting up bulk mail derives connection, from mail server, obtains the functions such as type mail such as subscriber mailbox report, honey jar mailbox.
Mail distribution module, to enter gateway system association requests and be distributed to respective modules according to its type, on its center line, mail requests will be transmitted to on-line/off-line classification of mail device, and the feedback mail requests such as user's report, honey jar, keeper will pass to mail sample collection module.Mail distribution module also needs to be responsible for the load balancing in each on-line/off-line classification of mail module, mail sample collection module simultaneously.
Online classification of mail module, request and mail distribution module that response mail distribution module sends over connect and obtain mail related content, then according to existing normal/spam feature classifies to mail on line, and the identification result that is whether spam is returned to mail transport agent in real time by former connection, the mail transmission of setting up when this former connection table is shown in the request that response mail distribution module comes connects.Meanwhile, online classification of mail module also needs to be connected with mail features Database by database interface, and from mail features database, obtains up-to-date mail features according to certain hour interval.Mail features in mail features database will be by real-time update, and up-to-date mail features refers to the mail features after last update.
Off-line classification of mail module, by database interface, be connected with mail features Database, and from mail features database, obtain up-to-date mail features according to certain hour interval, then use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, the form that classification results moves the mail tabulation of letter operation with needs returns to mail transport agent.
Mail sample collection module, the request that response mail distribution module sends over, connects and obtains mail sample type and content, and all kinds mail balanced proportion in mail sample database of take is principle, carries out mail sample collection.The type of collecting mail sample comprises, the normal email of the spam of user's report, user's report, from the mail of honey jar, keeper's auditing result etc.
Mail features is excavated module, by system management module, called, and for obtaining mail sample from mail sample database, and the feature to this mail sample excavation spam and normal email.First mail features excavation module is connected and is obtained feedback samples by database interface and mail sample database, then the mail features of system excavation module is analyzed this part sample, and the mail features of excavating enters into mail features database after examining by system manager.
Mail features is excavated module and is used clustering algorithm from various feedback samples, to extract various types of mail features.Particularly, from various feedback mail sample extraction, go out to report that quantity reaches the mail of certain threshold value, reject due to the feedback information disturbing and consumer taste adds.For example, as find the spam that a certain class is the theme with invoice, if it is inferior to be surpassed threshold value (such as 100) by report number of times, such mail will be judged to spam, and the feature of this part mail is joined in spam feature database.In addition, if there are the mails such as news list, certain customers are reported as spam, and certain customers think that it is normal email in addition, and this part mail can not be as spam sample.
The clustering algorithm that the present invention adopts preferably adopts and improves central point clustering algorithm, each central point is the representative of a class sample, and comprises the following aspects information: mail header template, short text are for short text template, long article are originally for fingerprint mean value, IP set, the addresser of the mean value of corresponding fingerprint, annex gather.A typical central point is as follows: mail header template is " generation is opened * invoice * " (* is asterisk wildcard), short text template " my public * department opens the various VAT invoices of * * * for *; have the * of needs * contact button button 92342* ", the nilsimsa cryptographic Hash that long text fingerprints and annex fingerprint are corresponding contents, IP set is sender's IP list, as " 199.1.1.1 ", addresser's set is the mailbox list of posting a letter, as
asdf@163.com.When a new mail sample enters, this mail sample and now all central points are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, and upgrade this central point.The mail center point that cluster obtains is a mail features.After cluster, in a classification, sample surpasses threshold value n, and report is ham(normal email) ratio of sample is less than threshold value t, extracting this classification central point is spam(spam) sample.Improved central point clustering algorithm can be realized by program below.
In superincumbent central point clustering algorithm, the similitude of mail sample and central point is calculated by mode below.When calculating the similitude of mail sample and a certain central point, execution following steps: by mail resolve to mail header, several most contents such as the IP that posts a letter, addresser, text, annex; Body part is gone to disturb and process, extract mail structural framing, Chinese text, English text, other Languages text, this five bulks content of body structure information; To enumerated variable such as IP, directly adopt set whether to have common factor to measure its similitude; For long text message and annex, adopt fingerprint to calculate both similitudes; For short text, adopt Needleman – Wunsch algorithm to determine similitude between the two; According to the similitude of various piece, be weighted the overall similitude that combination obtains two envelope mails.
The similarity measurement algorithm of various piece is as follows: 1) enumerated variable such as IP, sender similarity measurement algorithm is: in a mail center point, the IP that posts a letter of all mails forms a set, when the similitude of two IP set of tolerance, if two IP common factor non-NULLs (, there is public IP), its similarity is defined as 1, otherwise is 0.The enumerated variable such as sender, can do similar processing.2) short text similarity measurement algorithm is: adopt Needleman – Wunsch algorithm to determine the Optimum Matching of two sequences.Algorithm principle and realize false code can be referring to
http:// en.wikipedia.org/wiki/Needleman – Wunsch_algorithm.Algorithm need to be determined three types character, and the coupling of Chinese, English, asterisk wildcard and mistake matching score, can carry out rough estimates acquisition according to data.After overmatching, the public part of two character strings is the template of two character strings, and different piece adopts asterisk wildcard to represent.3) this similarity measurement of long article algorithm is: adopt the text similarity of nilsimsa fingerprint technique comparison after denoising.Can use Open Source Code:
http:// ixazon.dynip.com/~cmeclax/nilsimsa.htmlrealize.
When new mail enters, first anti-rubbish mail gateway is used online classification of mail module to compare to this new mail, if have its similarity of envelope mail and this mail similarity to be less than threshold value t in spam queue, this mail is judged to spam, and result is returned.Spam queue is the member of online classification of mail module the inside.The content of queue wherein obtains from mail features database.Specific algorithm is as follows:
When having new spam feature to enter spam property data base, off-line classification of mail module is used the mail in newfound characteristics of spam and all buffer queues to compare, if have mail in buffer queue and be newly less than t to spam characteristic similarity threshold value, this mail is judged to spam, this mail is deleted from mail queue, and returned results.Specific algorithm is as follows:
The mail distribution server at mail distribution module place is master server, it maintains existing each server configuration and each server process time delay, each new for mail, the delay of each server of master server training in rotation, and will newly to mail sample, be distributed to the server with the minimum delay.Each postpones its up-to-date processing time to report to Distributor from the complete mail of server process.
Continuation is with reference to Fig. 1, and system management module, for setting and configuration file distribution, server performance monitor and the optimizational function of various algorithm parameters.
Administrator interface, the manual examination and verification of system being excavated to the mail features obtain for system manager are confirmed, the audit of part suspicious mail, the arranging etc. of various parameters.
Database interface, realizes unified interface and the access rights of the database manipulations such as access, renewal of various mail samples, mail features and controls.
Mail sample database, has label mail for storing by user report, keeper's audit and honey jar mailbox various.
Mail features database, excavates for storing mail features the various mail features that module obtains.
To sum up, anti-rubbish mail gateway of the present invention is partly comprised of mailing system interface, mail distribution module, on-line/off-line classification of mail module, mail sample collection module, mail features excavation module, system management module, administrator interface, database interface, mail sample database, mail features database.Above-mentioned module completes classification of mail together, feedback information is collected and mail features is excavated this three functions.In classification of mail function, anti-rubbish mail gateway of the present invention obtains the information such as Mail Contents, user behavior information from mail transport agent by mailing system interface, after using on-line/off-line classification of mail module to classify to respective mail, mail classes is returned to mail transmission server; In feedback information collecting function, the mail samples such as user feedback, honey jar mailbox and system manager's auditing result enter gateway system by mail exploder and mail sample collection module becomes learning sample; In the function of excavating in mail features, anti-rubbish mail gateway of the present invention excavates module by mail features and from feedback samples, excavates up-to-date spam feature, and corresponding feature is distributed to on-line/off-line E-mail sorting model.
Anti-rubbish mail gateway of the present invention carries out the feature extraction of rubbish/normal email based on feedback information.User reports spam, reports normal email, moves the feedback packet such as letter containing a large amount of useful informations, has also comprised much noise simultaneously.The feature that the noise jamming of rejecting feedback information extracts rubbish/normal email is in time the key that anti-rubbish mail gateway is realized self-teaching.
Anti-rubbish mail gateway of the present invention adopts Spam Classification algorithm, particularly, in conjunction with existing normal/spam feature, the mail that mail exploder is assigned is classified, and reaches the target of following three aspects: reduce low spam erroneous judgement rate, higher spam discovery rate and response speed faster.
Anti-rubbish mail gateway of the present invention adopts the dispatching algorithm of mail exploder, on the line that magnanimity is arrived at a high speed, mail is distributed to each processor in real time, realizes the decentralized configuration of the processing logic of various mails, the load balancing of each server and various services.
Fig. 2 is the flow chart that the present invention is based on the anti-rubbish method of mass-mailer content clustering.Fig. 3 is the schematic diagram of realizing of feedback obtaining step.Fig. 4 is the schematic diagram of realizing of mail features excavation step.Fig. 5 is the schematic diagram of realizing of classification of mail step.
With reference to Fig. 2, the method comprising the steps of: S201, by mailing system interface from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent, can be with further reference to Fig. 3 in this step, can be from system manager, these three sources of user and honey jar obtain normal email and spam sample, and these mails are entered to mail sample database after by mail distribution module and mail sample collection module.It is principle that mail sample collection module be take all kinds mail balanced proportion in mail sample database, carries out mail sample collection.S202, is transmitted to on-line/off-line classification of mail device by mail distribution module by mail requests on line, and the mail requests of feeding back by variety of way is passed to mail sample collection module.S203, utilize line classification of mail module according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval.With reference to Fig. 5, further understand classification of mail process, wherein mail transport agent enters this antispam gateway by e-mail messages mailing system interface; Mailing system interface is transmitted to mail distribution module by mail; Mail distribution module is given online classification of mail module, off-line classification of mail module and sample collection module according to the strategy of setting by mail distribution; Mail on-line/off-line sort module is classified to mail according to the information in mail features library database, and result is returned to mail transport agent according to the path of mail " mail distribution module, mailing system interface, mail transport agent "; Mail distribution module will be transmitted to applicator, and sample collection module determines whether this mail is joined to Sample Storehouse according to corresponding strategies.The difference of on-line/off-line classification of mail module is that online classification of mail module can be returned to mail differentiation result in real time, and off-line classification of mail module adopts asynchronous mode that the differentiation result of mail is returned to mail transport agent.S204, utilize off-line classification of mail module from mail features database, to obtain up-to-date mail features according to certain hour interval, use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent; S205, the request sending over by mail sample collection module responds mail distribution module, connects and obtains mail sample type and content; S206, excavates module by mail features and from mail sample database, obtains mail sample, and therefrom excavates the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager.With further reference to Fig. 4, in this mail features excavation step, first system extracts the mail sample of nearest a period of time from mail sample database, then the mail features of system excavation module will be carried out cluster analysis to sample, and the mail features of excavating joins mail features database after examining by system manager.In process of cluster analysis, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, is weighted according to the similitude of various piece the overall similitude that combination obtains mail sample and central point.When comparing the similitude of mail sample and central point for each part, to enumerated variable, adopt set whether to have common factor to measure its similitude, to long text message and annex, adopt fingerprint to calculate both similitudes, to short text, adopt Needleman-Wunsch algorithm to determine similitude between the two.To excavating the mail features obtaining, carry out manual examination and verification confirmation, the audit of part suspicious mail, the setting of various parameters.S207 stores various mail samples in mail sample database.
Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.
Claims (6)
1. an Anti-spam gateway system, it comprises:
Mailing system interface, for from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent;
Mail distribution module, for mail requests on line being transmitted to on-line/off-line classification of mail device, passes to mail sample collection module by the mail requests of feeding back by variety of way;
Online classification of mail module, for according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval;
Off-line classification of mail module, for obtaining up-to-date mail features according to certain hour interval from mail features database, is used the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent;
Mail sample collection module, the request that response mail distribution module sends over, connects and obtains mail sample type and content;
Mail features is excavated module, for obtaining mail sample from mail sample database, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also for obtaining mail sample from mail sample database, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, when a new mail sample enters, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, after cluster, in a classification, sample surpasses a threshold value, and report that extracting this classification center is spam sample for the ratio of normal email sample is less than another threshold value,
Mail sample database, for storing various mail samples.
2. Mail Gateway system as claimed in claim 1, it is characterized in that, when comparing the similitude of mail sample and central point for each part, to enumerated variable, adopt set whether to have common factor to measure its similitude, to long text message and annex, adopt fingerprint to calculate both similitudes, to short text, adopt Needleman-Wunsch algorithm to determine similitude between the two.
3. Mail Gateway system as claimed in claim 1, is characterized in that, described system further comprises:
Administrator interface, the manual examination and verification of gateway system being excavated to the mail features obtaining for system manager are confirmed, the audit of part suspicious mail, the setting of various parameters.
4. an anti-rubbish mail method, the method comprising the steps of:
By mailing system interface from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent;
By mail distribution module, mail requests on line is transmitted to on-line/off-line classification of mail device, and the mail requests of feeding back by variety of way is passed to mail sample collection module;
Utilize line classification of mail module according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval;
Utilize off-line classification of mail module from mail features database, to obtain up-to-date mail features according to certain hour interval, use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent;
The request sending over by mail sample collection module responds mail distribution module, connects and obtains mail sample type and content;
By mail features, excavate module and from mail sample database, obtain mail sample, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also by mail features, excavate module and from mail sample database, obtain mail sample, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, when a new mail sample enters, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, after cluster, in a classification, sample surpasses a threshold value, and report that extracting this classification center is spam sample for the ratio of normal email sample is less than another threshold value,
In mail sample database, store various mail samples.
5. method as claimed in claim 4, it is characterized in that, when comparing the similitude of mail sample and central point for each part, to enumerated variable, adopt set whether to have common factor to measure its similitude, to long text message and annex, adopt fingerprint to calculate both similitudes, to short text, adopt Needleman-Wunsch algorithm to determine similitude between the two.
6. method as claimed in claim 4, is characterized in that, further comprises:
To excavating the mail features obtaining, carry out manual examination and verification confirmation, the audit of part suspicious mail, the setting of various parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110304470.3A CN102377690B (en) | 2011-10-10 | 2011-10-10 | Anti-spam gateway system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110304470.3A CN102377690B (en) | 2011-10-10 | 2011-10-10 | Anti-spam gateway system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102377690A CN102377690A (en) | 2012-03-14 |
CN102377690B true CN102377690B (en) | 2014-09-17 |
Family
ID=45795681
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110304470.3A Active CN102377690B (en) | 2011-10-10 | 2011-10-10 | Anti-spam gateway system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102377690B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103441924B (en) * | 2013-09-03 | 2016-06-08 | 盈世信息科技(北京)有限公司 | A kind of rubbish mail filtering method based on short text and device |
CN103744888A (en) * | 2013-12-23 | 2014-04-23 | 新浪网技术(中国)有限公司 | Method and system for anti-spam gateway to query database |
CN103841006A (en) * | 2014-02-25 | 2014-06-04 | 汉柏科技有限公司 | Method and device for intercepting junk mails in cloud computing system |
CN104796318A (en) * | 2014-07-30 | 2015-07-22 | 北京中科同向信息技术有限公司 | Behavior pattern identification technology |
CN108197638B (en) * | 2017-12-12 | 2020-03-20 | 阿里巴巴集团控股有限公司 | Method and device for classifying sample to be evaluated |
CN108737255B (en) * | 2018-05-31 | 2020-07-10 | 北京明朝万达科技股份有限公司 | Load balancing method, load balancing device and server |
CN112579733B (en) * | 2019-09-30 | 2023-10-20 | 华为技术有限公司 | Rule matching method, rule matching device, storage medium and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1716293A (en) * | 2004-06-29 | 2006-01-04 | 微软公司 | Incremental Antispam Lookup and Update Service |
GB2425855A (en) * | 2005-04-25 | 2006-11-08 | Messagelabs Ltd | Detecting and filtering of spam emails |
CN101094197A (en) * | 2006-06-23 | 2007-12-26 | 腾讯科技(深圳)有限公司 | Method and mail server of anti garbage mail |
CN101136874A (en) * | 2007-07-25 | 2008-03-05 | 华南理工大学 | Anti-spam false filtering method and system based on comprehensive decision |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Chinese Spam Filtering Method Based on Logistic Regression |
CN101295381A (en) * | 2008-06-25 | 2008-10-29 | 北京大学 | A spam detection method |
CN101299729A (en) * | 2008-06-25 | 2008-11-05 | 哈尔滨工程大学 | Method for judging rubbish mail based on topological action |
CN101415159A (en) * | 2008-12-02 | 2009-04-22 | 腾讯科技(深圳)有限公司 | Method and apparatus for intercepting junk mail |
CN101588558A (en) * | 2009-03-30 | 2009-11-25 | 网易(杭州)网络有限公司 | Spam filtering method and system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1696943A (en) * | 2004-05-13 | 2005-11-16 | 上海极软软件技术有限公司 | Self-adaptive method for filtering out garbage E-mails safely |
CN101083630A (en) * | 2006-06-01 | 2007-12-05 | 珠海金山软件股份有限公司 | Anti-rubbish E-mail system and method |
CN101119341B (en) * | 2007-09-20 | 2011-02-16 | 腾讯科技(深圳)有限公司 | Mail identifying method and apparatus |
CN102075447B (en) * | 2009-11-25 | 2015-08-12 | 中兴通讯股份有限公司 | The method and system of anti-rubbish mail |
-
2011
- 2011-10-10 CN CN201110304470.3A patent/CN102377690B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1716293A (en) * | 2004-06-29 | 2006-01-04 | 微软公司 | Incremental Antispam Lookup and Update Service |
GB2425855A (en) * | 2005-04-25 | 2006-11-08 | Messagelabs Ltd | Detecting and filtering of spam emails |
CN101094197A (en) * | 2006-06-23 | 2007-12-26 | 腾讯科技(深圳)有限公司 | Method and mail server of anti garbage mail |
CN101136874A (en) * | 2007-07-25 | 2008-03-05 | 华南理工大学 | Anti-spam false filtering method and system based on comprehensive decision |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Chinese Spam Filtering Method Based on Logistic Regression |
CN101295381A (en) * | 2008-06-25 | 2008-10-29 | 北京大学 | A spam detection method |
CN101299729A (en) * | 2008-06-25 | 2008-11-05 | 哈尔滨工程大学 | Method for judging rubbish mail based on topological action |
CN101415159A (en) * | 2008-12-02 | 2009-04-22 | 腾讯科技(深圳)有限公司 | Method and apparatus for intercepting junk mail |
CN101588558A (en) * | 2009-03-30 | 2009-11-25 | 网易(杭州)网络有限公司 | Spam filtering method and system |
Also Published As
Publication number | Publication date |
---|---|
CN102377690A (en) | 2012-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102377690B (en) | Anti-spam gateway system and method | |
KR101117866B1 (en) | Intelligent quarantining for spam prevention | |
US6928465B2 (en) | Redundant email address detection and capture system | |
US7930353B2 (en) | Trees of classifiers for detecting email spam | |
Toolan et al. | Feature selection for spam and phishing detection | |
CN101674264B (en) | Spam detection device and method based on user relationship mining and credit evaluation | |
CN102413076A (en) | Spam mail judging system based on behavior analysis | |
US20060036693A1 (en) | Spam filtering with probabilistic secure hashes | |
CN101087259A (en) | A system for filtering spam in Internet and its implementation method | |
EP2649535A2 (en) | Electronic communications triage | |
CN101637002A (en) | A method and system for collecting addresses for remotely accessible information sources | |
CN102124485B (en) | Apparatus, and associated method, for detecting fraudulent text message | |
CN104040963A (en) | System and methods for spam detection using frequency spectra of character strings | |
Bhat et al. | Classification of email using BeaKS: Behavior and keyword stemming | |
CN103595614A (en) | User feedback based junk mail detection method | |
Mishra et al. | Analysis of random forest and Naive Bayes for spam mail using feature selection catagorization | |
JP2009104400A (en) | E-mail filtering device, e-mail filtering method and program | |
US8819142B1 (en) | Method for reclassifying a spam-filtered email message | |
CN118250248A (en) | Processing method, device, equipment and medium for sending mails in batches | |
Chien et al. | Email Feature Classification and Analysis of Phishing Email Detection Using Machine Learning Techniques | |
Gonzalez-Talavan | A simple, configurable SMTP anti-spam filter: Greylists | |
Daisy et al. | Email spam behavioral sieving technique using hybrid algorithm | |
KR20140127036A (en) | Server and method for spam filtering | |
Sajwan et al. | Email spam filteration with machine learning | |
Revathi et al. | Email Spam Detection Using Naive Bayes Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |