[go: up one dir, main page]

CN102377690B - Anti-spam gateway system and method - Google Patents

Anti-spam gateway system and method Download PDF

Info

Publication number
CN102377690B
CN102377690B CN201110304470.3A CN201110304470A CN102377690B CN 102377690 B CN102377690 B CN 102377690B CN 201110304470 A CN201110304470 A CN 201110304470A CN 102377690 B CN102377690 B CN 102377690B
Authority
CN
China
Prior art keywords
mail
sample
module
classification
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110304470.3A
Other languages
Chinese (zh)
Other versions
CN102377690A (en
Inventor
蔡瑞初
向东
熊卫华
洪陆驾
谭景峰
乔斌
潘雷明
周达和
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201110304470.3A priority Critical patent/CN102377690B/en
Publication of CN102377690A publication Critical patent/CN102377690A/en
Application granted granted Critical
Publication of CN102377690B publication Critical patent/CN102377690B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses an anti-spam gateway system and an anti-spam method. The system comprises a mail sample database for storing various mail samples, and a mail characteristic exploration module for acquiring the mail samples from the mail sample database, comparing the mail samples with all central points, and directly adding the samples into the central points if the similarity between the mail samples and all the central points is less than a certain threshold value, wherein each central point represents a type of samples; when the similarity between the mail samples and the central points is calculated, the mail samples and the central points are resolved into a plurality of parts of contents respectively; for each part, the similarities of the mail sample and the central point are compared; and the global similarity between the mail samples and the central points can be acquired by weighted combination of the similarities of the parts. By using the system and the method, the sample database and a characteristic database have better adaptability to burst spam types and the like; therefore, the leakage rate of spam is low, the instantaneity is high, the manual intervention is low and the system contractability is high.

Description

Anti-spam gateway system and method
Technical field
The present invention relates to email disposal field, particularly a kind of Anti-spam gateway system and method based on mass-mailer content clustering.
Background technology
Spam is generally defined as the Email with following attribute: (one) addressee does not claim in advance or the tendentious Email such as the advertisement agreeing to receive, electronic publication, various forms of propaganda materials; (2) Email that addressee cannot reject; (3) hide the Email of the information such as sender's identity, address, title; (4) Email that contains the information such as false information source, sender, route.
Since the first envelope spam is born, spam has become a difficult problem for puzzlement mail user, and also the raising user of Cheng Liao mail operator experiences, attracts user's significant consideration.The task of anti-rubbish mail is that spam is blocked in beyond mailing system or user's inbox.Main flow anti-rubbish technology mainly based on the behavior of posting a letter of Mail Contents and mail.
The existing anti-spam technologies based on Mail Contents mainly contains: the system of increasing income Dspam(can download by website http://www.nuclearelephant.com); The patent application that the application number of Tencent Technology (Shenzhen) Co., Ltd. is 200810227762, denomination of invention is " method and apparatus of patent to intercepting junk mail "; The patent application that the application number of Zhejiang University is 200810059602, denomination of invention is " the Chinese Spam Filtering method returning based on Logistic "; The patent application that the application number of Peking University is 200810115584, denomination of invention is " a kind of junk mail detection method " etc.
Above-mentioned anti-spam technologies mainly comprises on training and line and uses two flow processs, take below Dspam as example introduce its train and line on several key steps while using, all the other correlation techniques are substantially similar.The training flow process of Dspam comprises following step: 1, obtain a large amount of mail samples and these samples are manually designated to spam and normal email; 2, mail is decoded; 3, message body content is carried out to participle; 4, add up the frequency that each participle occurs; 5, use Bayesian formula training Naive Bayes Classification Model.After Dspam model training is good, on line, use flow process relatively simple, only comprise following two steps: 1, mail on line is carried out to participle; 2, use the Naive Bayes Classification Model training to classify to mail.
Anti-rubbish mail strategy based on the behavior of posting a letter in real time and content-based anti-rubbish mail strategy there is bigger difference.Anti-garbage system based on real-time behavior is not generally trained this step.The post a letter anti-rubbish strategy of behavior of typical mail mainly contains Checksum(and can download by website http://www.rhyolite.com/dcc/), the application number of Harbin Engineering University is 200810064806, denomination of invention is " a kind of method for judging rubbish mail based on topological behavior " patent application etc.The Checksum of take below introduces its basic procedure as example.The basic assumption of Checksum is that the mail that multiplicity is large is spam, and its flow process is roughly as follows: 1, for each mail, calculate a fingerprint; 2, for the fingerprint of all mails of inline system, count; 3, for the high mail of fingerprint multiplicity, be directly judged to spam.
It is the main flow of current business anti-garbage mail system aspect that Mail Contents and the in real time behavior of posting a letter combine.By Mail Contents and the behavioral trait of posting a letter be in real time converted to rule, and take each rule accumulation bonus point, and whether be that spam is by the effective means of both combinations according to score threshold decision.Representational technology has, the SpamAssassin(of the system of increasing income can download by website http://spamassassin.apache.org/), the application number of South China Science & Engineering University is 200710029369, denomination of invention is the patent application of " anti-rubbish E-mail error filtering method based on integrated decision-making and system ", the bright mail system of business system Symantec Corporation (http://www.symantec.com/business/products/family.jsp familyid=brightmail can download by website), the Chinese opens scientific and technological KBAS system (http://www.hanqinet.com/project1.html can download by website) etc.The SpamAssassin of take is representative introduces its main flow process.SpamAssassin comprise training and line on use two flow processs.The training of rule-based anti-rubbish correlation technique mainly comprises following step: 1, obtain a large amount of mail samples and these samples are manually designated to spam and normal email; 2, manually add rule and set up rule base; 3, use artificial sign sample to mark to rule.On line, use and comprise following two steps: the rule of 1, calculating every envelope mail coupling; 2, to all satisfied regular scores summations and whether be spam according to threshold decision.
Mainly there is the deficiency of several aspects in existing anti-garbage mail system: A), lack effective Feedback collection mechanism, feedback information can not effectively utilize.Although most of mailing system all has the feedback mechanisms such as spam report, but the feedback information from the various channels of user feedback, honey jar mailbox, keeper's audit etc. is relatively independent, disperse, lack effectively the mechanism of collecting, integrating and utilizing, wherein honey jar mailbox is a kind of special Email Accounts, and the mail entering is wherein all spam.B), lack automatic learning mechanism, to the spam of flared, can not respond in time, and anti-garbage system is easily broken through by anti-rubbish mail person.Existing anti-garbage mail system is all that parameter good based on prior learning or that arrange judges the email type of newly arriving.This anti-rubbish mail thinking can not effectively be processed for the new spam type of flared.Meanwhile, because the model in conventional garbage mailing system is relatively fixing, easily by anti-rubbish mail person, found system features, cause system by spammer, broken through after a while and lost efficacy.C), misdetection rate is high and False Rate is high.Existing anti-garbage mail system can not adapt to the anti-rubbish mail strategy that email type changes fast, part is external and not consider the Chinese reasons such as special circumstances, causes higher misdetection rate.Meanwhile, because existing anti-garbage mail system lacks effectively erroneous judgement feedback mechanism, cause erroneous judgement effectively not correct, False Rate is too high.D), manual examination and verification amount is large.Two links of existing system need more manual examination and verification.First, for components of system as directed, can not differentiate result needs manual examination and verification, and this part audit amount is larger.Secondly, in order to make system adapt to new spam type needs, prepare sample and again train, this part sample size of not only examining is large, and sample distribution is also had to high requirement, causes difficulty large.
Summary of the invention
In order to solve the problems of the technologies described above, the present invention proposes a kind of Anti-spam gateway system and method.
Anti-spam gateway system of the present invention comprises: mailing system interface, for from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent, mail distribution module, for mail requests on line being transmitted to on-line/off-line classification of mail device, passes to mail sample collection module by the mail requests of feeding back by variety of way, online classification of mail module, for according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval, off-line classification of mail module, for obtaining up-to-date mail features according to certain hour interval from mail features database, is used the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent, mail sample collection module, the request that response mail distribution module sends over, connects and obtains mail sample type and content, mail features is excavated module, for obtaining mail sample from mail sample database, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also for obtaining mail sample from mail sample database, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, mail sample database, for storing various mail samples.
In addition, the invention allows for a kind of anti-rubbish mail method, the method comprises: by mailing system interface from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent, by mail distribution module, mail requests on line is transmitted to on-line/off-line classification of mail device, and the mail requests of feeding back by variety of way is passed to mail sample collection module, utilize line classification of mail module according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval, utilize off-line classification of mail module from mail features database, to obtain up-to-date mail features according to certain hour interval, use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent, the request sending over by mail sample collection module responds mail distribution module, connects and obtains mail sample type and content, by mail features, excavate module and from mail sample database, obtain mail sample, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also by mail features, excavate module and from mail sample database, obtain mail sample, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, in mail sample database, store various mail samples.
Use Anti-spam gateway system of the present invention and method, there is the following aspects: 1) spam type of flared etc. is had to good adaptability, the effective feedback capture mechanism that the present invention proposes can be unified timely collection by the mail of honey jar mailbox, user's report, keeper's audit, can obtain in real time the latest development of spam on line, and by the on-line/off-line study module of mail features, can obtain in time the latest features situation of mail on line, thereby make system can adapt to spam type, change fast.2) spam misdetection rate is low, real-time good.The invention provides the anti-rubbish module of two levels, be respectively online classification of mail module and off-line classification of mail module.Online mail online classification device is passing through loss part discovery rate, promoted the real-time response ability of system, off-line classification of mail device can make up the deficiency of online classification of mail device, with the larger spam discovery rate of larger delay acquisition, plays the effect of mending the fold after the sheep is lost.The anti-rubbish mail gateway of the present invention that is used in conjunction with of on-line/off-line classification of mail device has obtained lower misdetection rate and good real-time.3) manual intervention is little.The present invention can extract the feature of mail automatically effectively by feedback capture mechanism and mail features mining algorithm, do not need manually sample to be examined, keeper only need to for excavation to part mail features examine, this part amount is considerably less.Therefore, use system and method for the present invention, manual examination and verification amount is considerably less.4) system contractility is good, and system can adapt to the anti-garbage mail system of multiple scale by revising the dynamically quantity of the various sort module servers of increase and decrease of mail distribution Servers installed, has good contractility.
Accompanying drawing explanation
Fig. 1 is Anti-spam gateway system Organization Chart of the present invention;
Fig. 2 is the flow chart of of the present invention spam method;
Fig. 3 feeds back the schematic diagram of realizing of obtaining step in of the present invention spam method;
Fig. 4 be in of the present invention spam method mail features excavation step realize schematic diagram;
Fig. 5 be in of the present invention spam method classification of mail step realize schematic diagram.
Embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.
Fig. 1 shows the Anti-spam gateway system Organization Chart that the present invention is based on mass-mailer content clustering.
With reference to Fig. 1, gateway system of the present invention comprises mailing system interface, mail distribution module, online classification of mail module, off-line classification of mail module, mail sample collection module, mail features excavation module, system management module, administrator interface, database interface, mail sample database, mail features database
Mailing system interface, for realizing, anti-rubbish mail gateway and mailing system are various communicates by letter, comprise from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module; The classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent; Setting up bulk mail derives connection, from mail server, obtains the functions such as type mail such as subscriber mailbox report, honey jar mailbox.
Mail distribution module, to enter gateway system association requests and be distributed to respective modules according to its type, on its center line, mail requests will be transmitted to on-line/off-line classification of mail device, and the feedback mail requests such as user's report, honey jar, keeper will pass to mail sample collection module.Mail distribution module also needs to be responsible for the load balancing in each on-line/off-line classification of mail module, mail sample collection module simultaneously.
Online classification of mail module, request and mail distribution module that response mail distribution module sends over connect and obtain mail related content, then according to existing normal/spam feature classifies to mail on line, and the identification result that is whether spam is returned to mail transport agent in real time by former connection, the mail transmission of setting up when this former connection table is shown in the request that response mail distribution module comes connects.Meanwhile, online classification of mail module also needs to be connected with mail features Database by database interface, and from mail features database, obtains up-to-date mail features according to certain hour interval.Mail features in mail features database will be by real-time update, and up-to-date mail features refers to the mail features after last update.
Off-line classification of mail module, by database interface, be connected with mail features Database, and from mail features database, obtain up-to-date mail features according to certain hour interval, then use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, the form that classification results moves the mail tabulation of letter operation with needs returns to mail transport agent.
Mail sample collection module, the request that response mail distribution module sends over, connects and obtains mail sample type and content, and all kinds mail balanced proportion in mail sample database of take is principle, carries out mail sample collection.The type of collecting mail sample comprises, the normal email of the spam of user's report, user's report, from the mail of honey jar, keeper's auditing result etc.
Mail features is excavated module, by system management module, called, and for obtaining mail sample from mail sample database, and the feature to this mail sample excavation spam and normal email.First mail features excavation module is connected and is obtained feedback samples by database interface and mail sample database, then the mail features of system excavation module is analyzed this part sample, and the mail features of excavating enters into mail features database after examining by system manager.
Mail features is excavated module and is used clustering algorithm from various feedback samples, to extract various types of mail features.Particularly, from various feedback mail sample extraction, go out to report that quantity reaches the mail of certain threshold value, reject due to the feedback information disturbing and consumer taste adds.For example, as find the spam that a certain class is the theme with invoice, if it is inferior to be surpassed threshold value (such as 100) by report number of times, such mail will be judged to spam, and the feature of this part mail is joined in spam feature database.In addition, if there are the mails such as news list, certain customers are reported as spam, and certain customers think that it is normal email in addition, and this part mail can not be as spam sample.
The clustering algorithm that the present invention adopts preferably adopts and improves central point clustering algorithm, each central point is the representative of a class sample, and comprises the following aspects information: mail header template, short text are for short text template, long article are originally for fingerprint mean value, IP set, the addresser of the mean value of corresponding fingerprint, annex gather.A typical central point is as follows: mail header template is " generation is opened * invoice * " (* is asterisk wildcard), short text template " my public * department opens the various VAT invoices of * * * for *; have the * of needs * contact button button 92342* ", the nilsimsa cryptographic Hash that long text fingerprints and annex fingerprint are corresponding contents, IP set is sender's IP list, as " 199.1.1.1 ", addresser's set is the mailbox list of posting a letter, as asdf@163.com.When a new mail sample enters, this mail sample and now all central points are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, and upgrade this central point.The mail center point that cluster obtains is a mail features.After cluster, in a classification, sample surpasses threshold value n, and report is ham(normal email) ratio of sample is less than threshold value t, extracting this classification central point is spam(spam) sample.Improved central point clustering algorithm can be realized by program below.
In superincumbent central point clustering algorithm, the similitude of mail sample and central point is calculated by mode below.When calculating the similitude of mail sample and a certain central point, execution following steps: by mail resolve to mail header, several most contents such as the IP that posts a letter, addresser, text, annex; Body part is gone to disturb and process, extract mail structural framing, Chinese text, English text, other Languages text, this five bulks content of body structure information; To enumerated variable such as IP, directly adopt set whether to have common factor to measure its similitude; For long text message and annex, adopt fingerprint to calculate both similitudes; For short text, adopt Needleman – Wunsch algorithm to determine similitude between the two; According to the similitude of various piece, be weighted the overall similitude that combination obtains two envelope mails.
The similarity measurement algorithm of various piece is as follows: 1) enumerated variable such as IP, sender similarity measurement algorithm is: in a mail center point, the IP that posts a letter of all mails forms a set, when the similitude of two IP set of tolerance, if two IP common factor non-NULLs (, there is public IP), its similarity is defined as 1, otherwise is 0.The enumerated variable such as sender, can do similar processing.2) short text similarity measurement algorithm is: adopt Needleman – Wunsch algorithm to determine the Optimum Matching of two sequences.Algorithm principle and realize false code can be referring to http:// en.wikipedia.org/wiki/Needleman – Wunsch_algorithm.Algorithm need to be determined three types character, and the coupling of Chinese, English, asterisk wildcard and mistake matching score, can carry out rough estimates acquisition according to data.After overmatching, the public part of two character strings is the template of two character strings, and different piece adopts asterisk wildcard to represent.3) this similarity measurement of long article algorithm is: adopt the text similarity of nilsimsa fingerprint technique comparison after denoising.Can use Open Source Code: http:// ixazon.dynip.com/~cmeclax/nilsimsa.htmlrealize.
When new mail enters, first anti-rubbish mail gateway is used online classification of mail module to compare to this new mail, if have its similarity of envelope mail and this mail similarity to be less than threshold value t in spam queue, this mail is judged to spam, and result is returned.Spam queue is the member of online classification of mail module the inside.The content of queue wherein obtains from mail features database.Specific algorithm is as follows:
When having new spam feature to enter spam property data base, off-line classification of mail module is used the mail in newfound characteristics of spam and all buffer queues to compare, if have mail in buffer queue and be newly less than t to spam characteristic similarity threshold value, this mail is judged to spam, this mail is deleted from mail queue, and returned results.Specific algorithm is as follows:
The mail distribution server at mail distribution module place is master server, it maintains existing each server configuration and each server process time delay, each new for mail, the delay of each server of master server training in rotation, and will newly to mail sample, be distributed to the server with the minimum delay.Each postpones its up-to-date processing time to report to Distributor from the complete mail of server process.
Continuation is with reference to Fig. 1, and system management module, for setting and configuration file distribution, server performance monitor and the optimizational function of various algorithm parameters.
Administrator interface, the manual examination and verification of system being excavated to the mail features obtain for system manager are confirmed, the audit of part suspicious mail, the arranging etc. of various parameters.
Database interface, realizes unified interface and the access rights of the database manipulations such as access, renewal of various mail samples, mail features and controls.
Mail sample database, has label mail for storing by user report, keeper's audit and honey jar mailbox various.
Mail features database, excavates for storing mail features the various mail features that module obtains.
To sum up, anti-rubbish mail gateway of the present invention is partly comprised of mailing system interface, mail distribution module, on-line/off-line classification of mail module, mail sample collection module, mail features excavation module, system management module, administrator interface, database interface, mail sample database, mail features database.Above-mentioned module completes classification of mail together, feedback information is collected and mail features is excavated this three functions.In classification of mail function, anti-rubbish mail gateway of the present invention obtains the information such as Mail Contents, user behavior information from mail transport agent by mailing system interface, after using on-line/off-line classification of mail module to classify to respective mail, mail classes is returned to mail transmission server; In feedback information collecting function, the mail samples such as user feedback, honey jar mailbox and system manager's auditing result enter gateway system by mail exploder and mail sample collection module becomes learning sample; In the function of excavating in mail features, anti-rubbish mail gateway of the present invention excavates module by mail features and from feedback samples, excavates up-to-date spam feature, and corresponding feature is distributed to on-line/off-line E-mail sorting model.
Anti-rubbish mail gateway of the present invention carries out the feature extraction of rubbish/normal email based on feedback information.User reports spam, reports normal email, moves the feedback packet such as letter containing a large amount of useful informations, has also comprised much noise simultaneously.The feature that the noise jamming of rejecting feedback information extracts rubbish/normal email is in time the key that anti-rubbish mail gateway is realized self-teaching.
Anti-rubbish mail gateway of the present invention adopts Spam Classification algorithm, particularly, in conjunction with existing normal/spam feature, the mail that mail exploder is assigned is classified, and reaches the target of following three aspects: reduce low spam erroneous judgement rate, higher spam discovery rate and response speed faster.
Anti-rubbish mail gateway of the present invention adopts the dispatching algorithm of mail exploder, on the line that magnanimity is arrived at a high speed, mail is distributed to each processor in real time, realizes the decentralized configuration of the processing logic of various mails, the load balancing of each server and various services.
Fig. 2 is the flow chart that the present invention is based on the anti-rubbish method of mass-mailer content clustering.Fig. 3 is the schematic diagram of realizing of feedback obtaining step.Fig. 4 is the schematic diagram of realizing of mail features excavation step.Fig. 5 is the schematic diagram of realizing of classification of mail step.
With reference to Fig. 2, the method comprising the steps of: S201, by mailing system interface from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent, can be with further reference to Fig. 3 in this step, can be from system manager, these three sources of user and honey jar obtain normal email and spam sample, and these mails are entered to mail sample database after by mail distribution module and mail sample collection module.It is principle that mail sample collection module be take all kinds mail balanced proportion in mail sample database, carries out mail sample collection.S202, is transmitted to on-line/off-line classification of mail device by mail distribution module by mail requests on line, and the mail requests of feeding back by variety of way is passed to mail sample collection module.S203, utilize line classification of mail module according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval.With reference to Fig. 5, further understand classification of mail process, wherein mail transport agent enters this antispam gateway by e-mail messages mailing system interface; Mailing system interface is transmitted to mail distribution module by mail; Mail distribution module is given online classification of mail module, off-line classification of mail module and sample collection module according to the strategy of setting by mail distribution; Mail on-line/off-line sort module is classified to mail according to the information in mail features library database, and result is returned to mail transport agent according to the path of mail " mail distribution module, mailing system interface, mail transport agent "; Mail distribution module will be transmitted to applicator, and sample collection module determines whether this mail is joined to Sample Storehouse according to corresponding strategies.The difference of on-line/off-line classification of mail module is that online classification of mail module can be returned to mail differentiation result in real time, and off-line classification of mail module adopts asynchronous mode that the differentiation result of mail is returned to mail transport agent.S204, utilize off-line classification of mail module from mail features database, to obtain up-to-date mail features according to certain hour interval, use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent; S205, the request sending over by mail sample collection module responds mail distribution module, connects and obtains mail sample type and content; S206, excavates module by mail features and from mail sample database, obtains mail sample, and therefrom excavates the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager.With further reference to Fig. 4, in this mail features excavation step, first system extracts the mail sample of nearest a period of time from mail sample database, then the mail features of system excavation module will be carried out cluster analysis to sample, and the mail features of excavating joins mail features database after examining by system manager.In process of cluster analysis, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, is weighted according to the similitude of various piece the overall similitude that combination obtains mail sample and central point.When comparing the similitude of mail sample and central point for each part, to enumerated variable, adopt set whether to have common factor to measure its similitude, to long text message and annex, adopt fingerprint to calculate both similitudes, to short text, adopt Needleman-Wunsch algorithm to determine similitude between the two.To excavating the mail features obtaining, carry out manual examination and verification confirmation, the audit of part suspicious mail, the setting of various parameters.S207 stores various mail samples in mail sample database.
Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (6)

1. an Anti-spam gateway system, it comprises:
Mailing system interface, for from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent;
Mail distribution module, for mail requests on line being transmitted to on-line/off-line classification of mail device, passes to mail sample collection module by the mail requests of feeding back by variety of way;
Online classification of mail module, for according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval;
Off-line classification of mail module, for obtaining up-to-date mail features according to certain hour interval from mail features database, is used the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent;
Mail sample collection module, the request that response mail distribution module sends over, connects and obtains mail sample type and content;
Mail features is excavated module, for obtaining mail sample from mail sample database, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also for obtaining mail sample from mail sample database, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, when a new mail sample enters, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, after cluster, in a classification, sample surpasses a threshold value, and report that extracting this classification center is spam sample for the ratio of normal email sample is less than another threshold value,
Mail sample database, for storing various mail samples.
2. Mail Gateway system as claimed in claim 1, it is characterized in that, when comparing the similitude of mail sample and central point for each part, to enumerated variable, adopt set whether to have common factor to measure its similitude, to long text message and annex, adopt fingerprint to calculate both similitudes, to short text, adopt Needleman-Wunsch algorithm to determine similitude between the two.
3. Mail Gateway system as claimed in claim 1, is characterized in that, described system further comprises:
Administrator interface, the manual examination and verification of gateway system being excavated to the mail features obtaining for system manager are confirmed, the audit of part suspicious mail, the setting of various parameters.
4. an anti-rubbish mail method, the method comprising the steps of:
By mailing system interface from mail transport agent Real-time Obtaining line mail and by the delivery of mail to mail distribution module, the classification of mail result of online classification of mail module is returned to mail transport agent, the spam list of off-line classification of mail module is returned to mail transport agent;
By mail distribution module, mail requests on line is transmitted to on-line/off-line classification of mail device, and the mail requests of feeding back by variety of way is passed to mail sample collection module;
Utilize line classification of mail module according to existing normal/spam feature classifies to mail on line, and identification result returned to mail transport agent in real time, and from mail features database, obtains up-to-date mail features according to certain hour interval;
Utilize off-line classification of mail module from mail features database, to obtain up-to-date mail features according to certain hour interval, use the mail features of up-to-date extraction to classify to the buffer memory mail of the past period, and classification results is returned to mail transport agent;
The request sending over by mail sample collection module responds mail distribution module, connects and obtains mail sample type and content;
By mail features, excavate module and from mail sample database, obtain mail sample, and therefrom excavate the feature of spam and normal email, and by excavation to mail features enter into mail features database after examining by system manager, also by mail features, excavate module and from mail sample database, obtain mail sample, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, wherein each central point is the representative of a class sample, when calculating the similitude of mail sample and central point, mail sample and central point are resolved to respectively to a plurality of partial contents, the similitude that compares the two for each part, according to the similitude of various piece, be weighted the overall similitude that combination obtains mail sample and central point, when a new mail sample enters, this mail sample and all central point are compared, if similitude is less than certain threshold value, directly sample is joined to this central point, after cluster, in a classification, sample surpasses a threshold value, and report that extracting this classification center is spam sample for the ratio of normal email sample is less than another threshold value,
In mail sample database, store various mail samples.
5. method as claimed in claim 4, it is characterized in that, when comparing the similitude of mail sample and central point for each part, to enumerated variable, adopt set whether to have common factor to measure its similitude, to long text message and annex, adopt fingerprint to calculate both similitudes, to short text, adopt Needleman-Wunsch algorithm to determine similitude between the two.
6. method as claimed in claim 4, is characterized in that, further comprises:
To excavating the mail features obtaining, carry out manual examination and verification confirmation, the audit of part suspicious mail, the setting of various parameters.
CN201110304470.3A 2011-10-10 2011-10-10 Anti-spam gateway system and method Active CN102377690B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110304470.3A CN102377690B (en) 2011-10-10 2011-10-10 Anti-spam gateway system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110304470.3A CN102377690B (en) 2011-10-10 2011-10-10 Anti-spam gateway system and method

Publications (2)

Publication Number Publication Date
CN102377690A CN102377690A (en) 2012-03-14
CN102377690B true CN102377690B (en) 2014-09-17

Family

ID=45795681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110304470.3A Active CN102377690B (en) 2011-10-10 2011-10-10 Anti-spam gateway system and method

Country Status (1)

Country Link
CN (1) CN102377690B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103441924B (en) * 2013-09-03 2016-06-08 盈世信息科技(北京)有限公司 A kind of rubbish mail filtering method based on short text and device
CN103744888A (en) * 2013-12-23 2014-04-23 新浪网技术(中国)有限公司 Method and system for anti-spam gateway to query database
CN103841006A (en) * 2014-02-25 2014-06-04 汉柏科技有限公司 Method and device for intercepting junk mails in cloud computing system
CN104796318A (en) * 2014-07-30 2015-07-22 北京中科同向信息技术有限公司 Behavior pattern identification technology
CN108197638B (en) * 2017-12-12 2020-03-20 阿里巴巴集团控股有限公司 Method and device for classifying sample to be evaluated
CN108737255B (en) * 2018-05-31 2020-07-10 北京明朝万达科技股份有限公司 Load balancing method, load balancing device and server
CN112579733B (en) * 2019-09-30 2023-10-20 华为技术有限公司 Rule matching method, rule matching device, storage medium and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716293A (en) * 2004-06-29 2006-01-04 微软公司 Incremental Antispam Lookup and Update Service
GB2425855A (en) * 2005-04-25 2006-11-08 Messagelabs Ltd Detecting and filtering of spam emails
CN101094197A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and mail server of anti garbage mail
CN101136874A (en) * 2007-07-25 2008-03-05 华南理工大学 Anti-spam false filtering method and system based on comprehensive decision
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Chinese Spam Filtering Method Based on Logistic Regression
CN101295381A (en) * 2008-06-25 2008-10-29 北京大学 A spam detection method
CN101299729A (en) * 2008-06-25 2008-11-05 哈尔滨工程大学 Method for judging rubbish mail based on topological action
CN101415159A (en) * 2008-12-02 2009-04-22 腾讯科技(深圳)有限公司 Method and apparatus for intercepting junk mail
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1696943A (en) * 2004-05-13 2005-11-16 上海极软软件技术有限公司 Self-adaptive method for filtering out garbage E-mails safely
CN101083630A (en) * 2006-06-01 2007-12-05 珠海金山软件股份有限公司 Anti-rubbish E-mail system and method
CN101119341B (en) * 2007-09-20 2011-02-16 腾讯科技(深圳)有限公司 Mail identifying method and apparatus
CN102075447B (en) * 2009-11-25 2015-08-12 中兴通讯股份有限公司 The method and system of anti-rubbish mail

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1716293A (en) * 2004-06-29 2006-01-04 微软公司 Incremental Antispam Lookup and Update Service
GB2425855A (en) * 2005-04-25 2006-11-08 Messagelabs Ltd Detecting and filtering of spam emails
CN101094197A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and mail server of anti garbage mail
CN101136874A (en) * 2007-07-25 2008-03-05 华南理工大学 Anti-spam false filtering method and system based on comprehensive decision
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Chinese Spam Filtering Method Based on Logistic Regression
CN101295381A (en) * 2008-06-25 2008-10-29 北京大学 A spam detection method
CN101299729A (en) * 2008-06-25 2008-11-05 哈尔滨工程大学 Method for judging rubbish mail based on topological action
CN101415159A (en) * 2008-12-02 2009-04-22 腾讯科技(深圳)有限公司 Method and apparatus for intercepting junk mail
CN101588558A (en) * 2009-03-30 2009-11-25 网易(杭州)网络有限公司 Spam filtering method and system

Also Published As

Publication number Publication date
CN102377690A (en) 2012-03-14

Similar Documents

Publication Publication Date Title
CN102377690B (en) Anti-spam gateway system and method
KR101117866B1 (en) Intelligent quarantining for spam prevention
US6928465B2 (en) Redundant email address detection and capture system
US7930353B2 (en) Trees of classifiers for detecting email spam
Toolan et al. Feature selection for spam and phishing detection
CN101674264B (en) Spam detection device and method based on user relationship mining and credit evaluation
CN102413076A (en) Spam mail judging system based on behavior analysis
US20060036693A1 (en) Spam filtering with probabilistic secure hashes
CN101087259A (en) A system for filtering spam in Internet and its implementation method
EP2649535A2 (en) Electronic communications triage
CN101637002A (en) A method and system for collecting addresses for remotely accessible information sources
CN102124485B (en) Apparatus, and associated method, for detecting fraudulent text message
CN104040963A (en) System and methods for spam detection using frequency spectra of character strings
Bhat et al. Classification of email using BeaKS: Behavior and keyword stemming
CN103595614A (en) User feedback based junk mail detection method
Mishra et al. Analysis of random forest and Naive Bayes for spam mail using feature selection catagorization
JP2009104400A (en) E-mail filtering device, e-mail filtering method and program
US8819142B1 (en) Method for reclassifying a spam-filtered email message
CN118250248A (en) Processing method, device, equipment and medium for sending mails in batches
Chien et al. Email Feature Classification and Analysis of Phishing Email Detection Using Machine Learning Techniques
Gonzalez-Talavan A simple, configurable SMTP anti-spam filter: Greylists
Daisy et al. Email spam behavioral sieving technique using hybrid algorithm
KR20140127036A (en) Server and method for spam filtering
Sajwan et al. Email spam filteration with machine learning
Revathi et al. Email Spam Detection Using Naive Bayes Algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant