CN102833085B - Based on communication network message categorizing system and the method for mass users behavioral data - Google Patents
Based on communication network message categorizing system and the method for mass users behavioral data Download PDFInfo
- Publication number
- CN102833085B CN102833085B CN201110162097.2A CN201110162097A CN102833085B CN 102833085 B CN102833085 B CN 102833085B CN 201110162097 A CN201110162097 A CN 201110162097A CN 102833085 B CN102833085 B CN 102833085B
- Authority
- CN
- China
- Prior art keywords
- message
- data
- disaggregated model
- sorting algorithm
- communication network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000004891 communication Methods 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 title claims abstract description 19
- 230000003542 behavioural effect Effects 0.000 title claims abstract description 16
- 239000011159 matrix material Substances 0.000 claims abstract description 8
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 16
- 238000013480 data collection Methods 0.000 claims description 14
- 238000012795 verification Methods 0.000 claims description 5
- 230000008676 import Effects 0.000 claims description 3
- 239000000463 material Substances 0.000 claims description 3
- 230000006399 behavior Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 3
- 238000009412 basement excavation Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of communication network message categorizing system based on mass users behavioral data and method, it is characterized in that: comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module, this system and method accurately can identify all kinds of message, meet the fine granularity demand of data in message analysis, can effectively to user behavior data by message classification, comprise the access of user, search data carries out careful analysis.
Description
Technical field
The field of the present invention relates to comprises, the analysis of the communication network message that mass users uses the various network equipment and terminal access network to produce, the behavior derivation message characteristic according to user, usage data excavation and machine learning techniques carry out correct classification prediction to communication network message, a kind of communication network message categorizing system based on mass users behavioral data of special design and method.
Background technology
What the message classification that major part is traditional used is all rule-based system, namely adds up the keyword occurred in different message, then forms a rule base, when next message occurs, just go to mate in rule base, obtain the general classification of outgoing packet.
The shortcoming of this method is clearly: (1) has a large amount of messages to exist, and can not obtain a very accurate rule base; (2) in Different Rule storehouse, the possibility of rule is repeated, and use matching strategy may obtain inaccurate message classification (3) when message amount is huge, matching strategy can not meet temporal validity.
Summary of the invention
The object of the invention is for providing a kind of communication network message categorizing system based on mass users behavioral data and method, this system and method accurately can identify all kinds of message, meet the fine granularity demand of data in message analysis, effectively to user behavior data, the access of user can be comprised, search data carries out careful analysis by message classification.
Technical scheme of the present invention is as follows:
A kind of communication network message categorizing system based on mass users behavioral data, comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module.
The data of network collection are stored into storage of subscriber data system by described user data acquisition module.
Described sorting algorithm module also receives the data of training dataset, and described disaggregated model also receives the verification msg of assessment data collection.
Based on a communication network message sorting technique for mass users behavioral data, realize message classification as follows:
(1) information in user data acquisition module is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to sorting algorithm module simultaneously, sorting algorithm module learns the disaggregated model about message to training dataset, the eigenmatrix that assessment data collection is produced is input in disaggregated model intermediate object program, verification model Output rusults and artificial annotation results, carry out the accuracy of judgment models according to the accuracy of gained and recall rate;
(3) parameter feedback after being verified by disaggregated model, to sorting algorithm module, is constantly optimized sorting algorithm module, to improve the robustness of system under real complex situations and model accuracy;
(4) set up final mask and exported for being connected with new message by model output module, the classification of prediction communication network message.
The network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message, ad material message.
By user data acquisition module user behavior data collected and information is stored into storage of subscriber data system.
Technique effect of the present invention is:
A large amount of type of messages miscellaneous is there is in communication network message, in order to carry out analysis and the excavation of the degree of depth to these messages, all kinds of message of identification that must be correct.Huge due to data volume, so complete this task to become very difficult within the object time and in target accuracy rate.The present invention is by careful analysis communication network message, the feature of message has been extracted according to user behavior, then use from data mining and machine learning technique construction a whole set of accurately to identify the system of all kinds of message, comprise and collect the final online entire flow used from original message, ensure that the accurate identification of message within the object time.
Accompanying drawing explanation
Fig. 1 is the communication network message categorizing system based on mass users behavioral data of the present invention and method step flow chart.
Embodiment
Below in conjunction with accompanying drawing, the present invention will be further described.
As shown in Figure 1, a kind of communication network message categorizing system based on mass users behavioral data, comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, described disaggregated model exports final for the model with message comparison by model output module.
The data of network collection are stored into storage of subscriber data system by described user data acquisition module.
Described sorting algorithm module also receives the data of training dataset, and described disaggregated model also receives the verification msg of assessment data collection.
Based on a communication network message sorting technique for mass users behavioral data, realize message classification as follows:
(1) information in user data acquisition module is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to sorting algorithm module simultaneously, sorting algorithm module learns the disaggregated model about message to training dataset, the eigenmatrix that assessment data collection is produced is input in disaggregated model intermediate object program, verification model Output rusults and artificial annotation results, carry out the accuracy of judgment models according to the accuracy of gained and recall rate;
(3) parameter feedback after being verified by disaggregated model, to sorting algorithm module, is constantly optimized sorting algorithm module, to improve the robustness of system under real complex situations and model accuracy;
(4) set up final mask and exported for being connected with new message by model output module, the classification of prediction communication network message.
The network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message, ad material message.
By user data acquisition module user behavior data collected and information is stored into storage of subscriber data system.
Sorting algorithm module optimizing process: described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the assessment data collection generation of the artificial input of described disaggregated model reception is all verified and is used message classification eigenmatrix, disaggregated model again by checking after data feedback to sorting algorithm module, to be optimized, to classify more accurately afterwards its sorting algorithm module.
Some noises in data are removed in the effect of cleaning module, comprise two parts: (1) removes some unnecessary samples; (2) some noise information in some sample is removed.
Described training dataset comprises two parts, and one is the artificial network message classification marked, and representing the characteristic vector of network message besides, generally represents by sparse vector, in order to meet the requirement of concrete sorting algorithm, can carry out corresponding format conversion.
Feature mainly can differentiate some information of all kinds of message, is drawn by manual analysis and statistics, and such as advertisement url feature can be made up of three parts: (1) comprises particular keywords, alimama, doubleclick, ad etc.; (2) leaf node of user's access tree is generally in; (3) user directly to input ratio generally smaller.
The matrix that the characteristic value that eigenmatrix refers to each sample is formed.
The performance of classification of assessment system has two aspects, and one is model accuracy, and one is the efficiency of algorithm.The key factor wherein affecting model accuracy is exactly the adequacy of feature, comprises power and the number of feature.The present invention is carrying out on the basis of depth analysis to the communication network message of magnanimity, has carried out careful classification according to user behavior to message, has meticulously extracted the feature of all kinds of message, thus ensure that the precision of model and the accuracy of prediction.In addition on efficiency of algorithm, carry out a large amount of optimization, thus ensure that the actual effect of mass data processing.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.
Claims (6)
1. the communication network message categorizing system based on mass users behavioral data, it is characterized in that: comprise user data acquisition system, the data collected are transferred to data cleansing module by described user data acquisition system, described data cleansing module by cleaning and extract after message characteristic generating feature Transfer-matrix to sorting algorithm module, described sorting algorithm module and the mutual swap data of disaggregated model, use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to described sorting algorithm module simultaneously, described sorting algorithm module learns the described disaggregated model about message to described training dataset, the eigenmatrix that described assessment data collection is produced is input in described disaggregated model intermediate object program, verify described disaggregated model Output rusults and artificial annotation results, judge the accuracy of described disaggregated model according to the accuracy of gained and recall rate; Described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the checking message classification eigenmatrix that the assessment data collection that described disaggregated model receives artificial input generates, the data feedback after checking is given described sorting algorithm module by described disaggregated model again; Described disaggregated model exports final for the model with message comparison by model output module.
2. the communication network message categorizing system based on mass users behavioral data according to claim 1, is characterized in that: the data of network collection are stored into storage of subscriber data system by described user data acquisition system.
3. the communication network message categorizing system based on mass users behavioral data according to claim 1, it is characterized in that: described sorting algorithm module also receives the data of the training dataset of artificial input, and described disaggregated model also receives the verification msg of described assessment data collection.
4., based on a communication network message sorting technique for mass users behavioral data, it is characterized in that: realize message classification as follows:
(1) information in user data acquisition system is imported data cleansing module to clean user data, extract the feature of user communication network message, generating feature matrix, and import in sorting algorithm module and generate disaggregated model;
(2) use the classification of manual type to each communication network message to mark simultaneously, set up training dataset and assessment data collection; The eigenmatrix that training dataset generates also is input to described sorting algorithm module simultaneously, described sorting algorithm module learns the described disaggregated model about message to described training dataset, the eigenmatrix that described assessment data collection is produced is input in described disaggregated model intermediate object program, verify described disaggregated model Output rusults and artificial annotation results, judge the accuracy of described disaggregated model according to the accuracy of gained and recall rate;
(3) give described sorting algorithm module by the parameter feedback after the checking of described disaggregated model, constantly described sorting algorithm module is optimized, to improve the robustness of system under real complex situations and model accuracy; The process that described sorting algorithm module is optimized for: described sorting algorithm module receives computer and artificial generated message classification eigenmatrix, and generate disaggregated model, the checking message classification eigenmatrix that the assessment data collection that described disaggregated model receives artificial input generates, described disaggregated model again by the data feedback after checking to sorting algorithm module;
(4) set up final mask and exported for being connected with new message by described disaggregated model output module, the classification of prediction communication network message.
5. the communication network message sorting technique based on mass users behavioral data according to claim 4, is characterized in that: the communication network message classification mark that described manual type is distinguished comprises search engine message, web page browsing message, resource downloading page message and ad material message.
6. the communication network message sorting technique based on mass users behavioral data according to claim 4, is characterized in that: to be collected user behavior data by described user data acquisition system and information is stored into storage of subscriber data system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110162097.2A CN102833085B (en) | 2011-06-16 | 2011-06-16 | Based on communication network message categorizing system and the method for mass users behavioral data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110162097.2A CN102833085B (en) | 2011-06-16 | 2011-06-16 | Based on communication network message categorizing system and the method for mass users behavioral data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102833085A CN102833085A (en) | 2012-12-19 |
CN102833085B true CN102833085B (en) | 2015-09-16 |
Family
ID=47336064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110162097.2A Expired - Fee Related CN102833085B (en) | 2011-06-16 | 2011-06-16 | Based on communication network message categorizing system and the method for mass users behavioral data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102833085B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649455B (en) * | 2016-09-24 | 2021-01-12 | 孙燕群 | Standardized system classification and command set system for big data development |
CN107404398A (en) * | 2017-05-31 | 2017-11-28 | 中山大学 | A kind of networks congestion control judgement system |
CN112016617B (en) * | 2020-08-27 | 2023-12-01 | 中国平安财产保险股份有限公司 | Fine granularity classification method, apparatus and computer readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101540048A (en) * | 2009-04-21 | 2009-09-23 | 北京航空航天大学 | Image quality evaluating method based on support vector machine |
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | A Vulnerability Data Mining Method Based on Classification and Association Analysis |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8583416B2 (en) * | 2007-12-27 | 2013-11-12 | Fluential, Llc | Robust information extraction from utterances |
-
2011
- 2011-06-16 CN CN201110162097.2A patent/CN102833085B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101540048A (en) * | 2009-04-21 | 2009-09-23 | 北京航空航天大学 | Image quality evaluating method based on support vector machine |
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | A Vulnerability Data Mining Method Based on Classification and Association Analysis |
Non-Patent Citations (2)
Title |
---|
Internet网页自动分类技术的研究;谢华;《中国优秀硕士学位论文全文数据库信息科技辑》;20070630;对比文件第9页第1段至第11页第5段,图2-1 * |
刘博等.改进的KNN方法及其在中文文本分类中的应用.《西华大学学报(自然科学版)》.2008,第27卷(第2期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN102833085A (en) | 2012-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Heterogeneous graph attention network | |
De Choudhury et al. | How does the data sampling strategy impact the discovery of information diffusion in social media? | |
CN102567494B (en) | Website classification method and device | |
CN103530347B (en) | A kind of Internet resources method for evaluating quality based on big data mining and system | |
CN104657372A (en) | Page operation data processing method and device | |
CN104298679A (en) | Application service recommendation method and device | |
CN105573995A (en) | Interest identification method, interest identification equipment and data analysis method | |
CN107133436A (en) | A kind of multiple sample model training method and device | |
CN105608200A (en) | Network public opinion tendency prediction analysis method | |
CN104008109A (en) | User interest based Web information push service system | |
CN111274338B (en) | A pre-exit user identification method based on mobile big data | |
CN104268271A (en) | Interest and network structure double-cohesion social network community discovering method | |
CN104331404A (en) | A user behavior predicting method and device based on net surfing data of a user's cell phone | |
CN101393555A (en) | A Spam Blog Detection Method | |
CN103150663A (en) | Method and device for placing network placement data | |
CN109857457B (en) | A Method for Learning Function Hierarchical Embedding Representations in Source Codes in Hyperbolic Spaces | |
CN108023768A (en) | Network event chain establishment method and network event chain establish system | |
CN106528777A (en) | Cross-screen user identification normalizing method and system | |
CN103838754A (en) | Information searching device and method | |
CN103136331A (en) | Micro blog network opinion leader identification method | |
CN103136358A (en) | Method for automatically extracting BBS (bulletin board system) data | |
CN105550253A (en) | Method and device for obtaining type relation | |
CN102833085B (en) | Based on communication network message categorizing system and the method for mass users behavioral data | |
CN104298782A (en) | Method for analyzing active access behaviors of internet users | |
Yu et al. | Fast budgeted influence maximization over multi-action event logs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C56 | Change in the name or address of the patentee | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100081, Beijing, Zhongguancun, Haidian District South Avenue, No. 18, International Building, Beijing, block 18, B Patentee after: Izp (China) Network Technology Co. Ltd. Address before: 100081, Beijing, Zhongguancun, Haidian District South Avenue, No. 18, International Building, Beijing, block 18, B Patentee before: Beijing IZP Technologies Co., Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150916 Termination date: 20160616 |