Microblogging bloger community sorting technique based on interest analysis
Technical field
The invention belongs to community's sorting technique, be specifically related to a kind of microblogging bloger community sorting technique based on tag extraction.
Background technology
Along with the high speed development of infotech and network, the Web2.0 service that people can select is more and more, and wherein the appearance of microblogging class website is to have met the instant demand of sharing information and linking up with other people that exchanges of people.Microblogging is paid close attention to the Information Sharing and propagation platform forwarding with microblogging based on user as one, bloger on it can pass through 140 characters of the various ways issues such as computing machine, mobile phone, MI software and outside api interface with interior micro-blog information, substantially be not subject to the restriction of space-time, information updating is more convenient and rapid, has the instantaneity of height and the powerful features such as interactivity.Compare with traditional blog, the principal feature of microblogging has " short, clever, fast ", more can cater to modern's fast pace life.Therefore, microblogging is with its rapid fashionable whole world of characteristic quickly and easily.At present, China's microblog users quantity surpasses 300,000,000.
In the Web2.0 epoch, user is not only the viewer of web site contents, also becomes the creator of web site contents.User is the soul of network, thereby user's specificity analysis is become to the emphasis of network Development.In this platform of issuing based on bloger's user profile of microblogging, bloger's behavioural characteristic more directly has influence on the development of platform, and bloger's interest analysis is seemed to particularly important.
Except part, like diving, the bloger who does not substantially make a speech or seldom forward, it is exactly to issue the information such as the state of oneself, mood, event whenever and wherever possible that other blogers use the main mode of microblogging.From the microblogging of bloger's issue, can find out bloger's idiom, point of interest place and personality feature.So analysis based on content distributed to bloger, can be to a great extent bloger being carried out to current interest analysis, to bloger, stamp personalized labels, and then bloger is carried out to labeling, for example, for providing personalized service (, the recommendation of similar microblogging, the similar bloger's of interest recommendation) to provide support to bloger from now on.
Although in existing microblogging research, had many to the theory of bloger's classification aspect.Relatively more outstanding is that bloger is divided three classes: mass medium, famous person and grassroots-layer, can also to it, classify according to residing professional domain.But such classification is nowhere near for bloger's individual research.These classification are only more rough divisions, and the bloger's degree of difference in each classification is very large.The most of common bloger who forms grassroots-layer can not well be divided.In microblogging platform at home, although bloger can select to add different groups or micro-group, to exchange with other blogers with a common goal, a lot of blogers' inherence difference is still left in the basket.Bloger can only classify to oneself on one's own initiative, probably oneself itself has also ignored the numerous characteristics of oneself, lacks a kind of more objective and perfect partition mechanism.
The present invention newly proposes the microblogging bloger community sorting technique based on interest analysis, according to bloger's microblogging content, it is carried out objectively directly dividing.In the present invention, we use suitable API access technology, capturing on the basis of micro-blog information, micro-blog information to each bloger's issue is analyzed, therefrom extract some manual tag of applicable bloger (for preventing repeating, below use " label " replacement " manual tag "), and according to each bloger's label, bloger is classified.The present invention provides a kind of new method to microblogging bloger's classification, and for recommend to provide a kind of reference to microblogging bloger later.For example, if most of label of microblogging bloger belongs to a certain classification, can recommend other labels in this classification for him; And for most of label two blogers in the same few classification all, can be using the other side as friend recommendation.
Summary of the invention
The present invention utilizes existing microblogging open platform, by open API, access, capture the content of microblogging bloger and special time period thereof, comprise microblogging and comment and the forwarding of issue, and these contents are carried out to text analyzing, therefrom extract applicable some labels, thereby finally realize the tagsort to microblogging bloger.
Domestic all microblogging platforms are all open at present, and these open microblogging platforms have all been announced the API that can use for its platform in succession.Based on these API, can obtain the basis and the approach that capture microblogging data, as shown in Figure 2, main thes contents are as follows whole microblogging data capture program design framework: microblogging open platform, OAuth Certificate Authority, API Access, API source program, access queue control program, storage control program and SQL Sever database.
After obtaining data, microblogging content, through pre-service, is also needed microblogging carry out word segmentation processing and remove stop words.
To removing the microblogging data of stop words, carry out tag extraction.The type of label has following two classes:
1. extract the keyword in bloger's microblogging content, as hobby, life habit etc.;
2. extract the everyday words in bloger's microblogging content, as pet phrase, idiom etc.
In keyword extraction, can use existing TF-IDF method.The method particular content is as follows:
1.TF-IDF, as a kind of statistical method, is mainly for being evaluated in certain file in a file set or a corpus, the importance of some words.Application in the present invention, just can be used as the importance of the words in a microblogging of assessment, by importance ranking, extracts the keyword in microblogging.
2.TF (Term Frequency): i.e. word frequency, represents the correlativity of a words and certain document.It is here a frequency that words occurs in certain microblogging.Formula is N/Nt, and wherein N is the number of times that this words occurs in certain microblogging, and Nt is the sum of all words of comprising of this microblogging.
3.IDF (Inverse Document Frequency): i.e. anti-document frequency, represents that a words represents the weight size of the theme of document.Mainly that microblogging quantity by having comprised this words and the total quantity of microblogging are carried out comparison.The number of times occurring is more, and weight is less.Formula is-log (D/Dt) that wherein D is in all microbloggings of this bloger, the sum of the microblogging that has comprised this words; Dt is certain bloger's microblogging total number.
4. finally using the value of TF*IDF of words as its TF-IDF mark, carry out descending sort, as this bloger's keyword.
Everyday words extracting method is as follows:
1. all notional words, auxiliary word and interjection in a microblogging after pre-service are extracted.This wherein will comprise can embody the compose a piece of writing word of custom, personality feature of bloger.
2. by these word deduplications, also as this bloger's keyword.
Above-mentioned keyword is combined with everyday words, as each bloger's keyword.Add up the access times of these keywords in nearest 30 days of this microblogging bloger issuing microblog (removing the microblogging of forwarding), and carry out descending sort, get front L and (if be less than L, directly use this number, for convenient, we still remember that this number is L), as this microblogging bloger's label.
Each bloger obtains L label, for all U bloger (i.e. all blogers that are selected), all tag sets is got up, and obtains all bloger's label total collection, uses following algorithm to carry out community's classification to all U bloger.First provide the definition of following two concepts:
1. co-occurrence: when two labels appear in same bloger's tag set simultaneously, claim these two label co-occurrences once.
2. classification degree: the nodes that has been divided into each class accounts for the ratio of total nodes.
This is a kind of network analysis based on label co-occurrence:
1. each a bloger's obtained above L tag set is got up, all bloger's label total collection are carried out to deduplication, the statistics of the line frequency number of going forward side by side, according to the descending sort of frequency.The frequency has here referred to how many different blogers and has occurred this label.To frequency setting threshold, be 3 simultaneously, remove frequency and be 1 and 2 label.To data carry out this processing according to being: the frequency that label occurs is higher, with the label of its co-occurrence also can be more; Otherwise set up.Thereby to remove frequency be 1 and 2 label, can analyze more clearly remaining label.
2. pair data are processed and are obtained each label co-occurrence number of times each other, and the maximal value that obtains co-occurrence number of times is C.
3. label is positioned in network chart and is analyzed, a label is as a node.In network chart, every two nodes undirected connection respectively that is first C by co-occurrence value, more every two nodes undirected connection respectively that is C-1 by co-occurrence value, the like.Until classification degree reaches 90%, stop this step.
4. remove node independent in network chart and the component that only has two nodes to be connected, each connected graph being left of usining is classified as each.
5. after the classification that obtains label, can analyze bloger.Which classification is the L having according to a bloger tag body belong to, and can judge this bloger's principal character.
After the classification that obtains label, other labels in this classification can be recommended to this bloger, or using the bloger similar to this bloger's label as its commending friends.Specific rules is as follows:
Rule 1: for the specific bloger who accomplishes fluently L label, first recommend Frequent Set intermediate value (being co-occurrence value) maximum and the label in this bloger's tag set not to it, then according to the recommendation of successively decreasing of Frequent Set intermediate value;
Rule 2: for the specific bloger who accomplishes fluently L label, the label of analyzing this bloger belongs at most a certain class, and other labels in this classification are recommended to this bloger;
Rule 3: for the specific bloger who accomplishes fluently L label, the label of analyzing this bloger belongs at most a certain class, and maximum labels are also belonged to the bloger of this class as this bloger's friend recommendation;
Rule 4: consider special circumstances 1, if this bloger's label is uniformly distributed in each classification,, from each classification, the random not label in his tag set of selecting, recommends this bloger;
Rule 5: consider special circumstances 2, if the frequency of all labels is all identical, to certain specific bloger, only need, from all not his label of tag set, select at random label to recommend him.
For specific implementation this method, need to follow following steps:
Step 1: first obtain certain microblogging bloger's essential information, form basic bloger's list.
Step 2: capture all blogers in bloger's list and end current all microbloggings of nearest 6 months.If this bloger microblogging total number of nearest 6 months is greater than 100, think that this bloger often sends out microblogging, be heavyweight bloger, continue step below; Otherwise abandon this bloger, process next microblogging bloger.
Step 3: because the content forwarding can not embody current bloger's feature, so remove the content of each bloger's microblogging repeating, remaining microblogging is carried out to pre-service.
Step 4: will carry out word segmentation processing and remove stop words through pretreated microblogging content.
Step 5: to removing the microblogging data of stop words, extract the keyword of every microblogging.
Step 5.1: length is surpassed to the microblogging of 30 characters, use aforementioned TF-IDF method to carry out keyword extraction.
Step 5.2: length is less than or equal to the microblogging of 30 characters, uses aforementioned everyday words extracting method to carry out keyword extraction.
Step 6: for each bloger's obtained above keyword, add up the access times of these keywords in nearest 30 days of this microblogging bloger issuing microblog (removing the microblogging of forwarding), and carry out descending sort, get front L and (if be less than L, directly use this number, for convenient, we still remember that this number is L), as this microblogging bloger's label.
Step 7: U bloger L tag set separately got up, according to aforementioned algorithm, this all bloger's label total collection is carried out to community's classification.By the method for the Frequent Set in this similar correlation rule, the user tag occurring together of being everlasting is most assigned to same classification, therefrom excavate out the relevance between these labels and bloger.
Step 8: obtain classification results and graphically show.According to aforementioned rule 1~rule 5, bloger is analyzed.
Accompanying drawing explanation
Tu1Shi Tengxun microblogging schematic diagram.
Fig. 2 is the system flowchart that captures micro-blog information.
Fig. 3 is system flowchart of the present invention.
Fig. 4 .1 and Fig. 4 .2 are the database list examples of the microblogging bloger list in the present invention, because data line is long, are divided into two parts and show, are respectively database list (1) and database list (2).
Fig. 5 .1 and Fig. 5 .2 are participle examples.Fig. 5 .1 is original paragraph before participle, and Fig. 5 .2 is word segmentation result example.
Fig. 6 .1, Fig. 6 .2 and Fig. 6 .3 are TF-IDF arithmetic result examples.Fig. 6 .1 is bloger's part word TF value, and Fig. 6 .2 is bloger's part IDF value, and Fig. 6 .3 is part TF-IDF score value, and Fig. 6 .4 is that user " liubulang " part keyword is shown.
Fig. 7 .1 is application result example to Fig. 7 .6, is respectively classification 1 to classification 6.
Embodiment
Below in conjunction with drawings and Examples, the inventive method is further described.
The inventive method be take Tengxun's microblogging and is carried out the examples show of work in every as example:
Step 1: first obtain certain microblogging bloger's essential information, form basic bloger's list.
Step 2: microblogging content obtaining.Use java compiler eclipse to realize.By calling the API of open platform, the mode of the Oauth granted access of process platform is obtained microblogging data.Capture end current nearest 6 months all microbloggings in (from year May in Dec, 2011 to 2012) of all blogers in bloger's list.Only choosing heavyweight bloger is added in table.In database, choose two tables of generation, one is bloger's information table, and one is micro-blog information table.Because the data volume capturing is too large, from database table, choose at random 52 (being U=52) blogers' data and carry out ensuing displaying.Tableau format is as follows:
Table1:userinfo//bloger information table
Table2:status//micro-blog information table
Step 3: remove the forwarding content in selected bloger's microblogging, be that the content of " 1 " is removed by the value of zhuanfa item in database table.In addition, because some part in microblogging content there is no help to our subsequent treatment, but appearance is more frequent, so first carry out pre-service.
1), by the symbol in microblogging, as emoticon, the contents such as sign and bloger's pet name below, topic sign " ## " are removed.
2) also need to remove the contents such as punctuate in microblogging, space, carriage return.
Step 4: bloger's microblogging is carried out to word segmentation processing and stop words removal.Shown in being implemented as follows.
1) due to the singularity of microblogging term, some word, with microblogging characteristic, causes some bloger's word not in dictionary for word segmentation, so we are first added into bloger's dictionary to carry out perfect to it.
2) there is at present multiple participle mode, our Words partition system ICTLAS of the choice for use Chinese Academy of Sciences here, the bloger's dictionary according to importing, carries out participle to microblogging.Word segmentation result example is shown in accompanying drawing 5.1 and Fig. 5 .2.
3), according to given inactive vocabulary, the microblogging through participle is carried out to the removal of stop words.We remove stop words according to inactive vocabulary, the pronoun in removal microblogging, article, conjunction etc.These word frequencies of occurrences are very high, but the extraction of label is not had to help, and remove them and can not affect microblogging theme.In addition, because we select some idioms of bloger in the extraction of label, be label, so must be noted that these words to be added to inactive vocabulary.Stop words is for example: " ", " he ", " with ", " " etc.
4) content of having divided word, remove after stop words is put into different files according to different blogers, every microblogging is put into a txt document.
Step 5: to removing the microblogging data of stop words, write java program according to preceding method and carry out keyword extraction.
Step 5.1: length is surpassed to the microblogging of 30 characters, use aforementioned TF-IDF method to carry out keyword extraction.Fig. 6 .1, Fig. 6 .2 and Fig. 6 .3 are TF-IDF arithmetic result examples.The value for name item of showing in example is the bloger's of " liubulang " result.Fig. 6 .1 is the value of this bloger's part word TF, and Fig. 6 .2 is the value of this bloger's part IDF, and Fig. 6 .3 is TF-IDF score value, and Fig. 6 .4 is that keyword is shown.
Step 5.2: length is less than or equal to the microblogging of 30 characters, uses aforementioned everyday words extracting method to carry out keyword extraction.
Step 6: obtain after each bloger's keyword, add up the access times of these keywords in nearest 30 days of this microblogging bloger (May 1 was to May 31) issuing microblog (removing the microblogging of forwarding), and carry out descending sort, get first 30 (being L=30), as this microblogging bloger's label.So all 52 blogers obtain 1560 labels altogether, i.e. all bloger's label total collection.
Step 7: use preceding method to carry out following classification step to all bloger's label total collection that obtain.
Step 7.1: to all these labels, use array according to (label value, affiliated user) (be array[]=(tag, user)) form storage, the go forward side by side statistics of line frequency number, according to the descending sort of frequency, with array (frequency[]=(tag, frequency)) storage.To frequency setting threshold, be 3 simultaneously, remove frequency and be 1 and 2 label.Finally obtain totally 1347, the label of frequency >=3.
Step 7.2: data are processed, by above-mentioned array is carried out to recycle ratio, there is the co-occurrence value of two labels of identical user value to add 1, obtain every two labels co-occurrence number of times each other, take that this sets up label-label co-occurrence matrix that dimension is respectively a 1347x1347 of label, content co-occurrence number of times between two labels.That is, if the co-occurrence number of times of label i and label j is n, k
ijfor n.The maximal value that simultaneously obtains co-occurrence number of times is 43.
Step 7.3: label is positioned in a network and is analyzed, and label is node.Two-dimensional array co_occur[i] [j] store the co-occurrence number of times of i label and j label, then use a two-dimensional array link[i] whether [j] store i label and be connected with j label, initial value is 0.If a numeric type variable num counts.From co-occurrence maximal value 43, start descending circulation searching co-occurrence value, if exist, the link value of these two nodes being made as to 1, num increases by 1.Until classification degree reaches 90%, namely num/2 is (undirected owing to connecting, so exist and be connected between i node and j node, be expressed as link[i] [j] and link[j] [i] all can be made as 1, num is counted as actual twice) value surpass 1347*90%, existing 1212 labels find the classification of oneself, stop circulation, and classification completes.
Step 7.4: use Graph Drawing Algorithm the above results to be drawn as to the form of network chart.Remove node independent in network chart and the component that only has two nodes to be connected, each connected graph being left of usining is classified as each.In this example, obtain 6 classification.
Step 8: obtain classification results and graphically show.Refer to accompanying drawing 7.1 to 7.6.Label in Fig. 7 .1 is main relevant with amusement, also comprise some and medium, label that news is relevant, for example, if the label that this method beats to a microblogging bloger comprises " open air " this vocabulary, we just infer that this bloger is also interested in " going for an outing in early spring ", " medium ", " video display ", " tourism " etc. so, so just recommend these products of this bloger; Label in Fig. 7 .2 is main relevant to internet, also comprises some marketing simultaneously, manages such label, illustrates that internet has brought very large opportunity to new industry, and blogers relatively pay close attention to the situation of this respect; In Fig. 7 .3, be the main label relevant to fashion, beauty and performing career, main more relevant to women; Label in Fig. 7 .4 is relevant to blogers' Comparison of Gardening Activities often carrying out Network Based, has occurred the network words as " praising ", " practical joke "; Label in Fig. 7 .5 is mainly concerned with physical culture, public good and finance and economics aspect, compares with Fig. 7 .3, has more the male sex's speciality; Fig. 7 .6 is a huge figure who comprises quite a lot of label, relates to every aspect.After having obtained these classification, just can recommend according to each bloger's 30 labels, as long as the label that this bloger is endowed in certain class, just recommends the Related product of such other vocabulary to this bloger, label in certain class is more, preferentially recommends such product.For example, to bloger " yueguangtaotao ", because its most of label all belongs to the 3rd class as " fashion ", " beauty " etc., according to rule 2, other labels in this class are recommended to her as " star ", " shopping " etc.If most of label of another bloger " xingganxuexue " also belongs to the 3rd class, according to rule 3, " xingganxuexue " is recommended as to the good friend of " yueguangtaotao ".