Disclosure of Invention
The invention provides a text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence, aiming at improving the effect and quality of text clustering.
In order to solve the technical problems, the technical scheme of the invention is as follows:
in a first aspect, the invention provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:
s1: and acquiring a text data set, and preprocessing the text data set.
S2: and constructing a text characteristic matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set.
S3: and constructing a non-negative matrix tri-decomposition combined clustering model, inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix tri-decomposition combined clustering model for clustering, and introducing a global word vector to represent a regular item to perform iterative update on the non-negative matrix tri-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.
Preferably, the text feature matrix X is a tf-idf feature matrix, and the formula is as follows:
Xij=tfi,j×idfi
wherein, tf
i,jIndicating the number of times the word i appears in document j,
where N denotes the number of documents of the preprocessed text data set, df
xIs the number of documents in the data set that contain the word i.
Preferably, the word importance matrix HijThe construction method of (2) is as follows:
wherein α represents a weight index, CijRepresenting the number of times that the word i and the word j appear in the same window, CmaxThe value representing the largest element in the word co-occurrence matrix.
Preferably, S3 specifically includes the following steps:
s3.1: and constructing a non-negative matrix three-decomposition combined clustering model, and initializing parameters of the non-negative matrix three-decomposition combined clustering model.
S3.2: and inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into a non-negative matrix tri-decomposition combined clustering model to obtain a document theme matrix, a word theme matrix and a background word theme matrix.
S3.3: introducing a global word vector to represent a regular term, and performing a non-negative matrix three-decomposition joint clustering model by using a multiplication updating method to obtain a document clustering matrix and a word clustering matrix.
Preferably, S3.4 specifically comprises the steps of:
s3.4.1: and iteratively updating the non-negative matrix trisection solution joint clustering model by using a multiplication updating method to obtain a document clustering matrix, a word clustering matrix and a background word clustering matrix.
S3.4.2: and adding and averaging the word clustering matrix and the background word clustering matrix to obtain a final word clustering matrix.
Preferably, the objective function of the non-negative matrix tri-decomposition joint clustering model updated by using the multiplication update iteration is as follows:
wherein X is a document-word representation matrix of the text, Z is a document clustering matrix, W is a word clustering matrix,
the method comprises the steps of taking a background word clustering matrix, taking S as an intermediate matrix, taking H as a word importance matrix, taking C as a word co-occurrence matrix, and taking eta as a parameter of a global word vector representing a regular term. Z is more than or equal to 0, S is more than or equal to 0, W is more than or equal to 0,
to represent the regular term for the incoming global word vector,
and decomposing the matrix for the newly added non-negative matrix three-component matrix based on the background words.
Preferably, the more up-to-date formula for the multiplication is represented as follows:
wherein Z is
k-1Representing the document topic matrix before iteration, Z
kRepresenting the document theme matrix after iteration; s
k-1Representing the intermediate matrix before iteration, S
kRepresenting the intermediate matrix after iteration; w
k-1Representing the word topic matrix before iteration, W
kRepresenting the iterated word topic matrix;
representing a background word topic matrix, W, before iteration
kRepresenting the iterated word topic matrix.
Preferably, the preprocessing the text data set comprises word segmentation, word deactivation, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal.
In a second aspect, the present invention further provides a text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence, which is applied to the text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to any of the above schemes, and includes:
and the data processing module is used for acquiring the text data set and preprocessing the text data set.
And the characteristic construction module is used for constructing the text characteristics in the text data set.
And the clustering module is used for constructing a non-negative matrix three-component solution joint clustering model, inputting the text feature matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix three-component solution joint clustering model for clustering, and introducing a global word vector to represent a regular term to perform iterative update on the non-negative matrix three-component solution joint clustering model to obtain a document clustering matrix and a word clustering matrix.
Preferably, the text features include a text feature matrix, a word co-occurrence matrix, and a word importance matrix.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the invention introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent regular items, improves the expression effect of the word vector by searching a word co-occurrence rule, more fully utilizes the semantic relation between global text information and words, improves the joint clustering capability of non-negative matrix tri-decomposition, and effectively improves the effect and quality of text clustering.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:
s1: and acquiring a text data set, and preprocessing the text data set by word segmentation, stop word removal, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal to obtain a preprocessed text data set.
S2: and constructing a text characteristic matrix X, a word co-occurrence matrix C and a word importance matrix H based on the preprocessed text data set.
S3: and constructing a non-negative matrix tri-decomposition combined clustering model, inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix tri-decomposition combined clustering model for clustering, and introducing a global word vector to represent a regular item to perform iterative update on the non-negative matrix tri-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.
In the specific implementation process, the word co-occurrence-based text non-negative matrix tri-decomposition combined clustering method introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent regular items, improves the expression effect of the word vector by searching for a word co-occurrence rule, more fully utilizes the semantic relation between global text information and words, can explore deeper information of a text, improves the combined clustering capability of the non-negative matrix tri-decomposition on the basis of strong interpretability, and effectively improves the text clustering effect and quality.
Example 2
The embodiment provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:
s1: obtaining class 4 text data set, and preprocessing the text data set including word segmentation, word deactivation, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal.
The text data set adopted in the embodiment is classic4 English text data set, the classic4 English text data set comprises 7095 abstract documents, and the four abstract documents comprise four types of document sets, namely CACM (computer science) 3204, CISI (information science) 1460, MED (medical science) 1033 and CRAN (aeronautical science) 1398.
Words and punctuation in class 4 English text are reasonably cut, English segmentation is usually cut by taking a space as a separator, and segmentation is carried out on a data set to facilitate the subsequent text processing.
The stop words such as 'the', 'is', 'at', 'which', 'on' and the like in the English text of class 4 are removed, the method is not greatly influenced by the removal of the stop words, the storage space can be saved, and the method efficiency can be improved.
High-frequency words, i.e., words appearing in large numbers in the text data set, are difficult to help the method to improve the effect. Low frequency words are usually treated as noise. The method removes low-frequency words with the occurrence frequency of less than 5 and high-frequency words with the occurrence frequency of more than 0.95 in the classic4 text data set.
S2: constructing a text characteristic matrix X, a word co-occurrence matrix C and a word importance matrix H based on the preprocessed text data set;
the text feature matrix X constructed in the embodiment is tf-idf feature matrix X
ij=f
i,j×idf
i。tf
i,jIndicating the number of times the word i appears in document j,
where N denotes the number of documents of the preprocessed text data set, df
xIs the number of documents in the data set that contain the word i.
The word co-occurrence matrix C records the number of co-occurrences of different words in the data set, element CijRepresenting the number of times that the word i and the word j appear in the same window, the window size of the word co-occurrence matrix is set to 10.
Word importance matrix HijThe construction method of (2) is as follows:
where α is a suitable weighting index measured experimentally and set at 0.75; cmaxThe appropriate scaling denominator, as determined experimentally, was set to 10.
S3: the method comprises the following steps of constructing a non-negative matrix three-decomposition combined clustering model, inputting a text feature matrix, a word co-occurrence matrix and a word importance matrix into the non-negative matrix three-decomposition combined clustering model for clustering, introducing a global word vector to represent a regular item to iteratively update the non-negative matrix three-decomposition combined clustering model, and obtaining a document clustering matrix and a word clustering matrix, wherein the method specifically comprises the following steps:
s3.1: and constructing a non-negative matrix three-decomposition combined clustering model, and initializing parameters of the non-negative matrix three-decomposition combined clustering model.
S3.2: and inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into a non-negative matrix tri-decomposition combined clustering model to obtain a document theme matrix, a word theme matrix and a background word theme matrix.
S3.3: introducing a global word vector to represent a regular term, and performing a non-negative matrix three-decomposition joint clustering model by using a multiplication updating method to obtain a document clustering matrix and a word clustering matrix, wherein the method specifically comprises the following steps:
s3.4.1: and iteratively updating the non-negative matrix trisection solution joint clustering model by using a multiplication updating method to obtain a document clustering matrix, a word clustering matrix and a background word clustering matrix.
S3.4.2: and adding and averaging the word clustering matrix and the background word clustering matrix to obtain a final word clustering matrix.
In this embodiment, the parameters that need to initialize the non-negative matrix tri-decomposition joint clustering model are { η, xmaxα, I, k, l }, where η is a parameter of the global word vector representing the regular term, I is the number of multiplicative update iterations, k is the number of document topics, and l is the number of word topics; in this embodiment, the document clustering matrix, the word clustering matrix, the background word clustering matrix, and the intermediate matrix are respectively initialized at random with 0-1, where η is 0.01, I is 200, k is 20, and l is 20.
In this embodiment, the objective function of iteratively updating the non-negative matrix tri-decomposition joint clustering model by using the multiplicative update method is as follows:
wherein X is a document-word representation matrix of the text, Z is a document clustering matrix, W is a word clustering matrix,
a background word clustering matrix, an intermediate matrix, a word importance matrix, a word co-occurrence matrix, a global word vector and a regular term are used as the word co-occurrence matrix; z is more than or equal to 0, S is more than or equal to 0, W is more than or equal to 0,
to represent the regular term for the incoming global word vector,
and a newly added non-negative matrix three-decomposition matrix based on the background words is used for increasing the stability of clustering. The objective function is available
And (4) showing.
The purpose of non-negative matrix tri-decomposition joint clustering is to expect ZSWTFitting X as much as possible, and introducing more document information to form a regular item so as to improve the clustering effect, so that the objective function needs to be as small as possible.
In order to minimize the objective function, the present invention uses a multiplicative update rule, specifically derived as follows:
firstly, using α, β, γ, δ as lagrangian multipliers of the objective function to obtain a lagrangian equation L of the objective function:
then lagrange' S equation L applies the Z, S, W,
derivation:
and finally, obtaining a multiplication updating model by utilizing a Kuhn-Tucker condition, and carrying out iterative updating on the document theme matrix, the word theme matrix and the background word theme matrix, wherein the specific formula is as follows:
wherein Z is
k-1Representing the document topic matrix before iteration, Z
kRepresenting the document theme matrix after iteration; s
k-1Representing the intermediate matrix before iteration, S
kRepresenting the intermediate matrix after iteration; w
k-1Representing the word topic matrix before iteration, W
kRepresenting the iterated word topic matrix;
representing a background word topic matrix, W, before iteration
kRepresenting the iterated word topic matrix.
Clustering the word clustering matrix W and the background word clustering matrix
Adding and averaging to obtain final word clustering matrix W
f。
In this embodiment, the document clustering matrix Z is regarded as the document-topic distribution, and the word clustering matrix WfConsidered a word-topic distribution. The column where the maximum value of each row in Z is positioned indicates that the document represented by the maximum value belongs to the column class, WfThe column in which the maximum value of each row is located indicates that the word represented by the maximum value belongs to the column class.
The document-theme distribution and the word-theme distribution obtained by the embodiment can be widely applied to the fields of emotion analysis, short text theme recognition, text classification clustering, machine translation and the like.
The comparison of the document clustering experiment result on the classic4 data set with NMTF (non-negative Matrix Tri-Factorization) and WCNMTF (Word Co-objective regulated non-negative Matrix Tri-Factorization) is shown in Table 1.
TABLE 1 results of the inventive document clustering experiments on the classic4 data set compared to other models
Model (model)
|
ARI
|
NMI
|
NMTF
|
0.481351
|
0.620045
|
WCNMTF
|
0.578662
|
0.643512
|
The invention
|
0.739819
|
0.733148 |
Table 1 shows the performance of the different models on the document clustering Index ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information). It can be seen from table 1 that the document clustering indexes ARI and NMI of the method are higher than NMTF and WCNMTF, which indicates that the result obtained by the method is superior to other models in document clustering, i.e. the probability of the method dividing documents with the same theme into the same class is higher than that of other models.
Example 3
Referring to fig. 2, the embodiment provides a text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence, which includes a data processing module, a feature construction module and a clustering module.
In the specific implementation process, the data processing module carries out preprocessing including word segmentation, word stop removal, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal on the text data set. The feature construction module constructs a text feature matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set. The clustering module constructs a non-negative matrix three-decomposition combined clustering model, inputs a text characteristic matrix, a word co-occurrence matrix and a word importance matrix into the non-negative matrix three-decomposition combined clustering model for clustering, and introduces a global word vector to represent a regular term to perform iterative update on the non-negative matrix three-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.
The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.