CN114003723A

CN114003723A - Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence

Info

Publication number: CN114003723A
Application number: CN202111358208.7A
Authority: CN
Inventors: 饶洋辉; 刘海
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-02-01

Abstract

The present invention proposes a non-negative matrix three-decomposition joint clustering method and system for text based on word co-occurrence, including: acquiring a text data set, preprocessing the text data set; constructing a text feature matrix based on the preprocessed text data set , word co-occurrence matrix and word importance matrix; construct a non-negative matrix triple decomposition joint clustering model, input the text feature matrix, word co-occurrence matrix and word importance matrix into the non-negative matrix triple decomposition joint clustering model for clustering, At the same time, the global word vector is introduced to represent the regular term to iteratively update the non-negative matrix triple factorization joint clustering model, and the document clustering matrix and the word clustering matrix are obtained. The invention introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent the regular term, improves the expression effect of the word vector by finding the word co-occurrence rule, more fully utilizes the global text information and the semantic relationship between words, and improves the non-negative The joint clustering ability of matrix triple decomposition can effectively improve the effect and quality of text clustering.

Description

Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence

Technical Field

The invention relates to the field of natural language, in particular to a text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence.

Background

With the development of society, digitalized text data is also increasing. Under the background, the clustering technology is utilized to classify the text data, which is helpful for people to find useful information in massive information. Text co-clustering is a process of clustering documents and words simultaneously, which makes use of the inherent duality of rows and columns of a data matrix, makes it possible to improve the clustering effect of two dimensions simultaneously, and can produce more meaningful and interpretable results.

The non-negative matrix factorization method is a classic joint clustering method and is also applied to text joint clustering. The Markov patent refers to the field of 'electric digital data processing'.

However, the text clustering method using only non-negative matrix factorization ignores semantic relationships between words, resulting in that documents of the same subject or using similar words cannot be guaranteed to be mapped to the same direction in a potential space, resulting in a reduction in the effect of text clustering.

Disclosure of Invention

The invention provides a text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence, aiming at improving the effect and quality of text clustering.

In order to solve the technical problems, the technical scheme of the invention is as follows:

in a first aspect, the invention provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:

s1: and acquiring a text data set, and preprocessing the text data set.

S2: and constructing a text characteristic matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set.

S3: and constructing a non-negative matrix tri-decomposition combined clustering model, inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix tri-decomposition combined clustering model for clustering, and introducing a global word vector to represent a regular item to perform iterative update on the non-negative matrix tri-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.

Preferably, the text feature matrix X is a tf-idf feature matrix, and the formula is as follows:

X_ij＝tf_i,j×idf_i

wherein, tf_i,jIndicating the number of times the word i appears in document j,

where N denotes the number of documents of the preprocessed text data set, df_xIs the number of documents in the data set that contain the word i.

Preferably, the word importance matrix H_ijThe construction method of (2) is as follows:

wherein α represents a weight index, C_ijRepresenting the number of times that the word i and the word j appear in the same window, C_maxThe value representing the largest element in the word co-occurrence matrix.

Preferably, S3 specifically includes the following steps:

s3.1: and constructing a non-negative matrix three-decomposition combined clustering model, and initializing parameters of the non-negative matrix three-decomposition combined clustering model.

S3.2: and inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into a non-negative matrix tri-decomposition combined clustering model to obtain a document theme matrix, a word theme matrix and a background word theme matrix.

S3.3: introducing a global word vector to represent a regular term, and performing a non-negative matrix three-decomposition joint clustering model by using a multiplication updating method to obtain a document clustering matrix and a word clustering matrix.

Preferably, S3.4 specifically comprises the steps of:

s3.4.1: and iteratively updating the non-negative matrix trisection solution joint clustering model by using a multiplication updating method to obtain a document clustering matrix, a word clustering matrix and a background word clustering matrix.

S3.4.2: and adding and averaging the word clustering matrix and the background word clustering matrix to obtain a final word clustering matrix.

Preferably, the objective function of the non-negative matrix tri-decomposition joint clustering model updated by using the multiplication update iteration is as follows:

wherein X is a document-word representation matrix of the text, Z is a document clustering matrix, W is a word clustering matrix,

the method comprises the steps of taking a background word clustering matrix, taking S as an intermediate matrix, taking H as a word importance matrix, taking C as a word co-occurrence matrix, and taking eta as a parameter of a global word vector representing a regular term. Z is more than or equal to 0, S is more than or equal to 0, W is more than or equal to 0,

to represent the regular term for the incoming global word vector,

and decomposing the matrix for the newly added non-negative matrix three-component matrix based on the background words.

Preferably, the more up-to-date formula for the multiplication is represented as follows:

wherein Z is_k-1Representing the document topic matrix before iteration, Z_kRepresenting the document theme matrix after iteration; s_k-1Representing the intermediate matrix before iteration, S_kRepresenting the intermediate matrix after iteration; w_k-1Representing the word topic matrix before iteration, W_kRepresenting the iterated word topic matrix;

representing a background word topic matrix, W, before iteration_kRepresenting the iterated word topic matrix.

Preferably, the preprocessing the text data set comprises word segmentation, word deactivation, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal.

In a second aspect, the present invention further provides a text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence, which is applied to the text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to any of the above schemes, and includes:

and the data processing module is used for acquiring the text data set and preprocessing the text data set.

And the characteristic construction module is used for constructing the text characteristics in the text data set.

And the clustering module is used for constructing a non-negative matrix three-component solution joint clustering model, inputting the text feature matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix three-component solution joint clustering model for clustering, and introducing a global word vector to represent a regular term to perform iterative update on the non-negative matrix three-component solution joint clustering model to obtain a document clustering matrix and a word clustering matrix.

Preferably, the text features include a text feature matrix, a word co-occurrence matrix, and a word importance matrix.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the invention introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent regular items, improves the expression effect of the word vector by searching a word co-occurrence rule, more fully utilizes the semantic relation between global text information and words, improves the joint clustering capability of non-negative matrix tri-decomposition, and effectively improves the effect and quality of text clustering.

Drawings

FIG. 1 is a flow chart of a text non-negative matrix tri-factorization joint clustering method based on word co-occurrence.

FIG. 2 is a block diagram of a text non-negative matrix tri-decomposition joint clustering system architecture based on word co-occurrence.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:

s1: and acquiring a text data set, and preprocessing the text data set by word segmentation, stop word removal, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal to obtain a preprocessed text data set.

S2: and constructing a text characteristic matrix X, a word co-occurrence matrix C and a word importance matrix H based on the preprocessed text data set.

In the specific implementation process, the word co-occurrence-based text non-negative matrix tri-decomposition combined clustering method introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent regular items, improves the expression effect of the word vector by searching for a word co-occurrence rule, more fully utilizes the semantic relation between global text information and words, can explore deeper information of a text, improves the combined clustering capability of the non-negative matrix tri-decomposition on the basis of strong interpretability, and effectively improves the text clustering effect and quality.

Example 2

s1: obtaining class 4 text data set, and preprocessing the text data set including word segmentation, word deactivation, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal.

The text data set adopted in the embodiment is classic4 English text data set, the classic4 English text data set comprises 7095 abstract documents, and the four abstract documents comprise four types of document sets, namely CACM (computer science) 3204, CISI (information science) 1460, MED (medical science) 1033 and CRAN (aeronautical science) 1398.

Words and punctuation in class 4 English text are reasonably cut, English segmentation is usually cut by taking a space as a separator, and segmentation is carried out on a data set to facilitate the subsequent text processing.

The stop words such as 'the', 'is', 'at', 'which', 'on' and the like in the English text of class 4 are removed, the method is not greatly influenced by the removal of the stop words, the storage space can be saved, and the method efficiency can be improved.

High-frequency words, i.e., words appearing in large numbers in the text data set, are difficult to help the method to improve the effect. Low frequency words are usually treated as noise. The method removes low-frequency words with the occurrence frequency of less than 5 and high-frequency words with the occurrence frequency of more than 0.95 in the classic4 text data set.

S2: constructing a text characteristic matrix X, a word co-occurrence matrix C and a word importance matrix H based on the preprocessed text data set;

the text feature matrix X constructed in the embodiment is tf-idf feature matrix X_ij＝f_i,j×idf_i。tf_i,jIndicating the number of times the word i appears in document j,

The word co-occurrence matrix C records the number of co-occurrences of different words in the data set, element C_ijRepresenting the number of times that the word i and the word j appear in the same window, the window size of the word co-occurrence matrix is set to 10.

Word importance matrix H_ijThe construction method of (2) is as follows:

where α is a suitable weighting index measured experimentally and set at 0.75; c_maxThe appropriate scaling denominator, as determined experimentally, was set to 10.

S3: the method comprises the following steps of constructing a non-negative matrix three-decomposition combined clustering model, inputting a text feature matrix, a word co-occurrence matrix and a word importance matrix into the non-negative matrix three-decomposition combined clustering model for clustering, introducing a global word vector to represent a regular item to iteratively update the non-negative matrix three-decomposition combined clustering model, and obtaining a document clustering matrix and a word clustering matrix, wherein the method specifically comprises the following steps:

S3.3: introducing a global word vector to represent a regular term, and performing a non-negative matrix three-decomposition joint clustering model by using a multiplication updating method to obtain a document clustering matrix and a word clustering matrix, wherein the method specifically comprises the following steps:

In this embodiment, the parameters that need to initialize the non-negative matrix tri-decomposition joint clustering model are { η, x_maxα, I, k, l }, where η is a parameter of the global word vector representing the regular term, I is the number of multiplicative update iterations, k is the number of document topics, and l is the number of word topics; in this embodiment, the document clustering matrix, the word clustering matrix, the background word clustering matrix, and the intermediate matrix are respectively initialized at random with 0-1, where η is 0.01, I is 200, k is 20, and l is 20.

In this embodiment, the objective function of iteratively updating the non-negative matrix tri-decomposition joint clustering model by using the multiplicative update method is as follows:

a background word clustering matrix, an intermediate matrix, a word importance matrix, a word co-occurrence matrix, a global word vector and a regular term are used as the word co-occurrence matrix; z is more than or equal to 0, S is more than or equal to 0, W is more than or equal to 0,

to represent the regular term for the incoming global word vector,

and a newly added non-negative matrix three-decomposition matrix based on the background words is used for increasing the stability of clustering. The objective function is available

And (4) showing.

The purpose of non-negative matrix tri-decomposition joint clustering is to expect ZSW^TFitting X as much as possible, and introducing more document information to form a regular item so as to improve the clustering effect, so that the objective function needs to be as small as possible.

In order to minimize the objective function, the present invention uses a multiplicative update rule, specifically derived as follows:

firstly, using α, β, γ, δ as lagrangian multipliers of the objective function to obtain a lagrangian equation L of the objective function:

then lagrange' S equation L applies the Z, S, W,

derivation:

and finally, obtaining a multiplication updating model by utilizing a Kuhn-Tucker condition, and carrying out iterative updating on the document theme matrix, the word theme matrix and the background word theme matrix, wherein the specific formula is as follows:

Clustering the word clustering matrix W and the background word clustering matrix

Adding and averaging to obtain final word clustering matrix W_f。

In this embodiment, the document clustering matrix Z is regarded as the document-topic distribution, and the word clustering matrix W_fConsidered a word-topic distribution. The column where the maximum value of each row in Z is positioned indicates that the document represented by the maximum value belongs to the column class, W_fThe column in which the maximum value of each row is located indicates that the word represented by the maximum value belongs to the column class.

The document-theme distribution and the word-theme distribution obtained by the embodiment can be widely applied to the fields of emotion analysis, short text theme recognition, text classification clustering, machine translation and the like.

The comparison of the document clustering experiment result on the classic4 data set with NMTF (non-negative Matrix Tri-Factorization) and WCNMTF (Word Co-objective regulated non-negative Matrix Tri-Factorization) is shown in Table 1.

TABLE 1 results of the inventive document clustering experiments on the classic4 data set compared to other models

Model (model)	ARI	NMI
			NMTF	0.481351	0.620045
WCNMTF	0.578662	0.643512
			The invention	0.739819	0.733148

Table 1 shows the performance of the different models on the document clustering Index ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information). It can be seen from table 1 that the document clustering indexes ARI and NMI of the method are higher than NMTF and WCNMTF, which indicates that the result obtained by the method is superior to other models in document clustering, i.e. the probability of the method dividing documents with the same theme into the same class is higher than that of other models.

Example 3

Referring to fig. 2, the embodiment provides a text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence, which includes a data processing module, a feature construction module and a clustering module.

In the specific implementation process, the data processing module carries out preprocessing including word segmentation, word stop removal, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal on the text data set. The feature construction module constructs a text feature matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set. The clustering module constructs a non-negative matrix three-decomposition combined clustering model, inputs a text characteristic matrix, a word co-occurrence matrix and a word importance matrix into the non-negative matrix three-decomposition combined clustering model for clustering, and introduces a global word vector to represent a regular term to perform iterative update on the non-negative matrix three-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.

The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence is characterized by comprising the following steps of:

s1: acquiring a text data set, and preprocessing the text data set;

s2: constructing a text characteristic matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set;

2. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to claim 1, wherein the text feature matrix X is tf-idf feature matrix, and the formula is as follows:

X_ij＝tf_i，j×idf_i

wherein, tf_i，jIndicating the number of times the word i appears in document j,

3. The method according to claim 1, wherein the word importance matrix H comprises an element H_ijThe construction method of (2) is as follows:

4. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to claim 1, wherein S3 specifically comprises the following steps:

s3.1: constructing a non-negative matrix three-decomposition joint clustering model, and initializing parameters of the non-negative matrix three-decomposition joint clustering model;

s3.2: inputting the text feature matrix, the word co-occurrence matrix and the word importance matrix into a non-negative matrix tri-decomposition joint clustering model to obtain a document theme matrix, a word theme matrix and a background word theme matrix;

5. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to claim 4, wherein S3.4 specifically comprises the following steps:

s3.4.1: using a multiplicative updating type iterative updating non-negative matrix three-decomposition combined clustering model to obtain a document clustering matrix, a word clustering matrix and a background word clustering matrix;

6. The method of claim 5, wherein the updating the objective function of the non-negative matrix tri-decomposition co-clustering model using multiplicative update iteration is:

a background word clustering matrix, S is an intermediate matrix, a piece is a word importance matrix, C is a word co-occurrence matrix, and eta is a parameter of a global word vector representing a regular term;

to represent the regular term for the incoming global word vector,

7. The word co-occurrence based text non-negative matrix tri-decomposition joint clustering method according to claim 6, wherein the multiplicative updated formula is represented as follows:

wherein Z is_k-1Representing the document topic matrix before iteration, Z_kRepresenting the document theme matrix after iteration; s_{k_1}Representing the intermediate matrix before iteration, S_kRepresenting the intermediate matrix after iteration; w_{k_1}Representing the word topic matrix before iteration, W_kRepresenting the iterated word topic matrix;

8. The method according to claim 1, wherein the preprocessing of the text data set comprises word segmentation, word deactivation, high frequency word removal, low frequency word removal, punctuation removal and digit removal.

9. A text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence is characterized by comprising the following steps:

the data processing module is used for acquiring a text data set and preprocessing the text data set;

the characteristic construction module is used for constructing text characteristics in the text data set;

10. The word co-occurrence based text non-negative matrix tri-decomposition joint clustering system according to claim 9, wherein the text features comprise a text feature matrix, a word co-occurrence matrix, and a word importance matrix.