[go: up one dir, main page]

CN114003723A - Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence - Google Patents

Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence Download PDF

Info

Publication number
CN114003723A
CN114003723A CN202111358208.7A CN202111358208A CN114003723A CN 114003723 A CN114003723 A CN 114003723A CN 202111358208 A CN202111358208 A CN 202111358208A CN 114003723 A CN114003723 A CN 114003723A
Authority
CN
China
Prior art keywords
matrix
word
clustering
text
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111358208.7A
Other languages
Chinese (zh)
Inventor
饶洋辉
刘海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111358208.7A priority Critical patent/CN114003723A/en
Publication of CN114003723A publication Critical patent/CN114003723A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提出一种基于词共现的文本非负矩阵三分解联合聚类方法及系统,包括:获取文本数据集,对文本数据集进行预处理;基于预处理后的文本数据集构造文本特征矩阵、词共现矩阵和词重要性矩阵;构建非负矩阵三分解联合聚类模型,将文本特征矩阵、词共现矩阵和词重要性矩阵输入非负矩阵三分解联合聚类模型进行聚类,同时引入全局词向量表示正则项对非负矩阵三分解联合聚类模型进行迭代更新,得到文档聚类矩阵和词聚类矩阵。本发明引入词共现矩阵、词重要性矩阵和全局词向量表示正则项,通过寻找词共现规律提高词向量的表达效果,更加充分地利用全局文本信息和单词间的语义关系,提升非负矩阵三分解的联合聚类能力,有效提高文本聚类的效果与质量。

Figure 202111358208

The present invention proposes a non-negative matrix three-decomposition joint clustering method and system for text based on word co-occurrence, including: acquiring a text data set, preprocessing the text data set; constructing a text feature matrix based on the preprocessed text data set , word co-occurrence matrix and word importance matrix; construct a non-negative matrix triple decomposition joint clustering model, input the text feature matrix, word co-occurrence matrix and word importance matrix into the non-negative matrix triple decomposition joint clustering model for clustering, At the same time, the global word vector is introduced to represent the regular term to iteratively update the non-negative matrix triple factorization joint clustering model, and the document clustering matrix and the word clustering matrix are obtained. The invention introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent the regular term, improves the expression effect of the word vector by finding the word co-occurrence rule, more fully utilizes the global text information and the semantic relationship between words, and improves the non-negative The joint clustering ability of matrix triple decomposition can effectively improve the effect and quality of text clustering.

Figure 202111358208

Description

Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence
Technical Field
The invention relates to the field of natural language, in particular to a text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence.
Background
With the development of society, digitalized text data is also increasing. Under the background, the clustering technology is utilized to classify the text data, which is helpful for people to find useful information in massive information. Text co-clustering is a process of clustering documents and words simultaneously, which makes use of the inherent duality of rows and columns of a data matrix, makes it possible to improve the clustering effect of two dimensions simultaneously, and can produce more meaningful and interpretable results.
The non-negative matrix factorization method is a classic joint clustering method and is also applied to text joint clustering. The Markov patent refers to the field of 'electric digital data processing'.
However, the text clustering method using only non-negative matrix factorization ignores semantic relationships between words, resulting in that documents of the same subject or using similar words cannot be guaranteed to be mapped to the same direction in a potential space, resulting in a reduction in the effect of text clustering.
Disclosure of Invention
The invention provides a text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence, aiming at improving the effect and quality of text clustering.
In order to solve the technical problems, the technical scheme of the invention is as follows:
in a first aspect, the invention provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:
s1: and acquiring a text data set, and preprocessing the text data set.
S2: and constructing a text characteristic matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set.
S3: and constructing a non-negative matrix tri-decomposition combined clustering model, inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix tri-decomposition combined clustering model for clustering, and introducing a global word vector to represent a regular item to perform iterative update on the non-negative matrix tri-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.
Preferably, the text feature matrix X is a tf-idf feature matrix, and the formula is as follows:
Xij=tfi,j×idfi
wherein, tfi,jIndicating the number of times the word i appears in document j,
Figure BDA0003358054770000021
where N denotes the number of documents of the preprocessed text data set, dfxIs the number of documents in the data set that contain the word i.
Preferably, the word importance matrix HijThe construction method of (2) is as follows:
Figure BDA0003358054770000022
wherein α represents a weight index, CijRepresenting the number of times that the word i and the word j appear in the same window, CmaxThe value representing the largest element in the word co-occurrence matrix.
Preferably, S3 specifically includes the following steps:
s3.1: and constructing a non-negative matrix three-decomposition combined clustering model, and initializing parameters of the non-negative matrix three-decomposition combined clustering model.
S3.2: and inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into a non-negative matrix tri-decomposition combined clustering model to obtain a document theme matrix, a word theme matrix and a background word theme matrix.
S3.3: introducing a global word vector to represent a regular term, and performing a non-negative matrix three-decomposition joint clustering model by using a multiplication updating method to obtain a document clustering matrix and a word clustering matrix.
Preferably, S3.4 specifically comprises the steps of:
s3.4.1: and iteratively updating the non-negative matrix trisection solution joint clustering model by using a multiplication updating method to obtain a document clustering matrix, a word clustering matrix and a background word clustering matrix.
S3.4.2: and adding and averaging the word clustering matrix and the background word clustering matrix to obtain a final word clustering matrix.
Preferably, the objective function of the non-negative matrix tri-decomposition joint clustering model updated by using the multiplication update iteration is as follows:
Figure BDA0003358054770000031
wherein X is a document-word representation matrix of the text, Z is a document clustering matrix, W is a word clustering matrix,
Figure BDA0003358054770000032
the method comprises the steps of taking a background word clustering matrix, taking S as an intermediate matrix, taking H as a word importance matrix, taking C as a word co-occurrence matrix, and taking eta as a parameter of a global word vector representing a regular term. Z is more than or equal to 0, S is more than or equal to 0, W is more than or equal to 0,
Figure BDA0003358054770000033
Figure BDA0003358054770000034
Figure BDA0003358054770000035
to represent the regular term for the incoming global word vector,
Figure BDA0003358054770000036
and decomposing the matrix for the newly added non-negative matrix three-component matrix based on the background words.
Preferably, the more up-to-date formula for the multiplication is represented as follows:
Figure BDA0003358054770000037
Figure BDA0003358054770000038
Figure BDA0003358054770000039
Figure BDA00033580547700000310
wherein Z isk-1Representing the document topic matrix before iteration, ZkRepresenting the document theme matrix after iteration; sk-1Representing the intermediate matrix before iteration, SkRepresenting the intermediate matrix after iteration; wk-1Representing the word topic matrix before iteration, WkRepresenting the iterated word topic matrix;
Figure BDA00033580547700000311
representing a background word topic matrix, W, before iterationkRepresenting the iterated word topic matrix.
Preferably, the preprocessing the text data set comprises word segmentation, word deactivation, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal.
In a second aspect, the present invention further provides a text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence, which is applied to the text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to any of the above schemes, and includes:
and the data processing module is used for acquiring the text data set and preprocessing the text data set.
And the characteristic construction module is used for constructing the text characteristics in the text data set.
And the clustering module is used for constructing a non-negative matrix three-component solution joint clustering model, inputting the text feature matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix three-component solution joint clustering model for clustering, and introducing a global word vector to represent a regular term to perform iterative update on the non-negative matrix three-component solution joint clustering model to obtain a document clustering matrix and a word clustering matrix.
Preferably, the text features include a text feature matrix, a word co-occurrence matrix, and a word importance matrix.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that: the invention introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent regular items, improves the expression effect of the word vector by searching a word co-occurrence rule, more fully utilizes the semantic relation between global text information and words, improves the joint clustering capability of non-negative matrix tri-decomposition, and effectively improves the effect and quality of text clustering.
Drawings
FIG. 1 is a flow chart of a text non-negative matrix tri-factorization joint clustering method based on word co-occurrence.
FIG. 2 is a block diagram of a text non-negative matrix tri-decomposition joint clustering system architecture based on word co-occurrence.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:
s1: and acquiring a text data set, and preprocessing the text data set by word segmentation, stop word removal, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal to obtain a preprocessed text data set.
S2: and constructing a text characteristic matrix X, a word co-occurrence matrix C and a word importance matrix H based on the preprocessed text data set.
S3: and constructing a non-negative matrix tri-decomposition combined clustering model, inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix tri-decomposition combined clustering model for clustering, and introducing a global word vector to represent a regular item to perform iterative update on the non-negative matrix tri-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.
In the specific implementation process, the word co-occurrence-based text non-negative matrix tri-decomposition combined clustering method introduces a word co-occurrence matrix, a word importance matrix and a global word vector to represent regular items, improves the expression effect of the word vector by searching for a word co-occurrence rule, more fully utilizes the semantic relation between global text information and words, can explore deeper information of a text, improves the combined clustering capability of the non-negative matrix tri-decomposition on the basis of strong interpretability, and effectively improves the text clustering effect and quality.
Example 2
The embodiment provides a text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence, which comprises the following steps:
s1: obtaining class 4 text data set, and preprocessing the text data set including word segmentation, word deactivation, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal.
The text data set adopted in the embodiment is classic4 English text data set, the classic4 English text data set comprises 7095 abstract documents, and the four abstract documents comprise four types of document sets, namely CACM (computer science) 3204, CISI (information science) 1460, MED (medical science) 1033 and CRAN (aeronautical science) 1398.
Words and punctuation in class 4 English text are reasonably cut, English segmentation is usually cut by taking a space as a separator, and segmentation is carried out on a data set to facilitate the subsequent text processing.
The stop words such as 'the', 'is', 'at', 'which', 'on' and the like in the English text of class 4 are removed, the method is not greatly influenced by the removal of the stop words, the storage space can be saved, and the method efficiency can be improved.
High-frequency words, i.e., words appearing in large numbers in the text data set, are difficult to help the method to improve the effect. Low frequency words are usually treated as noise. The method removes low-frequency words with the occurrence frequency of less than 5 and high-frequency words with the occurrence frequency of more than 0.95 in the classic4 text data set.
S2: constructing a text characteristic matrix X, a word co-occurrence matrix C and a word importance matrix H based on the preprocessed text data set;
the text feature matrix X constructed in the embodiment is tf-idf feature matrix Xij=fi,j×idfi。tfi,jIndicating the number of times the word i appears in document j,
Figure BDA0003358054770000051
where N denotes the number of documents of the preprocessed text data set, dfxIs the number of documents in the data set that contain the word i.
The word co-occurrence matrix C records the number of co-occurrences of different words in the data set, element CijRepresenting the number of times that the word i and the word j appear in the same window, the window size of the word co-occurrence matrix is set to 10.
Word importance matrix HijThe construction method of (2) is as follows:
Figure BDA0003358054770000061
where α is a suitable weighting index measured experimentally and set at 0.75; cmaxThe appropriate scaling denominator, as determined experimentally, was set to 10.
S3: the method comprises the following steps of constructing a non-negative matrix three-decomposition combined clustering model, inputting a text feature matrix, a word co-occurrence matrix and a word importance matrix into the non-negative matrix three-decomposition combined clustering model for clustering, introducing a global word vector to represent a regular item to iteratively update the non-negative matrix three-decomposition combined clustering model, and obtaining a document clustering matrix and a word clustering matrix, wherein the method specifically comprises the following steps:
s3.1: and constructing a non-negative matrix three-decomposition combined clustering model, and initializing parameters of the non-negative matrix three-decomposition combined clustering model.
S3.2: and inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into a non-negative matrix tri-decomposition combined clustering model to obtain a document theme matrix, a word theme matrix and a background word theme matrix.
S3.3: introducing a global word vector to represent a regular term, and performing a non-negative matrix three-decomposition joint clustering model by using a multiplication updating method to obtain a document clustering matrix and a word clustering matrix, wherein the method specifically comprises the following steps:
s3.4.1: and iteratively updating the non-negative matrix trisection solution joint clustering model by using a multiplication updating method to obtain a document clustering matrix, a word clustering matrix and a background word clustering matrix.
S3.4.2: and adding and averaging the word clustering matrix and the background word clustering matrix to obtain a final word clustering matrix.
In this embodiment, the parameters that need to initialize the non-negative matrix tri-decomposition joint clustering model are { η, xmaxα, I, k, l }, where η is a parameter of the global word vector representing the regular term, I is the number of multiplicative update iterations, k is the number of document topics, and l is the number of word topics; in this embodiment, the document clustering matrix, the word clustering matrix, the background word clustering matrix, and the intermediate matrix are respectively initialized at random with 0-1, where η is 0.01, I is 200, k is 20, and l is 20.
In this embodiment, the objective function of iteratively updating the non-negative matrix tri-decomposition joint clustering model by using the multiplicative update method is as follows:
Figure BDA0003358054770000071
wherein X is a document-word representation matrix of the text, Z is a document clustering matrix, W is a word clustering matrix,
Figure BDA0003358054770000072
a background word clustering matrix, an intermediate matrix, a word importance matrix, a word co-occurrence matrix, a global word vector and a regular term are used as the word co-occurrence matrix; z is more than or equal to 0, S is more than or equal to 0, W is more than or equal to 0,
Figure BDA0003358054770000073
Figure BDA0003358054770000074
Figure BDA0003358054770000075
to represent the regular term for the incoming global word vector,
Figure BDA0003358054770000076
and a newly added non-negative matrix three-decomposition matrix based on the background words is used for increasing the stability of clustering. The objective function is available
Figure BDA0003358054770000077
And (4) showing.
The purpose of non-negative matrix tri-decomposition joint clustering is to expect ZSWTFitting X as much as possible, and introducing more document information to form a regular item so as to improve the clustering effect, so that the objective function needs to be as small as possible.
In order to minimize the objective function, the present invention uses a multiplicative update rule, specifically derived as follows:
firstly, using α, β, γ, δ as lagrangian multipliers of the objective function to obtain a lagrangian equation L of the objective function:
Figure BDA0003358054770000078
then lagrange' S equation L applies the Z, S, W,
Figure BDA0003358054770000079
derivation:
Figure BDA00033580547700000710
Figure BDA00033580547700000711
Figure BDA00033580547700000712
Figure BDA00033580547700000713
and finally, obtaining a multiplication updating model by utilizing a Kuhn-Tucker condition, and carrying out iterative updating on the document theme matrix, the word theme matrix and the background word theme matrix, wherein the specific formula is as follows:
Figure BDA0003358054770000081
Figure BDA0003358054770000082
Figure BDA0003358054770000083
Figure BDA0003358054770000084
wherein Z isk-1Representing the document topic matrix before iteration, ZkRepresenting the document theme matrix after iteration; sk-1Representing the intermediate matrix before iteration, SkRepresenting the intermediate matrix after iteration; wk-1Representing the word topic matrix before iteration, WkRepresenting the iterated word topic matrix;
Figure BDA0003358054770000085
representing a background word topic matrix, W, before iterationkRepresenting the iterated word topic matrix.
Clustering the word clustering matrix W and the background word clustering matrix
Figure BDA0003358054770000086
Adding and averaging to obtain final word clustering matrix Wf
In this embodiment, the document clustering matrix Z is regarded as the document-topic distribution, and the word clustering matrix WfConsidered a word-topic distribution. The column where the maximum value of each row in Z is positioned indicates that the document represented by the maximum value belongs to the column class, WfThe column in which the maximum value of each row is located indicates that the word represented by the maximum value belongs to the column class.
The document-theme distribution and the word-theme distribution obtained by the embodiment can be widely applied to the fields of emotion analysis, short text theme recognition, text classification clustering, machine translation and the like.
The comparison of the document clustering experiment result on the classic4 data set with NMTF (non-negative Matrix Tri-Factorization) and WCNMTF (Word Co-objective regulated non-negative Matrix Tri-Factorization) is shown in Table 1.
TABLE 1 results of the inventive document clustering experiments on the classic4 data set compared to other models
Model (model) ARI NMI
NMTF 0.481351 0.620045
WCNMTF 0.578662 0.643512
The invention 0.739819 0.733148
Table 1 shows the performance of the different models on the document clustering Index ARI (Adjusted Rand Index) and NMI (Normalized Mutual Information). It can be seen from table 1 that the document clustering indexes ARI and NMI of the method are higher than NMTF and WCNMTF, which indicates that the result obtained by the method is superior to other models in document clustering, i.e. the probability of the method dividing documents with the same theme into the same class is higher than that of other models.
Example 3
Referring to fig. 2, the embodiment provides a text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence, which includes a data processing module, a feature construction module and a clustering module.
In the specific implementation process, the data processing module carries out preprocessing including word segmentation, word stop removal, high-frequency word removal, low-frequency word removal, punctuation removal and digit removal on the text data set. The feature construction module constructs a text feature matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set. The clustering module constructs a non-negative matrix three-decomposition combined clustering model, inputs a text characteristic matrix, a word co-occurrence matrix and a word importance matrix into the non-negative matrix three-decomposition combined clustering model for clustering, and introduces a global word vector to represent a regular term to perform iterative update on the non-negative matrix three-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.
The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence is characterized by comprising the following steps of:
s1: acquiring a text data set, and preprocessing the text data set;
s2: constructing a text characteristic matrix, a word co-occurrence matrix and a word importance matrix based on the preprocessed text data set;
s3: and constructing a non-negative matrix tri-decomposition combined clustering model, inputting the text characteristic matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix tri-decomposition combined clustering model for clustering, and introducing a global word vector to represent a regular item to perform iterative update on the non-negative matrix tri-decomposition combined clustering model to obtain a document clustering matrix and a word clustering matrix.
2. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to claim 1, wherein the text feature matrix X is tf-idf feature matrix, and the formula is as follows:
Xij=tfi,j×idfi
wherein, tfi,jIndicating the number of times the word i appears in document j,
Figure FDA0003358054760000011
where N denotes the number of documents of the preprocessed text data set, dfxIs the number of documents in the data set that contain the word i.
3. The method according to claim 1, wherein the word importance matrix H comprises an element HijThe construction method of (2) is as follows:
Figure FDA0003358054760000012
wherein α represents a weight index, CijRepresenting the number of times that the word i and the word j appear in the same window, CmaxThe value representing the largest element in the word co-occurrence matrix.
4. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to claim 1, wherein S3 specifically comprises the following steps:
s3.1: constructing a non-negative matrix three-decomposition joint clustering model, and initializing parameters of the non-negative matrix three-decomposition joint clustering model;
s3.2: inputting the text feature matrix, the word co-occurrence matrix and the word importance matrix into a non-negative matrix tri-decomposition joint clustering model to obtain a document theme matrix, a word theme matrix and a background word theme matrix;
s3.3: introducing a global word vector to represent a regular term, and performing a non-negative matrix three-decomposition joint clustering model by using a multiplication updating method to obtain a document clustering matrix and a word clustering matrix.
5. The text non-negative matrix tri-decomposition joint clustering method based on word co-occurrence according to claim 4, wherein S3.4 specifically comprises the following steps:
s3.4.1: using a multiplicative updating type iterative updating non-negative matrix three-decomposition combined clustering model to obtain a document clustering matrix, a word clustering matrix and a background word clustering matrix;
s3.4.2: and adding and averaging the word clustering matrix and the background word clustering matrix to obtain a final word clustering matrix.
6. The method of claim 5, wherein the updating the objective function of the non-negative matrix tri-decomposition co-clustering model using multiplicative update iteration is:
Figure FDA0003358054760000021
wherein X is a document-word representation matrix of the text, Z is a document clustering matrix, W is a word clustering matrix,
Figure FDA0003358054760000022
a background word clustering matrix, S is an intermediate matrix, a piece is a word importance matrix, C is a word co-occurrence matrix, and eta is a parameter of a global word vector representing a regular term;
Figure FDA0003358054760000023
Figure FDA0003358054760000024
to represent the regular term for the incoming global word vector,
Figure FDA0003358054760000025
and decomposing the matrix for the newly added non-negative matrix three-component matrix based on the background words.
7. The word co-occurrence based text non-negative matrix tri-decomposition joint clustering method according to claim 6, wherein the multiplicative updated formula is represented as follows:
Figure FDA0003358054760000026
Figure FDA0003358054760000027
Figure FDA0003358054760000031
Figure FDA0003358054760000032
wherein Z isk-1Representing the document topic matrix before iteration, ZkRepresenting the document theme matrix after iteration; sk_1Representing the intermediate matrix before iteration, SkRepresenting the intermediate matrix after iteration; wk_1Representing the word topic matrix before iteration, WkRepresenting the iterated word topic matrix;
Figure FDA0003358054760000033
representing a background word topic matrix, W, before iterationkRepresenting the iterated word topic matrix.
8. The method according to claim 1, wherein the preprocessing of the text data set comprises word segmentation, word deactivation, high frequency word removal, low frequency word removal, punctuation removal and digit removal.
9. A text non-negative matrix tri-decomposition joint clustering system based on word co-occurrence is characterized by comprising the following steps:
the data processing module is used for acquiring a text data set and preprocessing the text data set;
the characteristic construction module is used for constructing text characteristics in the text data set;
and the clustering module is used for constructing a non-negative matrix three-component solution joint clustering model, inputting the text feature matrix, the word co-occurrence matrix and the word importance matrix into the non-negative matrix three-component solution joint clustering model for clustering, and introducing a global word vector to represent a regular term to perform iterative update on the non-negative matrix three-component solution joint clustering model to obtain a document clustering matrix and a word clustering matrix.
10. The word co-occurrence based text non-negative matrix tri-decomposition joint clustering system according to claim 9, wherein the text features comprise a text feature matrix, a word co-occurrence matrix, and a word importance matrix.
CN202111358208.7A 2021-11-16 2021-11-16 Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence Pending CN114003723A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111358208.7A CN114003723A (en) 2021-11-16 2021-11-16 Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111358208.7A CN114003723A (en) 2021-11-16 2021-11-16 Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence

Publications (1)

Publication Number Publication Date
CN114003723A true CN114003723A (en) 2022-02-01

Family

ID=79929228

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111358208.7A Pending CN114003723A (en) 2021-11-16 2021-11-16 Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence

Country Status (1)

Country Link
CN (1) CN114003723A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599917A (en) * 2022-10-31 2023-01-13 盐城工学院(Cn) Text double-clustering method based on improved bat algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
CN109960730A (en) * 2019-04-19 2019-07-02 广东工业大学 A method, device and device for short text classification based on feature extension

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
CN109960730A (en) * 2019-04-19 2019-07-02 广东工业大学 A method, device and device for short text classification based on feature extension

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SALAH 等: "Word co-occurrence regularized non-negative matrix tri-factorization for text data co-clustering", THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 32, no. 1, 29 April 2018 (2018-04-29), pages 3992 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599917A (en) * 2022-10-31 2023-01-13 盐城工学院(Cn) Text double-clustering method based on improved bat algorithm

Similar Documents

Publication Publication Date Title
Ivanov et al. Anonymous walk embeddings
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
US9984147B2 (en) System and method for probabilistic relational clustering
CN101079026B (en) Text similarity, acceptation similarity calculating method and system and application system
CN111680089B (en) Text structuring method, device and system and non-volatile storage medium
CN106484797B (en) Sparse learning-based emergency abstract extraction method
CN111666350B (en) Medical text relation extraction method based on BERT model
CN109960763A (en) A method for personalized friend recommendation in photography community based on user's fine-grained photography preference
CN111104801B (en) Text segmentation method, system, equipment and media based on website domain name
CN109508697B (en) Face recognition method, system and storage medium based on semi-nonnegative matrix factorization of E auxiliary function
Torres et al. A similarity measure for clustering and its applications
CN112364141A (en) Scientific literature key content potential association mining method based on graph neural network
CN104166684A (en) Cross-media retrieval method based on uniform sparse representation
CN114462392A (en) Short text feature expansion method based on topic relevance and keyword association
Jiang et al. Learningword embeddings for low-resource languages by pu learning
CN102722578B (en) Unsupervised cluster characteristic selection method based on Laplace regularization
CN112836014A (en) A multi-field and interdisciplinary expert selection method
CN114003723A (en) Text non-negative matrix tri-factorization joint clustering method and system based on word co-occurrence
CN111274537B (en) Document representation method based on punishment matrix decomposition
Zhang et al. Text classification of public feedbacks using convolutional neural network based on differential evolution algorithm
CN114265943A (en) A method and system for extracting causal relation event pairs
CN111899832A (en) Medical theme management system and method based on context semantic analysis
CN112149405B (en) A feature extraction method for program compilation error information based on convolutional neural network
CN112084298B (en) Public opinion topic processing method and device based on fast BTM
CN111310066B (en) A friend recommendation method and system based on topic model and association rule algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination