CN103488662A

CN103488662A - Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit

Info

Publication number: CN103488662A
Application number: CN201310112420.4A
Authority: CN
Inventors: 叶允明; 张金超; 黄晓辉
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2013-04-01
Filing date: 2013-04-01
Publication date: 2014-01-01

Abstract

The present invention relates to a parallel self-organizing map neural network clustering method and system based on a graphics processing unit. Compared with the traditional serial clustering method, the present invention adopts the parallelization of the algorithm and the parallel acceleration system based on the graphics processing unit , can achieve clustering of large-scale data faster. The present invention mainly relates to the contents of two aspects: (1) at first, aiming at the characteristics of the high parallel computing capability of the graphic processing unit, a kind of clustering method of parallelized self-organizing map neural network is designed, and this method is through parallelized statistics document The word frequency matrix is obtained from the keyword frequency, the feature matrix of the data set is generated by parallelizing the feature vector of the text, and the cluster structure of massive data objects is obtained through the parallel self-organizing map neural network clustering; (2) secondly, using the graphics processing unit Based on the complementarity of computing power between (GPU) and central processing unit (CPU), a set of parallel text clustering system based on CPU/GPU cooperation framework is designed.

Description

Self-organizing map neural network clustering method and system based on graphics processing unit

技术领域technical field

本发明涉及一种并行化的自组织映射神经网络聚类方法及系统，尤其涉及一种基于图形处理单元的并行化自组织映射神经网络聚类方法及系统。The invention relates to a parallel self-organizing map neural network clustering method and system, in particular to a graphic processing unit-based parallel self-organizing map neural network clustering method and system.

背景技术Background technique

目前，随着计算机的普及，互联网的用户数持续不断的增长，互联网用户在网络上每天产生大量的信息。同时，一些具有大量用户的社会化媒体系统中，每天也有大量的新数据增加。数据挖掘和机器学习算法为我们从这些数据中提取有价值的信息提供了可行方法，但是大部分算法的学习流程复杂，需要迭代学习，处理海量数据所花费的时间较长。虽然有用信息被提取，但是信息可能已经不具有时效性，这就需要开发更快的算法或者采用更高性能的运算设备。采用高性能机器或CPU集群的方式固然能加快算法的运算过程，但是企业需要承担巨额的资金投入。目前，多核技术已经发展的相对成熟，图形处理单元(GPU)的数值计算性能远远超过了CPU的性能，利用GPU的多核特性，充分发掘算法的并行能力成为现今计算机科学的研究热点。At present, with the popularization of computers, the number of Internet users continues to increase, and Internet users generate a large amount of information on the Internet every day. At the same time, in some social media systems with a large number of users, a large amount of new data is added every day. Data mining and machine learning algorithms provide feasible methods for us to extract valuable information from these data, but most of the algorithms have complex learning processes, require iterative learning, and take a long time to process massive data. Although useful information is extracted, the information may no longer be time-sensitive, which requires the development of faster algorithms or the use of higher-performance computing devices. The use of high-performance machines or CPU clusters can certainly speed up the calculation process of algorithms, but enterprises need to bear huge capital investment. At present, multi-core technology has developed relatively maturely. The numerical calculation performance of graphics processing unit (GPU) far exceeds the performance of CPU. Using the multi-core characteristics of GPU to fully explore the parallel ability of algorithms has become a research hotspot in computer science today.

在数据挖掘领域，已经有部分数据挖掘算法通过改进使其能够运行于图形处理单元设备上，并取得了至少5-6倍的加速，有的甚至能达到20-30倍的加速效果。数据挖掘领域中一个重要的研究方向就是针对文本数据的挖掘，而文本聚类在文本挖掘领域中扮演着重要角色。聚类是依据数据的特征，根据数据之间的相似程度，聚集成不同的文本簇。根据统计，人类社会有80%的信息以文本为载体形式存在。文本聚类技术可以对文本数据有效组织、摘要和导航。In the field of data mining, some data mining algorithms have been improved to run on graphics processing unit devices, and achieved at least 5-6 times of acceleration, and some even achieved 20-30 times of acceleration. An important research direction in the field of data mining is the mining of text data, and text clustering plays an important role in the field of text mining. Clustering is based on the characteristics of the data, according to the degree of similarity between the data, clustered into different text clusters. According to statistics, 80% of the information in human society exists in the form of text. Text clustering technology can effectively organize, summarize and navigate text data.

SOM网络是通过模拟人脑对外界信息处理的特点而设计的一种人工神经网络，是一种无监督的学习方法，非常适合于处理高维文本数据的聚类问题。SOM(Self-Organizing Mapping，简称“SOM”)网络无须用户指定聚类簇数，网络会在训练过程中自适应的进行聚类，对离群点噪音数据不敏感，具有很强的抗噪音能力。SOM根据训练样本中的样本分布规律进行聚类，对数据的形状不敏感。然而现有的SOM算法处理高维数据具有网络收敛速度慢，聚类时间长的特点。The SOM network is an artificial neural network designed by simulating the characteristics of the human brain's processing of external information. It is an unsupervised learning method and is very suitable for clustering problems of high-dimensional text data. SOM (Self-Organizing Mapping, referred to as "SOM") network does not require the user to specify the number of clusters, the network will adaptively cluster during the training process, it is not sensitive to outlier noise data, and has a strong anti-noise ability . SOM clusters according to the sample distribution law in the training samples, and is not sensitive to the shape of the data. However, the existing SOM algorithm has the characteristics of slow network convergence and long clustering time when dealing with high-dimensional data.

文本聚类是数据挖掘技术中的一种，把文本文档资源按照指定的相似性标准划分为若干个簇，使得每一簇内部尽可能的相同，不同簇之间相似性尽可能小。文本聚类主要是依据著名的聚类假设：同类的文档相似度较大，而不同类的文档相似度较小。作为一种无监督的机器学习方法，聚类由于不需要预先的训练过程，以及不需要预先对文档手工标注类别，因此具有一定的灵活性和较高的自动化处理能力，已经成为对文本信息进行有效地组织、摘要和导航的重要手段，为越来越多的研究人员所关注。Text clustering is a kind of data mining technology, which divides text document resources into several clusters according to the specified similarity standard, so that the interior of each cluster is as similar as possible, and the similarity between different clusters is as small as possible. Text clustering is mainly based on the well-known clustering assumption: documents of the same class have a greater similarity, while documents of different classes have a smaller similarity. As an unsupervised machine learning method, clustering has certain flexibility and high automatic processing ability because it does not require a pre-training process and does not need to manually label categories in advance. An important means of effectively organizing, summarizing, and navigating has attracted the attention of more and more researchers.

发明内容Contents of the invention

本发明解决的技术问题是：构建一种基于图形处理单元((Graphic ProcessingUnit,图形处理单元，简称“GPU”))的并行化自组织映射神经网络聚类方法及系统，克服现有技术在文本聚类过程中由于数据量大导致计算速度慢的技术问题。The technical problem that the present invention solves is: build a kind of parallelized self-organizing mapping neural network clustering method and system based on graphics processing unit ((Graphic Processing Unit, graphics processing unit, referred to as "GPU")), overcome existing technology in the text During the clustering process, due to the large amount of data, the calculation speed is slow.

本发明的技术方案是：提供一种基于图形处理单元的并行自组织映射神经网络聚类方法，包括如下步骤：The technical scheme of the present invention is: provide a kind of parallel self-organizing map neural network clustering method based on graphics processing unit, comprise the steps:

并行关键词词频统计：将文本内容进行分词并得到关键词的集合，并行统计文档中关键词的频率，得到词频矩阵；Parallel keyword frequency statistics: segment the text content into words and get a set of keywords, and count the frequency of keywords in the document in parallel to get a word frequency matrix;

并行特征向量计算：把关键词词频矩阵转化为对应的特征向量矩阵，每个特征向量代表一个文档。Parallel eigenvector calculation: convert the keyword frequency matrix into a corresponding eigenvector matrix, and each eigenvector represents a document.

并行SOM聚类：根据特征向量矩阵设计SOM网络结构，初始化SOM网络，并行计算输入样本与全部输出神经元权向量距离,比较各个距离的大小，获取最小距离的最佳神经元J，通过更新最佳神经元、其邻域内的神经元权向量值、学习率及最佳神经元的邻域大小，然后通过图形处理单元并行计算网络误差率E_t，若网络误差率E_t<=目标误差ε或迭代次数t>=训练最大迭代次数T,则SOM网络训练结束，否则重新进行新一轮训练；每次学习的结果使得最佳匹配神经元的邻域区域向输入数据向量值靠近，把距离相近的输入特征向量聚集成同一个簇，形成的簇集合即为最终的聚类结果。Parallel SOM clustering: Design the SOM network structure according to the eigenvector matrix, initialize the SOM network, calculate the distance between the input sample and all output neuron weight vectors in parallel, compare the size of each distance, and obtain the best neuron J with the smallest distance. The best neuron, the neuron weight vector value in its neighborhood, the learning rate and the neighborhood size of the best neuron, and then calculate the network error rate E _t in parallel through the graphics processing unit, if the network error rate E _t <= target error ε Or the number of iterations t>=the maximum number of training iterations T, then the SOM network training ends, otherwise a new round of training is performed; the result of each learning makes the neighborhood area of the best matching neuron approach the input data vector value, and the distance Similar input feature vectors are aggregated into the same cluster, and the cluster set formed is the final clustering result.

本发明的进一步技术方案是：统计每篇文档关键词词频的过程相互独立，本发明为每篇文档设计一个线程统计词频，然后通过图形处理单元的多线程并行统计。A further technical solution of the present invention is: the process of counting the keyword frequency of each document is independent of each other, and the present invention designs a thread for each document to count the word frequency, and then uses the multi-threaded parallel statistics of the graphics processing unit.

本发明的进一步技术方案是：每篇文档的特征向量计算过程相互独立，本发明为每篇文档设计一个线程计算特征向量，然后通过图形处理单元的多线程并发执行。其特征向量计算采用公式A further technical solution of the present invention is: the feature vector calculation process of each document is independent of each other, the present invention designs a thread for each document to calculate the feature vector, and then executes concurrently through the multi-thread of the graphics processing unit. Its eigenvectors are calculated using the formula

x_ij=log₂(tf_ij+1.0)*log(m/m_j)，x _ij = log ₂ (tf _ij +1.0)*log(m/m _j ),

并归一化为 $x_{ij} = \frac{x_{ij}}{\sqrt{Σ_{p = 1}^{n} x_{ip}^{2}}} .$ and normalized to $x_{ij} = \frac{x_{ij}}{\sqrt{Σ_{p = 1}^{no} x_{ip}_{2}}} .$

公式中，x_ij为第j个特征词在文档d_i中特征向量的值，tf_ij为第j个特征词在文档d_i中的出现次数，m/m_j为第j个特征词的倒文档频率，m为文档总数，m_j是包含第j个特征词的文档数。In the formula, x _ij is the value of the feature vector of the jth feature word in the document d _i , tf _ij is the number of occurrences of the jth feature word in the document d _i , m/m _j is the inverse of the jth feature word Document frequency, m is the total number of documents, and m _j is the number of documents containing the jth feature word.

本发明的进一步技术方案是：在并行特征向量计算步骤中，采用基于图形处理的的多线程并行计算每个文档的特征向量。A further technical solution of the present invention is: in the parallel feature vector calculation step, the feature vector of each document is calculated in parallel using multi-threads based on graphics processing.

本发明的进一步技术方案是：输入特征向量与每个输出神经元权向量距离的计算过程相互独立，采用基于图形处理的多个线程并行计算输入特征向量与每个输出神经元向量的距离，系统为每个神经元开启一个线程，采用多线程并行计算。A further technical solution of the present invention is: the calculation process of the distance between the input feature vector and the weight vector of each output neuron is independent of each other, and the distance between the input feature vector and each output neuron vector is calculated in parallel by multiple threads based on graphics processing, and the system Open a thread for each neuron, and use multi-threaded parallel computing.

本发明的进一步技术方案是：每个神经元相邻两次迭代的权向量误差的计算过程相互独立，采用基于图形处理的多个线程并行计算每个神经元的权向量误差，系统为每个神经元开启一个线程，采用多线程并行计算。A further technical solution of the present invention is: the calculation process of the weight vector error of two adjacent iterations of each neuron is independent of each other, and multiple threads based on graphics processing are used to calculate the weight vector error of each neuron in parallel. The neuron starts a thread and uses multi-threaded parallel computing.

本发明的技术方案是：构建一种基于图形处理单元的自组织映射神经网络聚类系统，包括硬件部分和软件部分，硬件部分：采用CPU/GPU协作框架设计，串行执行代码运行在CPU上，并行执行代码运行在GPU上，通过GPU提供的数据传输方式来交换显存与内存之间的数据；软件部分分为三个模块，包括并行化关键词词频统计模块、并行化特征向量计算模块、并行化SOM聚类模块，单元、计算特征向量的特征向量计算单元、进行文本聚类的文本聚类单元，所述并行化关键词词频统计模块将文本内容进行分词并得到关键词的集合，并行统计文档中关键词的频率，得到词频矩阵；所述并行化特征向量计算模块把关键词词频矩阵转化为对应的特征向量矩阵，每个特征向量代表一个文档；所述并行化SOM聚类模块根据特征向量矩阵设计SOM网络结构，初始化SOM网络，并行计算输入样本与全部输出神经元权向量距离,比较各个距离的大小，获取最小距离的最佳神经元J，通过更新最佳神经元、其邻域内的神经元权向量值、学习率及最佳神经元的邻域大小，然后通过图形处理单元并行计算网络误差率E_t，若网络误差率E_t<=目标误差ε或迭代次数t>=训练最大迭代次数T,则SOM网络训练结束，否则重新进行新一轮训练；每次学习的结果使得最佳匹配神经元的邻域区域向输入数据向量值靠近，把距离相近的输入特征向量聚集成同一个簇，形成的簇集合即为最终的聚类结果。The technical solution of the present invention is: build a kind of self-organizing map neural network clustering system based on graphics processing unit, including hardware part and software part, hardware part: adopt CPU/GPU cooperative frame design, serial execution code runs on CPU , the parallel execution code runs on the GPU, and the data between the video memory and the memory is exchanged through the data transmission method provided by the GPU; the software part is divided into three modules, including the parallel keyword frequency statistics module, the parallel feature vector calculation module, Parallelized SOM clustering module, unit, eigenvector calculation unit for calculating eigenvectors, and text clustering unit for text clustering, the parallelized keyword frequency statistics module divides text content into words and obtains a set of keywords, parallel The frequency of the keyword in the statistics document obtains the term frequency matrix; the parallelized feature vector calculation module converts the keyword term frequency matrix into a corresponding feature vector matrix, and each feature vector represents a document; the parallelized SOM clustering module according to Design the SOM network structure by eigenvector matrix, initialize the SOM network, calculate the distance between the input sample and all output neuron weight vectors in parallel, compare the size of each distance, obtain the best neuron J with the smallest distance, and update the best neuron and its neighbors The neuron weight vector value, learning rate and the neighborhood size of the best neuron in the domain, and then calculate the network error rate E _t in parallel through the graphics processing unit, if the network error rate E _t <= target error ε or iteration number t >= If the maximum number of iterations of training is T, the SOM network training ends, otherwise a new round of training is performed; the result of each learning makes the neighborhood area of the best matching neuron close to the input data vector value, and the input feature vectors with similar distances are gathered into the same cluster, the formed cluster set is the final clustering result.

本发明的进一步技术方案是：所述并行化关键词词频统计模块、所述并行化特征向量计算模块以及所述并行化SOM聚类模块中均设计了若干个核函数来并行加速算法的运行。A further technical solution of the present invention is: several kernel functions are designed in the parallel keyword frequency statistics module, the parallel feature vector calculation module and the parallel SOM clustering module to accelerate the operation of the algorithm in parallel.

本发明的进一步技术方案是：在并行关键词词频统计模块中，设计了一个用于关键词词频统计的核函数；在并行特征向量计算模块中，设计了两个用于特征向量计算的核函数和两个用于特征向量归一化的核函数。The further technical scheme of the present invention is: in the parallel keyword word frequency statistics module, designed a kernel function that is used for keyword word frequency statistics; In the parallel feature vector calculation module, designed two kernel functions that are used for feature vector calculation and two kernel functions for eigenvector normalization.

本发明的进一步技术方案是：在并行SOM聚类模块中，设计了一个用于计算输入特征向量与输出神经元的距离的核函数，一个用于计算每个神经元相邻两次迭代的网络权向量的误差的核函数和一个用于规约网络权向量的误差的核函数。The further technical solution of the present invention is: in the parallel SOM clustering module, a kernel function used to calculate the distance between the input feature vector and the output neuron is designed, and a network used to calculate the adjacent two iterations of each neuron A kernel function for the error of the weight vector and a kernel function for reducing the error of the network weight vector.

本发明的技术效果是：本发明是一套基于图形处理单元的并行自组织映射神经网络聚类方法及系统。该发明通过设计并行化的文本聚类算法，同时利用图形处理单元(GPU)和中央处理器(CPU)之间的计算能力的互补性,设计了一套基于CPU/GPU协作框架的并行化文本聚类系统。具体来说，本发明包含两部分：一.设计了一种基于图形处理单元的并行化自组织神经网络的聚类方法。在该方法中，针对文档的关键词词频统计，文档的特征向量计算和SOM聚类算法三个方面做了并行化。二.开发了一套基于CPU/GPU协作框架的并行化文本聚类系统。在该系统中，本发明设计了三个计算模块：并行化关键词词频统计模块，并行化特征向量计算模块和并行化SOM聚类模块。同时，在每个模块上设计了若干核函数来加速算法的运行。本发明通过算法的并行化和基于图形处理单元的并行加速系统，能更快的实现大规模数据的聚类，非常适合于处理类似高维文本数据的聚类问题。The technical effect of the present invention is: the present invention is a set of parallel self-organizing map neural network clustering method and system based on graphics processing unit. The invention designs a parallelized text clustering algorithm based on the CPU/GPU collaboration framework by designing a parallelized text clustering algorithm and utilizing the complementarity of computing power between the graphics processing unit (GPU) and the central processing unit (CPU). clustering system. Specifically, the present invention includes two parts: 1. A clustering method based on a parallel self-organizing neural network of a graphics processing unit is designed. In this method, three aspects of document keyword frequency statistics, document feature vector calculation and SOM clustering algorithm are parallelized. 2. Developed a set of parallel text clustering system based on CPU/GPU cooperation framework. In the system, the present invention designs three calculation modules: a parallel keyword frequency statistics module, a parallel feature vector calculation module and a parallel SOM clustering module. At the same time, several kernel functions are designed on each module to speed up the operation of the algorithm. The invention can achieve clustering of large-scale data faster through the parallelization of algorithms and the parallel acceleration system based on graphics processing units, and is very suitable for processing clustering problems similar to high-dimensional text data.

附图说明Description of drawings

图1为本发明的流程图。Fig. 1 is a flowchart of the present invention.

图2为本发明的多线程词频统计示意图。FIG. 2 is a schematic diagram of multi-threaded word frequency statistics in the present invention.

图3串行关键词所在文档数统计示意图。Figure 3 is a schematic diagram of the statistics of the number of documents where the serial keywords are located.

图4为本发明的并行特征矩阵计算过程图。Fig. 4 is a diagram of the parallel characteristic matrix calculation process of the present invention.

图5为本发明的并行关键词所在文档数统计示意图。Fig. 5 is a schematic diagram of statistics of the number of documents where parallel keywords are located in the present invention.

图6为本发明的多线程计算特征矩阵示意图。FIG. 6 is a schematic diagram of a multi-thread computing characteristic matrix of the present invention.

图7为本发明的多线程计算向量模示意图。Fig. 7 is a schematic diagram of the multi-thread calculation vector model of the present invention.

图8为本发明的多线程归一化示意图。FIG. 8 is a schematic diagram of multi-thread normalization in the present invention.

图9为本发明的SOM网络的拓扑结构。Fig. 9 is the topology structure of the SOM network of the present invention.

图10为本发明的CPU/GPU硬件架构图Fig. 10 is a CPU/GPU hardware architecture diagram of the present invention

图11为本发明的并行SOM算法CPU/GPU协作框架示意图Fig. 11 is a schematic diagram of the parallel SOM algorithm CPU/GPU cooperation framework of the present invention

图12为本发明的并行统计文档词频矩阵核函数流程图Fig. 12 is the flow chart of the parallel statistical document word frequency matrix kernel function of the present invention

图13为本发明的并行统计关键词所在文档数的核函数流程图Fig. 13 is the kernel function flow chart of the present invention's parallel statistical keyword place document number

图14为本发明的输入特征向量与神经元的距离计算示意图Fig. 14 is a schematic diagram of distance calculation between an input feature vector and a neuron in the present invention

图15为本发明的计算输入特征向量与神经的距离核函数流程图Fig. 15 is the distance kernel function flow chart of calculating input feature vector and nerve of the present invention

图16为本发明的数据做差运算示意图。Fig. 16 is a schematic diagram of data difference operation in the present invention.

图17为本发明的误差矩阵按行或列求和运算示意图。FIG. 17 is a schematic diagram of the summation operation of the error matrix by row or column according to the present invention.

图18为本发明的误差矩阵按行或列求和的核函数流程图Fig. 18 is the kernel function flowchart of error matrix summation by row or column of the present invention

具体实施方式Detailed ways

下面结合具体实施例，对本发明技术方案进一步说明。The technical solutions of the present invention will be further described below in conjunction with specific embodiments.

如图1所示，本发明的具体实施方式是：提供一种基于图形处理单元的并行化自组织映射神经网络聚类方法，包括如下步骤：As shown in Figure 1, the specific embodiment of the present invention is: provide a kind of parallel self-organizing map neural network clustering method based on graphics processing unit, comprise the steps:

步骤1：并行关键词词频统计，即：将文本内容进行分词并得到关键词的集合；针对大规模文本数据，图形处理设备的大规模计算单元能够为每一个文本文档提供一个线程并行统计文档中关键词的频率，得到词频矩阵。Step 1: Parallel keyword frequency statistics, namely: segment the text content into words and obtain a collection of keywords; for large-scale text data, the large-scale computing unit of the graphics processing device can provide each text document with a thread parallel statistical document The frequency of keywords is obtained as a word frequency matrix.

具体实施过程如下：计算机并不具有人类的智能，人在阅读文章后，根据自身的理解能力可以产生对文章内容的模糊认识，而计算机并不能轻易地“读懂”文章，从根本上说，它只认识0和1，所以必须将文本转换为计算机可以识别的格式。目前，在信息处理方向上，文本的表示主要采用向量空间模型(Vectorspace model,简称“VSM”)。向量空间模型的基本思想是以向量来表示文本：(X₁,X₂,...X_n)，其中X_j为第j个特征项的权重，那么选取什么作为特征项呢？一般可以选择字或词，根据实验结果，普遍认为选取词作为特征项要优于字和词组。因此，要将文本表示为以词为单位的向量空间模型，首先要将文本分词，以分词后的词作为向量空间中的维度来表示文本。对文档内容进行分词处理后，再进行去噪处理，得到文本对应的关键词集合。The specific implementation process is as follows: Computers do not have human intelligence. After reading an article, people can have a vague understanding of the content of the article according to their own understanding ability, but computers cannot easily "understand" the article. Fundamentally speaking, It only understands 0 and 1, so the text must be converted into a format that the computer can understand. Currently, in the direction of information processing, the representation of text mainly adopts the vector space model (Vectorspace model, referred to as "VSM"). The basic idea of the vector space model is to represent the text as a vector: (X ₁ ,X ₂ ,...X _n ), where X _j is the weight of the jth feature item, so what should be selected as the feature item? Generally, words or words can be selected. According to the experimental results, it is generally believed that selecting words as feature items is better than words and phrases. Therefore, to represent the text as a vector space model with words as units, the text must first be segmented, and the word after segmentation is used as the dimension in the vector space to represent the text. After word segmentation processing is performed on the document content, denoising processing is performed to obtain the keyword set corresponding to the text.

并行关键词词频统计是针对每篇文档分词，去噪后，统计在该文档中关键词出现的频率，然后形成整个数据集的词频矩阵。由于统计每篇文档的关键词词频的过程相互独立，可以为每篇文档在图形处理单元上开一个线程，从而达到计算的高度并行性,如图2。词频矩阵的一行代表一篇文档，矩阵的一列代表一个关键词，行列的交叉值代表某篇文档中某个关键词的出现频率。如果某关键词没有出现在某文档中，则在词频矩阵中，该值设为零。Parallel keyword frequency statistics is to segment each document, after denoising, count the frequency of keywords in the document, and then form the word frequency matrix of the entire data set. Since the process of counting the keyword frequency of each document is independent of each other, a thread can be opened on the graphics processing unit for each document, so as to achieve a high degree of parallelism in calculation, as shown in Figure 2. A row of the term frequency matrix represents a document, a column of the matrix represents a keyword, and the intersection value of the row and column represents the frequency of occurrence of a keyword in a document. If a keyword does not appear in a document, the value is set to zero in the term frequency matrix.

步骤2：并行特征向量计算，即：通过图形处理单元并行计算，把关键词词频矩阵转化为对应的特征向量矩阵。每个特征向量代表一个文本文档。Step 2: Parallel eigenvector calculation, that is, converting the keyword frequency matrix into a corresponding eigenvector matrix through parallel calculation by a graphics processing unit. Each feature vector represents a text document.

具体实施过程如下：根据关键词词频矩阵，并行计算文档对应的特征向量，生成特征向量矩阵，特征向量的一行代表一篇文档，特征向量的一列代表一个特征，行与列交叉值为该文档某个特征的特征值。The specific implementation process is as follows: According to the keyword frequency matrix, the feature vector corresponding to the document is calculated in parallel to generate a feature vector matrix. One row of the feature vector represents a document, one column of the feature vector represents a feature, and the intersection value of the row and column is a certain value of the document. eigenvalues of a feature.

本发明采用向量空间模型来描述文档，它只关注文档中出现的词，而不关注词的关系和文档的结构。在向量空间模型中，文档空间视为特征向量所构成的向量空间。一篇文档为向量空间里的一个特征向量，文档中的特征词可以看成向量空间模型中的维。特征向量记作d_i=(x_i1,x_i2,...,x_in),公式中x_ij代表词j在文档d_i中的权重。一种粗略的描述权重的方法即是用布尔值0或1来表示特征词在某篇文档中有没有出现过。tf*idf(term frequency*inverse documentfrequency)是一种常用的文档特征权重表示方法.这种权重计算方法主要考虑特征词的词频tf、逆文档频率idf和规格化因子。为保证聚类效果，本发明采用LTC权重作为特征词的权重，如下面公式所述The present invention uses a vector space model to describe the document, and it only pays attention to the words appearing in the document, and does not pay attention to the relationship between words and the structure of the document. In the vector space model, the document space is regarded as a vector space composed of feature vectors. A document is a feature vector in the vector space, and the feature words in the document can be regarded as dimensions in the vector space model. The feature vector is recorded as d _i =(x _i1 ,x _i2 ,...,x _in ), where x _ij represents the weight of word j in document d _i . A rough way to describe the weight is to use Boolean value 0 or 1 to indicate whether the feature word has appeared in a certain document. tf*idf (term frequency*inverse document frequency) is a commonly used document feature weight representation method. This weight calculation method mainly considers the term frequency tf, inverse document frequency idf and normalization factor of feature words. In order to ensure the clustering effect, the present invention uses the LTC weight as the weight of the feature word, as described in the following formula

x_ij=log₂(tf_ij+1.0)*log(m/m_j)。x _ij =log ₂ (tf _ij +1.0)*log(m/m _j ).

LTC权重公式是在tf*idf公式的基础上对词频tf的值取了对数，再次降低了词频tf对特征向量的影响，该公式在实际应用中更加合理。同时，由于不同的文档长度会对向量的权重值造成影响，所以需要对公式计算出的权重值做归一化处理，即：The LTC weight formula takes the logarithm of the word frequency tf value on the basis of the tf*idf formula, which again reduces the influence of the word frequency tf on the feature vector. This formula is more reasonable in practical applications. At the same time, since different document lengths will affect the weight value of the vector, it is necessary to normalize the weight value calculated by the formula, namely:

${x x}_{ij ij} = = \frac{{x x}_{ij ij}}{\sqrt{{Σ Σ}_{p p = = 11}^{n no} {x x}_{}^{ip ip}}} . .$

对于高维的文本数据，计算文本特征向量的每一维权重值非常耗时，所以本发明设计了并行的计算权重的方法，对权重计算过程进行加速。在计算LTC权重的时候，缺少m_j的值，即某个关键词在文档集合中出现的次数。在高维大规模数据集下，传统串行统计非常耗时，如图3。在得到m_j的值以后，针对每个文档的每个特征词计算权重，由于文本数据维度非常高，如果使用传统的串行算法，时间复杂度也很高。得到权重向量后，对权重向量进行归一化的过程，需要计算每个向量的长度，然后对每个权重进行归一化处理，整个过程也需要耗费大量时间。因此，本发明对上述特征向量权重计算过程中三个非常耗时的部分进行了在图形处理单元环境下的并行设计，其并行特征向量的计算过程如图4。For high-dimensional text data, it is very time-consuming to calculate the weight value of each dimension of the text feature vector, so the present invention designs a parallel calculation method to speed up the weight calculation process. When calculating the LTC weight, the value of m _j is missing, that is, the number of times a certain keyword appears in the document collection. Under high-dimensional large-scale data sets, traditional serial statistics are very time-consuming, as shown in Figure 3. After the value of m _j is obtained, the weight is calculated for each feature word of each document. Since the dimension of text data is very high, if the traditional serial algorithm is used, the time complexity is also very high. After obtaining the weight vector, the process of normalizing the weight vector needs to calculate the length of each vector, and then normalize each weight. The whole process also takes a lot of time. Therefore, in the present invention, the three time-consuming parts in the above-mentioned eigenvector weight calculation process are designed in parallel in the graphics processing unit environment, and the parallel eigenvector calculation process is shown in FIG. 4 .

(1)多线程并行统计每个关键词在文档集合中出现的文档频度m_j (1) Multi-threaded parallel statistics of the document frequency m _j of each keyword appearing in the document collection

针对每一个特征词需要统计相应的m_j的值，用于后续权重值的计算。我们把该问题转换为对词频矩阵的每一列进行非零值的计数统计，其计算方式如下For each feature word, the value of the corresponding m _j needs to be counted for the calculation of the subsequent weight value. We convert this problem into counting statistics of non-zero values for each column of the word frequency matrix, and the calculation method is as follows

$m_{j} = Σ_{i = 0}^{m} x_{ij},$ 其中 $x_{ij} = \{\begin{matrix} 0, & {tf}_{ij = 0} \\ 1, & {tf}_{ij} > 0, \end{matrix}$ $m_{j} = Σ_{i = 0}^{m} x_{ij},$ in $x_{ij} = \{\begin{matrix} 0, & {tf}_{ij = 0} \\ 1, & {tf}_{ij} > 0, \end{matrix}$

其中m为数据集中文档的数目。上述公式即统计m_j的计算公式，当词频数为零时，说明该特征词没有出现在该对应的文档中，则不参与计数；当特征词的词频大于零时，说明该词出现在文档里，m_j计数加一。由于每个特征词的m_j值是独立统计的，针对矩阵形式的数据，可以采用多线程的实现方案，能够充分利用图形处理单元的并行处理能力实现该过程加速。图5为并行关键词所在文档数统计示意图。where m is the number of documents in the dataset. The above formula is the calculation formula of statistical m _j . When the word frequency is zero, it means that the feature word does not appear in the corresponding document, and it does not participate in the counting; when the word frequency of the feature word is greater than zero, it means that the word appears in the document. Here, the m _j count is incremented by one. Since the m _j value of each feature word is counted independently, for the data in the form of a matrix, a multi-threaded implementation scheme can be adopted, which can make full use of the parallel processing capability of the graphics processing unit to accelerate the process. Fig. 5 is a schematic diagram of statistics on the number of documents where parallel keywords are located.

(2)多线程并行计算特征矩阵(2) Multi-thread parallel computing feature matrix

在该阶段的计算过程中，输入为词频矩阵和m_j的值，输出为特征矩阵如图6所示。特征矩阵的计算公式如下In the calculation process of this stage, the input is the word frequency matrix and the value of _mj , and the output is the feature matrix as shown in Figure 6. The calculation formula of the characteristic matrix is as follows

x_ij=log₂(tf_ij+1.0)*log(m/m_j)。x _ij =log ₂ (tf _ij +1.0)*log(m/m _j ).

图6为多线程并行计算特征矩阵的示意图,如果文档的数量为m篇，特征词的维度为n维，则该计算公式的执行频度为m*n次。对于文档集合规模庞大，文本维度高的应用场景，特征矩阵的计算量非常大，所以本文对此运算设计并行的多线程执行方法进行加速。由于每个特征向量的计算过程相互独立，每个线程负责一个特征向量的计算过程，多个线程并发执行，提高特征矩阵的计算速度。Fig. 6 is a schematic diagram of multi-threaded parallel calculation of feature matrix. If the number of documents is m and the dimension of feature words is n-dimensional, the execution frequency of the calculation formula is m*n times. For application scenarios with large-scale document collections and high text dimensions, the calculation of the feature matrix is very large, so this paper designs a parallel multi-threaded execution method to accelerate this operation. Since the calculation process of each eigenvector is independent of each other, each thread is responsible for the calculation process of one eigenvector, and multiple threads are executed concurrently to improve the calculation speed of the eigenmatrix.

(3)多线程特征矩阵归一化(3) Multi-thread feature matrix normalization

上述特征矩阵计算过程结束后会得到文档的每个关键词的特征值，由于文档的长度不同，长度较长的文档特征向量可能会对其他文档的特征向量有明显的抑制作用，所以采用对特征向量进行归一化的方法来均衡特征向量。特征向量归一化的公式如下After the above feature matrix calculation process is completed, the feature value of each keyword of the document will be obtained. Due to the different lengths of the documents, the document feature vector with a longer length may have a significant inhibitory effect on the feature vectors of other documents, so the feature vector is used. Vector normalization methods to equalize the feature vectors. The formula for eigenvector normalization is as follows

该公式代表将文档中的每一个特征词权重值除以该文档对应特征向量模的平方，以均衡化向量。对归一化向量的运算，本发明设计采用两个在图形处理单元上核函数来实现，一个核函数负责计算向量的模的值，另一个核函数负责对权重的归一化计算。This formula represents dividing the weight value of each feature word in the document by the square of the modulus of the feature vector corresponding to the document to equalize the vector. For the operation of the normalized vector, the present invention adopts two kernel functions on the graphics processing unit to realize, one kernel function is responsible for calculating the value of the modulus of the vector, and the other kernel function is responsible for the normalization calculation of the weight.

图7为向量模的多线程计算示意图，图中一条虚线代表一个线程，线程的功能是对特征矩阵一行的元素平方值进行加和，得到一个文档特征向量模值的平方。所有的线程计算完毕后，我们就会得到每个文档特征向量模值的平方，用于后续的归一化操作。假如不使用图形处理单元进行多线程并行加速，求模运算的时间复杂度是O(m*n)，m代表文档的数量，n代表文档特征向量的维数。改进后在GPU上并行运行的算法时间复杂度为O(n)。Figure 7 is a schematic diagram of multi-threaded calculation of vector modulus. A dotted line in the figure represents a thread. The function of the thread is to sum the square values of elements in a row of the feature matrix to obtain the square of the modulus value of a document feature vector. After all the threads are calculated, we will get the square of the modulus of each document feature vector for subsequent normalization operations. If the graphics processing unit is not used for multi-threaded parallel acceleration, the time complexity of the modulo operation is O(m*n), where m represents the number of documents, and n represents the dimension of the document feature vector. The time complexity of the improved algorithm running in parallel on GPU is O(n).

得到了每篇文档的特征向量的模以后，需要再针对权重矩阵做权重归一化处理。对于矩阵中的每个元素需要除以对应的文档特征向量模的平方，对于此类矩阵问题，同样适合于使用图形处理单元的多线程来加速算法运行的效率。图8为多线程归一化示意图，图8中一条虚线代表一个线程，所有的线程计算完毕后，得到最终归一化后的文档特征向量，用于后续聚类操作。假如不使用图形处理单元进行多线程并行加速，该过程的时间复杂度是O(m*n)，m代表文档的数量，n代表文档特征向量的维数。改进后在图形处理单元上并行运行的算法时间复杂度为O(1)。After obtaining the modulus of the feature vector of each document, it is necessary to perform weight normalization on the weight matrix. For each element in the matrix, it needs to be divided by the square of the modulus of the corresponding document feature vector. For this type of matrix problem, it is also suitable to use the multi-threading of the graphics processing unit to accelerate the efficiency of the algorithm operation. Fig. 8 is a schematic diagram of multi-thread normalization. A dotted line in Fig. 8 represents a thread. After all threads are calculated, the final normalized document feature vector is obtained for subsequent clustering operations. If the graphics processing unit is not used for multi-threaded parallel acceleration, the time complexity of the process is O(m*n), where m represents the number of documents, and n represents the dimension of the document feature vector. The time complexity of the improved algorithm running in parallel on the graphics processing unit is O(1).

步骤3：并行SOM文本聚类，即：根据特征向量矩阵设计SOM网络结构，初始化SOM网络，通过图形处理单元并行计算输入样本与全部输出神经元权向量距离,比较各个距离的大小，获取最小距离的最佳神经元，通过更新最佳神经元、邻域内的神经元权向量值、学习率及最佳神经元J的邻域大小，然后通过图形处理单元并行计算网络误差率E_t，若网络误差率E_t<=目标误差ε或迭代次数t>=训练最大迭代次数T,则SOM网络训练结束，否则重新进行新一轮训练；每次学习的结果使得最佳匹配神经元的邻域区域向输入数据向量值靠近，把距离相近的输入特征向量聚集成同一个簇，形成的簇集合即为最终的聚类结果。Step 3: Parallel SOM text clustering, that is: design the SOM network structure according to the eigenvector matrix, initialize the SOM network, calculate the distance between the input sample and all output neuron weight vectors in parallel through the graphics processing unit, compare the size of each distance, and obtain the minimum distance The optimal neuron of , by updating the optimal neuron, the neuron weight vector value in the neighborhood, the learning rate and the neighborhood size of the optimal neuron J, and then calculate the network error rate E _t in parallel through the graphics processing unit, if the network Error rate E _t <= target error ε or number of iterations t>= training maximum number of iterations T, then the SOM network training ends, otherwise a new round of training is performed; the result of each learning makes the best matching neuron neighborhood area Close to the input data vector value, gather the input feature vectors with similar distances into the same cluster, and the formed cluster set is the final clustering result.

具体实施过程如下：The specific implementation process is as follows:

SOM网络是模拟人脑的神经网络模型，它最重要的一个特性是自组织性，即对外部的输入网络中的神经元会自动的调整连接权重，相应的神经元聚集形成对外部的响应。SOM网络就是模拟脑细胞的这种自组织特性来实现聚类、识别、排序、拓扑不变性映射等。SOM网络中的神经元节点可以接受其他神经元的输入数据，也可以向其他的神经元输出数据。经过训练后的SOM网络会对外部输入模式形成自己的概念模式。SOM适合处理非线性、不确定性的数据。The SOM network is a neural network model that simulates the human brain. One of its most important features is self-organization, that is, the neurons in the external input network will automatically adjust the connection weights, and the corresponding neurons will gather to form a response to the outside. The SOM network is to simulate the self-organization characteristics of brain cells to achieve clustering, identification, sorting, topological invariance mapping, etc. The neuron nodes in the SOM network can accept input data from other neurons, and can also output data to other neurons. The trained SOM network forms its own conceptual patterns for external input patterns. SOM is suitable for dealing with nonlinear and uncertain data.

SOM网络是一种无监督的学习网络，由输入层和输出层两层构成。输入层计算输入向量与权向量的距离，该距离反映匹配程度，所以输入层又被称为匹配层。输出层也叫竞争层，各个神经元根据匹配程度进行竞争，匹配程度大的神经元称为获胜神经元，获胜的神经元以及其邻域内的神经元的权重向量会向输入向量更近的方向更新，经过多次反复的迭代竞争更新，最终形成稳定的网络，神经元保存相应的权向量。后续可以使用训练完成的网络进行聚类和空间映射等操作。SOM网络的训练过程就是一个自组织学习的过程，训练分为两个部分：最佳匹配神经元的筛选和网络权向量的更新。常见的SOM网络的输入层神经元一维排列，输出层神经元为二维排列，拓扑结构如图9所示。The SOM network is an unsupervised learning network consisting of two layers: an input layer and an output layer. The input layer calculates the distance between the input vector and the weight vector, which reflects the degree of matching, so the input layer is also called the matching layer. The output layer is also called the competition layer. Each neuron competes according to the matching degree. The neuron with a large matching degree is called the winning neuron. The weight vector of the winning neuron and the neurons in its neighborhood will move closer to the input vector. Update, after repeated iterative competition updates, a stable network is finally formed, and the neurons save the corresponding weight vectors. Subsequently, the trained network can be used for operations such as clustering and spatial mapping. The training process of the SOM network is a self-organizing learning process, and the training is divided into two parts: the selection of the best matching neuron and the update of the network weight vector. In a common SOM network, neurons in the input layer are arranged in one dimension, and neurons in the output layer are arranged in two dimensions. The topology is shown in Figure 9.

SOM网络使用自组织映射的学习方式，该学习方式属于无监督学习，每次学习的结果使得最佳匹配神经元的邻域区域向输入数据向量值靠近，这样就可以把距离相近的输入向量聚集到一起，形成聚类。经过大规模训练后的SOM网络，神经元之间的连接权重能够代表输入模式的特征，把具有相近模式的输入向量归并为一类，就完成了SOM网络的自动聚类过程。SOM算法的核心过程是最佳匹配神经元的筛选和神经元邻域内权重的更新。筛选最佳匹配神经元根据所有神经元与输入向量之间的距离，距离最小的即为最佳匹配神经元；神经元间连接权重的自组织调整是根据最佳匹配神经元调整其邻域内各个神经元的权重值。SOM网络每学习一次就对输入向量进行一次自组织适应过程，强化新的匹配模式的映射，弱化旧的模式映射。在传统的串行SOM聚类算法中，有两个步骤花费了80%的算法运行时间：(1)计算输入样本与全部输出神经权向量距离,比较各个距离的大小，获取最小距离的最佳神经元;(2)计算相邻两次迭代的网络误差率E_t。因此本发明针对以上两个特征，设计了一种基于图形处理单元的并行SOM聚类算法。并行SOM算法的逻辑流程描述如下：The SOM network uses the self-organizing map learning method, which belongs to unsupervised learning. The result of each learning makes the neighborhood area of the best matching neuron close to the input data vector value, so that the input vectors with similar distances can be gathered. together to form a cluster. After large-scale training of the SOM network, the connection weights between neurons can represent the characteristics of the input pattern, and the input vectors with similar patterns are merged into one class to complete the automatic clustering process of the SOM network. The core process of the SOM algorithm is the screening of the best matching neurons and the updating of the weights in the neuron neighborhood. Screening of the best matching neuron is based on the distance between all neurons and the input vector, and the one with the smallest distance is the best matching neuron; the self-organization adjustment of the connection weights between neurons is to adjust each neuron in its neighborhood according to the best matching neuron. The weight value of the neuron. The SOM network conducts a self-organizing adaptation process on the input vector every time it learns, strengthening the mapping of new matching patterns and weakening the mapping of old patterns. In the traditional serial SOM clustering algorithm, there are two steps that spend 80% of the algorithm running time: (1) Calculate the distance between the input sample and all the output neural weight vectors, compare the size of each distance, and obtain the best algorithm with the smallest distance. neuron; (2) Calculate the network error rate E _t of two adjacent iterations. Therefore, aiming at the above two features, the present invention designs a parallel SOM clustering algorithm based on graphics processing unit. The logic flow of the parallel SOM algorithm is described as follows:

Step1：假设输入样本为m个，每个输入样本的维度为n维。设计SOM网络结构，确定输入层神经元的个数为n个，输出层神经元为k个，训练最大迭代次数T和目标误差ε。Step1: Suppose there are m input samples, and the dimension of each input sample is n-dimensional. Design the SOM network structure, determine the number of neurons in the input layer as n, and the number of neurons in the output layer as k, train the maximum number of iterations T and the target error ε.

Step2：初始化SOM网络，包括初始化神经元之间连接权重向量W₀=(W₁₀,W₂₀,...,W_k0)，学习率α_i(0)∈(0,1)，邻域大小N_i(0),i∈{1,2,...,n}，迭代计数器t=1。Step2: Initialize the SOM network, including initializing the connection weight vector W ₀ =(W ₁₀ ,W ₂₀ ,...,W _k0 ), the learning rate α _i (0)∈(0,1), and the size of the neighborhood N _i (0), i∈{1,2,...,n}, iteration counter t=1.

Step3：并行计算输入样本X_i与全部输出神经元权向量W_j距离d_j，公式如下Step3: Calculate the distance d _j between the input sample _Xi and all output neuron weight vectors W _j in parallel, the formula is as follows

${d d}_{j j} = = | | | | {X x}_{i i} - - {W W}_{j j,, t t - - 11} | | | | = = \sqrt{{Σ Σ}_{p p = = 11}^{n no} {(({x x}_{ip ip} - - {w w}_{jp jp,, t t - - 11}))}^{22}} . . - - - - - - ((22 - - 11))$

Step4：比较各个距离的大小，具有最小距离的神经元即为最佳神经元J。Step4: Compare the size of each distance, and the neuron with the smallest distance is the best neuron J.

Step5：更新最佳神经元J极其邻域N_j(t-1)内的神经元的权向量值Step5: Update the weight vector value of the best neuron J and the neurons in the neighborhood N _j (t-1)

W_j,t=W_j,t-1+α_i(t-1)(X_i-W_j,t-1)。 (2-2)W _j,t =W _j,t-1 +α _i (t-1)(X _i -W _j,t-1 ). (2-2)

Step6：更新学习率α_j(t)及最佳神经元J的邻域大小N_j(t)。学习率的修正公式如下Step6: Update the learning rate α _j (t) and the neighborhood size N _j (t) of the best neuron J. The correction formula for the learning rate is as follows

α_i(t)=α_i(0)(1-t/T)。 (2-3)α _i (t)=α _i (0)(1−t/T). (2-3)

设竞争层神经元g在二维列中的坐标值为(X_g,Y_g)，则邻域的范围为一正方形区域，正方形的右上角顶点坐标为(X_g+N_j(t),Y_g+N_j(t))，正方形的左下角顶点坐标为(X_g-N_j(t),Y_g-N_j(t))，邻域的修正公式如下Assuming that the coordinate value of neuron g in the competition layer in the two-dimensional column is (X _g , Y _g ), the range of the neighborhood is a square area, and the coordinates of the upper right corner of the square are (X _g +N _j (t), Y _g +N _j (t)), the coordinates of the lower left corner of the square are (X _g -N _j (t),Y _g -N _j (t)), the correction formula of the neighborhood is as follows

N_j(t)=INT[N_j(0)·(1-t/T)]， (2-4)N _j (t)=INT[N _j (0)·(1-t/T)], (2-4)

其中,INT[X]表示对X取整操作。Among them, INT[X] represents the rounding operation on X.

Step7：并行计算神经元相邻两次迭代的误差，如公式如下Step7: Calculate the error of two adjacent iterations of the neuron in parallel, as the formula is as follows

${E E.}_{t t} = = | | | | {W W}_{t t} - - {W W}_{t t - - 11} | | | | = = {Σ Σ}_{i i = = 11}^{k k} | | | | {W W}_{i i,, t t} - - {W W}_{i i,, t t - - 11} | | | | . . - - - - - - ((22 - - 55))$

Step8：若E_t<=ε或t>=T,即SOM网络收敛于期望误差率或达到最大迭代次数，则SOM网络训练结束，否则转向Step3进行新一轮训练。每次学习的结果使得最佳匹配神经元邻域内的神经元权重向输入数据向量值靠近，把距离相近的输入特征向量聚集到一起，形成文本簇。Step8: If E _t <=ε or t>=T, that is, the SOM network converges to the expected error rate or reaches the maximum number of iterations, then the SOM network training ends, otherwise turn to Step3 for a new round of training. The result of each learning makes the neuron weights in the best matching neuron neighborhood close to the input data vector value, and the input feature vectors with similar distances are gathered together to form text clusters.

如图10、图11所示，本发明的具体实施方式是：构建一种基于图形处理单元的自组织映射神经网络聚类系统，其特征在于，包括硬件部分和软件部分，硬件部分：采用CPU/GPU协作框架设计，串行执行代码运行在CPU上，并行执行代码运行在GPU上，通过GPU提供的数据传输方式来交换显存与内存之间的数据；软件部分分为三个模块，包括并行化关键词词频统计模块、并行化特征向量计算模块、并行化SOM聚类模块，单元、计算特征向量的特征向量计算单元、进行文本聚类的文本聚类单元，所述并行化关键词词频统计模块将文本内容进行分词并得到关键词的集合，并行统计文档中关键词的频率，得到词频矩阵；所述并行化特征向量计算模块把关键词词频矩阵转化为对应的特征向量矩阵，每个特征向量代表一个文档；所述并行化SOM聚类模块根据特征向量矩阵设计SOM网络结构，初始化SOM网络，并行计算输入样本与全部输出神经元权向量距离,比较各个距离的大小，获取最小距离的最佳神经元J，通过更新最佳神经元、其邻域内的神经元权向量值、学习率及最佳神经元的邻域大小，然后通过图形处理单元并行计算网络误差率E_t，若网络误差率E_t<=目标误差ε或迭代次数t>=训练最大迭代次数T,则SOM网络训练结束，否则重新进行新一轮训练；每次学习的结果使得最佳匹配神经元的邻域区域向输入数据向量值靠近，把距离相近的输入特征向量聚集成同一个簇，形成的簇集合即为最终的聚类结果。As shown in Fig. 10 and Fig. 11, the specific embodiment of the present invention is: build a kind of self-organizing map neural network clustering system based on graphics processing unit, it is characterized in that, comprises hardware part and software part, and hardware part: adopts CPU /GPU collaborative framework design, the serial execution code runs on the CPU, the parallel execution code runs on the GPU, and the data between the video memory and the memory is exchanged through the data transmission method provided by the GPU; the software part is divided into three modules, including parallel A keyword frequency statistics module, a parallelized feature vector calculation module, a parallelized SOM clustering module, a unit, a feature vector calculation unit for calculating feature vectors, and a text clustering unit for text clustering, the parallelized keyword frequency statistics The module divides the text content into words and obtains a collection of keywords, and calculates the frequency of keywords in the document in parallel to obtain a word frequency matrix; the parallelized feature vector calculation module converts the keyword frequency matrix into a corresponding feature vector matrix, and each feature The vector represents a document; the parallelized SOM clustering module designs the SOM network structure according to the feature vector matrix, initializes the SOM network, calculates the distance between the input sample and all output neuron weight vectors in parallel, compares the size of each distance, and obtains the maximum value of the minimum distance. The best neuron J, by updating the best neuron, the neuron weight vector value in its neighborhood, the learning rate and the neighborhood size of the best neuron, and then calculate the network error rate E _t in parallel through the graphics processing unit, if the network error Rate E _t <= target error ε or number of iterations t>= maximum number of training iterations T, then the SOM network training ends, otherwise a new round of training is carried out; the result of each learning makes the neighborhood area of the best matching neuron to The input data vector values are close, and the input feature vectors with similar distances are gathered into the same cluster, and the formed cluster set is the final clustering result.

具体聚类过程如下：本发明基于图形处理单元的并行自组织映射神经网络聚类系统采用CPU/GPU框架的设计，如图10为系统的硬件框架,CPU控制系统的调度，给图形处理单元分配任务，为图形处理单元准备运行空间等，图形处理单元在CPU准备好的环境下，并行执行计算任务。图11为SOM聚类的软件协作框架,系统利用统一计算设备架构(Compute Unified Device Architecture，简称“CUDA”）编程平台对SOM算法应用于文本数据聚类过程进行加速。The specific clustering process is as follows: the present invention is based on the design of the CPU/GPU framework for the parallel self-organizing mapping neural network clustering system based on the graphics processing unit, as shown in Figure 10 for the hardware framework of the system, the scheduling of the CPU control system, and the allocation of task, prepare the running space for the graphics processing unit, etc., and the graphics processing unit executes computing tasks in parallel under the environment prepared by the CPU. Figure 11 shows the software collaboration framework of SOM clustering. The system uses the Compute Unified Device Architecture (CUDA for short) programming platform to accelerate the application of the SOM algorithm to the text data clustering process.

在基于CPU/GPU协作框架的设计中，通过对CPU和GPU的协作任务进行合理的分配和框架设计，充分利用CPU和GPU的各自优势，为算法进行加速。本系统将其任务分为两部分来进行分配，一部分是在CPU上具有明显运行优势的任务，一部分是在图形处理单元上明显具有运行优势的任务。适合在CPU上运行的任务主要包括：SOM网络的初始化，数据的I/O操作，算法逻辑流程的控制，核函数的调用。适合在图形处理单元上运行的任务主要是数据运算类任务包括：并行词频统计，并行特征向量计算，输入样本与神经元的距离计算，网络权重的误差计算。In the design based on the CPU/GPU cooperation framework, through the reasonable allocation and framework design of the cooperation tasks of CPU and GPU, the respective advantages of CPU and GPU are fully utilized to accelerate the algorithm. The system divides its tasks into two parts for allocation, one part is the task with obvious running advantage on the CPU, and the other is the task with obvious running advantage on the graphics processing unit. The tasks suitable for running on the CPU mainly include: SOM network initialization, data I/O operations, algorithm logic flow control, and kernel function calls. The tasks suitable for running on the graphics processing unit are mainly data computing tasks, including: parallel word frequency statistics, parallel feature vector calculation, distance calculation between input samples and neurons, and error calculation of network weights.

在系统软件方面,主要通过为各模块设计核函数来实现算法的加速运行。在并行词频统计模块中，系统设计一个核函数，该核函数为每一文档在图形处理单元上分配一个线程，共开启m个线程，m为文档数，统计出每个关键词在文档中出现的频率,其核函数的计算流程为图12。In terms of system software, the accelerated operation of the algorithm is realized mainly by designing kernel functions for each module. In the parallel word frequency statistics module, the system designs a kernel function, which allocates a thread on the graphics processing unit for each document, opens m threads in total, m is the number of documents, and counts the appearance of each keyword in the document frequency, the calculation process of its kernel function is shown in Figure 12.

在并行特征向量计算模块中，系统为该模块设计了三个核函数以处理算法中比较耗时的三个部分：(1)计算关键词所在文档数核函数，如图4,为该核函数开启n个线程，n为关键词得数目，其核函数的计算流程如图13；(2)计算文档特征向量核函数，如图6,为该核函数开启m乘n个线程，其计算公式为In the parallel eigenvector calculation module, the system designed three kernel functions for this module to deal with the three time-consuming parts of the algorithm: (1) Calculating the number of documents where keywords are located. Kernel function, as shown in Figure 4, is the kernel function Open n threads, n is the number of keywords, the calculation process of its kernel function is shown in Figure 13; (2) Calculate the document feature vector kernel function, as shown in Figure 6, open m times n threads for the kernel function, its calculation formula for

x_ij=log₂(tf_ij+1.0)*log(m/m_j)。x _ij =log ₂ (tf _ij +1.0)*log(m/m _j ).

(3)由于在实际运用中，每篇的文档的长度可能相差很大，为了克服这个问题，在该模块中设计了两个核函数来归一化文档特征向量，分别为：计算每一个特征向量的模的核函数，如图7,为该核函数开启m个线程；归一化特征向量的核函数，如图8,为该核函数开启m乘n个线程。(3) Since the length of each document may vary greatly in practical applications, in order to overcome this problem, two kernel functions are designed in this module to normalize document feature vectors, respectively: Calculate each feature vector The kernel function of the modulus, as shown in Figure 7, opens m threads for the kernel function; the kernel function of the normalized eigenvector, as shown in Figure 8, opens m times n threads for the kernel function.

并行SOM聚类模块的具体流程如图11。经过对串行SOM算法的执行逻辑流程进行分析，发现串行SOM算法的大规模运算模块有两部分：一部分是SOM网络在接收到新的输入向量后，计算输入向量到每个输出神经元的距离，从中选择出具有最近距离的突然神经元作为最优神经元的模块；一部分是计算神经元的相邻两次迭代的网络误差，与算法预设可接受误差率对比的模块。这两部分都是CPU不擅长的大规模的浮点数运算，如果计算过程采用串行执行，算法在这两部分上消耗的时间占据了算法运行时间的80%以上，所以模块三中为这两个部分设计了三个核函数来加快算法执行的速度，提高算法执行效率。针对以上两个部分,系统设计了两个在图形处理单元上运行的子模块。The specific flow of the parallel SOM clustering module is shown in Figure 11. After analyzing the execution logic flow of the serial SOM algorithm, it is found that the large-scale operation module of the serial SOM algorithm has two parts: one is that after the SOM network receives a new input vector, it calculates the input vector to each output neuron. Distance, from which the sudden neuron with the shortest distance is selected as the module of the optimal neuron; part of it is a module that calculates the network error of two adjacent iterations of the neuron and compares it with the algorithm's preset acceptable error rate. These two parts are large-scale floating-point calculations that the CPU is not good at. If the calculation process is executed serially, the time spent by the algorithm on these two parts accounts for more than 80% of the algorithm's running time, so module 3 is for these two parts. In the first part, three kernel functions are designed to speed up the execution speed of the algorithm and improve the execution efficiency of the algorithm. For the above two parts, the system designs two sub-modules that run on the graphics processing unit.

(1)并行运算寻找最佳神经元子模块。最佳神经元即是离输入模式向量最近的神经元，这就需要算法在接收到一个输入模式向量时，计算该向量与网络中每个神经元的距离。神经元与输入向量距离对应的计算公式如下(1) Parallel operation to find the best neuron sub-module. The best neuron is the neuron closest to the input pattern vector, which requires the algorithm to calculate the distance between the vector and each neuron in the network when receiving an input pattern vector. The calculation formula corresponding to the distance between the neuron and the input vector is as follows

${d d}_{j j} = = | | | | {X x}_{i i} - - {W W}_{j j,, t t - - 11} | | | | = = \sqrt{{Σ Σ}_{p p = = 11}^{n no} {(({x x}_{ip ip} - - {w w}_{jp jp,, t t - - 11}))}^{22}} . .$

鉴于所有的距离在计算过程中均需要做开方处理，公式可简化为In view of the fact that all distances need to be square rooted in the calculation process, the formula can be simplified as

${d d}_{j j} = = | | | | {X x}_{i i} - - {W W}_{j j,, t t - - 11} | | | | = = {Σ Σ}_{p p = = 11}^{n no} {(({x x}_{ip ip} - - {w w}_{jp jp,, t t - - 11}))}^{22} . .$

距离公式计算步骤简单，但当SOM网络的神经元数量较多、输入样本的维度较大、输入样本数量较多时，运算量十分庞大。当样本维度为n，神经元数量为k，算法迭代运行次数为T次时，该公式中差值运算执行的频度为k*n*T，所以对该公式进行算法并行执行改进对程序的性能提升非常有效。在这个子模块中,系统设计一个核函数来计算输入特征向量与网络中每个神经元的距离,可以充分利用图形处理单元的并行运算性能，加快算法的执行速度。The calculation steps of the distance formula are simple, but when the number of neurons in the SOM network is large, the dimension of the input samples is large, and the number of input samples is large, the amount of calculation is very large. When the sample dimension is n, the number of neurons is k, and the number of iterations of the algorithm is T times, the frequency of the difference operation in this formula is k*n*T, so the improvement of parallel execution of the algorithm on the formula will improve the performance of the program. The performance boost is very effective. In this sub-module, the system designs a kernel function to calculate the distance between the input feature vector and each neuron in the network, which can make full use of the parallel computing performance of the graphics processing unit and speed up the execution speed of the algorithm.

如图14所示，为方便对串行程序做并行化改进设计，下面对距离计算问题做矩阵意义上的描述。定义神经元权重矩阵与输入模式向量的距离运算为，神经元权重矩阵W为k*n的矩阵，其中k为输出神经元的数量，n为输入模式向量的维度，输入模式向量x_k为维度为n的向量，距离运算结果d_j为维度为k的向量。为了加速算法的运行，系统可以同时启动k个线程并行计算，其核函数的计算流程如图15。As shown in Figure 14, in order to facilitate the parallelization improvement design of the serial program, the following describes the distance calculation problem in the sense of a matrix. Define the distance operation between the neuron weight matrix and the input pattern vector as , the neuron weight matrix W is a k*n matrix, where k is the number of output neurons, n is the dimension of the input pattern vector, the input pattern vector x _k is a vector of dimension n, and the distance operation result d _j is the dimension of vector of k. In order to speed up the operation of the algorithm, the system can simultaneously start k threads to calculate in parallel, and the calculation process of its kernel function is shown in Figure 15.

(2)并行计算相邻两轮的网络权重的误差子模块。SOM网络指标E_t代表的意义即网络的当前权重值与上一轮训练完毕时的权重值的差。如果E_t的值小于算法初始化前定义的误差率，说明SOM网络在迭代训练时，各个神经元的权重值已经不再发生大规模的变化，认为网络已经收敛于某种状态，网络训练完毕。判断当前SOM网络是否已经训练到误差允许范围内的公式如下(2) Calculate the error sub-modules of the network weights of two adjacent rounds in parallel. The meaning represented by the SOM network index E _t is the difference between the current weight value of the network and the weight value at the end of the last round of training. If the value of E _t is less than the error rate defined before the algorithm initialization, it means that the weight value of each neuron has no large-scale changes during the iterative training of the SOM network. It is considered that the network has converged to a certain state and the network training is complete. The formula for judging whether the current SOM network has been trained to the allowable range of error is as follows

${E E.}_{t t} = = | | | | {W W}_{t t} - - {W W}_{t t - - 11} | | | | = = {Σ Σ}_{i i = = 11}^{k k} | | | | {W W}_{i i,, t t} - - {W W}_{i i,, t t - - 11} | | | | . .$

通过观察和分析上述计算公式，发现该公式即两个矩阵做差运算的一种变形。这种大规模的矩阵运算在CPU上串行执行速度远远落后于在图形处理单元上并行执行速度，所以本子模块中，将网络的误差计算设计成一个在图形处理单元上运行的核函数，来实现算法的加速。同时，在计算网络误差和时，为了加速算法运行，系统采用按行或列并行计算网络误差。因此，在本子模块中，我们还要设计一个核函数并行计算网络误差。By observing and analyzing the above calculation formula, it is found that this formula is a deformation of the difference operation between two matrices. The serial execution speed of this large-scale matrix operation on the CPU is far behind the parallel execution speed on the graphics processing unit. Therefore, in this sub-module, the error calculation of the network is designed as a kernel function running on the graphics processing unit. to speed up the algorithm. At the same time, when calculating the sum of network errors, in order to speed up the operation of the algorithm, the system uses row or column parallel calculation of network errors. Therefore, in this sub-module, we also design a kernel function to calculate network errors in parallel.

如图16所示，对于计算SOM网络误差率的计算子模块，将其划分为两步图形处理单元上并行运算，第一部分为两矩阵做差运算，即W_t-W_t-1=E_t运算，第二部分为矩阵内每个元素的绝对值和的规约部分。两部分都使用图形处理单元进行加速。图16中,W_t-W_t-1=E_t。可以设计一个核函数计算该矩阵的差，系统为该核函数开启m乘n个线程，每一个线程计算两个神经元的欧氏距离。As shown in Figure 16, for the calculation sub-module for calculating the error rate of the SOM network, it is divided into two steps of parallel operation on the graphics processing unit, and the first part is the difference operation between the two matrices, that is, W _t -W _t-1 =E _t Operation, the second part is the reduction part of the absolute value sum of each element in the matrix. Both parts are accelerated using graphics processing units. In FIG. 16, W _t -W _t-1 =E _t . A kernel function can be designed to calculate the difference of the matrix, and the system starts m times n threads for the kernel function, and each thread calculates the Euclidean distance between two neurons.

通常上述做差运算并不是我们最终想要的结果，我们需要继续对C矩阵各行的元素进行绝对值加和，得到一个n维的向量，最后通过CPU上的串行运算，将向量中的值进行累加，得到最终SOM网络的实际改变量与误差率进行对比，若大于误差率则继续下一轮训练，若小于误差率则停止训练，证明网络已经一定程度收敛。对矩阵的加和需要两步，按行或列求和，然后第一步的结果再求和，运算过程见如图17所示。系统可以设计一个核函数并行计算第一步，同时为该核函数开启n个线程，其核函数的计算流程如图18。Usually the above-mentioned difference operation is not the final result we want. We need to continue to add the absolute value of the elements of each row of the C matrix to obtain an n-dimensional vector. Finally, through the serial operation on the CPU, the value in the vector Accumulate and compare the actual change amount of the final SOM network with the error rate. If it is greater than the error rate, continue to the next round of training. If it is less than the error rate, stop the training, which proves that the network has converged to a certain extent. The addition of the matrix requires two steps, summing by row or column, and then summing the results of the first step. The operation process is shown in Figure 17. The system can design a kernel function to calculate the first step in parallel, and open n threads for the kernel function at the same time. The calculation process of the kernel function is shown in Figure 18.

注意图17中的矩阵元素值求和生成n*1维向量的运算不一定是按照行进行的规约，也可以按列进行规约，这取决于实际应用中行数和列数的大小关系。如果矩阵行数远远大于列数，则按行进行规约，因为按行进行规约可以一次开启更多的并行线程。反之，如果矩阵的列数远远大于行的数量，则需要按列进行规约。Note that the operation of summing the matrix element values in Figure 17 to generate an n*1-dimensional vector is not necessarily reduced by row, but also by column, which depends on the relationship between the number of rows and the number of columns in practical applications. If the number of rows of the matrix is much greater than the number of columns, reduce by row, because the reduction by row can open more parallel threads at a time. Conversely, if the number of columns of the matrix is much greater than the number of rows, it needs to be reduced by column.

针对上述运算过程需要维护几种相应的数据结构：存储SOM网络神经元权重值的二维矩阵W、训练样本矩阵X、距离向量D。由于需要计算E_t值，故需要保存上一轮的SOM网络神经元权重值的二维矩阵。Several corresponding data structures need to be maintained for the above operation process: the two-dimensional matrix W that stores the weight values of SOM network neurons, the training sample matrix X, and the distance vector D. Since the E _t value needs to be calculated, it is necessary to save the two-dimensional matrix of the weight values of the SOM network neurons in the last round.

本发明的基于图形处理单元的并行自组织映射神经网络聚类方法及系统,设计了一种并行化的SOM文本聚类算法。同时，利用图形处理单元(GPU)和中央处理器(CPU)之间的计算能力的互补性，本发明设计了一套基于CPU/GPU协作框架的并行化文本聚类系统。系统硬件部分设计为CPU/GPU协作框架，软件部分设计分三个模块:并行词频统计模块，并行特征向量计算模块和并行SOM算法聚类模块。本发明的基于图形处理单元的自组织映射神经网络的文本聚类，可以充分利用图形处理设备的高并行性，有效的提高算法的聚类速度，非常适合于处理高维文本数据的聚类问题。The parallel self-organizing map neural network clustering method and system based on the graphic processing unit of the present invention designs a parallel SOM text clustering algorithm. Simultaneously, the present invention designs a set of parallelized text clustering system based on CPU/GPU cooperation framework by utilizing the complementarity of computing power between graphics processing unit (GPU) and central processing unit (CPU). The hardware part of the system is designed as a CPU/GPU collaborative framework, and the software part is designed into three modules: a parallel word frequency statistics module, a parallel feature vector calculation module and a parallel SOM algorithm clustering module. The text clustering based on the self-organizing map neural network of the graphics processing unit of the present invention can make full use of the high parallelism of the graphics processing equipment, effectively improve the clustering speed of the algorithm, and is very suitable for processing the clustering problem of high-dimensional text data .

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. A parallel self-organizing map neural network clustering method based on graphics processing unit, comprising the steps:

Parallel keyword frequency statistics: segment the text content into words and get a set of keywords, and count the frequency of keywords in the document in parallel to get a word frequency matrix;

Parallel eigenvector calculation: convert the keyword frequency matrix into a corresponding eigenvector matrix, each eigenvector represents a document;

Parallel SOM clustering: Design the SOM network structure according to the eigenvector matrix, initialize the SOM network, calculate the distance between the input sample and all output neuron weight vectors in parallel, compare the size of each distance, and obtain the best neuron J with the smallest distance. The best neuron, the neuron weight vector value in its neighborhood, the learning rate and the neighborhood size of the best neuron, and then calculate the network error rate E _t in parallel through the graphics processing unit, if the network error rate E _t <= target error ε Or the number of iterations t>=the maximum number of training iterations T, then the SOM network training ends, otherwise a new round of training is performed; the result of each learning makes the neighborhood area of the best matching neuron approach the input data vector value, and the distance Similar input feature vectors are aggregated into the same cluster, and the cluster set formed is the final clustering result.

2. the self-organizing map neural network clustering method based on graphics processing unit according to claim 1, is characterized in that, in the keyword word frequency step of obtaining document, adopts the multi-thread parallel statistical word frequency based on graphics processing unit.

3. the self-organizing map neural network clustering method based on graphics processing unit according to claim 1, characterized in that, in the parallel feature vector calculation step, the feature vector of each document is calculated based on the multi-threaded parallel calculation of graphics processing .

4. according to the self-organizing map neural network clustering method based on graphics processing unit described in claim 1, it is characterized in that, the calculation process of input feature vector and each output neuron weight vector distance is independent of each other, adopts the multiple based on graphics processing Each thread calculates the distance between the input feature vector and each output neuron vector in parallel, and the system opens a thread for each neuron, using multi-thread parallel calculation.

5. according to the self-organizing map neural network clustering method based on graphics processing unit described in claim 1, it is characterized in that, the calculation process of the weight vector error of adjacent two iterations of each neuron is independent of each other, adopts based on graphics processing Multiple threads calculate the weight vector error of each neuron in parallel, and the system starts a thread for each neuron, using multi-threaded parallel calculation.

6. A self-organizing map neural network clustering system based on a graphics processing unit, characterized in that it includes a hardware part and a software part, and the hardware part: adopts CPU/GPU collaborative framework design, serial execution code runs on the CPU, parallel The execution code runs on the GPU, and the data between the video memory and the memory is exchanged through the data transmission method provided by the GPU; the software part is divided into three modules, including the parallelized keyword frequency statistics module, the parallelized feature vector calculation module, and the parallelized SOM clustering module, unit, feature vector calculation unit for calculating feature vectors, text clustering unit for text clustering, the parallelized keyword frequency statistics module divides text content into words and obtains a set of keywords, and performs parallel statistics on documents The frequency of keywords in the middle, obtains term frequency matrix; Described parallelization feature vector calculation module converts keyword term frequency matrix into corresponding feature vector matrix, and each feature vector represents a document; Described parallelization SOM clustering module according to feature vector Matrix design SOM network structure, initialize SOM network, calculate the distance between the input sample and all output neuron weight vectors in parallel, compare the size of each distance, obtain the best neuron J with the smallest distance, and update the best neuron and its neighborhood Neuron weight vector value, learning rate and the neighborhood size of the optimal neuron, and then calculate the network error rate E _t in parallel through the graphics processing unit, if the network error rate E _t <= target error ε or number of iterations t > = training maximum If the number of iterations is T, the training of the SOM network is over, otherwise a new round of training is carried out again; the result of each learning makes the neighborhood area of the best matching neuron close to the value of the input data vector, and the input feature vectors with similar distances are aggregated into the same A cluster, the formed cluster set is the final clustering result.

7. according to the clustering system of the parallelized self-organizing map neural network based on graphics processing unit according to claim 6, it is characterized in that, described parallelized keyword frequency statistics module, described parallelized feature vector calculation module and described Several kernel functions are designed in the parallel SOM clustering module to accelerate the operation of the algorithm in parallel.

8. The clustering system based on the parallelized self-organizing map neural network of the graphics processing unit according to claim 6, it is characterized in that, in the parallel keyword frequency statistics module, a kernel function for keyword frequency statistics is designed ; In the parallel eigenvector calculation module, two kernel functions for eigenvector calculation and two kernel functions for eigenvector normalization are designed.

9. The clustering system based on the parallel self-organizing map neural network of the graphics processing unit according to claim 6, it is characterized in that, in the parallel SOM clustering module, a neuron for calculating the input feature vector and the output neuron is designed The kernel function of the distance, a kernel function used to calculate the error of the network weight vector of two adjacent iterations of each neuron and a kernel function used to reduce the error of the network weight vector.