CN114386466A

CN114386466A - Parallel hybrid clustering method for candidate signal mining in pulsar search

Info

Publication number: CN114386466A
Application number: CN202210036692.XA
Authority: CN
Inventors: 游子毅; 刘莹; 马智; 李思瑶; 王培�; 童超
Original assignee: Guizhou Education University
Current assignee: Guizhou Education University
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-04-22
Anticipated expiration: 2042-01-13
Also published as: CN114386466B

Abstract

The invention discloses a parallel hybrid clustering method for candidate signal mining in pulsar search, which comprises the following steps: clustering analysis of pulsar candidate signals; grouping the data sets based on a grouping strategy of a sliding window, dividing the data sets according to a specific window value Batchsize =1160, and setting the size of the sliding window to be w = 2; selecting more complete various pulsar candidate body characteristic data 1600 from real samples as a group of samples, respectively adding the samples into the data to be detected corresponding to each sliding window to form 1 data block, and dividing a data set into a plurality of parallel data blocks with the same size; and parallelizing the data blocks based on the MapReduce/Spark calculation model to realize the clustering. The invention can improve the clustering performance, improve the screening recall rate and reduce the execution time.

Description

A Parallel Hybrid Clustering Method for Candidate Signal Mining in Pulsar Searching

技术领域technical field

本发明属于天文学技术领域，具体来说涉及一种用于脉冲星搜寻中候选体信号挖掘的并行的混合聚类方法。The invention belongs to the technical field of astronomy, and in particular relates to a parallel hybrid clustering method for candidate body signal mining in pulsar searching.

背景技术Background technique

脉冲星领域的发现有利地推动了天文学、物理学及导航等相关领域的发展，随着500米口径球面射电望远镜FAST的建成和19波束接收机巡天探测，其高灵敏度且更大天区覆盖面的特点，在带来脉冲星信号搜寻范围的优势同时也伴随着观测数据的巨大增长，如何有效地从海量数据中筛选出脉冲星候选体成为脉冲星搜寻的关键；The discovery of pulsars has favorably promoted the development of related fields such as astronomy, physics, and navigation. With the completion of the 500-meter-aperture spherical radio telescope FAST and the 19-beam receiver sky survey detection, its high sensitivity and larger sky area coverage have The advantages of the pulsar signal search range are accompanied by the huge growth of observation data. How to effectively screen out pulsar candidates from the massive data becomes the key to pulsar search;

基本的脉冲星搜寻中所需完成的工作为在P(周期)-DM(色散量)组成的两维空间中搜索稳定周期性脉冲信号；目前，图形工具辅助或基于统计的传统方法已无法满足如此庞大数据量处理的需要；人工智能技术运用于脉冲星的候选体筛选根据方法原理主要分为三类；第一类是基于经验公式的候选体排序算法；这类算法依赖于一些假设，如信噪比、脉冲轮廓形状等，实际中很多都不能很好拟合从而可能导致一些有特殊形状脉冲，如宽脉冲、偏DM曲线或者低流量的脉冲星被遗漏；第二类是直接利用候选体诊断图自动提取特征的神经网络图像识别模型；这类算法相比传统机器学习方法泛化性更好，但需要手动标记每个训练数据的子图且样本训练需求量较大，导致大量额外劳动的投入；第三类是基于机器学习的分类算法；依靠人类经验筛选的特征选择是影响脉冲星筛选的二值分类结果的关键，不全面的特征设计方案可能会弱化模型的性能，所以特征设计问题尤为关键；此外，一些多方法集成的混合模型也取得显著效果；The work that needs to be done in the basic pulsar search is to search for stable periodic pulse signals in the two-dimensional space composed of P (period)-DM (dispersion quantity). The need for processing such a huge amount of data; the candidate screening of pulsars applied by artificial intelligence technology is mainly divided into three categories according to the method principle; the first category is the candidate sorting algorithm based on empirical formula; this kind of algorithm relies on some assumptions, such as Signal-to-noise ratio, pulse profile shape, etc., many of them cannot be well fitted in practice, which may cause some pulses with special shapes, such as wide pulses, partial DM curves or low-flow pulsars to be missed; the second type is to directly use candidate A neural network image recognition model that automatically extracts features from body diagnostic maps; this kind of algorithm has better generalization than traditional machine learning methods, but it needs to manually label the subgraphs of each training data and requires a large amount of sample training, resulting in a lot of extra labor input; the third category is the classification algorithm based on machine learning; the feature selection based on human experience screening is the key to affecting the binary classification results of pulsar screening. Incomplete feature design schemes may weaken the performance of the model, so the feature selection The design problem is particularly critical; in addition, some mixed models of multi-method ensemble have also achieved remarkable results;

在实际的大规模脉冲星数据计算和搜索中，由于输入数据集中大部分都是无标签数据，而且存在脉冲星与非脉冲星样本数据比例极不均衡问题，导致使用有监督学习分类方法来识别脉冲星候选体的时间代价和工作量都相当大；In the actual large-scale pulsar data calculation and search, since most of the input data sets are unlabeled data, and there is an extremely unbalanced proportion of pulsar and non-pulsar sample data, the supervised learning classification method is used to identify The time cost and workload of pulsar candidates are quite large;

实验数据集HTRU2来自澳大利亚Parkes望远镜的多波束(13个波束)的观测，所用脉冲星信号搜寻管道的DM值设定为0到2000cm-3pc，描述了在高时间分辨率宇宙勘测期间收集的基于PRESTO(Pulsar Exploration and Search Toolkit)软件处理的脉冲星候选样本数据；PRESTO美国NRAO射电天文台开发的脉冲星搜索和分析套件，现已用于多次巡天，处理短积分时间数据和X射线数据；HTRU2数据集共包含17898个数据样本，其中16259个由RFI或噪声产生的虚假示例和1639个真实脉冲星示例；特征值包含脉冲轮廓的均值、脉冲轮廓的标准差、脉冲轮廓的超额峰度、脉冲轮廓的偏度、DM-S/N曲线的均值、DM-S/N曲线的标准差、DM-S/N曲线的超峰额度和DM-S/N曲线的偏度8个属性；HTRU2是一个开放的、样本相对丰富的数据集，认可度较高，因此被广泛用于评估脉冲星候选体分类算法的性能；The experimental dataset HTRU2 from the multi-beam (13 beams) observations of the Parkes Telescope in Australia, using the DM value of the pulsar signal search pipeline set from 0 to 2000cm-3pc, describes the The pulsar candidate sample data processed by PRESTO (Pulsar Exploration and Search Toolkit) software; the pulsar search and analysis suite developed by the PRESTO NRAO Radio Astronomy Observatory in the United States, which has been used for multiple sky surveys to process short integration time data and X-ray data; HTRU2 The dataset contains a total of 17,898 data samples, of which 16,259 are false examples generated by RFI or noise and 1,639 are real pulsar examples; eigenvalues include the mean of the pulse profile, the standard deviation of the pulse profile, the excess kurtosis of the pulse profile, the pulse profile The skewness of the profile, the mean of the DM-S/N curve, the standard deviation of the DM-S/N curve, the excess peak limit of the DM-S/N curve, and the skewness of the DM-S/N curve are eight attributes; HTRU2 is An open dataset with relatively abundant samples, with a high degree of recognition, and is therefore widely used to evaluate the performance of pulsar candidate classification algorithms;

聚类是处理大型数据挖掘问题的关键方法之一，包含基于划分、基于密度、基于网格等聚类算法；k-means作为一种基于划分的聚类算法得到广泛应用.但原始k-means存在聚类效果依赖于初始中心点的选择、只能应对数值型数据、异常值干涉大等缺陷；因此，不少学者一直在对该算法进行改进。(Privacy-Preserving Mechanisms for k-ModesClustering，Computers and Security，2018)提出K-MODES算法用于解决k-means只能应对数值型数据的缺点；基于密度的聚类方法，比如典型的DBSCAN(Density Based SpatialClustering of Applications with Noise)算法，能发现任意形状的簇类,但聚类样本大、收敛时间长，对于簇密度不均匀情况聚类效果不佳；(Clustering by fast search andfind of density peaks，Science，2014)提出了一种基于密度峰值的快速搜索聚类算法，其主要思想是簇类中心的密度应大于周围邻居的密度，且不同簇类中心之间的距离相对较远；由于该算法仅关注了密度较大且距离相对远的点作为中心点，容易将含有多个高密度点的同一簇类错误地分成多个簇类.为克服这个缺陷，(McDPC：multi-center densitypeak clustering，Neural Computing and Applications，2020)进一步提出一种基于密度层次划分的多中心聚类方法；基于层次的聚类不需要预先指定聚类数且可以发现类的层次关系，但计算复杂度太高。Clustering is one of the key methods to deal with large-scale data mining problems, including clustering algorithms based on partitioning, density-based and grid-based; k-means is widely used as a partition-based clustering algorithm. But the original k-means The clustering effect depends on the selection of the initial center point, can only deal with numerical data, and the outliers interfere greatly. Therefore, many scholars have been improving the algorithm. (Privacy-Preserving Mechanisms for k-ModesClustering, Computers and Security, 2018) proposed the K-MODES algorithm to solve the disadvantage that k-means can only deal with numerical data; density-based clustering methods, such as the typical DBSCAN (Density Based SpatialClustering of Applications with Noise) algorithm can find clusters of any shape, but the clustering samples are large and the convergence time is long, and the clustering effect is not good for uneven cluster density; (Clustering by fast search and find of density peaks, Science, 2014) proposed a fast search clustering algorithm based on density peaks, the main idea of which is that the density of cluster centers should be greater than that of surrounding neighbors, and the distances between different cluster centers are relatively far; since the algorithm only focuses on In order to overcome this defect, (McDPC: multi-center densitypeak clustering, Neural Computing and Applications, 2020) further proposed a multi-center clustering method based on density hierarchical division; hierarchical clustering does not need to pre-specify the number of clusters and can discover the hierarchical relationship of classes, but the computational complexity is too high.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服上述缺点而提供的一种提高聚类性能、提升筛选召回率并减少执行时间的用于脉冲星搜寻中候选体信号挖掘的并行的混合聚类方法。The purpose of the present invention is to overcome the above shortcomings and provide a parallel hybrid clustering method for candidate body signal mining in pulsar searching, which improves clustering performance, improves screening recall rate and reduces execution time.

本发明目的及解决其主要技术问题是采用以下技术方案来实现的：The purpose of the present invention and the solution to its main technical problems are achieved by adopting the following technical solutions:

本发明的一种用于脉冲星搜寻中候选体信号挖掘的并行的混合聚类方法，包括步骤如下：A parallel hybrid clustering method for candidate body signal mining in pulsar search of the present invention includes the following steps:

(1)脉冲星候选体信号的聚类分析：(1) Cluster analysis of pulsar candidate signals:

采用K近邻的多项式核(Polynomial)函数计算数据点密度，筛出密度值小于阈值0.01的样本，这些样本将进一步通过候选体诊断图判断是噪声还是新天文现象，排除密度过小的离群点干扰；The polynomial function of K nearest neighbors is used to calculate the density of data points, and the samples whose density value is less than the threshold 0.01 are screened out. These samples will be further judged by the candidate diagnostic map to determine whether it is noise or a new astronomical phenomenon, and outliers with too small density will be excluded. interference;

结合密度峰值及层次的聚类过程特点，用于数据集中多密度簇类层次的划分，合并同一区域内部分密度相近、距离邻近的微簇群，确定初始聚类中心点；Combined with the characteristics of the density peak and hierarchical clustering process, it is used to divide the multi-density cluster level in the data set, merge some micro-clusters with similar densities and distances in the same area, and determine the initial cluster center point;

运用基于高斯径向基核(RBF)距离的k-means迭代对所有数据点的进行分配与簇中心优化，采用(RBF)核函数计算样本数据点之间相似度计算可实现测度距离对高维空间的转换；The k-means iteration based on Gaussian Radial Basis Kernel (RBF) distance is used to assign all data points and optimize the cluster center, and the (RBF) kernel function is used to calculate the similarity between sample data points, which can realize the measurement distance to high-dimensional transformation of space;

(2)基于滑动窗口的分组策略对数据集进行分组，按照特定的窗口值Batchsize＝1160进行划分，设置滑动窗口尺寸为w＝2；拟通过从真实样本中挑选较完备的各类脉冲星候选体特征数据1600颗作为一组样本，并分别加入到每轮滑动窗口所对应的待检测数据形成1个数据块，将数据集分为多个大小相同的并行数据块；(2) Group the data set based on the sliding window grouping strategy, divide it according to a specific window value Batchsize=1160, and set the sliding window size to w=2; it is planned to select relatively complete various types of pulsar candidates from real samples 1600 pieces of volume feature data are used as a set of samples, and are added to the data to be detected corresponding to each round of sliding windows to form a data block, and the data set is divided into multiple parallel data blocks of the same size;

(3)用基于MapReduce/Spark计算模型的数据块并行化实现该聚类。(3) The clustering is realized by data block parallelization based on the MapReduce/Spark computing model.

上述一种用于脉冲星搜寻中候选体信号挖掘的并行的混合聚类方法，其中步骤(1)中所述的聚类分析方法为：The above-mentioned parallel hybrid clustering method for candidate body signal mining in pulsar search, wherein the clustering analysis method described in step (1) is:

①进行数据预处理，通过特征提取方法(Fifty Years of Pulsar CandidateSelection:From simple filters to a new principled real-time classificationapproach，Monthly Notices of the Royal Astronomical Society，2016)和主成分分析方法(PCA)对在基于PRESTO(Pulsar Exploration and Search Toolkit)软件的脉冲星搜索流程中的脉冲星候选体数据进行特征选择和降维，从而得到特征向量为b的新特征空间输入数据集；可选的候选体物理特征值包括有脉冲辐射(单峰、双峰和多峰)、周期、色散值、信噪比、噪声信号、信号斜波、非相干功率之和、相干功率；① Carry out data preprocessing, through feature extraction method (Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society, 2016) and principal component analysis (PCA) Feature selection and dimension reduction are performed on the pulsar candidate data in the pulsar search process of PRESTO (Pulsar Exploration and Search Toolkit) software, so as to obtain a new feature space input dataset with feature vector b; optional physical eigenvalues of the candidate body Including pulse radiation (single-peak, double-peak and multi-peak), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, sum of incoherent power, coherent power;

②根据式(1)计算数据点i和j之间的马氏距离为② According to formula (1), the Mahalanobis distance between data points i and j is calculated as

其中，S是多维随机变量的协方差矩阵；再根据式(2)计算各数据点基于K近邻的局部Polynomial核密度，Polynomial核函数拥有的全局特性，使其泛化性能强；Among them, S is the covariance matrix of multi-dimensional random variables; then calculate the local Polynomial kernel density of each data point based on K nearest neighbors according to formula (2).

其中，c为偏置系数，d为多项式的阶；为消除数据变异大小和数值大小的影响，对d_ij和ρ_i均采用离差标准化处理如下；Among them, c is the bias coefficient, and d is the order of the polynomial; in order to eliminate the influence of the data variation size and numerical value, the dispersion standardization is used for both d _ij and ρ _i as follows;

其中，min_d和min_ρ分别代表d_ij和ρ_i的最小值，max_d和max_ρ分别代表d_ij和ρ_i的最大值；Among them, min _d and min _ρ represent the minimum value of d _ij and ρ _i respectively, and max _d and max _ρ represent the maximum value of d _ij and ρ _i respectively;

③根据式(5)剔除离群点，再由式(6)计算非离群点之间的距离δ_i，剔除离群点有助于簇类中心点的选择；另外，密度过小的数据点数量少且分布边缘化；由于其稀缺性及低密度化，在数据分布中呈异常，而异常现象可能是纯噪声或天文新现象(比如特殊脉冲星)；这部分数据后续将通过对应的候选体诊断图作进一步的确定；(3) Eliminate outliers according to formula (5), and then calculate the distance δ _i between non-outlier points by formula (6). Elimination of outliers is helpful for the selection of cluster center points; in addition, data with too small density The number of points is small and the distribution is marginal; due to its scarcity and low density, it is abnormal in the data distribution, and the abnormal phenomenon may be pure noise or astronomical new phenomena (such as special pulsars); this part of the data will be passed through the corresponding Candidate diagnostic map for further determination;

inlier＝{ρ_i＞ρ_threhold}，ρ_threhold＝0.01 (5)inlier={ρ _i >ρ _threhold }, ρ _threhold =0.01 (5)

④所有距离δ大于阈值λ的数据点可生成1个二维决策图；其中，横轴用密度ρ表示，纵轴用距离δ表示；在二维决策图上进行密度层次微簇群的合并，方法为：若在ρ轴或δ轴划分区域上包含两个或两个以上的无数据点存在区域，则称该空隙区域为空区；空区把所有的数据点划分为两个密度区域，将最右的密度区域称作最大密度区域，其余为低密度区域；④ All data points whose distance δ is greater than the threshold λ can generate a two-dimensional decision map; in which, the horizontal axis is represented by density ρ, and the vertical axis is represented by distance δ; on the two-dimensional decision map, density-level micro-clusters are merged, The method is: if there are two or more areas without data points on the ρ-axis or δ-axis division area, the void area is called an empty area; the empty area divides all data points into two density areas, The rightmost density area is called the maximum density area, and the rest are low density areas;

(A)在低密度区域，由于区分度不高，将该区域相应的微簇均合并成一个簇类；(A) In the low-density area, due to the low degree of discrimination, the corresponding micro-clusters in this area are merged into one cluster class;

(B)在最大密度区域，若所有的代表点都在同一个δ区，则将这些代表点均选作独立的簇类中心；若不在同一个δ区，则这些代表点间距离区分度不高，可能属于同一个簇类，因此需要将相应的微簇合并成一个大簇；(B) In the maximum density area, if all the representative points are in the same delta area, these representative points are selected as independent cluster centers; if they are not in the same delta area, the distance discrimination between these representative points is different. high, may belong to the same cluster class, so the corresponding micro-clusters need to be merged into a large cluster;

⑤确定簇类数k以及对应集群C_i(1≤i≤k)的中心center_i；⑤ Determine the number of clusters k and the center center _i of the corresponding cluster C _i (1≤i≤k);

⑥根据就近原则将各个数据点x_j分配给距离最近的center_i所在的簇类，相似性测度方式采用RBF核距离，如式(7)所示；RBF核函数拥有局部特性且学习能力强，通过RBF核距离可实现测度距离对高维空间的转换；⑥ According to the principle of proximity, each data point x _j is allocated to the cluster class where the nearest center _i is located, and the similarity measurement method adopts the RBF kernel distance, as shown in formula (7); the RBF kernel function has local characteristics and strong learning ability, The transformation of measured distance to high-dimensional space can be realized through RBF kernel distance;

其中,η代表核函数宽度；按照式(8)计算新簇C_i'内所有数据点的均值作为新的中心center_i'，n_i表示属于C_i'的数据点总数；Wherein, n represents the kernel function width; According to formula (8), calculate the mean value of all data points in the new cluster C _i ' as the new center center _i ', and n _i represent the total number of data points belonging to C _i ';

⑦计算数据集所有对象的误差平方和SSE：⑦ Calculate the sum of squared errors SSE for all objects in the dataset:

直到SSE值不再发生变化，算法停止，否则回到步骤⑥；Until the SSE value no longer changes, the algorithm stops, otherwise go back to step ⑥;

上述一种用于脉冲星搜寻中候选体信号挖掘的并行的混合聚类方法，其中步骤(2)中基于滑动窗口的分组策略对数据集进行分组方法为：根据数据结构最大化地准确筛选候选体，采用滑动窗口理念进行数据划分；首先，划定窗口尺寸(Batchsize＝1160)，将待检测数据集等分为L块(最后一块的数据量不够可选用第1块的数据进行填充)；设定滑动窗口的大小w＝2，第1轮从第1、2块开始，每轮滑动窗口向前进1位，指向对应的数据块；最后一轮指向最后1块和第1块的组合，一共需要执行L轮分割；拟通过从真实样本中挑选一组较完备的各类脉冲星候选体特征数据1600颗作为样本，每轮均加入到滑动窗口所对应的数据形成待检测数据块，因此该数据集划分成L个并行的待检测数据块；目前，聚类存在一基本假设，即处在相同聚类中的示例有较大的可能拥有相同的标记；因此，根据各类数据分布的稠密或稀疏区域设定决策边界，从而确定脉冲星数据分布区域，进行对脉冲星信号与非脉冲星干扰信号的区域划分；通过计算各簇内脉冲星样本分布密度以统计相似程度，选取脉冲星样本占有率大于50％的簇进入脉冲星候选体列表；聚类分析方法中第③步所排除的噪声点列表则有可能会产生新现象的发现。The above-mentioned parallel hybrid clustering method for candidate body signal mining in pulsar search, wherein the method for grouping data sets based on a sliding window grouping strategy in step (2) is: maximizing and accurately screening candidates according to data structure First, define the window size (Batchsize=1160), and divide the data set to be detected into L blocks (the data volume of the last block is not enough, and the data of the first block can be used for filling); Set the size of the sliding window w = 2, the first round starts from the first and second blocks, and the sliding window advances 1 bit in each round, pointing to the corresponding data block; the last round points to the combination of the last block and the first block, A total of L rounds of segmentation need to be performed; it is planned to select a set of 1600 relatively complete types of pulsar candidate feature data from the real samples as samples, and each round is added to the data corresponding to the sliding window to form the data block to be detected. Therefore, The data set is divided into L parallel data blocks to be detected; at present, there is a basic assumption in clustering, that is, the examples in the same cluster are more likely to have the same label; therefore, according to the distribution of various data The decision boundary is set in the dense or sparse area, so as to determine the pulsar data distribution area, and the area division of the pulsar signal and the non-pulsar interference signal; by calculating the distribution density of the pulsar samples in each cluster to calculate the similarity degree, select the pulsar Clusters with sample occupancy greater than 50% enter the pulsar candidate list; the list of noise points excluded in step 3 in the cluster analysis method may lead to the discovery of new phenomena.

上述一种用于脉冲星搜寻中候选体信号挖掘的并行的混合聚类方法，其中步骤(3)中基于MapReduce/Spark计算模型的数据块并行化实现该聚类的方法为：针对大规模的脉冲星数据处理，依据Sun-Ni定理，研究该聚类算法在MapReduce计算模型的并行化实现是非常有必要的；一方面，可提高聚类结果的精确度；另一方面，能够降低数据比较的次数；Sun-Ni定理中引入了一个函数G(p)表示存储容量受限时工作负载的增加量；该定律提出在满足固定时间加速比所规定的时间限制的前提下且拥有足够的内存空间时，对问题进行放缩能有效地利用内存空间；首先通过上述基于滑动窗口的方法将数据划分为L个数据块(Block(1),...,Block(L))后并行执行；下一步，由Map1和Reduce1函数完成各Block(i)(1≤i≤L)中数据点的密度计算以及初始聚类中心点(cluster centers)的选取(需要说明的是，Map阶段的<key,value>输入：key是行号，value是当前样本各维度的值组成的列表；Reduce阶段输出：key.id即初始聚类中心)；最后，Map2和Reduce2函数迭代完成Block(i)内每个数据点到聚类中心(cluster centers(i))的距离计算并重新标记其属于的簇类别，其中用Reduce 2函数计算出新的簇中心为下一轮聚类任务作准备；比较当前轮簇中心与上一轮对应簇中心之间的距离，若变化小于给定的阈值，则运行结束；否则将新簇中心作为下一轮的聚类中心；在聚类结束后，提取出脉冲星簇和异常噪声点；Spark作为一种大规模数据处理通用的计算引擎，其计算过程与MapReduce类似。The above-mentioned parallel hybrid clustering method for candidate body signal mining in pulsar search, wherein in step (3), the method for realizing the clustering by parallelizing data blocks based on the MapReduce/Spark computing model is: for large-scale For pulsar data processing, according to the Sun-Ni theorem, it is very necessary to study the parallel implementation of the clustering algorithm in the MapReduce computing model; on the one hand, it can improve the accuracy of the clustering results; on the other hand, it can reduce the data comparison The number of times; the Sun-Ni theorem introduces a function G(p) to represent the increase in workload when the storage capacity is limited; the law proposes that there is enough memory under the premise of satisfying the time limit specified by the fixed time speedup ratio When the space is large, scaling the problem can effectively utilize the memory space; first, the data is divided into L data blocks (Block(1),...,Block(L)) by the above sliding window-based method and then executed in parallel; Next, the Map1 and Reduce1 functions complete the density calculation of the data points in each Block(i) (1≤i≤L) and the selection of the initial cluster centers (it should be noted that the <key of the Map stage) , value> input: key is the row number, value is a list of the values of each dimension of the current sample; the output of the Reduce stage: key.id is the initial cluster center); finally, the Map2 and Reduce2 functions iteratively complete each block in Block(i). Calculate the distance from each data point to the cluster centers (cluster centers(i)) and re-label the cluster category to which it belongs, and use the Reduce 2 function to calculate the new cluster center to prepare for the next round of clustering tasks; compare the current round The distance between the cluster center and the corresponding cluster center of the previous round, if the change is less than the given threshold, the operation ends; otherwise, the new cluster center is used as the cluster center of the next round; after the clustering is over, the pulsar is extracted Clusters and abnormal noise points; Spark is a general computing engine for large-scale data processing, and its computing process is similar to MapReduce.

本发明同现有技术相比具有明显的优点和有益效果。由以上技术方案可知，本发明首先，采用K近邻的多项式核(Polynomial)函数计算数据点密度，排除密度过小的离群点干扰；其次，结合密度峰值及层次，用于多密度簇类层次的划分，从而确定初始聚类中心点；再次，运用基于高斯径向基核(RBF)距离的k-means迭代进行数据点分配与簇中心优化；通过基于滑动窗口的数据划分策略以及基于MapReduce/Spark模型的并行化设计，使得该方案在确保候选体聚类效果的同时运行时间有了很大的改善；在Parkes高时间分辨率宇宙脉冲星巡天(The High Time Resolution Universe Survey 2，HTRU2)数据集上与其他常用机器学习分类方法进行实验对比，结果表明所提出方案在精确度(Precision)和召回率(Recall)上均取得较优的结果，分别为0.946和0.905；根据Sun-Ni定理，当并行执行节点足够且通信代价可忽略时，该算法的总运行时间理论上会明显减少；由于聚类的相似性分簇特性，该方法在提高候选体筛选效率的同时，能聚类出更有参考意义的分类以促进新现象的发现(比如特殊的脉冲星信号)。Compared with the prior art, the present invention has obvious advantages and beneficial effects. As can be seen from the above technical solutions, the present invention firstly uses a polynomial kernel (Polynomial) function of K-nearest neighbors to calculate the density of data points to eliminate the interference of outliers with too small density; secondly, combined with density peaks and levels, it is used for multi-density cluster hierarchy. Then, use the k-means iteration based on Gaussian Radial Basis Kernel (RBF) distance to perform data point allocation and cluster center optimization; The parallel design of the Spark model has greatly improved the running time of the scheme while ensuring the effect of candidate clustering; in the Parkes High Time Resolution Universe Survey 2 (HTRU2) data Compared with other commonly used machine learning classification methods, the results show that the proposed scheme achieves better results in both precision and recall, which are 0.946 and 0.905, respectively; according to the Sun-Ni theorem, When the parallel execution nodes are sufficient and the communication cost is negligible, the total running time of the algorithm will theoretically be significantly reduced; due to the similarity clustering characteristics of clustering, this method can improve the efficiency of candidate screening while clustering more Informative classification to facilitate the discovery of new phenomena (such as special pulsar signals).

附图说明Description of drawings

图1a为密度层次聚类二维决策ρ划分图；Figure 1a is a two-dimensional decision ρ division diagram of density hierarchical clustering;

图1b为密度层次聚类二维决策δ划分图；Figure 1b is a two-dimensional decision δ partition diagram of density hierarchical clustering;

图2基于滑动窗口的数据分配；Fig. 2 is based on the data distribution of sliding window;

图3 MapReduce流程图；Figure 3 MapReduce flow chart;

图4平均运行时间对比。Figure 4 Comparison of average runtimes.

具体实施方式Detailed ways

实施例：Example:

参见图1至图3，本发明的一种用于脉冲星搜寻中候选体信号挖掘的并行的混合聚类方法，包括步骤如下：Referring to FIG. 1 to FIG. 3 , a parallel hybrid clustering method for candidate body signal mining in pulsar search according to the present invention includes the following steps:

1.混合聚类分析1. Hybrid cluster analysis

(1)进行数据预处理，通过特征提取方法(Fifty Years of Pulsar CandidateSelection:From simple filters to a new principled real-time classificationapproach，Monthly Notices of the Royal Astronomical Society，2016)和主成分分析方法(PCA)对在基于PRESTO(Pulsar Exploration and Search Toolkit)软件的脉冲星搜索流程中的脉冲星候选体数据进行特征选择和降维，从而得到特征向量为b的新特征空间输入数据集。可选的候选体物理特征值包括有脉冲辐射(单峰、双峰和多峰)、周期、色散值、信噪比、噪声信号、信号斜波、非相干功率之和、相干功率等等。(1) Carry out data preprocessing, through feature extraction method (Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society, 2016) and principal component analysis (PCA) Feature selection and dimension reduction are performed on the pulsar candidate data in the pulsar search process based on PRESTO (Pulsar Exploration and Search Toolkit) software, so as to obtain a new feature space input dataset with feature vector b. The optional candidate physical characteristic values include pulsed radiation (single-peak, double-peak and multi-peak), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, sum of incoherent power, coherent power and so on.

(2)根据式(1)计算数据点i和j之间的马氏距离为(2) Calculate the Mahalanobis distance between data points i and j according to formula (1) as

其中，S是多维随机变量的协方差矩阵。再根据式(2)计算各数据点基于K近邻的局部Polynomial核密度，Polynomial核函数拥有的全局特性，使其泛化性能强。where S is the covariance matrix of a multidimensional random variable. Then calculate the local Polynomial kernel density based on K nearest neighbors of each data point according to formula (2). The global characteristics of the Polynomial kernel function make it have strong generalization performance.

其中，c为偏置系数，d为多项式的阶。为消除数据变异大小和数值大小的影响，对d_ij和ρ_i均采用离差标准化处理如下。where c is the bias coefficient and d is the order of the polynomial. In order to eliminate the influence of data variation size and numerical size, both d _ij and ρ _i are treated with dispersion standardization as follows.

其中，min_d和min_ρ分别代表d_ij和ρ_i的最小值，max_d和max_ρ分别代表d_ij和ρ_i的最大值。Among them, min _d and min _ρ represent the minimum value of d _ij and ρ _i , respectively, and max _d and max _ρ represent the maximum value of d _ij and ρ _i , respectively.

(3)根据式(5)剔除离群点，再由式(6)计算非离群点之间的距离δ_i，剔除离群点有助于簇类中心点的选择。另外，密度过小的数据点数量少且分布边缘化。由于其稀缺性及低密度化，在数据分布中呈异常，而异常现象可能是纯噪声或天文新现象(比如特殊脉冲星)。这部分数据后续将通过对应的候选体诊断图作进一步的确定。(3) Eliminate outliers according to formula (5), and then calculate the distance δ _i between non-outliers by formula (6). Eliminating outliers is helpful for the selection of cluster center points. In addition, the number of data points that are too low in density is small and the distribution is marginal. Due to their scarcity and low density, they are anomalies in the data distribution, and the anomalies may be pure noise or astronomical new phenomena (such as special pulsars). This part of the data will be further determined by the corresponding candidate diagnostic map in the future.

(4)所有距离δ大于阈值λ的数据点可生成二维决策图。例如，1组随机生成数据的二维决策图如图1所示，其中，横轴表示密度ρ，纵轴表示距离δ。(4) All data points whose distance δ is greater than the threshold λ can generate a two-dimensional decision map. For example, a two-dimensional decision diagram of a group of randomly generated data is shown in Figure 1, where the horizontal axis represents the density ρ, and the vertical axis represents the distance δ.

假定对该二维决策图实例的ρ轴和δ轴分别按照大小为θ和γ的间隔进行划分。图1随机生成数据集的ρ划分和δ划分.左:ρ划分；右:δ划分.θ＝2γ＝0.2；Assume that the ρ-axis and δ-axis of this two-dimensional decision graph instance are divided into intervals of size θ and γ, respectively. Figure 1 ρ division and δ division of randomly generated dataset. Left: ρ division; Right: δ division. θ=2γ=0.2;

若在ρ轴或δ轴划分区域上包含两个或两个以上的无数据点存在区域，则称该空隙区域为空区。在图1(a)和1(b)中，空区把所有的数据点划分为两个密度区域，将最右的密度区域称作最大密度区域，其余为低密度区域。If there are two or more regions where no data points exist on the ρ-axis or δ-axis division region, the void region is called an empty region. In Figures 1(a) and 1(b), the empty area divides all data points into two density areas, the rightmost density area is called the maximum density area, and the rest are low density areas.

1)在低密度区域,由于区分度不高,将该区域相应的微簇均合并成一个簇类；1) In the low-density area, due to the low degree of discrimination, the corresponding micro-clusters in the area are merged into one cluster class;

2)在最大密度区域,若所有的代表点都在同一个δ区,则将这些代表点均选作独立的簇类中心；若不在同一个δ区,则这些代表点间距离区分度不高,可能属于同一个簇类,因此需要将相应的微簇合并成一个大簇。2) In the maximum density area, if all the representative points are in the same delta area, these representative points are selected as independent cluster centers; if they are not in the same delta area, the distance between these representative points is not high. , may belong to the same cluster class, so the corresponding microclusters need to be merged into a large cluster.

(5)确定簇类数k以及对应集群C_i(1≤i≤k)的中心center_i。(5) Determine the number of clusters k and the center _i of the corresponding cluster C _i (1≤i≤k).

(6)根据就近原则将各个数据点x_j分配给距离最近的center_i所在的簇类，相似性测度方式采用RBF核距离，如式(7)所示。RBF核函数拥有局部特性且学习能力强，通过RBF核距离可实现测度距离对高维空间的转换。(6) According to the principle of proximity, each data point x _j is assigned to the cluster class where the nearest center _i is located, and the similarity measurement method adopts the RBF kernel distance, as shown in formula (7). The RBF kernel function has local characteristics and strong learning ability. Through the RBF kernel distance, the conversion of measured distance to high-dimensional space can be realized.

其中，η代表核函数宽度。按照式(8)计算新簇C_i'内所有数据点的均值作为新的中心center_i'，n_i表示属于C_i'的数据点总数。where η represents the kernel function width. According to formula (8), the mean of all data points in the new cluster C _i ' is calculated as the new center center _i ', and _ni represents the total number of data points belonging to C _i '.

(7)计算数据集所有对象的误差平方和SSE：(7) Calculate the sum of squared errors SSE for all objects in the dataset:

直到SSE值不再发生变化，算法停止，否则回到步骤(6)。Until the SSE value no longer changes, the algorithm stops, otherwise go back to step (6).

2.基于滑动窗口的数据集划分策略2. Dataset partition strategy based on sliding window

为划定更全面的脉冲星识别范围，根据数据结构最大化地准确筛选候选体，采用滑动窗口理念进行数据划分。如图2所示，划定窗口尺寸(In order to delineate a more comprehensive range of pulsar identification and maximally and accurately screen candidates according to the data structure, the sliding window concept is used to divide the data. As shown in Figure 2, define the window size (

Batchsize＝1160)，拟通过从真实样本中挑选一组较完备的各类脉冲星候选体特征数据1600颗作为样本，每轮加入到滑动窗口(其大小w＝2)对应的数据形成待检测数据块。目前，聚类存在一基本假设，即处在相同聚类中的示例有较大的可能拥有相同的标记。因此，根据各类数据分布的稠密或稀疏区域设定决策边界，从而确定脉冲星数据分布区域，进行对脉冲星信号与非脉冲星干扰信号的区域划分。通过计算各簇内脉冲星样本分布密度以统计相似程度，选取脉冲星样本占有率大于50％的簇进入脉冲星候选体列表；混合聚类分析步骤(3)所排除的噪声点列表则有可能会产生新天文现象的发现。Batchsize=1160), it is planned to select a set of 1600 relatively complete types of pulsar candidate feature data from real samples as samples, and add the data corresponding to the sliding window (its size w=2) in each round to form the data to be detected piece. Currently, clustering has a fundamental assumption that examples in the same cluster are more likely to have the same label. Therefore, the decision boundary is set according to the dense or sparse areas of various data distributions, so as to determine the pulsar data distribution area, and to divide the area of pulsar signals and non-pulsar interference signals. By calculating the distribution density of pulsar samples in each cluster to calculate the similarity degree, the clusters with pulsar sample occupancy greater than 50% are selected to enter the pulsar candidate list; the list of noise points excluded in the mixed cluster analysis step (3) may be will lead to the discovery of new astronomical phenomena.

3.基于MapReduce/Spark模型的并行化设计3. Parallel design based on MapReduce/Spark model

针对大规模的脉冲星数据处理，依据Sun-Ni定理，研究该聚类算法在MapReduce计算模型的并行化实现是非常有必要的。一方面，可提高聚类结果的精确度；另一方面,能够降低数据比较的次数。如图3所示,首先通过上述基于滑动窗口的方法将数据划分为L个数据块(Block(1),...,Block(L))后并行执行。下一步，由Map1和Reduce1函数完成各Block(i)(1≤i≤L)中数据点的密度计算以及初始聚类中心点(cluster centers)的选取(需要说明的是，Map阶段的<key，value>输入：key是行号，value是当前样本各维度的值组成的列表。Reduce阶段输出：key.id即初始聚类中心)。最后，Map2和Reduce2函数迭代完成Block(i)内每个数据点到聚类中心(cluster centers(i))的距离计算并重新标记其属于的簇类别，其中用Reduce 2函数计算出新的簇中心为下一轮聚类任务作准备。比较当前轮簇中心与上一轮对应簇中心之间的距离，若变化小于给定的阈值,则运行结束；否则将新簇中心作为下一轮的聚类中心。在聚类结束后，提取出脉冲星簇和异常噪声点。For large-scale pulsar data processing, it is necessary to study the parallel implementation of the clustering algorithm in the MapReduce computing model according to the Sun-Ni theorem. On the one hand, it can improve the accuracy of clustering results; on the other hand, it can reduce the number of data comparisons. As shown in Figure 3, the data is firstly divided into L data blocks (Block(1), ..., Block(L)) by the above sliding window-based method and executed in parallel. Next, the Map1 and Reduce1 functions complete the density calculation of the data points in each Block(i) (1≤i≤L) and the selection of the initial cluster centers (it should be noted that the <key of the Map stage) , value> input: key is the row number, value is a list of the values of each dimension of the current sample. Reduce phase output: key.id is the initial cluster center). Finally, the Map2 and Reduce2 functions iteratively complete the calculation of the distance from each data point in Block(i) to the cluster centers (cluster centers(i)) and re-label the cluster category to which it belongs, and the Reduce 2 function is used to calculate the new cluster The center prepares for the next round of clustering tasks. Compare the distance between the cluster center of the current round and the corresponding cluster center of the previous round. If the change is less than the given threshold, the operation ends; otherwise, the new cluster center is used as the cluster center of the next round. After clustering, pulsar clusters and abnormal noise points are extracted.

实验例：Experimental example:

硬件环境为：具有4个物理计算节点的Linux群集环境，包括2个Intel Core i7-9700K@3.6GHzCPU，1个Intel Core i7-1065G7@1.5GHz CPU和1个Intel Core i5-9300H@2.4GHz CPU，带32个CPU内核(总RAM为68G，总磁盘空间为3T)；软件环境为：centos7系统下的Anaconda3-4.2.0，Hadoop-2.7.6和Spark-2.3.1-bin-hadoop2.6框架。The hardware environment is: Linux cluster environment with 4 physical computing nodes, including 2 Intel Core i7-9700K@3.6GHz CPU, 1 Intel Core i7-1065G7@1.5GHz CPU and 1 Intel Core i5-9300H@2.4GHz CPU , with 32 CPU cores (total RAM is 68G, total disk space is 3T); the software environment is: Anaconda3-4.2.0, Hadoop-2.7.6 and Spark-2.3.1-bin-hadoop2.6 under centos7 system frame.

1.数据划分1. Data division

采用公开数据集HTRU2,该数据集是经过特征提取方法(Fifty Years of PulsarCandidate Selection:From simple filters to a new principled real-timeclassification approach，Monthly Notices of the Royal Astronomical Society，2016)处理得到的。将滑动窗口大小Batchsize设置为1160,从已知脉冲星中随机选取1600颗作为脉冲星样本集s,而剩余39颗被随机混入到非脉冲星数据样本中形成待检测数据集。根据4.1节的数据划分策略，待检测数据集按Batchsize被均分为(t₁,t₂,..t₁₄)，由此实验数据被划分为{Block(1):[s,t₁,t₂],Block(2):[s,t₂,t₃],...,The public dataset HTRU2 was used, which was processed by a feature extraction method (Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society, 2016). The sliding window size Batchsize is set to 1160, 1600 pulsars are randomly selected from the known pulsars as the pulsar sample set s, and the remaining 39 are randomly mixed into the non-pulsar data samples to form the data set to be detected. According to the data division strategy in Section 4.1, the data set to be tested is divided into (t ₁ , t ₂ , ..t ₁₄ ) according to the Batchsize, so the experimental data is divided into {Block(1):[s,t ₁ , t ₂ ],Block(2):[s,t ₂ ,t ₃ ],...,

Block(13):[s,t₁₃,t₁₄],Block(14):[s,t₁₄,t₁]}共14个数据块。各个Block(i)分别进行聚类，当聚类完成后，选取脉冲星样本占有率≧50％的簇进入脉冲星候选体列表。Block(13):[s,t ₁₃ ,t ₁₄ ],Block(14):[s,t ₁₄ ,t ₁ ]} a total of 14 data blocks. Each Block(i) is clustered separately. When the clustering is completed, the clusters with pulsar sample occupancy ≥ 50% are selected to enter the pulsar candidate list.

2.评价指标2. Evaluation indicators

候选体分类常采用准确率(Accuracy)、精度(Precision)、召回率(Recall)、和F1-分数(F1-score)4个指标对算法进行评估。Candidate classification often uses four indicators to evaluate the algorithm: Accuracy, Precision, Recall, and F1-score.

Accuracy能大致反映整体判断正确与否，但当数据不均衡时并不能客观的反映分类的性能。Precision用于判断为正类样本数中真实正类样本数所占之比，Recall则是判断正确的正样本数与所有正类样本数之比。由于聚类的Precision和Recall往往相互矛盾，所以可选取F1-Score来综合度量这两个指标。表2表示分类的混肴矩阵。Accuracy can roughly reflect whether the overall judgment is correct or not, but it cannot objectively reflect the classification performance when the data is unbalanced. Precision is used to judge the ratio of the number of true positive samples in the number of positive samples, and Recall is the ratio of the number of correct positive samples to the number of all positive samples. Since the Precision and Recall of clustering are often contradictory, F1-Score can be selected to comprehensively measure these two indicators. Table 2 shows the classified confusion matrix.

表1混肴矩阵Table 1 Confusion matrix

实验的评价指标采用总体Precision、Recall和F1-Score，设定如下：The evaluation indicators of the experiment use the overall Precision, Recall and F1-Score, which are set as follows:

其中，L代表划分的数据块数量，UTP＝TP₁∪TP₂∪TP₃…TP_L代表在每个小数据块内识别到的脉冲星的并集，Recall_O表示单个数据块的召回率，Recall_total表示所有数据块的总体召回率。Among them, L represents the number of divided data blocks, UTP=TP ₁ ∪TP ₂ ∪TP ₃ ...TP _L represents the union of the pulsars identified in each small data block, Recall _O represents the recall rate of a single data block, Recall _total represents the overall recall rate of all data blocks.

3.参数设置3. Parameter setting

实验涉及的参数包括计算数据点密度的K近邻参数,密度的阈值ρ_threhold,Polynomial核参数c和d,RBF核参数η,筛选小簇的阈值λ,对密度区域划分的θ值以及对距离区域划分的γ值。具体设置如下表。The parameters involved in the experiment include the K-nearest neighbor parameter for calculating the density of data points, the density threshold ρ _threhold , the Polynomial kernel parameters c and d, the RBF kernel parameter η, the threshold λ for screening small clusters, the θ value for dividing the density region and the distance region. The gamma value of the division. The specific settings are as follows.

表2参数设置Table 2 Parameter settings

4.聚类结果分析4. Analysis of clustering results

表3显示了不同监督学习和无监督学习算法在HTRU2数据集上的性能对比。在无监督算法中，并行混合聚类算法具有最高的Recall值即90.5％。与有监督学习算法相比，该算法的Recall值仅低于Table 3 shows the performance comparison of different supervised learning and unsupervised learning algorithms on the HTRU2 dataset. Among the unsupervised algorithms, the parallel hybrid clustering algorithm has the highest Recall value i.e. 90.5%. Compared with the supervised learning algorithm, the Recall value of this algorithm is only lower than

GMO_SNNNNNNNNN(基于自归一化神经网络的脉冲星候选体选择，物理学报，2020)，F1-Score低于GMO_SNNNNNNNNN、Random Forest(Fifty years of pulsar candidateselection：from simple filters to a new principled real-time classificationapproach，Monthly Notices of the Royal Astronomical Society,2016)和KNN算法,但高于SVM和PNCN(Pulsar candidate selection using pseudo-nearest centroidneighbour classifier，Monthly Notices of the Royal Astronomical Society，2020)。另外,经多轮的随机挑选39颗脉冲星形成待检测数据集的对照实验，得出被该算法检测出的脉冲星数最高一次达到36颗，均值为34颗。由于混合聚类的无监督学习和快速收敛的优点，适用于大规模脉冲星数据快速分类挖掘的场景。实验结果表明，所提出的基于混合聚类的方案具有可行性和有效性。在实际脉冲星搜索场景下，随着相关参数、脉冲星样本集以及数据划分策略的优化，其聚类效果将进一步提升。GMO_SNNNNNNNNN (Pulsar Candidate Selection Based on Self-Normalized Neural Network, Acta Physica Sinica, 2020), F1-Score is lower than GMO_SNNNNNNNNN, Random Forest (Fifty years of pulsar candidates selection: from simple filters to a new principled real-time classificationapproach, Monthly Notices of the Royal Astronomical Society, 2016) and KNN algorithm, but higher than SVM and PNCN (Pulsar candidate selection using pseudo-nearest centroidneighbour classifier, Monthly Notices of the Royal Astronomical Society, 2020). In addition, after multiple rounds of control experiments of randomly selecting 39 pulsars to form the data set to be detected, it is concluded that the number of pulsars detected by the algorithm reaches 36 at one time, with an average of 34. Due to the advantages of unsupervised learning and fast convergence of hybrid clustering, it is suitable for the scene of fast classification and mining of large-scale pulsar data. The experimental results show that the proposed hybrid clustering-based scheme is feasible and effective. In the actual pulsar search scenario, with the optimization of related parameters, pulsar sample set and data partitioning strategy, the clustering effect will be further improved.

表3不同方法在HTRU2数据集上的效果Table 3. Effects of different methods on the HTRU2 dataset

5.时间复杂度分析5. Time complexity analysis

设实验数据集的样本数为n，其他算法(kmeans++，McDpc，PNCN)的时间复杂度如表4所示。其中，kmeans++的时间复杂度为O(nkTM)，通常k，T，M被认为是常量，即可简化为O(n)；对于McDpc，计算ρ和δ时间复杂度为O(n²),基于不同密度水平的聚类时间复杂度也为O(n²)，所以整个算法的时间复杂度为O(n²)；PNCN时间复杂度取自其最坏情况下的计算量O(2nMK+FMK²/2),F和M设定为常量。混合聚类算法的串行时间复杂度为O(n²+nkTM)，由于k，T，M为常量，其复杂度化简为O(n²)。在并行计算平台下，依据Sun-Ni定理，其复杂度变为O((G(P)m)²),其中G(P)为因子,m为Block(i)的样本数且m□n；当并行节点数P足够(P值趋近于划分的数据块L达到某个阈值时)且通信开销可忽略时，G(P)→1，即复杂度趋近于O(m²),比k-means++和PNCN稍差，但优于McDpc。这说明所提出方案在提高聚类效果的同时运行时间有较明显下降。Let the number of samples in the experimental dataset be n, and the time complexity of other algorithms (kmeans++, McDpc, PNCN) is shown in Table 4. Among them, the time complexity of kmeans++ is O(nkTM), usually k, T, M are considered as constants, which can be simplified to O(n); for McDpc, the time complexity of calculating ρ and δ is O(n ² ), The time complexity of clustering based on different density levels is also O(n ² ), so the time complexity of the whole algorithm is O(n ² ); the time complexity of PNCN is taken from its worst-case calculation amount O(2nMK+ FMK ² /2), F and M are set as constants. The serial time complexity of the hybrid clustering algorithm is O(n ² +nkTM). Since k, T, and M are constants, the complexity is simplified to O(n ² ). Under the parallel computing platform, according to the Sun-Ni theorem, the complexity becomes O((G(P)m) ² ), where G(P) is the factor, m is the number of samples of Block(i) and m n ; When the number of parallel nodes P is sufficient (the value of P approaches the divided data block L reaches a certain threshold) and the communication overhead is negligible, G(P)→1, that is, the complexity approaches O(m ² ), Slightly worse than k-means++ and PNCN, but better than McDpc. This shows that the proposed scheme can improve the clustering effect while the running time is significantly reduced.

表4算法复杂度Table 4 Algorithm Complexity

^aT为迭代次数,M为元素的特征数,F为类别数，m为Block(i)的样本数,k为簇中心个数. ^a T is the number of iterations, M is the number of features of the element, F is the number of categories, m is the number of samples of Block(i), and k is the number of cluster centers.

对比实验分析和时间复杂度分析，证明所提出方案具有可行性和有效性,随着实际场景中数据分组与相关参数的优化，其各项性能指标会有更大提升。无监督聚类方法更适用于大量无标签数据集的分类，以及脉冲星与非脉冲星样本数据比例极不均衡情形。Comparing the experimental analysis and time complexity analysis, it is proved that the proposed scheme is feasible and effective. With the optimization of data grouping and related parameters in the actual scene, its various performance indicators will be greatly improved. The unsupervised clustering method is more suitable for the classification of a large number of unlabeled data sets, and the situation where the proportion of pulsar and non-pulsar sample data is extremely unbalanced.

6.实际运行时间6. Actual runtime

图4平均运行时间对比显示了在相同实验设备上，所提出方法(并行和串行)，与McDPC、k-means++和KNN平均运行时间的比较。从图中可以看出，串行混合聚类的平均运行时间是最长的，但并行混合聚类(23.07s)与其他系统相比却非常短。因此，我们可以相信，所提出的并行方案在保证分类性能的同时显著减少了执行时间。Figure 4 shows the average runtime comparison of the proposed method (parallel and serial) with McDPC, k-means++ and KNN on the same experimental equipment. As can be seen from the figure, the average running time of serial hybrid clustering is the longest, but parallel hybrid clustering (23.07s) is very short compared to other systems. Therefore, we can believe that the proposed parallel scheme significantly reduces the execution time while guaranteeing the classification performance.

以上所述，仅是本发明的较佳实施例而已，并非对本发明作任何形式上的限制，任何未脱离本发明技术方案内容，依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰，均仍属于本发明技术方案的范围内。The above are only preferred embodiments of the present invention, and do not limit the present invention in any form. Any simple modifications made to the above embodiments according to the technical essence of the present invention without departing from the technical solution content of the present invention, Equivalent changes and modifications still fall within the scope of the technical solutions of the present invention.

Claims

1. A parallel hybrid clustering method for candidate body signal mining in pulsar search, comprising the following steps:

(1) Cluster analysis of pulsar candidate signals:

The polynomial kernel function of K-nearest neighbors is used to calculate the density of data points, and the samples whose density value is less than the threshold 0.01 are screened out. These samples will be further judged as noise or new astronomical phenomenon through the candidate diagnostic map, and the interference of outliers with too small density will be excluded;

Combined with the characteristics of the density peak and hierarchical clustering process, it is used to divide the multi-density cluster level in the data set, merge some micro-clusters with similar densities and distances in the same area, and determine the initial cluster center point;

The k-means iteration based on the Gauss radial basis kernel distance is used to allocate all data points and optimize the cluster center, and the kernel function is used to calculate the similarity between the sample data points, which can realize the conversion of the measurement distance to the high-dimensional space;

(2) Group the data set based on the sliding window grouping strategy, divide it according to a specific window value Batchsize=1160, and set the sliding window size to w=2; it is planned to select relatively complete various types of pulsar candidates from real samples 1600 pieces of volume feature data are used as a set of samples, and are added to the data to be detected corresponding to each round of sliding windows to form a data block, and the data set is divided into multiple parallel data blocks of the same size;

(3) The clustering is realized by data block parallelization based on the MapReduce/Spark computing model.

2. a kind of parallel hybrid clustering method for candidate body signal mining in pulsar search as claimed in claim 1, wherein the clustering analysis method described in step (1) is:

① Carry out data preprocessing, and perform feature selection and dimension reduction on the pulsar candidate data in the pulsar search process based on PRESTO software by feature extraction method and principal component analysis (PCA) method, so as to obtain a new feature vector with b Feature space input dataset; optional candidate physical feature values include pulsed radiation (single-peak, double-peak and multi-peak), period, dispersion value, signal-to-noise ratio, noise signal, signal ramp, sum of incoherent power , coherent power;

② According to formula (1), the Mahalanobis distance between data points i and j is calculated as

Among them, S is the covariance matrix of multi-dimensional random variables; then calculate the local Polynomial kernel density of each data point based on K nearest neighbors according to formula (2).

Among them, c is the bias coefficient, and d is the order of the polynomial; in order to eliminate the influence of the data variation size and numerical value, the dispersion standardization is used for both d _ij and ρ _i as follows;

Among them, min _d and min _ρ represent the minimum value of d _ij and ρ _i respectively, and max _d and max _ρ represent the maximum value of d _ij and ρ _i respectively;

(3) Eliminate outliers according to formula (5), and then calculate the distance δ _i between non-outlier points by formula (6). Elimination of outliers is helpful for the selection of cluster center points; in addition, data with too small density The number of points is small and the distribution is marginal; due to its scarcity and low density, it is abnormal in the data distribution, and the abnormal phenomenon may be pure noise or astronomical new phenomena (such as special pulsars); this part of the data will be passed through the corresponding Candidate diagnostic map for further determination;

inlier={ρ _i >ρ _threhold }, ρ _threhold =0.01 (5)

④ All data points whose distance δ is greater than the threshold λ can generate a two-dimensional decision map; in which, the horizontal axis is represented by density ρ, and the vertical axis is represented by distance δ; on the two-dimensional decision map, density-level micro-clusters are merged, The method is: if there are two or more areas without data points on the ρ-axis or δ-axis division area, the void area is called an empty area; the empty area divides all data points into two density areas, The rightmost density area is called the maximum density area, and the rest are low density areas;

(A) In the low-density area, due to the low degree of discrimination, the corresponding micro-clusters in this area are merged into one cluster class;

(B) In the maximum density area, if all the representative points are in the same delta area, these representative points are selected as independent cluster centers; if they are not in the same delta area, the distance discrimination between these representative points is different. high, may belong to the same cluster class, so the corresponding micro-clusters need to be merged into a large cluster;

⑤ Determine the number of clusters k and the center center _i of the corresponding cluster C _i (1≤i≤k);

⑥ According to the principle of proximity, each data point x _j is allocated to the cluster class where the nearest center _i is located, and the similarity measurement method adopts the RBF kernel distance, as shown in formula (7); the RBF kernel function has local characteristics and strong learning ability, The transformation of measured distance to high-dimensional space can be realized through RBF kernel distance;

Wherein, n represents the kernel function width; According to formula (8), calculate the mean value of all data points in the new cluster C _i ' as the new center center _i ', and n _i represent the total number of data points belonging to C _i ';

⑦ Calculate the sum of squared errors SSE for all objects in the dataset:

Until the SSE value no longer changes, the algorithm stops, otherwise go back to step ⑥.

3. a kind of parallel hybrid clustering method for candidate body signal mining in pulsar search as claimed in claim 2, wherein the grouping strategy based on sliding window in step (2) is to carry out grouping method to data set: according to data The structure maximizes the accurate screening of candidates, and uses the sliding window concept to divide the data; first, the window size is defined (Batchsize=1160), and the data set to be detected is divided into L blocks (the amount of data in the last block is not enough, you can choose the first one The data of the block is filled); set the size of the sliding window w=2, the first round starts from the 1st and 2nd blocks, and the sliding window advances 1 bit in each round, pointing to the corresponding data block; the last round points to the last block In combination with the first block, a total of L rounds of segmentation need to be performed; it is planned to select a set of 1600 relatively complete types of pulsar candidate feature data from real samples as samples, and each round will be added to the data corresponding to the sliding window The data block to be detected is formed, so the data set is divided into L parallel data blocks to be detected; at present, there is a basic assumption in clustering, that is, the examples in the same cluster are more likely to have the same label; therefore , the decision boundary is set according to the dense or sparse areas of various data distributions, so as to determine the pulsar data distribution area, and to divide the area of pulsar signals and non-pulsar interference signals; by calculating the distribution density of pulsar samples in each cluster, Statistical similarity, select clusters with a pulsar sample occupancy greater than 50% to enter the pulsar candidate list; the noise point list excluded in step 3 in the cluster analysis method may lead to the discovery of new phenomena.

4. A parallel hybrid clustering method for candidate body signal mining in pulsar search as claimed in claim 1 or 2, wherein in step (3), the data block parallelization based on the MapReduce/Spark computing model realizes the clustering. The class method is: for large-scale pulsar data processing, according to the Sun-Ni theorem, it is very necessary to study the parallel implementation of the clustering algorithm in the MapReduce computing model; on the one hand, it can improve the accuracy of the clustering results. On the other hand, the number of data comparisons can be reduced; a function G(p) is introduced in the Sun-Ni theorem to represent the increase in workload when the storage capacity is limited; the law proposes to satisfy the time specified by the fixed time speedup ratio Under the premise of limitations and when there is enough memory space, scaling the problem can effectively utilize the memory space; first, the data is divided into L data blocks (Block(1),..., Block(L)) and execute in parallel; in the next step, the Map1 and Reduce1 functions complete the density calculation of the data points in each Block(i) (1≤i≤L) and the selection of the initial cluster centers (needs It is explained that the <key, value> input of the Map stage: key is the row number, value is a list of the values of each dimension of the current sample; the output of the Reduce stage: key.id is the initial cluster center); finally, Map2 and Reduce2 The function iteratively completes the calculation of the distance from each data point in Block(i) to the cluster centers (cluster centers(i)) and re-labels the cluster category to which it belongs, and the new cluster center calculated by the Reduce 2 function is the next round. Prepare for the clustering task; compare the distance between the cluster center of the current round and the corresponding cluster center of the previous round, if the change is less than the given threshold, the operation will end; otherwise, the new cluster center will be used as the clustering center of the next round; After clustering, pulsar clusters and abnormal noise points are extracted; Spark, as a general computing engine for large-scale data processing, has a similar computing process to MapReduce.