CN110060735A

CN110060735A - A kind of biological sequence clustering method based on the segmentation of k-mer group

Info

Publication number: CN110060735A
Application number: CN201910271872.4A
Authority: CN
Inventors: 江育娥; 俞婷婷; 林劼
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-07-26
Anticipated expiration: 2039-04-04
Also published as: CN110060735B

Abstract

The invention discloses a biological sequence clustering method based on k-mer group segmentation, which comprises the following steps: step 1, segment the sequences in the data set, and count the k-mer word frequencies after the segmentation; step 2, according to the sequence and The relationship between the k-mers, construct a bipartite graph; step 3, randomly group the k-mers, and calculate the importance of the sequences under each group of k-mers; step 4, sort the importance of the sequences in reverse order, and filter out the candidates sequence, and deduplicate it; step 5, cluster candidate sequences to find sequence centers; step 6, cluster all sequences. By constructing a bipartite graph model, the present invention performs cluster analysis on biological sequences, obtains deep information meaning and reliable conclusions from biological sequence data, and effectively solves the problems of high complexity of existing computing node weights and representative nodes of importance. The problem of insufficient sex and the importance of nodes is greatly affected by the length of the sequence.

Description

A biological sequence clustering method based on k-mer group segmentation

技术领域technical field

本发明涉及生物信息学领域，尤其涉及一种基于k-mer组群分割的生物序列聚类方法。The invention relates to the field of bioinformatics, in particular to a biological sequence clustering method based on k-mer group segmentation.

背景技术Background technique

随着二代测序技术的不断发展，生物学数据库的规模在日益扩大。庞大的生物信息数据库为科研人员提供了广阔的机遇，也带来了挑战，如何在这上百亿的生物信息数据库中挖掘出有用的信息，数据挖掘为科研工作提供了基本手段。生物序列的相似性往往体现其功能的相关性，而聚类分析是数据挖掘中常用的技术。With the continuous development of next-generation sequencing technology, the scale of biological databases is expanding. The huge biological information database provides researchers with broad opportunities and challenges. How to mine useful information from the tens of billions of biological information databases, data mining provides a basic means for scientific research. The similarity of biological sequences often reflects the correlation of their functions, and cluster analysis is a commonly used technique in data mining.

在生物学中，序列比对通过排列生物序列的方式，识别可以描述序列间的功能、结构以及进化关系的相似序列区域。聚类是把相似序列划分到相同的组中，不相似的序列划分到不同组中，使得同一组间的序列距离最小，而不同组内的距离最大。如果两条相似的序列被聚到同一组内，在一定程度上说明二者具有同源性，这将会大大节省重新测定未知序列结构和功能的时间和精力。此外，序列比对一般决定了许多生物信息学技术及程序的分析结果，影响着很多序列比较研究的结论和生物解释，是生物序列聚类分析等研究中的一个重要内容。In biology, sequence alignment identifies similar sequence regions that describe the function, structure, and evolutionary relationship between sequences by aligning biological sequences. Clustering is to divide similar sequences into the same group, and dissimilar sequences into different groups, so that the sequence distance between the same group is the smallest, and the distance between different groups is the largest. If two similar sequences are clustered into the same group, it indicates that the two have homology to a certain extent, which will greatly save the time and energy of re-determining the structure and function of unknown sequences. In addition, sequence alignment generally determines the analysis results of many bioinformatics techniques and programs, and affects the conclusions and biological interpretations of many sequence comparison studies. It is an important content in studies such as biological sequence cluster analysis.

二部图(Bipartite graph)是运用比较广泛的图类之一，在实际生活中的运用有：如何安排工作才能最大程度满足每个人的需求，知道每个人的工作胜任情况；如何安排课程，才能使教室、教师与学生三者之间的条件得到满足。这些都涉及到二部图的匹配问题，可以通过建立图模型来解决此类问题。Bipartite graph is one of the more widely used graph classes. The applications in real life include: how to arrange work to meet the needs of each person to the greatest extent, and to know the competence of each person's work; how to arrange courses to So that the conditions between classrooms, teachers and students are met. These all involve the matching problem of bipartite graphs, which can be solved by establishing a graph model.

结合图模型，对生物序列进行聚类研究，把功能相关的序列聚为一类，这样可以帮助科研人员快速了解生物序列的功能，明确其内部种群的多样性，从而促进生物多样性保护，合理开发利用生物资源。Combined with graphical models, clustering research on biological sequences, and grouping functionally related sequences into one category, can help researchers quickly understand the function of biological sequences and clarify the diversity of their internal populations, thereby promoting biodiversity conservation. Reasonable Development and utilization of biological resources.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于k-mer组群分割的生物序列聚类方法。The purpose of the present invention is to provide a biological sequence clustering method based on k-mer group segmentation.

本发明采用的技术方案是：The technical scheme adopted in the present invention is:

一种基于k-mer组群分割的生物序列聚类方法，其包括以下步骤：A biological sequence clustering method based on k-mer group segmentation, comprising the following steps:

步骤1：设定滑动窗口大小，对数据集中的序列进行分割，并统计分割后的k-mers词频；Step 1: Set the size of the sliding window, segment the sequences in the dataset, and count the k-mers word frequency after segmentation;

步骤2：根据序列与k-mers间的关系，构造二部图，并分别统计任意序列s_i与其他序列间共同出现的k-mers的词频；Step 2: According to the relationship between the sequence and k-mers, construct a bipartite graph, and separately count the word frequencies of k-mers co-occurring between any sequence _si and other sequences;

步骤3：将k-mers随机均匀地分成t组：g₁,g₂,…,g_t，计算每组k-mers下序列的重要性；Step 3: Divide the k-mers into t groups randomly and uniformly: g ₁ , g ₂ ,..., g _t , and calculate the importance of the sequence under each group of k-mers;

步骤4：对序列的重要性进行逆序排序：设k为m条序列的中心数，按照均匀的间隔从t组中各筛选出k条序列，作为候选序列；并对t*k条序列中对候选序列去重得到不重复的候选序列；Step 4: Sort the importance of the sequences in reverse order: let k be the number of centers of m sequences, and screen out k sequences from the t group according to uniform intervals as candidate sequences; Candidate sequences are deduplicated to obtain non-repeated candidate sequences;

步骤5：对候选序列进行k-mers聚类；基于设定的滑动窗口大小对于DNA序列进行聚类得到k-mers集合；在K-means的聚类结果中的每一个簇筛选出与当前质心最接近的点作为序列中心；以此类推，可得到k条中心序列；Step 5: Perform k-mers clustering on the candidate sequences; perform clustering on the DNA sequences based on the set sliding window size to obtain a k-mers set; in each cluster in the K-means clustering result, filter out the cluster with the current centroid. The closest point is used as the sequence center; and so on, k center sequences can be obtained;

步骤6：基于k条中心序列对m条序列S＝{s_i|i＝1，2，...，m}进行聚类。Step 6: Perform clustering on m sequences S={s _i |i=1, 2, . . . , m} based on the k central sequences.

进一步地，步骤1中的数据集为一个长度为m的序列集合S＝{s_i|i＝1，2，...，m}，滑动窗口大小为L，且分割后的k-mers集合为K＝{k_j|j＝1，2，...，n}。Further, the data set in step 1 is a sequence set S={s _i |i=1, 2, ..., m} of length m, the sliding window size is L, and the segmented k-mers set is K={k _j |j=1,2,...,n}.

进一步地，步骤2中构造的二部图为G＝(V，E)，也作序列-k-mers图。G＝(V，E)是由结点集和边集组成的一个无向图模型，其中V为结点集，且V可以分解为两个子集，即V＝S∪K，且S＝{s_i|i＝1，2，...，m}为序列集合，s_i为第i条序列，K＝{k_j|j＝1，2，...，n}为k-mers集合，k_j为第j个k-mers；E代表结点间相互作用关系形成的边的集合，且E中每条边的两个端点分别在子集S和子集K中，即E＝{e(s_i，k_j)|s_i∈S，k_j∈K}，其中e(s_i，k_j)表示序列s_i与k-mersk_j间存在隶属关系。Further, the bipartite graph constructed in step 2 is G=(V, E), which is also referred to as a sequence-k-mers graph. G=(V, E) is an undirected graph model composed of node sets and edge sets, where V is the node set, and V can be decomposed into two subsets, namely V=S∪K, and S={s _i |i=1,2,...,m} is the sequence set, s _i is the i-th sequence, K={k _j |j=1,2,...,n} is k -mers set, k _j is the jth k-mers; E represents the set of edges formed by the interaction relationship between nodes, and the two endpoints of each edge in E are in subset S and subset K respectively, that is, E ={e(s _i , k _j )|s _i ∈S, k _j ∈K}, where e(s _i , k _j ) indicates that there is a membership relationship between sequence _si and k-mersk _j .

进一步地，步骤3中序列重要性的确定方法为：Further, the method for determining sequence importance in step 3 is:

步骤3.1：计算边的权重。当两序列v_i和v_j存在一个共同k-mers，则认为v_i和v_j为相邻结点，且存在一条边相连接，边的权重w_ji为两序列存在的k-mers共同出现频度的数量，即：对于任意两条序列v_i和v_j，若它们存在共同的k-mers，可以用w_ji表示结点间无向的相互作用。Step 3.1: Calculate the weights of the edges. When two sequences vi and v _j have a common _k -mers, it is considered that vi and v _j are adjacent nodes, and there is an edge connected, and the weight of the edge w _ji is that the k-mers existing in the two sequences co _- occur The number of frequencies, that is: for any two sequences vi and v _j , if they have a common _k -mers, w _ji can be used to represent the undirected interaction between nodes.

步骤3.2：计算结点的权重。Step 3.2: Calculate the weight of the node.

对于任意两个结点v_i和v_j，结点v_i是通过连接它们的边w_ji向结点v_j传递作用的，边权重的大小决定了v_i对v_j的作用大小。当结点v_j与多个结点存在边的关系，即结点v_j具有多个相邻结点，此时结点v_j的权重为结点v_j接收到来自其它结点的作用之和。For any two nodes vi and v _j , the node v _i transmits the effect to the node v _j through the edge w _ji connecting them, and the size of the edge weight determines the _effect of v _i on v _j . When the node v _j has an edge relationship with multiple nodes, that is, the node v _j has multiple adjacent nodes, the weight of the node v _j at this time is the effect that the node v _j receives from other nodes. and.

步骤3.3：迭代计算每个结点v_i的权重，可得到结点v_i的重要性。Step 3.3: Iteratively calculate the weight of each node v _i to obtain the importance of the node v _i .

进一步地，步骤3.1中相邻结点v_i和v_j边的权重为w_ji，可以通过以下公式计算w_ji：Further, in step 3.1, the weights of the edges of adjacent nodes vi and v _j are w _ji _, and w _ji can be calculated by the following formula:

其中，kmer∈v_i&kmer∈v_j表示k-mers既存在于结点v_i又存在于结点v_j中。与分别表示当前k-mers在结点v_i与结点v_j中的出现频度。由于结点间的相互作用是无向的，所以有w_ji＝w_ij。Among them, kmer∈vi & _{kmer∈v j} _means that _{k-mers exist in both node vi and node v j} _. and Represent the frequency of the current k-mers in node v _i and node v _j , respectively. Since the interaction between nodes is undirected, there is w _ji =w _ij .

进一步地，步骤3.2中w_j.表示结点v_j接收到来自其它结点的作用，通过以下公式计算w_j.：Further, w _j. in step 3.2 indicates that the node v _j receives the action from other nodes, and w _j. is calculated by the following formula:

其中，w_j.表示结点集合V中每个结点对结点v_j的贡献程度。Among them, w _j. represents the contribution of each node in the node set V to the node v _j .

进一步地，步骤3.3中每个结点v_i的重要性为WS(v_i)，WS(v_i)对应的SeqRank计算公式如下：Further, the importance of each node v _i in step 3.3 is WS(vi ₎ , and the SeqRank calculation formula corresponding to WS(vi ₎ is as follows:

其中，d为阻尼系数(0≤d≤1)，表示在任意时刻，从一个结点游走至另一个结点的概率，即每个结点都有(1-d)的概率随机游走到其它结点。v_j∈e(v_i，v_j)表示结点v_i与结点v_j存在共同的边；在v_k∈e(v_j，v_k)中，v_k是与结点v_j存在共同的边的结点。w_ij(或w_ji)表示连接结点v_i与结点v_j的边的权重，即结点v_i与结点v_j存在的k-mers共同出现频度之和。分母表示v_k∈e(v_j，v_k)时结点v_j指向结点v_k的边的权重的加权和。WS(v_j)为上一次迭代后结点v_j的重要性。Among them, d is the damping coefficient (0≤d≤1), which represents the probability of walking from one node to another at any time, that is, each node has a probability of (1-d) random walk to other nodes. v _j ∈ e(v _i , v _j ) indicates that node vi and node v _j have a common edge; in v _k ∈ e(v _j , v _k ), v _k is a common _edge with node v _j the edge nodes. w _ij (or w _ji ) represents the weight of the edge connecting the node v _i and the node v _j , that is, the sum of the common occurrence frequencies of the k-mers existing at the node v _i and the node v _j . denominator Represents the weighted sum of the weights of the edges of node v _j pointing to node v _k when v _k ∈ e(v _j , v _k ). WS(v _j ) is the importance of node v _j after the last iteration.

由于计算结点权重时又需要用到结点本身的权重，因此需要进行迭代计算。若用WS(v_i)^t表示结点v_i经过t次迭代后的重要性，则公式(3)可表示为：Since the weight of the node itself needs to be used when calculating the weight of the node, iterative calculation is required. If WS( _vi ) ^t is used to represent the importance of node _vi after t iterations, then formula (3) can be expressed as:

SeqRank算法对图模型进行迭代计算直至满足收敛条件。The SeqRank algorithm iteratively calculates the graph model until the convergence conditions are met.

进一步地，步骤3.3中d满足0≤d≤1。Further, in step 3.3, d satisfies 0≤d≤1.

进一步地，d的取值为0.85。Further, the value of d is 0.85.

进一步地，步骤5中中心序列的确定方法为：Further, the determination method of the center sequence in step 5 is:

步骤5.1：用K-means算法对候选序列进行聚类，K-means中心数为k，特征为候选序列的k-mers频度；Step 5.1: Use the K-means algorithm to cluster the candidate sequences, the number of K-means centers is k, and the feature is the k-mers frequency of the candidate sequence;

步骤5.2：对每个簇，筛选出与当前中心最接近的点作为序列中心。Step 5.2: For each cluster, filter out the point closest to the current center as the sequence center.

进一步地，步骤6中序列聚类的确定方法为：Further, the determination method of sequence clustering in step 6 is:

步骤6.1：将k条中心序列，分别标记为μ₁，μ₂，...，μ_k；Step 6.1: Mark the k center sequences as μ ₁ , μ ₂ , ..., μ _{k respectively} ;

步骤6.2：对序列集合S中的每条序列s_i，使用以下公式计算其预测类别：Step 6.2: For each sequence _si in the sequence set S, use the following formula to calculate its predicted category:

其中，pre_i表示序列s_i距离k个簇中最近的那一类，即第i条序列的预测类别，||w_i-μ_j||²表示对序列s_i的重要性数值w_i，计算其与每个中心点的欧氏距离，表示将距离w_i最接近的中心点确定为第i条序列的预测类别。由此可得m条序列所对应的预测类别。Among them, pre _i represents the closest category of the sequence s _i to the k clusters, that is, the predicted category of the i-th sequence, ||w _i -μ _j || ² represents the importance value w _i of the sequence s _i , Calculate its Euclidean distance from each center point, Indicates that the center point closest to _wi is determined as the prediction category of the i-th sequence. From this, the prediction categories corresponding to the m sequences can be obtained.

本发明采用以上技术方案，通过构造二部图模型，对生物序列进行聚类分析，试图在聚类分析的层次上从生物序列数据中得到深层信息含义及可靠的结论，有效地解决现有技术中存在的计算结点权重复杂度高、结点重要性的代表性不够以及结点重要性受序列长度影响较大等问题。The present invention adopts the above technical scheme, and by constructing a bipartite graph model, performs cluster analysis on biological sequences, attempts to obtain deep information meaning and reliable conclusions from biological sequence data at the level of cluster analysis, and effectively solves the problem of the prior art. There are problems such as high complexity of calculating node weights, insufficient representation of node importance, and node importance being greatly affected by sequence length.

附图说明Description of drawings

以下结合附图和具体实施方式对本发明做进一步详细说明；The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments;

图1为本发明的一种基于k-mer组群分割的生物序列聚类方法的流程示意图；1 is a schematic flowchart of a biological sequence clustering method based on k-mer group segmentation of the present invention;

图2为本发明对k-mers进行随机均匀地分组的g1组示意图；Fig. 2 is the g1 group schematic diagram that the present invention carries out random and uniform grouping to k-mers;

图3为本发明对k-mers进行随机均匀地分组的g2组示意图；Fig. 3 is the g2 group schematic diagram that the present invention carries out random and uniform grouping to k-mers;

图4为本发明XYZ三条序列的二部图示意图；4 is a bipartite schematic diagram of three sequences of XYZ of the present invention;

图5为本发明对序列进行K-means聚类的结果示意图。FIG. 5 is a schematic diagram of the result of performing K-means clustering on sequences according to the present invention.

具体实施方式Detailed ways

如图1-5之一所示，本发明公开了一种基于k-mer组群分割的生物序列聚类方法。本发明根据不同滑动窗口大小下序列与k-mers间的关系，构造二部图，并对k-mers进行分组，计算不同组下序列的重要性，并找出序列中心，对其聚类，有效地解决了现有技术中存在的计算结点权重复杂度高、结点重要性的代表性不够以及结点重要性受序列长度影响较大等问题。As shown in one of Figures 1-5, the present invention discloses a biological sequence clustering method based on k-mer group segmentation. According to the relationship between sequences and k-mers under different sliding window sizes, the invention constructs a bipartite graph, groups k-mers, calculates the importance of sequences under different groups, finds out sequence centers, and clusters them. It effectively solves the problems existing in the prior art, such as high complexity of calculating node weights, insufficient representation of node importance, and large influence of sequence length on node importance.

如图1所示，本发明公开基于k-mer组群分割的聚类算法，其包括以下步骤：As shown in Figure 1, the present invention discloses a clustering algorithm based on k-mer group segmentation, which comprises the following steps:

步骤1：按照滑动窗口大小L，对给定的序列集合S＝{s_i|i＝1，2，...，m}进行分割，分割后的k-mers集合为K＝{k_j|j＝1，2，...，n}，m和n分别为序列集合S和k-mers集合的大小，统计k-mers词频。Step 1: According to the sliding window size L, divide the given sequence set S={s _i |i=1, 2, ..., m}, and the segmented k-mers set is K={k _j | j=1, 2, .

具体地，在序列s_i中所有n-Gram长度为L的元素个数有(|s_i|-L+1)个。Specifically, there are (|s _i |-L+1) elements of all n-Grams of length L in the sequence _si .

对于每条序列s_i，统计每个k-mers的出现词频。序列间的k-mers集合为∑^L，集合内元素个数为|∑|^L，即：对于碱基种类为4的DNA序列，其可能存在的k-mers个数为4^L；对于氨基酸种类为20的蛋白质序列，其可能存在的k-mers个数为20^L。For each sequence s _i , count the frequency of occurrences of each k-mers. The set of k-mers between sequences is ∑ ^L , and the number of elements in the set is |∑| ^L , that is: for a DNA sequence with base type 4, the number of possible k-mers is 4 ^L ; for amino acid types is 20 protein sequences, the number of possible k-mers is 20 ^L.

例如对于三条序列X＝″ACAGT″、Y＝″ACACG″及Z＝″CACGT″，当k-mer元素的滑动窗口大小设置为2时，三条序列中所有序列间n-Gram长度为2的元素都为4，且每条序列中k-mers的出现情况如表1所示：For example, for three sequences X="ACAGT", Y="ACACG" and Z="CACGT", when the sliding window size of the k-mer element is set to 2, the elements with n-Gram length of 2 between all sequences in the three sequences are all 4, and the occurrence of k-mers in each sequence is shown in Table 1:

表1 序列间的k-mers出现情况Table 1 The occurrence of k-mers between sequences

k-mersk-mers ACAC AGAG CACA CGCG GTGT sumsum 序列XSequence X 11 11 11 00 11 44 序列Ysequence Y 22 00 11 11 00 44 序列Zsequence Z 11 00 11 11 11 44

步骤2：根据序列与k-mers间的关系，构造二部图G＝(V，E)。G＝(V，E)是由结点集和边集组成的一个无向图模型。其中，V为结点集，E代表结点间相互作用关系形成的边的集合，且E＝{e(s_i，k_j)|s_i∈S，k_j∈K}，其中e(s_i，k_j)表示序列s_i与k-mersk_j间存在隶属关系。Step 2: According to the relationship between the sequence and k-mers, construct a bipartite graph G=(V, E). G=(V, E) is an undirected graph model consisting of node sets and edge sets. Among them, V is the set of nodes, E represents the set of edges formed by the interaction relationship between nodes, and E={e(s _i , k _j )|s _i ∈S, k _j ∈K}, where e(s _i , k _j ) means that there is a affiliation between the sequence _si and k-mersk _j .

具体地，由表1可知，对于序列X与序列Y，两序列间存在共同k-mers：AC和CA，则认为序列X与序列Y存在一条边相连接；对于序列X与序列Z，两序列间存在共同k-mers：AC、CA和GT，则认为序列X与序列Z存在一条边相连接；对于序列Y与序列Z，两序列间存在共同k-mers：AC、CA和CG，则认为序列Y与序列Z存在一条边相连接。Specifically, it can be seen from Table 1 that for sequence X and sequence Y, there are common k-mers between the two sequences: AC and CA, then it is considered that sequence X and sequence Y are connected by an edge; for sequence X and sequence Z, the two sequences If there are common k-mers between the two sequences: AC, CA and GT, it is considered that there is an edge connecting the sequence X and the sequence Z; for the sequence Y and the sequence Z, there are common k-mers between the two sequences: AC, CA and CG, it is considered that Sequence Y and sequence Z are connected by an edge.

步骤3：计算m条序列的重要性。Step 3: Calculate the importance of m sequences.

步骤3.1：将n个k-mers随机均匀地分成t组：g₁，g₂，...，g_t，如图2所示，序列s₁与序列s_i存在共同k-mersk₁，序列s₁与序列s_m存在共同k-mersk₃，即序列s₁与序列s_i、序列s₁与序列s_m存在相互作用关系。此时以序列、k-mers为结点，可构建图模型G＝(V，E)。Step 3.1: Divide the n k-mers into _t groups evenly and randomly: g ₁ , g ₂ , ..., gt , as shown in Figure 2, the sequence s ₁ and the sequence s _i have a common k-mersk ₁ , the sequence There is a common k-mersk ₃ between s ₁ and the sequence s _m , that is, the sequence s ₁ and the sequence s _i , and the sequence s ₁ and the sequence s _m have an interaction relationship. At this time, with sequences and k-mers as nodes, a graph model G=(V, E) can be constructed.

步骤3.2：计算边的权重。当两序列v_i和v_j存在一个共同k-mers，则认为v_i和v_j为相邻结点，且存在一条边相连接，边的权重w_ji为两序列存在的k-mers共同出现频度的数量，即：对于任意两条序列v_i和v_j，若它们存在共同的k-mers，可以用w_ji表示结点间无向的相互作用，通过以下公式计算w_ji：Step 3.2: Calculate the weights of the edges. When two sequences vi and v _j have a common _k -mers, it is considered that vi and v _j are adjacent nodes, and there is an edge connected, and the weight of the edge w _ji is that the k-mers existing in the two sequences co _- occur The number of frequencies, that is: for any two sequences vi and v _j , if they have common _k -mers, w _ji can be used to represent the undirected interaction between nodes, and w _ji is calculated by the following formula:

具体地，对于三条序列X＝″ACAGT″、Y＝″ACACG″及Z＝″CACGT″，当k-mer元素的滑动窗口大小设置为2时，关于三条序列的二部图如图3所示：Specifically, for the three sequences X="ACAGT", Y="ACACG" and Z="CACGT", when the sliding window size of the k-mer element is set to 2, the bipartite graph of the three sequences is shown in Figure 3 :

X、Y和Z三条序列的k-mers频数情况如表2所示。The frequencies of k-mers for the three sequences of X, Y and Z are shown in Table 2.

表2 三条序列的k-mers出现情况Table 2 The occurrence of k-mers of the three sequences

k-mersk-mers ACAC AGAG CACA CGCG GTGT 序列XSequence X 11 11 11 00 11 序列Ysequence Y 22 00 11 11 00 序列Zsequence Z 11 00 11 11 11

由表2可知，对于序列X与序列Y，两序列间存在共同k-mers：AC和CA。由于边的权重为两序列存在的k-mers共同出现频度的数量，表2中序列X与序列Y的关系可简化为：It can be seen from Table 2 that for sequence X and sequence Y, there are common k-mers between the two sequences: AC and CA. Since the weight of the edge is the number of co-occurrence frequencies of k-mers existing in the two sequences, the relationship between sequence X and sequence Y in Table 2 can be simplified as:

表3 X、Y两序列存在的k-mers共同出现频度Table 3 Co-occurrence frequency of k-mers in X and Y sequences

k-mersk-mers ACAC CACA 序列XSequence X 11 11 序列Ysequence Y 22 11

此时序列X与序列Y的相互作用w_YX可由下式计算而得：At this time, the interaction w _YX between the sequence X and the sequence Y can be calculated by the following formula:

w_YX＝min(|AC_X|，|AC_Y|)+min(|CA_X|，|CA_Y|)＝1+1＝2w _YX = min(|AC _X |, |AC _Y |)+min(|CA _X |, |CA _Y |)=1+1=2

步骤3.3：计算结点的权重。对于任意两个结点v_i和v_j，结点v_i是通过连接它们的边w_ji向结点v_j传递作用的，边权重的大小决定了v_i对v_j的作用大小。当结点v_j与多个结点存在边的关系，即结点v_j具有多个相邻结点，w_j.表示结点v_j接收到来自其它结点的作用，通过以下公式计算w_j.：Step 3.3: Calculate the weight of the node. For any two nodes vi and v _j , the node v _i transmits the effect to the node v _j through the edge w _ji connecting them, and the size of the edge weight determines the _effect of v _i on v _j . When there is an edge relationship between node v _j and multiple nodes, that is, node v _j has multiple adjacent nodes, w _j. indicates that node v _j receives action from other nodes, and w is calculated by the following formula _j .:

对于表3，通过计算X、Y和Z三条序列存在的k-mers共同出现频度，可得到如下关系矩阵M：For Table 3, by calculating the co-occurrence frequency of k-mers existing in the three sequences X, Y and Z, the following relationship matrix M can be obtained:

表4 X、Y和Z三条序列存在的k-mers共同出现频度Table 4 Co-occurrence frequency of k-mers in the three sequences of X, Y and Z

序列XSequence X 序列Ysequence Y 序列Zsequence Z 序列XSequence X 00 22 33 序列Ysequence Y 22 00 33 序列Zsequence Z 33 33 00

在表4中，矩阵M的大小为|V|*|V|，其中|V|表示结点的数量。矩阵中的数值表示两序列间的相互作用，如M[1，3]表示序列X与序列Z的相互作用为3。In Table 4, the size of the matrix M is |V|*|V|, where |V| represents the number of nodes. The values in the matrix represent the interaction between the two sequences, for example, M[1, 3] means that the interaction between sequence X and sequence Z is 3.

迭代计算每个结点v_i的权重，可得到结点v_i的重要性WS(v_i)，WS(v_i)对应的SeqRank计算公式为：By iteratively calculating the weight of each node v _i , the importance WS(vi ₎ of the node v _i can be obtained, and the SeqRank calculation formula corresponding to WS(vi ₎ is:

步骤4：对序列的重要性进行逆序排序。用t×m维的矩阵I表示m条序列在不同组k-mers下的重要性，矩阵I如表5所示。Step 4: Sort the sequences in reverse order of importance. The importance of m sequences under different groups of k-mers is represented by a t×m-dimensional matrix I, which is shown in Table 5.

表5 m条序列在t组下的重要性Table 5 Importance of m sequences under t group

具体地，在矩阵I中，矩阵的行表示t组k-mers，矩阵的列表示m条序列，而矩阵中I[p，q]表示在第p(1≤p≤t)组k-mers下，第q(1≤q≤m)条序列的重要性数值，如I[1，]表示在组g₁下m条序列的重要性，I[，q]表示在所有组下计算出的第q条序列的重要性。Specifically, in matrix I, the rows of the matrix represent t groups of k-mers, the columns of the matrix represent m sequences, and I[p, q] in the matrix represents the k-mers in the pth (1≤p≤t) group Below, the importance value of the q (1≤q≤m)-th sequence, such as I[1, ] represents the importance of m sequences under group g ₁ , I[, q] represents the calculated value under all groups The importance of the qth sequence.

值得注意的是，不同组别下同一条序列会呈现出不同的重要性，如矩阵I中的序列s₁，在组g₁下的重要性为0.9696823，而在组g₂下的重要性为1.040769；反之，不同组别下进行重要性计算所得出的最大值可以是不同的序列，如组g₁下最重要的序列为s_m，组g_p下最重要的序列为s₁。It is worth noting that the same sequence under different groups will show different importance, such as the sequence s ₁ in matrix I, the importance under group g ₁ is 0.9696823, and the importance under group g ₂ is 1.040769; on the contrary, the maximum value obtained by the importance calculation under different groups can be different sequences, for example, the most important sequence under group g ₁ is s _m , and the most important sequence under group g _p is s ₁ .

以组为单位，对序列的重要性进行逆序排序，得到的结果形式如表6所示。Taking the group as a unit, the importance of the sequence is sorted in reverse order, and the result is shown in Table 6.

表6 对重要性进行逆序排序Table 6 sorts the importance in reverse order

与表5不同的是，在表6中，矩阵I中的数值表示经过逆序排序后的序列号，如I[2，1]＝7表示的是：在组g₂下计算出来的最重要的是7序列；I[p，m]＝7表示的是：7序列在组g_p下被认为是最不重要的序列。不同组k-mers下计算出来的最重要的序列号是不一样的。The difference from Table 5 is that in Table 6, the values in the matrix I represent the sequence numbers sorted in reverse order. For example, I[2, 1]=7 represents: the most important value calculated under the group g ₂ is a 7-sequence; I[ _p ,m]=7 means that the 7-sequence is considered to be the least important sequence under group gp. The most important sequence numbers calculated under different groups of k-mers are different.

假设k(k≤m)为m条序列的中心数，以组g₁下的序列号SR₁为准，按照均匀的间隔，从t组中各筛选出k条序列，如表7阴影部分所示。Assuming that k (k≤m) is the number of centers of m sequences, the sequence number SR ₁ under group g ₁ shall prevail, and k sequences are screened from each of the t groups according to uniform intervals, as shown in the shaded part of Table 7. Show.

表7 对重要性进行筛选Table 7 Screening for Importance

对于每组被筛选出来的k个序列号，将其所在序列作为候选序列，该候选序列集合大小为t*k。在t*k条序列中，可能会存在重复序列，如组g₁中序列5和7也在组g_t的候选序列集合中；组g_t中序列78的序列也在组g₂的候选序列集合中。因此我们要对筛选出的t*k条候选序列去重，得到n(n≤t*k)条不重复的候选序列。For each group of k sequence numbers screened out, the sequence in which it is located is taken as a candidate sequence, and the size of the candidate sequence set is t*k. In _t *k sequences, there may be repeated sequences. For example, sequences 5 and 7 in group g ₁ are also in the candidate sequence set of group _gt ; sequence 78 in group gt is also a candidate sequence in group g ₂ . in the collection. Therefore, we need to deduplicate the selected t*k candidate sequences to obtain n (n≤t*k) non-repeated candidate sequences.

步骤5：用K-means算法对候选序列进行聚类。K-means中心数为k，特征为候选序列的k-mers频度。滑动窗口大小为L时，对于DNA序列，序列间可能出现的k-mers集合为∑^L。假设L＝2，则有如下矩阵O：Step 5: Cluster the candidate sequences with K-means algorithm. The number of K-means centers is k, and the feature is the k-mers frequency of the candidate sequence. When the sliding window size is L, for DNA sequences, the set of k-mers that may appear between sequences is ∑ ^L . Assuming L = 2, there is the following matrix O:

表8 k-mers在候选序列中的出现频度Table 8 Frequency of occurrence of k-mers in candidate sequences

矩阵O的大小为n*|∑|^L，其中n为候选序列数，∑^L为当前滑动窗口大小为L时序列对应的k-mers集合。当L＝2时，对于DNA序列，有∑^L＝{AA，AC，AG，...，TT}。在矩阵O中，0[s_i，]表示在第i条序列中各个k-mers的出现频度，即：140，122，200，...，101。The size of the matrix O is n*|∑| ^L , where n is the number of candidate sequences, and ∑ ^L is the set of k-mers corresponding to the sequence when the current sliding window size is L. When L=2, for a DNA sequence, Σ ^L = {AA, AC, AG, . . . , TT}. In matrix O, 0[s _i , ] represents the frequency of occurrence of each k-mers in the i-th sequence, namely: 140, 122, 200, ..., 101.

K-means得到的结果如图4所示。在图4中，n条候选序列被聚成k类：Cluster1，Cluster2，...，Clusterk。对每个Cluster，筛选出与当前质心最接近的点作为序列中心，即对于Cluster2，经过某种距离方式度量，此时与Cluster2最为接近的是点A，我们认为点A所在的序列为Cluster2的中心序列。以此类推，可得到k条中心序列。The results obtained by K-means are shown in Figure 4. In Figure 4, n candidate sequences are clustered into k categories: Cluster1, Cluster2, ..., Clusterk. For each cluster, the point closest to the current centroid is selected as the sequence center, that is, for Cluster2, after a certain distance measurement, the closest point to Cluster2 is point A, and we think that the sequence where point A is located is Cluster2. Center sequence. By analogy, k central sequences can be obtained.

步骤6：对m条序列S＝{s_i|i＝1，2，...，m}进行聚类。Step 6: Perform clustering on m sequences S={s _i |i=1, 2, . . . , m}.

具体地，当滑动窗口大小为L时，m条序列下的k-mers集合为K＝{k_j|j＝1，2，...，|∑|^L}，此时构建的k-mers频度矩阵O的大小为m*|∑|^L。Specifically, when the size of the sliding window is L, the set of k-mers under m sequences is K={k _j |j=1, 2, ..., |∑| ^L }, the k-mers constructed at this time The size of the frequency matrix O is m*|∑| ^L .

对于k条中心，特征为代表m条序列中k-mers的频度矩阵O，K-means算法的聚类过程描述具体为：For k centers, the feature is the frequency matrix O representing k-mers in m sequences. The description of the clustering process of the K-means algorithm is as follows:

(1)对筛选出的k条中心，分别标记为μ₁，μ₂，...，μ_k；(1) Mark the selected k centers as μ ₁ , μ ₂ , . . . , μ _{k respectively} ;

(2)对矩阵O中每一个点w_i，j，使用公式(5)计算其所属的类别：(2) For each point wi _,j in matrix O, use formula (5) to calculate the category to which it belongs:

在公式(3)中，pre_i表示序列s_i距离k个簇中最近的那一类，即第i条序列的预测类别，||w_i-μ_j||²表示对序列s_i的重要性数值w_i，计算其与每个中心点的欧氏距离，表示将距离w_i最接近的中心点确定为第i条序列的预测类别。由此可得m条序列所对应的预测类别。In formula (3), pre _i represents the closest category of the sequence s _i to the k clusters, that is, the predicted category of the i-th sequence, and ||w _i -μ _j || ² represents the importance of the sequence s _i property value w _i , calculate its Euclidean distance from each center point, Indicates that the center point closest to _wi is determined as the prediction category of the i-th sequence. From this, the prediction categories corresponding to the m sequences can be obtained.

Claims

1. a biological sequence clustering method based on k-mer group segmentation, is characterized in that: it comprises the following steps:

Step 1: Obtain the data set of the sequence set to be processed, and divide the sequence in the data set according to the set sliding window size to obtain the k-mers set;

Step 2: Construct a bipartite graph according to the relationship between the sequence and the k-mers set, and count the word frequencies of the k-mers co-occurring between any sequence _si and other sequences respectively;

Step 3: Divide the k-mers into _t groups randomly and uniformly: g ₁ , g ₂ , ..., gt , and calculate the importance of the sequence under each group of k-mers;

Step 4: Sort the importance of the sequences in reverse order: let k be the number of centers of m sequences, and screen out k sequences from the t group according to uniform intervals as candidate sequences; Candidate sequences are deduplicated to obtain non-repeated candidate sequences;

Step 5: Perform k-mers clustering on the candidate sequences; perform clustering on the DNA sequences based on the set sliding window size to obtain a k-mers set; in each cluster in the K-means clustering result, filter out the cluster with the current centroid. The closest point is used as the sequence center; and so on, k center sequences can be obtained;

Step 6: Clustering m sequences S={s _i |i=1, 2, .

2. a kind of biological sequence clustering method based on k-mer group segmentation according to claim 1 is characterized in that: the data set in step 1 is a sequence set S={s _i |i with a length of m =1, 2,...,m}, the size of the sliding window is L, and the k-mers set after segmentation is K={k _j |j=1,2,...,n}.

3. a kind of biological sequence clustering method based on k-mer group segmentation according to claim 1, is characterized in that: the bipartite graph constructed in step 2 is G=(V, E), also make sequence- k-mers graph; G=(V, E) is an undirected graph model composed of node sets and edge sets,

where V is the set of nodes and V can be decomposed into two subsets, namely V=S∪K, and S={s _i |i=1,2,...,m} is the sequence set, s _i is the i-th sequence, K={k _j |j=1,2,...,n} is k -mers set, k _j is the jth k-mers;

E represents the set of edges formed by the interaction relationship between nodes and the two endpoints of each edge in E are in subset S and subset K respectively, that is, E={e(s _i , k _j )|s _i ∈S , k _j ∈ K}, where e(s _i , k _j ) means that there is a membership relationship between the sequence _si and k-mers k _j .

4. a kind of biological sequence clustering method based on k-mer group segmentation according to claim 1, is characterized in that: the determination method of sequence importance in step 3 is:

Step 3.1: Calculate the weight of the edge: When there is a common _k -mers between the two sequences vi and v _j , it is considered that vi and v _j are adjacent nodes, and there is an edge connected, and the weight of the edge w _ji is two _. the number of co-occurrence frequencies of k-mers present in the sequence,

Step 3.2: Calculate the weight of the node:

For any two nodes vi and v _j , the node vi transmits the effect to the node v _j through the edge _w _ji connecting them _, and the size of the edge weight determines the effect of _{vi on v j} _;

When the node v _j has an edge relationship with multiple nodes, that is, the node v _j has multiple adjacent nodes, the weight of the node v _j at this time is the effect that the node v _j receives from other nodes. and;

Step 3.3: Iteratively calculate the weight of each node v _i to obtain the importance of the node v _i .

5. a kind of biological sequence clustering method based on k-mer group segmentation according to claim 4, it is characterized in that: in step 3.1, the weight of adjacent node v _i and v _j side is w _ji , by the following The formula calculates w _ji :

Among them, kmer∈vi & _{kmer∈v j} _means that _{k-mers exist in both node vi and node v j} _; and respectively represent the frequency of the current _{k-mers in the node vi and the node v j} _, and w _ji =w _ij .

6. a kind of biological sequence clustering method based on k-mer group segmentation according to claim 4, is characterized in that: in step 3.2, w _j. represents that node v _j receives the effect from other nodes, through The following formula calculates w _j. :

Among them, w _j. represents the contribution of each node in the node set V to the node v _j .

7. a kind of biological sequence clustering method based on _k -mer group segmentation according to claim 4, is characterized in that: the importance of each node vi in step 3.3 is WS( _vi ), WS( vi) The corresponding _SeqRank calculation formula is as follows:

Among them, d is the damping coefficient and 0≤d≤1, it represents the probability of walking from one node to another node at any time, v _j _∈ e(vi , v _j ) represents the node v _i and the node v _j has a common edge; in v _k ∈ e(v _j , v _k ), v _k is a node that has a common edge with node v _j ; w _ij or w _ji means connecting the node v _i and the node The weight of the edge of the point v _j , that is, the sum of the common occurrence frequencies of the k-mers existing at the node v _i and the node v _j ; the denominator Represents the weighted sum of the weights of the edges of the node v _j pointing to the node v _k when v _k ∈ e(v _j , v _k ), WS(v _j ) is the importance of the node v _j after the last iteration;

Using WS( _vi ) ^t to represent the importance of node _vi after t iterations, formula (3) can be expressed as:

The SeqRank algorithm iteratively calculates the graph model until the convergence conditions are met.

8 . The biological sequence clustering method based on k-mer group segmentation according to claim 7 , wherein the value of d is 0.85. 9 .

9. a kind of biological sequence clustering method based on k-mer group segmentation according to claim 1, is characterized in that: the determination method of central sequence in step 5 is:

Step 5.1: Use the K-means algorithm to cluster the candidate sequences, the number of K-means centers is k, and the feature is the k-mers frequency of the candidate sequence;

Step 5.2: Screen out the closest point to the current center for each cluster as the sequence center.

10. a kind of biological sequence clustering method based on k-mer group segmentation according to claim 1, is characterized in that: the determination method of sequence clustering in step 6 is:

Step 6.1: Mark the k center sequences as μ ₁ , μ ₂ , ..., μ _{k respectively} ;

Step 6.2: For each sequence _si in the sequence set S, use the following formula to calculate its predicted category:

Among them, pre _i represents the closest category of the sequence s _i to the k clusters, that is, the predicted category of the i-th sequence, ||w _i -μ _j || ² represents the importance value w _i of the sequence s _i , Calculate its Euclidean distance from each center point, Indicates that the center point closest to _wi is determined as the prediction category of the i-th sequence.