CN110060735A - A kind of biological sequence clustering method based on the segmentation of k-mer group - Google Patents
A kind of biological sequence clustering method based on the segmentation of k-mer group Download PDFInfo
- Publication number
- CN110060735A CN110060735A CN201910271872.4A CN201910271872A CN110060735A CN 110060735 A CN110060735 A CN 110060735A CN 201910271872 A CN201910271872 A CN 201910271872A CN 110060735 A CN110060735 A CN 110060735A
- Authority
- CN
- China
- Prior art keywords
- sequence
- node
- mers
- sequences
- importance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Analytical Chemistry (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明公开一种基于k‑mer组群分割的生物序列聚类方法,其包括以下步骤:步骤1,对数据集中的序列进行分割,统计分割后的k‑mers词频;步骤2,根据序列与k‑mers间的关系,构造二部图;步骤3,将k‑mers随机分组,并计算每组k‑mers下序列的重要性;步骤4,对序列的重要性进行逆序排序,筛选出候选序列,并对其去重;步骤5,对候选序列聚类,寻找序列中心;步骤6,对所有序列进行聚类。本发明通过构造二部图模型,对生物序列进行聚类分析,从生物序列数据中得到深层信息含义及可靠的结论,有效地解决现有计算结点权重复杂度高、结点重要性的代表性不够以及结点重要性受序列长度影响较大的问题。
The invention discloses a biological sequence clustering method based on k-mer group segmentation, which comprises the following steps: step 1, segment the sequences in the data set, and count the k-mer word frequencies after the segmentation; step 2, according to the sequence and The relationship between the k-mers, construct a bipartite graph; step 3, randomly group the k-mers, and calculate the importance of the sequences under each group of k-mers; step 4, sort the importance of the sequences in reverse order, and filter out the candidates sequence, and deduplicate it; step 5, cluster candidate sequences to find sequence centers; step 6, cluster all sequences. By constructing a bipartite graph model, the present invention performs cluster analysis on biological sequences, obtains deep information meaning and reliable conclusions from biological sequence data, and effectively solves the problems of high complexity of existing computing node weights and representative nodes of importance. The problem of insufficient sex and the importance of nodes is greatly affected by the length of the sequence.
Description
技术领域technical field
本发明涉及生物信息学领域,尤其涉及一种基于k-mer组群分割的生物序列聚类方法。The invention relates to the field of bioinformatics, in particular to a biological sequence clustering method based on k-mer group segmentation.
背景技术Background technique
随着二代测序技术的不断发展,生物学数据库的规模在日益扩大。庞大的生物信息数据库为科研人员提供了广阔的机遇,也带来了挑战,如何在这上百亿的生物信息数据库中挖掘出有用的信息,数据挖掘为科研工作提供了基本手段。生物序列的相似性往往体现其功能的相关性,而聚类分析是数据挖掘中常用的技术。With the continuous development of next-generation sequencing technology, the scale of biological databases is expanding. The huge biological information database provides researchers with broad opportunities and challenges. How to mine useful information from the tens of billions of biological information databases, data mining provides a basic means for scientific research. The similarity of biological sequences often reflects the correlation of their functions, and cluster analysis is a commonly used technique in data mining.
在生物学中,序列比对通过排列生物序列的方式,识别可以描述序列间的功能、结构以及进化关系的相似序列区域。聚类是把相似序列划分到相同的组中,不相似的序列划分到不同组中,使得同一组间的序列距离最小,而不同组内的距离最大。如果两条相似的序列被聚到同一组内,在一定程度上说明二者具有同源性,这将会大大节省重新测定未知序列结构和功能的时间和精力。此外,序列比对一般决定了许多生物信息学技术及程序的分析结果,影响着很多序列比较研究的结论和生物解释,是生物序列聚类分析等研究中的一个重要内容。In biology, sequence alignment identifies similar sequence regions that describe the function, structure, and evolutionary relationship between sequences by aligning biological sequences. Clustering is to divide similar sequences into the same group, and dissimilar sequences into different groups, so that the sequence distance between the same group is the smallest, and the distance between different groups is the largest. If two similar sequences are clustered into the same group, it indicates that the two have homology to a certain extent, which will greatly save the time and energy of re-determining the structure and function of unknown sequences. In addition, sequence alignment generally determines the analysis results of many bioinformatics techniques and programs, and affects the conclusions and biological interpretations of many sequence comparison studies. It is an important content in studies such as biological sequence cluster analysis.
二部图(Bipartite graph)是运用比较广泛的图类之一,在实际生活中的运用有:如何安排工作才能最大程度满足每个人的需求,知道每个人的工作胜任情况;如何安排课程,才能使教室、教师与学生三者之间的条件得到满足。这些都涉及到二部图的匹配问题,可以通过建立图模型来解决此类问题。Bipartite graph is one of the more widely used graph classes. The applications in real life include: how to arrange work to meet the needs of each person to the greatest extent, and to know the competence of each person's work; how to arrange courses to So that the conditions between classrooms, teachers and students are met. These all involve the matching problem of bipartite graphs, which can be solved by establishing a graph model.
结合图模型,对生物序列进行聚类研究,把功能相关的序列聚为一类,这样可以帮助科研人员快速了解生物序列的功能,明确其内部种群的多样性,从而促进生物多样性保护,合理开发利用生物资源。Combined with graphical models, clustering research on biological sequences, and grouping functionally related sequences into one category, can help researchers quickly understand the function of biological sequences and clarify the diversity of their internal populations, thereby promoting biodiversity conservation. Reasonable Development and utilization of biological resources.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种基于k-mer组群分割的生物序列聚类方法。The purpose of the present invention is to provide a biological sequence clustering method based on k-mer group segmentation.
本发明采用的技术方案是:The technical scheme adopted in the present invention is:
一种基于k-mer组群分割的生物序列聚类方法,其包括以下步骤:A biological sequence clustering method based on k-mer group segmentation, comprising the following steps:
步骤1:设定滑动窗口大小,对数据集中的序列进行分割,并统计分割后的k-mers词频;Step 1: Set the size of the sliding window, segment the sequences in the dataset, and count the k-mers word frequency after segmentation;
步骤2:根据序列与k-mers间的关系,构造二部图,并分别统计任意序列si与其他序列间共同出现的k-mers的词频;Step 2: According to the relationship between the sequence and k-mers, construct a bipartite graph, and separately count the word frequencies of k-mers co-occurring between any sequence si and other sequences;
步骤3:将k-mers随机均匀地分成t组:g1,g2,…,gt,计算每组k-mers下序列的重要性;Step 3: Divide the k-mers into t groups randomly and uniformly: g 1 , g 2 ,..., g t , and calculate the importance of the sequence under each group of k-mers;
步骤4:对序列的重要性进行逆序排序:设k为m条序列的中心数,按照均匀的间隔从t组中各筛选出k条序列,作为候选序列;并对t*k条序列中对候选序列去重得到不重复的候选序列;Step 4: Sort the importance of the sequences in reverse order: let k be the number of centers of m sequences, and screen out k sequences from the t group according to uniform intervals as candidate sequences; Candidate sequences are deduplicated to obtain non-repeated candidate sequences;
步骤5:对候选序列进行k-mers聚类;基于设定的滑动窗口大小对于DNA序列进行聚类得到k-mers集合;在K-means的聚类结果中的每一个簇筛选出与当前质心最接近的点作为序列中心;以此类推,可得到k条中心序列;Step 5: Perform k-mers clustering on the candidate sequences; perform clustering on the DNA sequences based on the set sliding window size to obtain a k-mers set; in each cluster in the K-means clustering result, filter out the cluster with the current centroid. The closest point is used as the sequence center; and so on, k center sequences can be obtained;
步骤6:基于k条中心序列对m条序列S={si|i=1,2,...,m}进行聚类。Step 6: Perform clustering on m sequences S={s i |i=1, 2, . . . , m} based on the k central sequences.
进一步地,步骤1中的数据集为一个长度为m的序列集合S={si|i=1,2,...,m},滑动窗口大小为L,且分割后的k-mers集合为K={kj|j=1,2,...,n}。Further, the data set in step 1 is a sequence set S={s i |i=1, 2, ..., m} of length m, the sliding window size is L, and the segmented k-mers set is K={k j |j=1,2,...,n}.
进一步地,步骤2中构造的二部图为G=(V,E),也作序列-k-mers图。G=(V,E)是由结点集和边集组成的一个无向图模型,其中V为结点集,且V可以分解为两个子集,即V=S∪K,且S={si|i=1,2,...,m}为序列集合,si为第i条序列,K={kj|j=1,2,...,n}为k-mers集合,kj为第j个k-mers;E代表结点间相互作用关系形成的边的集合,且E中每条边的两个端点分别在子集S和子集K中,即E={e(si,kj)|si∈S,kj∈K},其中e(si,kj)表示序列si与k-merskj间存在隶属关系。Further, the bipartite graph constructed in step 2 is G=(V, E), which is also referred to as a sequence-k-mers graph. G=(V, E) is an undirected graph model composed of node sets and edge sets, where V is the node set, and V can be decomposed into two subsets, namely V=S∪K, and S={s i |i=1,2,...,m} is the sequence set, s i is the i-th sequence, K={k j |j=1,2,...,n} is k -mers set, k j is the jth k-mers; E represents the set of edges formed by the interaction relationship between nodes, and the two endpoints of each edge in E are in subset S and subset K respectively, that is, E ={e(s i , k j )|s i ∈S, k j ∈K}, where e(s i , k j ) indicates that there is a membership relationship between sequence si and k-mersk j .
进一步地,步骤3中序列重要性的确定方法为:Further, the method for determining sequence importance in step 3 is:
步骤3.1:计算边的权重。当两序列vi和vj存在一个共同k-mers,则认为vi和vj为相邻结点,且存在一条边相连接,边的权重wji为两序列存在的k-mers共同出现频度的数量,即:对于任意两条序列vi和vj,若它们存在共同的k-mers,可以用wji表示结点间无向的相互作用。Step 3.1: Calculate the weights of the edges. When two sequences vi and v j have a common k -mers, it is considered that vi and v j are adjacent nodes, and there is an edge connected, and the weight of the edge w ji is that the k-mers existing in the two sequences co - occur The number of frequencies, that is: for any two sequences vi and v j , if they have a common k -mers, w ji can be used to represent the undirected interaction between nodes.
步骤3.2:计算结点的权重。Step 3.2: Calculate the weight of the node.
对于任意两个结点vi和vj,结点vi是通过连接它们的边wji向结点vj传递作用的,边权重的大小决定了vi对vj的作用大小。当结点vj与多个结点存在边的关系,即结点vj具有多个相邻结点,此时结点vj的权重为结点vj接收到来自其它结点的作用之和。For any two nodes vi and v j , the node v i transmits the effect to the node v j through the edge w ji connecting them, and the size of the edge weight determines the effect of v i on v j . When the node v j has an edge relationship with multiple nodes, that is, the node v j has multiple adjacent nodes, the weight of the node v j at this time is the effect that the node v j receives from other nodes. and.
步骤3.3:迭代计算每个结点vi的权重,可得到结点vi的重要性。Step 3.3: Iteratively calculate the weight of each node v i to obtain the importance of the node v i .
进一步地,步骤3.1中相邻结点vi和vj边的权重为wji,可以通过以下公式计算wji:Further, in step 3.1, the weights of the edges of adjacent nodes vi and v j are w ji , and w ji can be calculated by the following formula:
其中,kmer∈vi&kmer∈vj表示k-mers既存在于结点vi又存在于结点vj中。与分别表示当前k-mers在结点vi与结点vj中的出现频度。由于结点间的相互作用是无向的,所以有wji=wij。Among them, kmer∈vi & kmer∈v j means that k-mers exist in both node vi and node v j . and Represent the frequency of the current k-mers in node v i and node v j , respectively. Since the interaction between nodes is undirected, there is w ji =w ij .
进一步地,步骤3.2中wj.表示结点vj接收到来自其它结点的作用,通过以下公式计算wj.:Further, w j. in step 3.2 indicates that the node v j receives the action from other nodes, and w j. is calculated by the following formula:
其中,wj.表示结点集合V中每个结点对结点vj的贡献程度。Among them, w j. represents the contribution of each node in the node set V to the node v j .
进一步地,步骤3.3中每个结点vi的重要性为WS(vi),WS(vi)对应的SeqRank计算公式如下:Further, the importance of each node v i in step 3.3 is WS(vi ) , and the SeqRank calculation formula corresponding to WS(vi ) is as follows:
其中,d为阻尼系数(0≤d≤1),表示在任意时刻,从一个结点游走至另一个结点的概率,即每个结点都有(1-d)的概率随机游走到其它结点。vj∈e(vi,vj)表示结点vi与结点vj存在共同的边;在vk∈e(vj,vk)中,vk是与结点vj存在共同的边的结点。wij(或wji)表示连接结点vi与结点vj的边的权重,即结点vi与结点vj存在的k-mers共同出现频度之和。分母表示vk∈e(vj,vk)时结点vj指向结点vk的边的权重的加权和。WS(vj)为上一次迭代后结点vj的重要性。Among them, d is the damping coefficient (0≤d≤1), which represents the probability of walking from one node to another at any time, that is, each node has a probability of (1-d) random walk to other nodes. v j ∈ e(v i , v j ) indicates that node vi and node v j have a common edge; in v k ∈ e(v j , v k ), v k is a common edge with node v j the edge nodes. w ij (or w ji ) represents the weight of the edge connecting the node v i and the node v j , that is, the sum of the common occurrence frequencies of the k-mers existing at the node v i and the node v j . denominator Represents the weighted sum of the weights of the edges of node v j pointing to node v k when v k ∈ e(v j , v k ). WS(v j ) is the importance of node v j after the last iteration.
由于计算结点权重时又需要用到结点本身的权重,因此需要进行迭代计算。若用WS(vi)t表示结点vi经过t次迭代后的重要性,则公式(3)可表示为:Since the weight of the node itself needs to be used when calculating the weight of the node, iterative calculation is required. If WS( vi ) t is used to represent the importance of node vi after t iterations, then formula (3) can be expressed as:
SeqRank算法对图模型进行迭代计算直至满足收敛条件。The SeqRank algorithm iteratively calculates the graph model until the convergence conditions are met.
进一步地,步骤3.3中d满足0≤d≤1。Further, in step 3.3, d satisfies 0≤d≤1.
进一步地,d的取值为0.85。Further, the value of d is 0.85.
进一步地,步骤5中中心序列的确定方法为:Further, the determination method of the center sequence in step 5 is:
步骤5.1:用K-means算法对候选序列进行聚类,K-means中心数为k,特征为候选序列的k-mers频度;Step 5.1: Use the K-means algorithm to cluster the candidate sequences, the number of K-means centers is k, and the feature is the k-mers frequency of the candidate sequence;
步骤5.2:对每个簇,筛选出与当前中心最接近的点作为序列中心。Step 5.2: For each cluster, filter out the point closest to the current center as the sequence center.
进一步地,步骤6中序列聚类的确定方法为:Further, the determination method of sequence clustering in step 6 is:
步骤6.1:将k条中心序列,分别标记为μ1,μ2,...,μk;Step 6.1: Mark the k center sequences as μ 1 , μ 2 , ..., μ k respectively ;
步骤6.2:对序列集合S中的每条序列si,使用以下公式计算其预测类别:Step 6.2: For each sequence si in the sequence set S, use the following formula to calculate its predicted category:
其中,prei表示序列si距离k个簇中最近的那一类,即第i条序列的预测类别,||wi-μj||2表示对序列si的重要性数值wi,计算其与每个中心点的欧氏距离,表示将距离wi最接近的中心点确定为第i条序列的预测类别。由此可得m条序列所对应的预测类别。Among them, pre i represents the closest category of the sequence s i to the k clusters, that is, the predicted category of the i-th sequence, ||w i -μ j || 2 represents the importance value w i of the sequence s i , Calculate its Euclidean distance from each center point, Indicates that the center point closest to wi is determined as the prediction category of the i-th sequence. From this, the prediction categories corresponding to the m sequences can be obtained.
本发明采用以上技术方案,通过构造二部图模型,对生物序列进行聚类分析,试图在聚类分析的层次上从生物序列数据中得到深层信息含义及可靠的结论,有效地解决现有技术中存在的计算结点权重复杂度高、结点重要性的代表性不够以及结点重要性受序列长度影响较大等问题。The present invention adopts the above technical scheme, and by constructing a bipartite graph model, performs cluster analysis on biological sequences, attempts to obtain deep information meaning and reliable conclusions from biological sequence data at the level of cluster analysis, and effectively solves the problem of the prior art. There are problems such as high complexity of calculating node weights, insufficient representation of node importance, and node importance being greatly affected by sequence length.
附图说明Description of drawings
以下结合附图和具体实施方式对本发明做进一步详细说明;The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments;
图1为本发明的一种基于k-mer组群分割的生物序列聚类方法的流程示意图;1 is a schematic flowchart of a biological sequence clustering method based on k-mer group segmentation of the present invention;
图2为本发明对k-mers进行随机均匀地分组的g1组示意图;Fig. 2 is the g1 group schematic diagram that the present invention carries out random and uniform grouping to k-mers;
图3为本发明对k-mers进行随机均匀地分组的g2组示意图;Fig. 3 is the g2 group schematic diagram that the present invention carries out random and uniform grouping to k-mers;
图4为本发明XYZ三条序列的二部图示意图;4 is a bipartite schematic diagram of three sequences of XYZ of the present invention;
图5为本发明对序列进行K-means聚类的结果示意图。FIG. 5 is a schematic diagram of the result of performing K-means clustering on sequences according to the present invention.
具体实施方式Detailed ways
如图1-5之一所示,本发明公开了一种基于k-mer组群分割的生物序列聚类方法。本发明根据不同滑动窗口大小下序列与k-mers间的关系,构造二部图,并对k-mers进行分组,计算不同组下序列的重要性,并找出序列中心,对其聚类,有效地解决了现有技术中存在的计算结点权重复杂度高、结点重要性的代表性不够以及结点重要性受序列长度影响较大等问题。As shown in one of Figures 1-5, the present invention discloses a biological sequence clustering method based on k-mer group segmentation. According to the relationship between sequences and k-mers under different sliding window sizes, the invention constructs a bipartite graph, groups k-mers, calculates the importance of sequences under different groups, finds out sequence centers, and clusters them. It effectively solves the problems existing in the prior art, such as high complexity of calculating node weights, insufficient representation of node importance, and large influence of sequence length on node importance.
如图1所示,本发明公开基于k-mer组群分割的聚类算法,其包括以下步骤:As shown in Figure 1, the present invention discloses a clustering algorithm based on k-mer group segmentation, which comprises the following steps:
步骤1:按照滑动窗口大小L,对给定的序列集合S={si|i=1,2,...,m}进行分割,分割后的k-mers集合为K={kj|j=1,2,...,n},m和n分别为序列集合S和k-mers集合的大小,统计k-mers词频。Step 1: According to the sliding window size L, divide the given sequence set S={s i |i=1, 2, ..., m}, and the segmented k-mers set is K={k j | j=1, 2, .
具体地,在序列si中所有n-Gram长度为L的元素个数有(|si|-L+1)个。Specifically, there are (|s i |-L+1) elements of all n-Grams of length L in the sequence si .
对于每条序列si,统计每个k-mers的出现词频。序列间的k-mers集合为∑L,集合内元素个数为|∑|L,即:对于碱基种类为4的DNA序列,其可能存在的k-mers个数为4L;对于氨基酸种类为20的蛋白质序列,其可能存在的k-mers个数为20L。For each sequence s i , count the frequency of occurrences of each k-mers. The set of k-mers between sequences is ∑ L , and the number of elements in the set is |∑| L , that is: for a DNA sequence with base type 4, the number of possible k-mers is 4 L ; for amino acid types is 20 protein sequences, the number of possible k-mers is 20 L.
例如对于三条序列X=″ACAGT″、Y=″ACACG″及Z=″CACGT″,当k-mer元素的滑动窗口大小设置为2时,三条序列中所有序列间n-Gram长度为2的元素都为4,且每条序列中k-mers的出现情况如表1所示:For example, for three sequences X="ACAGT", Y="ACACG" and Z="CACGT", when the sliding window size of the k-mer element is set to 2, the elements with n-Gram length of 2 between all sequences in the three sequences are all 4, and the occurrence of k-mers in each sequence is shown in Table 1:
表1 序列间的k-mers出现情况Table 1 The occurrence of k-mers between sequences
步骤2:根据序列与k-mers间的关系,构造二部图G=(V,E)。G=(V,E)是由结点集和边集组成的一个无向图模型。其中,V为结点集,E代表结点间相互作用关系形成的边的集合,且E={e(si,kj)|si∈S,kj∈K},其中e(si,kj)表示序列si与k-merskj间存在隶属关系。Step 2: According to the relationship between the sequence and k-mers, construct a bipartite graph G=(V, E). G=(V, E) is an undirected graph model consisting of node sets and edge sets. Among them, V is the set of nodes, E represents the set of edges formed by the interaction relationship between nodes, and E={e(s i , k j )|s i ∈S, k j ∈K}, where e(s i , k j ) means that there is a affiliation between the sequence si and k-mersk j .
具体地,由表1可知,对于序列X与序列Y,两序列间存在共同k-mers:AC和CA,则认为序列X与序列Y存在一条边相连接;对于序列X与序列Z,两序列间存在共同k-mers:AC、CA和GT,则认为序列X与序列Z存在一条边相连接;对于序列Y与序列Z,两序列间存在共同k-mers:AC、CA和CG,则认为序列Y与序列Z存在一条边相连接。Specifically, it can be seen from Table 1 that for sequence X and sequence Y, there are common k-mers between the two sequences: AC and CA, then it is considered that sequence X and sequence Y are connected by an edge; for sequence X and sequence Z, the two sequences If there are common k-mers between the two sequences: AC, CA and GT, it is considered that there is an edge connecting the sequence X and the sequence Z; for the sequence Y and the sequence Z, there are common k-mers between the two sequences: AC, CA and CG, it is considered that Sequence Y and sequence Z are connected by an edge.
步骤3:计算m条序列的重要性。Step 3: Calculate the importance of m sequences.
步骤3.1:将n个k-mers随机均匀地分成t组:g1,g2,...,gt,如图2所示,序列s1与序列si存在共同k-mersk1,序列s1与序列sm存在共同k-mersk3,即序列s1与序列si、序列s1与序列sm存在相互作用关系。此时以序列、k-mers为结点,可构建图模型G=(V,E)。Step 3.1: Divide the n k-mers into t groups evenly and randomly: g 1 , g 2 , ..., gt , as shown in Figure 2, the sequence s 1 and the sequence s i have a common k-mersk 1 , the sequence There is a common k-mersk 3 between s 1 and the sequence s m , that is, the sequence s 1 and the sequence s i , and the sequence s 1 and the sequence s m have an interaction relationship. At this time, with sequences and k-mers as nodes, a graph model G=(V, E) can be constructed.
步骤3.2:计算边的权重。当两序列vi和vj存在一个共同k-mers,则认为vi和vj为相邻结点,且存在一条边相连接,边的权重wji为两序列存在的k-mers共同出现频度的数量,即:对于任意两条序列vi和vj,若它们存在共同的k-mers,可以用wji表示结点间无向的相互作用,通过以下公式计算wji:Step 3.2: Calculate the weights of the edges. When two sequences vi and v j have a common k -mers, it is considered that vi and v j are adjacent nodes, and there is an edge connected, and the weight of the edge w ji is that the k-mers existing in the two sequences co - occur The number of frequencies, that is: for any two sequences vi and v j , if they have common k -mers, w ji can be used to represent the undirected interaction between nodes, and w ji is calculated by the following formula:
具体地,对于三条序列X=″ACAGT″、Y=″ACACG″及Z=″CACGT″,当k-mer元素的滑动窗口大小设置为2时,关于三条序列的二部图如图3所示:Specifically, for the three sequences X="ACAGT", Y="ACACG" and Z="CACGT", when the sliding window size of the k-mer element is set to 2, the bipartite graph of the three sequences is shown in Figure 3 :
X、Y和Z三条序列的k-mers频数情况如表2所示。The frequencies of k-mers for the three sequences of X, Y and Z are shown in Table 2.
表2 三条序列的k-mers出现情况Table 2 The occurrence of k-mers of the three sequences
由表2可知,对于序列X与序列Y,两序列间存在共同k-mers:AC和CA。由于边的权重为两序列存在的k-mers共同出现频度的数量,表2中序列X与序列Y的关系可简化为:It can be seen from Table 2 that for sequence X and sequence Y, there are common k-mers between the two sequences: AC and CA. Since the weight of the edge is the number of co-occurrence frequencies of k-mers existing in the two sequences, the relationship between sequence X and sequence Y in Table 2 can be simplified as:
表3 X、Y两序列存在的k-mers共同出现频度Table 3 Co-occurrence frequency of k-mers in X and Y sequences
此时序列X与序列Y的相互作用wYX可由下式计算而得:At this time, the interaction w YX between the sequence X and the sequence Y can be calculated by the following formula:
wYX=min(|ACX|,|ACY|)+min(|CAX|,|CAY|)=1+1=2w YX = min(|AC X |, |AC Y |)+min(|CA X |, |CA Y |)=1+1=2
步骤3.3:计算结点的权重。对于任意两个结点vi和vj,结点vi是通过连接它们的边wji向结点vj传递作用的,边权重的大小决定了vi对vj的作用大小。当结点vj与多个结点存在边的关系,即结点vj具有多个相邻结点,wj.表示结点vj接收到来自其它结点的作用,通过以下公式计算wj.:Step 3.3: Calculate the weight of the node. For any two nodes vi and v j , the node v i transmits the effect to the node v j through the edge w ji connecting them, and the size of the edge weight determines the effect of v i on v j . When there is an edge relationship between node v j and multiple nodes, that is, node v j has multiple adjacent nodes, w j. indicates that node v j receives action from other nodes, and w is calculated by the following formula j .:
对于表3,通过计算X、Y和Z三条序列存在的k-mers共同出现频度,可得到如下关系矩阵M:For Table 3, by calculating the co-occurrence frequency of k-mers existing in the three sequences X, Y and Z, the following relationship matrix M can be obtained:
表4 X、Y和Z三条序列存在的k-mers共同出现频度Table 4 Co-occurrence frequency of k-mers in the three sequences of X, Y and Z
在表4中,矩阵M的大小为|V|*|V|,其中|V|表示结点的数量。矩阵中的数值表示两序列间的相互作用,如M[1,3]表示序列X与序列Z的相互作用为3。In Table 4, the size of the matrix M is |V|*|V|, where |V| represents the number of nodes. The values in the matrix represent the interaction between the two sequences, for example, M[1, 3] means that the interaction between sequence X and sequence Z is 3.
迭代计算每个结点vi的权重,可得到结点vi的重要性WS(vi),WS(vi)对应的SeqRank计算公式为:By iteratively calculating the weight of each node v i , the importance WS(vi ) of the node v i can be obtained, and the SeqRank calculation formula corresponding to WS(vi ) is:
其中,d为阻尼系数(0≤d≤1),表示在任意时刻,从一个结点游走至另一个结点的概率,即每个结点都有(1-d)的概率随机游走到其它结点。vj∈e(vi,vj)表示结点vi与结点vj存在共同的边;在vk∈e(vj,vk)中,vk是与结点vj存在共同的边的结点。wij(或wji)表示连接结点vi与结点vj的边的权重,即结点vi与结点vj存在的k-mers共同出现频度之和。分母表示vk∈e(vj,vk)时结点vj指向结点vk的边的权重的加权和。WS(vj)为上一次迭代后结点vj的重要性。Among them, d is the damping coefficient (0≤d≤1), which represents the probability of walking from one node to another at any time, that is, each node has a probability of (1-d) random walk to other nodes. v j ∈ e(v i , v j ) indicates that node vi and node v j have a common edge; in v k ∈ e(v j , v k ), v k is a common edge with node v j the edge nodes. w ij (or w ji ) represents the weight of the edge connecting the node v i and the node v j , that is, the sum of the common occurrence frequencies of the k-mers existing at the node v i and the node v j . denominator Represents the weighted sum of the weights of the edges of node v j pointing to node v k when v k ∈ e(v j , v k ). WS(v j ) is the importance of node v j after the last iteration.
由于计算结点权重时又需要用到结点本身的权重,因此需要进行迭代计算。若用WS(vi)t表示结点vi经过t次迭代后的重要性,则公式(3)可表示为:Since the weight of the node itself needs to be used when calculating the weight of the node, iterative calculation is required. If WS( vi ) t is used to represent the importance of node vi after t iterations, then formula (3) can be expressed as:
步骤4:对序列的重要性进行逆序排序。用t×m维的矩阵I表示m条序列在不同组k-mers下的重要性,矩阵I如表5所示。Step 4: Sort the sequences in reverse order of importance. The importance of m sequences under different groups of k-mers is represented by a t×m-dimensional matrix I, which is shown in Table 5.
表5 m条序列在t组下的重要性Table 5 Importance of m sequences under t group
具体地,在矩阵I中,矩阵的行表示t组k-mers,矩阵的列表示m条序列,而矩阵中I[p,q]表示在第p(1≤p≤t)组k-mers下,第q(1≤q≤m)条序列的重要性数值,如I[1,]表示在组g1下m条序列的重要性,I[,q]表示在所有组下计算出的第q条序列的重要性。Specifically, in matrix I, the rows of the matrix represent t groups of k-mers, the columns of the matrix represent m sequences, and I[p, q] in the matrix represents the k-mers in the pth (1≤p≤t) group Below, the importance value of the q (1≤q≤m)-th sequence, such as I[1, ] represents the importance of m sequences under group g 1 , I[, q] represents the calculated value under all groups The importance of the qth sequence.
值得注意的是,不同组别下同一条序列会呈现出不同的重要性,如矩阵I中的序列s1,在组g1下的重要性为0.9696823,而在组g2下的重要性为1.040769;反之,不同组别下进行重要性计算所得出的最大值可以是不同的序列,如组g1下最重要的序列为sm,组gp下最重要的序列为s1。It is worth noting that the same sequence under different groups will show different importance, such as the sequence s 1 in matrix I, the importance under group g 1 is 0.9696823, and the importance under group g 2 is 1.040769; on the contrary, the maximum value obtained by the importance calculation under different groups can be different sequences, for example, the most important sequence under group g 1 is s m , and the most important sequence under group g p is s 1 .
以组为单位,对序列的重要性进行逆序排序,得到的结果形式如表6所示。Taking the group as a unit, the importance of the sequence is sorted in reverse order, and the result is shown in Table 6.
表6 对重要性进行逆序排序Table 6 sorts the importance in reverse order
与表5不同的是,在表6中,矩阵I中的数值表示经过逆序排序后的序列号,如I[2,1]=7表示的是:在组g2下计算出来的最重要的是7序列;I[p,m]=7表示的是:7序列在组gp下被认为是最不重要的序列。不同组k-mers下计算出来的最重要的序列号是不一样的。The difference from Table 5 is that in Table 6, the values in the matrix I represent the sequence numbers sorted in reverse order. For example, I[2, 1]=7 represents: the most important value calculated under the group g 2 is a 7-sequence; I[ p ,m]=7 means that the 7-sequence is considered to be the least important sequence under group gp. The most important sequence numbers calculated under different groups of k-mers are different.
假设k(k≤m)为m条序列的中心数,以组g1下的序列号SR1为准,按照均匀的间隔,从t组中各筛选出k条序列,如表7阴影部分所示。Assuming that k (k≤m) is the number of centers of m sequences, the sequence number SR 1 under group g 1 shall prevail, and k sequences are screened from each of the t groups according to uniform intervals, as shown in the shaded part of Table 7. Show.
表7 对重要性进行筛选Table 7 Screening for Importance
对于每组被筛选出来的k个序列号,将其所在序列作为候选序列,该候选序列集合大小为t*k。在t*k条序列中,可能会存在重复序列,如组g1中序列5和7也在组gt的候选序列集合中;组gt中序列78的序列也在组g2的候选序列集合中。因此我们要对筛选出的t*k条候选序列去重,得到n(n≤t*k)条不重复的候选序列。For each group of k sequence numbers screened out, the sequence in which it is located is taken as a candidate sequence, and the size of the candidate sequence set is t*k. In t *k sequences, there may be repeated sequences. For example, sequences 5 and 7 in group g 1 are also in the candidate sequence set of group gt ; sequence 78 in group gt is also a candidate sequence in group g 2 . in the collection. Therefore, we need to deduplicate the selected t*k candidate sequences to obtain n (n≤t*k) non-repeated candidate sequences.
步骤5:用K-means算法对候选序列进行聚类。K-means中心数为k,特征为候选序列的k-mers频度。滑动窗口大小为L时,对于DNA序列,序列间可能出现的k-mers集合为∑L。假设L=2,则有如下矩阵O:Step 5: Cluster the candidate sequences with K-means algorithm. The number of K-means centers is k, and the feature is the k-mers frequency of the candidate sequence. When the sliding window size is L, for DNA sequences, the set of k-mers that may appear between sequences is ∑ L . Assuming L = 2, there is the following matrix O:
表8 k-mers在候选序列中的出现频度Table 8 Frequency of occurrence of k-mers in candidate sequences
矩阵O的大小为n*|∑|L,其中n为候选序列数,∑L为当前滑动窗口大小为L时序列对应的k-mers集合。当L=2时,对于DNA序列,有∑L={AA,AC,AG,...,TT}。在矩阵O中,0[si,]表示在第i条序列中各个k-mers的出现频度,即:140,122,200,...,101。The size of the matrix O is n*|∑| L , where n is the number of candidate sequences, and ∑ L is the set of k-mers corresponding to the sequence when the current sliding window size is L. When L=2, for a DNA sequence, Σ L = {AA, AC, AG, . . . , TT}. In matrix O, 0[s i , ] represents the frequency of occurrence of each k-mers in the i-th sequence, namely: 140, 122, 200, ..., 101.
K-means得到的结果如图4所示。在图4中,n条候选序列被聚成k类:Cluster1,Cluster2,...,Clusterk。对每个Cluster,筛选出与当前质心最接近的点作为序列中心,即对于Cluster2,经过某种距离方式度量,此时与Cluster2最为接近的是点A,我们认为点A所在的序列为Cluster2的中心序列。以此类推,可得到k条中心序列。The results obtained by K-means are shown in Figure 4. In Figure 4, n candidate sequences are clustered into k categories: Cluster1, Cluster2, ..., Clusterk. For each cluster, the point closest to the current centroid is selected as the sequence center, that is, for Cluster2, after a certain distance measurement, the closest point to Cluster2 is point A, and we think that the sequence where point A is located is Cluster2. Center sequence. By analogy, k central sequences can be obtained.
步骤6:对m条序列S={si|i=1,2,...,m}进行聚类。Step 6: Perform clustering on m sequences S={s i |i=1, 2, . . . , m}.
具体地,当滑动窗口大小为L时,m条序列下的k-mers集合为K={kj|j=1,2,...,|∑|L},此时构建的k-mers频度矩阵O的大小为m*|∑|L。Specifically, when the size of the sliding window is L, the set of k-mers under m sequences is K={k j |j=1, 2, ..., |∑| L }, the k-mers constructed at this time The size of the frequency matrix O is m*|∑| L .
对于k条中心,特征为代表m条序列中k-mers的频度矩阵O,K-means算法的聚类过程描述具体为:For k centers, the feature is the frequency matrix O representing k-mers in m sequences. The description of the clustering process of the K-means algorithm is as follows:
(1)对筛选出的k条中心,分别标记为μ1,μ2,...,μk;(1) Mark the selected k centers as μ 1 , μ 2 , . . . , μ k respectively ;
(2)对矩阵O中每一个点wi,j,使用公式(5)计算其所属的类别:(2) For each point wi ,j in matrix O, use formula (5) to calculate the category to which it belongs:
在公式(3)中,prei表示序列si距离k个簇中最近的那一类,即第i条序列的预测类别,||wi-μj||2表示对序列si的重要性数值wi,计算其与每个中心点的欧氏距离,表示将距离wi最接近的中心点确定为第i条序列的预测类别。由此可得m条序列所对应的预测类别。In formula (3), pre i represents the closest category of the sequence s i to the k clusters, that is, the predicted category of the i-th sequence, and ||w i -μ j || 2 represents the importance of the sequence s i property value w i , calculate its Euclidean distance from each center point, Indicates that the center point closest to wi is determined as the prediction category of the i-th sequence. From this, the prediction categories corresponding to the m sequences can be obtained.
本发明采用以上技术方案,通过构造二部图模型,对生物序列进行聚类分析,试图在聚类分析的层次上从生物序列数据中得到深层信息含义及可靠的结论,有效地解决现有技术中存在的计算结点权重复杂度高、结点重要性的代表性不够以及结点重要性受序列长度影响较大等问题。The present invention adopts the above technical scheme, and by constructing a bipartite graph model, performs cluster analysis on biological sequences, attempts to obtain deep information meaning and reliable conclusions from biological sequence data at the level of cluster analysis, and effectively solves the problem of the prior art. There are problems such as high complexity of calculating node weights, insufficient representation of node importance, and node importance being greatly affected by sequence length.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271872.4A CN110060735B (en) | 2019-04-04 | 2019-04-04 | Biological sequence clustering method based on k-mer group segmentation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271872.4A CN110060735B (en) | 2019-04-04 | 2019-04-04 | Biological sequence clustering method based on k-mer group segmentation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110060735A true CN110060735A (en) | 2019-07-26 |
CN110060735B CN110060735B (en) | 2023-02-10 |
Family
ID=67318406
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910271872.4A Expired - Fee Related CN110060735B (en) | 2019-04-04 | 2019-04-04 | Biological sequence clustering method based on k-mer group segmentation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110060735B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851655A (en) * | 2019-11-07 | 2020-02-28 | 中国银联股份有限公司 | A method and system for simplifying complex networks |
CN114822699A (en) * | 2022-04-07 | 2022-07-29 | 天津大学四川创新研究院 | Clustering algorithm-based high-performance k-mer frequency counting method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20030032345A (en) * | 2001-10-17 | 2003-04-26 | 한국과학기술원 | The Devices and Methods for Hyper-rectangle Based Multidimensional Data Segmentation and Clustering |
CN107480471A (en) * | 2017-07-19 | 2017-12-15 | 福建师范大学 | The method for the sequence similarity analysis being characterized based on wavelet transformation |
CN109326327A (en) * | 2018-08-28 | 2019-02-12 | 福建师范大学 | A Sequence Clustering Method Based on SeqRank Graph Algorithm |
-
2019
- 2019-04-04 CN CN201910271872.4A patent/CN110060735B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20030032345A (en) * | 2001-10-17 | 2003-04-26 | 한국과학기술원 | The Devices and Methods for Hyper-rectangle Based Multidimensional Data Segmentation and Clustering |
CN107480471A (en) * | 2017-07-19 | 2017-12-15 | 福建师范大学 | The method for the sequence similarity analysis being characterized based on wavelet transformation |
CN109326327A (en) * | 2018-08-28 | 2019-02-12 | 福建师范大学 | A Sequence Clustering Method Based on SeqRank Graph Algorithm |
Non-Patent Citations (1)
Title |
---|
徐彭娜等: "基于位置信息熵的局部敏感哈希聚类方法", 《计算机应用与软件》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851655A (en) * | 2019-11-07 | 2020-02-28 | 中国银联股份有限公司 | A method and system for simplifying complex networks |
CN110851655B (en) * | 2019-11-07 | 2024-05-17 | 中国银联股份有限公司 | A method and system for simplifying complex networks |
CN114822699A (en) * | 2022-04-07 | 2022-07-29 | 天津大学四川创新研究院 | Clustering algorithm-based high-performance k-mer frequency counting method and system |
Also Published As
Publication number | Publication date |
---|---|
CN110060735B (en) | 2023-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107862179A (en) | A kind of miRNA disease association Relationship Prediction methods decomposed based on similitude and logic matrix | |
CN103745258B (en) | Complex network community mining method based on the genetic algorithm of minimum spanning tree cluster | |
CN112259167B (en) | Pathogen analysis method, device and computer equipment based on high-throughput sequencing | |
Rasheed et al. | Metagenomic taxonomic classification using extreme learning machines | |
CN102722578B (en) | Unsupervised cluster characteristic selection method based on Laplace regularization | |
CN105574547A (en) | Integrated learning method and device adapted to weight of dynamically adjustable base classifier | |
CN109492682A (en) | A kind of multi-branched random forest data classification method | |
Filkov et al. | Heterogeneous data integration with the consensus clustering formalism | |
CN106228035A (en) | Based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method | |
Celik et al. | Biological cartography: Building and benchmarking representations of life | |
CN103793600B (en) | Classifier model generating method for gene microarray data | |
CN110060735B (en) | Biological sequence clustering method based on k-mer group segmentation | |
Sarmah et al. | An effective technique for clustering incremental gene expression data | |
CN115512772B (en) | A high-precision single-cell clustering method and system based on marker genes and ensemble learning | |
CN109376790A (en) | A binary classification method based on seepage analysis | |
Saha et al. | Improvement of new automatic differential fuzzy clustering using SVM classifier for microarray analysis | |
CN105139037B (en) | Integrated multi-target evolution automatic clustering method based on minimum spanning tree | |
CN108154189A (en) | Grey relational cluster method based on LDTW distances | |
CN109326327B (en) | Biological sequence clustering method based on SeqRank graph algorithm | |
CN112071362A (en) | A method for detection of protein complexes that fuse global and local topologies | |
Dinu et al. | Clustering based on Rank Distance with Applications on DNA | |
Zhou et al. | A new method for classification in DNA sequence | |
Misra et al. | Finding optimum width of discretization for gene expressions using functional annotations | |
Priscilla et al. | A semi-supervised hierarchical approach: Two-dimensional clustering of microarray gene expression data | |
Hossen et al. | Identification of robust clustering methods in gene expression data analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20230210 |