CN106841012A - Flow cytometry data automatically-controlled door prosecutor method based on distributed graph model - Google Patents
Flow cytometry data automatically-controlled door prosecutor method based on distributed graph model Download PDFInfo
- Publication number
- CN106841012A CN106841012A CN201710007719.1A CN201710007719A CN106841012A CN 106841012 A CN106841012 A CN 106841012A CN 201710007719 A CN201710007719 A CN 201710007719A CN 106841012 A CN106841012 A CN 106841012A
- Authority
- CN
- China
- Prior art keywords
- node
- graph
- data
- edges
- community
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N15/00—Investigating characteristics of particles; Investigating permeability, pore-volume or surface-area of porous materials
- G01N15/10—Investigating individual particles
- G01N15/14—Optical investigation techniques, e.g. flow cytometry
- G01N15/1425—Optical investigation techniques, e.g. flow cytometry using an analyser being characterised by its control arrangement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2323—Non-hierarchical techniques based on graph theory, e.g. minimum spanning trees [MST] or graph cuts
Landscapes
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Biochemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Discrete Mathematics (AREA)
- Analytical Chemistry (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Dispersion Chemistry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明公开了一种基于分布式图模型的流式细胞计数据自动门控方法,将分布式技术应用于流式细胞计数据的门控计算;对于图模型所有按元素操作的执行并行化计算,提高程序执行性能;基于随机投影树的kNN搜索策略,降低构造图的时间到线性复杂度;Spark的分布式存储方式可以处理更大规模的数据集。利用随机投影树加快了kNN的搜索效率;对于边权值的操作进行了并行化;割图算法采用了并行化的社群划分算法。本发明通过分布式计算框架实现对流式细胞计数据的门控过程进行基于图模型的聚类分析,对原始数据进行自动的划分,从而提升数据分析效率和准确度,降低划分过程的重复劳动和人为主观因素。
The invention discloses an automatic gating method for flow cytometer data based on a distributed graph model, which applies distributed technology to the gating calculation of flow cytometer data; and executes parallel calculations for all element-based operations of the graph model , to improve program execution performance; the kNN search strategy based on random projection trees reduces the time to linear complexity of constructing graphs; Spark's distributed storage method can handle larger-scale data sets. The random projection tree is used to speed up the search efficiency of kNN; the operation of edge weights is parallelized; the cut graph algorithm adopts a parallelized community partition algorithm. The present invention realizes the clustering analysis based on the graphical model for the gating process of the flow cytometer data through the distributed computing framework, and automatically divides the original data, thereby improving the efficiency and accuracy of data analysis, and reducing the duplication of labor and labor involved in the division process. Human subjective factors.
Description
技术领域technical field
本发明涉及流式细胞计技术,尤其涉及一种基于分布式图模型的流式细胞计数据自动门控方法。The invention relates to flow cytometer technology, in particular to an automatic gating method for flow cytometer data based on a distributed graph model.
背景技术Background technique
流式细胞计技术是目前较为先进的单细胞测量技术。是对悬液中的单细胞或其他生物粒子,通过检测标记的荧光信号,实现高速、逐一的细胞定量分析和分选的技术。质谱流式细胞计技术是在流式细胞计基础上发展出的新一代单细胞测量技术。利用稀土元素同位素标记特定抗原,控制抗原和单细胞表面的抗体特异性结合,接着气化细胞,控制其等离子体进入质谱,质谱对于同位数的技术结果代表着该细胞上被标记抗原的表达程度。而抗原表达程度和细胞的功能、种类密切相关,这个尺度的测量结果为免疫系统的研究,包括细胞的筛选、分型等提供了巨大的帮助。Flow cytometry is a relatively advanced single-cell measurement technology at present. It is a technology for high-speed, one-by-one quantitative analysis and sorting of single cells or other biological particles in suspension by detecting the labeled fluorescent signal. Mass cytometry technology is a new generation of single cell measurement technology developed on the basis of flow cytometry. Use rare earth element isotopes to label specific antigens, control the specific binding of antigens and antibodies on the surface of single cells, then gasify the cells, control their plasma into the mass spectrometer, and the technical results of mass spectrometry for the isotope represent the expression level of the labeled antigen on the cell . The degree of antigen expression is closely related to the function and type of cells. The measurement results of this scale provide great help for the research of the immune system, including cell screening and typing.
大部分研究的基础是依靠这些标志物,逐步筛选出感兴趣的细胞亚群进而进行研究,这个过程称为门控。传统的门控方法是利用已有的领域知识,依靠具有关键信息的标记物的二维图,按照一定的层次,逐步从原始数据中手动分离出感兴趣的细胞亚群。几个商业化软件也是在这个基础上为使用者提供速度更快、交互性更强的操作:FCS Express、FlowJo、CytoBank等。随着单细胞数据维度的升高,门控变得越来越不可行。The basis of most studies is to rely on these markers to gradually screen out interested cell subpopulations for further research. This process is called gating. The traditional gating method is to use the existing domain knowledge, rely on the two-dimensional map of the markers with key information, and gradually manually separate the cell subpopulations of interest from the raw data according to a certain level. Several commercial software also provide users with faster and more interactive operations on this basis: FCS Express, FlowJo, CytoBank, etc. Gating becomes less feasible as the dimensionality of single-cell data increases.
学术界引入计算方法,着重在用非监督性学习——聚类来替代对数据集的人为的门控工作,大致方向包括基于密度,基于几何等。其中Cytobank的提供的聚类算法中只有比较基础的k-means和层次聚类等。而相较于这些方法,基于图的方法由于不需要对数据有着预先的形态假设,输入参数在较大的范围内都输出稳定的结果,对于稀少亚群具有识别和保留能力,能还可以衔接进一步的半监督学习和迁移数据挖掘模型,成为较好的选择。The academic community introduces computational methods, focusing on using unsupervised learning-clustering to replace artificial gating work on data sets, and the general directions include density-based, geometry-based, etc. Among the clustering algorithms provided by Cytobank, there are only relatively basic k-means and hierarchical clustering. Compared with these methods, the graph-based method does not need to have a prior morphological assumption on the data, and the input parameters can output stable results in a large range. It has the ability to identify and retain rare subgroups, and can also be connected. Further semi-supervised learning and migration data mining models become better choices.
传统的人工门控都是采样一个二维散点图划分策略,费时费力,人为主观因素大,不具可重复性,对于高维度无能为力,复杂度为O(m^2),m为数据点的维数。而这种方法,对于定义不好的组织(比如肝、脾)也没有金标准。Traditional manual gating is to sample a two-dimensional scatter diagram division strategy, which is time-consuming and labor-intensive, with large human subjective factors and no repeatability. It is powerless for high dimensions, and the complexity is O(m^2), where m is the number of data points. dimension. And this method has no gold standard for poorly defined tissues (such as liver and spleen).
传统的聚类算法没有办法处理好模型假设和数据分布之间的一致性。比如k‐means是一个几何空间的多边形划分,很容易将一个多边形空间内明显分割的两块数据区域合并成一类;DBSCAN等依靠密度的聚类算法需要预估计密度核宽,因而对于尺寸小于这个核宽的区域就存在不适应性等。相较于这些方法,基于图的方法则很大程度上同时保持了全局和局部的数据特征。Traditional clustering algorithms have no way to deal with the consistency between model assumptions and data distribution. For example, k-means is a polygonal division of a geometric space. It is easy to merge two data areas that are clearly divided into one category in a polygonal space; clustering algorithms that rely on density such as DBSCAN need to estimate the density kernel width in advance, so for sizes smaller than this In the area of nuclear width, there is incompatibility and so on. Compared with these methods, graph-based methods largely preserve both global and local data characteristics.
常规的图算法计算过程往往会需要全局遍历节点和边,面对大规模的实验数据时会面临很大的时间和内存消耗。比如第一步通常需要构造kNN图,则需要遍历计算所有数据点之间两两距离,复杂度为O(n^2),n为数据点个数,同时预存边的权值。其他操作中,比如权值更新,边的增删,也存在许多按元素的操作方法,按照单线程的方式进行效率低下。当数据点到达百万数量级以后,常规的单机配置已经让内存消耗和计算时间变得难以忍受,这会限制通量越来越高的实验数据处理。The calculation process of conventional graph algorithms often requires global traversal of nodes and edges, which will consume a lot of time and memory when faced with large-scale experimental data. For example, the first step usually needs to construct a kNN graph, and it is necessary to traverse and calculate the pairwise distance between all data points, the complexity is O(n^2), n is the number of data points, and the weight of the edge is pre-stored at the same time. In other operations, such as weight update, addition and deletion of edges, there are also many element-by-element operation methods, which are inefficient in a single-threaded manner. When the data points reach the order of millions, the conventional stand-alone configuration has made the memory consumption and computing time unbearable, which will limit the experimental data processing with higher and higher throughput.
发明内容Contents of the invention
本发明的目的在于针对现有技术的不足,提供一种基于分布式图模型的流式细胞计数据自动门控方法。The object of the present invention is to provide an automatic gating method for flow cytometry data based on a distributed graph model to address the deficiencies in the prior art.
本发明的目的是通过以下技术方案来实现的:一种基于分布式图模型的流式细胞计数据自动门控方法,包括以下步骤:The object of the present invention is achieved by the following technical solutions: a method for automatic gating of flow cytometry data based on a distributed graph model, comprising the following steps:
(1)根据GraphLab框架将输入的细胞计数据文件变换为RDD类型,得到流式细胞计数据X={x1,x2,...,xN};(1) Transform the input cytometer data file into an RDD type according to the GraphLab framework, and obtain the flow cytometer data X={x 1 ,x 2 ,...,x N };
(2)输入流式细胞计数据X={x1,x2,...,xN}、邻域大小k、随机重复的次数r;N为数据点个数。输出C={c1,c2,...,cN},代表每个数据点被分配到的标签,即门控后的结果;(2) Input the flow cytometry data X={x 1 ,x 2 ,...,x N }, the size of the neighborhood k, and the number of random repetitions r; N is the number of data points. Output C={c 1 ,c 2 ,...,c N }, which represents the label assigned to each data point, that is, the result after gating;
(3)基于随机投影树和GraphLab框架分布式构造kNN图:(3) Distributed construction of kNN graph based on random projection tree and GraphLab framework:
在Gather阶段,每个线程上随机生成若干超平面v,将每个数据点视为节点,本线程上的节点x选择满足最远点距离要求x·v≤median({z·v:z∈S}+δ)的节点自动成为邻居,在Scatter阶段节点x与邻居节点之间的边的权值为1,与非邻居节点之间的边的权值为0,当发现每个节点邻居个数达到k时,自动跳出,从而构建得到kNN图;S为节点的集合,δ为任意常数,median()表示取中值。为了达到需要的精度,反复执行2‐3次,取图的交集以保证准确度,得到最终的kNN图;In the Gather stage, several hyperplanes v are randomly generated on each thread, and each data point is regarded as a node. The node x on this thread is selected to meet the distance requirement of the farthest point x·v≤median({z·v:z∈ S}+δ) nodes automatically become neighbors. In the Scatter stage, the weight of the edge between node x and the neighbor node is 1, and the weight of the edge between the non-neighbor node is 0. When each node is found to have neighbors When the number reaches k, it will automatically jump out to construct a kNN graph; S is the set of nodes, δ is an arbitrary constant, and median() means to take the median value. In order to achieve the required accuracy, it is repeated 2-3 times, and the intersection of the graphs is taken to ensure the accuracy, and the final kNN graph is obtained;
(4)通过GraphLab框架的Gather阶段遍历每一个节点x的所有边;通过GraphLab框架的Scatter阶段,根据所有的和该节点连接的节点Vk(x)计算Jaccard系数Jk(xi,xj),将kNN图变成带权值图;(4) Through the Gather stage of the GraphLab framework, traverse all the edges of each node x; through the Scatter stage of the GraphLab framework, calculate the Jaccard coefficient J k ( xi , x j ) according to all nodes V k (x) connected to the node ), turn the kNN graph into a graph with weights;
(5)对得到的带权值图执行图割算法,得到一次划分C={c1,c2,...,cN}和目标函数Q值,Q值的计算如(2):(5) Execute the graph cut algorithm on the obtained weighted graph to obtain a division C={c 1 ,c 2 ,...,c N } and the objective function Q value. The calculation of the Q value is as in (2):
其中,J代表整幅带权值图的权值,δ(ci,cj)表示delta函数,当输入相等取1,否则取0。deg(vi)代表节点i的度,本模型中没有负权值,m表示带权值图中边的权值之和的1/2。Among them, J represents the weight of the entire weighted graph, and δ( ci,c j ) represents the delta function. When the input is equal, it takes 1, otherwise it takes 0. deg(v i ) represents the degree of node i, there is no negative weight in this model, and m represents 1/2 of the sum of the weights of the edges in the weighted graph.
(6)重复r次步骤4,选择r次中最大Q值对应的C作为门控后的结果。(6) Repeat step 4 r times, and select C corresponding to the largest Q value in r times as the result after gating.
进一步,所述步骤5包括以下子步骤:Further, said step 5 includes the following sub-steps:
a)初始化每一个节点自成一个社群;a) Initialize each node to form a community;
b)并行遍历节点,对于节点i的邻域Vk(x)的每一个节点j∈Vk(x),计算如果将i归入j所属的社群cj中Q值的增加值ΔQ,寻找ΔQ最大的归入方式。在计算ΔQ的过程中通过Gather阶段,读取边权值和边所对应的节点,在Apply阶段将每条边的权值加到对应节点的属性上,计算出每个节点的度,并给予节点一个临时标签代表其所属的社群。根据每个节点的度和临时标签计算得到ΔQ。b) Traversing nodes in parallel, for each node j∈V k (x) in the neighborhood V k (x) of node i, calculate the increment value ΔQ of Q value if i is classified into the community c j to which j belongs, Find the way in which ΔQ is the largest. In the process of calculating ΔQ, through the Gather stage, read the edge weight and the node corresponding to the edge, and add the weight of each edge to the attribute of the corresponding node in the Apply stage, calculate the degree of each node, and give A temporary label of a node represents the community it belongs to. ΔQ is calculated according to the degree and temporary label of each node.
其中∑in是社区C内部节点之间边的权值和,∑tot是该归入方式执行后所有被连接归入C的边的权值之和,ki是节点i所有连接边的权值之和,ki,in是节点i连接进入社群C后所有带入边的权值之和,m是整个网络所有边的权值之和。Where ∑in is the sum of the weights of the edges between the internal nodes of community C, ∑tot is the sum of the weights of all the edges connected to C after the implementation of the subsuming method, k i is the weight of all the connected edges of node i The sum of k i,in is the sum of the weights of all edges brought in after node i connects to community C, and m is the sum of the weights of all edges in the entire network.
c)选择ΔQ最大的归入方式后,所有社群内部的点聚合成一个新的点,社群增加一条自连边,自连边的权值是社区C内部节点之间边的权值和,社群和社群之间边的权值是聚合之前两个社群的有连接的成员之间边的权值之和。将聚合后的图回带到步骤b)中执行相同的操作。c) After selecting the largest ΔQ incorporation method, all points within the community are aggregated into a new point, and a self-connected edge is added to the community, and the weight of the self-connected edge is the sum of the weights of the edges between the internal nodes of community C , the weight of the edge between the community and the community is the sum of the weight of the edge between the connected members of the two communities before aggregation. Bring the aggregated graph back into step b) to do the same.
d)重复b)和c),直到ΔQ收敛。d) Repeat b) and c) until ΔQ converges.
本发明的有益效果是:将分布式技术应用于流式细胞计数据的门控计算;对于图模型所有按元素操作的并行化计算,提高程序执行性能。基于随机投影树的kNN搜索策略,降低构造图的时间到线性复杂度。Spark的分布式存储方式可以处理更大规模的数据集(百万数据点以上)。利用随机投影树加快了kNN的搜索效率。对于边权值的操作进行了并行化。割图算法采用了并行化的社群划分算法。本发明通过分布式计算框架实现对流式细胞计数据的门控过程进行基于图模型的聚类分析,对原始数据进行自动的划分,从而提升数据分析效率和准确度,降低划分过程的重复劳动和人为主观因素。The beneficial effects of the invention are: the distributed technology is applied to the gating calculation of the flow cytometry data; all the parallel calculations of the graphic model are operated by elements, and the program execution performance is improved. A kNN search strategy based on random projection trees reduces the time-to-linear complexity of constructing graphs. Spark's distributed storage method can handle larger-scale data sets (more than one million data points). The search efficiency of kNN is accelerated by using random projection tree. The operation of edge weights is parallelized. The cut graph algorithm uses a parallelized community partition algorithm. The present invention realizes the clustering analysis based on the graphical model for the gating process of the flow cytometer data through the distributed computing framework, and automatically divides the original data, thereby improving the efficiency and accuracy of data analysis, and reducing the duplication of labor and effort in the division process. Human subjective factors.
附图说明Description of drawings
图1为GraphLab构造存储图的方法;Figure 1 shows how GraphLab constructs a storage graph;
图2为Gather‐Apply‐scatter架构;Figure 2 shows the Gather‐Apply‐scatter architecture;
图3为聚类算法流程图;Figure 3 is a flow chart of the clustering algorithm;
图4中(a)为k‐d树原理,(b)为随机投影树原理。In Figure 4 (a) is the principle of k-d tree, (b) is the principle of random projection tree.
具体实施方式detailed description
下面结合附图和具体实施例对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments.
本发明使用基于图的聚类算法来实现流式细胞计数据的自动门控,即通过在该数据集上构造、修饰并且划分图来实现数据的聚类。通过假设数据整体的分布模式和选择相似性度量,通过内部指标(或称为目标函数)的优化来实现稳定聚类结果的输出,依靠外部指标(和人为标注进行对比)来说明自身的准确性。非监督学习过程直接,易于高效实现,相较于传统的门控策略可以极大地解放人力,同时产生结果具有可重复性。The present invention uses a graph-based clustering algorithm to realize automatic gating of flow cytometry data, that is, realizes data clustering by constructing, modifying and dividing graphs on the data set. By assuming the overall distribution pattern of the data and selecting a similarity measure, the output of stable clustering results is achieved through the optimization of internal indicators (or called objective functions), and the accuracy of the clustering results is achieved by relying on external indicators (compared with artificial labels) . The unsupervised learning process is straightforward and easy to implement efficiently. Compared with the traditional gating strategy, it can greatly liberate manpower, and the results are repeatable.
而为了提升计算性能,使得该方法可以在未来适应更为庞大的数据集,更快地给予研究者反馈信息,本发明采用基于Spark的GraphLab分布式计算框架,该框架提供了成熟的基于图的数据挖掘模型函数借口,可以帮我们在分布式的框架上实现这一算法。GraphLab分布式计算框架具体如下:In order to improve computing performance so that the method can adapt to larger data sets in the future and give researchers feedback information faster, the present invention adopts the Spark-based GraphLab distributed computing framework, which provides a mature graph-based The data mining model function excuse can help us realize this algorithm on a distributed framework. The GraphLab distributed computing framework is as follows:
GraphLab的并行化主要依靠于顶点,顶点是最小并行粒度和通信粒度,而边是机器学习算法中数据依赖性的表现方式。GraphLab的存储模式是将单个顶点存储到多个机器上,分为master‐mirror,而对于某条边,GraphLab将其部署到某一台机器上,从而每一台机器上有一个本地图,如图1所示。而本地id到全局id存在一个映射表,图的节点是一个进程上所有线程共享的,在并行计算过程中,各个线程分摊所有顶点的gather→apply→scatter操作,如图2所示。The parallelization of GraphLab mainly relies on vertices, which are the minimum parallel granularity and communication granularity, while edges are the representation of data dependence in machine learning algorithms. GraphLab's storage mode is to store a single vertex on multiple machines, divided into master-mirror, and for a certain edge, GraphLab will deploy it to a certain machine, so that each machine has a local map, such as Figure 1 shows. There is a mapping table from the local id to the global id. The nodes in the graph are shared by all threads in a process. In the process of parallel computing, each thread shares the gather→apply→scatter operation of all vertices, as shown in Figure 2.
每个顶点每一轮迭代经过gather‐>apple‐>scatter三个阶段。Each iteration of each vertex goes through three stages of gather->apple->scatter.
1)Gather阶段1) Gather stage
工作顶点的边(可能是所有边,也有可能是入边或者出边)从领接顶点和自身收集数据,记为gather_data_i,各个边的数据graphlab会求和,记为sum_data。这一阶段对工作顶点、边都是只读的。The edge of the working vertex (it may be all edges, or it may be the incoming edge or the outgoing edge) collects data from the leading vertex and itself, which is recorded as gather_data_i, and graphlab will sum the data of each edge, which is recorded as sum_data. This phase is read-only for working vertices and edges.
2)Apply阶段2) Apply stage
Mirror将gather计算的结果sum_data发送给master顶点,master进行汇总为total。Master利用total和上一步的顶点数据,按照业务需求进行进一步的计算,然后更新master的顶点数据,并同步mirror。Apply阶段中,工作顶点可修改,边不可修改。Mirror sends the result sum_data calculated by gather to the master vertex, and the master summarizes it as total. The master uses the total and the vertex data in the previous step to perform further calculations according to business requirements, then updates the vertex data of the master and synchronizes the mirror. In the Apply phase, the working vertices can be modified, but the edges cannot be modified.
3)Scatter阶段3) Scatter stage
工作顶点更新完成之后,更新边上的数据,并通知对其有依赖的邻结顶点更新状态。这scatter过程中,工作顶点只读,边上数据可写。After the update of the working vertex is completed, the data on the edge is updated, and the adjacent node vertices that depend on it are notified to update the status. During the scatter process, the working vertices are read-only, and the data on the edges can be written.
在执行模型中,GraphLab通过控制三个阶段的读写权限来达到互斥的目的。在gather阶段只读,apply对顶点只写,scatter对边只写。并行计算的同步通过master和mirror来实现,mirror相当于每个顶点对外的一个接口人,将复杂的数据通信抽象成顶点的行为。In the execution model, GraphLab achieves mutual exclusion by controlling the read and write permissions of the three stages. Read-only in the gather phase, apply only writes to vertices, and scatter writes only to edges. The synchronization of parallel computing is realized through the master and mirror. The mirror is equivalent to an interface for each vertex, abstracting complex data communication into the behavior of the vertex.
对于常规的图操作,Graphlab也提供了可直接调用的API,例如元素访问graphlab.Edge、元素的融合graphlab.aggregate()等。For conventional graph operations, Graphlab also provides APIs that can be called directly, such as element access graphlab.Edge, element fusion graphlab.aggregate(), etc.
本发明提供的一种基于分布式图模型的流式细胞计数据自动门控方法,如图3所示,包括以下步骤:A method for automatic gating of flow cytometer data based on a distributed graph model provided by the present invention, as shown in Figure 3, comprises the following steps:
(1)根据GraphLab框架将输入的细胞计数据文件变换为RDD类型,得到流式细胞计数据X={x1,x2,...,xN};(1) Transform the input cytometer data file into an RDD type according to the GraphLab framework, and obtain the flow cytometer data X={x 1 ,x 2 ,...,x N };
(2)输入流式细胞计数据X={x1,x2,...,xN}、邻域大小k、随机重复的次数r;N为数据点个数。输出C={c1,c2,...,cN},代表每个数据点被分配到的标签,即门控后的结果;(2) Input the flow cytometry data X={x 1 ,x 2 ,...,x N }, the size of the neighborhood k, and the number of random repetitions r; N is the number of data points. Output C={c 1 ,c 2 ,...,c N }, which represents the label assigned to each data point, that is, the result after gating;
(3)基于随机投影树和GraphLab框架分布式构造kNN图:(3) Distributed construction of kNN graph based on random projection tree and GraphLab framework:
在Gather阶段,每个线程上随机生成若干超平面v,将每个数据点视为节点,本线程上的节点x选择满足最远点距离要求x·v≤median({z·v:z∈S}+δ)的节点自动成为邻居,在Scatter阶段节点x与邻居节点之间的边的权值为1,与非邻居节点之间的边的权值为0,当发现每个节点邻居个数达到k时,自动跳出,从而构建得到kNN图;S为节点的集合,δ为任意常数,median()表示取中值。为了达到需要的精度,反复执行2‐3次,取图的交集以保证准确度,得到最终的kNN图;In the Gather stage, several hyperplanes v are randomly generated on each thread, and each data point is regarded as a node. The node x on this thread is selected to meet the distance requirement of the farthest point x·v≤median({z·v:z∈ S}+δ) nodes automatically become neighbors. In the Scatter stage, the weight of the edge between node x and the neighbor node is 1, and the weight of the edge between the non-neighbor node is 0. When each node is found to have neighbors When the number reaches k, it will automatically jump out to construct a kNN graph; S is the set of nodes, δ is an arbitrary constant, and median() means to take the median value. In order to achieve the required accuracy, it is repeated 2-3 times, and the intersection of the graphs is taken to ensure the accuracy, and the final kNN graph is obtained;
(4)通过GraphLab框架的Gather阶段遍历每一个节点x的所有边;通过GraphLab框架的Scatter阶段,根据所有的和该节点连接的节点Vk(x)计算Jaccard系数Jk(xi,xj),将kNN图变成带权值图;(4) Through the Gather stage of the GraphLab framework, traverse all the edges of each node x; through the Scatter stage of the GraphLab framework, calculate the Jaccard coefficient J k ( xi , x j ) according to all nodes V k (x) connected to the node ), turn the kNN graph into a graph with weights;
(5)对得到的带权值图执行图割算法,得到一次划分C={c1,c2,...,cN}和目标函数Q值,Q值的计算如(2):(5) Execute the graph cut algorithm on the obtained weighted graph to obtain a division C={c 1 ,c 2 ,...,c N } and the objective function Q value. The calculation of the Q value is as in (2):
其中,J代表整幅带权值图的权值,δ(ci,cj)表示delta函数,当输入相等取1,否则取0。deg(vi)代表节点i的度,本模型中没有负权值,m表示带权值图中边的权值之和的1/2。该步骤的具体实现步骤如下:Among them, J represents the weight of the entire weighted graph, and δ( ci,c j ) represents the delta function. When the input is equal, it takes 1, otherwise it takes 0. deg(v i ) represents the degree of node i, there is no negative weight in this model, and m represents 1/2 of the sum of the weights of the edges in the weighted graph. The specific implementation steps of this step are as follows:
a)初始化每一个节点自成一个社群;a) Initialize each node to form a community;
b)并行遍历节点,对于节点i的邻域Vk(x)的每一个节点j∈Vk(x),计算如果将i归入j所属的社群cj中Q值的增加值ΔQ,寻找ΔQ最大的归入方式。在计算ΔQ的过程中通过Gather阶段,读取边权值和边所对应的节点,在Apply阶段将每条边的权值加到对应节点的属性上,计算出每个节点的度,并给予节点一个临时标签代表其所属的社群。根据每个节点的度和临时标签计算得到ΔQ。b) Traversing nodes in parallel, for each node j∈V k (x) in the neighborhood V k (x) of node i, calculate the increment value ΔQ of Q value if i is classified into the community c j to which j belongs, Find the way in which ΔQ is the largest. In the process of calculating ΔQ, through the Gather stage, read the edge weight and the node corresponding to the edge, and add the weight of each edge to the attribute of the corresponding node in the Apply stage, calculate the degree of each node, and give A temporary label of a node represents the community it belongs to. ΔQ is calculated according to the degree and temporary label of each node.
其中∑in是社区C内部节点之间边的权值和,∑tot是该归入方式执行后所有被连接归入C的边的权值之和,ki是节点i所有连接边的权值之和,ki,in是节点i连接进入社群C后所有带入边的权值之和,m是整个网络所有边的权值之和。Where ∑in is the sum of the weights of the edges between the internal nodes of community C, ∑tot is the sum of the weights of all the edges connected to C after the implementation of the subsuming method, k i is the weight of all the connected edges of node i The sum of k i,in is the sum of the weights of all edges brought in after node i connects to community C, and m is the sum of the weights of all edges in the entire network.
c)选择ΔQ最大的归入方式后,所有社群内部的点聚合成一个新的点,社群增加一条自连边,自连边的权值是社区C内部节点之间边的权值和,社群和社群之间边的权值是聚合之前两个社群的有连接的成员之间边的权值之和。将聚合后的图回带到步骤b)中执行相同的操作。c) After selecting the largest ΔQ incorporation method, all points within the community are aggregated into a new point, and a self-connected edge is added to the community, and the weight of the self-connected edge is the sum of the weights of the edges between the internal nodes of community C , the weight of the edge between the community and the community is the sum of the weight of the edge between the connected members of the two communities before aggregation. Bring the aggregated graph back into step b) to do the same.
d)重复b)和c),直到ΔQ收敛。d) Repeat b) and c) until ΔQ converges.
(6)重复r次步骤4,选择r次中最大Q值对应的C作为门控后的结果。(6) Repeat step 4 r times, and select C corresponding to the largest Q value in r times as the result after gating.
Claims (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710007719.1A CN106841012B (en) | 2017-01-05 | 2017-01-05 | Flow cytometry data automatically-controlled door prosecutor method based on distributed graph model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710007719.1A CN106841012B (en) | 2017-01-05 | 2017-01-05 | Flow cytometry data automatically-controlled door prosecutor method based on distributed graph model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106841012A true CN106841012A (en) | 2017-06-13 |
CN106841012B CN106841012B (en) | 2019-05-21 |
Family
ID=59117755
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710007719.1A Active CN106841012B (en) | 2017-01-05 | 2017-01-05 | Flow cytometry data automatically-controlled door prosecutor method based on distributed graph model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106841012B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110412287A (en) * | 2019-07-11 | 2019-11-05 | 上海宸安生物科技有限公司 | One kind being based on single celled immunocyte parting quantitative analysis method |
EP4027131A4 (en) * | 2019-09-02 | 2023-10-04 | H.U. Group Research Institute G.K. | GRID REGION ESTIMATION PROGRAM, GRID REGION ESTIMATION DEVICE, AND LEARNING MODEL GENERATION METHOD |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6057107A (en) * | 1995-10-11 | 2000-05-02 | Luminex Corporation | Methods and compositions for flow cytometric determination of DNA sequences |
WO2006089190A3 (en) * | 2005-02-18 | 2006-12-28 | Hematologics Inc | System, method, and article for detecting abnormal cells using multi-dimensional analysis |
US7551763B2 (en) * | 2004-01-14 | 2009-06-23 | Luminex Corporation | Methods for altering one or more parameters of a measurement system |
CN102279146A (en) * | 2011-03-11 | 2011-12-14 | 桂林优利特医疗电子有限公司 | Blood cell five classification method based on laser sheath flow technology |
-
2017
- 2017-01-05 CN CN201710007719.1A patent/CN106841012B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6057107A (en) * | 1995-10-11 | 2000-05-02 | Luminex Corporation | Methods and compositions for flow cytometric determination of DNA sequences |
US7551763B2 (en) * | 2004-01-14 | 2009-06-23 | Luminex Corporation | Methods for altering one or more parameters of a measurement system |
WO2006089190A3 (en) * | 2005-02-18 | 2006-12-28 | Hematologics Inc | System, method, and article for detecting abnormal cells using multi-dimensional analysis |
CN102279146A (en) * | 2011-03-11 | 2011-12-14 | 桂林优利特医疗电子有限公司 | Blood cell five classification method based on laser sheath flow technology |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110412287A (en) * | 2019-07-11 | 2019-11-05 | 上海宸安生物科技有限公司 | One kind being based on single celled immunocyte parting quantitative analysis method |
EP4027131A4 (en) * | 2019-09-02 | 2023-10-04 | H.U. Group Research Institute G.K. | GRID REGION ESTIMATION PROGRAM, GRID REGION ESTIMATION DEVICE, AND LEARNING MODEL GENERATION METHOD |
Also Published As
Publication number | Publication date |
---|---|
CN106841012B (en) | 2019-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | A stable community detection approach for complex network based on density peak clustering and label propagation | |
CN104239501B (en) | Mass video semantic annotation method based on Spark | |
CN104573124B (en) | A kind of education cloud application statistical method based on parallelization association rule algorithm | |
Singh et al. | Performance evaluation of k-means and heirarichal clustering in terms of accuracy and running time | |
CN106682116A (en) | OPTICS point sorting clustering method based on Spark memory computing big data platform | |
CN103559016A (en) | Frequent subgraph excavating method based on graphic processor parallel computing | |
CN105117488B (en) | A kind of distributed storage RDF data balanced division method based on hybrid hierarchy cluster | |
CN104156463A (en) | Big-data clustering ensemble method based on MapReduce | |
CN115019123A (en) | Self-distillation contrast learning method for remote sensing image scene classification | |
Meirom et al. | Optimizing tensor network contraction using reinforcement learning | |
Ledur et al. | Towards a domain-specific language for geospatial data visualization maps with big data sets | |
Gao et al. | Uncovering overlapping community structure in static and dynamic networks | |
CN105631465A (en) | Density peak-based high-efficiency hierarchical clustering method | |
CN106841012A (en) | Flow cytometry data automatically-controlled door prosecutor method based on distributed graph model | |
Sardar | Reflecting on a Decade of Evolution: MapReduce‐Based Advances in Partitioning‐Based, Hierarchical‐Based, and Density‐Based Clustering (2013–2023) | |
CN108198084A (en) | A kind of complex network is overlapped community discovery method | |
CN113139556A (en) | Manifold multi-view image clustering method and system based on self-adaptive composition | |
CN108510080A (en) | A kind of multi-angle metric learning method based on DWH model many-many relationship type data | |
CN115393378B (en) | A low-cost and high-efficiency cell nucleus image segmentation method | |
CN112685661B (en) | Information organization method, device and equipment based on information granule space | |
Yadav et al. | Study Of Existing Methods & Techniques Of K-Means Clustering | |
Shi et al. | Machine learning under big data | |
Zhao et al. | A supervised learning community detection method based on attachment graph model | |
Jiang | Sculpture 3D printing realization system based on multi-dimensional image mining | |
US12361002B2 (en) | High-performance large object operations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |