WO2015069713A2 - Systems and methods for automated multiplex assay design - Google Patents
Systems and methods for automated multiplex assay design Download PDFInfo
- Publication number
- WO2015069713A2 WO2015069713A2 PCT/US2014/064050 US2014064050W WO2015069713A2 WO 2015069713 A2 WO2015069713 A2 WO 2015069713A2 US 2014064050 W US2014064050 W US 2014064050W WO 2015069713 A2 WO2015069713 A2 WO 2015069713A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- probe
- assays
- probes
- processor
- computing device
- Prior art date
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6813—Hybridisation assays
- C12Q1/6834—Enzymatic or biochemical coupling of nucleic acids to a solid phase
- C12Q1/6837—Enzymatic or biochemical coupling of nucleic acids to a solid phase using probe arrays or probe chips
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/30—Microarray design
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Definitions
- the present invention relates generally to automated design of an assay, for example, a biological multiplex assay.
- Bio assays are widely used in life sciences research to elucidate cell function, as well as in clinical diagnostics to identify medical conditions and monitor therapeutic response. While most current diagnostic tests interrogate single biomarkers, it is becoming evident that the simultaneous measurement of multiple markers is significantly more informative and will become the norm for future generations of tests. Multiplexed assays, those that simultaneously quantify different targets in the same sample, are particularly valuable as they provide not just a single data point but a whole snapshot of the vast network of organized and interacting molecules that constitute the typical makeup of a living organism.
- the results of multi-probe, multi-sample assays can be analyzed using a two- dimensional matrix, e.g., a heatmap. Each row of the matrix corresponds to a sample, while each column corresponds to a probe. Analysis software is available that groups probes together based on similar behavior. However, the interpretation of such a heatmap currently must be performed manually. A researcher must study the pattern of clusters and draw conclusions about the state of the biological system under study, based on his or her knowledge of the scientific literature, previous experience, hypotheses being tested, intuition, and the like. Such interpretation is subjective and prone to error.
- Described herein are methods and systems for automated assay design. For example, a statistically significant (and/or biologically meaningful) cluster of probes resulting from an experiment are identified, then analyzed for association with additional probes (different from the experimentally identified probes) by performing a query of one or more databases.
- mapping is performed between multiple types of probes, such that an experimentally observed probe is mapped into a database to identify associations with a different kind of probe, thereby designing a multiplex assay.
- an experimentally observed microRNA probe for which limited biological data are available, can be mapped into a database of mRNA and/or proteins, the expression of which is regulated by microRNA. Multiple iterations can be performed (e.g., evolutionary experimental design).
- the method provides a platform, for example, for acceleration of assay development and optimization, for the investigation of the dysregulation of probes in disease conditions, and for the monitoring of the productivity of a cell colony in a bio-reactor.
- the systems, methods, and apparatus utilize or include a tablet computer, a mobile phone device, or any other computer device or system capable of receiving input.
- a web site interface, mobile device application, customized computer application, or other electronic system is used to connect the user (e.g., at the tablet computer, mobile phone device, or other computer device) with a query mechanism for researching one or more literature sources to identify relevant additional probes for designing a multiplex assay.
- the systems, methods, and apparatus have applications in a wide variety of industries that supply scientific research products, testing systems, or testing services.
- Elements of embodiments described with respect to a given aspect of the invention may be used in various embodiments of another aspect of the invention. For example, it is contemplated that features of dependent claims depending from one independent claim can be used in apparatus, articles, systems, and/or methods of any of the other independent claims.
- the invention is directed to a method including the steps of: providing a graphical user interface for display on a user computing device, said interface being configured to accept input from the user computing device; receiving, via a network, data from the user computing device, said data corresponding to a first set of one or more assays, said data comprising an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms; accessing, by the processor, one or more databases [e.g., biological or medical databases, scientific literature (e.g., PubMed, NIH Gene Expression Omnibus (GEO) datasets, CMAP/Connectivity Map datasets, KEGG database (Kyoto Encyclopedia of Genes and Genomes)] and performing, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first
- databases e.g
- the probe is an analyte.
- the one or more databases include biological, medical, or scientific literature database(s).
- the database is at least one database selected from PubMed, NIH Gene Expression Omnibus (GEO) datasets, CMAP/Connectivity Map datasets, and KEGG database (Kyoto)
- performing, by the processor, said first query to identify said one or more additional probe(s) different from said experimentally observed probe(s) includes performing a mapping between said experimentally observed probe(s) and said additional probe(s), wherein said additional probe(s) are of a different type than said experimentally observed probe(s).
- the experimentally observed probe(s) and the additional probe(s) are of different types.
- the different types of probes include two or more categories selected from the following: microRNA, messenger RNA, protein, and SNP.
- mapping between said experimentally observed probe(s) and said additional probe(s) is derived from an automated analysis, by the processor, of published literature (and/or contents of public database(s)) identifying co-occurrence of pairs of probes in published documents.
- mapping includes a matrix of weights indicating a quantified degree to which one probe co-occurs with another.
- the weight in the matrix corresponding to one probe (e.g., probe A) and another probe (e.g., probe B) is, or is proportional to, a number of searched publications that mention both the one probe and the other probe (e.g., probe A and probe B).
- identifying co-occurrence of pairs of probes includes identifying cooccurrence of an experimentally observed probe with an additional probe.
- the experimentally observed probe(s) include(s) one or more members of any one or more of the following categories: nucleic acids, proteins, cells, and viruses. In some embodiments, the experimentally observed probe(s) is/are quantified.
- the method also includes, following transmitting of the processing result related to the first query, receiving, via the network, data corresponding to a second set of one or more assays, said data including an identification of one or more probe(s) observed from experiments conducted using at least one of said additional probe(s) previously identified from said first query of said one or more databases, and performing, by the processor, a second query of the one or more databases using said data corresponding to the second set of one or more assays (with or without said data corresponding to the first set of one or more assays) to identify one or more additional probe(s) different from those identified in the first or second set of assays.
- the invention relates to a tangible product including an assay developed using the method of any aspect or embodiment discussed above.
- the assay includes one or more of the following substrates: microarrays, beads, hydrogel particles, and liquid phase probes.
- the one or more databases is a third party literature repository. In some embodiments, at least a portion of the one or more databases is downloaded and/or stored on a server associated with the processor. In some embodiments, the one or more databases is stored locally on a server associated with the processor.
- the method includes the steps of entering, via a network, data from a user computing device, said data corresponding to a first set of one or more assays, said data including an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms; accessing, by a processor, one or more databases [e.g., biological or medical databases, scientific literature and performing, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s) being associated with said one or more experimentally observed probes in the first set of one or more assays; and displaying, on the user computing device, a processing result related to said first query, said processing result including an identification of said one or more additional
- a further aspect described herein relates to a system for automated design.
- the system includes a processor and a memory having instructions stored thereon.
- the instructions when executed by the processor, cause the processor to: provide a graphical user interface for display on a user computing device, said interface being configured to accept input from the user computing device; receive, via a network, data from the user computing device, said data corresponding to a first set of one or more assays, said data including an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms; access, by the processor, one or more databases and perform, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s) being associated with said one or more experimentally observed probes in the first set of one
- Figure 1 illustrates heatmap coloring of cells according to probes (columns) and samples (rows) according to some embodiments presented herein.
- Figure 2 illustrates a heatmap showing probe clustering and sample clustering according to some embodiments presented herein.
- Figure 3 illustrates panels for patients with diabetes according to some embodiments presented herein.
- Figure 4 illustrates an example cluster identification in biological databases according to some embodiments presented herein.
- the set of probes clustered according to the experimental heatmap On the right of Figure 4 is illustrated the set of four clustered probes mapped into a database.
- the four probes may either be "tightly" clustered (as shown), suggesting significance, or may be spread out over the database, suggesting that the cluster has no significance, at least in this database.
- Figure 5 illustrates on the left side a set of probes clustered according to the experimental heatmap according to some embodiments presented herein. On the right of Figure 5, the set of four clustered probes mapped into a database is illustrated. If a well- defined cluster is identified in the database, other members of the cluster become strong candidates for including in future assays (e.g., to be studied in future experiments).
- Figure 6 illustrates a distribution of cluster tightness according to some embodiments presented herein.
- the "tightness" of the experimental cluster is evaluated in the context of the distribution of tightness for all clusters of that size in the database. The distribution is found by taking random samples of the same size from the database and calculating the tightness of each random cluster. If the experimental cluster is tighter than all but, for example 1% (or, e.g., 0.1%-1%, l%-5%, 5%-10%, 10%-20% etc.) of random clusters, the experimentally identified cluster is likely to be significant.
- Figure 7 illustrates entity mapping between experimentally measured probes according to some embodiments presented herein.
- a mapping can be defined between experimentally measured probes (for example, microRNAs) and literature or database entities (for example, Genes).
- the "tightness" of a cluster in the experiment can then be explored (e.g., analyzed) in the mapped space, to determine if it forms a tight cluster in that space, or a loose grouping that is no tighter than a random selection of genes in that space.
- Figure 8 is a block diagram of an example system for automating multiplex assay design based upon literature search results, in accordance with an illustrative embodiment of the present invention.
- Figure 9 is a block diagram of a network environment for creating software applications for computing devices, in accordance with an illustrative embodiment of the present invention.
- Figure 10 is a block diagram of a computing device and a mobile computing device for use in illustrative embodiments of the present invention.
- apparatus, systems, and methods of the claimed invention encompass variations and adaptations developed using information from the embodiments described herein. Adaptation and/or modification of the apparatus, systems, and methods described herein may be performed by those of ordinary skill in the relevant art.
- Targets in multiplexed assays can include, for example, DNA, messenger RNA, noncoding RNA, microRNA, polypeptides, proteins, metabolites, whole cells, viruses, or any combination thereof. Combining a multiplex probe with a multiplicity of samples yields further information. For instance, samples from a subject under test can be compared with a variety of healthy subjects, and subjects with a number of potential conditions, to see if the subject fingerprint matches any of the potential conditions.
- a cell colony in a bio-reactor can be sampled at a number of stages in its life-cycle.
- the pattern of probes will vary from stage to stage in the life-cycle.
- the samples from a production reactor can be periodically taken and the pattern of probes compared to the baseline series in order to monitor the life-cycle stage of the new colony, or the health of the colony, or the
- a research experiment might cast a wide net of probes to investigate what biological pathways are affected by the addition of an experimental drug to a cell colony, as compared to a control colony which does not receive the drug.
- the results of such multi-probe, multi-sample assays may be presented as a heatmap, a two-dimensional matrix. Each row of the matrix corresponds to a sample, each column of the matrix corresponds to a probe. The cell at the intersection of each row and column is colored according to the level of the probe in that sample, as shown in Figure 1.
- Analysis software may be used to group probes together that behave similarly under different sample conditions. For instance, at different life cycle stages of the bio-reactor colony, some probes may wax and some may wane together. The degree to which probes behave similarly suggests an association between probes, with some being closely related, and some being unrelated. Using such information, probe clusters can be identified, and a tree of probe relations, or dendrogram, can be built.
- Samples can also be clustered according to how their pattern of probes are similar or different.
- An example of a heatmap clustered for both probes and samples is shown in Figure 2. The interpretation of such a heatmap is traditionally performed manually. A user
- Some embodiments presented herein relate to providing automated systems and methods for assisting in drawing conclusions or formulating new questions regarding experimental analysis, by using computer analysis of the scientific literature as well as biological and medical databases, in combination with the experimental data generated by a given set of experiments (or a single experiment), to identify significant clusters in the experimental data that warrant further investigation and refinement of the next set of experiments.
- the system or method is customized to a particular multiplex assay so that each experiment using the multiplex assays results in a set of suggestions relevant to the particular combination of probes and samples in future experiment(s) that will further elucidate (e.g., help with study and/or analysis of) the system under investigation (by, e.g., providing suggestions for further experiments).
- a particular study might investigate the level of microRNA probes in a group of patients (e.g., humans or mammals) who are known to have a particular condition, for example, diabetes, in order to identify a set of markers which may indicate that another patient is at risk of developing diabetes.
- Figure 3 illustrates an exemplary panel of probe clustering and sample clustering for patients with diabetes. As shown in Figure 3, the "full panel" contains an initial experiment with 38 probes, 6 of which are strongly over- or under-expressed in the patient group (e.g., patients having diabetes) relative to healthy controls.
- an optimized probe panel can be developed. Not present in this example, but another potential outcome detectable by the software, is that the six probes in the over/under-expressed group may all belong to a particular pathway, or be affected by a particular drug.
- the software provides suggestions that may be useful to spark new lines of inquiry, and to highlight unexpected and statistically significant correlations of probes.
- One advantage to the user is that it can point out connections that the researcher may not have thought of otherwise. Rather than yielding a simple yes/no answer to a hypothesis, the software can suggest a range of alternative areas of investigation.
- the value of such software algorithms is further enhanced when used in combination with a multiplexing platform allowing scientists to choose any combination of targets they wish, and to change the probe set according to the suggestions offered by the software.
- Another advantage is that the software parses a very large fraction of the published literature, rather than the necessarily limited subset any given researcher can read in a lifetime, and also has access to online databases which capture the results of thousands of computer simulations and/or experiments. Therefore it has the potential to identify connections and relations that may not be apparent to a researcher engaged in a particular field of study.
- An additional advantage is that the software can identify other probes that are related to the experimentally highlighted probes. Detecting numerous independent members of the same probe family in future experiments would increase confidence in the strength of the finding. An example of expanding a cluster is shown in Figure 5.
- Methods described herein are advantageous to the assay provider in that they can suggest new marker panels to the researcher that may result in additional assay reagent purchases to move the same experiment forward.
- the methods are also advantageous in that they can make the assay more useful to the user. Therefore the user is more likely to use the assay provider again for other unrelated experiments.
- using data derived by literature text analysis is
- Associating microRNAs to their respective gene targets based on literature/database analysis can be performed in various ways.
- the method creates and/or uses a database that links microRNAs to gene targets based on computer models of where targeting is expected.
- the method links microRNAs to their respective gene targets based on a textual analysis of abstracts and/or articles in the relevant literature, e.g., by identifying papers that mention a particular gene and a particular microRNA in the same abstract, which would be suggestive of an association between the two. The quality of the association can be weighted according to evidence, if available.
- reporter assay, western blot (protein immunoblot), and qPCR validation methods may be strong evidence of an association between microRNA and a target, and may be weighted accordingly, whereas microarray, NGS, pSILAC, or other validation methods may be less strong evidence and may be weighted accordingly.
- Mapping databases is not limited to linking microR As to genes. For example, microR As can be mapped to disease conditions.
- the analysis software described herein first clusters the assay probes according to the result of the experiment, using known methods.
- the clusters in the experiment are analyzed according to whether they form clusters in the literature or in the different databases accessible to the analysis software.
- the branches of the treemap which correspond to significant clusters are annotated so that the user can click on them and be brought to a report describing the nature of the association and the literature or database supporting the association.
- the probes all belong to a common pathway, or that they are all up/down-regulated in a given disease, or that they are all influenced by a given drug, or that they all originate in a particular tissue, or that they are all mentioned in a particular publication, or in publications by a particular author, or often mentioned together with a particular gene, drug, disease, or other biological entity in the literature.
- the group of probes in the cluster is measured according to distance to each other in the context of all the potential probe distances in the database, as illustrated in Figure 4.
- a given database contains many probes.
- a distance metric over the probes can be defined using the links of the database. For instance, if the database is a gene ontology, a sibling link might have a distance of two, a cousin link might have a distance of four, and so on. In a literature search, the distance between two probes might be reciprocally related to the number of publications which mention them both.
- FIG. 4 On the left side of Figure 4 is a schematic 410 showing a set 420 of four probes 431, 432, 433, 434 clustered according to the experimental heatmap.
- FIG. 4 On the right of Figure 4 is a schematic 430 illustrating the set of four clustered probes 431 , 432, 433, 444 mapped into a database.
- the four probes may either be "tightly" clustered in the database (as shown in schematic 430), suggesting significance, or may be spread out over the database, suggesting that the cluster has no significance, at least in the present database.
- a cluster identified by the experiment can be analyzed for association in a given database by comparing the distances between probes within the cluster to the distances of random pairs of probes.
- a measure of tightness can be defined for this purpose, for example the average over the distances for all pairs of probes as illustrated in Figure 5.
- Figure 5 shows the set 420 of four probes 431, 432, 433, 434 clustered according to the experimental heatmap.
- Figure 5 shows the set of four clustered probes mapped into a database is illustrated. If a well-defined cluster 510 is identified in the database, other members of the cluster 521, 522, 523, 524 become strong candidates for including in future assays (e.g., to be studied in future experiments).
- a distribution of tightness under the null distribution can be established by sampling of random clusters of the same size among all probes. For practical purposes, this distribution may be pre-computed for each database and cluster size to be instantly available for testing the significance of increased tightness for experimental clusters.
- Figure 6 illustrates a distribution of cluster tightness.
- the "tightness" of the experimental cluster is evaluated in the context of the distribution of tightness for all clusters of that size in the database.
- the distribution 610 is found by taking random samples of the same size from the database 620, 630, 640 and calculating the tightness of each random cluster. If the experimental cluster 650 is tighter than all but, for example 1% (or, e.g., 0.1%- 1%, l%-5%, 5%-10%, 10%-20% etc.) of random clusters, the experimentally identified cluster is likely to be significant.
- the method can be extended by mapping between experimentally measured probes and probes which are well represented in the literature or databases.
- the experimental probe might be a microRNA probe, where limited biological data are available, while much more is known about mRNA and proteins, the expression of which is regulated by microRNA.
- Figure 7 illustrates entity mapping between experimentally measured probes according to some embodiments presented herein.
- a mapping can be defined between experimentally measured probes (for example, microRNAs) 710 and literature or database entities (for example, Genes) 720.
- the "tightness" of a cluster in the experiment can then be analyzed in the mapped space, to determine if it forms a tight cluster in that space, or a loose grouping that is no tighter than a random selection of genes in that space.
- the microRNA probes can be mapped into a database of proteins using a literature matching technique.
- the number of references which link a given protein with a given microRNA provide a basis for calculating the strength of an association between a protein and an microRNA.
- a cluster of microRNA probes in the experiment can be mapped into a cluster of proteins in a database, and the distance between the proteins calculated using the database distance metric.
- the mapping need not be one to one. For instance, the experiment might identify a cluster of four microRNAs, each of which has three leading protein targets. Either the twelve proteins could be analyzed as a protein cluster, or each combination of four proteins could be analyzed as a protein cluster.
- One more general way of representing this mapping relationship is by an N by M matrix, where N is the number of microRNA and M the number of genes.
- the matrix elements may be computed from literature text analysis, for example as the number of publications that mention both probes in the same sentence, or the same paragraph.
- literature text analysis makes use of the methods and systems of International Patent Application No. PCT/US13/68584, entitled, "Automated Product Customization Based Upon Literature Search Results,” filed November 15, 2013, and published as WO2014/071404 on May 8, 2014, the text of which is incorporated herein by reference in its entirety.
- PCT/US13/68584 entitled, "Automated Product Customization Based Upon Literature Search Results," filed November 15, 2013, and published as WO2014/071404 on May 8, 2014, the text of which is incorporated herein by reference in its entirety.
- other approaches may be used for the text analysis.
- Assay By performing a series of steps as prescribed by an appropriate microRNA assay protocol, the researcher prepares a mixture of particles that are
- each of a number of distinct classes of particles will emit an optical signature indicative of the quantity of each of a number of molecular species present in a biological sample.
- the molecular specificity is achieved by means of a distinct probe attached to each class of particle, an oligonucleotide in the present example.
- the optical signals are measured in a flow cytometer, which produces a file containing all relevant information.
- Analysis software reads raw data generated by the cytometer and performs a signal processing procedure. The result is a plurality of quantity
- N the number of samples
- M the number of probes.
- Each data point contains a signal value proportional to the measured quantity, plus confidence intervals and other statistics on the plurality of particles that contributed.
- the software can optionally perform additional analysis on the processed data. For example, the software can identify patterns of similarity between samples (and between probes), resulting in a clustering of related items.
- Various methods for clustering items can be used.
- the neighbor-joining method can be used to generate a phylogenetic tree.
- Another method is k-means clustering.
- One use for this clustering is to rearrange the items such that similar items occur next to each other in sequence. Such a rearrangement is particularly effective in the display of a heat map, as illustrated in Figure 2.
- the software does not account for information about the nature of the samples that go into producing the data.
- the probes since they are designed to detect well- characterized biological molecules (microRNA, in this example) that have been well studied, published in the scientific literature, and have been collected in central bioinformatics repositories such as mirBase. It is possible to use this available information to test clusters derived from the researcher's own data for biological relevance. If, for example, a particular biological pathway is important for the mechanism or disease being studied, the microRNA that play a role in this pathway may have similar expression profiles among the samples being studied. If it can be statistically ascertained that an experimental cluster contains a preponderance of microRNA from one particular pathway, this can be reported to the researcher, who may not have considered this pathway previously.
- Distance-based comparison An alternative approach is to define the tightness of a cluster according to the average distance between members of the cluster. First, a distance metric D(i,j) is defined for each pair of members of the database. Then a cluster tightness is defined by measuring the average distance between pairs of nodes in the cluster:
- the cluster tightness of a given experimentally identified cluster is compared to the distribution of tightness for all clusters of that size. For practical purposes, the distribution can be sampled by choosing several thousand random clusters and computing their tightness. If an experimentally identified cluster is tighter than, say, 99% of randomly chosen clusters, a significance value can be assigned. As before, a Bonferroni correction should be applied to compensate for multiple testing if many clusters and/or many databases are analyzed.
- the biological group to which it belongs can be derived in a way that depends on the database in question. For instance, in a tree- structured database such as the Gene Ontology, the closest common ancestor of all the nodes in the cluster might be a suitable choice. For a pathway database, the pathway to which all the genes belong would be a suitable gene.
- clusters which are significant are identified. Those clusters are brought to the user's attention by annotating the branches of the treemap corresponding to a significant cluster with an icon. Clicking on the icon brings the user to a report describing the nature of the association (common pathway, common ontology, common disease, for instance) complete with literature or database references supporting the association.
- the report may also include figures and tables suitable for inclusion in a scientific publication.
- Entity mapping Because much more biological data is available for genes than for microRNA, it is desirable to first map clusters of microRNA to clusters of genes and then analyze gene clusters against the various databases. Such mappings can be chained, e.g., genes can be further mapped to pathways or diseases. Similarly, microRNA can be mapped to publications and publications can be mapped to authors which in turn can be mapped to institutions. Mapping from experimental probes to literature entities is illustrated in Figure 6.
- a gene mapping is described by an N by M matrix, where N is the number of microRNA (approximately 1,000) and M is the number of genes
- microRNAs are associated with genes by parsing a large number of publications for mentions of each.
- the publications are selected from the entire body of literature, such as available from PubMed, by performing a search on the term "microRNA". Other ways to select from the database can be considered, or no selection performed and the entire database parsed.
- Each publication containing a pair of terms that identify one microRNA and one gene contributes a score to the corresponding element of the matrix.
- this score varies according to the textual proximity of the terms. There may be, for instance, three different scores: For the terms to occur in the same publication, in the same paragraph, or in the same sentence. The closer the terms, the higher the score. An additional score is added when both terms occur in the title of the publication. Many other ways of measuring textual proximity would be obvious to those skilled in the art.
- a given cluster of microRNA can be described by a vector U of length N.
- Such vectors may be called profiles, of which clusters are a special case with elements of 1 (microRNA present in cluster) and 0 (uRNA not present in cluster), only.
- the gene profile can then be turned into a gene cluster by setting a suitable threshold for the weights, where a weight above the threshold indicates membership in the cluster.
- discrete clusters are eliminated altogether, and the method instead involves working with profiles throughout.
- a software system has been described for automatically detecting biologically significant clusters in a multiplex assay.
- the integration of an assay platform including reagents and their association with markers that can be identified optically or electrically, the software needed to analyze the raw data, and the software for identifying biologically significant clusters and identifying related probes provides a powerful platform to carry out research studies, accelerate assay development and optimization, investigate dysregulation of probes in disease conditions, or monitor the productivity of a cell colony in a bio-reactor.
- FIG. 8 is a flow chart of an example method 800 for automated method for multiplex assay design.
- data entered by a user is received by the processor (802).
- the data corresponds to a first set of one or more assays.
- the data includes an identification of one or more experimentally observed probe(s).
- the probe is an analyte.
- the data includes an identification of experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms
- a query is constructed using the data corresponding to the first set of one or more assays (804).
- the processor accesses one or more databases (e.g., biological or medical databases, scientific literature (e.g., PubMed, NIH Gene Expression Omnibus (GEO) datasets, CMAP/Connectivity Map datasets, KEGG database (Kyoto Encyclopedia of Genes and Genomes)) and queries the one or more databases (806).
- databases e.g., biological or medical databases, scientific literature (e.g., PubMed, NIH Gene Expression Omnibus (GEO) datasets, CMAP/Connectivity Map datasets, KEGG database (Kyoto Encyclopedia of Genes and Genomes)
- the resulting query is then sent to a remote search engine, such as PubMed.
- the query includes instructions relevant to a particular third party query server.
- a particular query server may accept instructions on results formatting, such as instructions on various relevant pieces of information for up to 200 of the top results of the search. If multiple third party repositories are queried, in some implementations, equivalent instructions may be provided to each repository. In some implementations, repositories may be copied in whole or in part to be stored locally on the server, to improve response time.
- the processor queries the one or more databases (806) using the data corresponding to the first set of one or more assays. In some implementations, the processor identifies one or more additional probe(s) different from those identified in the first set of one or more assays (808), the one or more additional probe(s) being associated with the one or more experimentally observed probes in the first set of one or more assays.
- the processor conducts mapping between experimentally observed probe(s) and additional probe(s), wherein the additional probe(s) are of a different type than the experimentally observed probe(s).
- the experimentally observed probe(s) and the additional probe(s) are of different types, said different types comprising two or more categories selected from microRNA, messenger RNA, protein, and SNP.
- mapping between the experimentally observed probe(s) and the additional probe(s) is derived from an automated analysis, by the processor, of published literature (and/or contents of public database(s)) identifying co-occurrence of pairs of probes (e.g., cooccurrence of an experimentally observed probe with an additional probe) in published documents.
- the processor transmits to a user computing device a processing result related to the first query (810), the processing result including an identification of said one or more additional probe(s).
- the cloud computing environment 900 may include one or more resource providers 902a, 902b, 902c (collectively, 902).
- Each resource provider 902 may include computing resources.
- computing resources may include any hardware and/or software used to process data.
- computing resources may include hardware and/or software capable of executing algorithms, computer programs, and/or computer applications.
- exemplary computing resources may include application servers and/or databases with storage and retrieval capabilities.
- Each resource provider 902 may be connected to any other resource provider 902 in the cloud computing environment 900.
- the resource providers 902 may be connected over a computer network 908.
- Each resource provider 902 may be connected to one or more computing device 904a, 904b, 904c (collectively, 904), over the computer network 908.
- the cloud computing environment 900 may include a resource manager 906.
- the resource manager 906 may be connected to the resource providers 902 and the computing devices 904 over the computer network 908.
- the resource manager 906 may facilitate the provision of computing resources by one or more resource providers 902 to one or more computing devices 904.
- the resource manager 906 may receive a request for a computing resource from a particular computing device 904.
- the resource manager 906 may identify one or more resource providers 902 capable of providing the computing resource requested by the computing device 904.
- the resource manager 906 may select a resource provider 902 to provide the computing resource.
- the resource manager 906 may facilitate a connection between the resource provider 902 and a particular computing device 904.
- the resource manager 906 may establish a connection between a particular resource provider 902 and a particular computing device 904. In some implementations, the resource manager 906 may redirect a particular computing device 904 to a particular resource provider 902 with the requested computing resource.
- FIG. 10 shows an example of a computing device 1000 and a mobile computing device 1050 that can be used to implement the techniques described in this disclosure.
- the computing device 1000 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
- the mobile computing device 1050 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart-phones, and other similar computing devices.
- the components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to be limiting.
- the computing device 1000 includes a processor 1002, a memory 1004, a storage device 1006, a high-speed interface 1008 connecting to the memory 1004 and multiple highspeed expansion ports 1010, and a low-speed interface 1012 connecting to a low-speed expansion port 1014 and the storage device 1006.
- Each of the processor 1002, the memory 1004, the storage device 1006, the high-speed interface 1008, the high-speed expansion ports 1010, and the low-speed interface 1012 are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
- the processor 1002 can process instructions for execution within the computing device 1000, including instructions stored in the memory 1004 or on the storage device 1006 to display graphical information for a GUI on an external input/output device, such as a display 1016 coupled to the high-speed interface 1008.
- an external input/output device such as a display 1016 coupled to the high-speed interface 1008.
- multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
- multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
- the memory 1004 stores information within the computing device 1000.
- the memory 1004 is a volatile memory unit or units. In some
- the memory 1004 is a non- volatile memory unit or units.
- the memory 1004 may also be another form of computer-readable medium, such as a magnetic or optical disk.
- the storage device 1006 is capable of providing mass storage for the computing device 1000.
- the storage device 1006 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
- Instructions can be stored in an information carrier.
- the instructions when executed by one or more processing devices (for example, processor 1002), perform one or more methods, such as those described above.
- the instructions can also be stored by one or more storage devices such as computer- or machine-readable mediums (for example, the memory 1004, the storage device 1006, or memory on the processor 1002).
- the high-speed interface 1008 manages bandwidth-intensive operations for the computing device 1000, while the low-speed interface 1012 manages lower bandwidth- intensive operations.
- Such allocation of functions is an example only.
- the high-speed interface 1008 is coupled to the memory 1004, the display 1016 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 1010, which may accept various expansion cards (not shown).
- the low-speed interface 1012 is coupled to the storage device 1006 and the low-speed expansion port 1014.
- the low-speed expansion port 1014 which may include various communication ports (e.g., USB, Bluetooth®, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
- the computing device 1000 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 1020, or multiple times in a group of such servers. In addition, it may be implemented in a personal computer such as a laptop computer 1022. It may also be implemented as part of a rack server system 1024. Alternatively, components from the computing device 1000 may be combined with other components in a mobile device (not shown), such as a mobile computing device 1050. Each of such devices may contain one or more of the computing device 1000 and the mobile computing device 1050, and an entire system may be made up of multiple computing devices communicating with each other.
- the mobile computing device 1050 includes a processor 1052, a memory 1064, an input/output device such as a display 1054, a communication interface 1066, and a transceiver 1068, among other components.
- the mobile computing device 1050 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage.
- a storage device such as a micro-drive or other device, to provide additional storage.
- Each of the processor 1052, the memory 1064, the display 1054, the communication interface 1066, and the transceiver 1068, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
- the processor 1052 can execute instructions within the mobile computing device 1050, including instructions stored in the memory 1064.
- the processor 1052 may be implemented as a chipset of chips that include separate and multiple analog and digital processors.
- the processor 1052 may provide, for example, for coordination of the other components of the mobile computing device 1050, such as control of user interfaces, applications run by the mobile computing device 1050, and wireless communication by the mobile computing device 1050.
- the processor 1052 may communicate with a user through a control interface 1058 and a display interface 1056 coupled to the display 1054.
- the display 1054 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
- the display interface 1056 may comprise appropriate circuitry for driving the display 1054 to present graphical and other information to a user.
- the control interface 1058 may receive commands from a user and convert them for submission to the processor 1052.
- an external interface 1062 may provide communication with the processor 1052, so as to enable near area communication of the mobile computing device 1050 with other devices.
- the external interface 1062 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
- the memory 1064 stores information within the mobile computing device 1050.
- the memory 1064 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
- An expansion memory 1074 may also be provided and connected to the mobile computing device 1050 through an expansion interface 1072, which may include, for example, a SIMM (Single In Line Memory Module) card interface.
- SIMM Single In Line Memory Module
- the expansion memory 1074 may provide extra storage space for the mobile computing device 1050, or may also store applications or other information for the mobile computing device 1050.
- the expansion memory 1074 may include instructions to carry out or supplement the processes described above, and may include secure information also.
- the expansion memory 1074 may be provide as a security module for the mobile computing device 1050, and may be programmed with instructions that permit secure use of the mobile computing device 1050.
- secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
- the memory may include, for example, flash memory and/or NVRAM memory (nonvolatile random access memory), as discussed below.
- instructions are stored in an information carrier, that the instructions, when executed by one or more processing devices (for example, processor 1052), perform one or more methods, such as those described above.
- the instructions can also be stored by one or more storage devices, such as one or more computer- or machine-readable mediums (for example, the memory 1064, the expansion memory 1074, or memory on the processor 1052).
- storage devices such as one or more computer- or machine-readable mediums (for example, the memory 1064, the expansion memory 1074, or memory on the processor 1052).
- the instructions can be received in a propagated signal, for example, over the transceiver 1068 or the external interface 1062.
- the mobile computing device 1050 may communicate wirelessly through the communication interface 1066, which may include digital signal processing circuitry where necessary.
- the communication interface 1066 may provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile
- SMS Short Message Service
- EMS Enhanced Messaging Service
- MMS Multimedia Messaging Service
- CDMA code division multiple access
- TDMA time division multiple access
- PDC Personal Digital Cellular
- a GPS (Global Positioning System) receiver module 1070 may provide additional navigation- and location-related wireless data to the mobile computing device 1050, which may be used as appropriate by applications running on the mobile computing device 1050.
- the mobile computing device 1050 may also communicate audibly using an audio codec 1060, which may receive spoken information from a user and convert it to usable digital information.
- the audio codec 1060 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 1050.
- Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on the mobile computing device 1050.
- the mobile computing device 1050 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 680. It may also be implemented as part of a smart-phone 1082, personal digital assistant, or other similar mobile device.
- Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof.
- ASICs application specific integrated circuits
- These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
- machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine- readable medium that receives machine instructions as a machine-readable signal.
- machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
- the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
- the systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.
- LAN local area network
- WAN wide area network
- the Internet the global information network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Organic Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Analytical Chemistry (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Described herein, in certain embodiments, are systems and methods for automated multiplex assay design. For example, a statistically significant (and/or biologically meaningful) cluster of probes resulting from an experiment are identified, then analyzed for association with additional probes (different from the experimentally identified probes) by performing a query of one or more databases. In some embodiments, mapping is performed between multiple types of probes, such that an experimentally observed probe is mapped into a database to identify associations with a different kind of probe, thereby designing a multiplex assay.
Description
SYSTEMS AND METHODS FOR AUTOMATED MULTIPLEX ASSAY DESIGN
Priority Application
[0001] The present application claims priority to, and the benefit of, U.S. Provisional Patent Application No. 61/900,360, filed November 5, 2013.
Technical Field
[0002] The present invention relates generally to automated design of an assay, for example, a biological multiplex assay.
Background
[0003] Biological assays are widely used in life sciences research to elucidate cell function, as well as in clinical diagnostics to identify medical conditions and monitor therapeutic response. While most current diagnostic tests interrogate single biomarkers, it is becoming evident that the simultaneous measurement of multiple markers is significantly more informative and will become the norm for future generations of tests. Multiplexed assays, those that simultaneously quantify different targets in the same sample, are particularly valuable as they provide not just a single data point but a whole snapshot of the vast network of organized and interacting molecules that constitute the typical makeup of a living organism.
[0004] The results of multi-probe, multi-sample assays can be analyzed using a two- dimensional matrix, e.g., a heatmap. Each row of the matrix corresponds to a sample, while each column corresponds to a probe. Analysis software is available that groups probes together based on similar behavior. However, the interpretation of such a heatmap currently must be performed manually. A researcher must study the pattern of clusters and draw conclusions about the state of the biological system under study, based on his or her
knowledge of the scientific literature, previous experience, hypotheses being tested, intuition, and the like. Such interpretation is subjective and prone to error.
Summary
[0005] Described herein are methods and systems for automated assay design. For example, a statistically significant (and/or biologically meaningful) cluster of probes resulting from an experiment are identified, then analyzed for association with additional probes (different from the experimentally identified probes) by performing a query of one or more databases. In some embodiments, mapping is performed between multiple types of probes, such that an experimentally observed probe is mapped into a database to identify associations with a different kind of probe, thereby designing a multiplex assay. For example, an experimentally observed microRNA probe, for which limited biological data are available, can be mapped into a database of mRNA and/or proteins, the expression of which is regulated by microRNA. Multiple iterations can be performed (e.g., evolutionary experimental design). The method provides a platform, for example, for acceleration of assay development and optimization, for the investigation of the dysregulation of probes in disease conditions, and for the monitoring of the productivity of a cell colony in a bio-reactor.
[0006] In various embodiments, the systems, methods, and apparatus utilize or include a tablet computer, a mobile phone device, or any other computer device or system capable of receiving input. A web site interface, mobile device application, customized computer application, or other electronic system is used to connect the user (e.g., at the tablet computer, mobile phone device, or other computer device) with a query mechanism for researching one or more literature sources to identify relevant additional probes for designing a multiplex assay. The systems, methods, and apparatus have applications in a wide variety of industries that supply scientific research products, testing systems, or testing services.
[0007] Elements of embodiments described with respect to a given aspect of the invention may be used in various embodiments of another aspect of the invention. For example, it is contemplated that features of dependent claims depending from one independent claim can be used in apparatus, articles, systems, and/or methods of any of the other independent claims.
[0008] In one aspect, the invention is directed to a method including the steps of: providing a graphical user interface for display on a user computing device, said interface being configured to accept input from the user computing device; receiving, via a network, data from the user computing device, said data corresponding to a first set of one or more assays, said data comprising an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms; accessing, by the processor, one or more databases [e.g., biological or medical databases, scientific literature (e.g., PubMed, NIH Gene Expression Omnibus (GEO) datasets, CMAP/Connectivity Map datasets, KEGG database (Kyoto Encyclopedia of Genes and Genomes)] and performing, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s) being associated with said one or more experimentally observed probes in the first set of one or more assays; and
transmitting, to the user computing device, a processing result related to said first query, said processing result including an identification of said one or more additional probe(s). In some embodiments, the probe is an analyte. In some embodiments, the one or more databases include biological, medical, or scientific literature database(s). In some embodiments, the database is at least one database selected from PubMed, NIH Gene Expression Omnibus (GEO) datasets, CMAP/Connectivity Map datasets, and KEGG database (Kyoto
Encyclopedia of Genes and Genomes).
[0009] In some embodiments, performing, by the processor, said first query to identify said one or more additional probe(s) different from said experimentally observed probe(s) includes performing a mapping between said experimentally observed probe(s) and said additional probe(s), wherein said additional probe(s) are of a different type than said experimentally observed probe(s).
[0010] In some embodiments, the experimentally observed probe(s) and the additional probe(s) are of different types. In some embodiments, the different types of probes include two or more categories selected from the following: microRNA, messenger RNA, protein, and SNP.
[0011] In some embodiments, mapping between said experimentally observed probe(s) and said additional probe(s) is derived from an automated analysis, by the processor, of published literature (and/or contents of public database(s)) identifying co-occurrence of pairs of probes in published documents. In some embodiments, mapping includes a matrix of weights indicating a quantified degree to which one probe co-occurs with another.
[0012] In some embodiments, the weight in the matrix corresponding to one probe (e.g., probe A) and another probe (e.g., probe B) is, or is proportional to, a number of searched publications that mention both the one probe and the other probe (e.g., probe A and probe B). In some embodiments, identifying co-occurrence of pairs of probes includes identifying cooccurrence of an experimentally observed probe with an additional probe.
[0013] In some embodiments, the experimentally observed probe(s) include(s) one or more members of any one or more of the following categories: nucleic acids, proteins, cells, and viruses. In some embodiments, the experimentally observed probe(s) is/are quantified.
[0014] In some embodiments, the method also includes, following transmitting of the processing result related to the first query, receiving, via the network, data corresponding to a second set of one or more assays, said data including an identification of one or more
probe(s) observed from experiments conducted using at least one of said additional probe(s) previously identified from said first query of said one or more databases, and performing, by the processor, a second query of the one or more databases using said data corresponding to the second set of one or more assays (with or without said data corresponding to the first set of one or more assays) to identify one or more additional probe(s) different from those identified in the first or second set of assays.
[0015] In some embodiments, the invention relates to a tangible product including an assay developed using the method of any aspect or embodiment discussed above. In some embodiments, the assay includes one or more of the following substrates: microarrays, beads, hydrogel particles, and liquid phase probes.
[0016] In some embodiments, the one or more databases is a third party literature repository. In some embodiments, at least a portion of the one or more databases is downloaded and/or stored on a server associated with the processor. In some embodiments, the one or more databases is stored locally on a server associated with the processor.
[0017] Another aspect described herein relates to a method for automated design. The method includes the steps of entering, via a network, data from a user computing device, said data corresponding to a first set of one or more assays, said data including an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms; accessing, by a processor, one or more databases [e.g., biological or medical databases, scientific literature and performing, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s) being associated with said one or more experimentally observed probes in the first set of one or more assays; and displaying, on the
user computing device, a processing result related to said first query, said processing result including an identification of said one or more additional probe(s).
[0018] A further aspect described herein relates to a system for automated design. The system includes a processor and a memory having instructions stored thereon. The instructions, when executed by the processor, cause the processor to: provide a graphical user interface for display on a user computing device, said interface being configured to accept input from the user computing device; receive, via a network, data from the user computing device, said data corresponding to a first set of one or more assays, said data including an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms; access, by the processor, one or more databases and perform, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s) being associated with said one or more experimentally observed probes in the first set of one or more assays; and transmit, to the user computing device, a processing result related to said first query, said processing result including an identification of said one or more additional probe(s).
Brief Description of the Drawings
[0019] The foregoing and other objects, aspects, features, and advantages of the invention will become more apparent and may be better understood by referring to the following description taken in conjunction with the accompanying drawings, in which:
[0020] Figure 1 illustrates heatmap coloring of cells according to probes (columns) and samples (rows) according to some embodiments presented herein.
[0021] Figure 2 illustrates a heatmap showing probe clustering and sample clustering according to some embodiments presented herein.
[0022] Figure 3 illustrates panels for patients with diabetes according to some embodiments presented herein.
[0023] Figure 4 illustrates an example cluster identification in biological databases according to some embodiments presented herein. On the left side of Figure 4 is illustrated the set of probes clustered according to the experimental heatmap. On the right of Figure 4 is illustrated the set of four clustered probes mapped into a database. The four probes may either be "tightly" clustered (as shown), suggesting significance, or may be spread out over the database, suggesting that the cluster has no significance, at least in this database.
[0024] Figure 5 illustrates on the left side a set of probes clustered according to the experimental heatmap according to some embodiments presented herein. On the right of Figure 5, the set of four clustered probes mapped into a database is illustrated. If a well- defined cluster is identified in the database, other members of the cluster become strong candidates for including in future assays (e.g., to be studied in future experiments).
[0025] Figure 6 illustrates a distribution of cluster tightness according to some embodiments presented herein. The "tightness" of the experimental cluster is evaluated in the context of the distribution of tightness for all clusters of that size in the database. The distribution is found by taking random samples of the same size from the database and calculating the
tightness of each random cluster. If the experimental cluster is tighter than all but, for example 1% (or, e.g., 0.1%-1%, l%-5%, 5%-10%, 10%-20% etc.) of random clusters, the experimentally identified cluster is likely to be significant.
[0026] Figure 7 illustrates entity mapping between experimentally measured probes according to some embodiments presented herein. A mapping can be defined between experimentally measured probes (for example, microRNAs) and literature or database entities (for example, Genes). The "tightness" of a cluster in the experiment can then be explored (e.g., analyzed) in the mapped space, to determine if it forms a tight cluster in that space, or a loose grouping that is no tighter than a random selection of genes in that space.
[0027] Figure 8 is a block diagram of an example system for automating multiplex assay design based upon literature search results, in accordance with an illustrative embodiment of the present invention.
[0028] Figure 9 is a block diagram of a network environment for creating software applications for computing devices, in accordance with an illustrative embodiment of the present invention.
[0029] Figure 10 is a block diagram of a computing device and a mobile computing device for use in illustrative embodiments of the present invention.
[0030] The features and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements.
Detailed Description
[0031] It is contemplated that apparatus, systems, and methods of the claimed invention encompass variations and adaptations developed using information from the embodiments described herein. Adaptation and/or modification of the apparatus, systems, and methods described herein may be performed by those of ordinary skill in the relevant art.
[0032] It should be understood that the order of steps or order for performing certain actions is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.
[0033] Headers are provided herein for the convenience of the reader and are not meant to limit the scope of the embodiments described herein.
[0034] Targets in multiplexed assays can include, for example, DNA, messenger RNA, noncoding RNA, microRNA, polypeptides, proteins, metabolites, whole cells, viruses, or any combination thereof. Combining a multiplex probe with a multiplicity of samples yields further information. For instance, samples from a subject under test can be compared with a variety of healthy subjects, and subjects with a number of potential conditions, to see if the subject fingerprint matches any of the potential conditions.
[0035] In another example, a cell colony in a bio-reactor can be sampled at a number of stages in its life-cycle. The pattern of probes will vary from stage to stage in the life-cycle. After a baseline series has been established, the samples from a production reactor can be periodically taken and the pattern of probes compared to the baseline series in order to monitor the life-cycle stage of the new colony, or the health of the colony, or the
effectiveness in expressing the product of the bio-reactor, whether fuel or medication.
[0036] In yet another example, a research experiment might cast a wide net of probes to investigate what biological pathways are affected by the addition of an experimental drug to a cell colony, as compared to a control colony which does not receive the drug.
[0037] The results of such multi-probe, multi-sample assays may be presented as a heatmap, a two-dimensional matrix. Each row of the matrix corresponds to a sample, each column of the matrix corresponds to a probe. The cell at the intersection of each row and column is colored according to the level of the probe in that sample, as shown in Figure 1.
[0038] Analysis software may be used to group probes together that behave similarly under different sample conditions. For instance, at different life cycle stages of the bio-reactor colony, some probes may wax and some may wane together. The degree to which probes behave similarly suggests an association between probes, with some being closely related, and some being unrelated. Using such information, probe clusters can be identified, and a tree of probe relations, or dendrogram, can be built. Some conventional software tools exist for this purpose, including for instance the tool discussed in Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods, Molecular Biology and Evolution 28: 2731-2739, which is incorporated herein by reference in its entirety, or the clustering module of the statistics package R (as discussed for example in Team, R.D.C. (2010). R: A language and environment for statistical computing, Vienna, Austria: R Foundation for Statistical Computing; retrieved from http://www.R- project.org).
[0039] Samples can also be clustered according to how their pattern of probes are similar or different. An example of a heatmap clustered for both probes and samples is shown in Figure 2. The interpretation of such a heatmap is traditionally performed manually. A user
(researcher, clinician, or bio-engineer) studies the pattern of clusters, and draws conclusions about the state of the biological system under study, based on his or her knowledge of the scientific literature, his or her previous experience, hypotheses being tested, intuition and hunches, and other criteria.
[0040] Such interpretation is necessarily subjective, and guided by the assumptions built into the design of the experiment. As an example, a study may be exploring the hypothesis that a disease works by undermining a particular biological pathway, for instance a defect in glucose processing. Instead, in reality, it might be the case that a different pathway is involved, or the immune system may play an important role, or the condition might be the harbinger of a progressive and more serious disease. Unexpected findings such as these might easily be overlooked in conventional methods, which require manual analysis of experimental data.
[0041] Some embodiments presented herein relate to providing automated systems and methods for assisting in drawing conclusions or formulating new questions regarding experimental analysis, by using computer analysis of the scientific literature as well as biological and medical databases, in combination with the experimental data generated by a given set of experiments (or a single experiment), to identify significant clusters in the experimental data that warrant further investigation and refinement of the next set of experiments. In some embodiments presented herein, the system or method is customized to a particular multiplex assay so that each experiment using the multiplex assays results in a set of suggestions relevant to the particular combination of probes and samples in future experiment(s) that will further elucidate (e.g., help with study and/or analysis of) the system under investigation (by, e.g., providing suggestions for further experiments).
[0042] In some embodiments, a particular study might investigate the level of microRNA probes in a group of patients (e.g., humans or mammals) who are known to have a particular condition, for example, diabetes, in order to identify a set of markers which may indicate that another patient is at risk of developing diabetes. Figure 3 illustrates an exemplary panel of probe clustering and sample clustering for patients with diabetes. As shown in Figure 3, the "full panel" contains an initial experiment with 38 probes, 6 of which are strongly over- or
under-expressed in the patient group (e.g., patients having diabetes) relative to healthy controls. Using those six probes to seed a search in the pathway database KEGG (Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., and Tanabe, M.; KEGG for integration and interpretation of large-scale molecular datasets. Nucleic Acids Res. 40, D109-D114 (2012)) results in an additional 4 probes that are strongly associated with those probes. Using the same six probes to seed a search across the literature at PubMedresults in additional 8 probes that are frequently mentioned in the same articles as the over/under-regulated probes. A new experiment with 6 of the original probes and the 12 new probes may reveal a new cluster which has an even stronger association with diabetes. By iterating the process, an optimized probe panel can be developed. Not present in this example, but another potential outcome detectable by the software, is that the six probes in the over/under-expressed group may all belong to a particular pathway, or be affected by a particular drug.
[0043] The software provides suggestions that may be useful to spark new lines of inquiry, and to highlight unexpected and statistically significant correlations of probes. One advantage to the user is that it can point out connections that the researcher may not have thought of otherwise. Rather than yielding a simple yes/no answer to a hypothesis, the software can suggest a range of alternative areas of investigation. The value of such software algorithms is further enhanced when used in combination with a multiplexing platform allowing scientists to choose any combination of targets they wish, and to change the probe set according to the suggestions offered by the software.
[0044] Another advantage is that the software parses a very large fraction of the published literature, rather than the necessarily limited subset any given researcher can read in a lifetime, and also has access to online databases which capture the results of thousands of computer simulations and/or experiments. Therefore it has the potential to identify connections and relations that may not be apparent to a researcher engaged in a particular
field of study. An additional advantage is that the software can identify other probes that are related to the experimentally highlighted probes. Detecting numerous independent members of the same probe family in future experiments would increase confidence in the strength of the finding. An example of expanding a cluster is shown in Figure 5.
[0045] Methods described herein are advantageous to the assay provider in that they can suggest new marker panels to the researcher that may result in additional assay reagent purchases to move the same experiment forward. The methods are also advantageous in that they can make the assay more useful to the user. Therefore the user is more likely to use the assay provider again for other unrelated experiments. Compared to other methods of linking and representing biological data, using data derived by literature text analysis is
advantageous, because it allows the documentation of all associations derived by algorithm with literature references, thus freeing the user from having to trust the algorithm, and providing a direct way to follow up on and substantiate observations.
[0046] Associating microRNAs to their respective gene targets based on literature/database analysis can be performed in various ways. In certain embodiments, the method creates and/or uses a database that links microRNAs to gene targets based on computer models of where targeting is expected. In certain embodiments, the method links microRNAs to their respective gene targets based on a textual analysis of abstracts and/or articles in the relevant literature, e.g., by identifying papers that mention a particular gene and a particular microRNA in the same abstract, which would be suggestive of an association between the two. The quality of the association can be weighted according to evidence, if available. For example, reporter assay, western blot (protein immunoblot), and qPCR validation methods may be strong evidence of an association between microRNA and a target, and may be weighted accordingly, whereas microarray, NGS, pSILAC, or other validation methods may be less strong evidence and may be weighted accordingly. Mapping databases is not limited
to linking microR As to genes. For example, microR As can be mapped to disease conditions.
[0047] In certain embodiments, the analysis software described herein first clusters the assay probes according to the result of the experiment, using known methods. The clusters in the experiment are analyzed according to whether they form clusters in the literature or in the different databases accessible to the analysis software. When statistically significant clustering is found, the branches of the treemap which correspond to significant clusters are annotated so that the user can click on them and be brought to a report describing the nature of the association and the literature or database supporting the association. Among the different associations that are interesting is that the probes all belong to a common pathway, or that they are all up/down-regulated in a given disease, or that they are all influenced by a given drug, or that they all originate in a particular tissue, or that they are all mentioned in a particular publication, or in publications by a particular author, or often mentioned together with a particular gene, drug, disease, or other biological entity in the literature.
[0048] To identify whether an experimental cluster also represents a biologically meaningful cluster, the group of probes in the cluster is measured according to distance to each other in the context of all the potential probe distances in the database, as illustrated in Figure 4. A given database contains many probes. A distance metric over the probes can be defined using the links of the database. For instance, if the database is a gene ontology, a sibling link might have a distance of two, a cousin link might have a distance of four, and so on. In a literature search, the distance between two probes might be reciprocally related to the number of publications which mention them both. On the left side of Figure 4 is a schematic 410 showing a set 420 of four probes 431, 432, 433, 434 clustered according to the experimental heatmap. On the right of Figure 4 is a schematic 430 illustrating the set of four clustered probes 431 , 432, 433, 444 mapped into a database. The four probes may either be "tightly"
clustered in the database (as shown in schematic 430), suggesting significance, or may be spread out over the database, suggesting that the cluster has no significance, at least in the present database.
[0049] A cluster identified by the experiment can be analyzed for association in a given database by comparing the distances between probes within the cluster to the distances of random pairs of probes. A measure of tightness can be defined for this purpose, for example the average over the distances for all pairs of probes as illustrated in Figure 5. As shown previously in Figure 4, Figure 5 shows the set 420 of four probes 431, 432, 433, 434 clustered according to the experimental heatmap. On the right of Figure 5, the set of four clustered probes mapped into a database is illustrated. If a well-defined cluster 510 is identified in the database, other members of the cluster 521, 522, 523, 524 become strong candidates for including in future assays (e.g., to be studied in future experiments).
[0050] To establish statistical significance of a deviation of cluster tightness from overall tightness, a distribution of tightness under the null distribution can be established by sampling of random clusters of the same size among all probes. For practical purposes, this distribution may be pre-computed for each database and cluster size to be instantly available for testing the significance of increased tightness for experimental clusters.
[0051] Figure 6 illustrates a distribution of cluster tightness. The "tightness" of the experimental cluster is evaluated in the context of the distribution of tightness for all clusters of that size in the database. The distribution 610 is found by taking random samples of the same size from the database 620, 630, 640 and calculating the tightness of each random cluster. If the experimental cluster 650 is tighter than all but, for example 1% (or, e.g., 0.1%- 1%, l%-5%, 5%-10%, 10%-20% etc.) of random clusters, the experimentally identified cluster is likely to be significant.
[0052] The method can be extended by mapping between experimentally measured probes and probes which are well represented in the literature or databases. For instance, the experimental probe might be a microRNA probe, where limited biological data are available, while much more is known about mRNA and proteins, the expression of which is regulated by microRNA. For example, Figure 7 illustrates entity mapping between experimentally measured probes according to some embodiments presented herein. A mapping can be defined between experimentally measured probes (for example, microRNAs) 710 and literature or database entities (for example, Genes) 720. The "tightness" of a cluster in the experiment can then be analyzed in the mapped space, to determine if it forms a tight cluster in that space, or a loose grouping that is no tighter than a random selection of genes in that space.
[0053] The microRNA probes can be mapped into a database of proteins using a literature matching technique. The number of references which link a given protein with a given microRNA provide a basis for calculating the strength of an association between a protein and an microRNA. A cluster of microRNA probes in the experiment can be mapped into a cluster of proteins in a database, and the distance between the proteins calculated using the database distance metric. The mapping need not be one to one. For instance, the experiment might identify a cluster of four microRNAs, each of which has three leading protein targets. Either the twelve proteins could be analyzed as a protein cluster, or each combination of four proteins could be analyzed as a protein cluster. One more general way of representing this mapping relationship is by an N by M matrix, where N is the number of microRNA and M the number of genes.
[0054] The matrix elements may be computed from literature text analysis, for example as the number of publications that mention both probes in the same sentence, or the same paragraph. In certain embodiments, such literature text analysis makes use of the methods
and systems of International Patent Application No. PCT/US13/68584, entitled, "Automated Product Customization Based Upon Literature Search Results," filed November 15, 2013, and published as WO2014/071404 on May 8, 2014, the text of which is incorporated herein by reference in its entirety. However, other approaches may be used for the text analysis.
Example 1
[0055] For the purpose of describing an implementation in more detail, a particular example is used. There are many other embodiments of the invention, none of which shall be excluded by the choice of this example. This example is useful in the field of multiplex detection of biomolecules, more specifically for the analysis of data resulting from testing for 1-100 microRNA species simultaneously, but not limited to that number.
[0056] Assay: By performing a series of steps as prescribed by an appropriate microRNA assay protocol, the researcher prepares a mixture of particles that are
fluorescently labeled such that each of a number of distinct classes of particles will emit an optical signature indicative of the quantity of each of a number of molecular species present in a biological sample. The molecular specificity is achieved by means of a distinct probe attached to each class of particle, an oligonucleotide in the present example. The optical signals are measured in a flow cytometer, which produces a file containing all relevant information.
[0057] Data Processing: Analysis software reads raw data generated by the cytometer and performs a signal processing procedure. The result is a plurality of quantity
measurements of each molecular species in each of a plurality of samples, called the processed data. These data take the form of an N x M matrix of data points, where N is the number of samples and M is the number of probes. Each data point contains a signal value
proportional to the measured quantity, plus confidence intervals and other statistics on the plurality of particles that contributed.
[0058] Data Analysis: The software can optionally perform additional analysis on the processed data. For example, the software can identify patterns of similarity between samples (and between probes), resulting in a clustering of related items. Various methods for clustering items can be used. In this example, the neighbor-joining method can be used to generate a phylogenetic tree. Another method is k-means clustering. One use for this clustering is to rearrange the items such that similar items occur next to each other in sequence. Such a rearrangement is particularly effective in the display of a heat map, as illustrated in Figure 2.
[0059] Cluster comparison: In certain embodiments, the software does not account for information about the nature of the samples that go into producing the data. However, there is a large amount of data available on the probes, since they are designed to detect well- characterized biological molecules (microRNA, in this example) that have been well studied, published in the scientific literature, and have been collected in central bioinformatics repositories such as mirBase. It is possible to use this available information to test clusters derived from the researcher's own data for biological relevance. If, for example, a particular biological pathway is important for the mechanism or disease being studied, the microRNA that play a role in this pathway may have similar expression profiles among the samples being studied. If it can be statistically ascertained that an experimental cluster contains a preponderance of microRNA from one particular pathway, this can be reported to the researcher, who may not have considered this pathway previously.
[0060] Given a clustering of N probes from the experimental data (A, a set of clusters als a2, ... at which are sets of items that do not overlap and cover the set of all items completely) and another clustering B (with clusters bls b2, ... bm) from the literature, it is desirable to
ascertain whether one of the clusters a; contains a statistically significant overlap with one of the clusters bj. There are various possible approaches to this problem.
[0061] One of these is to perform a test of proportions on a given cluster a; with all the clusters bj. The null hypothesis of no relation between the clusterings would result in a distribution of the overlaps Ny = |aj Π bj |of the same proportions as the cluster sizes|bj | . Equivalently, the expected value for Ny is thereforela; | |bj |/N. A -test of the actual vs. the expected values with m-1 degrees of freedom will then yield a /?-value for the significance of the deviation for each cluster a;. Because there are k experimental clusters for which the test is performed, a Bonferroni correction should be performed, either by setting a more stringent significance threshold for /^-values of 0.05/k, or by multiplying each p-value by k.
[0062] Distance-based comparison: An alternative approach is to define the tightness of a cluster according to the average distance between members of the cluster. First, a distance metric D(i,j) is defined for each pair of members of the database. Then a cluster tightness is defined by measuring the average distance between pairs of nodes in the cluster:
[0063] To identify significant clustering in the database, the cluster tightness of a given experimentally identified cluster is compared to the distribution of tightness for all clusters of that size. For practical purposes, the distribution can be sampled by choosing several thousand random clusters and computing their tightness. If an experimentally identified cluster is tighter than, say, 99% of randomly chosen clusters, a significance value can be assigned. As before, a Bonferroni correction should be applied to compensate for multiple testing if many clusters and/or many databases are analyzed.
[0064] After identifying a cluster as significant, the biological group to which it belongs can be derived in a way that depends on the database in question. For instance, in a tree-
structured database such as the Gene Ontology, the closest common ancestor of all the nodes in the cluster might be a suitable choice. For a pathway database, the pathway to which all the genes belong would be a suitable gene.
[0065] By iterating through the experimental clusters and the different databases, clusters which are significant are identified. Those clusters are brought to the user's attention by annotating the branches of the treemap corresponding to a significant cluster with an icon. Clicking on the icon brings the user to a report describing the nature of the association (common pathway, common ontology, common disease, for instance) complete with literature or database references supporting the association. The report may also include figures and tables suitable for inclusion in a scientific publication.
[0066] Entity mapping: Because much more biological data is available for genes than for microRNA, it is desirable to first map clusters of microRNA to clusters of genes and then analyze gene clusters against the various databases. Such mappings can be chained, e.g., genes can be further mapped to pathways or diseases. Similarly, microRNA can be mapped to publications and publications can be mapped to authors which in turn can be mapped to institutions. Mapping from experimental probes to literature entities is illustrated in Figure 6.
[0067] In some embodiments, a gene mapping is described by an N by M matrix, where N is the number of microRNA (approximately 1,000) and M is the number of genes
(approximately 20,000). The matrix is sparse, i.e., contains many elements that are zero and can be omitted for efficiency. Matrix elements are further annotated with references, e.g., a list of publications which are used as the basis for mapping a microRNA to a gene. In some embodiments, microRNAs are associated with genes by parsing a large number of publications for mentions of each. The publications are selected from the entire body of literature, such as available from PubMed, by performing a search on the term "microRNA". Other ways to select from the database can be considered, or no selection performed and the
entire database parsed. Each publication containing a pair of terms that identify one microRNA and one gene contributes a score to the corresponding element of the matrix. In some embodiments, this score varies according to the textual proximity of the terms. There may be, for instance, three different scores: For the terms to occur in the same publication, in the same paragraph, or in the same sentence. The closer the terms, the higher the score. An additional score is added when both terms occur in the title of the publication. Many other ways of measuring textual proximity would be obvious to those skilled in the art.
[0068] A given cluster of microRNA can be described by a vector U of length N. Such vectors may be called profiles, of which clusters are a special case with elements of 1 (microRNA present in cluster) and 0 (uRNA not present in cluster), only. Given a mapping G from microRNA to genes as described above, a gene profile W is generated from the cluster U by multiplying with the matrix, i.e., W = G*U. This will lead to a profile for genes that will contain non-zero weights for all genes that are mapped to at least one microRNA in the cluster, and give greater weight to those mapped to multiple microRNA in the cluster. The gene profile can then be turned into a gene cluster by setting a suitable threshold for the weights, where a weight above the threshold indicates membership in the cluster.
[0069] In some embodiments, discrete clusters are eliminated altogether, and the method instead involves working with profiles throughout.
[0070] A software system has been described for automatically detecting biologically significant clusters in a multiplex assay. The integration of an assay platform including reagents and their association with markers that can be identified optically or electrically, the software needed to analyze the raw data, and the software for identifying biologically significant clusters and identifying related probes provides a powerful platform to carry out research studies, accelerate assay development and optimization, investigate dysregulation of probes in disease conditions, or monitor the productivity of a cell colony in a bio-reactor.
[0071] Throughout the description, where apparatus and systems are described as having, including, or comprising specific components, or where processes and methods are described as having, including, or comprising specific steps, it is contemplated that, additionally, there are apparatus, and systems of the present invention that consist essentially of, or consist of, the recited components, and that there are processes and methods according to the present invention that consist essentially of, or consist of, the recited processing steps.
[0072] It should be understood that the order of steps or order for performing certain action is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.
[0073] FIG. 8 is a flow chart of an example method 800 for automated method for multiplex assay design.
[0074] In some implementations, data entered by a user is received by the processor (802). In some implementations, the data corresponds to a first set of one or more assays. In some embodiments, the data includes an identification of one or more experimentally observed probe(s). In some embodiments, the probe is an analyte. In some embodiments, the data includes an identification of experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms
[0075] In some implementations, a query is constructed using the data corresponding to the first set of one or more assays (804).
[0076] In some implementations, the processor accesses one or more databases (e.g., biological or medical databases, scientific literature (e.g., PubMed, NIH Gene Expression Omnibus (GEO) datasets, CMAP/Connectivity Map datasets, KEGG database (Kyoto Encyclopedia of Genes and Genomes)) and queries the one or more databases (806).
[0077] The resulting query, for example, is then sent to a remote search engine, such as PubMed. In some implementations, the query includes instructions relevant to a particular
third party query server. For example, a particular query server may accept instructions on results formatting, such as instructions on various relevant pieces of information for up to 200 of the top results of the search. If multiple third party repositories are queried, in some implementations, equivalent instructions may be provided to each repository. In some implementations, repositories may be copied in whole or in part to be stored locally on the server, to improve response time.
[0078] In some implementations, the processor queries the one or more databases (806) using the data corresponding to the first set of one or more assays. In some implementations, the processor identifies one or more additional probe(s) different from those identified in the first set of one or more assays (808), the one or more additional probe(s) being associated with the one or more experimentally observed probes in the first set of one or more assays.
[0079] In some embodiments, when the one or more databases is queried (806) to identify said one or more additional probe(s) different from said experimentally observed probe(s), the processor conducts mapping between experimentally observed probe(s) and additional probe(s), wherein the additional probe(s) are of a different type than the experimentally observed probe(s). In some embodiments, the experimentally observed probe(s) and the additional probe(s) are of different types, said different types comprising two or more categories selected from microRNA, messenger RNA, protein, and SNP. In some embodiments, mapping between the experimentally observed probe(s) and the additional probe(s) is derived from an automated analysis, by the processor, of published literature (and/or contents of public database(s)) identifying co-occurrence of pairs of probes (e.g., cooccurrence of an experimentally observed probe with an additional probe) in published documents.
[0080] In some implementations, the processor transmits to a user computing device a processing result related to the first query (810), the processing result including an identification of said one or more additional probe(s).
[0081] As shown in FIG. 9, an implementation of an exemplary cloud computing
environment 900 for automated product customization based on literature search results is shown and described. The cloud computing environment 900 may include one or more resource providers 902a, 902b, 902c (collectively, 902). Each resource provider 902 may include computing resources. In some implementations, computing resources may include any hardware and/or software used to process data. For example, computing resources may include hardware and/or software capable of executing algorithms, computer programs, and/or computer applications. In some implementations, exemplary computing resources may include application servers and/or databases with storage and retrieval capabilities. Each resource provider 902 may be connected to any other resource provider 902 in the cloud computing environment 900. In some implementations, the resource providers 902 may be connected over a computer network 908. Each resource provider 902 may be connected to one or more computing device 904a, 904b, 904c (collectively, 904), over the computer network 908.
[0082] The cloud computing environment 900 may include a resource manager 906. The resource manager 906 may be connected to the resource providers 902 and the computing devices 904 over the computer network 908. In some implementations, the resource manager 906 may facilitate the provision of computing resources by one or more resource providers 902 to one or more computing devices 904. The resource manager 906 may receive a request for a computing resource from a particular computing device 904. The resource manager 906 may identify one or more resource providers 902 capable of providing the computing resource requested by the computing device 904. The resource manager 906 may select a
resource provider 902 to provide the computing resource. The resource manager 906 may facilitate a connection between the resource provider 902 and a particular computing device 904. In some implementations, the resource manager 906 may establish a connection between a particular resource provider 902 and a particular computing device 904. In some implementations, the resource manager 906 may redirect a particular computing device 904 to a particular resource provider 902 with the requested computing resource.
[0083] FIG. 10 shows an example of a computing device 1000 and a mobile computing device 1050 that can be used to implement the techniques described in this disclosure. The computing device 1000 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The mobile computing device 1050 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart-phones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to be limiting.
[0084] The computing device 1000 includes a processor 1002, a memory 1004, a storage device 1006, a high-speed interface 1008 connecting to the memory 1004 and multiple highspeed expansion ports 1010, and a low-speed interface 1012 connecting to a low-speed expansion port 1014 and the storage device 1006. Each of the processor 1002, the memory 1004, the storage device 1006, the high-speed interface 1008, the high-speed expansion ports 1010, and the low-speed interface 1012, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 1002 can process instructions for execution within the computing device 1000, including instructions stored in the memory 1004 or on the storage device 1006 to display graphical information for a GUI on an external input/output device, such as a display 1016 coupled to
the high-speed interface 1008. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
[0085] The memory 1004 stores information within the computing device 1000. In some implementations, the memory 1004 is a volatile memory unit or units. In some
implementations, the memory 1004 is a non- volatile memory unit or units. The memory 1004 may also be another form of computer-readable medium, such as a magnetic or optical disk.
[0086] The storage device 1006 is capable of providing mass storage for the computing device 1000. In some implementations, the storage device 1006 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
Instructions can be stored in an information carrier. The instructions, when executed by one or more processing devices (for example, processor 1002), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices such as computer- or machine-readable mediums (for example, the memory 1004, the storage device 1006, or memory on the processor 1002).
[0087] The high-speed interface 1008 manages bandwidth-intensive operations for the computing device 1000, while the low-speed interface 1012 manages lower bandwidth- intensive operations. Such allocation of functions is an example only. In some
implementations, the high-speed interface 1008 is coupled to the memory 1004, the display 1016 (e.g., through a graphics processor or accelerator), and to the high-speed expansion
ports 1010, which may accept various expansion cards (not shown). In the implementation, the low-speed interface 1012 is coupled to the storage device 1006 and the low-speed expansion port 1014. The low-speed expansion port 1014, which may include various communication ports (e.g., USB, Bluetooth®, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
[0088] The computing device 1000 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 1020, or multiple times in a group of such servers. In addition, it may be implemented in a personal computer such as a laptop computer 1022. It may also be implemented as part of a rack server system 1024. Alternatively, components from the computing device 1000 may be combined with other components in a mobile device (not shown), such as a mobile computing device 1050. Each of such devices may contain one or more of the computing device 1000 and the mobile computing device 1050, and an entire system may be made up of multiple computing devices communicating with each other.
[0089] The mobile computing device 1050 includes a processor 1052, a memory 1064, an input/output device such as a display 1054, a communication interface 1066, and a transceiver 1068, among other components. The mobile computing device 1050 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage. Each of the processor 1052, the memory 1064, the display 1054, the communication interface 1066, and the transceiver 1068, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
[0090] The processor 1052 can execute instructions within the mobile computing device 1050, including instructions stored in the memory 1064. The processor 1052 may be
implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor 1052 may provide, for example, for coordination of the other components of the mobile computing device 1050, such as control of user interfaces, applications run by the mobile computing device 1050, and wireless communication by the mobile computing device 1050.
[0091] The processor 1052 may communicate with a user through a control interface 1058 and a display interface 1056 coupled to the display 1054. The display 1054 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 1056 may comprise appropriate circuitry for driving the display 1054 to present graphical and other information to a user. The control interface 1058 may receive commands from a user and convert them for submission to the processor 1052. In addition, an external interface 1062 may provide communication with the processor 1052, so as to enable near area communication of the mobile computing device 1050 with other devices. The external interface 1062 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
[0092] The memory 1064 stores information within the mobile computing device 1050. The memory 1064 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. An expansion memory 1074 may also be provided and connected to the mobile computing device 1050 through an expansion interface 1072, which may include, for example, a SIMM (Single In Line Memory Module) card interface. The expansion memory 1074 may provide extra storage space for the mobile computing device 1050, or may also store applications or other information for the mobile computing device 1050. Specifically, the expansion memory
1074 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, the expansion memory 1074 may be provide as a security module for the mobile computing device 1050, and may be programmed with instructions that permit secure use of the mobile computing device 1050. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
[0093] The memory may include, for example, flash memory and/or NVRAM memory (nonvolatile random access memory), as discussed below. In some implementations, instructions are stored in an information carrier, that the instructions, when executed by one or more processing devices (for example, processor 1052), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices, such as one or more computer- or machine-readable mediums (for example, the memory 1064, the expansion memory 1074, or memory on the processor 1052). In some
implementations, the instructions can be received in a propagated signal, for example, over the transceiver 1068 or the external interface 1062.
[0094] The mobile computing device 1050 may communicate wirelessly through the communication interface 1066, which may include digital signal processing circuitry where necessary. The communication interface 1066 may provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile
communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA (time division multiple access), PDC (Personal Digital Cellular), WCDMA
(Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others. Such communication may occur, for example, through the
transceiver 1068 using a radio-frequency. In addition, short-range communication may occur, such as using a Bluetooth®, Wi-Fi™, or other such transceiver (not shown). In addition, a GPS (Global Positioning System) receiver module 1070 may provide additional navigation- and location-related wireless data to the mobile computing device 1050, which may be used as appropriate by applications running on the mobile computing device 1050.
[0095] The mobile computing device 1050 may also communicate audibly using an audio codec 1060, which may receive spoken information from a user and convert it to usable digital information. The audio codec 1060 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 1050. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on the mobile computing device 1050.
[0096] The mobile computing device 1050 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 680. It may also be implemented as part of a smart-phone 1082, personal digital assistant, or other similar mobile device.
[0097] Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
[0098] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine- readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
[0099] To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
[0100] The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication
(e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.
[0101] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
[0102] In view of the structure, functions and apparatus of the systems and methods described here, in some implementations, environments and methods for automated multiplex assay design and customization based on literature search results are provided. Having described certain implementations of methods and apparatus for supporting automated multiplex assay design and customization based on literature search results, it will now become apparent to one of skill in the art that other implementations incorporating the concepts of the disclosure may be used. Therefore, the disclosure should not be limited to certain implementations, but rather should be limited only by the spirit and scope of the following claims.
Claims
1. A method for automated assay design, the method comprising the steps of:
providing a graphical user interface for display on a user computing device, said interface being configured to accept input from the user computing device;
receiving, via a network, data from the user computing device, said data
corresponding to a first set of one or more assays, said data comprising an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms; accessing, by the processor, one or more databases and performing, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s) being associated with said one or more experimentally observed probes in the first set of one or more assays; transmitting, to the user computing device, a processing result related to said first query, said processing result comprising an identification of said one or more additional probe(s).
2. The method of claim 1 , wherein performing, by the processor, said first query to identify said one or more additional probe(s) different from said experimentally observed probe(s) comprises performing a mapping between said experimentally observed probe(s) and said additional probe(s), wherein said additional probe(s) are of a different type than said experimentally observed probe(s).
3. The method of claim 2, wherein said experimentally observed probe(s) and said additional probe(s) are of different types, said different types comprising two or more categories selected from the following: microRNA, messenger RNA, protein, and SNP.
4. The method of claim 2 or 3, wherein said mapping between said experimentally observed probe(s) and said additional probe(s) is derived from an automated analysis, by the processor, of published literature (and/or contents of public database(s)) identifying cooccurrence of pairs of probes in published documents.
5. The method of claim 4, wherein said mapping comprises a matrix of weights indicating a quantified degree to which one probe co-occurs with another.
6. The method of claim 5, wherein the weight in the matrix corresponding to one probe and another probe is, or is proportional to, a number of searched publications that mention both the one probe and the other probe.
7. The method of claim 4, wherein identifying co-occurrence of pairs of probes comprises identifying co-occurrence of an experimentally observed probe with an additional probe.
8. The method of any one of claims 1 to 7, wherein said experimentally observed probe(s) comprise(s) one or more members of any one or more of the following categories: nucleic acids, proteins, cells, and viruses.
9. The method of claim 8, wherein said experimentally observed probe(s) is/are quantified.
10. The method of any one of claims 1 to 9, further comprising, following said transmitting of said processing result related to said first query, receiving, via the network, data corresponding to a second set of one or more assays, said data comprising an
identification of one or more probe(s) observed from experiments conducted using at least one of said additional probe(s) previously identified from said first query of said one or more databases, and performing, by the processor, a second query of the one or more databases using said data corresponding to the second set of one or more assays (with or without said data corresponding to the first set of one or more assays) to identify one or more additional probe(s) different from those identified in the first or second set of assays.
11. A tangible product comprising an assay developed using the method of any one of claims 1 to 9.
12. The tangible product of claim 11 , wherein said assay comprises one or more of the following substrates: microarrays, beads, hydrogel particles, and liquid phase probes.
13. A method for automated assay design, the method comprising the steps of:
entering, via a network, data from a user computing device, said data corresponding to a first set of one or more assays, said data comprising an identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms;
accessing, by a processor, one or more databases [e.g., biological or medical databases, scientific literature and performing, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s) being associated with said one or more experimentally observed probes in the first set of one or more assays;
displaying, on the user computing device, a processing result related to said first query, said processing result comprising an identification of said one or more additional probe(s).
14. A system comprising:
a processor; and
a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to:
provide a graphical user interface for display on a user computing device, said interface being configured to accept input from the user computing device;
receive, via a network, data from the user computing device, said data corresponding to a first set of one or more assays, said data comprising an
identification of one or more experimentally observed probe(s) and/or experimentally quantified level(s) of one or more probe(s) in one or more samples obtained from one or more living organisms;
access, by the processor, one or more databases and perform, by the processor, a first query of the one or more databases using said data corresponding to the first set of one or more assays to identify one or more additional probe(s) different from those identified in the first set of one or more assays, said one or more additional probe(s)
being associated with said one or more experimentally observed probes in the first set of one or more assays;
transmit, to the user computing device, a processing result related to said first query, said processing result comprising an identification of said one or more additional probe(s).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361900360P | 2013-11-05 | 2013-11-05 | |
US61/900,360 | 2013-11-05 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2015069713A2 true WO2015069713A2 (en) | 2015-05-14 |
WO2015069713A3 WO2015069713A3 (en) | 2015-11-12 |
Family
ID=53042309
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2014/064050 WO2015069713A2 (en) | 2013-11-05 | 2014-11-05 | Systems and methods for automated multiplex assay design |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2015069713A2 (en) |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6519583B1 (en) * | 1997-05-15 | 2003-02-11 | Incyte Pharmaceuticals, Inc. | Graphical viewer for biomolecular sequence data |
US20040018506A1 (en) * | 2002-01-25 | 2004-01-29 | Koehler Ryan T. | Methods for placing, accepting, and filling orders for products and services |
CN102712955A (en) * | 2009-11-03 | 2012-10-03 | Htg分子诊断有限公司 | Quantitative nuclease protection sequencing (qNPS) |
WO2013134633A1 (en) * | 2012-03-09 | 2013-09-12 | Firefly Bioworks, Inc. | Methods and apparatus for classification and quantification of multifunctional objects |
EP2915119A4 (en) * | 2012-11-05 | 2016-04-20 | Firefly Bioworks Inc | Automated product customization based upon literature search results |
-
2014
- 2014-11-05 WO PCT/US2014/064050 patent/WO2015069713A2/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
WO2015069713A3 (en) | 2015-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Advances in bulk and single-cell multi-omics approaches for systems biology and precision medicine | |
Zhou et al. | Metascape provides a biologist-oriented resource for the analysis of systems-level datasets | |
Xie et al. | It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data | |
McCue et al. | The scope of big data in one medicine: unprecedented opportunities and challenges | |
Yu et al. | DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis | |
Zhu et al. | Targeted exploration and analysis of large cross-platform human transcriptomic compendia | |
Vaske et al. | Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM | |
Curtis et al. | Pathways to the analysis of microarray data | |
Garber et al. | Computational methods for transcriptome annotation and quantification using RNA-seq | |
Del Valle et al. | Disease networks and their contribution to disease understanding: A review of their evolution, techniques and data sources | |
Imbeaud et al. | ‘The 39 steps’ in gene expression profiling: critical issues and proposed best practices for microarray experiments | |
Ahmed | Precision medicine with multi-omics strategies, deep phenotyping, and predictive analysis | |
Cathryn et al. | A review of bioinformatics tools and web servers in different microarray platforms used in cancer research | |
CN107066835A (en) | A kind of utilization common data resource discovering and method and system and the application for integrating rectum cancer associated gene and its functional analysis | |
Su et al. | Method development for cross-study microbiome data mining: challenges and opportunities | |
Chen et al. | A multi-modal data harmonisation approach for discovery of COVID-19 drug targets | |
Long et al. | From function to translation: decoding genetic susceptibility to human diseases via artificial intelligence | |
Ahmed et al. | JWES: a new pipeline for whole genome/exome sequence data processing, management, and gene‐variant discovery, annotation, prediction, and genotyping | |
Chavda et al. | Introduction to Bioinformatics, AI, and ML for Pharmaceuticals | |
Jupiter et al. | A visual data mining tool that facilitates reconstruction of transcription regulatory networks | |
Sheikh et al. | Computational resources for oncology research: a comprehensive analysis | |
Caufield et al. | Cardiovascular informatics: building a bridge to data harmony | |
Wu et al. | Multi-omic analysis tools for microbial metabolites prediction | |
Kaur et al. | Multi-Omics and Its Clinical Application | |
Yarlagadda et al. | A guide to single‐cell RNA sequencing analysis using web‐based tools for non‐bioinformatician |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14859961 Country of ref document: EP Kind code of ref document: A2 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14859961 Country of ref document: EP Kind code of ref document: A2 |