WO2002077288A1

WO2002077288A1 - Methods for identifying nucleic acid molecules of interest for use in hybridization arrays

Info

Publication number: WO2002077288A1
Application number: PCT/US2002/008705
Authority: WO
Inventors: Andrew J G. Simpson; Sandro J. De Souza; Ricardo R. Brentani
Original assignee: Ludwig Institute For Cancer Research
Priority date: 2001-03-23
Filing date: 2002-03-21
Publication date: 2002-10-03

Abstract

The invention relates to methods for improving the ability to obtain molecules which are useful in oligonucleotide screening assays. By assaying a library for appropriate molecules using a set of defined parameters, one can identify a set of molecules which will facilitate oligonucleotide hybridization assays more efficiently. The invention provides for solid phare arrays of molecules which can, in turn, be used in this way.

Description

METHODS FOR IDENTIFYING NUCLEIC ACID MOLECULES OF INTEREST FOR USE IN HYBRIDIZATION ARRAYS

RELATED APPLICATION

This application claims priority of provisional application 60/300,998, filed June 26, 2001, and provisional application 60/278,485, filed March 23, 2001, both of which are incorporated by reference.

FIELD OF THE INVENTION

The present invention is related to methods for identifying regions of nucleic acid molecules, such as cDNA, to be deposited onto a solid phase, hybridization array, such as a "biochip." The invention involves the use of a series of parameters that are chosen to permit identification of those nucleic acid molecules which satisfy these parameters. As a result of the methodology, use of the solid phase hybridization arrays leads to generation of uniform, homogeneous signals, which in turn permits direct comparison of hybridization intensities from sample to sample and permits direct assessment of e.g., absolute expression levels of the gene or genes under consideration.

BACKGROUND OF THE INVENTION

The present invention relates to methods of designing solid phase hybridization arrays, such as biochips. In particular, the invention involves methods which provide for selection of appropriate nucleic acid molecules, such as mRNA or cDNA for placement, onto solid phase hybridization arrays, which in turn allows the artisan to determine absolute levels of expression of one or more genes, and also permits comparison of gene expression levels in different samples.

Until recently, processes of gene discovery and analysis of genes for their chemical and functional characteristics have been difficult, expensive, and time consuming. Advances in methods for isolating arrays of biomolecules, such as nucleic acid molecules have improved this field tremendously. In particular, solid phase hybridization supports, such as

#25147660vl< biochips, have improved the fields of nucleic acid molecule synthesis, sequencing, mutation studies, gene expression analysis, and gene discovery, to mention only a few relevant areas.

Hereafter, reference will be made to "biochips," but it is to be understood that the methodology under consideration as the invention may be applied to any solid phase array. In general, hybridization arrays utilize a pool of labeled molecules, which are then hybridized to a field of target molecules that have been immobilized on the biochip or other solid phase. The sample normally contains labeled cDNA molecules, which correspond to mRNA found in the sample, such as mRNA that has been isolated from a tissue. The number of copies of a given cDNA molecule is proportional to the number of copies of mRNA in the sample, and hence is a direct measurement of expression of a given gene. As a very general rule, the cDNA molecules hybridize to targets on the biochip, and the intensity of hybridization reflects the quantity of cDNA available, hence the original amount of mRNA, and the degree of expression of the particular gene. Similarly, carrying out this type of assay on two samples permits direct comparison, and permits the artisan to draw conclusions as to the two samples. These types of measurements have great practical importance for identifying markers for particular cell types, physiological states and so forth. Changes in expression levels also serve as indicia of changes related to a disease state per se, and can also serve as a way to identify targets for drug attack.

These biochip related methodologies do suffer from serious drawbacks. For example, very little work has been done in the area of target design, i.e., in the development of the immobilized molecules which are present on the biochip, and are used to hybridize to the sample. Generally, these targets have been partial, or complete gene fragments produced during genome sequencing surveys.

Additional issues arise due to the nature of nucleic acid molecules per se, as well as methodologies employed in preparing samples. For example, it is well known that different mRNA molecules may contain regions which are similar or identical to each other. If these different mRNA molecules are present in a given sample, they will compete for, and cross hybridize to, the same target on the biochip, leading to false or misleading results. Further, it is well known that the chemical composition of nucleic acid molecules affects their ability to hybridize. For example, if two different, target molecules differ in their length, and/or their GC content, this will impact hybridization. Careful standardization is required, and these facts preclude comparison to internal standards.

Further, when a sample is treated to yield, e.g., cDNA molecules, it is well known that the transcription necessary to manufacture the molecules takes place starting at the 3' end of the mRNA molecule, and efficiency of transcription decreases as it proceeds toward the 5' end of the molecule. Thus, if two different target molecules are used, which derive ultimately from the same gene but are located at different points along the length of the gene, very different hybridization values could result even though the gene is only transcribed at a particular level.

Yet another practical problem relates to the actual amount of target nucleic acid molecules on the chip. Regardless of efforts to equalize concentrations of molecules used on biochips, inevitably there are small differences which result. This leads to false values in the hybridization assays.

Hence, there are several areas where solid phase hybridization or biochip hybridization assays can be improved. This invention relates, in particular, to methods which improve the quality of the biochip or solid phase target, by providing for a more uniform, homogeneous, target base. How this is achieved will be described in the following materials.

"Bioinfomatics" is the field of science wherein computer based technologies are applied to analysis, and management of nucleic acid molecule sequence data. In the case of humans, for example, publicly available data bases include over 3 million cDNA molecules. Creative and imaginative approaches to the analysis of this information is needed. One initiative, referred to as "UniGene," has attempted to "cluster" sequence information via various criteria.

Such an approach, based upon computer software, is used in this invention. To elaborate, the analysis commences with the use of computer software to screen data bases to determine nucleotide sequences for use as target molecules on a given surface, such as a biochip. The software scans the database for particular information. Exemplary of the criteria, but by no means the only criteria, are:

(i) defined length;

(ii) defined GC content; (iii) defined position of a sequence in a cDNA molecule relative to the poly (A) tail of a cDNA molecule;

(iv) level of expression based upon size of a UniGene cluster;

(v) absence of repetitive elements;

(vi) differences in overall sequence from any other sequences in the same organism based upon a minimum threshold value of, e.g., 75%, and;

(vii) presence of a "synthetic" sequence, i.e., a sequence defined by two or more fragments of a cDNA molecule which are not contiguous in the available molecules.

With respect to the present invention, software was written using the language PERL and can be run in any UNIX or LINUX environment. The program makes use of relational databases (MySQL), which stores all information related to the transcriptome of any given species. This is given by the following command "$dbh = DBI->connect('DBI:mysql:database','user');"

The program prompts the user to enter values for several parameters, like in the example below: $min = 250; $max = 288; $comp = 500; $start = 300; $end = 1000;

$less_than = 10; $more_than = 0; $tissue = brain;

where $min and $max refers to the minimum and maximum numbers of nucleotides; $comp refers to the sub-sequence to be analyzed; $start and $end refers to the distance in nucleotides from the 3' end of the cDNA, $less_than and $more_than refers to the UniGene cluster size that is used to indicate the level of expression of the respective cDNA, and $tissue refers to the tissue where that cDNA is expressed. Given these parameters, the program accesses the referred database and collects those sequences that have a stretch of A's at the end of the cDNA or a stretch of T's at the beginning of the cDNA. In the latter case, the program generates a new sequence that is complementary to and the inverse of the original sequence. By scanning the collected sequences, the program finds candidate regions for inclusion on the biochip.

In addition to the identification of suitable standardized regions of genes, the program also lists the flanking DNA sequences. These sequences are used to produce small single stranded DNA molecules of between 18 and 20 nucleotides that are used to synthesize the DNA stretch identified by the well known technique of RT-PCR. Using the program, the researcher can select a set of DNA sequences that are highly similar in there chemical nature, as well as any other characteristic or characteristics.

Once the relevant target molecules are identified, they can be sequenced using methods well known in the art.

To further improve the accuracy of transcript abundance, the synthesis of these molecules is accomplished in connection with incorporation of a label therein. Various methodologies are preferred. In a first approach, when synthesizing the molecules a label, preferably fluorescent, is attached to the 5' end of the oligonucleotide template primer. As an alternative, labeled molecules, again especially fluorescent molecules, are incorporated along the length of the target molecules. This is useful in situations where the target molecules vary in length. In preferred embodiments, however, the program employed utilizes a defined sequence length for identifying relevant targets, so that intensity of incorporated signal as a determination of length becomes irrelevant. A further embodiment of the invention provides for synthesis of the target molecules followed by the in situ incorporation of label.

The deposition of target molecules on the solid phase, or biochip, is then followed by routine measurement of the signal, which provides an accurate measurement of the quantity of the target.

Following the quantification of target, and quenching or extinguishing of the signals associated therewith or, in the alternative, using other signals having non-overlapping wavelengths, samples are prepared and contacted to the target using standard methods. The sample molecules can, e.g., have signal incorporated therein in the same way that the target molecules were prepared. Following hybridization, the signals are measured, and quantified using standard methodologies, taking the data regarding targets into account.

A final consideration pertains to the approximate abundance of the nucleic acid molecule species corresponding to the targets on the biochip. Some of these are present at levels hundreds of times higher than others in living cells. Although the exact level of expression may vary between different physiological states rarely if ever, will such alterations attain such extremes. Thus, one can classify some molecules as generally being highly abundant while the majority are rather rare. The hybridization process is theoretically defined by a curve where the extent of binding eventually reaches a plateau. The curve is defined by the concentration of the binding component. Measurements that provide information concerning the concentration of the binding component can only result in the ascending region of the curve. Thus, it is not possible to simultaneously measure the concentrations of very disparate concentrations of binding agents. To ensure informative data in accordance with this, or any other method involving determinations based upon hybridization to a solid phase array, it is also important to ensure as far as possible that the range of likely concentrations of the nucleic acid molecules corresponding to the targets is reasonably narrow. To this end, one compares molecules present at hundreds of copies to other molecules present at such levels, rather than rarely occurring molecules, and vice versa.

The invention described here overcomes all of the problems inherent in the biochips currently available by the very careful computer based design of the exact DNA sequence of the target spots to be placed in target arrays. The description of the invention pertains to biochips of expressed human genes but could also apply to any living organism.

BRIEF DESCRIPTION OF THE FIGURES

Figure 1 presents a solid phase display developed using a library of probes developed using the invention as described supra, when used to analyze a cDNA library from a colon cancer line.

Figure 2 shows results from experiments designed to determine if the length of the probe on the solid phase array was relevant to the assay. Figure 3 presents the results of work designed to determine if the position of the probe, relative to the 3-end, was relevant.

Figure 4 sets forth the results of experiments designed to determine if cross hybridization within family members can be avoided when using the inventive method.

Figure 5 shows the results of experiments which are essentially the converse of those depicted in figure 4.

Figure 6 confirms the prior results, using a different gene family.

Figure 7 shows that by using the method, one can identify genes with higher levels of expression or distinguished from lower ones.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS EXAMPLE 1

The protocols described supra, were used in an assay to determine expression frequency of genes in a colon cancer sample.

Using the computer algorithm described supra, a set of "representative gene fragments" or "RGSs" was produced. The members of the group were chosen to be about 250 base pairs in length. The RGS's were amplified via RT-PCR, using oligonucleatide primers based upon their flanking sequences as described supra. The column "spot" is correlated to figure 1, as discussed infra.

Once the RGSs were prepared, they were spotted eight times, in equal quantities around the perimeter of a square, with a middle spot being a non-human DNA control (i.e., a 450 base pair segment of PCR product of Xanthomonas citri PTHA gene). Spotting was done on a positively charged, nylon filter.

A library of ³²P labeled cDNA was prepared from a human colon cancer line, and hybridized to the array. Prior to hybridization, pre-hybridization was undertaken, by incubating the membrane in 5 ml of pre-hybridization solution, for 30 minutes at 65 °C, with constant rotation.

Double stranded probes were denatured by heating in a hot water bath (100°C), for 10 minutes, followed by transfer to ice. Probes were then added to the hybridization solution, and incubated, overnight, at 28 °C. Following hybridization, the membrane was washed, twice, for 5-10 minutes each time, with 2 x SSC, 0.1% SDS, at room temperature. After the second wash, the wash solution was removed, and the membrane was rinsed briefly in 2 x SSC. The blots were exposed at -80 °C using x-ray film and intensifying screens, as well as a photoimager.

The results are presented in figure 1, and show that each of the fragments exhibited a level of hybridization that was distinct from the others, and reflects the level of expression of the gene that it represented in the source cell line.

EXAMPLE 2

The explanation of the principles underlying the invention, set forth supra, discuss how hybridization intensity is length dependent, and is also dependent on position within the transcript from which RGS is derived.

To test these principles, 13 pairs of RGSs were analyzed, where one member of the pair was 250 base pairs long, and the other 500 base pairs, although each was derived from the same transcript.

The RGSs' were arranged the same way the RGSs' of figure 1 were arranged, and tested under exactly the same conditions.

The results are depicted in figure 2. In 12 of the 13 cases, there was stronger signal using the longer RGS. In the one case where this did not happen, it is believed that the gene was simply not expressed, and signal was only background. This is in agreement with other experiments, where other fragments with low levels of hybridization with the 250 base pair RGS showed much higher levels with the 500 base pair molecule.

EXAMPLE 3

In these experiments, five RGSs, of 250 base pairs, were prepared from the same, very large transcript, i.e., human ankryn G, which can be found at GenBank Accession Number U13616. The positions from the transcript represented by the RGSs, are shown in figure 3. Hybridization was carried out exactly in the manner discussed in example 1.

Figure 3 shows that the RGS that is closest to the 3' end of the ankryn transcript showed a much higher level of hybridization than the other molecules. EXAMPLE 4

One of the key advantages of the invention is that it avoids cross hybridization between members of the same family that would otherwise render it impossible to measure individual family member transcription levels. This principle was tested in these experiments.

Three fragments, of 288, 2869 and 1918 base pairs, derived from human solute carrier 12 family member 4 ("SLC12A4, NTVI_005072), three fragments of 252, 1893, and 3083 base pairs, from human solute carrier 12 family member 6 ("SLC12A6, NM_005135"), and a 242 base pair fragment from human solute carrier 12 family member 7 ("12C12A7, NM_006598"), were placed on a membrane. The 2689 base pair fragment referred to supra was amplified via PCR, cloned into PGEM, and used to produce a ³²P-cDNA probe, via in vitro transcriptions. Radiolabelled cDNA was used as a probe in hybridization assays.

While the genes are closely related, the 250 base pair fragments were chosen to have less than 75% nucleotide similarity. Each of the smaller fragments was contained within the longest one.

Hybridization was carried out, as described in example 1. The data, as set forth in figure 4, show that the ³²P-cDNA probe hybridized to itself and to the long SLC12A6 probe. There was detectable hybridization to the long SLC12A4 RGS, but not to the other family members. Normally, much more extensive cross-hybridization would be expected.

EXAMPLE 5

This example represents the converse of example 4. In these experiments, the ³²P- cDNA probe was a 23083 base pair fragment of SLC12A6. While there was, in fact, significant cross-hybridization between large fragments, the RGSs were completely specific within the limits of detection.

EXAMPLE 6

This example describes experiments similar to those presented in examples 4 and 5, using a different gene family. Fragments 246 and 979 base pairs long, derived from human cadherin 12 (CDH12, NM_00461), and fragments of 272 and 2167 base pairs derived from human cadherin 18 (CDH18, NM_004934) were used. The 979 base pair fragment was used for the ³²P-cDNA probe.

Hybridization was carried out as in example 1. The data, as presented in figure 6, show that there was hybridization to the complement and to the long CDH12 RGS, but not the others. Hence, the method can be used to determine individual members of families generally.

EXAMPLE 7

The results of the prior examples suggest that the RGSs have the capacity to faithfully reflect gene expressed levels, in ways that would not occur were attention not paid to the transcript representations on the arrays.

To test this, measured levels of hybridization were composed to approximate levels of expression of the genes represented by RGS. Genes were divided into 3 groups of expression, representing low, intermediate, and high levels of expression. These levels were estimated by counting EST sequences for the genes that were available in public data bases. See Camargo, et. al, PNAS 98:12103-12108 (2001). Genes with less than 20 ESTs are considered as having low levels of expression, while those with between 21 and 100 ESTs are deemed to have intermediate expression, and those above 100, high levels.

The protocols developed in the prior examples were employed. Figure 7 plots hybridization intensities of the membrane of figure 1, against expression levels of the genes from which transcript were derived. RGSs are the sae length, i.e., 250 base pairs, prepared from the same 3' region of transcript, and have similar GC content. The data presented in figure 7 demonstrate, clearly, that there is a correlation between hybridization intensity and expression level, that would not be evident if significant cross reaction occurred between clones, or if other factors, including length or GC content were relevant.

The foregoing examples and discussion set forth the various aspects of the invention. One feature of the invention is a method for simplifying appropriate nucleic acid probes. In essence, the method involves analyzing a field of information, such as a nucleic acid molecule library to identify those members of the field which satisfy a particular set of parameters. The number of parameters chosen can vary. For example, one might choose as few as, e.g., 3, and as many as, e.g., 10 or more. In the embodiments described herein, it is preferred to use four to eight parameters. The particular parameters chosen can vary, depending upon the eventual use to which the defined molecules are going to be used. For example, the length of the molecule, its GC content, its position within a longer molecule relative to the longer molecules 5' or 3' tail, or a poly (A) sequence, the relative frequency of expression of the molecule as defined in a library, the presence or absence of repetitive elements, disparity from known sequences, and/or presence of sequences defined by sequences which are not contiguous naturally, are all parameters which may be used. Any or all of these parameters can be used, possibly with others.

The source of the materials being analyzed can also vary. There are extensive libraries of human sequences available, which can be analyzed, as are libraries of various plant, bacterial, or other animal molecules. Any or all of these can be assayed to determine the relevant molecules of interest.

Once the molecules are identified, they can be used in various ways. They can be used in liquid phase diagnostic assays, as probes; however, as was shown, supra, their preferred use is in assays where they are immobilized to a solid phase, such as a bead, a biochip, etc. where they can be used to assay samples for the presence of target molecules, or the amount of these molecules. "Amount" as used herein, refers to both the absolute amount of the target nucleic acid molecule as well as the "relative" amount as compared to other molecules which have been expressed in the sample being tested. As the examples show, one can determine the intensity of expression of molecules very easily.

Examples of the solid phase type systems useful in the invention include biochips such as "GEarray", described at www.superarray.com, and "MY array DNA", described at www.resgen.com.

All of this information is hereby incorporated by reference.

Other aspects of the inventions will be clear to the skilled artisan and need not be elaborated further.

Claims

WE CLAIM

1. A method for identifying a nucleic acid molecule of interest, comprising screening a pool of nucleic acid molecules of known nucleotide sequences for nucleic acid molecules which satisfy all of a series of predefined criteria.

2. The method of claim 1 comprising screening said pool with a computer program.

3. The method of claim 1 , wherein said predefined criteria comprise at least one of: (i) defined length;

(ii) defined GC content;

(iii) defined position of a nucleotide sequence in a cDNA molecule relative to a poly (A) tail in said cDNA molecule; (iv) level of expression of said nucleic acid molecule in a Unigene cluster; (v) absence of repetitive elements;

(vi) overall disparity from known sequences greater than a predefined value, and (vii) presence of a nucleotide sequence defined by two or more non contiguous nucleotide sequences.

4. The method of claim 1, wherein said pool is a pool of human nucleic acid molecules.

5. The method of claim 1 , wherein said pool is a pool of plant nucleic acid molecules.

6. The method of claim 1, wherein said pool is a pool of nucleic acid molecules from a subject with a pathological condition.

7. The method of claim 6, wherein said pathological condition is cancer.

8. The method of claim 1 , wherein said pool is a pool of mammalian nucleic acid molecules.

9. A method for making a nucleic acid molecule, containing solid array, comprising labeling the nucleic acid molecules identified in accordance with claim 1, and applying these to a solid array.

10. The method of claim 9, wherein said solid array is a biochip.

11. The method of claim 9, comprising labeling said nucleic acid molecules by attaching a label to the 5' end of said nucleic acid molecules.

12. The method of claim 9, comprising incorporating labels along the lengths of said nucleic acid molecules.

13. The method of claim 9, wherein said labeling comprises incorporating labels in situ following deposition on said solid array.

14. The method of claim 9, comprising labeling said nucleic acid molecules with a fluorescent label.

15. A nucleic acid molecule containing solid array produced via the method of claim 9.