WO2025061922A1

WO2025061922A1 - Methods for sequencing

Info

Publication number: WO2025061922A1
Application number: PCT/EP2024/076400
Authority: WO
Inventors: Aathavan KARUNAKARAN; David Olmstead BRACHER; Gery VESSERE
Original assignee: Illumina Inc
Current assignee: Illumina Inc
Priority date: 2023-09-20
Filing date: 2024-09-20
Publication date: 2025-03-27
Anticipated expiration: 2026-03-20

Abstract

A method of base calling nucleobases of first and second polynucleotide sequence portions. The method comprises accessing intensity data for a current sequencing cycle of a sequencing run, wherein the intensity data is a combined intensity of a first signal obtained based upon a respective first nucleobase of at least one first polynucleotide sequence portion and a second signal obtained based upon a respective second nucleobase of at least one second polynucleotide sequence portion; base calling the first nucleobase based on the intensity data; accessing a plurality of mappings representing adjustments to signal intensity obtained based upon a current nucleobase of at least one polynucleotide sequence portion, wherein said adjustments are dependent on at least one preceding and/or succeeding nucleobase in said polynucleotide sequence portion; base calling the second nucleobase based on the intensity data, the base call of the first nucleobase and the plurality of mappings.

Description

METHODS FOR SEQUENCING

BACKGROUND

Field

The disclosed technology relates to the field of nucleic acid sequencing. More particularly, the disclosed technology relates to a scheme for improving the accuracy in determining the nucleotide sequences of two or more polynucleotide sequence portions in a single sequencing run.

Description of the Related Art

In some types of next-generation sequencing (NGS) technologies, a nucleic acid cluster is created on a flow cell by amplifying an original template nucleic acid strand. Sequencing cycles may be performed as complementary strands of the template nucleic acids are being synthesized, i.e., using sequencing-by-synthesis (SBS) processes.

In each sequencing cycle, deoxyribonucleic acid analogs conjugated to fluorescent labels are hybridized to the template nucleic acids, and excitation light sources are used to excite the fluorescent labels on the deoxyribonucleic acid analogs. Detectors capture fluorescent emissions from the fluorescent labels and identify the deoxyribonucleic acid analogs. As a result, the sequence of the template nucleic acids may be determined by repeatedly performing such sequencing cycles.

NGS allows for the sequencing of a number of different template nucleic acids simultaneously, significantly reducing the cost of sequencing in the last twenty years. However there remains a need for improvements in sequencing technologies.

SUMMARY The present application provides an improved method of base calling combined sequencing data based on the simultaneous sequencing of first and second polynucleotide sequence portions. In some simultaneous sequencing runs, a first signal based upon a first nucleobase of a first polynucleotide sequence portion can be higher than a second signal based upon a second nucleobase of a second polynucleotide sequence portion. We refer to the second polynucleotide sequence portion as the portion corresponding to the lower signal intensity, i.e. the “weaker read”. The present application improves the accuracy in base calling the weaker read in particular.

Previous approaches to base calling simultaneous sequencing data can involve a comparison of the per-sequencing-cycle intensity values with one or more classifications which map the identities of respective nucleobases to intensity values. However, as will be described in the present application, the per-sequencing-cycle intensity values can be affected by the base context of a nucleobase to be base called. The base context of a nucleobase is the identity of at least one preceding and/or succeeding nucleobases in the corresponding polynucleotide sequence portion.

Certain embodiments described herein relate to training and use of a model comprising mappings which define adjustments to signal intensity according to base context of each of the first and second polynucleotide sequence portions. Other embodiments described herein describe a method of base calling a nucleobase from the second polynucleotide sequence portion (i.e. the weaker read) using the model. A consideration of the adjustments to signal intensity based on base context allows for the relationship between intensity values and sequences of the first and second polynucleotide sequence portions to be more precisely defined. As a result, when the second nucleobase is base called using the model, an improvement in the signal-noise ratio can be achieved when compared to previous approaches.

According to a first aspect there is provided a method of base calling nucleobases of first and second polynucleotide sequence portions, the method comprising:

(a) accessing intensity data for a current sequencing cycle of a sequencing run, wherein said intensity data is a combined intensity of a first signal obtained based upon a respective first nucleobase of at least one first polynucleotide sequence portion and a second signal obtained based upon a respective second nucleobase of at least one second polynucleotide sequence portion; (b) base calling the first nucleobase based on the intensity data;

(c) accessing a plurality of mappings representing adjustments to signal intensity obtained based upon a current nucleobase of at least one polynucleotide sequence portion, wherein said adjustments are dependent on at least one preceding and/or succeeding nucleobase in said polynucleotide sequence portion;

(d) base calling the second nucleobase based on the intensity data, the base call of the first nucleobase and the plurality of mappings.

In some embodiments, the second nucleobase is base called by comparing the intensity data with the base call of the first nucleobase and the plurality of mappings.

In some embodiments, the second nucleobase is base called based on an adjusted intensity data, wherein the intensity data is adjusted based on the base call of the first nucleobase.

In some embodiments, the method further comprises: accessing base context data for the current sequencing cycle of the first polynucleotide sequence portion comprising base calls of nucleobases for at least one of a preceding sequencing cycle and/or a succeeding sequencing cycle; and selecting a mapping for the first polynucleotide sequence portion based on the base call of the first nucleobase and the base context data for the first polynucleotide sequence portion; wherein the intensity data is further adjusted based on the selected mapping for the first polynucleotide sequence portion.

In some embodiments, said polynucleotide sequence portions have been selectively processed such that an intensity of the signals obtained based upon the respective first nucleobase is greater than an intensity of the signals obtained based upon the respective second nucleobase, wherein the intensity data is further adjusted based on the relative intensity of the signals obtained based upon the respective first nucleobase and the signals obtained based upon the respective second nucleobase.

In some embodiments, the relative intensity is determined based on the intensity data. In some embodiments, the intensity data is further adjusted to correct for cluster dependent effects.

In some embodiments, the cluster dependent effects are determined based on the intensity data.

In some embodiments, the cluster dependent effects comprise at least one of phasing/prephasing, background, decay, scale, background, camera gain and laser ramp.

In some embodiments, the method further comprises: accessing base context data for the current sequencing cycle of the second polynucleotide sequence portion comprising base calls of nucleobases for at least one of a preceding sequencing cycle and/or a succeeding sequencing cycle; and selecting a subset of the plurality of the mappings based on the base context data for the second polynucleotide sequence portion; wherein said subset of the plurality of mappings is used to base call the second nucleobase.

In some embodiments, each of the plurality of mappings receive a k-mer sequence as input and generate an adjusted signal intensity as output.

In some embodiments, the k-mer sequence corresponds to a current nucleobase and at least one preceding and/or succeeding nucleobase of a polynucleotide sequence portion, and wherein the adjusted signal intensity corresponds to the signal obtained based upon the current nucleobase.

In some embodiments, each of the plurality of mappings is modified by a respective phasing coefficient.

According to a second aspect there is provided a data processing device comprising means for carrying out a method for base calling nucleobases of first and second polynucleotide sequence portions.

In some embodiments, the data processing device is a polynucleotide sequencer. According to a third aspect there is provided a computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method for base calling nucleobases of first and second polynucleotide sequence portions.

According to a fourth aspect there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, cause the processor to carry out a method for base calling nucleobases of first and second polynucleotide sequence portions.

According to a fifth aspect there is provided a computer-readable data carrier having stored thereon a computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method for base calling nucleobases of first and second polynucleotide sequence portions.

According to a sixth aspect there is provided a data carrier signal carrying a computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method for base calling nucleobases of first and second polynucleotide sequence portions.

According to a seventh aspect there is provided a method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion, the method comprising:

(a) accessing sequence information of first and second polynucleotide sequence portions;

(b) accessing intensity data for a sequencing run performed on at least one first polynucleotide sequence portion and at least one second polynucleotide sequence portion, the intensity data comprising per-sequencing cycle intensity values each combining a first signal obtained based upon a respective first nucleobase of at least one first polynucleotide sequence portion and a second signal obtained based upon a respective second nucleobase of at least one second polynucleotide sequence portion;

(c) initializing a plurality of mappings representing adjustments to signal intensity obtained based upon a current nucleobase of at least one polynucleotide sequence portion, wherein the adjustments are dependent on at least one preceding and/or succeeding nucleobase in said at least one polynucleotide sequence portion; (d) iteratively updating said plurality of mappings based upon the intensity data and the sequence information of the first and second polynucleotide sequence portions.

In some embodiments, the plurality of mappings are updated based upon the intensity data and a predicted intensity data, wherein the predicted intensity data is based on the sequence information of the first and second polynucleotide sequence portions and the plurality of mappings.

In some embodiments the plurality of mappings are updated based upon a comparison of the intensity data with the predicted intensity data.

In some embodiments, each of the per-sequencing cycle intensity values of the intensity data are respectively compared with per-sequencing cycle intensity values of the predicted intensity data.

In some embodiments, said polynucleotide sequence portions have been selectively processed such that an intensity of the signals obtained based upon the respective first nucleobase is greater than an intensity of the signals obtained based upon the respective second nucleobase, wherein the predicted intensity data is further based upon the relative intensity of the signals obtained based upon the respective first nucleobase and the signals obtained based upon the respective second nucleobase.

In some embodiments, a value for the relative intensity is initialized, wherein the value for relative intensity is iteratively updated with plurality of mappings.

In some embodiments, values for one or more parameters representing cluster dependent effects are initialized, wherein the values for the one or more parameters are iteratively updated with the plurality of mappings, and wherein the predicted intensity data is further based upon the one or more parameters.

In some embodiments, one of more parameters representing cluster dependent effects comprise at least one of phasing/prephasing, background, decay, scale, background, camera gain and laser ramp. In some embodiments, each of the plurality of mappings receive a k-mer sequence as input and generate an adjusted signal intensity as output.

In some embodiments, a respective phasing coefficient to modify each of the plurality of mappings is initialized, wherein each of the phasing coefficients is iteratively updated with the plurality of mappings.

According to an eighth aspect there is provided a system comprising a memory storing a plurality of mappings representing adjustments to signal intensity obtained based upon a current nucleobase of at least one polynucleotide sequence portion, wherein the adjustments are dependent on at least one preceding and/or succeeding nucleobase in said at least one polynucleotide sequence portion; wherein the plurality of mappings are learned according to a method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion.

According to a ninth aspect there is provided a data processing device comprising means for carrying out a method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion.

In some embodiments, the data processing device is a polynucleotide sequencer.

In some embodiments, the data processing device further comprises a memory for storing the plurality of mappings.

According to a tenth aspect there is provided a computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion. According to an eleventh aspect there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, cause the processor to carry out a method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion.

According to a twelfth aspect there is provided a computer-readable data carrier having stored thereon a computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion.

According to a thirteenth aspect there is provided a data carrier signal carrying a computer program product computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion.

In embodiments, generating the sequencing data comprising the first and second sequence reads may comprise:

(a) obtaining first intensity data comprising a combined intensity of a first signal obtained based upon a respective first nucleobase of at least one first polynucleotide sequence portion and a second signal obtained based upon a respective second nucleobase of at least one second polynucleotide sequence portion;

(b) obtaining second intensity data comprising a combined intensity of a third signal obtained based upon the respective first nucleobase of the at least one first polynucleotide sequence portion and a fourth signal obtained based upon the respective second nucleobase of the at least one second polynucleotide sequence portion;

(c) selecting one of a plurality of classifications based on the first and the second intensity data, wherein each classification represents a possible combination of respective first and second nucleobases; and

(d) based on the selected classification, base calling the respective first and second nucleobases, wherein said polynucleotide sequence portions have been selectively processed such that an intensity of the signals obtained based upon the respective first nucleobase is greater than an intensity of the signals obtained based upon the respective second nucleobase.

In embodiments, the first and second signals and/or the third and fourth signals may be obtained substantially simultaneously.

In embodiments, selecting the classification based on the first and second intensity data may comprise selecting the classification based on the combined intensity of the first and second signals and the combined intensity of the third and fourth signals.

In embodiments, the plurality of classifications may comprise sixteen classifications, each classification representing one of sixteen unique combinations of first and second nucleobases.

In embodiments, the polynucleotide sequence portions may have been selectively processed such that, during sequencing, a greater number of the first polynucleotide sequence portions are capable of generating a signal than a number of the second polynucleotide sequence portions that are capable of generating a signal.

In embodiments, a ratio between the number of the first polynucleotide sequence portions capable of generating a signal and the number of the second polynucleotide sequence portions capable of generating a signal may be between 1.25:1 to 5:1, between 1.5: 1 to 3: 1 , or about 2:1.

In embodiments, the first signal, second signal, third signal and fourth signal may be generated based on light emissions associated with the respective nucleobase.

In embodiments, the obtained signals are generated by: contacting a plurality of polynucleotide molecules comprising the first and second polynucleotide sequence portions with first primers for sequencing the first polynucleotide sequence portions and second primers for sequencing the second polynucleotide sequence portions; extending the first primers and the second primers by contacting the polynucleotide molecules with labeled nucleobases to form first labeled primers and second labeled primers; stimulating the light emissions from the first and second labeled primers; and detecting the light emissions at a sensor.

In embodiments: the first and second signals may be based on light emissions detected in a first range of optical frequencies; the third and fourth signals may be based on light emissions detected in a second range of optical frequencies; and the first range of optical frequencies and the second range of optical frequencies may be not identical.

For example, the first range of optical frequencies may correspond to the color red, e.g., 400-484 THz (or equivalently, 620-750 nm in terms of wavelength), and the second range of optical frequencies may correspond to the color green, e.g., 526-606 THz (or equivalently, 495-570 nm in terms of wavelength).

In embodiments, the plurality of polynucleotide molecules may be attached to a substrate, and the light emissions from the first labeled primers and the light emissions from the second labeled primers may be emitted from the same region or substantially overlapping regions of the substrate.

In embodiments, the light emissions detected at the sensor may be spatially unresolved.

In embodiments, the sensor may be configured to provide a single output based upon the first and second signals.

In embodiments, the sensor may comprise a single sensing element.

In embodiments, each polynucleotide molecule may comprise one or more copies of the first polynucleotide sequence portion and one or more copies of the second polynucleotide sequence portion.

In embodiments, the first and second polynucleotide sequence portions may be respective portions of different polynucleotide molecules. In embodiments, the polynucleotide sequence portions may have been selectively processed by contacting the plurality of polynucleotide molecules with unblocked first primers and a predetermined fraction of second primers which have a blocked 3’ end.

In embodiments, selectively processing may comprise preparing for selective sequencing or conducting selective sequencing. For example, selective sequencing may be achieved using a mixture of unblocked and blocked sequencing primers.

In embodiments, the polynucleotide sequence portions may have been selectively processed to provide a greater total number of the first polynucleotide sequence portions than a total number of the second polynucleotide sequence portions.

In embodiments, selectively processing may comprise conducting selective amplification.

In embodiments, the at least one first polynucleotide sequence portion and the at least one second polynucleotide sequence portion may be present in a cluster.

In embodiments, the one of the plurality of classifications may be selected based on the first and the second intensity data using a Gaussian mixture model.

The systems, devices, kits, and methods disclosed herein each have several aspects, no single one of which is solely responsible for their desirable attributes. Numerous other embodiments are also contemplated, including embodiments that have fewer, additional, and/or different components, steps, features, objects, benefits, and advantages. The components, aspects, and steps may also be arranged and ordered differently. After considering this discussion, and particularly after reading the section entitled “Detailed Description”, one will understand how the features of the devices and methods disclosed herein provide advantages over other known devices and methods.

It is to be understood that any features of the systems disclosed herein may be combined together in any desirable manner and/or configuration. Further, it is to be understood that any features of the methods disclosed herein may be combined together in any desirable manner. Moreover, it is to be understood that any combination of features of the methods and/or the systems may be used together, and/or may be combined with any of the examples disclosed herein. It should be appreciated that all combinations of the foregoing concepts and additional concepts discussed in greater detail below are contemplated as being part of the inventive subject matter disclosed herein and may be used to achieve the benefits and advantages described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

Features of examples of the present disclosure will become apparent by reference to the following detailed description and drawings, in which like reference numerals correspond to similar, though perhaps not identical, components. For the sake of brevity, reference numerals or features having a previously described function may or may not be described in connection with other drawings in which they appear.

Figure 1 shows a block diagram which schematically illustrates an improved base calling method.

Figure 2A shows a block diagram which schematically illustrates an example sequencing system that may be used to perform the disclosed methods.

Figure 2B shows a block diagram which schematically illustrates an example imaging system that may be used in conjunction with the example sequencing system of Figure 1.

Figure 3 shows an example flow cell that may be used in conjunction with the example sequencing system of Figures 2A and 2B comprising an enlarged perspective of one of the tiles and the clusters within a tile.

Figure 4 shows a functional block diagram of an example computer system that may be used in the example sequencing system of Figures 2A and 2B.

Figure 5A and Figure 5B schematically illustrate nucleic acid clusters comprising two or more polynucleotide sequence portions for sequencing by the present methods. Figure 6 is a chart showing example four-channel, two-channel and one-channel dye labelling schemes that may be used in conjunction with the present methods.

Figure 7A is a chart showing an example two-channel dye labeling scheme that may be used in conjunction with the present methods.

Figure 7B is a plot showing graphical representations of sixteen distributions of signals from a nucleic acid cluster based on first and second polynucleotide sequence portions.

Figure 8A is a plot showing graphical representations of sixteen distributions of signals from a nucleic acid cluster based on first and second polynucleotide sequence portions.

Figure 8B is a plot showing graphical representations of nine distributions of signals from a nucleic acid cluster based on first and second polynucleotide sequence portions.

Figure 9A illustrates the fitting of a Gaussian mixture model based on four and sixteen sources respectively to base call intensity data.

Figure 9B illustrates a simple base calling method based on dividing an intensity plot into quadrants.

Figure 10 is a flow diagram showing an example method of base calling.

Figure 11 is a flow diagram showing an example method of generating signals for use in the method of base calling shown in Fig. 10.

Figure 12A shows that by plotting relative intensities of light signals obtained from a first channel (ch1) and a second channel (ch2), a constellation of 16 clouds is obtained. Figure 12B shows the alignment of R1 and R2 (minor and major reads respectively) with a known human and PhiX sequence.

Figure 13A illustrates base context effects for the base call A from an intensity signal based on a single polynucleotide sequence portion with respect to the context of AG, CG, or AA preceding bases.

Figure 13B illustrates base context effects for each of base calls A, G, 0 and T from an intensity signal based on a single polynucleotide sequence portion with respect to a base context of two preceding bases.

Figure 14 illustrates the context dependent signal modulation (CDSM) model with predicts per-sequencing cycle intensity values for an input polynucleotide sequence based on the base context of the sequence.

Figures 15A and 15B respectively show the signal intensity in the first channel for nucleobase C with base context AG before, and after, the transformation step of the model of Figure 14.

Figures 16A and 16B respectively show the signal intensity in the second channel for nucleobase T with base context GG before, and after, the transformation step of the model of Figure 14.

Figures 17A illustrates phasing and prephasing effects on a cluster. Figure 17B illustrates the intensity output of base calls “C” every 15 cycles in a heterogeneous background. The enlarged window shows anticipatory signals (gray arrow) and memory signals (black arrows) due to the phasing and prephasing effect.

Figure 18 shows block diagram schematically showing the modelling of emission stack effects.

Figure 19 illustrates the method learn the parameters to perform the base calling method of Figure 1. Figures 20A illustrates the loss calculated over 1000 training iterations according to the training method of Figure 19, and Figure 20B illustrates the predicted intensity values for various cycles of a sequence run from a tile of clusters.

Figures 21 A and 21 C illustrate predicted and observed per-sequencing cycle intensity values based on a single cluster with alpha (mixing fraction) equal 0.68 and 0.8 respectively. Figures 21 B and 21 D illustrate the predicted and observed per- sequencing cycle intensity values in the first and second channel based on a single cluster with alpha (mixing fraction) equal 0.68 and 0.8 respectively.

Figure 22 illustrates a base calling pipeline which base calls a sequence of first and second polynucleotide sequence portions according the base calling method of Figure 1.

Figures 23A and 23B is a plot showing the quality of fit against alpha for a plurality of clusters base called using the method of Figures 1 and 22, with the data points shaded according to blast alignment score, for the dominant read 1, and the weaker read 2, respectively. Figure 23C is a histogram of alpha for the plurality of clusters in Figures 23A and 23B.

Figures 24A and 24B are histograms of alignment lengths for a plurality of clusters base called using the method of Figures 1 and 22 for the stronger read and weaker read respectively. Figures 24C and 24D are histograms of the percent of matching bases for a plurality of clusters base called using the method of Figures 1 and 22 for the stronger read and weaker read respectively

Figure 25 is a histogram of fragment insert size for paired-end sequencing data obtained using the using the method of Figures 1 and 22 and intensity data from a paired-end sequencing run.

DETAILED DESCRIPTION

All patents, patent applications, and other publications, including all sequences disclosed within these references, referred to herein are expressly incorporated herein by reference, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated by reference. All documents cited are, in relevant part, incorporated herein by reference in their entireties for the purposes indicated by the context of their citation herein. However, the citation of any document is not to be construed as an admission that it is prior art with respect to the present disclosure.

Introduction

Analysis has revealed that the intensity profiles of signals generated based on a respective polynucleotide sequence portion at a current sequencing cycle can be shifted based on their base context identified at prior and succeeding sequencing cycles. Base context effects may also be known as chemistry modulation effects or fully functional nucleotide (FFN) triphosphate modulation effects. Chemistry modulation effects result from differential incorporation of two (or more) FFN species for a given base. When prior base context includes one or more base A, the shift in the intensity distribution can be substantial.

Quenching is another effect by which base context causes variations in the intensity profiles based on a respective polynucleotide sequencing portion. In the sequencing- by-synthesis (SBS) process, nucleotides incorporated into the template sequences contain fluorophores that specifically identify the types of the bases, and attached to the nucleotides is a cleavable linker. After the incorporated base is identified, the linker is cleaved, allowing the fluorophore to be removed and ready for the next base to be attached and identified. Nevertheless, the cleavage can leave a remaining “pendant arm” moiety located on each of the detected nucleotides, which impacts the intensity profiles of the following nucleotides incorporated into the template sequences. For example, the remaining “pendant arm” after the cleavage of the fluorophores attached to base G quenches/reduces/suppresses the intensity values of a subsequent fluorophore when the next nucleotide is incorporated. The quenching effect can be substantial when base calling dimer GA. The fluorophores attached to base A can be significantly quenched by the “pendant arm” of the fluorophores attached to prior base G. In a two-channel base calling system, the intensity values of base A at both color/intensity channels can be reduced, increasing the risk of miscalls. The intensity profiles of other bases (e.g., C, G and T) can be similarly impacted by the “pendant arm” of the fluorophores attached to base G (or some other nucleotide base). In some cases, however, a preceding G can lead to a high average intensity in certain FFN sets, while an A directly preceding an A can lead to relatively low intensity values. US Patent Application No. 63/476428, which is incorporated by reference herein, describes a context dependent method for base calling which considers adjustments to signal intensity based on the base context of an added nucleobase.

US Patent Application No. 63/439417, which is incorporated by reference herein, describes how simultaneous sequencing of first and second polynucleotide sequence portions can be achieved by selectively processing the polynucleotide sequence portions in a cluster such that an intensity of the signals obtained based upon the first polynucleotide sequence portion is greater than an intensity of the signals obtained based upon the second polynucleotide sequence portion. Simultaneous sequencing of first and second polynucleotide sequence portions is described below with reference to Figures 5A to 11. Since the signals obtained based upon the second polynucleotide sequence portion (“the weaker read”) have a lower intensity, they have a lower signal- to-noise ratio (“SNR”) than the signals obtained based upon the first polynucleotide sequence portion (“the dominant read”).

Figure 1 schematically illustrates processing according to the present disclosure to improve the quality of base calling, particularly base calling the weaker read. As shown in Figure 1 , first base calling operation 101 receives intensity data 102 and generates a first nucleobase base call 103. The intensity data 102 is intensity data obtained for simultaneous sequencing of first and second polynucleotides and may be obtained in any convenient way, for example as described below with reference to Figures 5A to 11. The first nucleobase base call 103 corresponds to a base call for the dominant read associated with the intensity data 102.

Second base calling operation 104 receives intensity data 102, first nucleobase base call 103 and mappings, M_di 105 and generates a second nucleobase base call 106. The second nucleobase base call 106 corresponds to a base call for the weaker read associated with the intensity data 102. The mappings 105 provide a context-dependent signal modulation (CDSM) model which more accurately maps a current nucleobase to an intensity signal by taking into account the base context of the nucleobase (i.e. the at least one preceding and/or succeeding nucleobases surrounding the current nucleobase) and is described in detail below with reference to Figures 14 to 17B. The second base calling operation 104 may optionally additionally receive and process emission stack effects 107 and a mixing fraction 108, described in detail below.

The improved base calling may, for example, be based upon a context-dependent signal modulation model for combined signals (referred to herein as a “CDSM-16QAM model”). In general terms, the CDSM-16QAM model can be defined as follows:

where y is the instrument signal intensity value(s) obtained from performing one or more sequencing cycles on first and second polynucleotide sequence portions, XI and X2 are sequences associated with the one or more added nucleobases of the respective first and second polynucleotide sequence portions over the sequencing run, EM_e is emission stack effects (e.g cluster and sequencer dependent effects), a is the mixing fraction, and M_e is a context dependent signal modulation (CDSM) model which adjusts signal intensities based on base context. As can be seen in equation (2), the CDSM model is applied to each of sequence

and X₂ respectively and therefore the base context of both of the polynucleotide sequence portions of interest is considered when predicting the instrument signal.

As described in further detail below, the parameters of the CDSM-16QAM model may be estimated in order to perform base calling. Some parameters of the model are cluster or sequencer specific, such as a and the various parameters in EM_e, while others, such as k-mer specific transformation matrices associated with the CDSM model, are independent of any specific cluster or sequence. The model may therefore be trained to learn k-mer specific transformation matrices which will subsequently have their weights locked during base calling. In some embodiments, model training also involves learning k-mer specific phasing coefficients. Although the k-mer specific phasing coefficients may have some dependence on the cluster or the experimental conditions during sequencing, a single set k-mer specific phasing coefficients can be learnt during training in order to avoid computational burden of learning a set of context dependent phasing coefficients for each cluster. In order to prevent overfitting, the model may be trained on many sequences in turn and k-mer specific transformation matrices may be limited to a 2x2 matrix (i.e. four parameters per k-mer specific transformation matrix) for computational efficiency. In other embodiments, the k-mer specific transformation matrices can be 3x3 matrices, providing greater accuracy at a higher computational cost.

An improvement in the accuracy of base calling can be achieved by considering the variations in the intensity profiles of clusters caused by their base context. For each added nucleobase in a polynucleotide sequence portion at a given sequencing cycle, the corresponding base context varies. Advantageously, the systems and methods describe herein consider the base context effects of both the stronger read and the weaker read when base calling the weaker read, which further improves the accuracy in base calling.

Systems and methods that provide improvements in base calling first and second polynucleotide sequence portions from combined intensity data as described above will now be described.

Example Sequencer

Referring to Fig. 2A, a diagrammatical representation of an example sequencing system 210 is illustrated as including a sequencer 212 designed to determine sequences of genetic material of a sample 214. The sequencer may function in a variety of manners, and based upon a variety of techniques, including sequencing by primer extension using labeled nucleotides, as in a presently contemplated example, as well as other sequencing techniques such as sequencing by ligation or pyrosequencing. In some examples, the sequencer 212 progressively moves samples through reaction cycles and imaging cycles to progressively build oligonucleotides by binding nucleotides to templates at individual sites on the sample. In some examples, the sample may be prepared by a sample preparation system 216. This process may include amplification of fragments of DNA or RNA on a support to create a multitude of sites of DNA or RNA fragments the sequence of which are determined by the sequencing process. The sample preparation system 216 may dispose the sample, which may be in the form of an array of sites, in a sample container for processing and imaging

In some examples, the sequencer 212 includes a fluidics control/delivery system 218 and a detection system 220. The fluidics control/delivery system 218 may receive a plurality of process fluids as indicated by reference numeral 222, for circulation through the sample containers of the samples in process, designated by reference numeral 224. As will be appreciated by those skilled in the art, the process fluids may vary depending upon the particular stage of sequencing. For example, in sequencing-by- synthesis (SBS) using labeled nucleotides, the process fluids introduced to the sample may include a polymerase and tagged nucleotides of the four common DNA types, each nucleotide having a unique fluorescent tag and a blocking agent linked to it. The fluorescent tag allows the detection system 220 to detect which nucleotides were last added to primers hybridized to template nucleic acids at individual sites in the array, and the blocking agent prevents addition of more than one nucleotide per cycle at each site.

At other phases of the sequencing cycles, the process fluids 222 may include other fluids and reagents, such as reagents for removing extension blocks from nucleotides or cleaving nucleotide linkers to release a newly extendable primer terminus. For example, once reactions have taken place at individual sites in the array of the samples, the initial process fluid containing the tagged nucleotides may be washed from the sample in one or more flushing operations. The sample may then undergo detection, such as by the optical imaging at the detection system 220. Subsequently, reagents may be added by the fluidics control/delivery system 218 to de-block the last added nucleotide and remove the fluorescent tag from each. The fluidics control/delivery system 218 may then again wash the sample, which is then prepared for a subsequent cycle of sequencing. Exemplary fluidic and detection configurations that can be used in the methods and devices set forth herein are described in WO 07/123744, which is incorporated herein by reference. In some examples, such sequencing may continue until the quality of data derived from sequencing degrades due to cumulative loss of yield or until a predetermined number of cycles have been completed.

In some examples, the quality of samples 224 in process as well as the quality of the data derived by the system, and the various parameters used for processing the samples is controlled by a quality/process control system 226. The quality/process control system 226 may include one or more programmed processors, or general purpose or application-specific computers which communicate with sensors and other processing systems within the fluidics control/delivery system 218 and the detection system 220. A number of process parameters may be used for sophisticated quality and process control, for example, as part of a feedback loop that can change instrument operation parameters during the course of a sequencing run.

In some examples, the sequencer 212 also communicates with a system control/operator interface 228 and ultimately with a post-processing system 230. The system control/operator interface 228 may include a general purpose or applicationspecific computer designed to monitor process parameters, acquired data, system settings, and so forth. The operator interface may be generated by a program executed locally or by programs executed within the sequencer 212. In some examples, these may provide visual indications of the health of the systems or subsystems of the sequencer, the quality of the data acquired, and so forth. The system control/operator interface 1828 may also permit human operators to interface with the system to regulate operation, initiate and interrupt sequencing, and any other interactions that may be desired with the system hardware or software. For instance, the system control/operator interface 1828 may automatically undertake and/or modify steps to be performed in a sequencing procedure, without input from a human operator. Alternatively or additionally, the system control/operator interface 228 may generate recommendations regarding steps to be performed in a sequencing procedure and display these recommendations to the human operator. This mode may allow for input from the human operator before undertaking and/or modifying steps in the sequencing procedure. In addition, the system control/operator interface 228 may provide an option to the human operator allowing the human operator to select certain steps in a sequencing procedure to be automatically performed by the sequencer 212 while requiring input from the human operator before undertaking and/or modifying other steps. In any event, allowing both automated and operator interactive modes may provide increased flexibility in performing the sequencing procedure. In addition, the combination of automation and human-controlled interaction may further allow for a system capable of creating and modifying new sequencing procedures and algorithms through adaptive machine learning based on the inputs gathered from human operators.

The post-processing system 230 may further include one or more programmed computers that receive detected information, which may be in the form of pixilated image data and derive sequence data from the image data. The post-processing system 230 may include image recognition algorithms which distinguish between colors of dyes (e.g., fluorescent emission spectra of dyes) attached to nucleotides that bind at individual sites as sequencing progresses (e.g., by analysis of the image data encoding specific colors and/or intensities), and logs the sequence of the nucleotides at the individual site locations. Progressively, then, the post-processing system 230 may build sequence lists for the individual sites of the sample array which can be further processed to establish genetic information for extended lengths of material by various bioinformatics algorithms.

The sequencing system 210 may be configured to handle individual samples or may be designed for higher throughput in a manner in which multiple stations are provided for the delivery of reagents and other fluids, and for detection of progressively building sequences of nucleotides. Further details can be found in U.S. Patent No. 9797012, which is incorporated herein by reference.

Optics System

Fig. 2A illustrates an exemplary optics system 238 which can detect nucleotides added at sites of an array and can be used in conjunction with the example sequencing system of Fig. 2A. A sample can be moved to two or more stations of the device that are located in physically different locations or alternatively one or more steps can be carried out on a sample that is in communication with the one or more stations without necessarily being moved to different locations. Accordingly, the description herein with regard to particular stations is understood to relate to stations in a variety of configurations whether or not the sample moves between stations, the stations move to the sample, or the stations and sample are static with respect to each other. In the example illustrated in Fig. 2B, one or more light sources 246 provide light beams that are directed to conditioning optics 248. The light sources 246 may include one or more lasers, with multiple lasers being used for detecting dyes that fluoresce at different corresponding wavelengths. The light sources may direct beams to the conditioning optics 248 for filtering and shaping of the beams in the conditioning optics. For example, in a presently contemplated example, the conditioning optics 248 combine beams from multiple lasers and generate a substantially linear beam of radiation that is conveyed to focusing optics 250. The laser modules can additionally include a measuring component that records the power of each laser. The measurement of power may be used as a feedback mechanism to control the length of time an image is recorded in order to obtain a uniform exposure energy, and therefore signal, for each image. If the measuring component detects a failure of the laser module, then the instrument can flush the sample with a “holding buffer” to preserve the sample until the error in the laser can be corrected.

The sample 224 is positioned on a sample positioning system 252 that may appropriately position the sample in three dimensions, and may displace the sample for progressive imaging of sites on the sample array. In a presently contemplated example, the focusing optics 250 confocally direct radiation to one or more surfaces of the array at which individual sites are located that are to be sequenced. Depending upon the wavelengths of light in the focused beam, a retrobeam of radiation is returned from the sample due to fluorescence of dyes bound to the nucleotides at each site.

The retrobeam is then returned through retrobeam optics 254 which may filter the beam, such as to separate different wavelengths in the beam, and direct these separated beams to one or more cameras 256. The cameras 256 may be based upon any suitable technology, such as including charge coupled devices that generate pixilated image data based upon photons impacting locations in the devices. In some examples, the cameras 256 may include CMOS sensors. In some examples, the cameras 256 may include one or more point-and-shoot cameras. In some examples, the cameras 256 may include one or more time delay and integration (TDI) cameras. The cameras generate image data that is then forwarded to image processing circuitry 258. In some examples, the processing circuitry 258 may perform various operations, such as analog-to-digital conversion, scaling, filtering, and association of the data in multiple frames to appropriately and accurately image multiple sites at specific locations on the sample. The image processing circuitry 258 may store the image data, and may ultimately forward the image data to the post-processing system 230 where sequence data can be derived from the image data. Example detection devices that can be used at a detection station include, for example, those described in US 2007/0114362 (U.S. patent application Ser. No. 11/286,309) and WO 07/123744, each of which is incorporated herein by reference.

Flow cell As illustrated in Fig. 3, the sample container 224 may be a flow cell 300 which allows the sample 1214 to be partitioned across an array of sites. In one implementation, the flow cell 300 is partitioned in a plurality of chambers called lanes, such as lanes 352a, 352b, ... , 352p, i.e., p represents a number of lanes. The lanes are physically separated from each other and may contain different tagged sequencing input libraries, distinguishable without sample cross-contamination. Each individual lane 352 can further be partitioned into non-overlapping regions called “tiles” 362. For example, Fig. 3 illustrates a magnified view of section 358 of an example lane. Section 358 is illustrated to comprise a plurality of tiles 362. Hundreds of thousands to millions of clusters can be immobilized on the surface of each tile. At each sequencing cycle of a sequencing run, the cameras 256 of the sequencer takes sequencing images of each tile at each color/intensity channel. The cameras 256 may be part of the flow cell 300. The intensity profiles of clusters being base called at each sequencing cycle are extracted from the sequencing images and analyzed for base calling.

Computer system

A computer system 406 as illustrated in Fig. 4 may be used to implement the system control/operator interface 228 and the post-processing system 230 of the example sequencing system 210 in Fig. 2A. As shown in Fig. 4, the computer system 406 can include functionalities for controlling optics/fluidics systems and determining nucleobase sequences of polynucleotides.

In one example, the computer system 406 includes a processor 402 that is in electrical communication with a memory 404, a storage 406, and a communication interface 408. The processor 402 can be configured to execute instructions that cause the fluidics system 218 to supply reagents to the flow cell 300 during sequencing reactions. The processor 402 can execute instructions that control the light source 246 of the optics system to generate light at around a predetermined wavelength. The processor 402 can execute instructions that control the one or more cameras 256 of the optics system and receive data from the one or more cameras 256. The processor 402 can execute instructions to process data, for example fluorescent images, received from the one or more cameras 256 and to determine the nucleotide sequences of polynucleotides based on the data received form the one or more cameras 256. The memory 404 can be configured to store instructions for configuring the processor 402 to perform the functions of the computer system 406 when the sequencing system 210 is powered on. When the sequencing system 210 is powered off, the storage 406 can store the instructions for configuring the processor 402 to perform the functions of the computer system 406. The communication interface 408 can be configured to facilitate the communications between the computer system 406, the optics system 238, and the fluidics system 218.

The computer system 406 can include a user interface 410 configured to communicate with a display device (not shown) for displaying the sequencing results of the sequencing system 210. The user interface 410 can be configured to receive inputs from users of the sequencing system 210. An optics system interface 412 and a fluidics system interface 414 of the computer system 406 can be configured to control the optics system 238 and the fluidics system 418 through communication links (not shown). For example, the optics system interface 412 can communicate with the computer interface of the optics system 238 through a communication link.

The computer system 406 can include a nucleic base determiner 416 configured to determine the nucleotide sequence of polynucleotides using the data received from the one or more cameras 256. The nucleic base determiner 416 can include one or more of: a template generator 418, a location registrator 420, an intensity extractor 422, an intensity corrector 424, a base caller 426, and a quality score determiner 428. The template generator 418 can be configured to generate a template of the locations of polynucleotide clusters in the flow cell 300 using the fluorescent images captured by the one or more cameras 256. The location registrator 420 can be configured to register the locations of polynucleotide clusters in the flow cell 300 in the fluorescent images captured by the one or more cameras 256 based on the location template generated by the template generator 418. The intensity extractor 422 can be configured to extract intensities of the fluorescent emissions from the fluorescent images to generate extracted intensities. For example, the peak intensity value found in a diffraction-limited spot of a DNA cluster may be extracted from the image and used to represent the signal of the DNA cluster. For another example, the total intensity included within a diffraction-limited spot of a DNA cluster may be extracted from the image and used to represent the signal of the DNA cluster. Alternatively, the intensity estimate can be made through the use of equalization and channel estimation. The intensity corrector 424 can be configured to reduce or eliminate noise or aberration inherent in the sequencing reaction or optical system. For example, intensity may be influenced by laser intensity fluctuation, DNA cluster shape/size variation, uneven illumination, optical distortions or aberrations, and/or phasing/pre-phasing that occur in the DNA clusters. In some examples, the intensity corrector 424 can phase correct or pre-phase correct extracted intensities. In some examples, the intensity corrector 424 can normalize extracted fluorescence intensities to reduce or eliminate the effect of DNA cluster size variation. For example, each DNA template may contain the same calibration oligonucleotide. Thus, the extracted fluorescence intensity of a cluster obtained from sequencing a known nucleotide in the calibration oligonucleotide can be used as a normalization factor for that cluster. The intensity corrector 424 can divide the extracted fluorescence intensities of that cluster obtained from sequencing nucleotides in other regions of the DNA template by the normalization factor to obtain the normalized extracted fluorescence intensities. The base caller 426 can be configured to determine the nucleobases of a polynucleotide from the corrected intensities. The bases of a polynucleotide determined by the base caller 426 can be associated with quality scores determined by the quality score determiner 428. Quality scoring refers to the process of assigning a quality score to each base call. To evaluate the quality of a base call from a sequencing read, example processes can include calculating a set of predictor values for the base call and using the predictor values to look up a quality score in a quality table. The quality score can be presented in any suitable format that allows a user to determine the probability of error of any given base call. In some examples, the quality score is presented as a numerical value. For example, the quality score can be quoted as QXX where the XX is the score and it means that that particular call has a probability of error of 1O^_XX/1°. Thus, as an example, Q30 equates to an error rate of 1 in 1000, or 0.1% and Q40 equates to an error rate of 1 in 10,000 or 0.01%. The error rate can be calculated using a control nucleic acid. Additionally, some metrics displays can include the error rate on a per- cycle basis. In some examples, the quality table is generated using on a calibration data set, the calibration set being representative of run and sequence variability. Further details of the computations that can be performed by the nucleic base determiner, calculation of error rate and quality score may be found in U.S. Patent Number 8392126, U.S. Patent Application Publication Numbers 2020/0080142 and 2012/0020537, each of which is incorporated by reference herein in its entirety. While nucleic base determiner 416 is shown as part of computer system 406 in Fig. 4, it will be appreciated that nucleic base determiner 416 may be a separate computing device from the other components shown in Fig. 4 such that nucleic base determiner 416 may receive and process image data in a computing device that is different to a computing device that provides optics and fluidics control.

In some examples, the disclosed systems and methods may involve approaches for shifting or distributing certain sequence data analysis features and sequence data storage to a cloud computing environment or cloud-based network. User interaction with sequencing data, genome data, or other types of biological data may be mediated via a central hub that stores and controls access to various interactions with the data. In some examples, the cloud computing environment may also provide sharing of protocols, analysis methods, libraries, sequence data as well as distributed processing for sequencing, analysis, and reporting. In some examples, the cloud computing environment facilitates modification or annotation of sequence data by users. In some examples, the systems and methods may be implemented in a computer browser, on- demand or on-line.

In some examples, software written to perform the methods as described herein is stored in some form of computer readable medium, such as memory, CD ROM, DVD- ROM, memory stick, flash drive, hard drive, SSD hard drive, server, mainframe storage system and the like.

In some examples, the methods may be written in any of various suitable programming languages, for example compiled languages such as C, C#, C++, Fortran, and Java. Other programming languages could be script languages, such as Perl, MatLab, SAS, SPSS, Python, Ruby, Pascal, Delphi, R and PHP. In some examples, the methods are written in C, C#, C++, Fortran, Java, Perl, R, Java or Python. In some examples, the method may be an independent application with data input and data display modules. Alternatively, the method may be a computer software product and may include classes wherein distributed objects comprise applications including computational methods as described herein.

In some examples, the methods may be incorporated into pre-existing data analysis software, such as that found on sequencing instruments. Software comprising computer implemented methods as described herein are installed either onto a computer system directly, or are indirectly held on a computer readable medium and loaded as needed onto a computer system. Further, the methods may be located on computers that are remote to where the data is being produced, such as software found on servers and the like that are maintained in another location relative to where the data is being produced, such as that provided by a third party service provider.

An assay instrument, desktop computer, laptop computer, or server which may contain a processor in operational communication with accessible memory comprising instructions for implementation of systems and methods. In some examples, a desktop computer or a laptop computer is in operational communication with one or more computer readable storage media or devices and/or outputting devices. An assay instrument, desktop computer and a laptop computer may operate under a number of different computer based operational languages, such as those utilized by Apple based computer systems or PC based computer systems. An assay instrument, desktop and/or laptop computers and/or server system may further provide a computer interface for creating or modifying experimental definitions and/or conditions, viewing data results and monitoring experimental progress. In some examples, an outputting device may be a graphic user interface such as a computer monitor or a computer screen, a printer, a hand-held device such as a personal digital assistant (i.e., PDA, Blackberry, iPhone), a tablet computer (e.g., iPAD), a hard drive, a server, a memory stick, a flash drive and the like.

A computer readable storage device or medium may be any device such as a server, a mainframe, a supercomputer, a magnetic tape system and the like. In some examples, a storage device may be located onsite in a location proximate to the assay instrument, for example adjacent to or in close proximity to, an assay instrument. For example, a storage device may be located in the same room, in the same building, in an adjacent building, on the same floor in a building, on different floors in a building, etc. in relation to the assay instrument. In some examples, a storage device may be located off-site, or distal, to the assay instrument. For example, a storage device may be located in a different part of a city, in a different city, in a different state, in a different country, etc. relative to the assay instrument. In examples where a storage device is located distal to the assay instrument, communication between the assay instrument and one or more of a desktop, laptop, or server is typically via Internet connection, either wireless or by a network cable through an access point. In some examples, a storage device may be maintained and managed by the individual or entity directly associated with an assay instrument, whereas in other examples a storage device may be maintained and managed by a third party, typically at a distal location to the individual or entity associated with an assay instrument. In examples as described herein, an outputting device may be any device for visualizing data.

An assay instrument, desktop, laptop and/or server system may be used itself to store and/or retrieve computer implemented software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like. One or more of an assay instrument, desktop, laptop and/or server may comprise one or more computer readable storage media for storing and/or retrieving software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like. Computer readable storage media may include, but is not limited to, one or more of a hard drive, a SSD hard drive, a CD-ROM drive, a DVD-ROM drive, a floppy disk, a tape, a flash memory stick or card, and the like. Further, a network including the Internet may be the computer readable storage media. In some examples, computer readable storage media refers to computational resource storage accessible by a computer network via the Internet or a company network offered by a service provider rather than, for example, from a local desktop or laptop computer at a distal location to the assay instrument.

In some examples, computer readable storage media for storing and/or retrieving computer implemented software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like, is operated and maintained by a service provider in operational communication with an assay instrument, desktop, laptop and/or server system via an Internet connection or network connection.

In some examples, a hardware platform for providing a computational environment comprises a processor (i.e. , CPU) wherein processor time and memory layout such as random access memory (i.e., RAM) are systems considerations. For example, smaller computer systems offer inexpensive, fast processors and large memory and storage capabilities. In some examples, graphics processing units (GPUs) can be used. In some examples, hardware platforms for performing computational methods as described herein comprise one or more computer systems with one or more processors. In some examples, smaller computer are clustered together to yield a supercomputer network.

In some examples, computational methods as described herein are carried out on a collection of inter- or intra-connected computer systems (i.e., grid technology) which may run a variety of operating systems in a coordinated manner. For example, the CONDOR framework (University of Wisconsin-Madison) and systems available through United Devices are exemplary of the coordination of multiple stand-alone computer systems for the purpose dealing with large amounts of data. These systems may offer Perl interfaces to submit, monitor and manage large sequence analysis jobs on a cluster in serial or parallel configurations.

Samples

In some examples, the sample comprises or consists of a purified or isolated polynucleotide derived from a tissue sample, a biological fluid sample, a cell sample, and the like. Suitable biological fluid samples include, but are not limited to blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear flow, lymph, saliva, cerebrospinal fluid, ravages, bone marrow suspension, vaginal flow, trans-cervical lavage, brain fluid, ascites, milk, secretions of the respiratory, intestinal and genitourinary tracts, amniotic fluid, milk, and leukophoresis samples. In some examples, the sample is a sample that is easily obtainable by non-invasive procedures, e.g., blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear flow, saliva or feces. In certain examples the sample is a peripheral blood sample, or the plasma and/or serum fractions of a peripheral blood sample. In other examples, the biological sample is a swab or smear, a biopsy specimen, or a cell culture. In another example, the sample is a mixture of two or more biological samples, e.g., a biological sample can comprise two or more of a biological fluid sample, a tissue sample, and a cell culture sample. As used herein, the terms “blood,” “plasma” and “serum” expressly encompass fractions or processed portions thereof. Similarly, where a sample is taken from a biopsy, swab, smear, etc., the “sample” expressly encompasses a processed fraction or portion derived from the biopsy, swab, smear, etc. In certain examples, samples can be obtained from sources, including, but not limited to, samples from different individuals, samples from different developmental stages of the same or different individuals, samples from different diseased individuals (e.g., individuals with cancer or suspected of having a genetic disorder), normal individuals, samples obtained at different stages of a disease in an individual, samples obtained from an individual subjected to different treatments for a disease, samples from individuals subjected to different environmental factors, samples from individuals with predisposition to a pathology, samples individuals with exposure to an infectious disease agent, and the like.

In one illustrative, but non-limiting example, the sample is a maternal sample that is obtained from a pregnant female, for example a pregnant woman. The maternal sample can be a tissue sample, a biological fluid sample, or a cell sample. In another illustrative, but non-limiting example, the maternal sample is a mixture of two or more biological samples, e.g., the biological sample can comprise two or more of a biological fluid sample, a tissue sample, and a cell culture sample.

In certain examples samples can also be obtained from in vitro cultured tissues, cells, or other polynucleotide-containing sources. The cultured samples can be taken from sources including, but not limited to, cultures (e.g., tissue or cells) maintained in different media and conditions (e.g., pH, pressure, or temperature), cultures (e.g., tissue or cells) maintained for different periods of length, cultures (e.g., tissue or cells) treated with different factors or reagents (e.g., a drug candidate, or a modulator), or cultures of different types of tissue and/or cells.

In some examples, the use of the disclosed sequencing technology does not involve the preparation of sequencing libraries. In other examples, the sequencing technology contemplated herein involve the preparation of sequencing libraries. In one illustrative approach, sequencing library preparation involves the production of a random collection of adapter-modified DNA fragments (e.g., polynucleotides) that are ready to be sequenced.

Sequencing libraries of polynucleotides can be prepared from DNA or RNA, including equivalents, analogs of either DNA or cDNA, for example, DNA or cDNA that is complementary or copy DNA produced from an RNA template, by the action of reverse transcriptase. The polynucleotides may originate in double-stranded form (e.g., dsDNA such as genomic DNA fragments, cDNA, PCR amplification products, and the like) or, in certain examples, the polynucleotides may originated in single-stranded form (e.g., ssDNA, RNA, etc.) and have been converted to dsDNA form. By way of illustration, in certain examples, single stranded mRNA molecules may be copied into doublestranded cDNAs suitable for use in preparing a sequencing library. The precise sequence of the primary polynucleotide molecules is generally not material to the method of library preparation, and may be known or unknown. In one example, the polynucleotide molecules are DNA molecules. More particularly, in certain examples, the polynucleotide molecules represent the entire genetic complement of an organism or substantially the entire genetic complement of an organism, and are genomic DNA molecules (e.g., cellular DNA, cell free DNA (cfDNA), etc.), that typically include both intron sequence and exon sequence (coding sequence), as well as non- coding regulatory sequences such as promoter and enhancer sequences. In certain examples, the primary polynucleotide molecules comprise human genomic DNA molecules, e.g., cfDNA molecules present in peripheral blood of a pregnant subject.

Methods of isolating nucleic acids from biological sources may differ depending upon the nature of the source. One of skill in the art can readily isolate nucleic acids from a source as needed for the method described herein. In some instances, it can be advantageous to fragment large nucleic acid molecules (e.g. cellular genomic DNA) in the nucleic acid sample to obtain polynucleotides in the desired size range. Fragmentation can be random, or it can be specific, as achieved, for example, using restriction endonuclease digestion. Methods for random fragmentation may include, for example, limited DNase digestion, alkali treatment and physical shearing. Fragmentation can also be achieved by any of a number of methods known to those of skill in the art. For example, fragmentation can be achieved by mechanical means including, but not limited to nebulization, sonication and hydroshear.

In some examples, sample nucleic acids are obtained from as cfDNA, which is not subjected to fragmentation. For example, cfDNA, typically exists as fragments of less than about 300 base pairs and consequently, fragmentation is not typically necessary for generating a sequencing library using cfDNA samples. Typically, whether polynucleotides are forcibly fragmented (e.g., fragmented in vitro), or naturally exist as fragments, they are converted to blunt-ended DNA having 5’- phosphates and 3’-hydroxyl. Standard protocols, e.g., protocols for sequencing using, for example, the Illumina platform, instruct users to end-repair sample DNA, to purify the end-repaired products prior to dA-tailing, and to purify the dA-tailing products prior to the adaptor-ligating steps of the library preparation.

In various examples, verification of the integrity of the samples and sample tracking can be accomplished by sequencing mixtures of sample genomic nucleic acids, e.g., cfDNA, and accompanying marker nucleic acids that have been introduced into the samples, e.g., prior to processing.

Simultaneous sequencing methods

Simultaneous sequencing methods can dramatically shorten the total sequencing time and reduce the number of reagents used in next generation sequencing workflows. In addition, sequencing yield per flow cell area may be increased. In some examples, the method enables simultaneous sequencing of two or more polynucleotide sequence portions without the need for the signals generated from the different portions to be separately detectable given the configuration of the portions and the sequencing equipment. It is therefore possible to simultaneously sequence multiple polynucleotide sequence portions that are not possible to spatially resolve, for example sequencing two polynucleotide sequence portions from a signal detected at a single sensing region (for example a single pixel of an imaging sensor) and/or from a signal obtained from a single cluster (i.e. a single contiguous cluster containing both of the two or more sequence portions), thus increasing the efficiency of the sequencing workflow. Typically, sequencing data from clusters comprising more than one polynucleotide sequence portion of interest (“polyclonal clusters”) are filtered out and are excluded from the sequencing output. Therefore, the methods described may also allow for an increase in the number of usable clusters in a given area of a substrate.

In some examples, the primer for sequencing a first sequence portion and the primer for sequencing a second sequence portion are annealed/hybridized to the molecules in the same reaction step to reduce chemical reaction steps, thus saving time and increasing the efficiency of sequencing-by-synthesis (SBS) workflows. Then, both sequence portions may be read-out through SBS chemistry cycles in the same reaction run.

In some examples, in order to separate the signals received from the dye-labeled nucleobases hybridized to each sequence portion, the signal from one of the portions is diminished, e.g., by 50%, in comparison to the signal generated by the other portion. In one example, the difference in signal intensity may be achieved by blocking the addition of labeled nucleobases to some of the primers. For example, half of the primers which bind to a first portion may be blocked so that no fluorescent nucleotides can be added during the sequencing reactions. Thus, the overall intensity of the nucleobases added to the first portion will be 50% lower than the intensity of the nucleobases added to a second portion in this example. By reviewing not only the wavelength of light emitted from the dyes from each nucleic acid cluster on the flow cell, but also the intensity of that light, the labeled nucleobase hybridized to the first portion can be distinguished from the labeled nucleobase hybridized to the second portion. This will be discussed more completely in the sections below. In another example, the difference in signal intensity may be achieved by selectively controlling the number of copies of a first sequence portion relative to the number of copies of a second sequence portion. For example, the number of copies of the second sequence portion may be lower, e.g. by 50%, in comparison to the number of copies of the first sequence portion.

In some examples, generating the sequencing data comprises obtaining sequence information using Illumina’s sequencing-by-synthesis and reversible terminator-based sequencing chemistry with removable fluorescent dyes (e.g., as described in Bentley et al., Nature 6:53-59 [2009]). Short sequence reads of about tens to a few hundred base pairs may be aligned against a reference genome and unique mapping of the short sequence reads to the reference genome may be identified. Further details regarding the sequencing-by-synthesis and dye labeling methods which can be used by the disclosed technology are described in U.S. Patent Application Publication Numbers 2007/0166705, 2006/0188901, 2006/0240439, 2006/028111109, 2005/0100900, 2013/0079232, U.S. Patent Number 7,057,026, PCT Application Publication Numbers WO 2005/065814, WO 2006/064199, WO 2007/0110251 , and WO 2018/165099, U.S. Patent Application Number 17/338590, U.S. Patent No. 7,601,499, U.S. Patent No. 9,267,173, and U.S. Patent Publication No. 2012/0053063, the disclosures of which are incorporated herein by reference in their entireties. Example sequencers are also disclosed in the above.

Clusters

Figs. 5A-5B each illustrate a respective plurality of polynucleotide molecules 500 comprising multiple copies of two polynucleotide sequence portions of interest 501a, 501b for base calling simultaneously based upon a single combined signal obtained from the two portions according to the present methods. For example, the plurality of polynucleotide molecules 500 illustrated in Figs. 5A and 5B may be configured on a substrate 510 such that light emissions from the plurality of polynucleotide molecules are detected by a single sensing portion (for example a single pixel of an imaging sensor 520). Additionally or alternatively, the plurality of polynucleotide molecules 500 may comprise a single cluster (i.e. a single contiguous cluster containing both of the two or more sequence portions 501a, 501b) such that light emissions from each of the two respective portions cannot be spatially resolved. The substrate 510 may be a flow cell, which may be patterned or unpatterned. In one example, the substrate 510 may be a patterned flow cell comprising a number of discrete nanowells 511 , with each well containing polynucleotide molecules comprising two or more polynucleotide sequence portions for sequencing and each well having a single respective sensor associated with the well. Because each a single sensor is associated with the well, signals from the two or more portions of interest cannot be resolved, irrespective of whether the different portions (or respective clusters) are spatially resolved within the well. Two or more polynucleotide sequence portions of interest contained within a single well in this way is sometimes referred to herein as a “cluster” irrespective of whether the different portions are spatially resolved in the well given that light emissions from such a well form a single combined signal.

In one example, as shown in Fig. 5A, the first and second sequence portions 501a, 501b are present in different polynucleotide molecules 200. In the example shown in Fig. 5B, the first and second sequence portions 501a, 501b are present as respective portions of the same molecules 500. As described in detail below, by diminishing the signal from one of the portions 501b relative to the other of the portions 501a it is possible to separate the signals received from the dye-labeled nucleobases hybridized to each portion. Figs. 5A and 5B each illustrate a respective way in which the signal intensity from the first portion 501a may be modified relative to the signal intensity from the second portion 501b (i.e. in which the signal intensity of one portion may be diminished relative to the signal intensity of the other portion). In the example illustrated in Fig. 5A, the number of first portions 501a relative to the number of second portions 501b is uneven, with one second portion for each two first portions). In the example illustrated in Fig. 5B, the number of first portions 501a and second portions 501b is the same, however some of the primers 502b used to sequence the second portions are blocked such that during a sequencing run, an uneven number of first portions relative to second portions emit light. While blocking is illustrated with respect to Fig. 5B, in which first and second sequence portions 501a, 501b are present as respective portions of the same polynucleotide molecules 500, it will be appreciated that blocking can also be used to diminish the signal from one sequence portion relative to another sequence portion where first and second portions are present in different polynucleotide molecules.

Differential signal intensity

Both the first and second sequence portions 501a, 501b in the cluster can be sequenced simultaneously using first primers 502a specific to the first portion 501a, or to a region 503a adjacent to the first portion, and second primers 502b specific to the second portion 501b, or to a region 503b adjacent to the second portion, in the same reaction run. For example, the first and second sequence portions 501a, 501b may be flanked at one or both ends by respective primer binding sites 503a, 503b having a known sequence. Sequencing primers 502a, 502b specific to the different primer binding sites 503a, 503b can therefore be designed and used for simultaneous sequencing of the two sequence portions 501a, 501b.

As described above, a single combined signal may be obtained from the two polynucleotide sequence portions of interest 501a, 501b according to the present methods. For example, the plurality of polynucleotide molecules 500 may be configured on the flow cell 510 such that light emissions from the plurality of polynucleotide molecules are detected by a single sensing portion 520. Alternatively or additionally, the plurality of template polynucleotides may comprise a single cluster such that light emissions from each of the respective polynucleotide sequence portions cannot be spatially resolved. Since the fluorescent signal associated with the extended first portion sequencing primers 502a and the fluorescent signal associated with the extended second portion sequencing primers 502b is combined, the signals may not be optically resolved. Therefore, methods for determining whether a fluorescent signal is associated with the extended first portion sequencing primers 502a or the extended second portion sequencing primers 502b are needed, at least when the dye-labeled nucleotide analogs at the extended first portion sequencing primers are not the same as the dye- labeled nucleotide analogs at the extended second portion sequencing primers (e.g., when “A”s are added at the first sequence portion 501a and “C”s are added at the second sequence portion 501b), in order to correctly determine the nucleic acid sequences of both the first and second portions.

In some examples, whether a fluorescent signal is associated with the first sequence portion 501a or the second sequence portion 501b can be determined by using distinguishable levels of signal intensity. In particular, the polynucleotide molecules 500 may be selectively processed such that an intensity of the light emissions associated with respective nucleobases in each of the different sequence portions of interest is different.

By “selective processing” is meant here performing an action that changes relative properties of the first portion and the second portion in the at least one polynucleotide sequence comprising a first portion and a second portion (or the plurality of polynucleotide sequences each comprising a first portion and a second portion), or the at least one first polynucleotide sequence comprising a first portion and at least one second polynucleotide sequence comprising a second portion (or the plurality of first polynucleotide sequences each comprising a first portion and the plurality of second polynucleotide sequences each comprising a second portion), so that the intensity of the first signal is greater than the intensity of the second signal. The property may be, for example, a concentration of first portions capable of generating the first signal relative to a concentration of second portions capable of generating the second signal. The action may include, for example, conducting selective amplification, conducting selective sequencing, or preparing for selective sequencing. It will be appreciated that for dye labeling schemes which include an unlabeled or “dark” base (e.g., G), the signal intensity will be zero for both portions. Similarly, the signal associated with a nucleobase may be zero for an image captured in one particular channel of a base calling cycle. Accordingly, it will be appreciated that the polynucleotide molecules may be selectively processed such that, for signals of nonzero intensity, an intensity of the signals obtained based upon respective nucleobases of the different sequence portions is different.

Sequencing

As described herein, the template provides information (e.g. identification of the genetic sequence, identification of epigenetic modifications) on the original target polynucleotide sequence. For example, a sequencing process (e.g. a sequencing-by- synthesis or sequencing-by-ligation process) may reproduce information that was present in the original target polynucleotide sequence, by using complementary base pairing.

In one example, sequencing may be carried out using any suitable "sequencing-by- synthesis" technique, wherein nucleotides are added successively in cycles to the free 3' hydroxyl group, resulting in synthesis of a polynucleotide chain in the 5' to 3' direction. The nature of the nucleotide added may be determined after each addition. One particular sequencing method relies on the use of modified nucleotides that can act as reversible chain terminators. Such reversible chain terminators comprise removable 3' blocking groups. Once such a modified nucleotide has been incorporated into the growing polynucleotide chain complementary to the region of the template being sequenced there is no free 3'-OH group available to direct further sequence extension and therefore the polymerase cannot add further nucleotides. Once the nature of the base incorporated into the growing chain has been determined, the 3' block may be removed to allow addition of the next successive nucleotide. By ordering the products derived using these modified nucleotides it is possible to deduce the DNA sequence of the DNA template. Such reactions can be done in a single experiment if each of the modified nucleotides has attached thereto a different label, known to correspond to the particular base, to facilitate discrimination between the bases added at each incorporation step. Suitable labels are described in PCT application PCT/GB2007/001770, the contents of which are incorporated herein by reference in their entirety. Alternatively, a separate reaction may be carried out containing each of the modified nucleotides added individually.

The modified nucleotides may carry a label to facilitate their detection. Such a label may be configured to emit a signal, such as an electromagnetic signal, or a (visible) light signal.

In a particular example, the label is a fluorescent label (e.g. a dye). Thus, such a label may be configured to emit an electromagnetic signal, such as a (visible) light signal. One method for detecting the fluorescently labelled nucleotides comprises using laser light of a wavelength specific for the labelled nucleotides, or the use of other suitable sources of illumination. The fluorescence from the label on an incorporated nucleotide may be detected by a CCD camera or other suitable detection means. Suitable detection means are described in PCT/US2007/007991, the contents of which are incorporated herein by reference in their entirety.

However, the detectable label need not be a fluorescent label. Any label can be used which allows the detection of the incorporation of the nucleotide into the DNA sequence.

Each cycle may involve simultaneous delivery of four different nucleotide types to the array of template molecules. Alternatively, different nucleotide types can be added sequentially and an image of the array of template molecules can be obtained between each addition step.

In some examples, each nucleotide type may have a (spectrally) distinct label. In other words, four channels may be used to detect four nucleobases (also known as 4- channel chemistry) (Fig. 6 - left). For example, a first nucleotide type (e.g. A) may include a first label (e.g. configured to emit a first wavelength, such as red light), a second nucleotide type (e.g. G) may include a second label (e.g. configured to emit a second wavelength, such as blue light), a third nucleotide type (e.g. T) may include a third label (e.g. configured to emit a third wavelength, such as green light), and a fourth nucleotide type (e.g. C) may include a fourth label (e.g. configured to emit a fourth wavelength, such as yellow light). Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. For example, the first nucleotide type (e.g. A) may be detected in a first channel (e.g. configured to detect the first wavelength, such as red light), the second nucleotide type (e.g. G) may be detected in a second channel (e.g. configured to detect the second wavelength, such as blue light), the third nucleotide type (e.g. T) may be detected in a third channel (e.g. configured to detect the third wavelength, such as green light), and the fourth nucleotide type (e.g. C) may be detected in a fourth channel (e.g. configured to detect the fourth wavelength, such as yellow light). Although specific pairings of bases to signal types (e.g. wavelengths) are described above, different signal types (e.g. wavelengths) and/or permutations may also be used.

In some examples, detection of each nucleotide type may be conducted using fewer than four different labels. For example, sequencing-by-synthesis may be performed using methods and systems described in US 2013/0079232, which is incorporated herein by reference.

Thus, in some examples, two channels may be used to detect four nucleobases (also known as 2-channel chemistry) (Fig. 6 - middle). For example, a first nucleotide type (e.g. A) may include a first label (e.g. configured to emit a first wavelength, such as green light) and a second label (e.g. configured to emit a second wavelength, such as red light), a second nucleotide type (e.g. G) may not include the first label and may not include the second label, a third nucleotide type (e.g. T) may include the first label (e.g. configured to emit the first wavelength, such as green light) and may not include the second label, and a fourth nucleotide type (e.g. C) may not include the first label and may include the second label (e.g. configured to emit the second wavelength, such as red light). Two images can then be obtained, using detection channels for the first label and the second label. For example, the first nucleotide type (e.g. A) may be detected in both a first channel (e.g. configured to detect the first wavelength, such as red light) and a second channel (e.g. configured to detect the second wavelength, such as green light), the second nucleotide type (e.g. G) may not be detected in the first channel and may not be detected in the second channel, the third nucleotide type (e.g. T) may be detected in the first channel (e.g. configured to detect the first wavelength, such as red light) and may not be detected in the second channel, and the fourth nucleotide type (e.g. C) may not be detected in the first channel and may be detected in the second channel (e.g. configured to detect the second wavelength, such as green light). Although specific pairings of bases to signal types (e.g. wavelengths) and/or combinations of channels are described above, different signal types (e.g. wavelengths) and/or permutations may also be used.

In some examples, one channel may be used to detect four nucleobases (also known as 1 -channel chemistry) (Fig. 6 - right). For example, a first nucleotide type (e.g. A) may include a cleavable label (e.g. configured to emit a wavelength, such as green light), a second nucleotide type (e.g. G) may not include a label, a third nucleotide type (e.g. T) may include a non-cleavable label (e.g. configured to emit the wavelength, such as green light), and a fourth nucleotide type (e.g. C) may include a labelaccepting site which does not include the label. A first image can then be obtained, and a subsequent treatment carried out to cleave the label attached to the first nucleotide type, and to attach the label to the label-accepting site on the fourth nucleotide type. A second image may then be obtained. For example, the first nucleotide type (e.g. A) may be detected in a channel (e.g. configured to detect the wavelength, such as green light) in the first image and not detected in the channel in the second image, the second nucleotide type (e.g. G) may not be detected in the channel in the first image and may not be detected in the channel in the second image, the third nucleotide type (e.g. T) may be detected in the channel (e.g. configured to detect the wavelength, such as green light) in the first image and may be detected in the channel (e.g. configured to detect the wavelength, such as green light) in the second image, and the fourth nucleotide type (e.g. C) may not be detected in the channel in the first image and may be detected in the channel in the second image (e.g. configured to detect the wavelength, such as green light). Although specific pairings of bases to signal types (e.g. wavelengths) and/or combinations of images are described above, different signal types (e.g. wavelengths), images and/or permutations may also be used.

In some examples, the fluorescent labels are selected from the group consisting of polymethine derivatives, coumarin derivatives, benzopyran derivatives, chromenoquinoline derivatives, compounds containing bis-boron heterocycles such as BOPPY and BOPYPY. In some examples, the fluorescent label is attached to the nucleotide through a cleavable linker. In some further examples, the labeled nucleotide may have the fluorescent label attached to the C5 position of a pyrimidine base or the 07 position of a 7-deaza purine base, optionally through a cleavable linker moiety. For example, the nucleobase may be 7-deaza adenine and the dye is attached to the 7- deaza adenine at the 07 position, optionally through a cleavable linker. The nucleobase may be 7-deaza guanine and the dye is attached to the 7-deaza guanine at the C7 position, optionally through a cleavable linker. The nucleobase may be cytosine and the dye is attached to the cytosine at the C5 position, optionally through a cleavable linker. As another example, the nucleobase may be thymine or uracil and the dye is attached to the thymine or uracil at the C5 position, optionally through a cleavable linker. In some further examples, the cleavable linker may comprise similar or the same chemical moiety as the reversible terminator 3' hydroxy blocking group such that the 3' hydroxy blocking group and the cleavable linker may be removed under the same reaction condition or in a single chemical reaction. Non-limiting example of the cleavable linker include the LN3 linker, the sPA linker, and the AOL linker.

In some examples, the nucleotides are selected from the group consisting of an analog of dGTP, an analog of dTTP, an analog of dllTP, an analog of dCTP, and an analog of dATP. In some examples, the first nucleotide is a first reversibly blocked nucleotide triphosphate (rbNTP), the second nucleotide is a second rbNTP, the third nucleotide is a third rbNTP, and the fourth nucleotide is a fourth rbNTP, wherein each of the first nucleotide, second nucleotide, third nucleotide and fourth nucleotide is a different type of nucleotide from the other. In some examples, the four rbNTPs are selected from the group consisting of rbATP, rbTTP, rbllTP, rbCTP, and rbGTP. In some examples, each of the four rbNTPs includes a modified base and a reversible terminator 3' blocking group. Non- limiting example of the 3' blocking group include azidomethyl (*-CH2N3), substituted azidomethyl (e.g., *-CH(CHF2)N3 or *-CH(CH2F)N3) and *-CH2-O-CH2- CH=CH2, where the asterisk * indicates the point attachment to the 3' oxygen of the ribose or deoxyribose ring of the nucleotide.

Further details about the dyes and the fully functionalized nucleotides can be found in U.S. Patent Application Publication Numbers 2018/0094140 and 2020/0277670, International Patent Application Publication Number 2017/051201 , and U.S. Provisional Patent Application Numbers 63/057758 and 63/127061 , the disclosures of which are incorporated herein by reference in their entireties.

Signal processing Fig. 6B is a dye labelling scheme used to generate a scatter plot shown in Figure 7. The scatter plot of Figure 7 shows an example of sixteen distributions of signals from a nucleic acid cluster as illustrated in Figs. 5A-5B. As explained in connection with Figs. 5A and 5B, in one example, the fluorescent signal coming from the collection of extended first portion sequencing primers 502a will be brighter than the fluorescent signal coming from the collection of extended second portion sequencing primers 502b in the same cluster. The scatter plot of Fig. 7 shows sixteen distributions (or “bins”/”classifications”) of intensity values from the combination of a brighter signal and a dimmer signal; the two signals may be co-localized and may not be optically resolved as described above. The intensity values shown in Fig. 6B may be up to a scale or normalization factor; the units of the intensity values may be arbitrary or relative (i.e., representing the ratio of the actual intensity to a reference intensity). The sum of the brighter signal from the extended first portion primers 502a and the dimmer signal from the extended second portion primers 502b results in a combined signal. The combined signal may be captured by the first optical channel and the second optical channel (e.g., the “IMAGE 1” channel and the “IMAGE 2” channel in Fig. 6B). Since the brighter signal may be A, T, C or G, and the dimmer signal may be A, T, C or G, there are sixteen possibilities for the combined signal, corresponding to sixteen distinguishable patterns when optically captured according to the example shown in connection with Fig. 6B. That is, each of the sixteen possibilities corresponds to a bin shown in Fig. 7. The computer system can map the combined signal from a cluster into one of the sixteen bins, and thus determine the added nucleobase at the extended first portion primers 502a and the added nucleobase at the extended second portion primers 502b, respectively.

For example, when the combined signal is mapped to bin 712 for a base calling cycle, the computer processor base calls both the added nucleobase at the extended first portion primers 502a and the added nucleobase at the extended second portion primers 502b as C. When the combined signal is mapped to bin 714 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as C and the added nucleobase at the extended second portion primers 502b as T. When the combined signal is mapped to bin 716 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as C and the added nucleobase at the extended second portion primers 502b as G. When the combined signal is mapped to bin 718 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as C and the added nucleobase at the extended second portion primers 502b as A.

When the combined signal is mapped to bin 722 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as T and the added nucleobase at the extended second portion primers 502b as C. When the combined signal is mapped to bin 724 for the base calling cycle, the processor base calls both the added nucleobase at the extended first portion primers 502a and the added nucleobase at the extended second portion primers 502b as T. When the combined signal is mapped to bin 726 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as T and the added nucleobase at the extended second portion primers 502b as G. When the combined signal is mapped to bin 728 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as T and the added nucleobase at the extended second portion primers 502b as A.

When the combined signal is mapped to bin 732 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as G and the added nucleobase at the extended second portion primers 502b as C. When the combined signal is mapped to bin 734 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as G and the added nucleobase at the extended second portion primers 502b as T. When the combined signal is mapped to bin 736 for the base calling cycle, the processor base calls both the added nucleobase at the extended first portion primers 502a and the added nucleobase at the extended second portion primers 502b as G. When the combined signal is mapped to bin 738 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as G and the added nucleobase at the extended first portion primers 502b as A.

When the combined signal is mapped to bin 742 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as A and the added nucleobase at the extended second portion primers 502b as C. When the combined signal is mapped to bin 744 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as A and the added nucleobase at the extended second portion primers 502b as T. When the combined signal is mapped to bin 746 for the base calling cycle, the processor base calls the added nucleobase at the extended first portion primers 502a as A and the added nucleobase at the extended second portion primers 502b as G. When the combined signal is mapped to bin 748 for the base calling cycle, the processor base calls both the added nucleobase at the extended first portion primers 502a and the added nucleobase at the extended second portion primers 502b as A. Further details regarding performing base-calling based on a scatter plot having sixteen bins may be found in U.S. Patent Application Publication No. 2019/0212294, the disclosure of which is incorporated herein by reference.

Modelling differential signal intensity

As described above with reference to Figure 1, the second base calling operation 104 may optionally receive and process a mixing fraction 108, which is derived based on the ratio of signal intensities based upon the respective first and second polynucleotide sequence portions.

In the arrangement of Figures 5A-5B described above, the ratio of signal intensities based upon the respective first and second polynucleotide sequence portions of interest 501a, 501b is 2:1. In other embodiments, the ratio of signal intensities based upon the respective first and second polynucleotide sequence portions of interest 501a, 501b can be any other ratio, such as 3:1 , 1.5:1, 1.25:1, 1:1, etc. The difference in signal intensities based upon the respective first and second polynucleotide sequence portions of interest 501a, 501b can be represented by the mixing fraction, a.

The combined corrected signal (i.e. the fully corrected intensity signal normalized in the 0-1 range in each channel) based on respective added first and second X₂ nucleobases in a cluster can be expressed as:

where X_± and X₂ each encode the binarised 2-channel signal intensity according to the respective added nucleobase and correspond to one of [0,0], [1,0], [0,1] and [1 ,1] respectively depending on whether the respective added nucleobase is a G,C,T or A, and wherein a is the mixing fraction with a value between (0,1). The mixing fraction a controls the difference in signal intensities based upon the respective first and second polynucleotide sequence portions of interest 501a, 501b. When the ratio of signal intensities based upon the respective first and second polynucleotide sequence portions of interest 501a, 501b is 2:1, a = 0.6666, when the ratio of signal intensities is 1 :1 , a = 0.5. The term “signal intensity” may correspond to a voltage signal or other electromagnetic signal obtained from a sequencer based on a cluster, a photon emission and/or light emission from a cluster. Of further note, in an actual sequencing run, the value of a in a cluster may vary slightly from the value of a that was intended during the sample preparation steps and therefore it is a parameter that can be learned from the intensity data.

The distribution of signal intensities from a sequencing run performed on a cluster comprising multiple copies of first and second polynucleotide sequence portions of interest 501a, 501b is dependent on the value of a for that cluster.

Figure 8A is a scatter plot showing an example of sixteen distributions of signals from a nucleic acid cluster with a = 0.6666 (i.e a 2:1 signal intensity ratio as illustrated in Figures 5A-5B) and implemented with a 2-channel dye labeling scheme. As explained in connection with Figure 7, there are sixteen distributions (or “bins”/”classifications”) of intensity values from the combination of a first signal relating to the first polynucleotide sequence portions that emit a relatively higher intensity signal (i.e. a brighter signal) and a second signal relating to the second polynucleotide sequence portions that emit a relatively lower intensity signal (i.e a dimmer signal); the two signals may be colocalized and may not be optically resolved as described above.

Figure 8B is a scatter plot showing an example of nine distributions of signals from a nucleic acid cluster with a = 0.5 (i.e a 1:1 signal intensity ratio) and implemented with a 2-channel dye labeling scheme, although with a different label arrangement compared to that shown in Figure 6B. In one embodiment, the fluorescent signal coming from the collection of extended first portion sequencing primers 502a has the same intensity as the fluorescent signal coming from the collection of extended second portion sequencing primers 502b in the same cluster. The scatter plot of Figure 8B shows nine distributions (or “bins”/”classifications”) of intensity values from the combination of signals; the two signals may be co-localized and may not be optically resolved as described above. The intensity values shown in Figure 8B may be up to a scale or normalization factor; the units of the intensity values may be arbitrary or relative (i.e., representing the ratio of the actual intensity to a reference intensity). The sum of the signal from the extended first portion primers 502a and the signal from the extended second portion primers 502b results in a combined signal. The combined signal may be captured by the first optical channel and the second optical channel. The first signal may be A, T, C or G and the second signal may be A, T, C or G, and since the respective signals cannot be spatially resolved, there are nine possibilities for the combined signal, corresponding to nine distinguishable patterns when optically captured according to the embodiment shown in connection with Figures 5A-5B (although with the intensity ratios as 1 :1 , rather than 2:1 as illustrated). It can be seen that since there are sixteen possible combinations of two respective nucleobases but only nine bins, there is not a unique mapping of combinations of two respective nucleobases to bins. For example, it can be seen that when the first signal is a T and the second signal is a C, or when the first signal is an A and the second signal is a G, or when the first signal is G and the second signal is an A, or when the first signal is a C and the second signal is a T, that the signal corresponds to the central bin in the scatter plot. Referring back to equation (1), it can be seen that for each of these signal combinations, the resultant combined corrected signal is [0.5, 0.5], US Patent Application No. 63439443, which is incorporated by reference herein, describes in further detail the performance of base-calling based on a scatter plot having nine bins.

Accordingly, it can be seen in connection to Figure 8A that a mixing fraction a = 0.666 allows for unique binning of each of the combinations based on the brighter signal corresponding to the first polynucleotide sequence portion 501a and dimmer signal corresponding to the second polynucleotide sequence portion 501b. Since the intensity signal based upon the dominant read is stronger than the intensity signal based upon the weaker read, the added nucleobase corresponding to the stronger read may be determined with greater accuracy in comparison to the determination of the added nucleobase corresponding to the weaker read. In other words, the signal-to-noise ratio associated with the dominant read is greater than the signal-to-noise ratio associated with the weaker read.

Signal mapping In one example, the combined signals may be mapped to the distributions by using a Gaussian Mixture Model (GMM). A Gaussian mixture model comprises multiple Gaussians, each identified by n e {1 ,... , /V}, where N is the number of clusters (/.e., groups of data points). Each Gaussian n in the mixture includes the following parameters: a mean value p that defines its centroid and covariances Z that define its width. In a multivariate scenario where, e.g., the intensity profiles for the clusters are extracted from the sequencing images acquired from two color/intensity channels, the covariances Z define the dimension of an ellipsoid of the intensity distribution.

In the present embodiment, the intensity data is two-channel intensity data and the ratio in signal intensities between the stronger read

and the weaker read X₂ is approximately 2:1 , i.e. a is approximately 0.666 and therefore the intensity distribution is corresponds to Figure 8A.

Figure 9A is an overview of a method for fitting a GMM to intensity data with a distribution as shown in Figure 8A. As shown in Figure 9A, at 901 , raw intensities obtained based upon a plurality of clusters is received. At 902, raw intensities for each cluster may first be separately normalized. At 903, a GMM with four sources may then be fitted. At 904, the fitted GMM may then be used to predict the brighter or “dominant” read. At 905, the intensities with the same dominant read may then be used to train a GMM with four sources for the dimmer or “weaker” read. At 906, the trained GMM may be used to predict the weaker read for each dominant read, resulting in 16 cluster centers. A GMM with 16 sources may then be initialized with these 16 cluster centers, and trained with all cluster intensities for each cycle. At 907, yhe two reads may then be predicted using GMM cluster assignment and correlation with the expected cloud constellation. At 908, the predicted identities of the base calls are output.

Alternatively, as now described in relation to Figure 9B, a simpler base calling method can be performed. A base call for a nucleobase of the stronger read X_± can be determined by dividing the 2-channel intensity plot into quadrants, each quadrant corresponding to a respective nucleobase, and then determining in which of the quadrants the corresponding per-sequence cycle intensity signal lies. A base call for each of the nucleobases of the weaker read X₂ can be determined by dividing the quadrant in which the corresponding per-sequence cycle intensity signal lies into subquadrants and determining in which of the sub-quadrants the signal belongs. In another example, simple base calls can be made for X₁ and X₂ by dividing the 2-channel intensity plot into 16 portions, each portion corresponding to a respective combination of two nucleobases, and base calling

and X₂ simultaneously.

In other embodiments where the ratio in signal intensities between the stronger read X₁ and the weaker read X₂ does not approximate 2:1 , coarse base calling can, for example, be performed by dividing the 2-channel intensity plot based on the expected intensity distribution given the estimate value of a.

Simplified sequencing workflow

Fig. 10 is a flow diagram showing an example method 1000 of base calling. The described method allows for simultaneous sequencing of two or more sequence portions in a single sequencing run from a single combined signal obtained from the two or more portions, thus requiring less sequencing reagent consumption and faster generation of data from both portions. Further, the simplified method may reduce the number of workflow steps while producing the same yield as compared to existing nextgeneration sequencing methods. Thus, the simplified method may result in reduced sequencing runtime.

As shown in Fig. 10, the disclosed method 1000 may start from block 1001. The method may then move to block 1010.

At block 1010, intensity data is obtained. The intensity data includes first intensity data and second intensity data. The first intensity data comprises a combined intensity of a first signal obtained based upon a respective first nucleobase of at least one first polynucleotide sequence portion and a second signal obtained based upon a respective second nucleobase of at least one second polynucleotide sequence portion. Similarly, the second intensity data comprises a combined intensity of a third signal obtained based upon the respective first nucleobase of the at least one first polynucleotide sequence portion and a fourth signal obtained based upon the respective second nucleobase of the at least one second polynucleotide sequence portion. As described above, polynucleotide molecules comprising the at least one first polynucleotide sequence portion and the at least one second polynucleotide sequence portion may be arranged on the flow cell such that light emissions from the first and second portions are detected by a single sensing portion and/or may comprise a single cluster such that light emissions from each of the respective two polynucleotide sequence portions cannot be spatially resolved.

In one example, the signals may be generated according to the method shown in Fig. 11.

In one example, obtaining the intensity data comprises selecting intensity data that corresponds to two or more different sequence portions. In one example, intensity data is selected based upon a chastity score. A chastity score may be calculated as the ratio of the brightest base intensity divided by the sum of the brightest and second brightest base intensities. The desired chastity score may be different depending upon the expected intensity ratio of the light emissions associated with the different portions. As described above, it may be desired to produce clusters comprising two different sequence portions of interest, which give rise to signals in a ratio of 2:1. In one example, high-quality data corresponding to two sequence portions with an intensity ratio of 2:1 may have a chastity score of around 0.8 to 0.9. In one example, clusters may be identified as containing one or more than one polynucleotide sequence portion of interest (e.g. by chastity score) and processed accordingly. For example, clusters containing more than one sequence portion of interest may be base called according to the present methods, whereas clusters containing a single sequence portion of interest may be base called according to known methods.

After the intensity data has been obtained, the method may proceed to block 1020. In this step, one of a plurality of classifications is selected based on the intensity data. Each classification represents a possible combination of respective first and second nucleobases. In one example, the plurality of classifications comprises sixteen classifications as shown in Figs. 8A and/or 8B, each representing a unique combination of first and second nucleobases. Where there are two polynucleotide sequence regions of interest, there are sixteen possible combinations of first and second nucleobases. Selecting the classification based on the first and second intensity data comprises selecting the classification based on the combined intensity of the first and second signals and the combined intensity of the third and fourth signals, for example using GMMs as described above.

The method may then proceed to block 1030, where the respective first and second nucleobases are base called based on the classification selected in block 1020. The light emissions generated during a cycle of a sequencing-by-synthesis method are indicative of the identity of the nucleobase(s) added to the sequencing primers undergoing extension. It will be appreciated that there is a direct correspondence between the identity of the nucleobases incorporated into the sequencing primers and the identity of the complementary base at the corresponding position of the sequence portion bound to the flow cell. Therefore, any references herein to the base calling of respective nucleobases of polynucleotide sequence portions encompasses the base calling of nucleobases hybridized to the polynucleotide sequence portions and, alternatively or additionally, the identification of the corresponding nucleobases of the portions. The method may then end at block 1040.

Fig. 11 is a flow diagram showing a method 1100 by which the signals discussed in relation to block 1010 of Fig. 10 may be generated. The method may start from block 1101.

The method may then move to block 1110, default oligo grafting, which may include the attachment of oligonucleotide anchors/graft sequences to a planar, optically transparent surface of the flow cell. The method may then move to block 1120, generating DNA libraries from a sample, where template polynucleotides in a sample may be end-repaired to generate 5’-phosphorylated blunt ends, and the polymerase activity of Klenow fragment may be used to add a single A base to the 3’ end of the blunt phosphorylated nucleic acid fragments. This addition prepares the nucleic acid fragments for ligation to oligonucleotide adapters, which have an overhang of a single T base at their 3’ end to increase ligation efficiency. The adapter oligonucleotides are complementary to the flow cell anchor oligos.

After DNA library generation, the method may then move to block 1130, denaturing the double stranded DNA libraries to generate single stranded template polynucleotides for seeding on the flow cell. The method may then move to block 1140, clustering from the single stranded template polynucleotides. Under limiting-dilution conditions, adapter- modified, single-stranded template polynucleotides are added to the flow cell and immobilized by hybridization to the anchor oligos. Attached nucleic acid fragments are extended and bridge amplified to create an ultra-high density sequencing flow cell with hundreds of millions of clusters, each containing about 1 ,000 copies of the same template. Details regarding enrichment of nucleic acids using cluster amplification may be found in Kozarewa et al., Nature Methods 6:291-295 (2009), which is incorporated herein by reference.

After cluster generation, the method may directly move to block 1150, hybridizing/annealing first and second primers 502a, 502b simultaneously to both the first and second polynucleotide sequence portions 501a, 501b on the flow cell 300. Next, the method may move to block 1160 of signal generation. Signal generation proceeds by simultaneously extending the hybridized primers 502a, 502b. With each cycle, fluorescently tagged nucleotides compete for addition to the growing chains of extended primers. Only one is incorporated at a primer location based on the sequence of the template strand. After the addition of nucleotides, the cluster is excited by a light source, and characteristic fluorescent signals are emitted. The emission spectra and the signal intensities uniquely determine the base call. Hundreds of millions of nucleic acid clusters, or thousands to tens of thousands of millions of clusters, may be sequenced in a massively parallel manner. After sequencing the polynucleotide sequence portions 501a, 501b on the flow cell 300, the method may end at block 1170.

Sequencing data example

Figs. 12A and 12B show the results of concurrent sequencing of different inserts (human and PhiX) according to the methods described above.

By plotting relative intensities of light signals obtained from a first channel (ch1) and a second channel (ch2), a constellation of 16 clouds is obtained. Each of these clouds allows sequence information to be identified on both the human insert and the PhiX insert, where the top left corner of four clouds corresponds with base calls corresponding to C, the top right corner of four clouds corresponds with base calls corresponding to T, the bottom left corner of four clouds corresponds with base calls corresponding to G, and the bottom right corner of four clouds corresponds with base calls corresponding to A. The basecall read out (R1 and R2) of both the human insert and the PhiX insert is also shown.

As shown in Fig. 12B, alignment of R1 and R2 (minor and major reads respectively) with the known human and PhiX sequence confirmed that the method accurately sequenced the inserts. In particular the sequence identity of R1 and R2 with the known sequences was 99% (150 out of 151 correct base calls for R1 and 148 out of 149 correct base calls for R2).

Modelling context dependent effects

As previously described with reference to Figure 1 , the base calling method described herein improves base calling of simultaneous sequencing data by considering base context effects. The base context of a current nucleobase corresponds to the at least one preceding and/or succeeding nucleobase surrounding the current nucleobase. It has been found that the base context of a current nucleobase results in adjustments to an intensity signal obtained based upon the current nucleobase.

Base context effects are further illustrated in Figure 13A, which is a scatter plot of intensity signals from a plurality of clusters corresponding to an added nucleobase of a single polynucleotide sequence portion of interest (i.e. not a combined signal but rather a single respective signal based on either one of the first or second polynucleotide portions of interest 501a, 501b). As shown in Figure 13A the clusters that are called as base A at a given sequencing cycle have different base context, namely, AGA, CGA and AAA. The prior bases AG, CG and AA are identified at prior sequencing cycles. The chemistry modulation effects caused by different prior bases AG, CG and AA lead to substantial variation in the intensity profiles corresponding to the signal obtained based upon the incorporation of nucleobase into a cluster at both color/intensity channels. These base context-specific variations can cause miscalls, especially when the intensity profile of a target cluster to be called is close to a decision boundary, i.e., between two intensity distributions of different bases, for example, bases A and C, bases A and T. Figure 13A shows an exemplary decision boundary 1300 which delimits intensity profiles that may be base called as a C. It can be seen that some clusters with an added base A and base context “AA” or “CG” may be miscalled as C due to context dependent adjustments to the intensity profile emitted based upon the nucleobase A.

Figure 13B further illustrates the base context effects for all base calls A, C, G and T. It is noted that Figure 13B is based on a different 2-channel dye labeling scheme compared to Figure 13A. Figure 13B illustrates exemplary adjustments to intensity profiles corresponding to one of four bases A, G, C and T based on the identity of the two preceding bases. The intensity distributions represented by dashed circles 1310, 1320, 1330 and 1340 correspond to each of four the bases A, C, G and T respectively. Within each of the respective four intensity distributions 1310, 1320, 1330 and 1340 are 16 smaller distributions, each corresponding to a respective base context. Since the base context in this example comprises the two preceding bases, there are 4² = 16 base context possibilities. The position of each of the sixteen sub-distributions within each of the respective larger distributions 1310, 1320, 1330 and 1340 corresponds to the adjustments to signal intensity based upon the particular based context indicated by the respective sub-distribution. For example, sub-distribution 1311, which indicates a nucleobase A with base context “AC”, has a smaller first channel intensity and a larger second channel intensity, when compared to sub-distribution 1312 indicating a nucleobase A with base context “GT”.

Context dependent signal modulation model

Referring back to the base calling method of Figure 1 , the mappings 105, which may be transformation matrices, provide adjustments to signal intensity based upon base context effects. The mappings 105 may be based on a context dependent signal modulation model (CDSM). Figure 14 schematically illustrates operations performed by CDSM 1400 on an input sequence of L binarised intensity values corresponding to base calls of respective nucleobases 1402 to generate adjusted per-sequence cycle intensity values 1442 according to the base context of the respective nucleobase. Since each nucleobase in a sequence has its own respective base context, the adjustments are specific to each of the one or more nucleobases in the polynucleotide sequence. The transformation of each of the L intensity values 1402 to result in an adjusted per-sequence cycle intensity value 1442 is achieved by first performing a context dependent correction and then a context dependent phasing correction, each of which will be described below in further detail. This complete transformation is equivalent to applying a mapping 105 to each of the respective intensity values 1402 to result in adjusted per-sequence cycle intensity values 1442.

As illustrated in Figure 14, base calls for a known polynucleotide sequence of length L can be represented as a L x 2 matrix of encoded base calls 1402, where each row corresponds to a respective nucleotide of the sequence and is represented by one of [0,0], [1,0], [0,1] and [1,1], each of which correspond to one of the four nucleotides G, T, C, and A respectively. When sequencing is performed using a two-channel system, the entries in the L x 2 matrix can be considered to correspond to binarized intensity signals detected in the first channel and the second channel respectively.

The CDSM model 1400 can be configured to receive as input any number of preceding and/or succeeding nucleobases as the context of a nucleobase. In other words, the CDSM model 1400 can be configured to process any window of length k around the added nucleobase of the current sequencing cycle, wherein the window includes the added nucleobase. A sequence of k nucleobases comprising a nucleobase X to be base called and the one or more preceding and/or succeeding nucleobases is defined as a k-mer of length k. Since there are four possible nucleobases, there are 4^k different possible k-mers of length k. For example, if the base context of a nucleobase X to be base called comprises three preceding nucleobases and one succeeding nucleobase, the k-mer sequence is a 5-mer represented as KKKXK and there are 4⁵ different possible 5-mers.

At 1410, the CDSM model 1400 is configured, for a given length k, to generate a k-mer specific time series 1412 for each of the possible 4^k k-mers based on the sequence of encoded base calls 1402. Each k-mer specific time series indicates at which sequencing cycle a particular nucleobase is added with the specific base context denoted by the k-mer. Each of the 4^k timeseries 1412 can be stored as a matrix with dimension L x 2, where the rows correspond to sequencing cycles and the columns represent the signals detected in the first and second channel respectively. By summing all of the 4^k timeseries 1412, the sequence of encoded base calls 1402 would be recovered. It should be noted that if a specific k-mer does not appear over a sequencing run, there will not be any values registered in the corresponding k-mer specific time series 1412.

At 1420, the CDSM 1400 is configured to transform each of the k-mer specific timeseries 1412 to adjust the signal intensity based on the corresponding base context to result in a transformed k-mer specific time series 1422, which can be stored as a matrix with dimension L x 2. Each k-mer specific timeseries 1412 is transformed based on a respective k-mer specific transformation matrix, and there are therefore 4^k k-mer specific transformation matrices in total.. Each k-mer specific time series 1412 is transformed by multiplying each of the binarised intensity values in the k-mer specific time series 1442 by the corresponding k-mer specific transformation matrix. Each k- mer specific transformation matrix adjusts the binarised signal intensity values for a single nucleobase so that they are no longer integer values to model the context dependent adjustments to signal intensity. Each of the k-mer specific transformation matrices takes a binarised signal intensity for a single nucleobase as input and outputs an adjusted signal intensity for that nucleobase. In the present embodiment, the binarised signal intensities for a single nucleobase can be stored as a 1x2 vector and each of the k-mer specific transformation matrices have dimension of 2x2. The output adjusted signal intensitiy can be stored at a 1x2 vector. A binarized k-mer-specific identifier can be used as a lookup index to identify the corresponding k-mer-specific matrixin order to perform the transformation.

Consider as an example tetramer context of ATGC (G being the base at a current sequencing cycle). The k-mer-specific 2 x 2 matrix M corresponding to tetramer context ATCC is identified using the binarized time series as a lookup index. The binarized intensities of base G at the current sequencing cycle is in a vector form b. Accordingly, the transformed time series with adjusted intensities i in a vector form are calculated using the following: i = M x b

All of the coefficients within matrix M that map binarized intensities b with adjusted intensities i are learnable through backpropagation, such that the gradient update can be applied to the entries of matrix M. When k-mer-specific 2 x 2 matrix M is used, the CDSM model can perform linear transformation. The CDSM model can also perform non-linear transformations. For example, the CDSM model can use k-mer-specific 3 x 3 matrix and perform affine transformation to generate predicted k-mer-specific centroids. In order to multiply each of the Lx2 dimensional k-mer specific time series matrix by a 3x3 matrix, the k-mer specific timeseries 612 may be extended by one column by adding a 1 or c value to each row, depending on whether an inverse or the forward transform is to be used. The value c represents a learnable parameter through back propagation. For an affine transformation, a vector [x,y, 1] or [x,y,c] is multiplied by a 3 x 3 matrix. Figures 15A, 15B, 16A and 16B, described below, illustrate example k-mer specific time series and transformed k-mer specific timeseries after transformation.

At 1430, the CDSM model 1400 can be configured to apply a k-mer specific phasing correction 1406 to each of the transformed k-mer specific timeseries 1422. In the ideal situation of sequencing-by-synthesis (SBS) process, the lengths of all nascent strands, or polynucleotide sequence portions, within an analyte would be the same. Imperfections in the cyclic reversible termination (CRT) chemistry create stochastic failures that result in nascent strand length heterogeneity. In other words, a loss of synchrony occurs in which the incorporation of a nucleobase into a given strand falls out of phase with the incorporation of a nucleobase into the majority of the remaining strands within a cluster. One example is the phasing effect where an oligonucleotide in a cluster does not incorporate a nucleotide in some of the sequencing cycles and therefore, lags behind other oligonucleotides. Figures 17A and 17B, described below, illustrate an example of phasing and prephasing effects.

To correct for the phasing effect, the CDSM model 1400 can apply k-mer-specific phasing coefficients 1406 to the k-mer-specific time series and generate corrected k- mer-specific time series 1432. The k-mer-specific phasing coefficients are modelled as being k-mer-dependent. Each of the k-mer-specific time series has a corresponding k- mer-specific coefficient for phasing. Each of the resultant 4^k corrected k-mer specific time series 1432 have a dimension Lx2. A binarized k-mer-specific identifier can be used as a lookup index to identify the corresponding k-mer-specific phasing correction value 1406 in order to perform the phasing correction. At 1440, the CDSM model 1400 is configured to merge the 4^k corrected k-mer specific time series 1432 to generate predicted per-sequencing cycle intensity values 1442, which may be stored as a Lx2 matrix. The merge may be performed using the sum operator. The 4^k corrected k-mer specific time series 1432 are the output of the CDSM model 1400.

The transformation matrices and phasing coefficients are learnable parameters which can be learnt through backpropagation. In alternative embodiments, the CDSM model 1400 can be configured to directly transform the k-mer specific time series 1402 into the predicted per-sequencing cycle intensity values 1442 using convolution kernels. The coefficients of the convolutional kernels can be optimized through backpropagation.

Stages 1420 and 1430 of the CDSM model 1400 can be summarised as a single stage comprising a multiplication of a k-mer specific time series 1412 with the corresponding k-mer specific transformation matrix 1404 followed by the multiplication of the corresponding k-mer specific phasing coefficient 1406. The k-mer specific transformation matrices 1404, optionally multiplied by a corresponding k-mer specific phasing coefficient 1406, are referred to as mappings M_di 105. In other words, the mappings 105 provide a context dependent signal correction and optionally a context- dependent phasing correction.

Figure 15A illustrates a 3-mer specific time series over 151 sequencing cycles of a sequencing run corresponding to k-mer specific time series 1412 of Figure 14 for a sequence AGC. The white bars represent binarized intensities in the first color/intensity channel where incorporation of a nucleobase C with the context AG occurs and it can be seen that a C with context AG occurs at sequencing cycles 8, 22, 75 and 121 respectively. That is, 3-mer AGC is present at sequencing cycles 8, 22, 75 and 121 and is absent at remaining sequencing cycles. Figure 15B illustrates a transformed k-mer specific timeseries corresponding to 1422 of Figure 14 after the transformation 1420 of Figure 14. It can be seen that the intensity values on the first color/intensity channel have been transformed by the corresponding 2x2 transformation matrix to have a value of approximately 0.85.

Figure 16A illustrates a 3-mer specific timeseries for sequence GGT corresponding to the incorporate of a nucleobase T with the context GG over 151 sequencing cycles of a sequencing run. The black bars represent the binarized intensities in the second color/intensity channel and it can be seen that a T with the context GG occurs at sequencing cycles 38, 62, 103 and 148 respectively. That is, 3-mer GGT is present at sequencing cycles 38, 62, 103 and 148 and is absent at remaining sequencing cycles. Figure 16B illustrates the transformed k-mer specific timeseries 1422. It can be seen that the intensity values on the second color/intensity channel have been transformed by the corresponding 2x2 transformation matrix to have a value of approximately 0.75.

It can be appreciated by the skilled person that the binarized intensities as illustrated in Figures 15A, 15B, 16A and 16B are for illustrative purposes. The binarized intensities can represent any of the bases within its corresponding k-mer context. Moreover, the binarized intensities of base G in the k-mer contexts may be shown as zero at both color/intensity channels and the binarized intensities of base A in the k-mer contexts may be shown as one at both color/intensity channels.

Figures 17A and 17B illustrate an example of the phasing and prephasing effects. Figure 17A shows that some strands of an analyte lead (1700) while others lag behind (1701), leading to a mixed signal readout of the analyte, illustrating how individual polynucleotide sequence portions can either fall behind (phasing) or jump ahead (prephasing) of the majority of the polynucleotide sequence portions in a cluster. Figure 17B depicts the intensity output of the second intensity channel corresponding to the incorporation of a “C” into a cluster every 15 cycles in a heterogeneous background. With reference to the sub-panel 1710 within Figure 17B, which shows the signal intensity around cycle 15, notice the anticipatory signals (arrow 1712) and memory signals (arrows 1714) due to the phasing and prephasing effect. If there were no phasing and prephasing effects, there would not be any signal at cycles 14, 16, 17 and 18 since the nucleobase “C” would be incorporated into all of the strands of the cluster at the same cycle. However, since in some strands the “C” is incorporated either earlier or later than cycle 15, there is some signal present in the window around cycle 15. A fading effect can also be seen in Figure 17B. Fading, or signal decay, is an exponential decay in fluorescent signal intensity as a function of sequencing cycle number. Fading, or signal decay, will be described in further detail below in relation to emission stack effects.

Emission stack effects

The second base calling operation 104 of Figure 1 may perform an adjustment on the intensity data 102 in order to correct for cluster and machine dependent effects, also referred to as emission stack effects EM_g. The emission stack effects, EM_g , captures modifications due to cluster and sequencer dependent effects to the signal intensities emitted by the added nucleotides in a cluster for the respective first and second polynucleotide sequence portions 501a, 501b. Cluster dependent modifications correspond to chemistry dependent effects that may be specific to a respective cluster. Sequencer dependent effects can correspond to aspects of the imaging process, such as the settings of the imaging equipment and any heterogeneities in illumination or image collection. The emission stack effects can comprise a phasing/prephasing effect, background effects, decay effects, scale effects, further background effects, camera gain, and/or laser ramp. Examples of background correction, decay correction and phasing/prephasing correction are described in U.S. Patent No. 11,423,306 and U.S. Patent Publication No. US2020/0364565A1 which are herein incorporated by reference.

Figure 18 schematically shows processing to obtain an adjusted signal with emission stack effects removed from an input signal. EM_d application 1801 takes as input a signal 1802 obtained based on the first and second polynucleotide portions as input and outputs an adjusted signal 1803, which comprises a modified version of the input signal to remove various effects.

The effects removed from the signal may comprise a phasing/prephasing effect that captures cluster dependent phasing/prephasing effects. Phasing/prephasing effects have been described previously in relation to Figures 17A and 17B. In contrast to the phasing/prephasing coefficients of the CDSM model 1400, the phasing/prephasing effect is base context independent and may be associated with the particular sequencing process. In some embodiments, the phasing/prephasing effect may be included in EM_e if the dependent-mer specific phasing coefficients have been omitted from the CDSM model 1400. In other embodiments, both the k-mer specific phasing coefficients and the phasing/prephasing component of EM_e may be included in the CDSM-16QAM model.

Background effects that model background variation may additionally or alternatively be removed by processing 1801. Background intensity of a particular sensor is relatively steady between cycles, but varies across the sensors. Positioning of the illumination source, which can vary by illumination color, creates a spatial pattern of background variation over a field of the sensors. It has been found that manufacturing differences among the sensors were observed to produce different background intensity readouts, even between adjoining sensors. In a first approximation, idiosyncratic variation among sensors can be ignored. In a refinement, the idiosyncratic variation in background intensity among sensors can be taken into account for. Background intensity can be a constant parameter to be fit, either overall or per pixel. Alternatively, different background intensities can be taken into account for and corrected accordingly.

A decay effect that models signal decay, for example, fading of the intensities of the fluorophores that are incorporated into the template sequences during the sequencing- by-synthesis process, may additionally or alternatively be removed from the signal by processing 1801. As sequencing proceeds, accurate base calling becomes increasingly difficult, because signal strength decreases and noise increases, resulting in a substantially decreased signal-to-noise ratio. It has been observed that later synthesis steps attach tags in a different position relative to the sensor than earlier synthesis steps. When the sensor is below a sequence that is being synthesized, signal decay results from attaching tags to strands further away from the sensor in later sequencing steps than in earlier steps. This causes signal decay with progression of sequencing cycles.

A scale effect that models the variations in the intensities of clusters may additionally or alternatively be removed by processing 1801. When clusters are immobilized on the surface of the flow cell, their size and shape may vary. A larger-sized cluster includes more template oligonucleotides than a small-sized cluster and thus, may show higher intensity values when more fluorophores are incorporated into the oligonucleotides. The scale effect models the difference in the scale of the intensities of clusters.

In some embodiments, at least one of the phasing/prephasing effect, background effects, decay effects, scale effects, further background effects, camera gain, and laser ramp can be iteratively learned by training the base calling system. Each of the components of EM_e can involve learnable and cluster-dependent parameters, that is, each cluster or a batch of clusters can have a particular set of learnable parameters used to correct for inter-cluster intensity variations. During the training process of the base calling system where these parameters are iteratively optimized, the transformation parameters and context-dependent phasing parameters in the CDSM model 1400 can be locked. In other words, the base calling system does not learn the chemistry effects caused by base context but leverages the optimized transformation parameters and context-dependent phasing parameters. It will be appreciated that while various effects are described above, given that EM_e may be learned (as described in further detail below), the effects that are adjusted for will typically be learned in combination based upon training data and additional or alternative effects may be modelled.

Modelling simultaneous sequencing with context dependent effects

In order to perform the second base calling operation 104 referred to in Figure 1, the parameters of the CDSM-16QAM model, comprising mappings 105, and optionally the emission stack effects 107 and missing fraction 108, may be learnt.

In embodiments, the parameters can be initially estimated and then updated using learning techniques that are typically used for training a machine learning model.. A “machine learning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through experience based on use of data. For example, a machine-learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machine-learning models include various types of decision trees, logistic regressions, linear regressions, random forests, support vector machines, Bayesian networks, or neural networks. The learning technique can be a backpropagation method that is typically used to update the parameters of a neural network. A backpropagation method can operate as follows: given a training dataset, a forward pass sequentially computes the output and propagates the function signals forward through the model. In the final output layer, an objective loss function measures error between the inferenced outputs and the given labels. To minimize the training error, the backward pass uses the chain rule to backpropagate error signals and compute gradients with respect to all parameters throughout the model. Finally, the parameters are updated using optimization algorithms based on stochastic gradient descent. Whereas batch gradient descent performs parameter updates for each complete dataset, stochastic gradient descent provides stochastic approximations by performing the updates for each small set of data examples. Several optimization algorithms stem from stochastic gradient descent. For example, the Adagrad, Adam and Levenberg-Marquardt training algorithms perform stochastic gradient descent while adaptively modifying learning rates based on update frequency and moments of the gradients for each parameter, respectively.

In some embodiments, a computational graph is constructed based on the CDSM- 16QAM model and the parameters of the computational graph, which correspond to the parameters of the CDSM-16QAM model, are updated by applying backpropagation techniques. The computational graph may be a minimal network and thus the parameters of the graph can be updated in a computationally efficient manner.

Training 1900 of the computational graph based on the CDSM-16QAM model is schematically illustrated in Figure 19. At 1910, sequence information is input to the computational graph. The inputs to the computational graph are sets of base called sequences

and X₂ that have been sequenced during a simultaneous sequencing run. A training set of base called sequences X_± and X₂ and associated ground truth may be used to train the model. The ground truth corresponds to the instrument signals obtained for a simultaneous sequencing run performed on the sequences X_± and X₂. The number of sets of already base called sequences used as training samples can be 10-50, 50-200, 200-500, 500-1000, 1000-2000 and so on. For example, the training samples can include 512 or 1024 sets of sequences X_± and X₂. Initial estimates are provided for all of the parameters of the CDSM-16QAM model. An estimate of the mixing fraction a for each cluster can be made based on the estimated ratio of first polynucleotide sequence portions 501a to second polynucleotide sequence portions 501b in that cluster. For example, in embodiments where sequencing was performed on samples which were prepared to have roughly a 2:1 ratios of first polynucleotide sequence portions to second polynucleotide sequence portion in a cluster, a can be initialized as 0.66666. The parameters of some of components of EM_θ, such as the background and scale, can be initialized based on calculations performed on the instrument signals. The decay coefficient of EM_g can be initialized to 0. The 4^k transformation matrices can be initialized with a 2x2 identity matrix and the 4^k k-mer specific phasing coefficients can be initialized to the same value.

At stage 1920, during the forward pass, a predicted instrument signal is generated for a set of sequences and X₂ 1910 based on the current estimates for the parameters. The predicted instrument signal is generated by first performing the splitting, transformation, (optional) phasing correction and merging steps of the CDSM model as described previously in relation to Figure 14 to result in context dependent persequencing cycle intensity values M_θ(X₁) and M_θ(X₂). Subsequently, the context dependent per-sequencing cycle intensity values are combined according to α * M_θ (X₁) + (1 - a) * M_θ(X₂) to result in a combined context dependent signal. The emission stack effect model EM_g is then applied to the combined context dependent signal to result in a predicted per-sequencing cycle instrument signal y = EM_θ [ a * M_θ(X₁ + (1 - α) * M_θ(X₂)] for the input sequences X₁ and X₂.

At stage 1930, the predicted per-sequencing cycle instrument signals y are compared to corresponding instrument signals associated with the sequences X₁ and X₂.and a loss is determined. At stage 1940, the loss is then backpropagated through the computational graph and the model parameters including the parameters of EM_e, a, the k-mer specific transformation matrices and the k-mer specific phasing coefficients are adjusted accordingly using one of the optimization algorithms described previously.

Returning to stage 1920, a forward pass of the computational graph is performed based on the updated parameters to provide a new estimate the predicted per- sequencing cycle instrument signals y for the input sequences X₁ and X₂. Stages 1930, 1940 are then performed. Iterations of stages 1920, 1930 and 1940 are performed. The model may then be further trained on additional sets of sequences X₁ and X₂ and associated instrument signal data according to stages 1910-1940. The model may be trained until the values of the k-mer specific transformation matrices and k-mer specific phasing coefficients have converged. Once this has occurred, the weights of the k-mer specific transformation matrices and the k-mer specific phasing coefficients are locked so that these parameters are no longer learnable. Since many training iterations may be performed until convergence, in some embodiments, training is performed offline.

During backpropagation through computational graph, the gradients can flow backward through network through the merge step of the CDSM model involving the sum operator, and all of the upstreaming parameters can be updated. An example of how gradients flow backwards through a sum operator is as follows. During backpropagation the backward pass computes the gradients with respect to the inputs of each node in the computational graph. The sum operation takes the gradients on its outputs and broadcasts it equally to all of its inputs, regardless of what the input values were during the forward pass. It follows from the fact that the local gradient for the sum operation is simply +1.0. As a result of applying the chain rule, the gradients on all inputs should be equal to the gradients on the output multiplied by 1.0 and thus, remain unchanged.

Figure 20A shows a plot of the loss evaluated by the loss function, which is the mean squared error between the predicted signal y and the actual instrument signal for a plurality of clusters for 102 cycles of a sequencing run over 1000 training iterations. As can be seen in Figure 20A, it has been found that the loss converges to over around 800 cycles showing that after around 800 cycles the model is able to predict the instrument signal. A residual loss can still be seen at 1000 cycles which can be attributed to the fact that the loss is an aggregate loss value over a plurality of clusters, some of which do not comprise any signal. In addition, some of the residual loss may be due to noise in the sequencing process or systematic effects that are not included in the model.

Figure 20B shows intensity profiles generated by training the model based on a plurality of known sequences and X₂ and the corresponding actual instrument signals then generating a predicted signal based on the known sequences and learnt parameters. Each of the first 11 panels show the predicted per-sequencing cycle intensity values for a tile of clusters at cycle number, 1, 11 , 21 , 31, 41, 51, 61, 71 , 81 , 91 and 101 respectively. It can be seen that the model predicts the sixteen cloud distribution to converge over successive sequencing cycles. This predicted behavior is in accordance with what would typically be seen in an actual sequencing run. This shows that the parameters of the model have converged towards values which are representative of a real-world system.

Figure 21A shows “expected” (i.e. predicted by the model) and observed persequencing cycle intensity values for 102 cycles of a sequencing run based on a single cluster of first and second polynucleotide sequence portions of interest with a (mixing fraction) equal to 0.68 after training the model. The x-axis defines the cycle number and the y-axis defines the signal intensity in the first and second channel.

Figure 21 B shows a scatter plot of the observed against the expected signals for the first channel (left panel) and second channel (right panel) respectively based on the expected and observed per-sequencing cycle intensity values from Figure 21A. It can be seen from Figure 21 B that the expected signals and the observed signals are highly correlated. This again shows that the parameters of the model have converged towards values which are representative of a real-world system. It can be seen that the data points in each of the scatter plots can be separated into four groups. This is due to the mixing fraction being close to 0.66 and thus the intensity data corresponds to a sixteen cloud distribution.

Figures 21 C shows predicted and observed per-sequencing cycle intensity values for 102 cycles of a sequencing run based on a single cluster of first and second polynucleotide sequence portions of interest but with a (mixing fraction) equal to 0.8.

Figure 21 D shows a scatter plot of the predicted versus the observed per-sequencing cycle intensity values for the first channel (left panel) and second channel (right panel) respectively based on the expected and observed per-sequencing cycle intensity values from Figure 21C. It can again be seen that the predicted and observed signals are highly correlated, showing that that the parameters of the model have converged towards values which are representative of a real-world system. It can be seen that that the data points in each of the scatter plots can be separated into two groups. This is due to the mixing fraction being equal to 0.8. The signal contribution from the first polynucleotide sequence portion is dominant and therefore the intensity profiles tend towards a four cloud distribution which is what would usually be seen when a cluster contains multiple copies of a single polynucleotide portion of interest. As can be seen from Figures 21A-21D, the mixing fraction component of the CDSM-16QAM model has a large impact on the per-sequencing cycle intensity values.

Base calling

Once the k-mer specific transformation matrices and k-mer specific phasing coefficients (i.e. the components of the mappings M_θ)have been learned, a cluster can be base called based upon the CDSM-16QAM model. Since base calling requires values for the cluster specific parameters of the model, base calling and parameter estimation may be performed iteratively using initial coarse estimates for the cluster specific parameters and base calls, and refining the estimates over subsequent iterations using a method similar to the training method described previously but with the coefficients for the k-mer specific transformation matrices and k-mer specific phasing coefficients locked.

The intensity data can comprise per-sequencing cycle instrument signals corresponding to one or more respective clusters, each of the respective clusters comprising a plurality of polynucleotide molecules comprising multiple copies of respective first and second polynucleotide sequence portions 501a, 501b.

A method of base calling 2200 is illustrated in Figure 22. At step 2210, instrument signal y and mappings M_θ are received. At step 2210, initial parameter estimates are made for the cluster dependent parameters of EM_θ and a for each of the respective clusters in the sequencing data. The mappings M_d, which provide the k-mer specific transformation matrices optionally modified by the corresponding k-mer specific phasing coefficients, can be obtained based upon the training method 1900 described previously.

At step 2230, the per-sequencing cycle instrument signal for each respective cluster is first corrected by applying the inverse of EM_θ and then determining coarse base calls for each of the sequences

and X₂ for each of the respective clusters. The base calling performed at this stage is “coarse” in the sense that a simple and computationally efficient base calling method can be used, such as the simple base calling method described in relation to Figure 9B..

At step 2240, a forward pass of the computational graph based on the CDSM-16QAM model is performed based on the mappings M_g and the current estimates for EM_θ, a , X₁ and X₂ and a predicted signal for each of the clusters is determined. The loss function is evaluated by comparing the predicted per-sequence-cycle intensity values for each cluster with the actual instrument signal corresponding to that cluster, and the estimates for EM_θ, a are updated through backpropagation. The forward pass and backpropagation steps can be performed multiple times, e.g. 20, 30, 40, 50 times, and so on to update the estimates for the parameters EM_θ, a.

At stage 2250, refined bases are determined for X₁ based on the updated parameter estimates. Each of the nucleobases of X₁ can be base called based on the respective per-sequence cycle intensity value and a context-independent base calling method such as the GMM model as described previously. Since X₁ is the dominant read, accurate base calls can be made without considering base context.

Refined base calls for nucleobases X₂ are obtained by considering base context. At 2250, a more accurate base call for X₂ is determined. In order to base call a nucleobase of X₂ from a sequencing cycle N, equation (1) is rearranged to result in:

The right-hand side of equation (3) is evaluated based on the actual per-sequencing cycle intensity values for sequencing cycle N, the current estimate for EM_g and a, the base call for X₁ at sequencing cycle N, and the base context of the base call for X₁. The appropriate mapping for X₁ is selected from all possible 4^k mappings 608 based on its base context. The right-hand side of equation (2) can be considered to be an adjustment of intensity data y, where the adjustments are based on the base call of X₁, the corresponding mapping, inverse(EM_θ), and a, i.e.

M_θ(X₂) = adjusted(y). The left-hand side of equation (2) is compared to the adjusted intensity data in order to base call X₂. In the present embodiment, the intensity data is 2-channel intensity data represented as an 1x2 array and the adjusted intensity data is also an 1x2 array. It is not possible to simply apply the inverse of the CDSM model 600 to base call X₂ because this requires knowledge of which of the 4^k mappings correspond to X₂, which cannot be determined because the base call for X₂ is unknown. In order to base call X₂, a subset of the plurality of mappings is first selected based on the base context of X₂. The base context of X₂ is known from the coarse base calling step. An example of selecting a subset of the plurality of mappings is as follows: if the base context comprises the base calls of the two preceding cycles represented by 3-mer KKG and the two preceding bases are CG, the mappings corresponding to CGA, CGG, OGG and CGT, are selected. For each of the four selected mappings M^bc each corresponding to one of the four possible base calls for X₂, the base calling system computes a binarized base call for X₂, bc(X₂), by applying the inverse of the associated mapping M^bc to the adjusted intensity data: bc(X₂) = inverse(M^bc) * adjusted(y) (4)

The correct base call for X₂ is considered to be the one for which corresponding binarised base call bc(X₂) is closest to the rounded version of that binarised base call bc(X₂). This can be determined by the following expression: x(X₂) = norm(bc(X₂) — round(bc(X₂))) (5)

Stage 2250 is evaluated for all of the nucleobases in sequence. It should be noted that the accuracy in base calling X₂ is improved not just by considered the base context of X₂ but also by considered the context dependent effects of X_lt which further improves the signal-to-noise ratio when determining X₂,

In some instances, equations (4) and (5) can be evaluated for more than four of the plurality mappings. This can be necessary, for example, when the base context comprises a number p of preceding bases and the current sequencing cycle is N <= p, i.e. the data does not exist to determine the base context. A similar situation arises when the base context comprises a number of s succeeding bases and the current sequencing cycle is A/ > T-s, where there are T sequencing cycles in total. Consider an example of base calling sequence X₂ of the target cluster at sequencing cycle 1 (N = 1) based on base context KKX. No prior base context of the target cluster can be determined for the first cycle. There are 64 mappings in total and the base calling system can evaluate the binarised base calls using each of the 64 mappings. The binarized base call bc(X₂) which results in the lowest value of x(X₂) is determined as the base call of X₂ for the target cluster. At sequencing cycle 2 (N = 2), the target cluster has a base context comprising a single known prior base call (e.g., base A) identified at sequencing cycle 1. Therefore, the base calling pipeline does not need to compare all of the sixty-four trimer-specific mappings with the current intensity data of the target cluster. Instead, sixteen mappings with base A as the preceding base, i.e. GAG, GAT, GAC, GAA, CAG, CAT, CAC, CAA, AAG, AAT, AAC, AAA, TAG, TAT, TAC and TAG can be selected. These mappings are used to call the target cluster at sequencing cycle number 2. For all of the proceeding sequencing cycles, there is sufficient base context information to select only four of the mappings. There may be other reasons for selecting more than four of the mappings. For example, if there is some uncertainty in the base context of a current base call.

At the end of stage 2250, the base calling system can return back to step 2230 to refine the estimates for the parameters based on the updated base calls. After a number of iterations of the forward pass and backpropagation, the base calling system may proceed to stages 2240 and 2250 to update the base calls based on the updated parameters. Successive iterations of 2230, 2240 and 2250 may be repeated until convergence of the parameter estimates and base calls for each of the clusters.

Figures 23A and 23B are scatter plots showing the quality of fit (intensity space signal- to-noise) against the mixing fraction for a plurality of clusters base called using the base calling method 2200, with data points shaded according to blast alignment score for the cluster, for the dominant read X_lt and the weaker read X₂, respectively. Figure 23C is a histogram of the mixing fractions recovered for the same plurality of clusters. It can be seen from Figure 23B that a mixing fraction of around 0.66 results in a higher alignment score for the weaker read, which is expected since this corresponds to the sixteen cloud intensity distribution. Although this same mixing fraction of 0.66 results in a lower alignment score for the dominant read, it can be seen that dominant read has a generally better alignment overall. As can be seen in Figure 23C, the mixing fraction can vary over clusters and therefore this parameter should be estimated for each cluster.

Figures 24A and 24B are histograms of alignment lengths for a plurality of clusters base called using the base calling method 2200 for the dominant read X_lt and the weaker read X₂ respectively. Figures 24C and 24D are histograms of the percent of matching bases for a plurality of clusters base called using the CDSM-16QAM model for the stronger read and weaker read respectively. It can be seen from these figures that the outputs correspond to expected distributions.

Figure 25 is a histogram of fragment insert size for paired-end simultaneous sequencing data where the dominant read X_lt and the weaker read X₂ are paired reads which are base called using the base calling method 900. It can be seen that the fragment insert size clusters around a base pair length of around 300, which is to be expected for paired-end data.

Definitions

Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs. See, e.g. Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, NY 1994); Sambrook et al., Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Press (Cold Spring Harbor, NY 1989). For purposes of the present disclosure, the following terms are defined below.

As used herein, the term “cluster” or “clump” refers to a group of molecules, e.g., a group of DNA, or a group of signals. In some embodiments, the signals of a cluster are derived from different features. In some embodiments, a signal clump represents a physical region covered by one amplified oligonucleotide. In various examples, a physical region may be a tile, a sub-tile, a lane or a sub-lane on a flow cell, etc. Each signal clump could be ideally observed as several signals. Accordingly, duplicate signals could be detected from the same clump of signals. In some embodiments, a cluster or clump of signals can comprise one or more signals or spots that correspond to a particular feature. When used in connection with microarray devices or other molecular analytical devices, a cluster can comprise one or more signals that together occupy the physical region occupied by an amplified oligonucleotide (or other polynucleotide or polypeptide with a same or similar sequence). For example, where a feature is an amplified oligonucleotide, a cluster can be the physical region covered by one amplified oligonucleotide. In other embodiments, a cluster or clump of signals need not strictly correspond to a feature. For example, spurious noise signals may be included in a signal cluster but not necessarily be within the feature area. For example, a cluster of signals from four cycles of a sequencing reaction could comprise at least four signals.

As used herein, a “flow cell” can include a device having a lid extending over a reaction structure to form a flow channel therebetween that is in communication with a plurality of reaction sites of the reaction structure, and can include a detection device that is configured to detect designated reactions that occur at or proximate to the reaction sites. A flow cell may include a solid-state light detection or “imaging” device, such as a Charge- Coupled Device (CCD) or Complementary Metal-Oxide Semiconductor (CMOS) (light) detection device. As one specific example, a flow cell may be configured to fluidically and electrically couple to a cartridge (having an integrated pump), which may be configured to fluidically and/or electrically couple to a bioassay system. A cartridge and/or bioassay system may deliver a reaction solution to reaction sites of a flow cell according to a predetermined protocol (e.g., sequencing-by- synthesis), and perform a plurality of imaging events. For example, a cartridge and/or bioassay system may direct one or more reaction solutions through the flow channel of the flow cell, and thereby along the reaction sites. At least one of the reaction solutions may include four types of nucleotides having the same or different fluorescent labels. The nucleotides may bind to the reaction sites of the flow cell, such as to corresponding oligonucleotides at the reaction sites. The cartridge and/or bioassay system may then illuminate the reaction sites using an excitation light source (e.g., solid-state light sources, such as light-emitting diodes (LEDs)). The excitation light may have a predetermined wavelength or wavelengths, including a range of wavelengths. The fluorescent labels excited by the incident excitation light may provide emission signals (e.g., light of a wavelength or wavelengths that differ from the excitation light and, potentially, each other) that may be detected by the light sensors of the flow cell. Flow cells described herein may be configured to perform various biological or chemical processes. More specifically, the flow cells described herein may be used in various processes and systems where it is desired to detect an event, property, quality, or characteristic that is indicative of a designated reaction. For example, flow cells described herein may include or be integrated with light detection devices, biosensors, and their components, as well as bioassay systems that operate with biosensors. The flow cells may be configured to facilitate a plurality of designated reactions that may be detected individually or collectively. The flow cells may be configured to perform numerous cycles in which the plurality of designated reactions occurs in parallel. For example, the flow cells may be used to sequence a dense array of DNA features through iterative cycles of enzymatic manipulation and light or image detection/acquisition. As such, the flow cells may be in fluidic communication with one or more microfluidic channels that deliver reagents or other reaction components in a reaction solution to a reaction site of the flow cells. The reaction sites may be provided or spaced apart in a predetermined manner, such as in a uniform or repeating pattern. Alternatively, the reaction sites may be randomly distributed. Each of the reaction sites may be associated with one or more light guides and one or more light sensors that detect light from the associated reaction site. In one example, light guides include one or more filters for filtering certain wavelengths of light. The light guides may be, for example, an absorption filter (e.g., an organic absorption filter) such that the filter material absorbs a certain wavelength (or range of wavelengths) and allows at least one predetermined wavelength (or range of wavelengths) to pass therethrough. In some flow cells, the reaction sites may be located in reaction recesses or chambers, which may at least partially compartmentalize the designated reactions therein.

As used herein, the term “spot radius” or “cluster radius” refers to a defined radius which encompasses a diffraction-limited spot or a cluster of signals. Accordingly, by defining a cluster radius as larger or smaller, a greater number of signals can fall within the radius for subsequent ordering and selection. A cluster radius can be defined by any distance measure, such as pixels, meters, millimeters, or any other useful measure of distance.

As used herein, a “signal” refers to a detectable event such as an emission, such as light emission, for example, in an image. Thus, in some embodiments, a signal can represent any detectable light emission that is captured in an image (i.e., a “spot”). Thus, as used herein, “signal” can refer to an actual emission from a feature of the specimen, or can refer to a spurious emission that does not correlate to an actual feature. Thus, a signal could arise from noise and could be later discarded as not representative of an actual feature of a specimen.

As used herein, an “intensity” of an emitted light refers to the intensity of the light transferred per unit area, where the area is measured on the plane perpendicular to the direction of propagation of the light ray, and where the intensity is the amount of energy transferred per unit time. In some embodiments, signal “strength”, “amplitude”, “magnitude” or “level” may be used synonymously with signal intensity. In some embodiments, an image taken by a detector is approximately or proportional to an intensity map integrated over some amount of time. In some embodiments, the signal of a diffraction-limited spot of a DNA cluster is extracted from the image as the total intensity included in the spot, up to a factor of the integration time. For example, the signal of a DNA cluster may be defined as the intensity included within the spot radius of the DNA cluster, up to a factor of the integration time. In other embodiments, the peak intensity value found within the spot radius may be used to represent the signal of the DNA cluster, up to a factor of the integration time.

As used herein, the process of aligning the template of signal positions onto a given image is referred to as “registration”, and the process for determining an intensity value or an amplitude value for each signal in the template for a given image is referred to as “intensity extraction”. For registration, the methods and systems provided herein may take advantage of the random nature of signal clump positions by using image correlation to align the template to the image.

As used herein, a “nucleotide” includes a nitrogen containing heterocyclic base, a sugar, and one or more phosphate groups. Nucleotides are monomeric units of a nucleic acid sequence. Examples of nucleotides include, for example, ribonucleotides or deoxyribonucleotides. In ribonucleotides (RNA), the sugar is a ribose, and in deoxyribonucleotides (DNA), the sugar is a deoxyribose, i.e., a sugar lacking a hydroxyl group that is present at the 2' position in ribose. The nitrogen containing heterocyclic base can be a purine base or a pyrimidine base. Purine bases include adenine (A) and guanine (G), and modified derivatives or analogs thereof. Pyrimidine bases include cytosine (C), thymine (T), and uracil (II), and modified derivatives or analogs thereof. The C-1 atom of deoxyribose is bonded to N-1 of a pyrimidine or N-9 of a purine. The phosphate groups may be in the mono-, di-, or tri-phosphate form. These nucleotides may be natural nucleotides, but it is to be further understood that non-natural nucleotides, modified nucleotides or analogs of the aforementioned nucleotides can also be used.

As used herein, “nucleobase” is a heterocyclic base such as adenine, guanine, cytosine, thymine, uracil, inosine, xanthine, hypoxanthine, or a heterocyclic derivative, analog, or tautomer thereof. A nucleobase can be naturally occurring or synthetic. Nonlimiting examples of nucleobases are adenine, guanine, thymine, cytosine, uracil, xanthine, hypoxanthine, 8-azapurine, purines substituted at the 8 position with methyl or bromine, 9-oxo-N6-methyladenine, 2-aminoadenine, 7-deazaxanthine, 7- deazaguanine, 7- deaza-adenine, N4-ethanocytosine, 2,6- diaminopurine, N6-ethano- 2,6-diaminopurine, 5- methylcytosine, 5-(C3-C6)- alkynylcytosine, 5-fluorouracil, 5- bromouracil, thiouracil, pseudoisocytosine, 2-hydroxy-5-methyl-4-triazolopyridine, isocytosine, isoguanine, inosine, 7,8-dimethylalloxazine, 6-dihydrothymine, 5,6- dihydrouracil, 4-methyl-indole, ethenoadenine and the non-naturally occurring nucleobases described in U.S. Pat. Nos. 5,432,272 and 6,150,510 and PCT applications WO 92/002258, WO 93/10820, WO 94/22892, and WO 94/24144, and Fasman ("Practical Handbook of Biochemistry and Molecular Biology", pp. 385-394, 1989, CRC Press, Boca Raton, LO), all herein incorporated by reference in their entireties.

The term “nucleic acid” or “polynucleotide” refers to a deoxyribonucleotide or ribonucleotide polymer in either single- or double-stranded form, and unless otherwise limited, encompasses known analogs of natural nucleotides that hybridize to nucleic acids in manner similar to naturally occurring nucleotides, such as peptide nucleic acids (PNAs) and phosphorothioate DNA. Unless otherwise indicated, a particular nucleic acid sequence includes the complementary sequence thereof. Nucleotides include, but are not limited to, ATP, dATP, CTP, dCTP, GTP, dGTP, UTP, TTP, dUTP, 5-methyl-CTP, 5-methyl-dCTP, ITP, diTP, 2-amino-adenosine-TP, 2-amino- deoxyadenosine-TP, 2-thiothymidine triphosphate, pyrrolo-pyrimidine triphosphate, and 2-thiocytidine, as well as the alphathiotriphosphates for all of the above, and 2'-O- methyl-ribonucleotide triphosphates for all the above bases. Modified bases include, but are not limited to, 5-Br-UTP, 5-Br-dUTP, 5-F-UTP, 5-F-dUTP, 5-propynyl dCTP, and 5-propynyl-dllTP.

The polymerase used is an enzyme generally for joining 3'-OH 5'- triphosphate nucleotides, oligomers, and their analogs. Polymerases include, but are not limited to, DNA-dependent DNA polymerases, DNA-dependent RNA polymerases, RNA- dependent DNA polymerases, RNA-dependent RNA polymerases, T7 DNA polymerase, T3 DNA polymerase, T4 DNA polymerase, T7 RNA polymerase, T3 RNA polymerase, SP6 RNA polymerase, DNA polymerase I, Klenow fragment, Thermophilus aquaticus DNA polymerase, Tth DNA polymerase, VentR® DNA polymerase (New England Biolabs), Deep VentR® DNA polymerase (New England Biolabs), Bst DNA Polymerase Large Fragment, Stoeffel Fragment, 90N DNA Polymerase, 90N DNA polymerase, Pfu DNA Polymerase, Tfl DNA Polymerase, Tth DNA Polymerase, RepliPHI Phi29 Polymerase, Tli DNA polymerase, eukaryotic DNA polymerase beta, telomerase, Therminator™ polymerase (New England Biolabs), KOD HiFi™ DNA polymerase (Novagen), KOD1 DNA polymerase, Q-beta replicase, terminal transferase, AMV reverse transcriptase, M-MLV reverse transcriptase, Phi6 reverse transcriptase, HIV-1 reverse transcriptase, novel polymerases discovered by bioprospecting, and polymerases cited in US 2007/0048748, US 6,329,178, US 6,602,695, and US 6,395,524 (incorporated by reference). These polymerases include wild-type, mutant isoforms, and genetically engineered variants. "Encode" or "parse" are verbs referring to transferring from one format to another, and refers to transferring the genetic information of target template base sequence into an arrangement of reporters.

Nucleosides and nucleotides may be labeled at sites on the sugar or nucleobase. A dye may be attached to any position on the nucleotide base, for example, through a linker. In particular embodiments, Watson-Crick base pairing can still be carried out for the resulting analog. Particular nucleobase labeling sites include the C5 position of a pyrimidine base or the C7 position of a 7-deaza purine base. A linker group may be used to covalently attach a dye to the nucleoside or nucleotide. As used herein, the term “covalently attached” or “covalently bonded” refers to the forming of a chemical bonding that is characterized by the sharing of pairs of electrons between atoms. For example, a covalently attached polymer coating refers to a polymer coating that forms chemical bonds with a functionalized surface of a substrate, as compared to attachment to the surface via other means, for example, adhesion or electrostatic interaction. It will be appreciated that polymers that are attached covalently to a surface can also be bonded via means in addition to covalent attachment.

Various different types of linkers having different lengths and chemical properties can be used. The term “linker” encompasses any moiety that is useful to connect one or more molecules or compounds to each other, to other components of a reaction mixture, and/or to a reaction site. For example, a linker can attach a reporter molecule or “label” (e.g., a fluorescent dye) to a reaction component. In certain embodiments, the linker is a member selected from substituted or unsubstituted alkyl (e.g., a 2-5 carbon chain), substituted or unsubstituted heteroalkyl, substituted or unsubstituted aryl, substituted or unsubstituted heteroaryl, substituted or unsubstituted cycloalkyl, and substituted or unsubstituted heterocycloalkyl. In one example, the linker moiety is selected from straight- and branched carbon-chains, optionally including at least one heteroatom (e.g., at least one functional group, such as ether, thioether, amide, sulfonamide, carbonate, carbamate, urea and thiourea), and optionally including at least one aromatic, heteroaromatic or non-aromatic ring structure (e.g., cycloalkyl, phenyl). In certain embodiments, molecules that have trifunctional linkage capability are used, including, but are not limited to, cynuric chloride, mealamine, diaminopropanoic acid, aspartic acid, cysteine, glutamic acid, pyroglutamic acid, S- acetylmercaptosuccinic anhydride, carbobenzoxylysine, histine, lysine, serine, homoserine, tyrosine, piperidinyl-1 , 1 -amino carboxylic acid, diaminobenzoic acid, etc. In certain specific embodiments, a hydrophilic PEG (polyethylene glycol) linker is used.

In certain embodiments, linkers are derived from molecules which comprise at least two reactive functional groups (e.g., one on each terminus), and these reactive functional groups can react with complementary reactive functional groups on the various reaction components or used to immobilize one or more reaction components at the reaction site. “Reactive functional group,” as used herein refers to groups including, but not limited to, olefins, acetylenes, alcohols, phenols, ethers, oxides, halides, aldehydes, ketones, carboxylic acids, esters, amides, cyanates, isocyanates, thiocyanates, isothiocyanates, amines, hydrazines, hydrazones, hydrazides, diazo, diazonium, nitro, nitriles, mercaptans, sulfides, disulfides, sulfoxides, sulfones, sulfonic acids, sulfinic acids, acetals, ketals, anhydrides, sulfates, sulfenic acids isonitriles, amidines, imides, imidates, nitrones, hydroxylamines, oximes, hydroxamic acids thiohydroxamic acids, allenes, ortho esters, sulfites, enamines, ynamines, ureas, pseudoureas, semicarbazides, carbodiimides, carbamates, imines, azides, azo compounds, azoxy compounds, and nitroso compounds. Reactive functional groups also include those used to prepare bioconjugates, e.g., N-hydroxysuccinimide esters, maleimides and the like.

Cleavable linkers may be, by way of non-limiting example, electrophilically cleavable linkers, nucleophilically cleavable linkers, photocleavable linkers, cleavable under reductive conditions (for example disulfide or azide containing linkers), oxidative conditions, cleavable via use of safety-catch linkers and cleavable by elimination mechanisms. The use of a cleavable linker to attach the dye compound to a substrate moiety ensures that the label can, if required, be removed after detection, avoiding any interfering signal in downstream steps.

In some embodiments, one or more dye or label molecules may attach to the nucleotide base by non-covalent interactions, or by a combination of covalent and non- covalent interactions via a plurality of intermediating molecules. In one example, a nucleotide or a nucleotide analog, being newly incorporated by the polymerase synthesizing from a target polynucleotide, is initially unlabeled. Then, one or more fluorescent labels may be introduced to the nucleotide or nucleotide analog by binding to labeled affinity reagents containing one or more fluorescent dyes. Uses of unlabeled nucleotides and affinity reagents in sequencing by synthesis have been disclosed in U.S. Publication No. 2013/0079232, which is incorporated herein by reference. For example, one, two, three or each of the four different types of nucleotides (e.g., dATP, dCTP, dGTP and dTTP or dUTP) in the reaction mix may be initially unlabeled. Each of the four types of nucleotides (e.g., dNTPs) may have a 3' hydroxy blocking group to ensure that only a single base can be added by a polymerase to the 3' end of a copy polynucleotide being synthesized from the target polynucleotide. After incorporation of an unlabeled nucleotide, an affinity reagent may be then introduced that specifically binds to the incorporated dNTP to provide a labeled extension product comprising the incorporated dNTP. The affinity reagent may be designed to specifically bind to the incorporated dNTP via antibody-antigen interaction or ligand-receptor interaction, for example. The dNTP may be modified to include a specific antigen, which will pair with a specific antibody included in the corresponding affinity reagent. Thus, one, two, three or each of the four different types of nucleotides may be specifically labeled via their corresponding affinity reagents. In some embodiments, the affinity reagents may include small molecules or protein tags that may bind to a hapten moiety of the nucleotide (such as streptavidin-biotin, anti-DIG and DIG, anti-DNP and DNP), antibody (including but not limited to binding fragments of antibodies, single chain antibodies, bispecific antibodies, and the like), aptamers, knottins, affimers, or any other known agent that binds an incorporated nucleotide with a suitable specificity and affinity. In some embodiments, the hapten moiety of the unlabeled nucleotide may be attached to the nucleobase through a cleavable linker, which may be cleaved under the same reaction condition as that for removing the 3’ blocking group. In some embodiments, one affinity reagent may be labeled with multiple copies of the same fluorescent dye, for example, 1, 2, 3, 4, 5, 6, 8, 10, 12, 15 copies of the same dye. In some embodiments, each affinity reagent may be labeled with a different number of copies of the same fluorescent dye. In some embodiments, a first affinity reagent may be labeled with a first number of a first fluorescent dye, a second affinity reagent may be labeled with a second number of a second fluorescent dye, a third affinity reagent may be labeled with a third number of a third fluorescent dye, and a fourth affinity reagent may be labeled with a fourth number of a fourth fluorescent dye. In some embodiments, each affinity reagent may be labeled with a distinct combination of one of more types of dye, where each type of dye has a certain copy number. In some embodiments, different affinity reagents may be labeled with different dyes that can be excited by the same light source, but each dye will have a distinguishable fluorescent intensity or a distinguishable emission spectrum. In some embodiments, different affinity reagents may be labeled with the same dye in different molar ratios to create measurable differences in their fluorescent intensities.

A nucleotide analog may be attached to or associated with one or more photo- detectable labels to provide a detectable signal. In some embodiments, a photo- detectable label may be a fluorescent compound, such as a small molecule fluorescent label. Fluorescent molecules (fluorophores) suitable as a fluorescent label include, but are not limited to: 1 ,5 IAEDANS; 1,8-ANS; 4-methylumbelliferone; 5-carboxy-2,7- dichlorofluorescein; 5- carboxyfluorescein (5-FAM); fluorescein amidite (FAM); 5- carboxynapthofluorescein; tetrachloro-6-carboxyfluorescein (TET); hexachloro-6- carboxyfluorescein (HEX); 2,7- dimethoxy-4,5-dichloro-6-carboxyfluorescein (JOE); VIC®; NED™; tetramethylrhodamine (TMR); 5-carboxytetramethylrhodamine (5- TAMRA); 5-HAT (Hydroxy Tryptamine); 5- hydroxy tryptamine (HAT); 5-ROX (carboxy- X-rhodamine); 6-carboxyrhodamine 6G; 6-JOE; Light Cycler® red 610; Light Cycler® red 640; Light Cycler® red 670; Light Cycler® red 705; 7-amino-4-methylcoumarin; 7- aminoactinomycin D (7-AAD); 7-hydroxy-4- methylcoumarin; 9-amino-6-chloro-2- methoxyacridine; 6-methoxy-N-(4- aminoalkyl)quinolinium bromide hydrochloride (ABQ); Acid Fuchsin; ACMA (9-amino-6- chloro-2-methoxyacridine); Acridine Orange; Acridine Red; Acridine Yellow; Acriflavin; Acriflavin Feulgen SITSA; AFPs- AutoFluorescent Protein-(Quantum Biotechnologies); Texas Red; Texas Red-X conjugate; Thiadicarbocyanine (DiSC3); Thiazine Red R; Thiazole Orange; Thioflavin 5; Thioflavin S; Thioflavin TCN; Thiolyte; Thiozole Orange; Tinopol CBS (Calcofluor White); TMR; TO-PRO-1; TO-PRO-3; TO-PRO-5; TOTO-1; TOTO-3; TriColor (PE- Cy5); TRITC (TetramethyIRodamine-lsoThioCyanate); True Blue; TruRed; Ultralite; Uranine B; Uvitex SFC; WW 781; X-Rhodamine; X-Rhodamine-5-(and-6)- Isothiocyanate (5(6)-XRITC); Xylene Orange; Y66F; Y66H; Y66W; YO-PRO-1 ; YO- PRO-3; YOYO-1; interchelating dyes such as YOYO-3, Sybr Green, Thiazole orange; members of the Alexa Fluor® dye series (from Molecular Probes/lnvitrogen) which cover a broad spectrum and match the principal output wavelengths of common excitation sources such as Alexa Fluor 350, Alexa Fluor 405, 430, 488, 500, 514, 532, 546, 555, 568, 594, 610, 633, 635, 647, 660, 680, 700, and 750; members of the Cy Dye fluorophore series (GE Healthcare), also covering a wide spectrum such as Cy3, Cy3B, Cy3.5, Cy5, Cy5.5, Cy7; members of the Oyster® dye fluorophores (Denovo Biolabels) such as Oyster-500, -550, -556, 645, 650, 656; members of the DY-Labels series (Dyomics), for example, with maxima of absorption that range from 418 nm (DY- 415) to 844 nm (DY-831) such as DY-415, -495, -505, -547, -548, -549, -550, -554, - 555, -556, -560, -590, -610, -615, -630, -631, -632, -633, -634, -635, -636, -647, -648, - 649, - 650, -651, -652, -675, -676, -677, -680, -681 , -682, -700, -701 , -730, -731, -732, - 734, -750, - 751, -752, -776, -780, -781, -782, -831, -480XL, -481XL, -485XL, -510XL, - 520XL, -521XL; members of the ATTO series of fluorescent labels (ATTO-TEC GmbH) such as ATTO 390, 425, 465, 488, 495, 520, 532, 550, 565, 590, 594, 610, 611X, 620, 633, 635, 637, 647, 647N, 655, 680, 700, 725, 740; members of the CAL Fluor® series or Quasar® series of dyes (Biosearch Technologies) such as CAL Fluor® Gold 540, CAL Fluor® Orange 560, Quasar® 570, CAL Fluor® Red 590, CAL Fluor® Red 610, CAL Fluor® Red 635, Quasar® 570, and Quasar® 670. In some embodiments, a first photo-detectable label interacts with a second photo-detectable moiety to modify the detectable signal, e.g., via fluorescence resonance energy transfer (“FRET”; also known as Forster resonance energy transfer). The fluorescent labels utilized by the systems and methods disclosed herein can have different peak absorption wavelengths, for example, ranging from 400 nm to 800 nm. In some embodiments, the peak absorption wavelengths of the fluorescent labels can be, or be about, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700,

710, 720, 730, 740, 750, 760, 770, 780, 790, 800 nm, or a number or a range between any two of these values. In some embodiments the peak absorption wavelengths of the fluorescent labels can be at least, or at most, 400, 410, 420, 430, 440, 450, 460, 470,

480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640,

650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, or 800 nm.

The fluorescent labels can have different peak emission wavelength, for example, ranging from 400 nm to 800 nm. In some embodiments, the peak emission wavelengths of the fluorescent labels can be, or be about, 400, 410, 420, 430, 440,

450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610,

620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780,

790, 800 nm, or a number or a range between any two of these values. In some embodiments the peak emission wavelengths of the fluorescent labels can be at least, or at most, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, or 800 nm.

The fluorescent labels can have different Stokes shift, for example, ranging from 10 nm to 200 nm. In some embodiments, the stoke shift can be, or be about, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200 nm, or a number or a range between any two of these values. In some embodiments, the stoke shift can be at least, or at most, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nm.

In some embodiments, the distance between the peak emission wavelengths of any two fluorescent labels can vary, for example, ranging from 10 nm to 200 nm. In some embodiments, the distance between the peak emission wavelengths of any two fluorescent labels can be, or be about, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200 nm, or a number or a range between any two of these values. In some embodiments, the distance between the peak emission wavelengths of any two fluorescent labels can be at least, or at most, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nm.

A “light source” may be any device capable of emitting energy along the electromagnetic spectrum. A light source may be a source of visible light (VIS), ultraviolet light (UV) and/or infrared light (IR). “Visible light” (VIS) generally refers to the band of electro-magnetic radiation with a wavelength from about 400 nm to about 750 nm. “Ultraviolet (UV) light” generally refers to electromagnetic radiation with a wavelength shorter than that of visible light, or from about 10 nm to about 400 nm range. “Infrared light” or infrared radiation (IR) generally refers to electromagnetic radiation with a wavelength greater than the VIS range, or from about 750 nm to about 50,000 nm. A light source may also provide full spectrum light. Light sources may output light from a selected wavelength or a range of wavelengths. In some embodiments of the invention, the light source may be configured to provide light above or below a predetermined wavelength, or may provide light within a predetermined range. A light source may be used in combination with a filter, to selectively transmit or block light of a selected wavelength from the light source. A light source may be connected to a power source by one or more electrical connectors; an array of light sources may be connected to a power source in series or in parallel. A power source may be a battery, or a vehicle electrical system or a building electrical system. The light source may be connected to a power source via control electronics (control circuit); control electronics may comprise one or more switches. The one or more switches may be automated, or controlled by a sensor, timer or other input, or may be controlled by a user, or a combination thereof. For example, a user may operate a switch to turn on a UV light source; the light source may be applied on a constant basis until it is turned off, or it may be pulsed (repeated on/off cycles) until it is turned off. In some embodiments, the light source may be switched from a continuously-on state to a pulsed state, or vice versa. In some embodiments, the light source may be configured to be brightening or darkening over time.

For operation, the light source may be connected to a power source capable of providing sufficient intensity to illuminate the sample. Control electronics may be used to switch the intensity on or off based on input from a user or some other input, and can also be used to modulate the intensity to a suitable level (e.g. to control brightness of the output light). Control electronics may be configured to turn the light source on and off as desired. Control electronics may include a switch for manual, automatic, or semiautomatic operation of the light sources. The one or more switches may be, for example, a transistor, a relay or an electromechanical switch. In some embodiments, the control circuit may further comprise an AC-DC and/or a DC-DC converter for converting the voltage from the voltage source to an appropriate voltage for the light source. The control circuit may comprise a DC-DC regulator for regulation of the voltage. The control circuit may further comprise a timer and/or other circuitry elements for applying electric voltage to the optical filter for a fixed period of time following the receipt of input. A switch may be activated manually or automatically in response to predetermined conditions, or with a timer. For example, control electronics may process information such as user input, stored instructions, or the like.

One or more of a plurality of light sources may be provided. In some embodiments, each of the plurality of light sources may be the same. Alternatively, one or more of the light sources may vary. The light characteristics of the light emitted by the light sources may be the same or may vary. A plurality of light sources may or may not be independently controllable. One or more characteristic of the light source may or may not be controlled, including but not limited to whether the light source is on or off, brightness of light source, wavelength of light, intensity of light, angle of illumination, position of light source, or any combination thereof.

In some embodiments, light output from a light source may be from about 350 to about 750 nm, or any amount or range therebetween, for example from about 350 nm to about 360, 370, 380, 390, 400, 410, 420, 430 or about 450 nm, or any amount or range therebetween. In other embodiments, light from a light source may be from about 550 to about 700 nm, or any amount or range therebetween, for example from about 550 to about 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690 or about 700 nm, or any amount or range therebetween. In some embodiments, the wavelength of the light generated by the light source can vary, for example, ranging from 400 nm to 800 nm. In some embodiments, the wavelength of the light generated by the light source can be, or be about, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800 nm, or a number or a range between any two of these values. In some embodiments, the wavelength of the light generated by the light source can be at least, or at most, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, or 800 nm. The light source may be capable of emitting electromagnetic waves in any spectrum. In some embodiments, the light source may have a wavelength falling between 10 nm and 100 pm. In some embodiments, the wavelength of light may fall between 100 nm to 5000 nm, 300 nm to 1000 nm, or 400 nm to 800 nm. In some embodiments, the wavelength of light may be less than, and/or equal to 10 nm, 100 nm, 200 nm, 300 nm, 400 nm, 500 nm, 600 nm, 700 nm, 800 nm, 900 nm, 1000 nm, 1100 nm, 1200 nm, 1300 nm, 1500 nm, 1750 nm, 2000 nm, 2500 nm, 3000 nm, 4000 nm, or 5000 nm.

In one example, a light source may be a light-emitting diode (LED) (e.g., gallium arsenide (GaAs) LED, aluminum gallium arsenide (AIGaAs) LED, gallium arsenide phosphide (GaAsP) LED, aluminum gallium indium phosphide (AIGalnP) LED, gallium(lll) phosphide (GaP) LED, indium gallium nitride (InGaN)Zgallium(lll) nitride (GaN) LED, or aluminum gallium phosphide (AIGaP) LED). In another example, a light source can be a laser, for example a vertical cavity surface emitting laser (VCSEL) or other suitable light emitter such as an Indium-Gallium-Aluminum-Phosphide (InGaAlP) laser, a Gallium-Arsenic Phosphide/Gallium Phosphide (GaAsP/GaP) laser, or a Gallium-Aluminum- Arsenide/Gallium-Aluminum-Arsenide (GaAIAs/GaAs) laser. Other examples of light sources may include but are not limited to electron stimulated light sources (e.g., Cathodoluminescence, Electron Stimulated Luminescence (ESL light bulbs), Cathode ray tube (CRT monitor), Nixie tube), incandescent light sources (e.g., Carbon button lamp, Conventional incandescent light bulbs, Halogen lamps, Globar, Nernst lamp), electroluminescent (EL) light sources (e.g., Light-emitting diodes — Organic light-emitting diodes, Polymer light-emitting diodes, Solid-state lighting, LED lamp, Electroluminescent sheets Electroluminescent wires), gas discharge light sources (e.g., Fluorescent lamps, Inductive lighting, Hollow cathode lamp, Neon and argon lamps, Plasma lamps, Xenon flash lamps), or high-intensity discharge light sources (e.g., Carbon arc lamps, Ceramic discharge metal halide lamps, Hydrargyrum medium-arc iodide lamps, Mercury-vapor lamps, Metal halide lamps, Sodium vapor lamps, Xenon arc lamps). Alternatively, a light source may be a bioluminescent, chemiluminescent, phosphorescent, or fluorescent light source. As used herein, an “optical channel” is a predefined profile of optical frequencies (or equivalently, wavelengths). For example, a first optical channel may have wavelengths of 500 nm-600 nm. To take an image in the first optical channel, one may use a detector which is only responsive to 500 nm-600 nm light, or use a bandpass filter having a transmission window of 500 nm-600 nm to filter the incoming light onto a detector responsive to 300 nm-800 nm light. A second optical channel may have wavelengths of 300 nm-450 nm and 850 nm-900 nm. To take an image in the second optical channel, one may use a detector responsive to 300 nm-450 nm light and another detector responsive to 850 nm-900 nm light and then combine the detected signals of the two detectors. Alternatively, to take an image in the second optical channel, one may use a bandstop filter which rejects 451 nm-849 nm light in front of a detector responsive to 300 nm-900 nm light.

Additional Notes

The embodiments described herein are exemplary. Modifications, rearrangements, substitute processes, etc. may be made to these embodiments and still be encompassed within the teachings set forth herein. One or more of the steps, processes, or methods described herein may be carried out by one or more processing and/or digital devices, suitably programmed.

The various illustrative imaging or data processing techniques described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality can be implemented in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosure.

The various illustrative detection systems described in connection with the embodiments disclosed herein can be implemented or performed by a machine, such as a processor configured with specific instructions, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. For example, systems described herein may be implemented using a discrete memory chip, a portion of memory in a microprocessor, flash, EPROM, or other types of memory.

The elements of a method, process, or algorithm described in connection with the embodiments disclosed herein can be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of computer-readable storage medium known in the art. An exemplary storage medium can be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium can be integral to the processor. The processor and the storage medium can reside in an ASIC. A software module can comprise computer-executable instructions which cause a hardware processor to execute the computer- executable instructions.

Conditional language used herein, such as, among others, “can,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or states. Thus, such conditional language is not generally intended to imply that features, elements and/or states are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without author input or prompting, whether these features, elements and/or states are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” “involving,” and the like are synonymous and are used inclusively, in an open- ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.

Disjunctive language such as the phrase “at least one of X, Y or Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y or Z, or any combination thereof (e.g., X, Y and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y or at least one of Z to each be present.

The terms “about” or “approximate” and the like are synonymous and are used to indicate that the value modified by the term has an understood range associated with it, where the range can be ±20%, ±15%, ±10%, ±5%, or ±1%. The term “substantially” is used to indicate that a result (e.g., measurement value) is close to a targeted value, where close can mean, for example, the result is within 80% of the value, within 90% of the value, within 95% of the value, or within 99% of the value. The term “partially” is used to indicate that an effect is only in part or to a limited extent.

Unless otherwise explicitly stated, articles such as “a” or “an” should generally be interpreted to include one or more described items. Accordingly, phrases such as “a device configured to” or “a device to” are intended to include one or more recited devices. Such one or more recited devices can also be collectively configured to carry out the stated recitations. For example, “a processor to carry out recitations A, B and C” can include a first processor configured to carry out recitation A working in conjunction with a second processor configured to carry out recitations B and C.

While the above detailed description has shown, described, and pointed out novel features as applied to illustrative embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the devices or algorithms illustrated can be made without departing from the spirit of the disclosure. As will be recognized, certain embodiments described herein can be embodied within a form that does not provide all of the features and benefits set forth herein, as some features can be used or practiced separately from others. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

It should be appreciated that all combinations of the foregoing concepts (provided such concepts are not mutually inconsistent) are contemplated as being part of the inventive subject matter disclosed herein. In particular, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the inventive subject matter disclosed herein.

Claims

CLAIMS:

1. A method of base calling nucleobases of first and second polynucleotide sequence portions, the method comprising:

(a) accessing intensity data for a current sequencing cycle of a sequencing run, wherein said intensity data is a combined intensity of a first signal obtained based upon a respective first nucleobase of at least one first polynucleotide sequence portion and a second signal obtained based upon a respective second nucleobase of at least one second polynucleotide sequence portion;

(b) base calling the first nucleobase based on the intensity data;

2. The method of claim 1 , wherein the second nucleobase is base called by comparing the intensity data with the base call of the first nucleobase and the plurality of mappings.

3. The method of any preceding claim, wherein the second nucleobase is base called based on an adjusted intensity data, wherein the intensity data is adjusted based on the base call of the first nucleobase.

4. The method of claim 3, further comprising accessing base context data for the current sequencing cycle of the first polynucleotide sequence portion comprising base calls of nucleobases for at least one of a preceding sequencing cycle and/or a succeeding sequencing cycle; and selecting a mapping for the first polynucleotide sequence portion based on the base call of the first nucleobase and the base context data for the first polynucleotide sequence portion; wherein the intensity data is further adjusted based on the selected mapping for the first polynucleotide sequence portion.

5. The method according to any one of claims 3 or 4, wherein said polynucleotide sequence portions have been selectively processed such that an intensity of the signals obtained based upon the respective first nucleobase is greater than an intensity of the signals obtained based upon the respective second nucleobase, and wherein the intensity data is further adjusted based on the relative intensity of the signals obtained based upon the respective first nucleobase and the signals obtained based upon the respective second nucleobase.

6. The method according to claim 5, wherein the relative intensity is determined based on the intensity data.

7. The method according to any one of claims 3 to 6, wherein the intensity data is further adjusted to correct for cluster dependent effects.

8. The method according to claim 7, wherein the cluster dependent effects are determined based on the intensity data.

9. The method according to claim 7 or 8, wherein the cluster dependent effects comprise at least one of phasing/prephasing, background, decay, scale, background, camera gain and laser ramp.

10. The method according to any preceding claim, further comprising: accessing base context data for the current sequencing cycle of the second polynucleotide sequence portion comprising base calls of nucleobases for at least one of a preceding sequencing cycle and/or a succeeding sequencing cycle; and selecting a subset of the plurality of the mappings based on the base context data for the second polynucleotide sequence portion; wherein said subset of the plurality of mappings is used to base call the second nucleobase.

11 . The method according to any preceding claim, wherein each of the plurality of mappings receive a k-mer sequence as input and generate an adjusted signal intensity as output.

12. The method of claim 11 , wherein the k-mer sequence corresponds to a current nucleobase and at least one preceding and/or succeeding nucleobase of a polynucleotide sequence portion, and wherein the adjusted signal intensity corresponds to the signal obtained based upon the current nucleobase.

13. The method of any preceding claim, wherein each of the plurality of mappings is modified by a respective phasing coefficient.

14. A data processing device comprising means for carrying out a method according to any one preceding claim.

15. A data processing device according to claim 14, wherein the data processing device is a polynucleotide sequencer.

16. A computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method according to any one of claims 1 to 13.

17. A computer-readable storage medium comprising instructions which, when executed by a processor, cause the processor to carry out a method according to any one of claims 1 to 13.

18. A computer-readable data carrier having stored thereon a computer program product according to claim 16.

19. A data carrier signal carrying a computer program product according to claim 16.

20. A method for determining base context effects on a current nucleobase of at least one polynucleotide sequence portion, the method comprising:

(c initializing a plurality of mappings representing adjustments to signal intensity obtained based upon a current nucleobase of at least one polynucleotide sequence portion, wherein the adjustments are dependent on at least one preceding and/or succeeding nucleobase in said at least one polynucleotide sequence portion; (d) iteratively updating said plurality of mappings based upon the intensity data and the sequence information of the first and second polynucleotide sequence portions.

21. The method according to claim 20, wherein the plurality of mappings are updated based upon the intensity data and a predicted intensity data, wherein the predicted intensity data is based on the sequence information of the first and second polynucleotide sequence portions and the plurality of mappings.

22. The method according to claim 21 , wherein the plurality of mappings are updated based upon a comparison of the intensity data with the predicted intensity data.

23. The method according to claim 22, wherein each of the per-sequencing cycle intensity values of the intensity data are respectively compared with per-sequencing cycle intensity values of the predicted intensity data.

24. The method according to any one of claims 20 to 23, wherein said polynucleotide sequence portions have been selectively processed such that an intensity of the signals obtained based upon the respective first nucleobase is greater than an intensity of the signals obtained based upon the respective second nucleobase, wherein the predicted intensity data is further based upon the relative intensity of the signals obtained based upon the respective first nucleobase and the signals obtained based upon the respective second nucleobase.

25. The method according to claim 24, wherein a value for the relative intensity is initialized, and wherein the value for relative intensity is iteratively updated with plurality of mappings.

26. The method according to any one of claims 21 to 25, wherein values for one or more parameters representing cluster dependent effects are initialized, wherein the values for the one or more parameters are iteratively updated with the plurality of mappings, and wherein the predicted intensity data is further based upon the one or more parameters.

27. The method according to claim 26, wherein the one of more parameters representing cluster dependent effects comprise at least one of phasing/prephasing, background, decay, scale, background, camera gain and laser ramp.

28. The method according to any one of claims 20 to 27, wherein each of the plurality of mappings receive a k-mer sequence as input and generate an adjusted signal intensity as output.

29. The method according to claim 28, wherein the k-mer sequence corresponds to a current nucleobase and at least one preceding and/or succeeding nucleobase of a polynucleotide sequence portion, and wherein the adjusted signal intensity corresponds to the signal obtained based upon the current nucleobase.

30. The method of any one of claims 20 to 29, wherein a respective phasing coefficient to modify each of the plurality of mappings is initialized, wherein each of the phasing coefficients is iteratively updated with the plurality of mappings.

31 . A system comprising: a memory storing a plurality of mappings representing adjustments to signal intensity obtained based upon a current nucleobase of at least one polynucleotide sequence portion, wherein the adjustments are dependent on at least one preceding and/or succeeding nucleobase in said at least one polynucleotide sequence portion; wherein the plurality of mappings are learned according to the method of any one of claims 20 to 30.

32. A data processing device comprising means for carrying out a method according to any one of claims 20 to 30.

33. A data processing device according to claim 32, wherein the data processing device is a polynucleotide sequencer.

34. The data processing device according to claim 32 or 33, further comprising a memory for storing the plurality of mappings.

35. A computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out a method according to any one of claims 20 to 30.

36. A computer-readable storage medium comprising instructions which, when executed by a processor, cause the processor to carry out a method according to any one of claims 20 to 30.

37. A computer-readable data carrier having stored thereon a computer program product according to claim 36.

38. A data carrier signal carrying a computer program product according to claim 36.