WO2025193757A1

WO2025193757A1 - Methods and systems for masking an array of biosensor elements

Info

Publication number: WO2025193757A1
Application number: PCT/US2025/019452
Authority: WO
Inventors: Michael LENANDER; Michael Anthony D'JAMOOS; Martin D. Smith; Ravi Saxena; Curtis S. Gehman; Mathew Koshy; Sriram Kumar RASIPUR SATISH KUMAR
Original assignee: Pacific Biosciences of California Inc
Current assignee: Pacific Biosciences of California Inc
Priority date: 2024-03-12
Filing date: 2025-03-11
Publication date: 2025-09-18
Anticipated expiration: 2026-09-12

Abstract

Provided is an analytical system comprising: a plurality of biosensor elements configured to generate sensor data, and a data selection module. The data selection module comprises: a data input array configured to receive sensor data from the plurality of biosensor elements; and a data output array connectable to a downstream data processing module for analyzing the sensor data. The data selection module is for selectively providing the sensor data from the input array to the output array and is configured to: receive sensor mask information, the sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data are sought; and selectively provide the identified sensor data from the identified biosensor elements to the data output array.

Description

METHODS AND SYSTEMS FOR MASKING AN ARRAY OF BIOSENSOR ELEMENTS

Field of the Disclosure

The present disclosure relates to an analytical system, to methods and systems for selectively providing data from a plurality of biosensor elements of the analytical system to a downstream processing component, and to methods of configuring systems thereof.

Background

DNA sequencing is the process of determining the nucleotide order of a given DNA fragment. Increasingly, massive parallel or massively parallel sequencing approaches have been adopted (also called next-generation or second-generation sequencing) which utilize miniaturized and parallelized platforms to sequence millions or billions of data reads per instrument run. Within these approaches, single-molecule sequencing (utilizing single-molecule templates) have become increasingly popular due to the negating of a need to amplify the DNA before sequencing (these amplification processes often being cumbersome to implement and potentially introducing sequencing errors).

In a nanowell based platform, a nanowell provides a zero-mode waveguide, or ZMW (also termed a nanowell optical confinement element). A single DNA polymerase enzyme is affixed to the bottom of the ZMW with a single molecule of DNA as a template. The ZMW structure is small enough that a single nucleotide incorporation can be detected optically, and the ZMWs can be incorporated into a single detection chip or cell (macroscale reservoir into which a sample is placed). In some versions, each of the four DNA bases is attached to one of four different fluorescent dyes such that sequencing can be made possible due to the characteristic frequency of light given off during incorporation. In other versions, the amplitude of the detected signal is characteristic of the respective DNA base and so this can be used for sequencing.

However, in each of these platforms the volume of data acquired is vast. Some nanowell based sequencers have millions of ZMWs, and the data from raw reads (before correction) can include up to 180 terabytes from a single cell (e.g., from a 25M cell over a 20-hour acquisition time).

Moreover, not all of the data acquired from a given nanowell may be usable for sequencing. In a nanowell based sequencing platform, the same may apply (in that some of the nanowells may be empty) but some nanowells may also contain more than one polymerase (so called multi-loading) whereby the signals detected cannot be used to reliably base call. Accordingly, the data generated from each nanowell of a nanowell based sequencing platform may not be continuously or perpetually useable for sequencing.

Moreover, the process of base calling using the raw signals from any of these platforms is computationally expensive (“base calling” referring to the inference of DNA sequences from physical signals).

The inventors have devised the present disclosure in light of these considerations. Summary of the Disclosure

In a general sense, the present disclosure provides methods and systems for reducing the computational load of an analytical system (such as a DNA sequencer) by streaming a selected subset of sensor data from a plurality of biosensor elements of the analytical system to a downstream processing module (e.g., for base calling). The data from the biosensor elements is selected by applying a “data mask” which includes information about which sensor data should be provided to the downstream processing module.

To generate the data mask, the sensor data from the biosensor elements may be analyzed (in a process referred to as “mask inference”) to determine which biosensor elements are “productive” and which are “unproductive”. Therefore, the sensor data from the “productive” biosensor elements may be selectively provided to the downstream processing module and the sensor data from the remaining, “unproductive” biosensor elements may be discarded or prevented from passing to the downstream processing module. By filtering the sensor data in this way, the downstream processing module need only process some of the sensor data which reduces the computing time, expense, and resources needed to process the sensor information.

The large amounts of sensor data involved in such systems is vast which means that selecting and streaming the sensor data in this way is not trivial. The present disclosure provides methods and systems for selecting and mapping the selected sensor data in efficient ways which minimizes the disruption and computation load associated with selecting sensor data which is present in such large quantities.

Further aspects discussed herein provide techniques for continuously analyzing the sensor data and updating the data mask during an analytical run as biosensor elements become productive and/or nonproductive. The process of updating the data mask may be referred to as “dynamic masking”. The techniques described herein provide systems and methods for updating the data mask in a way which provides minimal disruption to the data streaming and operations of the downstream processing.

Aspects of the invention are set out in the appended set of claims.

In a first aspect there is provided an analytical system comprising: a plurality of biosensor elements configured to generate sensor data, and a data selection module comprising: a data input array configured to receive sensor data from the plurality of biosensor elements; and a data output array connectable to a downstream data processing module for analyzing the sensor data; wherein the data selection module is configured to: receive sensor mask information, the sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data are sought; and selectively provide the sensor data of the identified biosensor elements from the data input array to the data output array.

Advantageously, by selectively providing the sensor data from only the identified subset of biosensor elements to the output array, the downstream data processing module can be less computationally expensive and much faster than if all the sensor data were to be processed. The analytical system may be a single molecule analytical system. Accordingly, the identified subset of the plurality of biosensor elements may be biosensor elements which are identified as productive biosensor elements. In the context of single molecule analytical systems, a “productive biosensor” element may be an element that is actively detecting a single target molecule. Biosensor elements comprising no target molecules, or two or more target molecules that are providing sensor data indicating overlapping signals, may be referred to as “non-productive”.

The single molecule analytical system may be configured to detect the presence of a target, or sequence a target, such as a polynucleotide or peptide target. The system may be configured to detect optical signals from fluorescent labels or may use non-optical detection methods and may detect a series of events in real time. Such events may be binding events (e.g., polymerase binding to a fluorescent nucleotide or a target peptide being transiently bound by a labelled affinity reagent), movement of a molecule and/or enzymatic events such as incorporation of a nucleotide (e.g., polymerase incorporation of a phosphate labelled nucleotide resulting in a end of a pulse width as a fluorescent label is cleaved off of a nucleotide during incorporation).

For example, the analytical system may be a polymerase sequencing system wherein the sensor data is received during real-time DNA sequencing from single polymerase molecules. In this example, the biosensor elements may comprise zero-mode waveguide (ZMW) nanostructure arrays which are configured to provide sensor data from a large number of ZMWs (i.e., 1 million-*- ZMWs) for postprocessing. In this example, only some of the ZMWs may be providing productive data for DNA sequencing at any point in time. The present inventors have found that typically only 40% of the ZMW sensor data may be productive for post-processing. Therefore, by selectively streaming only selected samples of the sensor data to the downstream processing module, the overall computing expense of the downstream processing can be reduced dramatically. As was discussed previously, data from the identified biosensor elements is provided from the data input array to the data output array.

Subsequently, a downstream processor connected to the data output array (i.e. downstream from the data selection module, which is selecting the data to provide to the data output array) may perform base calling on the received data. Therefore, the derivation and application of the mask as referred to below may be performed before any base calling has occurred and/or may be performed independently of any base calling (e.g., even if basecalling is performed in parallel).

The present techniques may also be applied to other types of analytical systems comprising a plurality of biosensor elements, such as optical systems for single molecule protein detection and/or sequencing, and/or systems for single molecule DNA sequencing.

In some examples discussed herein, the mask information for selecting the sensor data for providing to the downstream processing module may be updated on the fly (i.e., during an analytical run) based on changes in the sensor data provided by the biosensor elements. In this way, the data mask may be updated to selectively pass data from newly productive biosensor elements and to not pass data from newly non-productive biosensor elements. For example, when the analytical system is configured to perform real-time polynucleotide sequencing, a biosensor element may become productive or become unproductive during an operational run depending on a number of polynucleotides with actively extending polymerases. Non-productive biosensor elements may not provide any data, may provide poor data or may provide slower data (e.g., slower event rate, such as slow rate of nucleotide incorporation) as compared to other biosensor elements. It is worth noting that the data selection module is not configured to control the operation of the biosensor elements themselves. That is, each biosensor element is active during a given run (even though not all are productive) and are not controlled on the basis of the derived mask or the metrics discussed below. Therefore, each biosensor element generates data, and that generated data leaves the circuitry associated with the biosensor element (e.g., the detection circuitry, any amplifiers or other electrical components associated with a single biosensor element). For example, where the biosensor elements are ZMWs (or unit cells comprising a ZMW and the associated detection circuitry dedicated to the reaction site of a single biosensor element), all of the ZMWs are active and producing data. No component of the system modifies the operation of the ZMWs or the unit cells on the basis of the mask or the metrics discussed below. Therefore, in some examples, the data is discarded by the data selection module (which is separate from the biosensor elements). That is, all of the data from all of the ZMWs or unit cells arrives at the data selection module, but only some of the data is passed through to the output.

In an additional aspect, there is provided a data selection module for an analytical system comprising: a data input array configured to receive sensor data from a plurality of biosensor elements of the analytical system; and a data output array connectable to a downstream data processing module for analyzing the sensor data. The data selection module is for selectively providing the sensor data from the input array to the output array. The data selection module is configured to: receive sensor mask information, the sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data is sought; and selectively provide the sensor data of the identified biosensor elements from the data input array to the data output array.

Advantageously, as discussed above, by selectively providing the sensor data from the identified subset of biosensor elements to the output array, the downstream data processing module may be less computationally expensive and require less time to process the sensor data.

Selectively providing the identified sensor data from the identified biosensor elements to the output array may comprise using the sensor mask information to configure (or reconfigure) the data selection module for providing the sensor data from the identified biosensor elements to the output array. Accordingly, using the sensor mask information to configure the data selection module may be referred to herein as “applying a data mask”. The sensor data from the biosensor elements which are not identified by the sensor mask information may not be provided to the data output array.

The individual biosensor elements may each comprise a binding site for loading of a target, such as a single molecule target. For example, the binding sites may be a pad or nanowell associated with a detection element. The biosensor elements may further comprise one or more detection elements, such as a photodiode and/or electrode. Accordingly, the analytical system may be a sequencing system, a single polymerase molecule sequencer, a peptide sequencer, or an optical analytical system.

The data selection module may be configured to stream the identified sensor data in real-time to the data output array based on the sensor mask information. The data selection module may therefore be referred to as a “data streaming module”.

The data input array may comprise a plurality of parallel input ports for receiving the sensor data from the biosensor elements. For example, the data array may comprise input buffers for receiving and clocking in the sensor data at a predetermined frame rate, or at a variable frame rate. In certain aspects, the frame rate of the data input array and/or data output array may be controlled by a dedicated reference clock that is separate from a main sensor clock, or may be controlled by a phased locked loop control system.

The data output array may be or comprise one or more output connections or output buffer for providing the selected sensor data to the downstream processing module. The downstream processing component may comprise any downstream components configured to receive the selected sensor data. For example, this may include, a PCIe interface, a host computer's DRAM for volatile high bandwidth storage, and/or a CPU or GPU for analyzing the sensor data. For example, the data selection module may be configured to temporarily store (i.e., buffer) a plurality of frames of the sensor data (i.e., several samples overtime from the selected biosensor elements) and then provide the sensor data, from the temporary storage to the data output array in sequential chunks of the buffered sensor data. This enables the downstream processing module to process the sensor data in more manageable amounts than if all of the selected sensor data were provided to the downstream processing module at once.

The mask information may include a list or array of locations or offsets in the data input array corresponding to the input locations of the identified sensor data. In some examples, the mask information may comprise an indicator (e.g., a bit) for each biosensor element which indicates whether that biosensor element is selected or not. For example, the identified subset of biosensor elements may be biosensor elements which are identified as being e.g., active, or productive, or as detecting a molecule of interest (such as a sequence of interest) in the analytical system. Accordingly, the sensor data from the identified subset of biosensor elements may be referred to as “selected sensor data” or “identified sensor data”. In certain embodiments, the number of biosensor elements within the subset may be a fixed or predetermined number. That is, there may be no input or adjustment of the number of biosensor elements within the subset. Whilst they may be subdivided into further subsets (as discussed below with relation to the sub-masks) it may be that the total number of biosensor elements it not adjusted or is not adjustable. In certain embodiments, the number of biosensor elements within the subset is not adjusted during a sequencing run.

The data selection module may be further configured to perform a mapping operation in which a target location is assigned to the sensor data from each identified biosensor element. The mapping step may comprise assigning a target location or address to the sensor data from each selected biosensor element, the target location corresponding to a host memory location of the downstream data processing module. The mask information may include mapping information for the sensor data which indicates the target location in the data output array (or in downstream memory) for the outgoing sensor data.

In examples discussed herein, wherein the data selection module is configured to update the mask information to indicate an updated selection of biosensor elements, the data selection module may be configured to perform a re-mapping operation wherein updated target locations are assigned to the updated selection of sensor data. Importantly, the target locations of persisting sensor data, which is selected before and after the update may be preserved to enable efficient memory access operations of the downstream data processing.

The plurality of biosensor elements may be detection sites, or wells (e.g., for hosting a biological reaction) of a biological testing sensor system/chip of the analytical system. Each biosensor element may be configured to continuously stream sensor data to the data selection module during an operational run of the analytical system.

A data bandwidth of the data output array may be smaller than a data bandwidth of the data input array. Accordingly, by selectively providing only the selected sensor data to the downstream module the data bandwidth requirements of the data output array and the downstream module are reduced. For example, the analytical system may comprise N biosensor elements, and the output array may be connectable to the downstream data processing component via M parallel output connectors, wherein M and N are both integers greater than zero, and M is less than N. For example, M may be no more than 3N/4, no more than 2N/3, no more than N/2, no more than 0.4-N.

The data selection module may be embodied in hardware. For example, the data selection module may be embodied in fixed function circuitry which is configured to operate concurrently. For example, the data selection module may be implemented in an integrated circuit (IC) such as a bespoke IC and/or a Field Programmable Gate Array (FPGA). When the data selection module is implemented on an IC, the IC may be located on a sensor stack or sensor chip of the analytical system which also comprises the plurality of biosensor elements. In other examples, the data selection module may be implemented, completely or in part, on one or more FPGAs (i.e., in the programmable logic of the one or more FPGAs).

As such, providing the sensor data to the data output array may comprise configuring switches in the fixed function circuitry or programmable logic to reroute data streaming pipelines. For example, sensor data may be clocked into input buffers and then some of the sensor data may be selectively clocked out to the data output array (or into temporary storage in the data output array) depending on if it is identified in the mask information or not.

The mask information may be received from a data evaluation module. The data evaluation module may be configured to determine the mask information based on the sensor data from the plurality of biosensor elements. For example, the data evaluation module may be configured to perform pre-processing on the sensor data to determine which biosensor elements are providing sensor data which meets a predetermined preprocessing threshold. Determining the mask information may be referred to as “mask inference” and is discussed in more detail below. The data evaluation module may be located on a CPU and/or a GPU. Alternatively, some or all of the data evaluation module may be implemented in digital circuitry, for example in programmable logic of an FPGA. Accordingly, the data selection module and the data evaluation module may be provided on a same component (e.g., an FPGA, where some, or all, of the data evaluation module is provided in the programmable logic of the FPGA).

The mask information may comprise a data array comprising locations of the identified sensor data in the data input array. In this way, the data selection module may be configured to stream sensor data from the identified locations to the data output array. For example, the mask information may comprise one or more lists of addresses or offsets corresponding to the selected sensor data. For example, the mask information may comprise a plurality of data arrays (or masking words) of addresses. Each data array (which may be referred to as a region sub-mask) may correspond to a region of the sensor data as discussed in detail below. As discussed above, the mask information may further comprise mapping information for the sensor data, the mapping information comprising a target address for the sensor date in host memory of the downstream processing module.

As mentioned above, the data selection module may be configured to dynamically adjust the selective provision of sensor data from the input array to the output array based on the mask information. Therefore, as the behaviors of the biosensor elements change over time during an experimental run of the analytical system, the data selection module may advantageously re-configure based on new mask information to adapt to changes in biosensor element behavior. For example, when one or more biosensor elements become “non-productive” or become “productive”, as discussed herein, the “data mask” may be updated to selectively pass data from newly productive biosensor elements and not pass data from newly non-productive biosensor elements.

Accordingly, the data selection module may be configured to: receive updated mask information indicating an updated selection of identified biosensor elements from which sensor data are sought and reconfigure to selectively provide the sensor data from the updated selection of biosensor elements to the output array. “Reconfiguring”, may be considered as updating a data streaming routing of the data selection module to provide (i.e., stream) the sensor data from updated locations in the data input array (as indicated by the updated mask information) to the data output array.

The updating of the mask information may comprise providing the sensor data from the plurality of biosensor elements to a data evaluation module, the data evaluation module being configured to assess the sensor data, determine the updated sensor mask information, and provide the sensor mask information to the data selection module. The data selection module may be configured to periodically update by receiving the updated mask information and reconfiguring to selectively provide the sensor data from the updated selection of biosensor elements. This process may be referred to herein as an “update event” or an “update routine”.

The reconfiguration of the data selection module may comprise updating a data routing configuration of the data selection module (i.e., the hardware/circuitry of the module). The reconfiguring of the data selection module may be performed in one time step of the data selection module. For example, the sensor data may be clocked into the data selection module (e.g., into an input buffer of the data section module) at a predetermined frame rate separated by predetermined time steps so that the sensor data is sampled during each time step. The identified sensor data selected from the “clocked-in” sensor data may then be routed to the output array based on the mask information. Accordingly, the reconfiguring of the data selection module in one time step may comprise re-routing the newly selected sensor data of the update selection of biosensor elements from the input array towards the output array from one time step to the next.

The reconfiguring may comprise remapping of the sensor data, wherein remapping comprises assigning an updated target location or address to the updated sensor data. The updated mask information may comprise remapping information, the remapping information comprising updated target locations for the updated sensor data. The updated target locations may correspond to memory locations of the data output array or the downstream data processing module.

Importantly, the sensor data from the biosensor elements not being updated may be provided to the same location in the output array or the downstream data processing module before and after the updating. That is, the reconfiguring of the data selection module may be performed in a manner which preserves the position in the output array of biosensor elements that persist across the update event. Accordingly, sensor data from biosensor elements which were selected before and after the update event may be provided to the same position in the output array. This preservation of data locations in the output array during the updating of the data selection module may be referred to as “dynamic remapping” of the data selection module.

Maintaining the data location in the output array of persisting biosensor elements may advantageously ensure that memory mapping and retrieval functions of the downstream data processing component can be simpler, and quicker than if the sensor data were re-ordered since direct memory access (DMA) calls may be used to access context data associated with each biosensor element sequentially without needing to remap the locations of the context data in memory after the updating of the data selection module.

For example, the downstream data processing component may comprise context data associated with each identified biosensor element, the context data being accessible by direct memory access (DMA). Accordingly, selectively providing the sensor data to the output array may include providing the sensor data from the input array to selected locations in the data output array or host memory of the downstream data processor, the selected locations corresponding to the memory locations of the respective context data.

For example, the data selection module may be configured to provide the sensor data from the identified selection of biosensor elements in the data input array to selected locations in the data output array, wherein, the reconfiguring may then comprise: providing sensor data from the identified biosensor elements which were selected before and after the reconfiguring to the same locations in the data output array, and providing sensor data from newly identified biosensor elements in the updated selection of biosensor elements to remaining locations in the data output array. The “remaining locations” may be locations which previously received sensor data from biosensors which are no longer selected in the updated selection of biosensor elements. By remapping newly selected sensor data which in this way, the location in the output array of the persisting sensor data, selected before and afterthe updating, may be preserved, thereby reducing the computing workload require to remap the memory locations of downstream context data associated with each biosensor element.

The data selection module may be configured to continue providing the sensor data from the identified biosensor elements which are not in the update updated selection of biosensor elements from the input array to the output array during the reconfiguring. Accordingly, the sensor data which is received from biosensor elements which are not being updated may be provided to the data output array without interruption or disruption. In this way, the data selection can be dynamically updated while ensuring there is minimal disruption to the sensor data being streamed.

As mentioned above, the mask information may comprise a plurality of sub-masks (i.e., a plurality of data sub-arrays). Each sub-mask may correspond to a group of the plurality of biosensor elements and comprise an indication of which biosensor elements in the respective group are identified biosensor elements from which sensor data is sought. Each sub-mask may be configured to enable or disable sensor data being provided from a certain quantity of biosensor elements in each group. In this way, each sub-mask may be configured to enable or disable sensor data being provided from only its respective group of biosensor elements. Importantly, by dividing the selection of the sensor data from the biosensor elements in this way, the data routing complexity of the data selection module can be reduced compared to if the mask information were configured to enable or disable the streaming of sensor data from any of the biosensor elements.

Each group of biosensor elements may be referred to as a region of the plurality of biosensor elements and the sub-mask may be referred to as a region sub-mask. The biosensor elements in each region may be located contiguously in the analytical system. That is, the biosensor elements in each region may be positioned adjacent to each other in the analytical system and/or the sensor data from the biosensor elements of each region may be received at the data selection module at adjacent positions in the data input array. By locating biosensor elements contiguously in regions, the routing of the sensor data from the input array to the output array may be less complex and more efficient which is particularly useful when the data selection module is implemented in an IC. However other arrangements of biosensor elements in the regions may also be possible.

By dividing and selecting the sensor data according to regions of sensor data, the routing complexity of the data selection module may be reduced since only local sensor data within each region may be enabled or disabled and provided to the data output array. Additionally, the regions may provide important boundaries for remapping of the sensor data from the data input array to the data output array which usefully constrains the complexity of the implementation. In particular, dividing the mask information and the sensor data in this way enables the downstream processing module to more easily locate and access context data associated with the sensor data in each region and the mask information may be updated more easily. The sensor data from the identified biosensor elements in each region, indicated by the region sub-masks may be provided to respective region locations in the data output array. The reconfiguring of the data selection module during an update event may comprise providing the sensor data from the identified biosensor elements in each region to the same respective region locations in the data output array before and after the reconfiguring. Mapping the data locations of regions in this way advantageously enables the sensor data to be remapped to the output array in a way which preserves the relative location of sensor data in regions which are not updated during the reconfiguring.

Each region sub-mask may be configured to indicate a selection quantity of (i.e., an amount of, or a number of) biosensor elements in the respective group from which identified sensor data are sought. For example, each sub-mask may specify a predetermined number of selected biosensor elements from which data streaming is to be enabled.

The selection quantity may be configurable. Accordingly, the amount of sensor data acquired from each region may be adjusted. Adjusting the number of biosensor elements which are selected per region/sub- mask may be related to a “sub-mask size” wherein the sub-mask size is the number of biosensor elements in a region which are selectable using a respective sub-mask for that region (i.e., the number of input elements in the sub-mask region). The selection quantity may therefore be the quantity of biosensor elements in the sub-mask region which may be selected (i.e., therefore defining the size of the sensor data appearing in the output array). The sub-mask size may also be configurable.

The selection quantity may be the same across each of the region sub-masks such that sensor data is provided to the output array from a constant number of biosensor elements of each group. Accordingly, the selection quantity may be a global selection quantity indicating an amount of biosensor elements selectable from each region. In this context, “selecting” or “enabling” a biosensor element may be considered to mean providing the sensor data from that biosensor element to the output array of the data selection module (as opposed to “disabling” or “masking” the sensor data from a biosensor element).

Alternatively, each region sub-mask (or subset of sub-masks) may have a respective selection quantity such that the amount of biosensor elements indicated by each sub-mask is adjustable between submasks. Accordingly, each sub-mask may be configured to select a variable number of biosensor elements for enabling data streaming from. In this way, different amounts of sensor data may be extracted from different regions of biosensor elements. Advantageously, configuring the sub-mask information in this way enables the data selection module to be more adaptable because the respective selection quantities between each region may be adjusted to accommodate the current performance or conditions of different regions of the analytical system.

For example, if biosensor elements (e.g., ZMWs, reactions sites, nanowells) in a certain area of a sensor stack or chip are producing particularly productive or high performing sensor data, then the selection quantities of sub-mask region(s) corresponding to that area may be increased to extract more sensor data from that area. The selection quantities associated with other regions may be decreased to accommodate this. Each region of biosensor elements may comprise, for example, up to 256K (i.e., 256 x 1024) biosensor elements. In other examples, each region may comprise 32 to 8192, more preferably between 64 to 4096, more preferably 512 to 2048, more preferably 1024 biosensor elements. The number of biosensor elements in each region is preferably a power of two to maximize the efficient use of addressable space and circuitry in the data selection module. When a very large region size is implemented (e.g., up to 256K biosensor elements), a correspondingly large local memory (e.g., 256K+ RAM) may be provided to initially store the sensor data from the biosensor elements. In these examples, the data mask information may then contain e.g., 15-bit mapping values (e.g., memory addresses) for selecting the sensor data from the local memory. In contrast, when smaller region sizes are used (e.g., 1024 biosensor elements) then smaller mapping values may be used (e.g., 8 bits).

The selection quantity of each region sub-mask may be between 4 and 1024, more preferably at least 8 biosensor elements. Accordingly, each sub-mask may be configured to indicate at least eight, or e.g., 16, 36, 48, etc., identified, biosensor elements in the respective group of biosensor elements from which sensor data is sought. For example, when the global region selection quantity is eight, each sub-mask may comprise eight addresses for eight input locations of sensor data in the data input array. Additionally, the sub-mask may comprise eight target mapping addresses for the selected sensor data, which corresponds to target locations in the host memory of the downstream processing module.

In some examples, the quantity of biosensor elements in each region may be divided evenly across a plurality of tracks per region. Each region sub-mask may be configured to indicate a fixed sub-quantity of identified biosensor elements within each track from which sensor data is sought. The sub-quantity may be configured to evenly divide the selection quantity of each region between the plurality of tracks.

When the sensor data from each region of biosensor elements is divided into a plurality of tracks, the data selection module may be configured to provide the selected sensor data from each track to a respective track location in the data output array. Accordingly, the reconfiguring based on the updated mask information may comprise providing the sensor data from the updated selection of biosensor elements in each track to the same respective track location in the data output array before and after the reconfiguring. Therefore, the output array locations of the remaining sensor data in other regions and tracks may be preserved through the reconfiguring.

For example, each region may be divided into eight tracks. The sub-quantity, in this example, may be one biosensor element per track. Accordingly, the sub-mask may be configured to indicate at least one biosensor element from each track of the region. Dividing each region into eight tracks advantageously corresponds to the predominance of 64-bit data paths in compute platforms and accelerator cards which ultimately relates to how DRAM is accessed. In this way, eight track words (each word comprising 8 bits of sensor data from one biosensor element of the track) may be moved to/from memory (e.g., temporary storage in the data selection module or host memory of the downstream data processor) in a single operation.

The updated mask information may correspond to a sector of the biosensor elements. The update sector may be one of a plurality of sectors of the biosensor elements. The sector of biosensor elements for which the mask information is updated may be referred to herein as an update sector. Accordingly, the updated selection of identified biosensor elements may be included in the update sector. Therefore, the reconfiguring may only be applied to the data routing pipelines corresponding to that sector of biosensor elements. The data selection module may be configured to continue streaming previously identified sensor data from the remaining portions without updating the data routing corresponding to those groups. Each sector may comprise one or more regions of the biosensor elements.

As discussed above, the data selection module may be configured to provide the identified biosensor data from the data input array to selected locations in the data output array. The reconfiguring may therefore comprise continuing to provide the sensor data from the identified biosensor elements which are not in the update sector to same selected locations in the data output array (i.e., as before the reconfiguring). The sensor data from the updated selection of biosensor elements in the update sector may be provided to the remaining selected locations in the data output array which correspond the current update sector. Accordingly, the remaining portion of the data selection module (i.e., the data routing configuration of the remaining sectors) may be maintained during the reconfiguring.

Additionally, as discussed above, the location of sensor data may also be preserved within each update sector (e.g., by using region sub-masks and/or tracks and/or a remapping operation) for the biosensor elements which were selected before and after updating, further preventing disruption of the downstream data processing. As discussed above, each region sub-mask may be configured to indicate a selection quantity of biosensor elements from its respective region. Accordingly, the splitting of the mask information by region sub-masks in this way enables the sensor data provided to the output array to be remapped and to accommodate newly identified sensor data whilst preserving the array location of the remaining sensor data (which has not been newly identified) even within the update sector.

The data selection module may be configured to provide the sensor data from the plurality of biosensor elements in the update sector to the data evaluation module. The data selection module may then receive the updated mask information relating to that update sector in return. The updated mask information may also be passed to a downstream analytical system, for example, so that basecalls are assigned to the appropriate sequence (e.g., as coming from the correct biosensor element). For example, if a period of slow or poor sequencing data classifies a biosensor region as unproductive but that biosensor region starts producing better or faster data, basecalls from before and after the period of masking can be assigned to the same sequence.

The reconfiguring of the data selection module may comprise reconfiguring (only) a portion of the data selection module that corresponds to the selected update sector of biosensor elements. Accordingly, the portion of the data selection module which is configured to provide identified sensor data from the biosensor elements in the update sector may be rerouted to provide sensor data from the updated selection of biosensor elements in that sector.

The remaining sectors of the biosensor elements may not be updated when the update sector is updated.

That is, the data selection module may be configured to continue streaming sensor data from identified biosensor elements in the remaining sectors based on previously received mask information. By updating the data selection module in sections in this way, the data selection module may continue streaming data during the update. Moreover, because less bandwidth is needed to receive the updated mask information, the mask information can be determined by the data evaluation module more quickly, and less computing power is needed to determine the updated mask information than if the mask information were updated for the entire plurality of biosensor elements.

Accordingly, when the data selection module is configured to periodically update, it may update sector-by- sector. That is, the data selection module may be configured to determine a subsequent update sector of the plurality of sectors and provide the selected sensor data from the plurality of biosensor elements in the update sector to the data evaluation module. The data selection module may then receive updated mask information comprising an updated selection of biosensor elements in that update sector and reconfigure to provide the selected sensor data from that sector to the output array. The data selection module may be configured to periodically update by performing the above steps.

As mentioned above, each sector of biosensor elements may comprise one or more regions of the biosensor elements corresponding to the region sub-masks. The regions of biosensor elements in each sector may be located non-contiguously in the analytical system. That is, the biosensor elements in each region may be located adjacent to each other in the analytical system and/or the sensor data may be received in adjacent positions in the data input array, and the regions of biosensor elements may be located in positions separate to each other in the analytical system and/or the sensor data from each region may be received in separated positions in the data input array. Accordingly, the regions of each sector may be separated from each other in the analytical system and/or in the data input array by at least one region of a different sector.

By updating the data selection module in sectors of non-contiguous regions, the routing intensity and bandwidth requirements of the data selection module for updating the mask information may be reduced. This has an effect of smoothing what would otherwise be a very uneven use of data bandwidth across the platform, between the source of the data and the data selection, and to the data evaluation module. Accordingly, the data selection module is able to accommodate more sensor data and is more adaptable and quicker to update than if the sectors for updating were located contiguously. Additionally, the demands on memory resources, including routing between subsystems like PCIe bandwidth to the FPGA card or GPU, can be spread out overtime. This reduces the risk of bottlenecks in these resources while the remaining data streaming and processing is happening. This may also allow for more time between processing steps. Less compute resources per unit time can be allocated, e.g., fewer CPU/GPU/FPGA parallel threads may be required. Update events need to compete for resources with the remainder of the data processing and streaming, so being able to perform an update even with reduced priority reduces hardware requirements of the update event.

As mentioned above, the data selection module may be configured to provide the sensor data from the plurality of biosensor elements to a data evaluation module. The data evaluation module may be configured to assess the sensor data, determine the updated sensor mask information, and provide the sensor mask information to the data selection module. Accordingly, providing the sensor data to the data evaluation module may form part of the periodic updates of the data selection module.

The data selection module may comprise a side-channel output array connectable to the evaluation module for generating the updated mask information. The data selection module may be configured to periodically provide sensor data from a selected evaluation sector of the biosensor elements from the input array to the side-channel output array. The evaluation sector may be equivalent to an update sector wherein the update sector may be evaluated and use to update the mask information. However, other data pipelining arrangements may be possible wherein the update sector may be different to the evaluation sector. Therefore, the side-channel output array may have a data bandwidth corresponding to the data size of each evaluation sector of biosensor elements. Multiple sectors may be evaluated and then updated, e.g., in parallel or in serial or a combination thereof. Parallelization of multiple sector updates allows for a faster revolution of a dynamic full-frame mask update. Serialization of sector updates may reduce bandwidth and/or processing resources, which may allow for decrease in hardware and/or free additional bandwidth and/or processing by a downstream analytical system.

In this way, even sensor data from biosensor elements which are not identified by the most recent mask information may be provided to the evaluation module. Therefore, new biosensor elements may be identified during each update, thereby enabling the data selection module to adapt to changes in performance of the biosensor elements.

The data evaluation module may be embodied fully or partially on a CPU, GPU, an FPGA, etc., which is configured to perform the methods and aspects discussed herein for configuring the data selection module.

The data selection module may be configured to perform an initialization sequence for initially configuring the data selection module. For example, the initialization sequence may include initially receiving mask information related to all (or a selection) of the biosensor elements and configuring the data selection module to selectively provide the identified sensor data from each of the biosensor elements to the output array. For example, the initialization sequence may comprise receiving mask information relating to all (or a selection of) of the biosensor elements.

The data selection module may be configured to begin selectively providing the sensor data to the output array using the mask information upon completion of the initialization sequence. In this way, the data selection module can usefully configure each sector for data streaming before it begins streaming the sensor data and periodically updating the mask information sector-by-sector.

The initialization sequence may comprise providing initial sensor data from all (or a portion) of the biosensor elements to a data evaluation module for evaluation in order to receive the initial mask information in return (e.g., as discussed in further aspects of the present disclosure). The initial mask information may therefore relate to all of the biosensor elements (or all of the biosensor elements which are available or intended for use) for initially configuring the data selection module. In some examples, the data selection module may be configured to provide the initial sensor data from sequential sectors of the biosensor elements. Accordingly, less data bandwidth and memory resource may be required to communicate the initial sensor data and determine the initial mask information. However, in some examples, all of the initial sensor data may be provided to the data evaluation module and used to determine the initial masking operating. While this method may require more data bandwidth to communicate the sensor data the initialization routine may be performed more quickly thereby reducing the downtime of the analytical system during the initialization routine.

The data evaluation module may be provided (fully or in part) on a same shared resource as the downstream data processing module. For example, the mask information may be determined on a same GPU that is also used for the downstream data processing. In these examples, the data output array may be connectable to the shared resource for providing the initial sensor data from the data selection module to the downstream processing resource and for receiving the initial mask information in return. The data selection module may therefore be configured to provide the initial sensor data to the shared processing resource (and receive the initial mask information) via the data output array during the initialization sequence and then repurpose the output array for providing the selected (masked) sensor data to the downstream data processor after the initialization sequence is complete.

That is, the data selection module may comprise a data side channel for communicating sensor data and receiving mask information to/from the data evaluation module, and main data output channels for providing selected (masked) sensor data to the downstream data processing module. In the present examples, both the data side channels, and the main channels may be utilized to provide sensor data to the data evaluation module during the initialization sequence. In this way, the data evaluation module may use more processing resource to determine the initial mask information more quickly and reduce the initial downtime of the analytical system while the data selection module is configuring.

Notably, some analytical systems may benefit from providing and utilizing only the initial “data mask”. For example, analytical systems for detecting peptides may benefit from masking to only stream sensor data from “productive” biosensor elements that have a single target peptide, and not stream sensor data from “non-productive” biosensor elements that do not comprise a target peptide or that comprise more than one target peptides. Such peptide detection (e.g., peptide sequencing analytical devices) may not see non-productive biosensor elements become productive, and so the mask information may not need to be updated in these cases.

The plurality of biosensor elements may include at least 1 million biosensor elements. For example, the plurality of biosensor elements may include at least 10 million, at least 25 million, at least 100 million biosensor elements. Accordingly, the data selection module may usefully be configurable to selectively provide a large amount of sensor data to the downstream processing module, significantly reducing the computing resource needed to process the sensor data and providing more adaptable data selection functions for streaming data in real time than existing data selection methods which require large memories, data buffers, or hugely complex and large routing layouts. The input array may be configured to receive sensor data from the plurality of biosensor elements and provide the sensor data from the identified biosensor elements to the data output array at a predetermined frame rate, or a variable frame rate, wherein a frame includes a sensor datum from each of the plurality of biosensor elements. For example, the frame rate may be at or between 10 frames per second (FPS) and 1000 FPS, more preferably at or between 50 and 300 FPS, more preferably at or between 100 and 200 FPS, such as at 100 FPS. The frame rate may be 10 FPS or more, 50 FPS or more, 100 FPS or more, or 200 FPS or more.

The output data bandwidth may have fewer lanes than the input data bandwidth. In certain aspects, the output data bandwidth may be at a frame rate higher than could be supported in the absence of the mask, e.g., may have a number of lanes and a clock cycle that would not be able to pass data from biosensor elements. In certain aspects, only a portion of the output data bandwidth is used when the mask is applied to the sensor data, for example if the available output data bandwidth is be equal to the input data bandwidth. Reducing output data bandwidth use below what is available may reduce power consumption and/or overheating.

In certain aspects, the output data bandwidth may be variable. For example, the number of powered output lanes and/or frame rates may be selectable. Selectable frame rates may be enabled through use of a dedicated reference clock or a phased locked loop control system. In such cases, the output bandwidth could be selected prior to a run or may be determined by the data selection module (e.g., at the start of a run and/or during a run). As such, the frame rate may be changed during a run and/or between runs.

As discussed above, the data selection module may be implemented on one or more FPGAs. Alternatively, the data selection module may be implemented on a bespoke portion of an integrated circuit (IC) comprising fixed function circuitry. For example, the IC may be included in a sensor chip or senor stack comprising the plurality of biosensor elements.

Accordingly, in a further aspect, there is provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a data selection module according to the preceding aspect.

In a further aspect, there is provided a non-transitory computer-readable storage medium having stored thereon, a computer-readable description of an integrated circuit that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture a data selection module according to the previous aspect.

In a further aspect of the present disclosure there is provided a method of configuring a data selection module for an analytical system, the data selection module comprising: a data input array configured to receive sensor data from a plurality of biosensor elements of the analytical system; and a data output array connectable to a downstream data processing module for analyzing the sensor data. The data selection module is for selectively providing the sensor data from the input array to the output array. The method comprises: receiving sensor mask information, the sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data is sought; and configuring the data selection module to selectively provide the sensor data from the identified subset of biosensor elements to the output array.

In a further aspect, there is provided a method of providing sensor data from a plurality of biosensor elements to a downstream data processing module of an analytical system, the method comprising: receiving sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data is sought, and selectively providing the sensor data from the identified subset from an input array connectable to the plurality of biosensor elements to an output array connectable to a downstream data processing module for analyzing the sensor data. The method of the present aspect may also be referred to as a method of streaming sensor data from the plurality of biosensor elements to a downstream data processing module.

In a general sense, the present disclosure also provides methods and systems for identifying, in real-time and from a plurality of biosensor elements, a subset of the biosensor elements which are producing sensor data which is to be processed. By identifying this subset of the biosensor elements, those not producing data which is to be processed can be filtered earlier in the analytical workflow and so computing time, expense, power, and resource can be reduced.

In a further aspect, there is provided: a use of the method of providing sensor data from a plurality of biosensor elements to a downstream data processing module in performing single-molecule real-time sequencing.

In a further aspect there is provided a use of the data selection module discussed in aspects herein, in performing single-molecule real-time sequencing.

In a further aspect, there is provided a method of performing single-molecule real-time sequencing, including performing the methods of providing sensor data from a plurality of biosensor elements to a downstream data processing module discussed herein.

In a further aspect, the present disclosure provides a method of generating sensor mask information to a data selection module of an analytical system, the data selection module being for selectively providing sensor data from a plurality of biosensor elements to a downstream data processing module, the method comprising steps of:

(a) receiving sensor data from at least a portion of the plurality of biosensor elements;

(b) identifying, from the received sensor data, a subset of the biosensor elements which are producing sensor data to be processed; and

(c) generating sensor mask information indicating at least a portion of the identified subset of the plurality of biosensor elements.

The method may further comprise: (d) providing the sensor mask information to the data selection module. As will be appreciated, the sensor mask information can then be utilized by the data selection module in the manner discussed above so as to reduce the resource requirements for the downstream data processing module.

The sensor data to be processed may be sensor data meeting one or more criteria. For example, the sensor data to be processed may contain data of a type which has been flagged, may be produced from one or more biosensor elements which have been flagged (for example on the basis of their location on the chip, or the type of analysis/sensing being performed therein), may be producing data which is usable for base calling, may be producing data which is indicative of biological activity in the respective biosensor element, and/or may be producing data which is indicative of single-molecule sequencing.

Generating the sensor mask information may include a step of generating a data structure in which a label or index value for each of the subset of identified biosensor elements is present. For example, generating the sensor mask information may include generating an array the values within which are addresses of the identified biosensor elements as discussed above in relation to data mask application.

The data selection module may be the data selection module discussed above with respect to the other aspects.

Identifying the subset of the biosensor elements may include determining whether metadata associated with each biosensor element indicates that the biosensor element is to be included in the subset. The metadata may be descriptive of, for example, a location of the biosensor element (e.g., on a sensor chip); a type of data being produced by the biosensor element; and/or a type of input provided to the biosensor element (e.g., a quality-control, QC, sample). The metadata may be received in a step performed before step (b), and either before, after, or in parallel with step (a).

Steps (a) - (d) may be repeated and may be repeated periodically or continuously (and so with no gap between repetitions). The steps (a) - (d) may be repeated in this fashion within at least the first 10%, 20%, 30%, 40%, 50%, 60%, 66%, 70%, 80%, or 90% of a total run of an analytical system in which the biosensor elements are located. The step of receiving the sensor data from the biosensor elements may include receiving sensor data from the biosensor elements which are not presently identified in the sensor mask information. The sensor data from the biosensor elements which are not presently identified in the sensor mask information may be received via a separate channel to a channel through which the data from the biosensor elements which are identified in the sensor mask information is received by the downstream data processing module. The biosensor elements which are identified in the sensor mask information may be productive biosensor elements and the biosensor elements which are not identified in the mask may be referred to as non-productive biosensor elements. As discussed above, some biosensor elements may transition from being productive to non-productive, or vice versa, for example, depending on if individual polymerases are active or inactive.

The method may include receiving sensor data from all of the biosensor elements in an initialization phase (e.g., a first loop of steps (a) - (d)), and subsequently data may be received from fewer than all of the biosensor elements in a production phase (e.g., in a second of subsequent loop of steps (a) - (d)). In the initialization phase, the sensor data may not be subsequently analyzed (for example, it may not be used for base calling). In the production phase, the sensor data may be used for subsequent analysis (for example, it may be used for base calling).

The sensor data used In the initialization phase may comprise between 512 frames of sensor data and 65,536 (64K) frames of sensor data. Typically, the sensor data used in the identifying step comprises 2ⁿ frames of data for convenience. The amount of sensor data used in the initialization phase may be chosen to ensure that an initial mask can be derived quickly enough with respect to the full length of the run (i.e., so as to have minimal impact on overall data collection).

Steps (a) - (c) may be performed in series or in parallel for a plurality of regions of the biosensor elements. In such examples, the entire biosensor element array (comprising all of the biosensor elements on a chip) may be notionally broken down into a series of regions. The steps (a) - (c) can then be performed for each of the regions. In some examples, the sensor mask information may be provided on a per-region basis (and so step (d) is included in each iteration for the respective region) or a total sensor mask information may be generated from sensor mask information generated for each region and this total sensor mask information provided in a single step. The initialization phase referred to above may be performed on a per-region basis or on all of the biosensor elements in a single pass. That is, the initial mask provided in the initialization phase may be built-up through a per-region application of steps (a) - (c) or may be derived in a single iteration of steps (a) - (c).

The identified subset of biosensor elements from which data is to be processed maylde no nonproductive biosensor elements (e.g., biosensor elements in which no biological activity is detected, such as when no target is loaded and/or multi-loaded biosensor elements in which signal from a single target is not distinguishable), the subset may include only single-loaded biosensor elements, or the subset may include a mixture of single- and multi-loaded biosensor elements.

In certain aspects, the received sensor data may include less than the first 10 seconds, the first 30 seconds, the first minute, first five minutes, or first ten minutes of data received from the biosensor elements during a run.

Generating the sensor mask information may include selecting all of the identified subset of the plurality of biosensor elements. Generating the sensor mask information may include selecting less than all of the identified subset of the plurality of biosensor elements. The selected number of biosensor elements may be no more than 75%, no more than 70%, no more than 65%, no more than 60%, no more than 55%, no more than 50%, no more than 45%, or no more than 40% of the total number of biosensor elements from which sensor data was received.

Identifying the biosensor elements may include analyzing the received sensor data. This may include analyzing the received sensor data and applying a label to each biosensor element from which data is received based on the analysis. The sensor data may include pulsatile data, and the analysis may include identifying pulse features from the pulsatile data (e.g., pulse rate, stutter rate, intra-pulse variation, and/or pulse autocorrelation, etc.). Analyzing the received sensor data may include calculating, for each biosensor element, one or more metrics descriptive of the properties of the received sensor data from that biosensor element, and identifying the subset of biosensor elements on the basis of the calculated metrics. The one or more metrics may include one or more of: data indicative of amplitude, data indicative of amplitude clipping, data indicative of an autocorrelation between frames of sensor data of a respective biosensor element, data indicative of stuttering events, data indicative of a pulse variation (e.g., a mean pulse variation), and a number of pulses within a time period. The metrics may include any one or more of a data indicative of a first percentile and a second percentile of the number of pulses; data indicative of a first percentile and a second percentile of the (mean) pulse variance, data indicative of a first percentile and a second percentile of stuttering events. Identifying may include comparing the or each calculated metric to a respective benchmark value or threshold. The one or more metrics may include at least one statistically based metric. By statistically based metric, it may be meant a metric which is calculated based on timeseries data relating to a given property or parameter of the biosensor element or data therefrom. The at least one statistically based metric may be an autocorrelation calculated for each biosensor element.

Identifying the subset of biosensor elements may include providing the one or more calculated metrics to a scoring component, the scoring component deriving from the one or more calculated metrics a composite data quality score. Identifying the subset of biosensor elements may further include comparing each respective composite data quality score to a threshold score, and identifying the biosensor element as one for processing if the respective composite data quality score is greater than a threshold score. The scoring component may be pre-trained machine-learning model configured to generate the composite data quality score from the or each calculated metric. The scoring component may be one of: a regression model, a linear regression model, a regression tree, a support vector machine, a support vector regression model, a neural network, or a K-nearest neighbors model.

The pre-trained machine-learning model may have been trained on historical biosensor element data. It may have been trained on a set of biosensor element data which comprises at least 10,000, 20,000, or 30,000 elements, each element being labelled. It may be trained to produce a label for a given biosensor element in accordance with labels used in training. It may be trained to label each biosensor element in accordance with the corresponding composite data quality score. The composite data quality score may be, for example, an expected number of callable bases from the respective biosensor element. Callable bases may be those where the data exceeds a predetermined accuracy. The label may be a combination (e.g., a multiplication of) the accuracy of the data from the biosensor element and the expected length of the bases produced.

Identifying the subset of biosensor elements may include a step of selecting the top n biosensor elements according to their respective composite data quality scores, where n is an integer smaller than the total number of biosensor elements. Selecting the top n biosensor elements may be performed within a plurality of regions of the biosensor elements, such that y sets of the top n biosensor elements are selected (where y is the number of regions of biosensor elements). The method may include a pre-processing step, performed before identifying the subset of biosensor elements. The pre-processing step may be performed before the step of analyzing the received sensor data. The pre-processing step may include applying one or more filters. The pre-processing step may include identifying one or more transitions within the received sensor data. The filter may be a smoothing filter, a convolutional filter, a digital filter, a least-squares filter, or a Savitzky-Golay filter. The preprocessing step may include a dark background correction, baseline subtraction, and/or a crosstalk correction step.

The received sensor data may include temporally spaced frames of received sensor data. For example, each frame may describe an amplitude of a signal received by a detector in the respective biosensor element. The received sensor data may include a series of amplitude values against time. For example, the received sensor data may include a series of a single amplitude against time (e.g., a photocurrent amplitude from a single detector or a voltage from a single resistor). The received sensor data may include multiple series of amplitude values against time, indicative of different sensed events by the biosensor element. For example, the received sensor data may include a plurality of amplitudes against time (e.g., a photocurrent amplitude from a plurality of detectors, each associated with a different frequency of light).

The method may be performed on an FPGA, i.e., in fixed function circuitry, on an integrated circuit, on a CPU/GPU, on a softcore located on an FPGA, or in a combination of these.

In a further aspect, there is also provided a method of training a machine-learning model to generate composite data quality scores from one or more metrics, the steps including: generating a training set, the training set comprising historical biosensor element data, each element in the training set being labelled with a label corresponding to a composite data quality score; training the machine-learning model on the training set, thereby providing a trained machinelearning model which can utilize the one or more metrics associated with a biosensor element to provide a composite data quality score for that biosensor element.

The method of this aspect may include any one, or any combination insofar as they are compatible, of the optional features set out with reference to any other aspect herein. For example, the composite data quality score may be an expected number of callable bases from the respective biosensor element. Callable bases may be those where the data exceeds a predetermined accuracy. The label may be a combination (e.g., a multiplication of) the accuracy of the data from the biosensor element and the expected length of the bases produced.

The one or more metrics may be descriptive of the properties of the historical received sensor data. The one or more metrics may include one or more of: data indicative of amplitude, data indicative of amplitude clipping, data indicative of an autocorrelation between frames of sensor data of a respective biosensor element, data indicative of stuttering events, data indicative of a pulse variation (e.g., a mean pulse variation), and a number of pulses within a time period. The metrics may include any one or more of a data indicative of a first percentile and a second percentile of the number of pulses; data indicative of a first percentile and a second percentile of the (mean) pulse variance, data indicative of a first percentile and a second percentile of stuttering events. The one or more metrics may include at least one statistically based metric. By statistically based metric, it may be meant a metric which is calculated based on time-series data relating to a given property or parameter of the biosensor element or data therefrom. The at least one statistically based metric may be an autocorrelation calculated for each biosensor element.

The machine-learning model may be a regression model, a linear regression model, a regression tree, a support vector machine, a support vector regression model, a neural network, or a K-nearest neighbors model.

The training set may comprise at least 10,000, 20,000, or 30,000 elements, each element being labelled. It may be trained to produce a label for a given biosensor element in accordance with labels used in training.

As discussed above, the method of the present aspect may be implemented fully or in part on a CPU or GPU of the analytical system. Accordingly, in a further aspect, there is also provided non-transitory computer-readable storage medium comprising instructions that, when executed by a computing apparatus, cause the computing apparatus to carry out the methods discussed herein of providing sensor mask information to a data selection module.

In a further aspect, there is also provided a data evaluation module for generating sensor mask information to a data selection module of an analytical system, the data selection module being for selectively providing sensor data from a plurality of biosensor elements to a downstream data processing module, the data evaluation module being configured to:

(a) receive sensor data from at least a portion of the biosensor elements;

(b) identify, from the received sensor data, a subset of the biosensor elements which are producing sensor data to be processed; and

(c) generate sensor mask information indicating the identified subset of the plurality of biosensor elements.

The data evaluation module may be configured to perform any one, or any combination insofar as they are compatible, of the optional features of the method of providing sensor mask information to a data selection module set out above. The data evaluation module may be provided purely in software, purely in hardware, or a combination of software and hardware. The data evaluation module may be provided at least partially, or entirely, on an FPGA. The FPGA on which the data evaluation module is provided may be the same FPGA on which the data selection module is provided.

In further aspects, there is also provided: a use of the method of configuring a data selection module in performing single-molecule real-time sequencing; and a use of the method of providing sensor mask information in performing single-molecule real-time sequencing.

In a further aspect, there is provided a method of performing single-molecule real-time sequencing, including performing the method of providing sensor mask information to a data selection module, performing the method of configuring the data selection module, and performing the method of providing sensor data.

During sequencing, the method may include a step of introducing additional sample to a sensor chip including the plurality of biosensor elements. The method may then include a step, responsive to an indication that additional sample has been introduced to the sensor chip, of performing the method of providing sensor mask information to the data selection module. This can either be done as a repeat of the initialization phase discussed above, or as a repeat of the production phase discussed above.

Loading of a target into biosensor elements (e.g., into binding regions of biosensor elements) may include attachment of the target by biotin-streptavidin, click chemistry, hybridization, affinity binding by a protein such as an antibody, binding of polynucleotide by a helicase, or any other suitable means.

In a further aspect, there is provided an analytical system comprising: a plurality of biosensor elements configured to generate sensor data, and a data selection module according to previous aspects discussed herein. Advantageously, by selectively providing the sensor data from the identified subset of biosensor elements to the output array, the downstream data processing module of the analytical system can use less computing resource and time to process the sensor data. The analytical system is therefore more efficient and faster at generating results based on the sensor data.

In each of the above aspects, the analytical system (which may, for example, be referred to as a molecule sequencing system or a biosensor system or an instrument for performing biochemical analysis of a sample) may be any suitable system comprising a plurality of biosensor elements and configured to receive sensor data from each of the biosensor elements in real-time.

The analytical system may further comprise the downstream data processing module for analyzing the sensor data. For example, the downstream data processing module may be for performing DNA or peptide sequencing etc. The downstream data processing module may be considered to include any computing resource configured to receive and process the sensor data from the data selection module. This may include a GPU, a CPU, connection components such as PCIe buses, and memory banks.

The analytical system may further comprise a data evaluation module configured to perform the method of configuring the data selection module to selectively provide the sensor data from the biosensor elements to a downstream data processing module according to the aspects discussed herein. The data evaluation module may be implemented on one or more of: an FPGA, a CPU, and a GPU. For example, in some examples, some of all of the functionality of the data evaluation module may be implemented on a same FPGA as the data selection module. In some examples, some or all of the functionality of the data evaluation module may be implemented on a GPU or CPU. The same GPU or CPU may also be configured to implement the functionality of the downstream data processing module.

In examples, where the functionality of the data evaluation module and the downstream data processing module are implemented on a shared resource (i.e., a GPU), an initial (large amount of the) data bandwidth of the shared resource may be configured to determine initial mask information for configuring the data selection module before the data selection module begins providing data to the downstream processing module. Accordingly, after the data selection module is configured with the initial mask information, a portion of the initial data bandwidth may be repurposed for receiving the sensor data from the data selection module and performing the functionality of the downstream data processing module. In examples where the data evaluation module is configured to generate updated mask information, a smaller portion of the initial data bandwidth of the shared resource may be maintained for generating updated mask information.

The analytical system may further comprise one or more pre-processing modules configured to perform initial processing on the sensor data. For example, the pre-processing modules may include a crosstalk filter and/or a background level subtractor.

The analytical system may be a single molecule detection system. For instance, the analytical system may be configured to detect single molecule events associated with each biosensor element.

A target molecule or a complex may be loaded into the analytical system by Poisson distribution such that some biosensor elements have no target molecules, some have a single target molecule, and some are multi-loaded. The biosensor elements having a single target molecule may be considered as productive biosensor elements whereas the biosensor elements having no or multiple target molecules may be considered as non-productive biosensor elements.

The analytical system may be a sequencing system. In particular, the analytical system may be a next generation sequencing system for determining a structural order of molecules. For example, the sequencing system may be configured to sequence DNA or RNA. For example, the analytical system may be configured to sequence single-molecule proteins such as peptides or polymerase molecules.

To detect single molecules, the analytical system may be configured to detect events in real-time. The events may be changes in a parameter associated with the biosensor elements such as an optical signal, a voltage between electrodes, an electrical current, etc.

Herein, a biosensor element may be understood as all of the components which: physically locate a biological entity, the activity of which is to be sensed; facilitates a biological event to be sensed; obtains data relating to the biological event; and provides the obtained data to further components. For example, the biosensor element may include: a pad and/or a nanowell optical confinement element (e.g., a zeromode waveguide). A pad may be any site functionalized to bind the target molecule, and may be on a planar surface, or may be raised or depressed relative to a planar interstitial surface. In general, pads as used herein are not optically confining. A pad may comprise one or more (e.g., two) electrodes for detecting the target molecule or a reaction as described herein. For example, electrodes may detect a change in current during a single molecule DNA sequencing reaction, such as from a redox reaction, as described in US 2013/0109577 A1 , which is hereby incorporated by reference in its entirety. It may also include the components to extract signals from the pad or nanowell optical confinement element. For example, where the element includes a pad, the biosensor element may include electrodes and read-out electronics to measure a voltage across the pad or a current through a portion of the pad (where these values change as a result of the biological activity occurring in the pad). In another example, where the element includes a nanowell, the biosensor element may include the optical components which illuminate the nanowell and/or receive optical signals from the nanowell (where the values of these change as a result of the biological activity occurring in the nanowell).

Further examples of analytical systems may comprise surface pads or spots as a biosensor element in place of a nanowell, for example, if optical confinement is not needed. Each pad or spot may be optically interrogated of providing optical sensor data or may have electrodes for direct electronic detection for providing electrical sensor data.

The target may be loaded as a cluster comprising multiple copies of e.g., a polynucleotide sequence, or may be loaded as a single nucleotide sequence and amplified into a cluster such as by bridge amplificon or rolling circle amplification.

Alternatively, a target may comprise a single molecule for detection using the analytical systems discussed herein. For example, a single molecule target may be a polynucleotide or a peptide, or a complex comprising a single polynucleotide or peptide to be detected. Targets may be loaded in a Poisson distribution. For example, some biosensor elements may have no targets loaded and some may be loaded with multiple targets. Biosensor elements loaded with single targets that are generating signal may be considered “productive”, although additional aspects such as speed of events and/or quality of events may be applied to determine which biosensor elements are productive for the purpose of masking.

Loading of targets into biosensor elements generally takes place before the mask steps described herein. However, after or during an initial sample run, additional targets may be loaded, and additional masking may be performed. The device having the biosensor elements may be moved (e.g., to a liquid handler region of the analytical system) and the data acquisition may stop while additional loading is performed, or loading may be performed while data is being acquired. For example, if biosensor elements become non-productive as a run progresses, then loading may be performed to make non-productive biosensor elements productive.

The analytical system may be an optical analytical system. In these examples, the sensor data may therefore comprise data derived from optical signals received from optical sensors in the plurality of biosensor elements. The optical analytical system may be configured to distinguish events by one or more of color, amplitude, fluorescence intensity, fluorescence lifetime, and binding kinetics of the optical signals. More specifically, the optical analytical system may distinguish events using, at least, amplitude of the optical signals.

When the analytical system is an optical analytical system the analytical system may provide an illumination source, such as a laser, that illuminates binding sites through a waveguide (i.e., an optical confinement). In some examples, the binding sites are the bottom of nanowells that act as optical confinement regions for waveguide directed illumination, and as such are optical confinement nanowells. Accordingly, each biosensor element of the optical analytical system, may comprise an optical confinement nanowell in which fluorescent molecules bind to a target molecule. These binding sites may be referred to as ZMWs in certain examples provided herein. Accordingly, in these examples, the ZMWs may be considered as the (or part of the) biosensor elements of the analytical system which provide the sensor data.

In some examples of optical analytical systems, integrated optical elements such as one or more illumination waveguides, filters, and/or lenses may be positioned between each binding site (e.g., a binding site at the bottom of a nanowell, such as a ZMW) and a detection element of the biosensor element. The detection element of the biosensor element in an optical analytical system may include a photodiode, such as a CMOS sensor or CCD sensor, for detecting an optical signal from the binding site. A fluorescent label coupled to a nucleotide or affinity reagent, may be distinguished (and hence the events distinguished) based on one or more of: color, amplitude, fluorescence lifetime, and/or through binding kinetics (e.g., ON/OFF rates across multiple binding events between a target and the molecule that the fluorescent label is bound to) of the detected optical signal.

For example, the optical system may be a polynucleotide sequencing system. In this example, when the analytical system is an optical analytical system, the data processing module may be configured to analyze the sensor data to detect events of fluorescent nucleotides binding to a respective polymerase complexed to a target polynucleotide. The fluorescently labelled nucleotides may bind to the polymerase complexed to a target polynucleotide in an optical confinement nanowell of a respective biosensor element. In other examples, the polynucleotide sequencing system may be a non-optical analytical system as discussed below.

For example, when the analytical system is a polymerase sequencing system, biosensor elements with two or more target polynucleotides with actively extending polymerases may have all but one polymerase become inactive or pause during a sequencing run, such that the signal from only one polynucleotide is detected and a biosensor element may become “productive”. Alternatively, biosensor elements which did not have active extension by a polymerase complexed with a polynucleotide in the biosensor element may have a polymerase become active or resume sequencing after pausing and therefore become “productive”. Conversely, “productive” biosensor elements with a single extending polymerase may have that polymerase become inactive or pause or may have another polymerase complexed to another polynucleotide in the biosensor element become active and start extending, such that the biosensor element becomes “non-productive”. Accordingly, updating the masking for a region of biosensor elements comprising polymerases, such as ZMWs comprising polymerases, advantageously enables the selective passing of sensor data from the newly productive biosensor elements and while ceasing the provision of data sensor from newly non-productive biosensor elements.

In further examples, the analytical system may be a peptide sequencing system. For example, the analytical system may be configured to detect events of fluorescently labelled affinity reagents binding to a target peptide. The analytical system may therefore be a next generation sequencing system for determining a sequence of amino acids of a protein or peptide thereof. The fluorescently labelled affinity reagents may bind to the target peptide in a nanowell optical confinement of a biosensor element. Accordingly, the data processing module may be configured to analyze the (optical) sensor data to detect events of fluorescently labelled affinity reagents binding to a target peptide.

Example target peptides may include short peptide fragments, proteins, and protein complexes. A peptide may be functionalized for binding to a binding site of a biosensor element, or may be bound by an affinity reagent that is bound to the binding site. Analytical systems for detecting target peptides may detect the presence or absence of a particular peptide through one or more fluorescently labelled affinity reagents. Multiple binding events may provide a signature that differentiates the peptide. In certain aspects, a peptide sequence (e.g., subsequence of the peptide) may be determined by a cyclic process comprising detection of a terminal amino acid of the peptide by binding as described above, alternating with removal of the terminal amino acid such that the next amino acid is detected in the next cycle.

Therefore, the peptide sequencing systems may benefit from the data selection module discussed herein, for selectively providing sensor data from the biosensor elements to the downstream data processing module, by selecting sensor data only from the productive biosensor elements in which the nanowells are filled with one target as opposed to empty nanowells or multi-loaded nanowells. By selectively streaming the sensor data in this way the overall computing cost of performing the peptide sequencing may be reduced.

In some examples, the sample may be reloaded into the biosensor elements of a peptide sequencing system during an experimental run. Therefore, such a system may benefit from the methods discussed her in of updating the data selection module to provide updated sensor information to the downstream data processing module so that the sensor data from the newly filled biosensor elements may be received.

The analytical system may be an electrode based analytical system wherein each biosensor element comprises one or more electrodes proximial to a target molecule. For example, each biosensor element may comprise a sensor for detecting an electrical current signal as the target molecule is positioned between two electrodes such that the sensor data from each biosensor element comprises data derived from the electrical current signal.

The disclosure includes the combination of the aspects and preferred features described except where such a combination is clearly impermissible or expressly avoided.

Summary of the Figures

Embodiments and experiments illustrating the principles of the disclosure will now be discussed with reference to the accompanying figures in which:

Fig. 1 shows a functional block diagram of an analytical system according to aspects of the present invention;

Fig. 2 shows a block diagram of an embodiment of the analytical system;

Fig. 3 shows a block diagram of another embodiment of the analytical system; Fig. 4 shows a block diagram of another embodiment of the analytical system;

Fig. 5 shows a block diagram of an analytical system wherein the data selection module is implemented on the sensor stack;

Fig. 6 shows a block diagram of an analytical system wherein the data selection module is partially implemented on the sensor stack;

Fig. 7 shows a method of configuring a data selection module for an analytical system according to aspects of the present invention;

Fig. 8 shows a diagram of sensor data being provided to a host based on sensor mask information;

Fig. 9 shows a diagram of a data mask being applied to sensor data;

Fig. 10 shows a diagram of a data mask being updated;

Fig. 11 shows an example acquisition signal from an amplitude-based single-molecule real-time DNA sequencing platform;

Fi-. 12A - 12C show examples of acquired signals from a single loaded nanowell, a multi-loaded nanowell, and an unloaded nanowell respectively;

Fig. 13 shows an example of an acquired signal from a biosensor element and the respective smoothed signal;

Fig. 14 shows an example of a smoothed signal and the analysis performed thereon;

Fig. 15 shows an example of an acquired signal from a multi-loaded nanowell;

Fig. 16 is a plot of autocorrelation against loading;

Fig. 17 is a plot of Gbase HiFi calls for 15 experimental runs (runs 1 - 13 being performed with 14 kB inserts and runs 14 - 15 being performed with 18 kB inserts), each run being performed on the basis of all available biosensor elements, a selected subset of the biosensor elements, and a random selection of the same size;

Fig. 18 is a plot of Gbase HiFi calls for 20 experimental runs (runs 1 - 8 having > 75% loading and runs 9 - 20 having <60% loading), each run being performed on the basis of all available biosensor elements, a selected subset of the biosensor elements, and a random selection of the same size;

Fig. 19 is a plot of early loading ratio for each collection having varying degrees of loading;

Fig. 20 shows a method of configuring a data selection module; and

Fig. 21 shows a method of training a machine-learning model. Detailed Description of the Disclosure

Figure 1 shows a simplified block diagram of an analytical system 1 comprising a plurality of biosensor elements.

The analytical system 1 may be one of the single molecule sequencing systems discussed above such as: a single molecule DNA sequencer comprising zero-mode waveguide (ZMW) nanowells as discussed in [1], a single-molecule protein sequencer comprising ZMWs as discussed in [2] or any other suitable biosensing system which generates sensor data from a plurality of biosensor elements. Further documents which discuss the concepts and principals of single molecule sequencing systems include: US 2010/0221716 A1 , US 2009/0024331 A1 , US 2019/012779 A1 , US 2010/0065726 A1 , and US 2013/0109577 A1 , each of which is hereby incorporated by reference in its entirety.

Each of the biosensor elements (not shown) provides sensor data to a sensor interface 2 of a sensor stack of the analytical system 1 .

The sensor interface 2 is connected to a data selection module 4 by a plurality of parallel data connections 12. The data selection module 4 is configured to selectively provide the sensor data, in realtime, from the parallel data connections 12 to a downstream data processor 6 via a plurality of output connections 14 for post processing (for example, to perform base calling as part of DNA sequencing). Accordingly, the data selection module is configured to stream selected the sensor data from an input array of the data selection module to an output array of the data selection module.

The biosensor elements are configured to continuously provide sensor data in real time to the data selection module 4 during an experimental run of the analytical system 1. For example, frames, wherein each frame comprises a sample from each of the biosensor elements, may be provided to the data selection module 4 at 100 FPS (frames per second). A frame may typically comprise a sample from each biosensor element totaling 1 million-*- samples, wherein each sample is an 8-bit word. In this way, a “frame” may be considered as one time step, or one sample point, of the sensor data. Each frame is provided as one or more data packets. Typically, several data packets may be received from the biosensor elements for each frame, wherein each data packet comprises one of a plurality of sub-frames.

Importantly, the data bandwidth of the input connections 12 is larger than a data bandwidth of the output connections 14, and so the data selection module 4 is configured to selectively stream only a subset of the sensor data to the data processor 6. To do this, the data selection module 4 is configured to receive sensor mask information from a data evaluation module 8, via a side-channel 18, and selectively stream or filter the sensor data to the downstream data processor 6 based on the mask information.

The sensor mask information indicates an identified subset of the plurality of biosensor elements from which sensor data is sought. Or rather, the sensor mask information indicates which input connection 12 from the sensor interface 2 should be connected to the data output of the data selection module 4 for streaming data to the data processor 6 via the output connections 14. This process may be referred to herein as applying a “data mask” to the sensor data for selectively “masking” some of the sensor data so that it is not transferred to the data processor 6. The data selection module 4 typically comprises a data buffer for storing multiple frames of masked sensor data it is provided to the data output array and the data processor 6. In this way, the data selection module 4 temporarily buffers and provides the sensor data to the downstream data processor 6 in chunks of multiple subframes. For example, chunks comprising 128 frames of sensor data from 64 biosensor elements may be provided to the downstream processor 6 at a time. The data processor 6 can then process (e.g., perform base calling on) each 128-frame chunk at a time before moving onto the next chunk. The data selection module is configured to store (i.e., buffer) the remaining chunks until they may be communicated to the data processor 6.

As part of the processing the data processor 6 utilizes context data associated with each biosensor element which is stored in memory (e.g., DRAM). For each chunk of sensor data, the data processor 6 must perform a context switching operation in which a new set of context data is loaded for the relevant biosensor elements in the next chunk. Therefore, by providing the sensor data to the output array in chunks of multiple frames, the data processor 6 can perform context switching less often than if the chunks only contained one frame thereby improving the speed and efficiency of the overall analysis.

The data evaluation module 8 is configured to determine the sensor mask information from the sensor data itself by evaluating sensor data received via the side channel 18, to determine which biosensor elements are “productive” - “productive” being a metric which is dependent on the specific context and type of analytical system 1 at hand.

The side channel 18 is configured to stream unmasked sensor data from the data selection module to the data evaluation module 8. In some examples described herein, the data evaluation module 8 is configured to periodically evaluate the sensor data and update the mask information, thereby dynamically adapting the data selection module 4 to stream sensor data from a different selection of the biosensor elements as their performance or productivity changes during an experimental run.

In some examples, the side channel 18 is configured to stream subsets (called sectors or update sectors) of the unmasked sensor data to the data evaluation module 8. The data selection module 4 is then updated sector-by-sector throughout a run. In other examples, the side channel 18 may be configured to stream all of the sensor to the data evaluation module 8 for determining the data mask. Therefore, the data bandwidth of the side channel 18 is the same as or smaller than the data bandwidth of input connections 12, depending on the size of the update sectors.

The data selection module 4 is implemented in hardware circuitry, for example, in CMOS or the programmable logic of an FPGA. In this way, a large number of parallel data pipelines can be established between the sensor array 2 and the buffer memory of the data processor 6 for streaming the data in real time without needing to rely on large memory banks and memory calls.

The large amount of data involved in the present applications (i.e., DNA sequencing) e.g., 1 million+ biosensor elements, means that a very large amount of data routing resource and memory buffers would be required to selectively stream data from any of the positions on the input array to any of the positions on the output array. Therefore, to reduce the amount of resource required, and to enable practical implementation of the data selection module 4, the data selection module 4 is configured to divide and select the sensor data in regions of biosensor elements.

Additionally, the data selection module 4 is configured to provide the sensor data to the output array using techniques discussed herein which preserve, as far as possible, the existing routing and mapping of the sensor data when the mask information is updated. This advantageously enables the data selection to be updated with less disruption to the data streaming functions and enables post-processing by the downstream data processor s to be implemented more efficiently.

To implement this, the data selection module 4 is configured to perform a mapping step in which an output address is assigned to each datum of the sensor data so that it is provided to a specified location in host memory of the downstream processing module 6. This enables the data processor s to access the context data for each biosensor element sequentially, using direct memory access (DMA). The data mask information may comprise mapping information for the sensor data from each selected biosensor element. The mapping information may contain a target address for the sensor data, the target address corresponding to a downstream memory location in the data processor 6.

When the mask information is updated during an analytical run, as discussed in detail below, the data selection module 4 is configured to perform a remapping step in which the sensor data in each chunk, which was selected before and after the update, is assigned a target location in the data output array which was the same after as before the updating. Accordingly, the data mask information may comprise mapping information for the sensor data from each selected biosensor element. The remapping step may be performed in conjunction with or separately to the configuration of the data selection module 4 (i.e., the application of the data mask) as described below in relation to Figs. 5 and 6.

The data evaluation module 8 may be implemented on a CPU or GPU, or on a same FPGA as the data selection module 4. In some examples, the functionality of the data evaluation module 8 may be divided between hardware circuitry (e.g., an IC or the programmable logic of an FPGA) and a CPU and/or GPU. The functions of the data evaluation module are described in more detail below in relation to Figures 11 to 21.

The data processor 6 may be implemented on one or more GPUs which are configured to handle large data payloads for performing single molecule sequencing on the sensor data.

Figure 2 shows a block diagram of an embodiment of the analytical system 100 according to aspects of the present invention.

A sensor device 102 comprises a plurality of biosensor elements 104. For example, the sensor device 102 may be a chip comprising a plurality of optical confinement nanowells and optical sensors. Sensor data (e.g., optical data from the optical sensors or current data from the electrical current sensors) is provided to an FPGA 106 via a series of parallel input connections 114. In this embodiment, the data selection module described above is implemented on the FPGA 106 which therefore clocks in the sensor data from the input connections 114 and selectively routes the sensor data to output connections 116. The output connections 116 (comprising e.g., PCIe buses) are connected to a CPU and/or GPU 108 which are configured to perform downstream data processing on the selected sensor data as discussed above.

A side-channel connection 118 is provided between the FPGA 106 and the CPU/GPU 108 for the data selection module to stream sensor data to and receive mask information from a data evaluation module for inferring the mask information as described above. Therefore, in this embodiment, the data evaluation module is implemented on the same CPU/GPU 108 as the downstream data processing. However, in other examples the data evaluation module may be implemented entirely, or in-part, on the FPGA 106.

An advantage of implementing the data evaluation module for determining the mask information on the same CPU/GPU 108 as the downstream data processing is that some of the resource for the downstream data processing may be initially repurposed for determining the mask information at the beginning of an analytical run. For example, as discussed below, when the data selection module and data evaluation module are performing an initialization routine to determine initial mask information for all of the sensor data, before the data selection module begins streaming the sensor data to the data processor, then some or all of the output connections 116 and the downstream data processing resource 108 may be used to determine the initial mask information. This can make the initialization routine much faster. After the initialization routine is complete then the output connection 116 and the data processing resource of the CPU/GPU may be used for the downstream data processing and the side channel 118 is used for receiving sensor data and updating the mask information.

In other examples, (e.g., as discussed in relation to Figs. 4 to 5) the data evaluation module 8 may be implemented partly or entirely on the FPGA 106. An advantage of this is that real-time evaluation tasks of the mask inference can be performed without spending valuable bandwidth on transferring sensor data to the CPU/GPU 108. This same circuitry can be used for periodic updates while not burdening the CPU/GPU 108 data processor during a run. For example, real-time evaluation of the sensor data to extract metrics may be performed on the FPGA 106 and ML regression and mask inference may be performed in the CPU/GPU 108. This usefully reduces the bandwidth requirements of the output channel 118, because only the results of the real-time evaluation (e.g., extracted metrics) need to be passed to the CPU/GPU 108 (e.g., for ML regression) rather than a real-time sensor data stream. The real-time evaluation results may be equivalent to 1-4 frames of sensor data, while the full sensor data stream to produce that result entirely on the CPU/GPU 108 may include thousands of frames of sensor data.

Figure 3 shows a block diagram of a further embodiment of the analytical system 200 wherein the functionality of the data selection module is implemented in an IC (integrated circuit) 206 which is integrated with a sensor device 202 comprising the plurality of biosensor elements 204. The sensor device 202 may also be referred to as a “sensor stack” or a “sensor chip”.

Accordingly, in this example, the input connections 214 to the data selection module are internal connections of the sensor device 202. The data selection module 206 then selectively provides sensor data from the plurality of biosensor elements to the CPU/GPU 208, via the output connections 216, for data processing. As in Figure 2, a side-channel data connection 218 is provided for the data selection module to provide unmasked sensor data to a data evaluation module implemented on the CPU/GPU 208 and receive, in return, mask information for masking the sensor data.

Similarly to Figure 2, the data evaluation module may also be implemented entirely, or in part, on the IC, wherein the side-channel connections 218 may not be required.

Figure 4 shows a block diagram of a further embodiment of the analytical system 300 in which an IC 306 is provided on the sensor device 302 and an FPGA 310 is provided for performing some of all of the functionality of the data selection module.

In this example, the sensor device 302 comprising a plurality of biosensor elements 304 provides unmasked sensor data to an IC 306 located on the sensor stack 302 via a series of internal parallel input connections 114. The functionality of the data selection module is implemented totally or in part on the IC so that masked data is provided to the FPGA via connections 312 for buffering. The sensor data is then buffered on the FPGA 310 before it is provided to the CPU/GPU 308 for post processing in chunks via output data connections 316. In this example, the data evaluation module for inferring the mask information is implemented on the FPGA 310, wherein unmasked sensor data is received from the sensor device 304 and masking information is returned to the IC 306 via the side-channel 318. As discussed in more detail below in relation to Figures 5 and 6 below, a data remapping function of the data selection module may be performed on the IC 306 or on the FPGA 310.

Of course, the skilled person would understand that other configurations may be possible for implementing the techniques of the present invention. For example, the data selection module may be implemented on multiple FPGAs, or on a separate IC to the sensor chip. In some examples, the data evaluation may be provided on a separate computing platform, or even on a remote server, etc.

Fig. 5 shows a block diagram of an analytical system 400 wherein the data selection module is implemented on the sensor device 402 and the data evaluation module is implemented on an FPGA 406.

The sensor device 402 comprises a sensor array 404 comprising the plurality of biosensor elements for generating sensor data and a logic array 450 for performing preprocessing on the sensor data and applying the data mask of the data selection module.

In the example shown, the logic array 450 includes a crosstalk (“XT”) filter module 452 for preprocessing the sensor data. The crosstalk filter 452 is configured to remove spatial dependencies from the sensor data owing to the physical location of the biosensor elements relative to each other in the sensor stack. This step is preferably performed prior to the application of the data mask, since this would remove the spatial relationship of the sensor data (i.e., by selectively streaming only some of the sensor data).

The data selection module 404 is formed of a data masker 440 for filtering the sensor data from the biosensor elements based on mask information from a mask inference block 408, and a data re-mapper 442 for re-ordering the masked data into expected memory locations for host memory (i.e., DRAM) of the downstream processing. When the data masking is updated during an analytical run to select different biosensor elements, the re-mapper 442 re-orders the sensor data coming from the data masker 440 to preserve the memory location of the sensor data from biosensor elements that were not updated. A RAM array 454 is provided for storing the mask information from the mask inference block 408 on the sensor device 402. In this example, wherein the data remapping 442 is performed on the sensor device 402, the RAM array is configured to store a data word (i.e., 8 bits) corresponding to each biosensor element, each data word comprising a first bit to indicate if sensor data from a particular biosensor element should be selected or not (i.e., “on” or “off’) and 7 bits containing an output address corresponding to where the sensor data from each biosensor element should be mapped to in the output array for post-processing. Therefore, in the example shown in Figure 5, a large amount of RAM is required on the sensor device 402 to store the mask information. For example, for a 10OM sensor device 402 with mask information comprising 8-bits per biosensor element, 1x 1Gb DRAM 454 may be required. Accordingly, if each sensor datum from each biosensor element contains 8-bits, the bandwidth of the RAM 454, in this example, is equivalent to the bandwidth of the sensor array 4034emappedremapped and masked sensor data output from the remapping module 442 is provided to the FPGA 406 via a QSFP fiber connection 444. An additional, QSFP fiber connection is provided as a side-channel 446 to provide unmasked sensor data, sector-by-sector, to the FPGA 406 for mask inference 408. The masked sensor data 444 and the unmasked side-channel data 446 is provided to a strided data writer 462 which writes the sensor data to SDRAM 460 for buffering. For example, the SDRAM 460 may buffer data chunks comprising 128 frames of masked sensor data at a time.

The buffered sensor data (i.e., including unmasked sensor data from the side-channel 446 and masked sensor data from the data masker 440) undergoes further pre-processing on the FPGA 406, which in this case includes dark frame subtraction “x-df” 464 wherein a background signal value is subtracted from the sensor data.

Next, the preprocessed and masked chunks of sensor data in the SDRAM 460 are sequentially copied to host memory of the downstream data processor via PCIe connections 466. The downstream data processor can then retrieve context data associated with each biosensor element in the present chunk and perform analysis (i.e., sequencing) of the received sensor data.

The unmasked side channel data is provided to the data evaluation module for mask inference 408, which in this example is entirely implemented on the FPGA 406. The data evaluation module analyses the sensor data sector-by-sector and determines updated masking information for each sector which is then provided to sensor device 402 via connection 448 for storage in the RAM array 454.

Fig. 6 shows a block diagram of an analytical system 500 wherein the functionality of the data selection module is divided between the sensor device 502 and the FPGA 506.

In this example, the mask application is performed on the sensor device 502 as described above for Fig. 5. However, in this example the remapping step 542 of the data selection module is performed on the FPGA 506. Therefore, in this example, the data masking information, at its most basic, need only contain 1 -bit of data per biosensor element to indicate whether that biosensor element is selected or not. Therefore, a large RAM array is not provided with the sensor stack 502 in this example, since the memory requirements are lower, and the masking information may be stored in on-chip memory 554. This configuration is particularly suited to examples wherein the sensor chip is a consumable since less memory, which can be large and expensive, is needed on the sensor chip.

The data masker 540 provides the sensor data from the selected biosensor elements to the FPGA 506 via 544 where it is written into the SDRAM data buffer 560. The masked sensor data is temporarily stored in SDRAM 560 along with remapping data (i.e., output addresses for the sensor data of each chunk) for use by the re-mapper 542. The mask inference block 508 is configured to receive sectors of unmasked sensor data via the side-channel connection 546 and provide updated masking information to the sensor stack 502 as described above. However, in this example, the masked sensor data coming off the sensor stack 502 may be ordered differently to before the data mask information was updated.

Therefore a remapping step 542 of the data selection module is performed on the FPGA 506. In particular, the remapping 542 is performed on chunks of the masked sensor data after it has been retrieved from the SDRAM buffer 560 and been preprocessed, in this example by a dark frame subtractor 564. The re-mapper 542 is configured to assign the masked sensor data to an output memory location which preserve the previous memory location of sensor data which was being output before the updating of the mask information. The remapped and masked sensor data is then provided to the host DRAM via PCIe 566 for downstream processing as discussed previously.

Preferably, the generation and application of the masking information by the masker 540 and the mask inference block 508 still complies with the rules of regions and tracks as discussed above In order to limit the required buffering and routing required in the logic array 550 for implementing the data masking 540. However, the placement of the sensor data within the regions and/or tracks may move or be re-ordered as a result of the masking information updating since different biosensors may be selected for each track.

Fig. 7 shows a method of configuring a data selection module for an analytical system according to aspects of the present invention.

First, an initialization routine A comprising steps S600 to S606 is performed to determine and apply initial mask information for selectively streaming the sensor data to the downstream processor. Then, the data selection module performs a repeating update loop B for periodically updating the mask information to accommodate changes in the performance of the biosensor elements and hence the sensor data that is sought.

First, in step S600, the data selection module provides sensor data from each of the plurality of biosensor elements to the data evaluation module to determine mask information indicating which biosensor elements are productive and therefore are generating sensor data which is worth forwarding to the downstream data processor. The sensor data is provided in sectors of sensor data, wherein the size of the sectors is dependent on the data bandwidth of the data evaluation module and the side-channel which connects to the data evaluation module.

At least 512 frames of sensor data are provided to determine the mask information. However, in some examples, ~11 minutes of sensor data at 100 FPS, totaling 65,536 (64K) frames may be provided to the data evaluation module to determine the mask information. In the context of single-molecule polymerase sequencing with ZMWs [1], this is a useful amount of sensor data as it likely to comprise several incorporation events per biosensor element which can be used to determine if the biosensor element is productive or not.

In step S602, the data selection module receives mask information from the data evaluation module. The mask information is received sector-by-sector until information mask information relating to the sensor data from each biosensor element is received.

The mask information comprises a plurality of data arrays or lists of target addresses, wherein each target address corresponds to an address in the input array of the data selection module which corresponds to the identified subset of sensor data.

In step S604, the data selection module uses the mask information to configure parallel data pipelines between the target addressed in the input array to locations in the output array.

Next, in step S606, when all of the parallel data pipelines the data selection module begins streaming sensor data from the target addresses in the input array to the output array and to the downstream data processing module.

In some applications, wherein the mask information does not need to be updated, then the process may stop here so that a static data mask is applied to the sensor data based on only the initial sensor data. However, in applications where the behavior of the biosensor elements may change (e.g., owing to changes in the underlying biological reactions in the biosensor elements) then the process proceeds to step S608.

In step S608, the data selection module provides a sector of the sensor data to the data evaluation module via the data side-channel. Importantly step S608 is performed while the masked sensor date is being streamed to the downstream data processor. The sector of sensor data provided to the data evaluation module is unmasked sensor data including the sensor data from the biosensor elements which are not in the previously identified subset.

The biosensor elements may be divided into a plurality of sector e.g., between 4 and 32 sectors, wherein the mask information is updated sector-by-sector. For example, each update sector may comprise between 1/%and 1/4, between 1/8 and 1/32, or 1/16 of a full frame of sensor data, therefore containing sensor data from between 750,000 and 6 million-*- biosensor elements. For example, a sector may comprise 1.5 million biosensor elements.

Each update sector may comprise 1024 non-contiguous regions of biosensor elements, wherein each region may comprise, for example, 1024 contiguous biosensor elements. Each region of biosensor elements comprise groups of biosensor elements for which sub-masks of the mask information are determined in orderto limit the routing complexity of the masking process as discussed herein.

In step S610, the data selection module receives updated mask information from the data evaluation module relating to the update sector of biosensor elements. The updated mask information comprises an updated list of target addresses in the data input array for streaming sensor data from. In step S612, the data selection module reconfigures the plurality of data routing pipelines based on the updated mask information. Only the parallel data pipelines corresponding to the update sector are updated. The updating is performed from one time-step of the fixed function circuitry to the next. For example, where sensor data is clocked into a buffer of the data selection module (e.g., local Block RAM of the FPGA) from the target addresses in the data input array, then on a subsequent clock cycle sensor data is clocked into the buffer from the updated target addresses thereby updating the parallel data pipelines.

Additionally, the data selection module performs a re-mapping operation on the selected data to preserve the output memory location of the sensor data which was selected before and after the update. As described above, this facilitates the downstream data processor accessing context data associated with each biosensor element by using direct memory access wherein the context data is accessed incrementally depending on the position of the corresponding sensor data in the data output array. By preserving the data locations of the corresponding sensor data, this functionality can be performed more easily and efficiently than if all of the context data had to be remapped with every update of the mask information.

In step S614, the next update sector of the sensor data for updating is determined e.g., by incrementing a sector index number, and the process returns to step S608 to update the mask information for the new update sector.

In the above process, when the mask information is being determined, this is done in such a way that sensor data from some locations in the input array can only be routed to a limited number of locations in the output array determined by the regions and tracks of the mask information. This advantageously reduce the routing complexity required of the data selection module. Therefore, when the data selection module is updating the parallel data pipelines the movement of sensor data locations on the data output array is minimized.

To implement this, the mask information comprises one or more data arrays comprising locations of the selected sensor data in the input array and is divided into a plurality of sub-masks, wherein each submask corresponds to a region of biosensor elements. A region preferably comprises between 512 and 2048 biosensor elements. The sensor data from the biosensor elements in each region are received in the data input array in locations which are contiguous. Each sub-mask is therefore configured to enable or disable sensor data to be provided from its respective region of biosensor elements to the output array of the data selection module.

In contrast, the sectors of biosensor elements which are provided to the data evaluation module for updating the mask information preferably comprise a plurality of regions of biosensor elements which are non-contiguous in the data input array. By updating the data selection module in sectors of noncontiguous regions in this way, the routing intensity and bandwidth requirements of the data selection module for updating the mask information may be reduced. Accordingly, each update sector of the updated mask information comprises a plurality of updated sub-masks. Each sub-mask is configured to indicate a selection quantity of biosensor elements in the respective region from which identified sensor data is sought. The selection quantity is the same across each of the sub-masks such that sensor data is provided to the output array from a constant number of biosensor elements of each region.

The quantity of biosensor elements per region is further divided evenly across a plurality of tracks (e.g., 8 tracks of 128 biosensor per region) such that a sub-quantity of biosensor elements is indicated per track. Dividing each region into eight tracks advantageously corresponds to the memory port size of local Block RAMs (for example as provided in most FPGAs), facilitating efficient buffering of the sensor data for selecting and streaming. In this way, when the mask information is updated the array locations of the sensor data which is not updated can be usefully preserved in the data output array. Additionally, the processing of sensor data for each track can be parallelized which improves throughput in the timing critical real-time data path.

The following Figs. 8 to 10 show diagrams illustrating the application of the mask information in further detail.

Figure 8(A) shows a diagram of a traditional transfer of sensor data from the biosensor elements to host memory (from where the downstream data processing accesses the data). In this example, 64-byte cache lines of 8-bit words of sensor data from 8x8 biosensor elements are transferred to the host memory.

Sensor data from the 8x8=64 biosensor elements may be referred to as a track and the resulting 8x(8x8b) words correspond to one 64-byte cache line read or write request. In this example, data chunks comprising 128 frames of all of the sensor data from the 64 biosensor elements are transferred in parallel to the host memory.

Figure 8(B) shows a diagram of the “data mask” being applied and the sensor data being transferred to the host memory according to aspects of the present invention. In this example, data chunks from 64 biosensor elements comprising 128 frames of sensor data (8b) from each biosensor element (i.e., so that a chunk size is 8x[128x(8x8b)]) are received alongside sub-mask information (i.e., the “region mask”) corresponding to this region of biosensor elements. In this example, the sub-mask information comprises 8-bits of data for each selected biosensor element in each track for 128 frames. The sub-mask is therefore applied to the sensor data as 8 separate tracks. Accordingly, sensor data from a selected one of each of the 64 biosensor elements in each track may be selected using the sub-mask. In other example, the masking information may be configured to select more than one biosensor element for each track.

This is represented in more detail in Fig. 9 which shows an example of mask information being used to select data from one region of biosensor elements. Fig. 9 shows one frame of sensor data from one region, the region comprising sensor data from 8 tracks of 128 biosensor elements to make 1024 biosensor elements for a region. (In this example, “ZMW” is used to represent a biosensor element wherein ZMW is a zero-mode waveguide of an optical single molecule sequencing system.)

The mask information is divided into sub-masks wherein each sub-mask corresponds to one region of biosensor elements. The mask table of Fig. 9 therefore represents one sub-mask. Each sub-mask can only be used to select sensor data from its corresponding region. In this way, sensor data from each region which is provided to the data input array can only be provided to locations in the output array which correspond to that region, thereby reducing the required routing complexity of the data selection module and reducing the complexity of memory calls for context data by the downstream processing module. As shown in the table of Fig. 9 the sub-mask information comprises target locations (i.e., offsets) for the selected sensor data in the input array. Each of the corresponding target locations is highlighted in the diagram of the region of sensor data.

Each sub-mask of the mask information is configured to indicate a selection quantity of biosensor elements from its respective region. In the example of Fig. 9 the selection quantity is 8 so that the submask is configured to indicate one biosensor element per track in the region. However, other selection quantities may be possible such as two or more biosensor elements per track.

The data selection module is configured to provide the sensor data of each track to a respective track location in the data output array. Preferably, the sub-mask is configured to indicate the same number of biosensor elements from each track. Accordingly, when the mask information is updated to indicate different biosensor elements in one or the tracks, the locations of the sensor data from each track in the data output array may be preserved during the updating.

Additionally, the data selection is configured to perform remapping of the sensor data which is output from the data masking step (i.e., the region output array) to preserve memory locations of the sensor data. For example, when the selection quantity of the region sub-mask for each track is more than one, then different biosensor elements for each track may be provided to the output array. The remapping step may be configured to preserve the offset location of persisting sensor data, for example by re-ordering the selected sensor data, and replacing “vacated locations” with newly selected sensor data. The remapping involves assigning an output address to the sensor data from each selected biosensor element which corresponds to the corresponding to the remapped locations in host memory of the downstream processing module.

This preservation of data locations in the output array during updating of the data selection module may be referred to as dynamic remapping of the data selection module. The concept of dynamic remapping is discussed further in relation to Figs. 10A to 10B.

Fig. 10A shows a diagram of a data array of sensor data provided to the data output array according to the prior art. In this example, all of the input sensor data (from the biosensor elements) is provided to the data output array for processing by the downstream data processor. Accordingly, none of the data array is shaded out in Fig. 10A.

Fig. 10B shows a data array of sensor data wherein the sensor data has been filtered based on mask information wherein the shaded regions of the data array represent “masked regions” wherein the input data has not been provided to the data output array. The circled line of the data array represents an end location of the sensor data in the data array. (NB: in reality sensor data would be selected in multiples or 8 bytes or 64 bytes so that each memory buffer in the data selection module is filled and used to its full potential.)

Fig. 10C shows the data array of Fig. 10B wherein the mask information has been updated. The black region shows sensor data which is no longer provided to the data output array (i.e., it has been masked). However, the striped regions represent newly selected sensor data which is provided to the output array. The data size of the newly selected sensor data is a same size as the newly masked region thereby preserving the memory location of the end location of the sensor data represented by the circle line. Accordingly, by swapping the no longer selected sensor data for newly selected data in this way, the memory locations of the remaining sensor data are preserved during the update event, reducing the complexity of subsequent processing.

The following figures, Figs. 11 to 21 , describe the functionality of the data evaluation module for analyzing the sensor data and determining the mask information for indicating the subset of biosensor elements from which sensor data is sought.

Fig. 11 shows an example acquisition signal from an amplitude-based single-molecule real-time DNA sequencing platform. Here, the amplitude is characteristic of the base being incorporated.

Fi- 12A - 12C show examples of acquired signals from a single loaded nanowell, a multi-loaded nanowell, and an unloaded nanowell respectively.

Fig. 13 shows an example of an acquired signal from a biosensor element and the respective smoothed signal.

Fig. 14 shows an example of a smoothed signal and the analysis performed thereon. Transitions can be detected in the smoothed signal when the smoothed trace exceeds a transition threshold within a transition window. The transition window can be defined with respect the second derivative of the smoothed signal. In this example, being bounded by a point at which the second derivative falls below a first threshold value in the positive sense (indicating that the rate of change of the amplitude is increasing) and subsequently a point at which the second derivative exceeds the first threshold value in the negative sense (indicating that the rate of change of the amplitude is decreasing). Stutters can then be identified by determining that two transitions are in the same sense (up and up, or down and down)

Fig. 15 shows an example of an acquired signal from a multi-loaded nanowell. As can be seen, the acquired signal clips beyond a threshold. In this example, the acquired signal is an amplitude and amplitudes in excess of an int8 range are clipped to 255. The total number of clipped frames can be arrived at by counting the number of frames in which the value provided from the biosensor element possess the maximum value possible (e.g., 255). The inventors have recognized that amplitude clipping is indicative that a particular biosensor element is multiloaded, and therefore this feature can be used to discriminate between multiloaded and single or unloaded biosensor elements.

Fig. 16 is a plot of autocorrelation against loading. As can be seen, empty or unloaded biosensor elements are significantly more likely to have an autocorrelation approaching 0 (e.g., have an auto correlation less than 0.2). Whereas single or multi-loaded biosensor elements are significantly more likely to have an autocorrelation of around 0.7 (e.g., have an autocorrelation greater than 0.4). The inventors have recognized then that autocorrelation can be used to discriminate between unloaded and single or multiloaded biosensor elements. In some examples, autocorrelation was calculated for each block of 512 frames using the equation:

Where N is the number of time points in the series, and X is the mean of the time series. The average can be taken of multiple autocorrelation estimates in the time series.

Moreover, the inventors have recognized that stutter rates, pulse rates, and intra-feature variance can be utilized to discriminate between single-, multi-, and unloaded biosensor elements. Stutter in this context indicates a repetition in the direction of the amplitude between frames, for example an increase or decrease in amplitude followed in the next frame by a further increase or decrease in amplitude.

In an example, the above signals were analyzed to derive the following features: total transition stutters 75^th percentile; number of amplitude clipped observations from the biosensor (i.e., the number of observations where the amplitude was 255); mean pulse variance 75^th percentile; number of pulses 75^th percentile; and an autocorrelation value. In an example, the percentiles were calculated for each set of 512 frames for the inference period. For a 512 frame inference period, there was only one estimate of transition stutters, pulse variances, etc.

The features were combined by a trained regression tree or support vector machine trained in one example on 12,800 biosensor elements (ZMWs) and tested on 3,200 biosensor elements. In one example, the support vector machine was a standard Epsilon-Support Vector Regression model from the libsvm library, with a radial basis function (rbf) kernel and 3^rd degree polynomials. Feature values were standardized prior to use based on statistics relating to the training set. Training occurred between the 5 real-valued features and target labels which were the number of bases produced in that ZMWthat align with the appropriate reference for the library. The total yields in the table below were extrapolated from the 3,200 test set: Fig. 17 is a plot of Gbase HiFi calls for 15 experimental runs (runs 1 - 13 being performed with 14 kB inserts and runs 14 - 15 being performed with 18 kB inserts), each run being performed on the basis of all available biosensor elements, a selected subset of the biosensor elements, and a random selection of the same size. It can be seen then that irrespective of the size of the insert being used, a significant proportion of the data which would have been collected had all of the biosensor elements been used is still available when only data from a selected subset of the biosensor elements is used. Further, it shows that in actively selecting the subset of the biosensor elements from which data is to be analyzed significantly outperforms randomly selecting a subset of biosensor elements.

Fig. 18 is a plot of Gbase HiFi calls for 20 experimental runs (runs 1 - 8 having > 75% loading and runs 9 - 20 having <60% loading), each run being performed on the basis of all available biosensor elements, a selected subset of the biosensor elements, and a random selection of the same size. It can be seen then that by actively selecting the subset of biosensor elements to utilize data from (in contrast to randomly selecting the same number of biosensor elements) performance is enhanced. It also shows that where a smaller number of biosensor elements are loaded, the number of calls achieved when using all the biosensor elements and when the actively selected biosensor elements are very similar (and so the amount of data lost is small).

Fig. 19 is a plot of early loading ratio for each collection, and that as is expected around 1/3 of the biosensor elements were single loaded (and so producing data which can be utilized for base calling). Each line represents a collection from Figure H, where each collection has a number of biosensor elements that were empty, single, or multiloaded in the first 10 minutes (the inference period or initialization phase). The sum of the values traversed by each line is therefore equal to 1 . The ratio of the three categories is known to be modellable using Poisson distribution with various lambdas. For example, it is expected to never exceed 38% of the biosensor elements being single loaded. In this case, the moderately loaded chips had total loadings (single + multi) close to the mask size, and so it can be seen that the masking approach described herein performs well.

Fig. 20 shows a method of configuring a data selection module.

In a first step, S900, sensor data is received from at least a portion of the biosensor elements of the analytical system. This can be received, for example, from the data selection module or a part thereof. In other examples, the data may be received from a component downstream of the data selection module such as a GPU or CPU.

Thereafter, in step S902, the subset of biosensor elements which are producing data to be processed are identified. As discussed above, this identification includes calculating, for each biosensor element from which data has been received, one or more metrics of the type discussed above (autocorrelation, amplitude clipping, stutter rates, etc.). The calculated metrics are then provided to a machine-learning model as input features of the machine learning model. For example, the machine-learning model may be a regressor tree trained in the manner discussed below. The machine-learning model is used to predict (from the calculated metrics) a number of expected of callable bases from each biosensor element. The final selection may include other steps, such as those described above, of selecting the top n biosensor elements (which may be done per region, or for the entire sensor chip).

Step S902 may include a pre-processing step as discussed above. In one example, a Savitsky-Golay filter is applied. The observed amplitudes (i.e., the unfiltered data from the biosensor elements) can be approximately segmented into pulse events using the following method:

1) Calculate the smoothed data within a sliding window of size 7 using the low-order polynomial (3) fitting characteristic of Savitzky-Golay;

2) Calculate the second derivative of the smoothed data within said windows; and

3) Identify pulse transitions as the centers of 5-frame windows centered on zero-crossings of the second derivative of the smoothed signal with a maximum change in smoothed signal amplitude of at least 25% of the maximal signal within the window.

These steps can be performed for each chunk of 512 frames, for example during the initialization phase.

Once the subset of biosensor elements from which sensor data is sought has been identified, the method moves to step S904, wherein the sensor mask information is generated indicating the identified subset of biosensor elements. This step includes generating a data structure which is interpretable by the data selection module and indicates the subset of biosensor elements from which data is sought. For example, an array, the values of the array indicating an address or unique identifier or label for the respective selected biosensor elements.

Once this sensor mask information is generated, it is provided to the data selection module in step S906.

These steps are performed by a data evaluation module 8. As was discussed with relation to Figures 2 and 3, the data evaluation module 8 may be located on a CPU or a GPU. Alternatively, some or all of the data evaluation module 8 may be implemented in digital circuitry such as in the programmable logic of an FPGA. In some examples, the data selection module 4 and the data evaluation module 8 are both provided on the same component (e.g., on the same FPGA).

Fig. 21 shows a method of training a machine-learning model for generating composite data quality scores from the above-described metrics.

In a first step, S930, a training set is generated. In one example, the training set included data from 12,800 ZMWs from a Revio HG002. The data comprised labelled traces from each ZMW, that is a trace of amplitude against time and an associate label indicating the number of bases called from that race. An SVM and regression tree was then trained on the training set. The training included identifying and extracting N features from the trace (e.g., the first 60,000 frames from each ZMW). The features included, for the full trace include: the autocorrelation (with a 4 frame lag); the 15^th percentile of the amplitudes; and the total number of amplitude clipping frames. The features included, for a 512-frame chunk of the full trace: the 75^th and 90^th percentiles of the number of pulses; the 75^th and 90^th percentiles of the mean pulse variance; and the 75^th and 90^th percentiles of the number of stutters. This resulted in 9 scalar features, which the models were trained utilizing. During development, the method included a step of testing the trained machine-learning model on a data received from a further 3,200 ZMWs. The number of pulses within a 512 chunk of frames can be estimated to be half the number of observed transitions within the 512 frame chunk, as each pulse has an “up” and a “down” transition associated with it. ^The 75th an^d/or 95th percentile among the 512 frame chunks can be used to reduce outliers (for example during the initialization phase).

Within each 512 frame chunk of the initialization phase, “pulses” can be identified as frames between an up and down transition. The sample variance of the amplitude values for these frames can be estimated and averaged over the pulses within a 512 frame chunk. The percentiles among 512 frame chunks can be used to reduce outliers.

A stutter is defined to be transitions of the same type (up or down) in a row. These are expected when two pulses are abutted or overlap. These can be directed using the directionality of the transitions identified in the Savitzky-Golay step for each 512-frame chunk, and the percentiles can be calculated in the usual fashion.

In one example, a regression tree and a linear regressor were trained. The regressor’s mean absolute error values were found to be, for the regression tree, 47,887; and, for the linear regressor, 35,696. It is useful in this context to note that the mean number of matched bases in a biosensor element during a collection is 66,441 , however that average is bimodal. An empty biosensor element may generate 0 - 1000 pulses over a 24 hour period, whilst a very productive biosensor element may produce around 120,000 matches on average. Therefore, if instead the misclassification of low performing biosensor elements (<10k pulses) and high performing biosensor elements (>10k pulses) is compared, the error rate is 15.8% (10,193 misclassifications among the 64,512 test set). This despite the model only being provided with data corresponding to the first 50 base calls on average.

The method in Fig. 21 , as will be understood, can be performed by any computing device which has access to the training data and has sufficient computing power. This could be the CPU/GPU in Figures 2, 3 or 4, or could be an entirely different CPU/GPU. Once the model has been trained, and so becomes a production model, it forms a part of the data evaluation module.

References

The following references are hereby incorporated by reference in their entirety.

[1] Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B. and Bibillo, A., 2009. Real-time DNA sequencing from single polymerase molecules. Science, 323(5910), pp.133-138.

[2] Reed, B.D., Meyer, M.J., Abramzon, V., Ad, O., Ad, O., Adcock, P., Ahmad, F.R., Alppay, G., Ball, J.A., Beach, J. and Belhachemi, D., 2022. Real-time dynamic single-molecule protein sequencing on an integrated semiconductor device. Science, 378(6616), pp.186-192.

Claims

Claims: What is claimed is:

1 . An analytical system comprising: a plurality of biosensor elements configured to generate sensor data, and a data selection module comprising: a data input array configured to receive sensor data from the plurality of biosensor elements; and a data output array connectable to a downstream data processing module for analyzing the sensor data; wherein the data selection module is configured to: receive sensor mask information, the sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data are sought; and selectively provide the sensor data of the identified biosensor elements from the data input array to the data output array.

The analytical system of claim 1 further comprising a data evaluation module configured to: receive sensor data from at least a portion of the plurality of biosensor elements, identify, from the received sensor data, a subset of the biosensor elements which are producing sensor data to be processed, generate the sensor mask information indicating the identified subset of biosensor elements from which sensor data is sought, and provide the sensor mask information to the data selection m

2. odule.

The analytical system of claim 1 or 2 wherein the analytical system is a single molecule detection system configured to detect single molecule events associated with a target molecule present in each biosensor el

3. ement.

The analytical system of claim 3 further configured to receive target molecules or a complex into the analytical system according to a Poisson distribution such that one or more of the biosensor elements may comprise no target molecules, at least some of the biosensor elements each comprise a single target molecule, and one or more of the biosensor elements may be multiloaded with more than one target mol

4. ecule.

The analytical system of any preceding claim wherein the analytical system is an optical analytical system, and the sensor data comprises data derived from optical signals received from optical sensors in the plurality of biosensor ele

5. ments. The analytical system of claim 5 wherein the data processing module of the optical analytical system is configured to distinguish events in the biosensor elements by monitoring one or more of: color, amplitude, fluorescence lifetime, and binding kinetics of the optical si

6. gnals.

The analytical system of claim 6, wherein the data processing module is configured to distinguish the events at least by amplitude of the optical si

7. gnals.

The analytical system of claim 7 wherein each biosensor element of the analytical system comprises an optical confinement nanowell in which fluorescent molecules bind to the target mol

8. ecule.

The analytical system of claim 8 wherein the analytical system is a polynucleotide sequencing s

9. ystem.

The analytical system of claim 9 wherein the events for detection by the data processing module comprise fluorescent nucleotides binding to a respective polymerase complexed to a target polynucle

10. otide.

The analytical system of any one of claims 5 to 8 wherein the analytical system is a peptide sequencing s

11 . ystem.

The analytical system of claim 11 wherein the data processing module is configured to analyze the sensor data to detect events of fluorescently labelled affinity reagents binding to a target pe

12. ptide.

The analytical system of claim 12 wherein the fluorescently labelled affinity reagents bind to the target peptide in an optical confinement nanowell of each biosensor el

13. ement.

The analytical system of claims 3 or 4 wherein the analytical system is an electrode based analytical system wherein each biosensor element comprises one or more electrodes proximial to a target molecule

14.

The analytical system of claim 14 wherein each biosensor element comprises a sensor for detecting an electrical current signal as the target molecule is positioned between two electrodes such that the sensor data from each biosensor element comprises data derived from the electrical current si

15. gnal.

The analytical system of claims 14 or 15 wherein the analytical system is a polynucleotide sequencing s

16. ystem.

The analytical system of any preceding claim, further comprising the downstream data processing module for analyzing the sensor

17. data.

A data selection module for an analytical system comprising: a data input array configured to receive sensor data from a plurality of biosensor elements of the analytical system; and a data output array connectable to a downstream data processing module for analyzing the sensor data; wherein the data selection module is for selectively providing the sensor data from the input array to the output array; wherein the data selection module is configured to: receive sensor mask information, the sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data are sought; and selectively provide the identified sensor data from the identified biosensor elements to the output

18. array.

The data selection module of claim 18 wherein the data selection module is embodied in hardware in an integrated ci

19. rcuit.

The data selection module of claims 18 or 19 wherein the mask information is received from a data evaluation module, the data evaluation module being configured to determine the mask information based on the sensor data from the plurality of biosensor ele

20. ments.

The data selection module of any of claims 18 to 20 wherein the mask information comprises a data array comprising locations of the identified sensor data in the input array, wherein the data selection module is configured to stream sensor data from the identified locations to the output

21. array. The data selection module of any preceding claim wherein the data selection module is configured to update the data selection by: receiving updated mask information indicating an updated selection of identified biosensor elements from which sensor data are sought, and reconfiguring to selectively provide the sensor data from the updated selection of biosensor elements to the output

22. array.

The data selection module of claim 22 wherein the data selection module is configured to periodically update the data selection, wherein each update further comprises: providing sensor data from at least some of the plurality of biosensor elements to a data evaluation module, the data evaluation module being configured to evaluate the sensor data and determine updated sensor mask inform

23. ation.

The data selection module of claims 22 of 23 wherein the reconfiguring comprises updating a data routing configuration from the input array to the output array, wherein reconfiguring is performed in one processing time step of the data selection m

24. odule.

The data selection module of any one of claims 22 to 24 wherein the reconfiguring comprises remapping the sensor data from the updated selection of biosensor elements by: providing sensor data from the identified biosensor elements which are selected before and after the reconfiguring to same locations in the data output array, and providing sensor data from newly identified biosensor elements in the updated selection of biosensor elements to remaining locations in the data output array which previously received sensor data from biosensors which are no longer selected in the updated selection of biosensor ele

25. ments.

The data selection module of claim 25 wherein the remapping comprises assigning a target address for the sensor data from each selected biosensor element, the target address corresponding to a memory location of the downstream data processing mo

26. dule.

The data selection module of any one of 22 to 26 wherein the mask information comprises a plurality of region sub-masks, and each sub-mask corresponds to a region of the plurality of biosensor elements and comprises an indication of which biosensor elements in the respective region are identified biosensor elements from which sensor data is s

27. ought. The data selection module of 27 wherein the sensor data from each region of biosensor elements is provided to respective region locations in the data output array, and the reconfiguring comprises providing the sensor data from each region to the same respective locations in the output array before and after the reconfig

28. uring.

The data selection module of claims 27 or 28 wherein each region sub-mask indicates a configurable selection quantity of biosensor elements in the respective region from which identified sensor data is s

29. ought.

The data selection module of claim 29 wherein the selection quantity is the same across each of the region sub-masks such that the sensor data provided to the output array is received from the same number of biosensor elements per r

30. egion.

The data selection module of claim 30 wherein each region sub-mask has a respective selection quantity such that the amount of biosensor elements indicated by each region sub-mask is adjustable between re

31. gions.

The data selection module of any one of claims 27 to 31 wherein each region comprises between 64 to 4096 biosensor ele

32. ments.

The data selection module of any one of 27 to 32 wherein each region sub-mask is configured to indicate at least eight identified biosensor elements from the respective region from which sensor data is s

33. ought.

The data selection module of any one of 27 to 33 wherein the data selection module is configured to divide the sensor data from each region into a plurality of tracks of sensor data per group, and each region sub-mask is configured to indicate a fixed sub-quantity of identified biosensor elements within each track from which sensor data is s

34. ought.

The data selection module of claim 34 wherein the sensor data from each track is provided to respective track locations in the data output array, and the reconfiguring comprises providing the sensor data from each track to the same respective locations in the output array before and after the reconfig

35. uring.

The data selection module of claim 34 or claim 35 wherein each region is divided into eight tracks, and each region sub-mask is configured to indicate at least one biosensor element from each

36. track.

The data selection module of any one of claims 22 to 36 wherein the updated mask information corresponds to a portion of the plurality of biosensor elements included in an update sector, the update sector being one of a plurality of sectors of the biosensor elements such that: the updated selection of identified biosensor elements is included in the update sector and the remaining sectors of the biosensor elements are not up

37. dated.

The data selection module of claim 37 wherein the data selection module is configured to provide the sensor data from the plurality of biosensor elements in the update sector to a data evaluation module in order to receive the updated mask inform

38. ation.

The data selection module of claim 38 wherein the data selection module is configured to periodically update by: determining a subsequent update sector of the plurality of sectors, providing sensor data from the plurality of biosensor elements in the update sector to the data evaluation module, receiving updated mask information corresponding to the update sector from the data evaluation module, and reconfiguring to provide the sensor data from the updated selection of biosensor elements in the update sector to the output ar

39. ray.

40. The data selection module any one of claims 37 to 39 wherein the reconfiguring to selectively provide sensor data from the updated selection of biosensor elements comprises reconfiguring a portion of the data selection module that corresponds to the selected update sector of biosensor elements.

The data selection module any one of claims 37 to 40 wherein the data selection module is configured to continue providing the sensor data from the identified biosensor elements which are not in the update sector of biosensor elements from the input array to the output array during the reconfig

41. uring. The data selection module of any one of claims 37 to 41 , as dependent on any one of claims 27 to 36 wherein each sector of the biosensor elements comprises a plurality of regions of the biosensor elements, each region corresponding to a region sub-mask of the mask information, wherein the updated mask information comprises a plurality of updated sub-masks, and each updated sub-mask indicates the updated selection of identified biosensor elements in a respective region of the update s

42. ector.

The data selection module of claim 42 wherein the regions of biosensor elements in each sector are located non-contiguously in the analytical s

43. ystem.

The data selection module of any one of claims 18 to 43 wherein the data selection module is configured to perform an initialization sequence which includes: receiving initial mask information relating to each of the plurality of biosensor elements, and configuring the data selection module to selectively provide the identified sensor data to the output

44. array.

The data selection module of claim 44 wherein the data selection module is configured to begin selectively providing the sensor data to the data output array upon completion of the initialization seq

45. uence.

The data selection module of claims 45 wherein the initialization sequence comprises providing initial sensor data from each of the plurality of biosensor elements to a data evaluation module for evaluation in order to receive the initial mask information from the data evaluation m

46. odule.

The data selection module of claim 46 wherein the data selection module is configured to provide the initial sensor data in sequential sectors of the biosensor ele

47. ments.

The data selection module of any one of claims 18 to 47 wherein the data evaluation module is provided on a same shared resource as the downstream data processing module, wherein the data output array is connectable to the shared resource for providing the sensor data from the data selection module to the downstream processing resource, wherein the data selection module is configured to provide the initial sensor data to the shared processing resource via the data output array during the initialization seq

48. uence.

The data selection module of any one of claims 18 to 48 wherein the plurality of biosensor elements includes at least 1 million biosensor elem

49. ents.

50. The data selection module of any one of claims 18 to 49 wherein the input array is configured to receive sensor data from the plurality of biosensor elements and provide the sensor data from the identified biosensor elements to the data output array at or between 100 and 200 frames per second (FPS), wherein a frame includes sensor data from each of the plurality of biosensor elements.

51 . The data selection module of any one of claims 18 to 50 wherein the data selection module is implemented on one or more FPGAs.

52. The data selection module of any one of claims 18 to 51 wherein the data selection module is implemented on a bespoke portion of an integrated circuit, IC.

53. The data selection module of any one of claims 18 to 52 wherein the IC is included in a sensor chip comprising the plurality of biosensor elements.

54. An integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a data selection module according to any one of claims 18 to 53.

55. A non-transitory computer-readable storage medium having stored thereon, a computer-readable description of an integrated circuit that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture a data selection module according to according to any one of claims 18 to 54.

56. A method of configuring a data selection module for an analytical system, the data selection module comprising: a data input array configured to receive sensor data from a plurality of biosensor elements of the analytical system; and a data output array connectable to a downstream data processing module for analyzing the sensor data; wherein the data selection module is for selectively providing the sensor data from the input array to the output array; the method comprising: receiving sensor mask information, the sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data is sought; and configuring the data selection module to selectively provide the sensor data from the identified subset of biosensor elements to the data output array.

57. A method of providing sensor data from a plurality of biosensor elements to a downstream data processing module for an analytical system, the method comprising: receiving sensor mask information indicating an identified subset of the plurality of biosensor elements from which sensor data is sought, and selectively providing the sensor data from the identified subset from an input array connectable to the plurality of biosensor elements to an output array connectable to a downstream data processing module for analyzing the sensor data.

58. A non-transitory computer-readable storage medium comprising instructions that, when executed by a computing apparatus, cause the computing apparatus to carry out the method of claim 56.

59. A method of generating sensor mask information for a data selection module of an analytical system, the data selection module being for selectively providing sensor data from a plurality of biosensor elements to a downstream data processing module, the method comprising steps of:

(a) receiving sensor data from at least a portion of the plurality of biosensor elements,

(b) identifying, from the received sensor data, a subset of the biosensor elements which are producing sensor data to be processed, and

60. The method of claim 59, further comprising:

(d) providing the sensor mask information to the data selection module.

61. The method of claims 59 or 60, wherein identifying the subset of the biosensor elements includes analyzing the received sensor data.

62. The method of claim 61 , wherein analyzing the received sensor data includes calculating, for each biosensor element, one or more metrics descriptive of the properties of the received sensor data from that biosensor element, and identifying the subset of biosensor elements on the basis of the calculated metrics.

63. The method of claim 62, wherein the one or more metrics includes one or more of: data indicative of amplitude, data indicative of amplitude clipping, data indicative of an autocorrelation between frames of sensor data of a respective biosensor element, data indicative of stuttering events, data indicative of (mean) pulse variance, and a number of pulses within a time period.

64. The method of any claims 62 or 63, wherein the identifying includes comparing the or each calculated metric to a respective benchmark value or threshold.

65. The method of any of claims 63 to 64, wherein the one or more metrics include at least one statistically based metric.

66. The method of claim 65, wherein at least one statistically based metric is an autocorrelation calculated for each biosensor element.

67. The method of any one of claims 62 to 66, wherein the identifying includes providing the one or more calculated metrics to a scoring component, the scoring component deriving from the one or more calculated metrics a composite data quality score.

68. The method of claim 67, wherein the identifying further includes comparing each respective composite data quality score to a threshold score, and identifying the biosensor element as one for processing if the respective data quality score is greater than the threshold score.

69. The method of claim 67 or claim 68, wherein the scoring component is a pre-trained machinelearning model configured to generate the composite data quality score from the or each calculated metrics.

70. The method of claim 69, wherein the scoring component is one of a regression model, a linear regression model, a regression tree model, a support vector machine, a support vector regression model, a neutral network, or a K-nearest neighbors model.

71. The method of any one of claims 67 to 70, wherein the identifying includes a step of selecting the top N biosensor elements according to their respective composite data quality scores, where N is an integer smaller than the total number of biosensor elements.

72. The method of any one of claims 59 to 71 , further including a pre-processing step, performed before identifying the subset of biosensor elements.

73. The method of claim 72 as dependent on any of claims 61 to 71 , wherein the pre-processing step is performed before analyzing the received sensor data.

74. The method of claim 72 or claim 73, wherein the pre-processing step includes applying one or more filters.

75. The method of claim 74, wherein the filter includes one or more of: a smoothing filter, a convolution filter, a digital filter, a least-squares filter, or a Savitzky-Golay filter.

76. The method of any one of claims 59 to 75, wherein the received sensor data includes temporally spaced frames of received sensor data.

77. The method of any one of claims 59 to 76, wherein the received sensor data includes a series of amplitude values against time.

78. A non-transitory computer-readable storage medium comprising instructions that, when executed by a computing apparatus, cause the computing apparatus to carry out the method of any one of claims 59 to 77.

79. A data evaluation module for providing sensor mask information to a data selection module of an analytical system, the data selection module being for selectively providing sensor data from a plurality of biosensor elements to a downstream data processing module, the data evaluation module being configured to:

(a) receive sensor data from at least a portion of the biosensor elements;

(b) identify, from the received sensor data, a subset of the biosensor elements which are producing sensor data to be processed by the downstream data processing module,;

(c) generate sensor mask information indicating the identified subset of the plurality of biosensor elements; and

(d) provide the sensor mask information to the data selection module.

80. An analytical system comprising: a plurality of biosensor elements configured to generate sensor data, and a data selection module according to any one of claims 18 to 58 for selectively providing the sensor data from the biosensor elements to a downstream data processing module.

81 . The analytical system of claim 80 further comprising a data evaluation module configured to perform a method according to any one of claims 59 to 78.