WO2020259847A1

WO2020259847A1 - A computer implemented method for privacy preserving storage of raw genome data

Info

Publication number: WO2020259847A1
Application number: PCT/EP2019/067336
Authority: WO
Inventors: Rastislav HEKEL; Jaroslav BUDIŠ; Marcel KUCHARÍK; Tomáš SZEMES
Original assignee: Geneton SRO
Current assignee: Geneton SRO
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2020-12-30
Anticipated expiration: 2021-12-28

Abstract

The invention relates to managing and processing of digitalized copy of human genome, which falls into the field of bioinformatics. The subject-matter of the invention is a set of mutually related computer implemented methods focused on solving privacy issues related to storage and sharing of the genome - the ultimate identifier of an individual. More specifically the invention masks personal information contained in raw genomic data, while preserving valuable properties of these data for further research. The personal information is stored in easily manageable confidential file generated by the invention. Moreover, the whole genome or its specific regions can be unmasked with this file for later examination of an unaltered genome for specific biological trait, which is standard procedure in personalized medicine. The confidential file is intended to be stored in a private device or in a trusted institution and remain in a possession of genome owner. Masked genomic data can be stored in public place so it remains available to all interested parts: the owner, medical units, researchers.

Description

A computer implemented method for privacy preserving storage of raw genome data

FIELD OF THE INVENTION

The invention generally pertains to the methods for storing, managing and processing of digitalized copy of human genome which falls into the field of bioinformatics.

BACKGROUND OF THE INVENTION

Genomic sequence is physically stored in a double helix DNA molecule, which consists of two strands, each carrying a sequence of nucleotides (A, T, G, C) called bases. Both strands are interconnected by base pairs like a ladder. Only AT and GC base pairs exist, hence one DNA strand is complementary to the other. With respect to the properties of a DNA molecule, a genome can be simply expressed as a sequence of four different letters. A whole human genome is a sequence of roughly 3.2 billion DNA bases.

Modem technologies provide efficient tools for determining the precise order of nucleotides within a DNA molecule, by a laboratory process called DNA sequencing. Genome analysis starts with collection of biological sample (e.g. blood, saliva, etc.) from which the DNA is extracted. This DNA is multiplied and fragmented many times creating millions of random fragments. Pool of fragments is placed on chip of the sequencing platform where thousands of fragments are read in parallel creating strings of DNA bases called reads. Usually, reads are a few hundred bases long and paired-end, which means that a single fragment is being simultaneously read from both ends resulting in pair of reads with opposite direction.

The sequencing process is not perfect - not all fragments are read and usually paired-end reads do not overlap. Moreover, a wrong base can be called instead of the real one. To denote probability of incorrect base call a quality score is assigned to each base read. Base quality tends to decrease exponentially with increasing length of the read due to nature of the sequencing technology. To overcome these shortcomings, and to reliable determine sequence of bases, each nucleotide has to be read multiple times. Average number of reads per nucleotide during the sequencing process is called base coverage.

The output of sequencing platform is a vast amount of randomly ordered reads with unknown direction and unknown DNA strand of origin. These short reads need to be mapped to a reference genome to provide meaningful information. The human reference genome, artificial genome composed by scientists, is the most common sequence of bases in human DNA. Genome of each individual differs from the reference genome by around 0.5% of bases, owing to genetic variations. Each read is mapped (aligned) to the most probable region of origin on the reference genome. This process is called sequence alignment mapping and its goal is to reconstruct the original DNA sequence. Reads without a sufficiently probable match are considered unmapped. Mapping algorithms are often complex and a lot of mapping parameters can be configured. Various algorithms and their configurations lead to different trade-offs. It is worth to mention that mapping process requires significant computer power, hence running it on computing server in parallel is reasonable. Set of aligned reads is de facto a digital copy of the DNA contained within the biological sample.

Aligned reads reveal differences between a sequenced and the reference genome, called genomic variants. Whole set or even specific selection of genomic variants is unique to each individual hence a genome is the ultimate person identifier. Moreover, some of these variants can have a significant impact on human health and reveal sensitive information such as predisposition to particular disease or physical traitThe most common type of variation is Single Nucleotide Variant (SNV), the variation in a single DNA base that occurs at specific genomic position. When this variation is common in the population (e.g. > 1%) it can be called Single Nucleotide Polymorphism (SNP). Other common type of variants is INDEL - insertion or deletion of one or more bases. For perspective, each human carries roughly 4 million variants according to dbSNP database.

Typical clinical genetic test is a query on a personal genomic data. Genetic test can be conducted with goal to select suitable treatment or to evaluate risk for a particular disease. Usually, the outcome of this test is based on presence of specific variants associated with a tested trait. This provides a great incentive to study genomic variants, which are de facto the backbone of precision medicine. Linking variants to specific traits is a subject of variant associations studies. Genome- Wide Association Study (GWAS) is common type of variant study in which genomes of many participants with varying phenotypes are compared for a particular trait or a disease. If one variant is more common in the group with an observed trait, the variant is said to be associated with it.

The more we know about genomic variants, the more we need to ensure personal genome privacy. The same information obtained from the genome analysis as part of a diagnosis can be abused against the patient in the wrong hands. At a same the time, it is crucial to keep genomic data available for further research. This conflict of interests is a motivation for development of novel methods for secure storing and processing of the genomic data. Their goal is to preserve privacy of patients while not restricting the scientific research. A biological sample collected in the beginning of the analysis contains a genome in chemical form as a DNA molecule. Until the sequencing the privacy of the genome can be preserved by standard physical precautions. However, the sequencing creates digital copy of a genome and from there it must be secured digitally to prevent unwanted copying, modifying and sharing.

Most variant studies require limited access to short regions of specific genes only, such as diagnosis of specific disorder with a known set of causal genes. In addition, some analyses do not even require information about genomic variants at all. For example, chromosome aneuploidy detection needs only information about number of reads aligned to individual chromosomes, in other words coverage, that is not major subject to abuse or person identification.

Standard file formats described below are used in the field of bioinformatics. Sequencing produces raw data in platform specific format which is soon converted to standard text FASTQ files consisting of reads together with corresponding base qualities. A mapping algorithm takes FASTQ file and aligns reads to reference genome producing text SAM (Sequence Alignment Map) file or its binary equivalent Binary Alignment Map (BAM) file. The whole genome analysis can produce BAM with size ofmore than 100 GB. Genomic variants within aligned reads are detected using variant calling tools and stored in Variant Call Format (VCF) files.

BAM files are widely used to store aligned reads generated by bioinformatic analysis of sequenced DNA fragments. A ingle BAM file of a patient stores millions of these alignments and each of them is typically 100 to 400 hundred bases long, depending on the type of sequencer. If mapping algorithm is unable to assign position of a read within the reference genome (e.g. it can be ambiguous), it is stored as unmapped. Record with mapped read, simply called alignment, contains CIGAR string denoting mapping operation for each aligned base. Various other properties can be stored within the record such as base quality or mapping quality.

There is preference to store aligned reads (i.e. BAM files) along with called variants for following reasons: _• Algorithms for variant calling are not mature, they can have various settings and trade offs.

_• Disease such as cancer can cause specific variations in diseased cells. These can be misclassified as sequencing errors by only looking at variant calls, while examining raw reads can reveal the true cause.

_• It is impossible to know which (novel) variants are going to be proved as significant in the future.

DESCRIPTION OF THE PRIOR ART

In the literature, there are several solutions suggested for privacy and security of digital health records based on de-identification and aggregation methods. However, these solutions are not applicable to personal genomic data as genome itself is an ultimate identifier of an individual. Akgiin et al. 2015 provide great overview on privacy processing of genomic data and state the following major problems which need to be solved:

1. Private read alignment on public cloud Alice wants to make a sequence alignment for her whole genome on a public cloud controlled by Bob, without revealing the genome to Bob.

2. Query on private genomic data Alice wants to test her genome for some biological trait. The test is provided by Bob, who must query Alice's genome with publicly known markers for that trait. Alice does not want to reveal her whole genome to Bob.

3. Query on private genomic database Alice want to test an hypothesis using a genomic database, while Bob (responsible for the database) wants to preserve the privacy of the data- owners.

4. Privacy-preserving sharing of private statistical database GWAS produces population statistics for associations between variants and specific traits. Alice wants to query statistics relevant for her study, while Bob (responsible for GWAS) do not want to reveal if some individual is part of GWAS.

Work solving these problems can be described separately below according to the applied cryptographic method.

Secure Multiparty Computation (SMC). Sequence alignment is likely to be outsourced to public clouds due to high computation cost. Public cloud is considered as an insecure environment where private data can be obtained by an adversary. Due to this concern secure computation scheme must be used instead of standard alignment algorithms. SMC is basis of some proposed solutions [Erlich and Narayanan 2014] This method enables outsourcing most of computation intensive read mapping without disclosing genetic information. SMC allows two or more entities jointly compute on private data without revealing the data to each other or a third party. Work of Karthik et al. 2017 showed how SMC can be used to securely identify causative variants in individuals between multiple parties. More precisely, they focus on variants in Mendelian patients and use SMC methods based on Yao's protocol. The computation can be run on the whole genomes provided by various parties (e.g. institutions, patients) to jointly discover the causative variants, while not revealing the genomes to each other.

Homomorphic Encryption. Variant association studies require unrestricted access to large databases of genotype and phenotype data to compute reliable statistics. These data collected from volunteering patients is at risk of privacy breach when stored in unencrypted form. Lauter et al. 2014 in their work propose solution to this problem: encrypting both genotype and phenotype data by homomorphic encryption scheme. They take several statistical algorithms commonly used in GWAS studies and altering them so they can be run on the encrypted data. Homomorphic encryption is allowing to directly compute on the encrypted data without knowing a passphrase. Work of Sousa et al. 2017 enables a user to securely store and retrieve millions of genomic variants of all types for one or multiple individuals on the cloud. Variants are encrypted with symmetric key and can be efficiently searched without revealing anything to the cloud provider. They use homomorphic encryption and private information retrieval techniques. Novel approach proposed by Shimizu et al. 2016 combines efficient string data structure called positional Burrows-Wheeler transform (PBWT) with two cryptographic techniques called additive homomorphic encryption and oblivious transfer.

Differential privacy. It is impossible to publish information from a private statistical database without revealing some amount of private information and small number of queries can reveal a presence of some record. For example the presence of specific genome (with known minor allele frequencies) in some statistical dataset can be inferred by comparing it against reference population and population in the dataset and then evaluating the difference (with t-test). [Homer et al. 2008] Differential privacy solves this problem by maximizing the accuracy of queried statistics while minimizing the information about presence of specific record. Solution offers trade-off between utility (accuracy) and privacy. Unencrypted data is available only through special queries which add noise to the result of each query. Differentially private genomic databases differing from each other by only one individual’s data, have indistinguishable statistical features.

Clinical genetic testing is becoming common in personalized medicine and each conducted test introduces a risk to genomic privacy of a patient. Since most of these tests are based on presence of specific SNPs the typical solution of the prior art is to extract these variants from underlying genome and store them in encrypted and encoded form suitable for secure analysis [Ayday et al. 2013, Sousa et al. 2017, Lauter et al. 2014] The raw aligned genome is not further considered, at best it is encrypted and stored separately so it can be reused in the future when new SNPs are discovered.

Description of the prior art for secure storage and retrieval of aligned reads

Only one of all examined works on the subject of genomic privacy is focused at secure storage, retrieval and processing of aligned reads, stored within SAM file. Ayday et al. 2014 propose a scheme that stores encrypted SAM files, containing genomes of patients, in a public biobank. A medical unit can request a genomic region from a biobank without revealing the scope of the request, so a biobank can not infer the nature of the genetic test behind the request. A biobank provides only reads that include at least one base from the requested range. Bases outside of this range are masked from a medical unit while no decryption is involved. Returned encrypted reads are decrypted at a medical unit. Encryption keys are stored separately at masking and key manager , because not all patients are capable of protecting their keys on a private device. Furthermore, when patient controls his key, he must be involved in all operations of medical unit, related to his genome, which is not practical when conducting research. Identities of medical units or patients are not revealed to the masking and key manager.

Conventional digital security methods, can protect against unauthorized use of personal data, but do not allow data owners to use dynamic consent approaches. These give data owners an ability to retain control over their private data instead of giving single consent to share the data at the time of sampling. This aspect is considered to be a core element of modern information privacy [Erlich et al. 2014]

Nevertheless, the data should also be protected against misuse by legitimated users or data recipients who gained access to those data in a legal way under data sharing processes. Although this point is relevant as well, it is not solvable by this invention. According to several opinions the problem of protection from such kinds of misuse cannot be efficiently solved using current technological solutions. Instead, the solution of these problems seems to remain on legislation and on the awareness of the society [Savage 2016].

All of the above mentioned methods completely encrypt genomic data with aim to secure personal variants. Retrieval, decryption and interpretation of encrypted data is available only through special procedures by authorized parties. Besides, some sort of consent is required when requesting the data. As a result, access to whole genomic information produced by sequencing is constrained or made unavailable for further research by scientific community. Ideally, genomic studies would have unlimited access to genomic data where all known variants are masked. Method for masking these variants in a secure and reversible way would preserve whole remaining information about the sequenced genome. This information is not considered private or individual specific, therefore it can be utilized by genomic studies unrelated to variants.

International patent application WO2009156934A2 entitled “Anonymization of genetic information in electrical patient records” by Alphons A. M. L. Bruekers et al. is focused on anonymization of Short Tandem Repeats (STR). STR is a genomic region without known biological meaning in form of short nucleotide sequence repeated multiple times. The number of repetitions in particular STR varies largely over the population, making it ideal for person identification. STR is located within a sequence and is removed or replaced with modified or different sequence. Finding the location of STR is based on knowledge of its locus within the human genome and two associated primers - unique sequences before and after the STR. The said patent application is very vague about implementation of described methods and does not show any technical solution. The subject of the present invention described herein is anonymization of SNVs and INDELs, which unlike STRs can have impact on phenotype of the carrier.

Patent application US20160048690A1 by Shigeki Tanishima and Nori Matsuda proposed a genetic data search system comprising storage and management of the genes in encrypted form. A target gene is encrypted by encryption apparatus and stored in data centre. Moreover encrypted tag is generated by the encryption apparatus embedding differential information generated by comparison of target gene with reference gene. A search apparatus generates a search query encrypted by embedding the differential information as a search keyword and sends the query to the data centre. Encrypted tag is specified by the data centre using the differential information specified in the search query and the related encrypted gene is extracted and returned to the search apparatus. The present invention described herein does not encrypt individual genes, but anonymizes information within a standard genomic file, hence there is not need of search apparatus. Moreover, both anonymized genomic data and related encrypted part do not have to be stored in centralized database. Furthermore, credential needed to restore anonymized information can be in possession of the patient.

In the International patent application W02018001761A1 entitled“Disease-oriented genomic anonymization” Daniel Pletea, et al. proposed a method for anonymization of genetic data from at least one individual with respect to particular disease. Genetic data are separated into different layers, based on how closely related the genetic data are with the genes relevant to the disease to be studied. This relationship is established based on the genome’s pathways network. Different anonymization techniques are then used for anonymizing the layers of genetic data not directly related to the disease to be studied. Anonymization technique is chosen for each layer, based on its estimated relevance. The genetic data directly related to the disease to be studied is not anonymized and can be used for genetic analysis. In contrast, scope of anonymization by the present invention described herein does not have to be related to particular disease only. It depends on contents of supplied file containing population variants, which can be related to any trait.

Ethan Huang, in the patent application US20170308717A1 entitled“Methods and systems for anonymizing genome segments and sequences and associated information” claims any method processing genome sequences and associated information with these sequences. The method comprises of: segmenting genome sequence with purpose of anonymization; organizing associated information; anonymized linkage records between genome segments and associated information; and non- transitory storing of one or more of these aspects. The present invention described herein deals with genomic data, without any additional information. Moreover, it does not organize these data in any novel structure, except personal information extracted from these data.

DESCRIPTION OF THE INVENTION

The present invention provides a method for masking known population variants within a genome, while keeping all remaining information available (coverage, quality, etc.). Moreover, a user can use another method to unmask variants within a specific region of a genome with a private key. Third method combines the two mentioned to allow the user with a private key to share unmasked regions of a genome with another user. The invention can be used to give full control of personal information contained within a genome to a patient, while keeping it in a public - anonymized form, which can be disseminated among researchers.

Alternative irreversible masking method is also part of the invention. It is a one-way method, meaning that personal variants can not be restored after they have been masked. The method simply replaces sequences of personal reads with reference sequence whenever possible. It is described in standalone section. Both reversible and irreversible method is based on the common concept of masking of personal variants in the reads.

The technical and scientific terms used herein are from fields of molecular biology, cryptography and bioinformatics. Some of them are described in detail below for better understanding of the invention.

Definitions

• read

In context of DNA sequencing, a read is the inferred sequence of nucleotides from either side of DNA molecule fragment.

• alignment

Alignment is a read aligned or mapped to a reference genome. Terms alignment, aligned read and mapped read are used interchangeably.

• variant

The term variant herein refers to a single difference between a genome and the reference genome. It is defined by position (on the reference), reference allele and alternative allele. In other words, it is a replacement of reference allele by alternative allele at specific genomic position.

• allele

Depending on presence of one or more variants, same gene can have different forms, called alleles. However, in bioinformatics, the term allele refers to a particular sequence at the position of a variant. Variant is always described by one reference allele and at least one alternative allele. A reference allele is the sequence present in reference genome and alternative allele is a different sequence replacing the reference allele. • zygosity

Since a human has two sets of homologous chromosomes, it is a diploid organism, which implies it has two copies of each allele, except for the sex chromosomes. If both alleles of a diploid organism are identical, the organism is said to be homozygous for that position. If they differ, the organism is said to be heterozygous for that position.

• genotype

Set of all alleles in the genome.

• heterozygosity

See zygosity.

• homozygosity

See zygosity.

_• phenotype

Set of all observable traits in the organism, such as its morphology, biochemical or physiological properties. It is a product of genotype expression influenced by the environment.

_• CIGAR string

String describing relation of aligned bases to the reference. Each relation or alignment operation is denoted by one letter with number of bases in the operation. For example CIGAR string 5M2I describes alignment 7 bases long with 5 bases matching the reference and 2 inserted bases.

_• symmetric key encryption

Encryption method that uses same cryptographic key both for encryption of plaintext and decryption of ciphertext. Symmetric encryption can use either stream cipher or block cipher. AES is a commonly used block cipher encryption algorithm.

_• asymmetric key encryption

This method uses a pair of cryptographic keys: a public key which everyone knows and a private key known only to the owner. This approach has two wide applications:

I. digital signature - owner of the private key signs a message and anyone with the public key can verify his signature; II. encryption - anyone can encrypt message with the public key so that only the owner of the private key can decrypt it. RSA is widely known asymmetric cryptographic algorithm.

_• hash function

Function that takes arbitrary data and produces virtually unique data of fixed size called hash. Same data always produce the same hash and there is no way to compute original data from the hash.

_• genome

The complete set of DNA sequences within an organism.

_• exome

Protein-coding subset of a genome, which constitutes about 1% of the human genome.

Summary of the methods according to the invention

The invention relating to reversible anonymization provides methods of anonymization, deanonymization, and dissemination. These methods process raw genomic data in form of mapped reads. Specifically, anonymization method processes personal mapped reads and population allele frequencies (i.e. allele frequencies of population variants) simultaneously into anonymized mapped reads and the associated masked alleles. Notably, the number of anonymized mapped reads stays equal to personal mapped reads. Masked alleles represent all differences between original mapped reads and anonymized mapped reads. Therefore, masked alleles contain all of the data, that anonymized mapped reads are deprived of. All masked alleles are encrypted as a single file using asymmetric encryption scheme, so only owner of the private key can decrypt them.

Deanonymization method is partially reversed anonymization method. File with masked alleles is decrypted with a private key of an owner and is processed simultaneously with anonymized mapped reads into personal mapped reads. Dissemination method re-encrypts the file with masked alleles in a custom range, making it available for specific user. Firstly, file with masked alleles is decrypted by owner’s private key, secondly subrange of masked alleles is selected, and lastly the selected masked alleles are encrypted as a new file with a public key of specific user. Method for irreversible anonymization of raw reads is also part of the invention. It replaces the sequence of each read by corresponding reference sequence obtained from the reference genome. The method is elaborated after the reversible anonymization method.

Personal allele has certain chance to be masked at each position described by population allele frequency. Variant is masked if personal alternative allele is replaced by reference allele. More specifically, the allele is replaced in all mapped reads covering the position of the variant. Conversely, novel variant is introduced if personal reference allele is replaced by alternative allele. Notably, rare variants of an individual which are not described by population frequency are preserved. Although these variants are presumably not associated with particular trait, they can be relevant for studying diseases such as cancer. Moreover, original alignment information is also preserved, including distribution of alignments on the reference genome, distribution of variants at single position and sequence quality.

The extent of anonymization depends on the number of genomic positions described by population allele frequencies, since only alleles at these positions are being masked. Particular allele frequency expresses the incidence of the associated allele in a sampled population and is the probability of replacing a personal allele with the allele in the anonymization method. In conclusion, the number of alleles masked with the anonymization method depends on the number of genomic positions described by population allele frequencies.

The concept of masking is extended by introduction of artificial variants at random genomic positions. This way a potential adversary can not distinguish between real and artificial variant, when its position is described by population allele frequencies.

Detailed description of the methods according to the invention

The invention introduces two file formats, VOF and BDIFF; VOF describes population allele frequencies and BDIFF is the format of masked alleles which is used to unmask anonymized mapped reads.

Variant Occurrence Format

VOF is a compact file format storing the numbers of SNV or INDEL alleles for specific genomic positions. These numbers represent incidence of the alleles within certain population. Records are sorted by genomic position. Format stores two similar types of records sorted by genomic position, one for SNV alleles and the second for INDEF alleles, and has four distinct fields:

_• genomic position

Position of the variant within the genome.

_• type of record

SNV or INDEF

_• reference index

In case of a SNV it is the index of the reference allele in A, T, G, C list. In case of an INDEF it is the index of listed allele.

_• allele counts

Number of observed alleles in a population. In case of SNV record, allele numbers are corresponding to the list of A, T, G, C alleles respectively. INDEF alleles are listed explicitly and occurrence count is assigned to each of them.

position type reference index allele counts

11042 0 2 0, 89, 1, 10

11191 1 1 TC: 5, TCA: 95

BDIFF format

All the SNV and INDEL alleles replaced in personal mapped reads are stored in a BDIFF format. The BDIFF file format provides a header, for storing metadata required in masking process, and a file index, enabling fast seeking in genomic positions. BDIFF records are sorted by genomic position. A single BDIFF record stores difference between original and masked variant in four fields:

_• genomic position

Position of the variant within the genome.

_• type of record

SNV or INDEF. _• reference index

In case of a SNV it is the index of the reference allele in A, T, G, C list. In case of an INDEL it is the index of listed allele.

_• allele mapping

Listed SNV alleles correspond to A, T, G, C bases respectively. INDEL alleles are listed together with an index of a target allele within the list itself.

position type reference index allele mapping

11032 0 2 G, A, T, A

11038

0 GCG: 1, G: 0

When two different personal alleles are replaced by two identical masking alleles for as part of the masking process described later (as masking from heterozygous to homozygous position), information necessary to reverse this operation is lost. It is impossible to infer which particular alignment with masked allele was carrier of which personal allele from the original pair. This problem is resolved by keeping one of the replaced personal alleles as a part of a BDIFF record together with the list of identifiers of alignments associated with this allele. The other replaced personal allele is mapped to a masking allele as usual.

In addition, BDIFF file has to keep deleted base qualities associated with replaced alleles. Base qualities are deleted only in case of INDEL position if longer allele is replaced with shorter allele. Deleted quality sequences are stored in another field of INDEL record as a list sorted by genomic position of the corresponding alignment. The motivation behind storing deleted base qualities is explained hereinafter.

Anonymization of mapped reads

Anonymization of mapped reads involves masking of alleles at genomic positions of population allele frequencies. User must provide mapped reads representing personal human genome, a VOF file describing frequencies of population variants, and his public key for encryption of BDIFF storing masked alleles. Each time a genomic position of a VOF record intersects with one or more alignments, a pseudorandom pair of alleles is generated. This pair of alleles replaces an personal pair of alleles detected within the mapped reads, regardless of whether the personal pair contains alternative allele or not. As a consequence, masking of personal alleles can result in masking of variant but also in introducing a new variant in anonymized mapped reads. If an alignment is not overlapping any population variant position it remains unaltered.

Masking of alleles

A single genomic position is typically covered by multiple alignments, which can carry different alleles due to heterozygosity or sequencing and alignment errors. Both personal alleles are equally likely to be represented in alignments covering the variant, albeit their mutual ratio can substantially vary as a consequence of low coverage or sequencing and alignment errors. As a result, position of a variant is described by list of alleles, where the personal alleles are the most represented ones.

Personal pair of alleles is assigned for each position of a variant described by population allele frequencies. Allele is considered as personal if it constitutes at least 20% of alignments covering the variant. If only one such allele exists, position is evaluated as homozygous and two identical alleles are assigned to the position. If two alleles with sufficient representation exist, position is considered heterozygous and two different alleles are assigned to the position. If there are more than two sufficiently represented alleles, variant position is skipped by the method.

Population allele frequencies defined in VOF are multiplied with each other to produce probability matrix of every possible pair of alleles at the given genomic position. Pair of masking alleles is drew from this probability matrix as a replacement for the pair of personal alleles assigned previously. If both personal alternative alleles are replaced by the reference allele, variant can not be detected in anonymized mapped reads, therefore it is masked. Conversely, if either of the personal reference alleles is replaced by an alternative allele, a new variant can be called at this position in anonymized mapped reads, thus it is introduced. All personal alleles within the alignments covering the variant are replaced by masking alleles, although personal alleles may be replaced by the same pair of masking alleles, which is a common case. Remaining alleles found within the alignments are considered to be sequencing and alignment errors and are not replaced or replaced by other than masking alleles.

Both personal and masking pair can be either homozygous or heterozygous, thus one of following cases occurs:

• homozygous to homozygous

Two identical personal alleles are replaced by two identical masking alleles. Most often, two reference alleles are replaced by the same two alleles, since reference allele is the most common one in both personal mapped reads and population allele frequencies. If pairs are identical no actual masking occurs.

outcomes masked variant, introduced variant, replaced variant, none

• heterozygous to homozygous

Two different personal alleles are replaced by two identical masking alleles. Reference and alternative allele are often replaced by two reference alleles, which results in masking of a personal variant.

outcomes masked variant, replaced variant, none

• homozygous to heterozygous

Two identical personal alleles are replaced by two different masking alleles. If either reference allele is replaced by an alternative allele a new variant is introduced.

outcomes’, introduced variant, replaced variant, none

• heterozygous to heterozygous

Two different personal alleles are replaced by another two different masking alleles. Personal and masking pair of alleles are often identical so no actual masking occurs. If one personal allele is identical to one masking allele, only the other personal allele is masked with the remaining masking allele. In this case variant can not be masked, since alternative allele can be replaced only by another alternative allele.

outcomes’, replaced variant, none

Allele of single nucleotide variant (SNV) is always one of four available DNA bases, therefore the probability matrix of SNV allele pairs has always size 4x4. Unknown base represented by letter N is always mapped to itself. In a case of insertion or deletion (INDEL), the number of available alleles depends on the number of different alleles found within alignments at position of the variant. In order to find an actual INDEL allele within particular alignment, population alleles defined by VOF record are iterated from the longest to the shortest one. In each iteration CIGAR string of the current allele is inferred from the difference between the length of reference allele and the length of alternative allele. Length of the shorter allele from the pair is considered to be a number of CIGAR match operations. The difference between the two lengths is either positive or negative, denoting the number of CIGAR insertions or the number of CIGAR deletions respectively.

The computed CIGAR string is compared with the corresponding portion of the CIGAR string describing the alignment. Likewise, nucleotide sequence of the allele is compared with the corresponding subsequence of the alignment. If both sequences and CIGAR strings match, the actual allele is found and iteration is stopped. However, it is impossible to find an actual allele in any of the following cases:

_• End position of allele CIGAR string exceeds the length of alignment CIGAR string.

_• INDEL operation exceeds the length of allele CIGAR string.

_• INDEL operation occurs right after reference allele.

When all personal alleles are detected, probability matrix of available allele pairs is computed from population allele frequencies. To replace personal allele with masking allele, the personal allele is deleted and masking one is inserted in its place. The same procedure applies for CIGAR string. Alignments with undetected personal allele remain unchanged.

If masking process alters any of the alignments covering the position of a variant, the mapping from personal alleles to masking alleles is stored in BDIFF format. After all positions of variants on the alignment, described by population allele frequencies, and all preceding alignments are treated, the alignment is stored as anonymized mapped read. When all positions of variants are processed, remaining alignments, although unchanged, are stored as anonymized mapped reads.

Masking of unmapped reads

Unmapped reads are encrypted completely using stream cipher encryption which produces a cipher with the same size as the input. At first, a secret key is randomly generated for all unmapped reads. This key is stored within the BDIFF file header. When an unmapped read is found, its template name and the secret key are hashed by SHA algorithm producing 512 bits long hash. The hash is then used to encrypt the sequence of the read. Each 2 bits of the hash are used to encrypt one DNA base, also encoded by 2 bits, from the input sequence using a simple XOR operation. Consequently, the key size is enough to uniquely encrypt sequence of 256 bases. If the sequence is longer, the key is repeated. Unknown bases, represented by letter N, are skipped in the encryption.

BDIFF Encryption and Storage

Checksum of the mapped reads and checksum of the VOF file are stored along anonymized mapped reads for later verification. After anonymized mapped reads are complete, their checksum is added to the header of encrypted BDIFF file.

BDIFF file header contains the exact range that the anonymization covers - effective range. It is necessary because BDIFF file does not have to contain records with genomic positions exactly at the start and the end of a specific range. By default, the effective range covers the whole genome. An owner of a BDIFF file can specify a subrange of an effective range to produce a new smaller BDIFF. This process is called dissemination and is explained later. Effective range is always greater or equal to a range defined by first and last BDIFF record. Secret key for encryption of unmapped reads and checksum of anonymized mapped reads are also stored within the BDIFF header.

BDIFF contains all of the information necessary to unmask personal alleles within anonymized mapped reads, hence it is never stored as plain text. Content of the BDIFF file is encrypted using AES encryption with randomly generated key. The AES key itself is encrypted by RSA public key, provided by user, and stored as a part of encrypted BDIFF file. In this way, an access to the personal alleles is restricted to the owner of the private key paired with the public key used for encryption. Finally, the plain BDIFF file is signed with a provided private key. The signature is stored as a part of encrypted BDIFF file and verified at the start of a decryption process using public key paired with signing private key. Random masking

The provided VOF file and the anonymized mapped reads are considered public, therefore everybody can tell which parts of the genome were not masked. As a consequence, rare variants not covered by VOF file, can be still abused by adversary to infer personal data. This vulnerability is mitigated by the introduction of random artificial SNV alleles into anonymized mapped reads by generating additional random VOF records before the masking process. Each generated VOF record has a random genomic position and contains allele counts representing approximate ratio of alleles in human genome. Both generated and file contained VOF records are iterated together and processed in a same way. As a result, generated VOF record has a chance to mask or introduce novel variant in the same way as a population based record. The number of new variants should be high enough to disallow attacks in-between the variants from the VOF file. On the other hand, the size of BDIFF file and time cost of all operations linearly increases with the increasing number of variants.

CIGAR string and sequencing quality

Mapped reads do not contain only nucleotide sequences, but also other sensitive data, that are processed by the invention. When making modification to an alignment, CIGAR string is modified accordingly, otherwise it would be easy to guess the nature of an original alignment.

Moreover, mapped alignment typically contains sequence qualities expressing confidence in each base. While masking SNV allele does not change the length of an alignment, masking INDEL allele often does, so it is necessary to adjust the length of a sequencing quality string to match the altered alignment. If masked alignment is longer than the original one, anonymization method provides artificial qualities filling the gap. On the other hand, if the masked alignment is shorter than the original one, sequencing qualities are deleted and stored within a BDIFF file to keep the anonymization method reversible.

Deanonymization of mapped reads

All variants within anonymized mapped reads, or their specific subset, can be unmasked by BDIFF file, containing replaced personal alleles and deleted qualities. This operation transforms anonymized mapped reads to personal mapped reads. User have to provide anonymized mapped reads and an associated encrypted BDIFF file along with the RSA private key whose public counterpart was used in the BDIFF encryption. Decryption of unmapped reads is handled separately and the user can choose whether to decrypt them.

First step of the deanonymization method is decryption of encrypted BDIFF file. The algorithm reads the encrypted AES key and the file signature from start of the file. The AES key is decrypted with a provided private key and then used to decrypt an actual encrypted BDIFF file. The decrypted file is verified with a public key against its signature to prove its origin.

BDIFF file contains mapping from personal alleles to masking ones for each position of difference between personal mapped reads and their anonymized version. Reversed mappings, together with additional information mentioned in description of BDIFF, are used to unmask all masked alleles.

When an effective range of unmasking is supplied by user, only alignments that are within or intersecting this range are deanonymized. Moreover, the unmasking process covers exactly this range, so boundary alignments may be only partially restored. Unmapped alignments are optionally decrypted with symmetric stream cipher encryption using secret key contained in the BDIFF header.

Dissemination of variants

Owner of the private key that was used to encrypt a BDIFF file can disseminate variants described by BDIFF file and associated anonymized mapped reads by re-encrypting the BDIFF file in desired genomic range. BDIFF file is first decrypted by the owner's private key and then encrypted by a public key of another user, who can later decrypt the file using his private key. If subrange of effective range for re-encryption is provided, only records inside or intersecting this range are considered and this range becomes the effective range of the new BDIFF file.

Decrypted BDIFF file gets verified with owner's public key and the new BDIFF file is signed with his private key during the process as proof of origin.

Checksum of anonymized mapped reads is compared to the checksum stored in the encrypted BDIFF file header. This ensures that the BDIFF file belongs to the anonymized mapped reads and that they were not modified. The re-encryption process can be repeated with different combinations of genomic ranges and public keys, producing separate access rights for individual users.

Alternative Irreversible Anonymization Method

Along with the reversible masking method, we provide another method for irreversible anonymization of genomic data in form of raw reads. The method replaces read sequences with corresponding sequences from the reference genome determined by an arbitrary mapping algorithm. The mapping to the reference genome is performed twice - the first maps personal reads and the second maps anonymized reads. Two files of mapped reads, results of the two mappings, are compared to find consistently mapped reads. The output of the method are mapped reads of which consistently mapped reads are anonymized and the remaining reads are personal. Ensuring consistent mapping is important for several sequence analyses, such as non-invasive prenatal testing and copy number variation detection.

The first mapping creates a set of personal mapped reads. If a read is successfully mapped, its sequence is replaced with the corresponding reference sequence according to mapped position and its length, otherwise the original sequence is kept. The second mapping maps anonymized reads from the first mapping, which produces a set of anonymized mapped reads. The final set of anonymized reads is created by joining mapped reads both from the personal set and the anonymized set. Both sets are iterated simultaneously and in each iteration mapped position of the personal read and the mapped position of the anonymized read are compared. If their positions match, the reads are considered to be consistently mapped. In this case the anonymized read is written to the final set of anonymized reads. On the other hand, if the pair of reads is not consistently mapped, personal read is written to the final set of anonymized reads instead.

The subject-matter of the present invention is method or set of methods as described in details above and as claimed in the attached claims.

Another subject-matter of the present invention is also alternative method of irreversible anonymization as defined in the attached claims. Computer system

The methods of the invention described above may be implemented in the form of modules and sub-modules into a computer system comprising computing device(s), server(s) and means for mutual data communication (e.g. LAN, internet) and for data communication with another computer system(s).

The computing devices and servers can each include a processor (central processing unit, CPU), random access memory (RAM), non-volatile secondary storage, such as a hard drive, network interfaces, and peripheral devices, including user interfacing means, such as a keyboard and display. Program code, including software programs, and data are loaded into the RAM for execution and processing by the CPU and results are generated for display, output, transmittal, or storage.

The modules and sub-modules, configured to perform one or more steps of the method(s) of the invention, can be implemented as a computer program or procedure written as source code in a conventional programming language and are presented for execution by the CPU as object or byte code. Alternatively, the modules and sub-modules can also be implemented in hardware, either as integrated circuitry or burned into read-only memory components, and then each of the computing devices and server can act as a specialized computer. The various implementations of the source code and object and byte codes can be held on a computer- readable storage medium, such as hard disk drive (HDD), solid state drive (SDD), flash drive, random access memory (RAM), read-only memory (ROM) and similar storage mediums. Other types of modules and module functions are possible, as well as other physical hardware components, as it is known to the persons skilled in the art.

The computer system configured for reversible anonymization of digitalized personal genome comprises anonymization module configured to execute anonymization method as disclosed herein. The computer system may further comprise deanonymization module configured to execute deanonymization method as disclosed herein. Furthermore, the computer system may comprise dissemination module configured to execute dissemination method as disclosed herein. Consequently, the subject-matter of the present invention is also a computer system configured for reversible or irreversible anonymization of digitalized personal genome in form of mapped reads as claimed in the attached claims.

Still another subject-matter of the present invention is a computer program product, as claimed in the attached claims, comprising computer-readable instructions, which, when loaded into and executed on a computer system, causes the computer system to execute operations according to the method of invention.

For better understanding to the invention, the method of the invention is further explained using an advantageous embodiment of the method in the Example 1 and the functionality of the method is evaluated in the Example 2. The attached figures will be also helpful for this purpose.

DESCRIPTION OF THE DRAWINGS

Fig. 1 shows workflow of anonymization method. BAM and VOF file are processed into Anonymized BAM and BDIFF, which is later encrypted.

Fig. 2 shows workflow of deanonymization method. Encrypted BDIFF is decrypted and with anonymized BAM is processed into personal BAM file.

Fig. 3 shows workflow of dissemination method. Encrypted BDIFF is decrypted, subrange of it is selected and encrypted with a key of specific user.

Fig. 4 (4A + 4B) shows flowchart of anonymization and deanonymization algorithm. This algorithm is represented by“anonymize” component in Fig. 1 and“deanonymize” component in Fig. 2.

Fig. 5 shows flowchart of masking single variant position in covering alignments.

Fig. 6 shows conversion process from BDIFF to encrypted BDIFF and structure of encrypted BDIFF file stored on disk.

Fig. 7 shows workflow of alternative irreversible anonymization method.

Fig. 8 shows intersections between three sets of variant positions. These sets are population positions from VCF, personal VCF and masked VCF.

Fig. 9 shows distribution of allele frequency in population VCF, personal VCF and masked VCF.

Fig. 10 shows distribution of population allele frequency by category. Categories are defined by intersections of variant positions (Fig. 9).

Fig. 11 shows ratio of masked alleles to not masked alleles and its relation to allele frequency. EXAMPLES OF THE INVENTION

Example 1

Example of the method of the invention

Anonymization Method 101

The anonymization method 101 is schematically depicted in Fig.l. User provides BAM file B1 representing personal genome and VCF file VI describing population allele frequencies of variants within this genome. VCF file Vi is converted to compact VOF file V2 containing only position and allele frequency of each variant. Anonymization algorithm 301 processes both BAM B1 and VOF V2 simultaneously, generating new anonymized BAM file B2 and in memory BDIFF file EU online. After process 301 is finished, checksums of personal BAM EU and VOF file V2 are written to the header of anonymized BAM B2. Similarly, checksum of anonymized BAM B2 is written to the header of BDIFF DL In-memory BDIFF D1 is signed with user’s private key 401, encrypted with his public key 402 and stored on a drive.

After the anonymization method 101 finishes, user can delete personal BAM EU without loss of information. Anonymized BAM B2 is deprived of personal information, stored in encrypted BDIFF D2, meaning it can be shared and stored publicly. Only the user’s private key can restore personal BAM EU in the deanonymization method (see Fig.2).

BDIFF Encryption and Decryption process 201 , 202 , 203, 204

BDIFF in its plain form Dl_ exists only as in-memory file as it contains sensitive personal information. It is encrypted and stored on a drive in the anonymization method 201 after its completion by the anonymization algorithm 301 and in the dissemination method 204 after subset of BDIFF D3 is copied (for the dissemination method see Fig. 3). Before the encryption, as schematically depicted in Fig. 6, header FI of BDIFF D2 or BDIFF subset D3 is populated by an effective range F3, checksum F4 of anonymized BAM file and the secret key F5 used for encryption of unmapped alignments. First, user’s private key K1 signs the BDIFF D1 creating its signature E2 and symmetric AES key K3 encrypts the BDIFF D1 creating in-memory BDIFF cipher E3. Next, AES key K3 itself is encrypted by user’s public key K2 and is written to the new encrypted BDIFF file D2 on a hard drive together with signature E2 of BDIFF. Finally, the in-memory BDIFF cipher E3 is appended to the file D2. Encrypted BDIFF D2 stored on a drive is decrypted as a part of the deanonymization method 202 and the dissemination method 203. First, encrypted AES key El is decrypted with user’s private key and the BDIFF cipher E3 is decrypted with it and stored in memory. Next, signature of BDIFF E2 is verified with a public key provided K2 by user. Finally, decrypted BDIFF stored in memory is ready for further processing.

Anonymization and Deanonymization process 301 , 302

The anonymization and deanonymization version of the process 301, 302 is a crucial part of the anonymization 101 and the deanonymization 103 methods respectively. In case of the anonymization version 301, the input of the process is personal BAM file B1 and VOF file V2, and in case of the deanonymization version 302, it is anonymized BAM file and BDIFF file DL From here onwards, record of BAM file is called alignment and record of either VOF or BDIFF is called variant, because both VOF and BDIFF records describe alleles at given genomic position. The output of the anonymization version of the process 301 is BDIFF file D1 and anonymized BAM B2, whereas output of its deanonymization version 302 is personal BAM Bl.

At each position of a variant the process collects all alignments that are covering the position. These alignments are stored in a queue and if an alignment that is placed after current variant is read, the queue is processed by masking or unmasking algorithm 506 depending on the called method (see Figs. 4A, 4B). Afterwards, next variant is read 507 and all alignments from the queue that are placed before it can be written 511 to BAM file. If it is end of VOF or BDIFF file, all remaining alignments from the queue are written 508 to the output BAM. Subsequently, alignments from the input BAM are read 509 one by one till the end of file, where the process stops. Each alignment is written 514 to output BAM, except if an alignment is unmapped it is encrypted 513 beforehand.

If an alignment is unmapped it is encrypted 503 as stream cipher and if the alignment queue is empty it can be written 504 to the BAM file directly, otherwise it has to be appended 512 at the end of the alignment queue. Same applies for alignment placed before current variant. Alignment that is not placed before nor after current variant covers it and is appended 510 to the alignment queue. When an alignment is written to the output BAM, or appended to the queue, next alignment is read 505 from the input BAM. In case it is end of the input BAM, all remaining alignments from the queue are written 515 to the output BAM and the process stops.

Masking process 701 , 506

The masking process 701, schematically depicted in Fig. 5, is a step 506 of the anonymization process 301. When position described by VOF record is covered with one or more alignments Ql, and next alignment is found to be after this position, masking algorithm is called to mask alleles at this position within these alignments. VOF record describes frequencies of population alleles 801 found at the position. Since human has two alleles per each genomic position, probability matrix of available allele pairs is computed 802 simply by multiplying vector of probabilities 801 with itself. Next, one pair of masking alleles 803 is randomly selected from this matrix with computed probability.

An actual pair of personal alleles 804 is inferred from the alleles found in list of alignments Ql at the position described by VOF record. A personal allele is recognized if it is represented in at least 20% of alignments from the covering list Ql. If there are more than two such alleles, variant position is not processed by the method. If only one such allele exist, position is considered to be homozygous, meaning that alleles from the pair are identical.

A mapping 805 from the pair of personal alleles 804 to the pair of masking alleles 803 is created. Personal alleles are mapped to masking alleles within the alignments Q2 at the position of a variant. Remaining alleles found within alignments are considered as sequencing and alignment errors and are mapped to itself or alleles other than masking alleles. Mapping is written as a new BDIFF record D2 and altered alignments are written in respective order to the anonymized BAM B2.

In the case of mapping alleles from heterozygous pair to homozygous pair, the information about heterozygosity is preserved in BDIFF to keep the method reversible. One allele from masked pair is written to BDIFF D1 together with identifiers of associated masked alignments 806 in addition to the allele mapping 805. Unmasking process 702 , 506

In a step 506 of the deanonymization process 302 masking process 701 is replaced by unmasking process 702, which is similar (see Fig. 5). It is applied when position described by BDIFF record is covered with one or more alignments Q2 and next alignment is found to be after this position. In contrast with masking algorithm 702 it does not include steps needed to create allele mapping, since the mapping is already contained in BDIFF record. First, this mapping is read from BDIFF record and reversed to map masking alleles to personal ones. In case that heterozygous pair of alleles was masked by homozygous pair, one masked allele with identifiers of alignments is also part of a BDIFF record 807. Next, masked alleles within the alignment queue Q2 are mapped to personal ones and then homozygosity is restored if needed. As a result, alignment queue now contains personal alleles and can be written to personal BAM Bl.

Deanonymization method 103

Deanonymization method 103 is schematically depicted in Fig. 2. The encrypted BDIFF D2 provided by a user is decrypted 403 with his private key. Origin of BDIFF Dl_ is verified 404 by comparison of the actual signature and the signature stored as a part of the encrypted BDIFF D2. The actual signature is computed on the decrypted BDIFF D1 with the public key that originally encrypted the file.

In-memory decrypted BDIFF Dl_ together with associated anonymized BAM B2, provided by user, act as input in deanonymization algorithm 302. This algorithm is substantially same as anonymization algorithm 301. differing only by using BDIFF D1 instead of VOF file V2 and reversed masking - unmasking. The output of the method is restored personal BAM file BL User can choose to deanonymize only part of his personal BAM file Bl_ by supplying specific genomic range 406. Moreover he can opt to deanonymize either unmapped reads or mapped reads.

Dissemination Method 102

User wants to share part of his genome with another user - his confidant. He provides encrypted BDIFF D2 and anonymized BAM B2, both associated with his genome, together with arbitrary genomic range of BAM he wants to share. User can also opt to deanonymize either unmapped reads or mapped reads.

The method starts with decrypting 403 the BDIFF file with user’s private key and afterwards decrypted in-memory BDIFF is verified 404 by comparison of its signature generated by user’s public key against the signature stored as a part of the encrypted BDIFF D2. Next, the checksum of the anonymized BAM associated with the BDIFF D1 and stored in its header is compared 405 against the actual checksum of the supplied anonymized BAM file B2. This comparison verifies that the BDIFF file Dl_ belongs to the anonymized BAM file B2 and that the anonymized BAM file B2 was not modified. A subset of BDIFF given by desired range 406, supplied by user, is copied from the BDIFF EU as a new in-memory file D3. The anonymized BAM B2 contains index of chromosome positions needed to translate the range in form of relative positions to chromosomes into absolute genomic positions used by the BDIFF file Dl.

A BDIFF subset D3 is signed with user’s private key to keep proof of origin and then encrypted with the public key of target user. The output of the method is an encrypted BDIFF subset D4, which is accessible only with private key of target user. The encrypted BDIFF subset D4 can be used in deanonymization 103 and again in dissemination 102 methods. The dissemination method 103 can be used multiple times with different combinations of genomic ranges and public keys, producing separate access rights for individual users.

Alternative Irreversible Anonymization Method

The alternative irreversible anonymization method, as alternative to the reversible method, is schematically demonstrated in Fig. 7. The method maps paired-end reads stored in pair of FASTQ files R1 twice 901, 903, with aim to unambiguously replace personal sequences of the reads with corresponding reference sequences. Bowtie2 [Langmead et al. 2009] mapping algorithm is used, although it can be substituted by an arbitrary mapping algorithm that produces alignments compliant with the SAM file specification.

The first mapping 901 alignes reads, which are simultaneously written to SAM file Si and to FASTQ file pair R2 as anonymized reads. In order to anonymize a mapped read its sequence is replaced with the corresponding reference sequence according to mapped position and its length, otherwise the original sequence is kept 902. The second mapping 903 aligns reads from the FASTQ file pair R2 created by the first mapping 901, which produces second SAM file S2 containing anonymized reads.

The final FASTQ file pair R3 is created by joining the two SAM files Si, S2 - they are iterated simultaneously and in each iteration positions of the same pair of mapped reads are compared between the files 904. If they match, the pair of reads is considered to be consistently mapped. In this case the pair of reads from the anonymized SAM S2 is written to the output FASTQ pair R3. However, if the pair of reads is not consistently mapped, reads from the personal SAM Si are written to the output FASTQ pair R3_instead.

Example 2

Evaluation of the method of the invention

Preparation of Data

In order to evaluate performance of the method of the invention exemplified in example 1, it was validated on real world data from the third phase of 1000 genomes project [12]. For practical purposes, method was evaluated on chromosome 20, which make up 2.1% of the whole genome. Furthermore, only exome was used, since the sequenced genome has low coverage that leads to uncertainty about the presence of real alleles. Following files were downloaded from the official site of the project:

_• reference genome FASTA [13]

Modified GRCh37 reference genome used in phase 3 of 1000 genomes project. 3.0 GB

_• personal BAM [14]

Exome aligned to chromosome 20 of a single individual containing 2 393 934 alignments. 224 MB

_• population VCF [15]

Aggregated variants called on chromosome 20 of all individuals that participated in 1000 genomes project. Each record describes frequency of all alleles found at the site. 326 MB

In the first step, population VCF VI was converted to VOF format V2 described and utilised by the invention. Personal BAM EH was used to infer genomic positions of VCF records and reference genome FASTA file was used to verify reference alleles in the VCF. The VOF file V2 has 23 MB - it is considerably smaller than the original VCF VI.

Total of 1 744 566 (96.23%) SNVs and 64 103 (3.54%) INDELs were found by the method. Remaining VCF records were neither in form of SNV nor INDEL or had repeated genomic position.

Population VOF and personal BAM files were used as an input in the anonymization method 101 of the invention. INDELs and random positions were not masked in the validation process to provide more concrete results, which are easier to interpret. Output of the masking method, BDIFF D1 and anonymized BAM B2, was used as the unmasking method input. Produced unmasked personal BAM B1 was compared to the original personal BAM B1 in order to verify masking and unmasking process. DNA sequences and CIGAR strings were found to be equal, hence the invention succeeded to restore information in the original file.

Anonymization Method

Performance of the anonymization method 101 was evaluated by comparison between called variants on personal BAM Bl_, called variants on its anonymized version and population variants. Firstly, variant positions were compared and as a result subset of population variants with identical positions was found. Later, alleles of each of these variants were compared to find any differences.

Variants were called by tool Vardict [Lai et al. 2016] on exome of the chromosome 20 on both personal BAM Bl_ and anonymized BAM B2, producing two VCF files. Genomic positions from“personal VCF”,“masked VCF” and“population VCF” were extracted and stored as three sets deprived of duplicates. Intersections between these sets (Fig. 8) are described in detail below:

_• not found

Vast majority of variant positions in“population VCF” is not found in“personal VCF”. This is expected as the“population VCF” is called on thousands of personal genomes, while “personal VCF” is called only on one of them.

_• masked

Truly masked variant positions by the invention. Associated variants are found in personal BAM Bl, but not in anonymized BAM B2. This occurs when alternative allele at homozygous position is masked by the reference one and homozygosity is preserved.

_• not masked

Variant positions that the invention did not mask. However, alleles of variants associated with these positions can be still masked, therefore they are analysed separately.

_• introduced

Alleles at every position of personal BAM Bl_ described by“population VCF” have chance to be replaced, including the reference ones. When reference allele at homozygous position is replaced by alternative one, a new variant is found by variant calling.

_• not covered

Set of personal variant positions not covered by“population VCF”. It is expected to be empty, since in reality, variants of“personal VCF” are subset of“population VCF” (“personal VCF” is based on BAM file that is part of the study behind“population VCF”). Nevertheless, different variant positions were found due to different calling process.

Distributions of alternative allele frequencies per VCF were compared to show their nature (Fig. 9). Population VCF contains vast amount of alleles with frequency below one percent. Nevertheless, they have a little chance to be introduced by masking process into masked VCF, even though every variant covered by personal BAM is considered. On the other hand, allele frequency in personal VCF has always expected ratio of 0.5 for heterozygote and 1.0 for homozygote but actual ratios can quite vary due to possible low coverage and sequencing errors. Variants with frequency below 20% are not shown, because this value was set as a threshold to distinguish between an error and an actual allele - the same threshold as in the masking method. Masked VCF preserves allele frequency distribution of personal VCF to considerable extent.

Furthermore, distributions of alternative allele frequency in the population were compared per category of masked , introduced and not masked variant positions (Fig. 10). Number of variants with masked and introduced positions increases with decreasing frequency of an allele (Fig. 11). The rarer a variant is, or in other words specific for the individual, the higher the chance it would be masked or introduced by the method. Variants common in the population have a low chance to be masked or introduced, nonetheless they are specific for the population and not an individual. Alleles of variants with positions in not masked set are compared next. First, variants with positions in the masked set are retrieved from both personal and masked VCF. Two obtained lists of variants are deprived of variants with duplicated position and joined together based on the position. Alternative allele frequency, reference allele and alternative allele are compared between the two lists. Only alternative allele frequency was found to differ - from total of 476 variants, 160 (33.61%) had different frequency. The change of frequency of alternative alleles of these variants was caused by change of homozygous pair of alleles to heterozygous pair or vice-versa by the masking method.

REFERENCES

Non-patent literature

1. Akgiin, Mete, et al. "Privacy preserving processing of genomic data: A survey." Journal of biomedical informatics 56 (2015): 103-111.

2. Erlich, Yaniv, and Arvind Narayanan. "Routes for breaching and protecting genetic privacy." Nature Reviews Genetics 15.6 (2014): 409-421.

3. Jagadeesh, Karthik A., et al. "Deriving genomic diagnoses without revealing patient genomes." Science 357.6352 (2017): 692-695.

4. Lauter, Kristin, Adriana Lopez-Alt, and Michael Naehrig. "Private computation on encrypted genomic data." International Conference on Cryptology and Information Security in Latin America. Springer, Cham, 2014.

5. Sousa, Joao Sa, et al. "Efficient and secure outsourcing of genomic data storage." BMC medical genomics 10.2 (2017): 46.

6. Shimizu, Kana, Koji Nuida, and Gunnar Ratsch. "Efficient privacy-preserving string search and an application in genomics." Bioinformatics 32.11 (2016): 1652-1661.

7. Homer, Nils, et al. "Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays." PLoS genetics 4.8 (2008): el000167.

8. Ayday, Erman, et al. "The chills and thrills of whole genome sequencing." Computer (2013).

9. Ayday, Erman, et al. "Privacy-preserving processing of raw genomic data." Data Privacy Management and Autonomous Spontaneous Security. Springer, Berlin, Heidelberg, 2014. 133-147. 10. Erlich, Yaniv, et al. "Redefining genomic privacy: trust and empowerment." PLoS biology 12.11 (2014): el001983.

11. Savage, Neil. "The myth of anonymity." Nature 537, (2016): S70-S72.

12. 1000 Genomes Project Consortium. "An integrated map of genetic variation from 1,092 human genomes." Nature 491.7422 (2012): 56.

13. ftp://ftp.1000genomes.ebi.ac.Uk//voll/ftp/technical/reference/phase2_reference_assem bly_sequence/hs37d5.fa. gz

14. ftp://ftp.1000genomes.ebi.ac.uk/voll/ftp/phase3/data/HG00101/exome_alignment/HG 00101.chrom20.ILLUMINA.bwa.GBR.exome.20121211.bam

15. ftp://ftp.1000genomes.ebi.ac.uk/voll/ftp/release/20130502/ALL.chr20.phase3_shapeit 2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

16. Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome biology 10.3 (2009): R25.

17. Lai, Zhongwu, et al. "VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research." Nucleic acids research 44.11 (2016): el08-el08.

Patent literature

18. WO2009156934A2

19. US20160048690A1

20. WO2018001761A1

21. US20170308717A1

Claims

1. A computer implemented method for reversible anonymization (101) of digitalized genomic sequence in form of reads comprising the steps:

a) mapping genomic reads obtained by DNA sequencing to a reference genome producing file (Bl) comprising mapped reads;

b) providing a file (VI) comprising population variants and converting it into file (V2) comprising position and mapping of alleles to frequencies for each variant; c) anonymizing (301) the mapped reads (Bl) by replacing alternative allele with reference allele using allele frequency file (V2), thus masking personal variant within the genomic sequence;

d) anonymizing (301) the mapped reads (Bl) by replacing reference allele with alternative allele using allele frequency file (V2), thus masking reference allele by introducing novel variant to the genomic sequence;

e) producing anonymized file (B2) comprising mapped reads and file (Dl) with masked alleles;

f) encrypting (402) file (Dl) comprising masked alleles with public part of asymmetric key producing encrypted file (D2) comprising masked alleles.

2. The method according to claim 1 further comprising deanonymization (103) of the mapped reads, comprising the steps:

g) decrypting (403) encrypted file (D2) comprising masked alleles with private part of asymmetric key;

h) deanonymizing (302) the anonymized mapped reads (B2) by replacing each masked allele with the original one, using file (Dl) with masked alleles, producing deanonymized file (Bl) comprising mapped reads.

3. The method according to claim 1 further comprising a dissemination (102) of masked alleles allowing a user to re-encrypt subset of encrypted masked alleles for an arbitrary user, comprising the steps:

g) decrypting (403) file (D2) comprising masked alleles with private part of asymmetric key; h) selecting all masked alleles or their subset (D3) within arbitrary genomic range

(406);

i) encrypting (402) the selected masked alleles (D3) with public part of asymmetric key producing file (D4) comprising encrypted masked alleles.

4. The method according to claim 1 or 3, wherein said encrypting (402) comprises the steps: a) encrypting file (D1 / D3) comprising masked alleles with symmetric key (K3); b) encrypting symmetric key (K3) with public part of asymmetric key (K2);

c) storing encrypted asymmetric key (K2) in file (D2 /D4) comprising encrypted masked alleles.

5. The method according to claim 4, further comprising signing (401) file (D1 / D3) comprising masked alleles with private part of asymmetric key (Kl) producing its signature (E2) and storing it in file (D2 / D4) comprising encrypted alleles.

6. The method according to claim 2 or 3, wherein said decrypting (403) comprises the steps: a) decrypting symmetric key (El), contained in file (D2 / D4) with encrypted masked alleles, with private part of asymmetric key (Kl);

b) decrypting encrypted masked alleles from the file (D2 / D4) with the symmetric key (K3).

7. The method according to claim 6, further comprising verifying (404) masked alleles using associated cryptographic signature (E2) contained in file (D2 / D4) comprising encrypted alleles which is decrypted with public part of asymmetric key (K2).

8. The method according to claim 1, wherein said anonymization process (301) comprises the steps:

a) aggregating covering mapped reads, from the file (Bl) comprising mapped reads, for each variant described by file (V2) comprising allele frequencies into a queue and processing (506) them by masking (701) each time next mapped read is placed after current variant position; b) reading (507) next variant from the file (V2) comprising allele frequencies when all mapped reads covering current variant are processed;

c) writing (511) all mapped reads from the queue that precede current variant to the file (B2) comprising anonymized mapped reads;

d) writing (508, 514) all mapped reads from the queue together with remaining mapped reads from the file (Bl) comprising mapped reads into the file (B2) comprising anonymized mapped reads when there is no next variant;

e) encrypting (503, 513) all unmapped reads by stream cipher before writing them to the file (B2) comprising anonymized mapped reads;

f) appending (510, 512) mapped read to the queue when it covers or precedes current variant and the queue is empty;

g) writing (504) mapped read to the file (B2) comprising anonymized mapped reads when it precedes current variant and the queue is empty;

h) reading (505, 509) next mapped read when the current mapped read is written (504, 508, 514) to file (B2) comprising anonymized mapped reads or is appended (510, 512) to the queue;

i) writing (515) all mapped reads in the queue to file (B2) comprising anonymized mapped reads when all mapped reads from file (Bl) comprising mapped reads are processed.

9. The method according to claim 2, wherein said deanonymization (302) process comprises the steps:

a) aggregating covering mapped reads, from the file (B2) comprising anonymized mapped reads, for each variant from file (Dl) comprising replaced alleles into a queue and processing (506) them by unmasking (702), each time next mapped read is placed after current variant position;

b) reading (507) next variant from the file (Dl) comprising replaced alleles when all mapped reads covering current variant are processed;

c) writing (511) all mapped reads from the queue that precede current variant to the file (Bl) comprising mapped reads; d) writing (508, 514) all mapped reads from the queue together with remaining mapped reads from the file (B2) comprising anonymized mapped reads into the file (Bl) comprising mapped reads when there is no next variant;

e) decrypting (503, 513) all unmapped reads by stream cipher before writing them to the file (Bl) comprising mapped reads;

g) writing (504) mapped read to the file (Bl) comprising mapped reads when it precedes current variant and the queue is empty;

h) reading (505, 509) next mapped read when current mapped read is written (504, 508, 514) to file (Bl) comprising mapped reads or is appended (510, 512) to the queue;

i) writing (515) all mapped reads in the queue to file (Bl) comprising mapped reads when all anonymized mapped reads in the file (B2) comprising anonymized mapped reads are processed.

10. The method according to claim 8 wherein said masking (701) comprising the steps:

a) receiving a variant from the file (V2) comprising allele frequencies, defined by the mapping of alleles to their frequencies at a given genomic position, together with a queue (Ql) of mapped reads, which is a complete list of mapped reads covering the position of the variant;

b) computing a probability matrix of all possible allele pairs (802) by multiplying the vector of public allele frequencies (801) with itself and randomly selecting a pair of masking alleles (803) with computed probability from the matrix;

c) inferring a pair of personal alleles (804) from the queue (Ql) of mapped reads as homozygous when only one allele is significantly represented in covering mapped reads or heterozygous if two such alleles exist;

d) replacing personal alleles (804) with masking alleles (803) within the queue (Ql) of mapped reads at the position of the variant using the mapping (805), creating a queue (Q2) of anonymized mapped reads for that position;

e) replacing remaining alleles within the queue (Q2) of anonymized mapped reads at the position of the variant with alleles other than masking alleles; f) preserving a replaced allele from the personal pair (804) together with the identifiers (806) of mapped reads containing the allele if the personal pair of alleles (804) is heterozygous and the masking pair (803) is homozygous; g) storing the mapping (805) from masking alleles to personal alleles at the given genomic position, possibly including identifiers of mapped reads associated with the masked allele (806) in file (Dl) comprising replaced alleles;

h) writing mapped reads from the queue (Q2) to the file (B2) comprising mapped reads.

11. The method according to claim 9 wherein said unmasking (702) that is substantially reversed masking (701), comprising the steps:

a) receiving a variant from the file (Dl) comprising replaced alleles describing by the mapping from masking alleles to personal alleles (807) at a given genomic position together with a queue (Q2) of mapped reads, which is a complete list of anonymized mapped reads covering the position of the variant;

b) receiving a list of identifiers of the mapped reads (806) associated with a single masked allele in case that heterozygous allele pair is masked by homozygous allele pair;

c) replacing masking alleles (803) with personal alleles (804) within the queue (Q2) of anonymized mapped reads at the position of the variant using the mapping (807) described by the variant in reverse;

d) restoring a replaced allele from the personal pair (804) within the queue (Q2) of anonymized mapped reads, using the identifiers (807) of mapped reads and the allele described by the variant;

e) writing mapped reads from the queue (Ql) to the file (Bl) comprising mapped reads.

12. A computer implemented method for irreversible anonymization of digitalized genome sequence in form of reads, comprising the steps:

a) mapping (901) genomic reads (Rl) obtained by DNA sequencing to a reference genome producing primary mapped reads (SI); b) replacing nucleotide sequences of mapped reads with corresponding reference sequences and storing processed reads as primary anonymized reads (R2); c) mapping (903) primary anonymized reads (R2) to the same reference genome producing secondary mapped reads (S2);

d) comparing mapped positions between each corresponding primary mapped read

(51) and secondary mapped read (S2) and selecting the secondary mapped read

(52) whenever their positions match and primary mapped read (SI) if they differ; e) producing secondary anonymized reads (R3) from the selected reads.

13. A computer system comprising at least one computing device and / or server which comprises one or more processors and one or more modules configured to execute any one of the methods according to any one of the claims 1 to 12.

14. The computer system according to claim 13 comprising an anonymization module configured to perform anonymization (101) according to claim 1.

15. The computer system according to claim 13 comprising a deanonymization module configured to perform deanonymization (103) according to claim 2.

16. The computer system according to claim 13 comprising dissemination module configured to perform dissemination (102) according to claim 3.

17. A computer program product comprising computer-readable instructions which, when loaded into and executed on a computer system, causes the system to execute operations according to the method according to any one of the claims 1 to 12.