Genetic association of TMPRSS2 rs2070788 polymorphism with COVID-19 Case Fatality Rate among Indian populations


bioRxiv preprint doi: https://doi.org/10.1101/2021.10.04.463014; this version posted October 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 2 Genetic association of TMPRSS2 rs2070788 polymorphism with COVID-19 Case Fatality Rate among Indian populations 3 4 Rudra Kumar Pandey1*, Anshika Srivastava1, Prajjval Pratap Singh1, and Gyaneshwer Chaubey1* 5 6 *Corresponding authors: E-mail address: gyaneshwer.chaubey@bhu.ac.in (Gyaneshwer Chaubey), rudrakumarpandey4@gmail.com (Rudra Kumar Pandey). 1 Cytogenetics Laboratory, Department of Zoology, Banaras Hindu University, Varanasi, India221005 2 9 Abstract 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 SARS-CoV2, the causative agent for COVID-19, an ongoing pandemic, engages the ACE2 receptor to enter the host cell through S protein priming by a serine protease, TMPRSS2. Variation in the TMPRSS2 gene may account for the difference in population disease susceptibility. The haplotype-based genetic sharing and structure of TMPRSS2 among global populations have not been studied so far. Therefore, in the present work, we used this approach with a focus on South Asia to study the haplotypes and their sharing among various populations worldwide. We have used next-generation sequencing data of 393 individuals and analysed the TMPRSS2 gene. Our analysis of genetic relatedness for this gene showed a closer affinity of South Asians with the West Eurasian populations therefore, host disease susceptibility and severity particularly in the context of TMPRSS2 will be more akin to West Eurasian instead of East Eurasian. This is in contrast to our prior study on ACE2 gene which shows South Asian haplotypes have a strong affinity towards West Eurasians. Thus ACE2 and TMPRSS2 have an antagonistic genetic relatedness among South Asians. We have also tested the SNP’s frequencies of this gene among various Indian state populations with respect to the case fatality rate. Interestingly, we found a significant positive association between the rs2070788 SNP (G Allele) and the case fatality rate in India. It has been shown that the GG genotype of rs2070788 allele tends to have a higher expression of TMPRSS2 in the lung compared to the AG and AA genotypes, thus it might play a vital part in determining differential disease vulnerability. We trust that this information will be useful in underscoring the role of the TMPRSS2 variant in COVID-19 susceptibility and using it as a biomarker may help to predict populations at risk. 31 Keywords: COVID-19, TMPRSS2, India, rs2070788, haplotype, Linkage Disequilibrium 32 1. Introduction 33 COVID-19 is an ongoing pandemic that has cost millions of lives worldwide, caused by the 34 SARS-CoV2 virus of the Beta Family. Along with ACE2 (Angiotensin-converting enzyme 2) 35 which acts as a receptor, TMPRSS2 (Transmembrane protease, serine 2), a serine protease, is 36 also involved in virus entry the host cell through S Protein priming (1,2). Along with SARS-CoV- 37 2, the Influenza virus, as well as the various human coronaviruses such as HCoV-229E, MERS- 38 CoV, and SARS-CoV, have been identified to utilize this protein for cell entrance (3). Serine 39 proteases have been linked to a variety of physiological and pathological processes. 40 Androgenic hormones were shown to upregulate this gene in prostate cancer cells, while 41 androgen-independent prostate cancer tissue was found to downregulate it (4). Northern 42 blots analysis has revealed that in mice TMPRSS2 is mainly expressed in the kidney and 43 prostate, whereas in humans, TMPRSS2 is largely expressed in the prostate, salivary gland, 44 stomach and colon (5). TMPRSS2 is also expressed in the epithelia of the respiratory, 45 urogenital and gastrointestinal tracts according to in-situ hybridization investigations 46 performed on mice embryos and adult tissues (5). 47 The impact of the COVID-19 crisis is not uniform across ethnic groups. Patients from different 48 ethnic backgrounds suffer disproportionately (6). Discrepancies in infection as well as case 49 fatality rates (CFR) could be due to multiple reasons e.g., differences in quarantine and social 50 distancing policies, access to medical care, reliability & coverage of epidemiological data, and 51 population age structure, which shows that mortality is greater among the elderly and those 52 with comorbidity (7,8). However, many young and healthy people have also lost their lives 53 due to rapid cytokine storms (9). It is important to note that these factors do not appear to 54 account for all the disparities noticed among groups, and there are significant gaps that 55 require the scientific community's attention to propose and test theories that will assist us in 56 better understanding the disease etiology. This is even more important, keeping in mind that 57 the number of cases and deaths may be poorly reported in some populations however, 58 countries with strict standards for the collection and presentation of epidemiological data 59 suggest that human variation in genetic makeup may account for differential susceptibility 60 and severity in disease outcomes among different populations (10). There is evidence that 61 supports the role of ACE2 gene variations in susceptibility to COVID-19 in Indian populations 62 (11,12). However, little is known regarding the genetic structure of TMPRSS2 haplotypes 63 among South Asian populations, a detailed analysis of the sequence data of TMPRSS2 gene 64 from world populations may unveil its haplotype sharing, which may help understand the role 65 of TMPRSS2 in disease susceptibility globally. Given the relevance of the TMPRSS2 gene in the 66 SARS-CoV-2 infection process, COVID-19 infection and severity pattern may be directly linked 67 to elevated TMPRSS2 gene expression, resulting in varying disease susceptibility outcomes in 68 various communities globally. However, the role of TMPRSS2 polymorphism for disease 69 susceptibility in the Indian populations is largely unexplored and this needs to be examined. 70 Therefore, in the current study, we analysed the haplotype structure of TMPRSS2 focusing on 71 South Asia and its genetic markers that could be responsible for changes in the gene's 72 expression in the lungs tissue and, correlate it with epidemiological data on COVID-19 for any 73 existing association among Indian population. 74 2. Material and Methods 75 The TMPRSS2 gene haplotype analysis for various world populations was done using NGS data 76 from (13). PLINK 1.9 was used to extract sequences from the dataset for different populations 77 (14). After excluding samples from Sahul and Africa, as well relatives up till second-degree, a 78 total of 393 samples and 795 SNPs were observed and were used further for study 79 (Supplementary Table 1 and 2). The plink file was converted to fasta (ped to IUPAC) by a 80 customized script (15). For the purpose of phasing, Fst calculation, Population-wise genetic 81 distances calculation, and generation of Network and Arlequin input file, DNAsp was used 82 (16). MEGA X was used to construct an Fst based Neighbour-joining tree (17). To calculate 83 Nei’s genetic and average pairwise distance, Arlequin 3.5 was used and plotted on a graph by 84 R V3.1 (18,19). Network v5 and network publisher were employed to draw the median-joining 85 network while total and prevalent haplotypes in TMPRSS2 gene for each population were 86 calculated using XML file generated through Arlequin 3.5 (18,20). 87 For the association study, we searched for the studies on TMPRSS2 variants reported in the 88 literature elsewhere in relation to COVID-19 susceptibility (4,21–41). We obtained a total of 89 5 SNPs (rs2070788, rs734056, rs12329760, rs2276205, and rs3787950) was observed in our 90 data and studied subsequently in detail. Data from the Estonian Biocentre (42–45), data from 91 phase 3 of the 1,000 Genomes Project (46), and our new genotyped samples from several bioRxiv preprint doi: https://doi.org/10.1101/2021.10.04.463014; this version posted October 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 5 92 Indian states were used to calculate the frequency of each of these SNPs among various Indian 93 populations using plink 1.9. State-wise frequency maps for rs2070788 and COVID-19 CFR 94 among the Indian population were made by https://www.datawrapper.de/. and worldwide 95 spatial distribution of rs2070788 was generated from the PGG.SNV toolkit using 1000 genome 96 samples (47). The regression plots for statewise allele frequency Vs the CFR were constructed 97 using https://www.graphpad.com/quickcalcs/linear1/ and further validated by the Microsoft 98 excel regression calculations. We also performed Pearson's correlation coefficient test (48) 99 at a 95 percent confidence interval and 1,000 bootstrapping (2,000,000 seeds) for a two- 100 tailed significance test to verify our results by using, SPSS (ver 26). The LD map and aggregate 101 frequency of haplotypes carrying rs2070788 (G allele) were calculated for each of the 102 populations by Haploview (49). 103 3. Result and Discussion 104 TMPRSS2 is a serine protease enzyme that is encoded in humans by the TMPRSS2 gene that 105 is located on chromosome 21q22.3. (50). This protein aids in virus entry into host cells, such 106 as the influenza virus, and human coronaviruses such as HCoV-229E, MERS-CoV, SARS-CoV, 107 and SARS-CoV-2 by proteolytically cleaving and then activating the viral envelope 108 glycoproteins (51), and thus can be inhibited by TMPRSS2 inhibitor (1). Genetic variation in 109 this gene may account for differential vulnerability for COVID-19 disease among diverse 110 populations, therefore, in the present study with our major focus being on South Asia. 111 We analyzed TMPRSS2 gene sequence data among world populations by haplotype-based 112 approach for comparison among the various groups. Fst based neighbour Joining (NJ) tree 113 showed the clustering of South Asians with the West Eurasian populations (Caucasus, West 114 Asia, Europe, and Central Asia) (Figure 1A). Similarly, the Average Pairwise differences 115 analysis showed smaller diversity and genetic distance between populations, among East and 116 West Eurasians, while greater diversity and genetic distance was observed between East and 117 West Eurasian populations. The lowest diversity was found within West Asia & the American 118 population (Figure 1B). A median-joining (MJ) network analysis of the TMPRSS2 gene revealed 119 that there are 499 haplotypes throughout this gene among the examined populations, with 120 prevalent haplotypes (Hap 34, Hap 48, Hap 75, Hap 98, and Hap 260), each having ≥10 bioRxiv preprint doi: https://doi.org/10.1101/2021.10.04.463014; this version posted October 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 6 121 individuals. Haplotypes 48 and 75 were found to be more common in Europe, while 122 haplotypes 98 and 260 were observed to be more common in Siberia. Haplotype 34 was 123 frequent in Southeast Asia, followed by Central Asia (Supplementary Table 3A and 124 Supplementary Figure 1). Altogether, South Asian populations carry 47 haplotypes, among 125 which 6 are shared (Hap_34, Hap_48, Hap_78, Hap_112, Hap_219, and Hap_260) with other 126 continental populations while the rest are unique to South Asia. Among the shared 127 haplotypes, five are shared with the West Eurasian populations, whereas only a single 128 haplotype is shared with the East Eurasian populations. (Figure 1C and Supplementary Table 129 3B). The haplotype sharing, as well as Fst analysis, are consistent with the West Eurasian 130 affiliation of the majority of South Asian TMPRSS2 haplotypes (Figure 1C and Figure 1A). 131 Therefore, the host susceptibility of SARS-CoV-2 for TMPRSS2 gene among South Asians is 132 most likely expected to be similar to West Eurasian rather than that of East Eurasians. In 133 contrast with this, our previous study on the ACE2 gene has shown the strong affinity of South 134 Asian haplotypes with the East Eurasians (11,12). Thus, for the South Asians, ACE2 and 135 TMPRSS2 have an antagonistically genetic relatedness. As a result, it's worth proposing that 136 the South Asian population's susceptibility to SARS-CoV-2 will fall somewhere between West 137 and East Eurasian people, which is most likely the cause of the moderate susceptibility. 138 There has not been any association study so far on the TMPRSS2 variants in relation to COVID- 139 19 among Indian Populations. Therefore, we calculated groupwise allele frequencies in Indian 140 populations for all the 5 SNPs (rs2070788, rs734056, rs12329760, rs2276205, and rs3787950) 141 observed in our data. The linear regression analysis was carried out for these SNP's for spatial 142 frequency in India with COVID-19 CFR among various Indian states (Supplementary Table 4 143 A, B and 5). The Regression Analysis showed a significant positive correlation for rs2070788 144 SNP (G allele), between allele frequency and case fatality rate (p < 0.05). Higher CFR was 145 observed where the allele frequency is higher and vice versa (Figure 2A and B). The goodness 146 of fit (R2) explained 33.82% of the variation (Figures 2C). Because this is an active pandemic 147 with changing numbers of infected and dead patients, we confirmed our findings at different 148 timelines (latest up to August 2021). The recent data backs up the previous observation with 149 no substantial difference between the outcomes, to further validate our results we 150 performed the Pearson correlation coefficient test which shows a significant positive bioRxiv preprint doi: https://doi.org/10.1101/2021.10.04.463014; this version posted October 5, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 7 151 correlation with r = .582, p = 0.029, thus supporting the previous observation of strong 152 positive association (Table 1). 153 Tmprss2 expression in the lungs was reported to be higher in the rs2070788 GG genotype 154 than those in the AA and AG genotype (52) thus, the G allele may contribute to severe 155 consequences in SARS-COV2 infection in populations with high frequency. We found that G 156 allele frequency in India ranges from 20% to 50%, with the mean frequency of 39%, lowest 157 being in Arunachal Pradesh and highest in Bihar which is in accordance as per data observed 158 which clearly shows Arunachal Pradesh is among those states that show lowest CFR while 159 Bihar and other states are among higher CFR rate (Supplementary Table 4A and B). Thus this 160 may explain the disparity in severity of pandemic among various Indian states (Figure 2 B). 161 Being an androgen-sensitive gene TMPRSS2 is known to mediate sex-related effects and 162 rs2070788 SNP seems to play an important role (53). Higher expression of TMPRSS2 in males 163 might make them more prone to virus fusion and could explain high COVID-19 mortality in 164 males (54,55). 165 For Linkage disequilibrium (LD) analysis, LD plots were made for each population focussing on 166 rs2070788 and nearby SNPs on that haplotype. LD blocks of various sizes were observed 167 among Central Asians, Caucasians, Europeans, South Asians, Siberians, and West Asians. The 168 highest LD level was found in Americans. (Supplementary Figure 2). We also calculated 169 aggregate haplotypes frequency which are in LD carrying rs2070788 (G allele), in each 170 population presented in (Supplementary Table 6). Considerable levels of variation in 171 haplotype frequency were observed among the populations. The highest haplotype 172 frequency was observed in America (0.654), while the lowest haplotype frequency was 173 recorded in Southeast Asia Island (0.322), these findings are consistent with epidemiological 174 data available on COVID-19 which clearly shows that the American population has the most 175 number of cases and death while Southeast Asians are much below in the list. It is made available under aCC-BY-NC-ND 4.0 International license. 8 181 Asians could be due to adaptation at many genes that engage with coronaviruses, also 182 including the SARS-CoV-2, which began 25,000 years back for coronaviruses, or a related virus 183 outbreak in East Asia at that time (56). 184 4. Conclusion 185 In conclusion for the first time, we have shown closer affinity of South Asians with the West 186 Eurasian populations for TMPRSS2 gene. Hence, hot disease susceptibility in context of 187 TMPRSS2 will be more likely similar to West Eurasian populations. This is in contrast to our 188 prior study on the ACE2 gene, which showed closer genetic affinity of South Asian haplotypes 189 with Easts Eurasians. Thus, for South Asians, ACE2 and TMPRSS2 have an antagonistic genetic 190 relationship. It is made available under aCC-BY-NC-ND 4.0 International license. 8 181 Asians could be due to adaptation at many genes that engage with coronaviruses, also 182 including the SARS-CoV-2, which began 25,000 years back for coronaviruses, or a related virus 183 outbreak in East Asia at that time (56). 184 4. Conclusion 185 In conclusion for the first time, we have shown closer affinity of South Asians with the West 186 Eurasian populations for TMPRSS2 gene. Hence, hot disease susceptibility in context of 187 TMPRSS2 will be more likely similar to West Eurasian populations. This is in contrast to our 188 prior study on the ACE2 gene, which showed closer genetic affinity of South Asian haplotypes 189 with Easts Eurasians. Thus, for South Asians, ACE2 and TMPRSS2 have an antagonistic genetic 190 relationship. So, it's worth proposing that the susceptibility of the South Asian population to 191 SARS-CoV-2 will fall somewhere between West and East Eurasian populations, which is most 192 likely the source of the moderate susceptibility. We also found a genetic association between 193 rs2070788 and CFR among various Indian populations. This information could be used as a 194 genetic biomarker to predict susceptible populations, which may be very useful during the 195 epidemic in policymaking and making better resource allocation. 196 Author Contributions 197 198 199 GC and RKP conceived and designed this study. RKP, AS, and PPS analysed the data. An ancient viral epidemic involving host coronavirus interacting genes more than 20,000 years ago in East Asia. Curr Biol. 2021 Aug 23;31(16):3504-3514.e9. TABLE 1 | Outcome of tests conducted for statistical significance at different timelines of the pandemic in India. Observation rs2070788 Linear regression R square p-value Pearson's correlation r p-value June 2021_CFR 0.3382 0.0292 0.582 0.029 July 2021_CFR 0.3097 0.0387 0.557 0.039 August 2021_CFR 0.2888 0.0475 0.537 0.047