Page 1 of 10 Manuscripts submitted to Bioinformatics Advances
1 Bioinformatics Advances, 2021, pp. 1–21
2
3 doi: DOI HERE
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
Advance Access Publication Date: Day Month Year
4 Paper
5
6
7
8
9
10
11 PAPER
12
13
14
How sticky are our proteins? Quantifying
15 hydrophobicity of the human proteome
16
17 Juami Hermine Mariama van Gils,1,⇤ Dea Gogishvili,1 Jan van Eck,1
18
19
Robbin Bouwmeester,1 Erik van Dijk1 and Sanne Abeln1,⇤
20 1
Center for Integrative Bioinformatics (IBIVU), Computer Science Department, Vrije Universiteit Amsterdam, De Boelelaan 1111, 1081
21 HV, Noord-Holland, The Netherlands
⇤
Corresponding author. s.abeln@vu.nl, j.h.m.van.gils@vu.nl
22 FOR PUBLISHER ONLY Received on Date Month 2021; revised on Date Month Year; accepted on Date Month Year
23
24
25
26 Abstract
27 Proteins tend to bury hydrophobic residues inside their core during the folding process to provide stability to the
28 protein structure and to prevent aggregation. Nevertheless, proteins do expose some ’sticky’ hydrophobic residues to the
29 solvent. These residues can play an important functional role, for example in protein-protein and membrane interactions.
30 Here, we investigate how hydrophobic protein surfaces are by providing three measures for surface hydrophobicity: the
31 total hydrophobic surface area, the relative hydrophobic surface area, and - using our MolPatch method - the largest
32 hydrophobic patch. Secondly, we analyse how difficult it is to predict these measures from sequence: by adapting solvent
33 accessibility predictions from NetSurfP2.0, we obtain well-performing prediction methods for the THSA and RHSA,
34 while predicting LHP is more difficult. Finally, we analyse implications of exposed hydrophobic surfaces: we show
that hydrophobic proteins typically have low expression, suggesting cells avoid an overabundance of sticky proteins.
35
Availability: https://github.com/ibivu/hydrophobic_patches
36
37 Key words: Protein Structure, Hydrophobicity, Surface accessible area, Human tissues, Neurodegenerative disease
38
39
40
41 Introduction patches and residue contributions have been previously used
42 Hydrophobic residues tend to be buried inside the core of a
for identifying aggregation-prone regions (Sankar et al.,
2018). Abundant exposed hydrophobic residues can also a↵ect
43 protein to avoid contact with their hydrophilic surroundings
experimental outcomes: exposed hydrophobic residues may
44 (the hydrophobic e↵ect) (Dill, 1985, 1990). Hydrophobic
cause gel formation and prevent crystallisation for protein
45 residues that do occur on the protein surface often play
structure determination (Wright and Dyson, 1999); in liquid
a functional role, e.g. for protein-protein interactions and
46 chromatography surface hydrophobicity is used to separate
membrane binding (Gowder et al., 2014; Chothia and Janin,
47 1975; Young et al., 1994). Additionally, exposed hydrophobic
proteins for further experiments (Moruz and Käll, 2017).All
48 residues can play a role in the progression of diseases. For
these examples suggests that the more hydrophobic a protein
surface, the more “sticky” this protein is to its surrounding (see
49 example, it has become apparent that hydrophobicity may
also panel 1 in Figure 1).
50 play a major role in the formation and stabilisation of
The hydrophobic surface area can be defined in di↵erent
51 amyloid fibrils (Iadanza et al., 2018; Tuttle et al., 2016;
ways. Here, we use three di↵erent structure-based measures to
52 van Gils et al., 2020), which are linked to aggregation
describe surface hydrophobicity (see panel 1 in Figure 1):
diseases such as Alzheimer and Parkinson (Dobson, 2001;
53
Koo et al., 1999; Ross and Poirier, 2004; Chiti and Dobson, 1. The total hydrophobic surface area (THSA) is the sum of
54 2006). In fact, burying the hydrophobic residues inside the the exposed surface area of all the hydrophobic residues.
55 folded protein is also thought to prevent aggregation (Dobson,
56 2003; Abeln and Frenkel, 2008, 2011). Therefore, hydrophobic
57
58
© The Author(s) 2022. Published by Oxford University Press.
59
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
60 (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium,
1
provided the original work is properly cited.
Manuscripts submitted to Bioinformatics Advances Page 2 of 10
2 Author Name et al.
1
2 2. The relative hydrophobic surface area (RHSA) is the predict whether a hydrophobic residue will be exposed to the
3 fraction of the protein surface that is hydrophobic, i.e. the surface is not a trivial task: the earlier methods tended to
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
4 THSA divided by the total accessible surface area (TASA). predict the majority of hydrophobic residues to be fully buried
3. The largest hydrophobic patch (LHP) is the largest (see Figure S2), as may be expected since the hydrophobicity
5
connected hydrophobic area on the protein surface of residues is strongly associated with being buried inside the
6 (and is therefore always smaller than or equal to the protein (Kyte and Doolittle, 1982a). The current generation of
7 THSA). It has been shown that LHP size a↵ects protein residue-based surface accessibility predictors use deep neural
8 solubility (Lijnzaad and Argos, 1997; Bahadur et al., 2003; networks. For example, NetSurfP2.0 is a deep learning-based
9 Huang and Chandler, 2000) and function (Larsen et al., multitask predictor, which uses evolutionary profiles to make
10 1998; Dobson, 2004). sequence-based predictions of structural features (Klausen
11 Note that THSA, RHSA and LHP may not always correlate.
et al., 2019). It uses both convolutional and long short-
term memory neural layers in the deep learning architecture,
12 For example, a large THSA value can be due to the size of with the ability to predict both secondary structure and
13 the protein, and a protein with many scattered hydrophobic solvent accessibility (Klausen et al., 2019). Here we will
14 residues on its surface may have a small LHP but a large THSA show NetSurfP2.0 is able to make accurate enough surface
15 and RHSA. accessibility predictions for hydrophobic residues, which in
Experimentally, the exposed hydrophobic surface area can
16 turn can be used to predict the global hydrophobic surface
be estimated using di↵erential scanning calorimetry (DSC), for
17 which the heat capacity temperature relation for the folded
measures described above. Previously, hydrophobic patches
have been used to predict the aggregation propensity of protein
18 protein is directly related to the THSA (Gomez et al., 1995; regions using 3D structures as input (Sankar et al., 2018). This
19 Dijk et al., 2016). AggScore method, trained on a small (31) number of adnectin
20 In this work, our main goal is to investigate how proteins, is focused on amyloid-like proteins. In this work we
21 hydrophobic protein surfaces are within the human proteome. are interested in generic hydrophobicity of a proteome.
22 We also provide some insight how hydrophobicity is related Finally, we use the best-performing prediction methods to
to cellular expression levels, giving an idea of the overall
23 predict the THSA, RHSA and LHP of all proteins in the
hydrophobicity in the cellular environment. The question why
24 some of the human proteome is hydrophobic is not the main
human proteome, and correlate this to cellular expression levels,
providing e↵ectively an indication of proteome hydrophobicity
25 focus of our investigation, but is considered in some cases to per cell type. Subsequently, we use our predictions to provide a
26 interpret results. glance into the potential implications of a highly hydrophobic
27 We use 3D structural information from the PDB to proteome in terms of human disease.
28 determine the THSA, RHSA and LHP from structure. The
29 THSA and RHSA can be derived by summing over the exposed
surface area per residue calculated by DSSP (Kabsch and
30
Sander, 1983). To calculate the LHP, we introduce a novel
31 method named MolPatch, which is loosely based on a method
Methods and Materials
32 developed by Lijnzaad et al. (Lijnzaad et al., 1996; Lijnzaad Figure 1 indicates how our approach is split into three stages.
33 and Argos, 1997). Firstly, we created a database of filtered PDB structures
34 (Figure 2) using PISCES (Wang and Dunbrack Jr, 2003).
35 We used this culled set to define measures for surface
hydrophobicity: THSA, RHSA and LHP. For the latter, we
36
used a newly developed tool named MolPatch. Secondly, using
37 the same dataset, we investigated how well we can predict
38 these measures from sequence using the output generated by
39 NetSurfP2.0. Finally, we determined the biological impact of
40 the THSA, RHSA and LHP. To this end, we created a dataset of
41 Fig. 1. Outline of the study. 1) Structure-based definition represents human proteins from UniProt (Consortium, 2019). We used the
42 the three hydrophobic measures: red and yellow colours indicate the best prediction models to predict the THSA, RHSA and LHP
surface of hydrophobic residues, the blue colour indicates the surface for each of these proteins. The structure-based data includes
43 of hydrophilic residues. The THSA is calculated by summing the area
protein structures from all organisms, while the sequence-based
44 of all hydrophobic residues (red and yellow). The RHSA is calculated
data only includes human proteins. Subsequently, we correlated
45 by dividing the THSA by the total accessible surface area (TASA, red,
gene expression to the hydrophobicity in the human proteome
yellow, and blue). The LHP is the largest area of adjacent hydrophobic
46 residues (only red). 2) We train and benchmark sequence-based prediction for di↵erent cell types.
47 methods of the three hydrophobic measures. 3) THSA, RHSA and LHP
48 values for the human proteome were predicted by the best performing
methods and used to estimate the abundance of hydrophobic proteins in
49 various diseases and tissues.
Calculating the measures for hydrophobicity
50 To calculate the THSA we sum over the surface areas of
51 all hydrophobic residues in the protein. For proteins with
Since many protein structures have not yet been determined an available 3D structure in the PDB, this quantity can be
52
experimentally, we subsequently use the values we obtain determined by calculating the surface area of each residue
53 from the PDB structures to train/assess predictors for these using DSSP (we used the DSSP module in Biopython version
54 three hydrophobicity measures. There is a wide range of 1.76) (Cock et al., 2009). To calculate RHSA, THSA was
55 methods that can predict the surface accessibility for a single divided by the total surface area of all residues in the protein.
56 residue (Garg et al., 2005; Petersen et al., 2009; Joo et al., Residues, r, were considered hydrophobic in this work, if:
57 2012; Faraggi et al., 2012; Klausen et al., 2019). However, to r 2 {A, C, F, I, L, M, V, W, Y }
58
59 https://mc.manuscriptcentral.com/bioadv
60
Page 3 of 10 Manuscripts submitted to Bioinformatics Advances
How sticky are our proteins? 3
1
2
3
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
4
5
6
7
8
9
10
11
12
13 Fig. 2. Data curation scheme representing the main steps used to generate datasets for this study. The boxes show the filtering steps and the arrows
indicate the number of entries (structures or sequences) passed through. The structure-based definitions dataset used the protein 3D structure information
14 and the human proteome dataset was constructed of protein sequences. The distribution of the measures for surface hydrophobicity within the datasets
15 are represented by the Figure 5 and are colour-coded.
16
17
18 MolPatch structures were filtered out, which resulted in a dataset of 5,110
19 In order to calculate the surface area of the LHP given unique monomeric protein structures. Multimers often interact
a protein structure, we need to find the largest connected through hydrophobic patches, and if strongly bound such
20
hydrophobic surface area on the protein’s surface. For this patches will not be exposed in the native environment. Finally,
21 purpose, we developed the tool MolPatch (also see Figure S1). transmembrane proteins have a relatively large hydrophobic
22 Given the PDB structure of a protein, MolPatch creates a surface area. TMHMM (Sonnhammer et al., 1998; Möller et al.,
23 point cloud on the solvent-excluded protein surface (SES) using 2001) was used to filter transmembrane proteins from the
24 MSMS (Sanner et al., 1996). In this work, the SES was dataset (>18 amino acids in transmembrane helices).
25 constructed using a probe of 1.5 Å and a density of 1.5
points per Å2 . Each point on the point surface was labelled
26 Machine Learning models
hydrophobic or hydrophilic based on the hydrophobicity The final dataset for training and testing of the models
27
classification of the closest residue. Initial edges between contained 4,917 monomers. For the THSA and RHSA, the
28 points were then created if the points existed within a range values calculated by DSSP (as described above) were used as
29 of 1.25Å of each other. This search was performed with training output labels. MolPatch was used to create training
30 the KDTree algorithm to speed up the process (Bentley, output labels for evaluating the LHP predictions. For all
31 1975). Finally, only the edges between node pairs labelled as models, a training-test split of 80% and 20% was used. Within
32 hydrophobic were retained. This created a network of isolated the training set, a three-fold cross-validation scheme was used
hydrophobic patches. The individual network components
33 to train the models. Predictions for the THSA, RHSA and LHP
were then extracted for accessible surface area estimation. were acquired with the following models:
34 MolPatch is available on GitHub, this version can also carry out
35 hydrophobic patch identification using atom-based definitions 1. The three feature model (TFM) uses the sequence
36 of hydrophobicity for each SES point rather than residue-based length, number of hydrophobic amino acids and number
37 definitions. In this work we only use the residue-based method. of hydrophilic amino acids as input. This model is trained
38 using a cubist regression in the CARET module (Kuhn,
2008).
39 Sequence-based predictions
2. The global feature model (GFM) uses 31 global features
40 Data curation (the total count of each of the 20 possible amino acids
41 To predict the THSA, RHSA and LHP, a dataset of PDB in the sequence, sequence length, entropy, hydrophobic
42 structures was generated using PISCES. PISCES is a public amino acid count, polar amino acid count, molecular
43 server for culling sets of protein sequences from the Protein weight, aromaticity, instability index, gravy score, buried,
44 Data Bank (PDB) by sequence identity and structural quality isoelectric point and molar extinction coefficient) as input.
45 criteria (Wang and Dunbrack Jr, 2003). This is important, This model is trained using an XGBoost regressor (Chen
because using structures with a high sequence identity can and Guestrin, 2016).
46
introduce bias in the dataset, and factors such as the resolution 3. (THSA and RSHA only) NetSurfP2.0 (Klausen et al.,
47 can a↵ect the accuracy of the results. The chosen parameters 2019) was used to predict the accessible surface area of
48 were as follows: sequence percentage identity lower or equal to all the amino acids in a protein. Subsequently, the THSA
49 25%, resolution lower or equal to 3.0Å, R-factor lower or equal was calculated by summing over the predicted surface areas
50 to 0.3, sequence length within the range of 40-10,000 amino of the hydrophobic residues in the protein sequence. The
51 acids, and non-X-ray entries and C↵-only entries were excluded RHSA was calculated by dividing the predicted THSA by
as we aimed at constructing a homogeneous training dataset the sum of the surface areas of all the residues in the protein
52
for interpretable results. The culled dataset consisted of 13,858 as predicted by NetSurfP2.0. Note that we did not perform
53 unique protein structures with a selection of 14,604 chains. Two any model training for this approach; NetSurfP2.0 was used
54 obsolete PDB chains were removed. as publicly available with default settings.
55 In order to avoid bias from hydrophobic patches that 4. (LHP only) The LHP cannot be calculated from
56 may never be exposed in a cellular environment, we filtered NetSurfP2.0 predictions directly, as the connections
57 multimeric and transmembrane proteins. Multimeric protein between the residues on the surface are not known from
58
59 https://mc.manuscriptcentral.com/bioadv
60
Manuscripts submitted to Bioinformatics Advances Page 4 of 10
4 Author Name et al.
1
2 these sequence-based predictions. A random forest model Human proteome mapping
3 was trained using the RHSA and THSA predicted by
Data curation
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
4 NetSurfP2.0 (as described above) as input features. This
All reviewed protein sequences for the human genome were
model was called the NetSurfP-based model (NBM).
5 extracted from UniProt (Consortium, 2019) (accessed 1st
6 The data was randomly split into a training and test set Oct 2020). In total 20,384 sequences were analysed with
NetSurfP2.0 for predicting solvent accessibility and structural
7 of 80% and 20%, respectively. To assess the models a cross-
disorder among other characteristics. THSA and RHSA values
8 validation scheme was used. First two-thirds of the training
were calculated from NetSurfP2.0 predictions as described
data was used for a five-fold cross-validation scheme, in
9 above. The LHP for each protein has been predicted using
which the best performing parameters for each of the models
10 were optimised using a grid search method. Subsequently the the NBM. The following data curation steps have been
11 remaining one-third of data, i.e. the validation set, was used to administered to remove unreliable predictions: (1) highly
12 choose the best performing model. Note that the performance disordered proteins have been discarded: when more than a
13 of the models and model parameters was optimised on the half of the residues have been classified as disordered; (2)
large proteins (>800 AA residues) have been discarded in order
14 R2 . Finally, the 20% of test data was used to estimate the
to match the protein sizes in the structure-based definitions
15 performance of the best performing model and the NetSurfP2.0
dataset as the majority (99.2%) of proteins are in this range
based calculations (code available on GitHub).
16 for the training data. (3) duplicate gene IDs were filtered out
17 and the ones with the highest THSA value were retained.
Estimation of prediction errors
18 In order to evaluate the predictions, the structure-based
This quality filter resulted in a curated dataset of 14,533
19 definitions and sequence-based predictions can be compared,
proteins. Separate datasets were created with 4,913 proteins
20 by calculating the correlation coefficient R2 . Nevertheless, for
annotated as transmembrane and 6,825 - as multimeric by
UniProt (Figure 2).
21 difficult regression tasks this value will put a lot of weight
Additionally, the final curated dataset described above was
22 on the outliers, and will not produce results that are easy to
used to analyse the link between the expression levels and
23 interpret. In addition to the R2 measure, we also evaluated the
measures for surface hydrophobicity. RNA consensus tissue
performance of the prediction model by examining the relative
24 gene data was downloaded from Human Protein Atlas (Pontén
error threshold curve given a certain threshold, partially
25 inspired by the GDT TS score (Zemla et al., 2001). A major
et al., 2008; Uhlén et al., 2015) (accessed on https://www.
26 benefit of this method is that it is robust against extreme
proteinatlas.org/about/download 24 Dec 2020). In order to
27 outliers. For each prediction, the relative THSA error ( T HSAi ),
obtain a single expression value for each gene, the highest
expression value was selected among all the tissues each gene
28 RHSA error ( RHSAi ), and LHP error ( LHPi ) for each protein
is expressed in. Subsequently, the genes were divided in deciles
29 i are defined by the following formulas:
based on these values.
30
31 T HSAi =
|T HSApredi T HSADSSPi |
(1) Gene set enrichment analysis
T HSADSSPi
32 To identify tissue types enriched in proteins with large
33 hydrophobic surface area, Gene Set Enrichment Analysis
34 =
|RHSApredi RHSADSSPi |
(2)
(GSEA) was used. THSA, RHSA, and LHP values were
RHSAi centered (such that 0 fell between two parts of a bimodal
35 RHSADSSPi
distribution or between the main bulk and the tail of the
36 distribution, see Supporting Information and Figure S10) and
37 LHPi =
|LHPpredi LHPM olP atchi |
(3) scaled (Figure S1) prior to the pre-ranked GSEA analysis
38 LHPM olP atchi
(Subramanian et al., 2005; Mootha et al., 2003). Tissue-
39 where T HSApredi , RHSApredi , and LHPpredi are the enriched gene sets were downloaded from the Human Protein
40 predicted THSA, RHSA, and LHP of a protein. T HSADSSPi Atlas (accessed 10 Nov 2020). 375 Disease associated gene sets
41 and RHSADSSPi are the THSA and RHSA of a protein were extracted from the GSEA website (accessed on https:
//www.gsea-msigdb.org/gsea/msigdb/search.jsp 5 Nov 2020).
42 estimated using DSSP. LHPM olP atchi is the predicted LHP
GSEA was used with the following parameters: number of
43 of a protein, determined by MolPatch. The performance of
permutations = 1000; collapse; chip platform: human UniProt
the methods over the whole set of structures is evaluated by
44 IDs MSigDB.v7.2.chip”; enrichment statistic: weighted; max
plotting the percentage correctly predicted instances (protein
45 chains) versus a varying error threshold t. The threshold curve, size=1000, min size=15.
46 F (t), shows the percentage of correctly predicted THSA and
47 RHSA of proteins for a given relative error threshold,t: Tissue-specific average surface hydrophobicity
48 Tissue-specific average surface hydrophobicity (TASH) was
calculated across all the genes with the following formula with
49 F (t) =
|{i|i 2 chains ^ < t}|
· 100 (4) and without transmembrane proteins:
50 |{i|i 2 chains}|
P
51 g NXg,t · hg
The relative error for all chains in the chain dataset is calculated TASHt = P (5)
52 g NXg,t
to determine the fraction of correctly predicted chains for the
53 threshold, see also Figure S9. The is here interchangeably Where TASHt is the tissue-specific average surface hydrophobicity
54 used for T HSAi , RHSAi , or LHPi . Unlike in a ROC-curve, for tissue t, NXg,t is the normalised expression of gene g in
55 the amount of correctly predicted chains does not necessarily tissue t and h is the predicted hydrophobicity of gene g for one
56 have to be 100% when the threshold t = 1.0, since the size of of the three measures (THSA, RHSA or LHP). The results are
57 the relative error can be >100%. shown in Figure S7.
58
59 https://mc.manuscriptcentral.com/bioadv
60
Page 5 of 10 Manuscripts submitted to Bioinformatics Advances
How sticky are our proteins? 5
1
2 Results Sequence-based predictions - THSA and RHSA can be
3 Structure-based definitions - MolPatch
predicted with reasonable accuracy
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
4 Since there are many more protein sequences available than
To quantify the exposed hydrophobic areas on the protein
5 surface, we defined three di↵erent structure-based measures
structures, it is highly valuable to be able to predict the
6 for surface hydrophobicity, the THSA, RHSA and LHP. Using
THSA, RHSA and LHP from sequence, which will allow us to
characterise much broader set of proteins. Thus, we aimed to
7 DSSP (Kabsch and Sander, 1983), we can calculate the THSA
determine how well we can currently predict the three measures,
8 and RHSA directly from the surface area per residue (Figure 3),
and identify which sequence features contribute most to the
9 see methods for further details.
accuracy of these predictions. We used our structure-based
10 To define the largest hydrophobic patch on a protein surface,
definition set to develop sequence-based predictors in a double
we developed a novel tool named MolPatch. This tool takes
11 cross-validation scheme (see Methods).
the 3D coordinates in PDB format and identifies networks of
12 adjacent hydrophobic residues to find hydrophobic patches on
13 the protein surface. Hydrophobic patches of 4,250 structures of
14 soluble proteins were analysed using MolPatch (see Methods).
15 Figure 3 highlights the importance of having three measures
16 by observing the LHP of two proteins with very di↵erent
surface areas. Although the di↵erence in RHSA between the
17
two proteins is only 6%, the THSA and LHP of Leishmanolysin
18 are approximately 1.5 and 3 times larger than the LHP of SabA.
19 Generally, we see that there is no trivial correlation between
20 THSA, RHSA and LHP (Figure S3).
21
22
23
24
25 THSA: 3691 Å
2
26 RHSA: 0.20
27 LHP: 877 Å
2
28
29
30
31 Fig. 4. Accuracy of the predictions of the total, relative and largest patch
32 hydrophobic surface area for NetSurfP2.0-based models, the NBM, TFM
33 2 and GFM. The fraction of correctly predicted proteins within a certain
THSA: 5046 Å error margin for each of the methods are shown as calculated over the
34
RHSA: 0.26 test set.
35 2
LHP: 2459 Å
36 To predict the THSA and RHSA, we used NetSurfP2.0,
37 a neural-network-based method that takes evolutionary
38 conservation profiles as input, and is currently one of the
best (non-ensemble) predictors for surface accessibility and
39
secondary structure (Klausen et al., 2019; Xu et al., 2020;
40 Fig. 3. THSA, RHSA and LHP, as identified by MolPatch for
Fereshteh et al., 2020). NetSurfP2.0 provides surface area
41 two di↵erent protein structures. Top: SabA, PDB=4O5J. Bottom:
predictions per residue. To obtain the THSA, we summed
Leishmanolysin, PDB=1LML. The surface of hydrophobic residues are
42 displayed in yellow and red. Those in the largest hydrophobic patch (LHP) over the predicted accessible surface areas of all hydrophobic
43 are displayed in red. The surfaces of the hydrophilic residues are displayed residues. To obtain the RHSA, we summed over the predicted
44 in blue. Note that Leishmanolysin is much larger (465 residues) and has accessible surface area of all residues and divided the THSA
a much larger THSA (5046 Å2 ) compared to SabA (370 residues, 3691
45 Å2 ), while the RHSA is quite similar between the two proteins, 26% vs
by this value. Previous results (see Supporting Information
and Figure S2) indicate that the sequence length and
46 20%. The di↵erence in the LHP is even larger, with 2459 Å2 vs. 877 Å2 ,
hydrophobicity are strong predictors for the THSA and RHSA,
47 respectively; a nearly three-fold di↵erence.
and even outperformed a previous version of NetSurfP ??
48 (Figure S5). Therefore, we trained two additional models,
49 one that incorporates the sequence length, the number of
50 To determine whether our structure-based largest patch hydrophobic residues and the number of hydrophilic residues
51 definition is reasonable in biological terms, we overlapped the (three-feature model, TFM), and one that includes a larger
residues in the 20 largest hydrophobic patches of each protein number of features derived from the sequence (global feature
52
in our database with those in the PiSITE protein interaction model (GFM) see Methods). Figure 4 and Table 1 show that
53 database (also see SI Methods). We would expect that large the NetSurfP2.0 based predictions are clearly superior.
54 hydrophobic patches, functionally may serve as a protein- The TFM, which only includes the features sequence length,
55 protein interaction interfaces. Indeed, we found that overall, number of hydrophobic and number of hydrophilic residues,
56 the three largest patches in a protein were significantly enriched also performs significantly better than random for both the
57 in protein interaction sites (Figure S4). THSA and RHSA, indicating that these features are of major
58
59 https://mc.manuscriptcentral.com/bioadv
60
Manuscripts submitted to Bioinformatics Advances Page 6 of 10
6 Author Name et al.
1
2 significance for predicting these two properties. The GFM,
3 which includes 31 features, performs only marginally better
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
4 than the TFM, indicating that sequence length and sequence
hydrophobicity are some of the main determinants for the
5
hydrophobic surface area. Since it is difficult to obtain feature
6 importance from neural network models such as NetSurfP2.0,
7 we also analysed the feature importance measures from the
8 GFM. This analysis showed that the hydrophobicity of the
9 sequence is another major predictor for the THSA and
10 RHSA (Figure S5, gravy score (Kyte and Doolittle, 1982b),
11 aromaticity (Lobry and Gautier, 1994), hydr count).
To predict the LHP from sequence, the LHP determined by
12
MolPatch was used as a gold standard. The training procedure
13 for the TFM and GFM for predicting the LHP, was performed
14 in a similar fashion to the training for THSA and RHSA.
15 Since the NetSurfP2.0 predictions cannot readily be used to
Fig. 5. The distribution of the protein length, THSA, RHSA and LHP
values from the whole curated human proteome (red, predicted values),
16 predict the LHP, a model was trained that uses the THSA and annotated transmembrane (yellow, predicted values) and multimeric
17 RHSA predicted by NetSurfP2.0 as input features to predict the (grey, predicted values) proteins and the same values in the structure-
LHP (NetSurfP-based model, NBM). The results are shown in
18 based dataset (blue, structure-based values) for the comparison. The
structure-based dataset generally contains smaller proteins than the
Figure 4 and Table 1. One can see that the NBM outperforms
19 the other two methods. The sequence hydrophobicity again
human proteome datasets, as may be observed from the length
20 appears to have a major contribution to the prediction results
distribution. The numbers in brackets in the legend indicate the sizes of
21
the datasets analysed. The figure also shows that transmembrane proteins
(Figure S5). Nevertheless, each of these prediction models are predicted to have large hydrophobic surface areas, as can be observed
22 perform significantly worse than the models for the THSA and in the LHP plot: ⇠ 2000 Å2 ; ⇠ 6500 Å2 .
23 RHSA predictions (Table 1), suggesting LHP prediction is less
24 straightforward than THSA or RHSA predictions.
25 neither contains proteins with more than one chain in the
26 Table 1. R2 of each of the prediction models for the THSA, RHSA PDB structure nor transmembrane proteins; both groups of
27 and LHP for the four di↵erent prediction models as calculated over
the test set. For all three measures, the NBM/NetSurfP2.0 models
proteins maybe expected to have a very large hydrophobic
28 performed significantly better than the other models (P<0.001, t- patch. To investigate if this peak for the human proteome
29 test for correlations (Fisher, 1921)) may be due to transmembrane or multimeric proteins, we
selected those proteins annotated by UniProt (Consortium,
30 THSA RHSA LHP Features 2019) as ’transmembrane’ (yellow), or ‘part of the protein
31 complex’ (grey). Note that these sequence-based curated
32 NetSurfP2.0 0.92 0.77 - Evolutionary profiles
UniProt annotations are not identical to the structure based
33 NBM - - 0.43 THSA and RHSA
exclusion criteria for the training dataset (see Methods).
predictions by NetSurfP2.0
34 Indeed, when comparing transmembrane proteins from the
TFM 0.71 0.13 0.00 Sequence length, number
35 of hydrophobic residues,
human proteome dataset to the predictions for the entire human
36 number of hydrophilic
proteome, the composition of peak of the large hydrophobic
patches, as well as the shoulder in the RHSA distribution
37 residues
can be explained predominantly through the transmembrane
38 GFM 0.75 0.49 0.12 31 sequence-based features
annotated proteins (Figure 5). Note that the two distinct peaks
39 in the predicted LHP for the transmembrane proteins can be
40 explained by the size of the transmembrane domain: some
41 proteins contain a large transmembrane domain (LHP ⇠ 6500
42 Human Proteome Mapping Å2 ), while other protein are anchored to the membrane with
43 Transmembrane proteins - the most hydrophobic part of the a single transmembrane helix (LHP ⇠ 2000 Å2 ).THSA and
RHSA do not show two distinct peaks for the LHP, instead
44 human proteome
we see a shoulder at higher values - corresponding to large
45 For 14,533 proteins in the human proteome, we were able to
transmembrane domains.
predict THSA, RHSA and LHP values (see Methods). Figure 5
46 Multimeric proteins mostly follow the distribution of the
shows a comparison of the distributions of the definitions of
47 these values on the structural dataset and of the predicted whole human proteome and do not appear to be much more
48 values on the human proteome dataset. One can see from the hydrophobic in general. The results in Figure 5 also suggest that
49 figure that proteins in the structure-based dataset appear to our ML model (NBM) successfully predicted transmembrane
50 be smaller compared to those in the curated human proteome. proteins to have large hydrophobic patches, despite the lack of
transmembrane proteins in the training dataset.
51 In line with this, we see that the predicted THSA and LHP
distributions are strongly shifted towards the right-hand side
52
compared to the structure-based data, most likely due to the Cells avoid the over expression of proteins with a large
53 larger size of proteins in the human proteome. hydrophobic surface area
54 Moreover, the structure-based set (blue) does not show Since hydrophobic characteristics are associated with the
55 a peak of very large hydrophobic patches (LHP, ⇠ 6500 aggregation tendencies, we wanted to investigate whether
56 Å2 ) as observed for the human proteome dataset (red). proteins with large hydrophobic surface areas have di↵erent
57 Importantly, structure-based data analysed by MolPatch expression levels. We used the RNA consensus tissue gene
58
59 https://mc.manuscriptcentral.com/bioadv
60
Page 7 of 10 Manuscripts submitted to Bioinformatics Advances
How sticky are our proteins? 7
1
2
3
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
4
5
6
7
8
9
10
11
12
13
14
15
16
17 Fig. 6. Relationship between normalised expression (NX) and THSA, RHSA and LHP values. For each gene the highest NX value was
18 selected across all tissues. The genes were grouped in ten bins based on their expression levels. Each violin represents one decile of genes grouped by
their normalised expression values (x-axis). Boxplots inside the violins show the median and quartiles for the surface hydrophobicity measures (y-axis).
19 As presented in the legend, dark red indicates the decile with the lowest NX values and black indicates the decile with the highest NX values. The
20 deciles with the lowest NX values show significantly higher THSA, RHSA and LHP values (indicated by red circles), compared to the deciles with the
21 highest NX values. Significance was calculated using Wilcoxon signed-rank test and shown between the lowest and the highest expression deciles. The
three asterisks indicate p-values<2.22e-16. Note that the intermediate groups in the case of RHSA and LHP are excluded from the plot, as they show
22 the similar trend to the THSA.
23
24
25 data from the Human Proteome Atlas to explore a link with of the kidney enriched proteome is annotated as transmembrane
26 expression levels. For this we relate normalised expression (NX) by UniProt (Consortium, 2019). Interestingly, liver tissue
27 data to measures for surface hydrophobicity. To obtain a single revealed no enrichment. The skin and blood tissue enriched
expression value for each gene we took the highest expression gene sets exhibited significant enrichment in the RHSA ranked
28
value among all the tissues each gene is expressed in, and list. Furthermore, both tissue groups were significantly depleted
29 subsequently divided the genes into deciles based on these in the THSA ranked list, indicating that they may contain the
30 values. Figure 6 shows the lowest and highest deciles for each smaller proteins in the human proteome.
31 of the hydrophobic measures. When comparing the lowest and
32 highest deciles, it is clear that only in the lowly expressed
Table 2. Pre-ranked GSEA enrichment statistics in di↵erent
33 proteins, a group of proteins with very hydrophobic surfaces
tissues. Various tissue-enriched gene sets were obtained from the
34 is present (red circles in Figure 6).
HPA (Pontén et al., 2008; Uhlén et al., 2015). THSA, RHSA
Since it is not trivial to define a measure of expression across
35 and LHP values were central-scaled prior to the GSEA analysis.
di↵erent tissue types, we also explored the a median NX value The enrichment score (ES) is the maximum deviation from zero
36 across all the tissues that a particular gene appears in. The showing the degree to which the gene set is over-represented at
37 resulting deciles show a similar trend in terms of hydrophobicity the top (positive ES score) or bottom (negative ES score) of the
38 of the surface (see Figure S6) compared to the data based on the entire ranked list of genes. The fraction of transmembrane (TM)
and multimeric proteins in the following gene sets is shown in
39 highest NX value. Interestingly, proteins that do not follow the
percentages. * P<0.05 ** P<0.001
40 general trend, i.e. those that are highly expressed while having
41 a large THSA, RHSA and LHP value, are typically protein sub- Gene set ES ES ES TM Multimeric
units assembling large multimeric complexes. In such complexes
42 (THSA) (RHSA) (LHP) (%) (%)
the proteins are likely to be stably bound, and are hence able
43 to shield the hydrophobic surfaces from the solvent. Brain (488) 0.33⇤⇤ 0.14 0.64⇤⇤ 47.0 47.0
44 Kidney (53) 0.62⇤⇤ 0.53⇤⇤ 0.78⇤⇤ 79.2 35.8
45 Skin (113) -0.46⇤ 0.30⇤⇤ 0.44 7.9 15.9
The brain- and kidney-specific proteomes are enriched with
46 Liver (242) -0.22 -0.16 0.45 26.0 59.9
hydrophobic proteins Blood (57) -0.41⇤ 0.40⇤⇤ 0.68⇤ 47.4 28.0
47 To investigate if genes that are enriched in specific tissues are
48 associated with the hydrophobic properties of the proteins,
49 we carried out Gene Set Enrichment Analysis (GSEA). We
50 downloaded 5 tissue enriched gene sets from the Human To investigate the overall tissue hydrophobicity, we
51 Protein Atlas (HPA) (Pontén et al., 2008; Uhlén et al., introduced TASH - tissue specific average surface hydrophobicity
2015). Table 2 shows that the brain tissue-enriched gene set for all proteins based on the expression levels in a specific
52
has a high enrichment in predicted THSA and LHP values. tissue (Eqn. 5 and Figure S7). The TASH-THSA value provides
53 Kidney-enriched genes show the highest enrichment in THSA, an indication of the total hydrophobic surface area present
54 RHSA and LHP of the ranked gene lists (p-values<0.001). A in a specific cell type. The tissues with the highest TASH-
55 possible explanation for this is the major role of kidney tissue THSA values occur in the brain, such as the cerebellum,
56 in maintaining homeostasis through various membrane-bound corpus callosum, thalamus, cerebral cortex, and basal ganglia
57 receptors and transporters (Lote and Lote, 1994). Indeed, 79% (Figure S7).
58
59 https://mc.manuscriptcentral.com/bioadv
60
Manuscripts submitted to Bioinformatics Advances Page 8 of 10
8 Author Name et al.
1
2 Increased relative hydrophobicity is associated with observed for proteins with a strong tendency to form amyloid
3 (aggregation) diseases fibrils (Tartaglia et al., 2009), suggesting an evolutionary
Downloaded from https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbac002/6515307 by Vrije Universiteit, Library user on 03 February 2022
4 To investigate the association of surface hydrophobicity with pressure to avoid proteins with high aggregation propensities
human diseases, a GSEA pre-ranked analysis of 375 various being present at high concentrations in the cell. Based on our
5
disease-associated gene sets was carried out, of which 44 data, if we assume that the high expression values correlate
6 gene sets show a significant (p-value<0.05) enrichment (<-0.2 with high protein abundance in the cell, it is conceivable that
7 (negative enrichment) and >0.2 (positive enrichment)) in at there is also an evolutionary pressure against proteins with a
8 least two hydrophobic measures (see Figure S8). Among the large hydrophobic surface area to be overly abundant in the
9 enriched gene sets we can observe several KEGG (Kanehisa cell.
10 and Goto, 2000) pathways that are associated with neurological Note that while the THSA and RHSA sequence-based
11 disorders. The RHSA showed a significant (p-value<0.05) predictions show a reasonable correlation with the structure-
enrichment in Parkinson’s (ES=0.43), Alzheimer’s (ES=0.24) based definitions, this does not necessarily mean that the
12
and Huntington’s disease (ES=0.23) gene sets. The analysis predicted amount of accessible hydrophobic surface area is
13 shows a significant (p-value<0.001) enrichment of sticky actually exposed to the cellular environment. For example, a
14 proteins (based on LHP) in the KEGG Parkinson’s disease map hydrophobic patch may be buried in a stable macro-molecular
15 (ES=0.66). In contrast to the GSEA analysis results on tissue- complex, or may be buried inside a membrane. Additionally,
16 specific proteome, the THSA shows a negative enrichment in a high hydrophobic surface area does not necessarily mean a
17 these sets, suggesting that the proteins involved in pathological protein will be insoluble; this will also be very much dependent
pathways have large hydrophobic surfaces and patches, but are on the amount of polar and charged residues that may surround
18
smaller in size (median length 171-180 residues). the hydrophobic residues or patches (Kramer et al., 2012), as
19 well as disordered regions (Abeln and Frenkel, 2008).
20 Despite the general tendency to avoid highly expressed
21 proteins with a large hydrophobic surface area, the brain
22 Discussion appears to be highly hydrophobic in its overall expression
23 In this work, we analysed the predictability of hydrophobic patterns (THSA in cerebellum, cerebral and cortex as shown
24 areas on protein surfaces, which until recently was a difficult in Figure S7) and in proteins enriched in the brain (THSA and
problem. We show that THSA and RHSA values can be LHP as shown in Table 2). This high expression of proteins
25
predicted with high accuracy (>75% within a 20% error with a large hydrophobic surface area may be rationalised
26 margin, Figure 4). The improved predictions of NetSurfP2.0, by functional requirements: genes enriched in brain tissue are
27 compared to the earlier secondary structure prediction methods involved in organising and maintaining synaptic signalling,
28 (Figure S2), make this possible by straightforward calculations requiring various cell adhesion proteins with large hydrophobic
29 of the THSA and RHSA using the predictions of the surface surface areas (Sytnyk et al., 2017); the cellular morphology of
30 accessibility per residue from NetSurfP2.0. Note that the neurons including the dendrite means that there is a relatively
31 problem of predicting the THSA, RHSA and LHP for a protein large transmembrane surface area per cell. Additionally,
sequence, means we are trying to predict a global feature of the structural integrity of neuronal axons is facilitated by
32
the protein, while NetSurfP2.0 (Klausen et al., 2019) makes myelin (Stadelmann et al., 2019), a fatty substance surrounding
33 residue-based predictions; residue-based models can be trained neurons, and by myelin-associated proteins, which are all very
34 on many more labels that are available from protein structures, hydrophobic.
35 and hence can reach a richer model representation. Furthermore, brain tissue has been associated with various
36 On the other hand, the LHP cannot be directly obtained aggregation diseases (Dobson, 2001; Koo et al., 1999; Ross
37 from NetSurfP2.0 (Klausen et al., 2019) and needs additional and Poirier, 2004; Chiti and Dobson, 2006). Based on our
38 model training. The major difficulty herein is that NetSurfP2.0 data, it may be hypothesised that the brain is specifically
cannot predict which hydrophobic residue form a continuous vulnerable to such diseases due to its high expression of proteins
39
patch in the protein 3D structure. Nevertheless, we believe that with a large hydrophobic surface. Hydrophobic patches play
40 recent advances in deep neural nets, contact map prediction a role in the folding and/or misfolding of proteins (Dobson,
41 and structure prediction (Zheng et al., 2019; Li et al., 2019; 2004; Ross and Poirier, 2004), and can possibly provide
42 Xu et al., 2020; Senior et al., 2020) should make it possible to nucleation sites for the formation of oligomers and amyloid
43 make these predictions more accurate in the near future, for fibrils. This hypothesis would be supported by the relatively
44 example by using structure or contact predictions to predict high hydrophobic surface area in molecular pathways associated
45 the hydrophobic patches, or by training a purpose specific deep with Parkinson’s, Huntington’s and Alzheimer’s disease (as
neural net. Recently, language models, especially bidirectional observed for the RHSA and LHP, see Figure S8).
46
encoder representations from transformers (BERT) have been
47 shown to be a promising model achieving novel state-of-the-art
48 performance (Shah and Ou, 2021). BERT adopted the concept
49 of contextualised word embedding to capture the semantics
Conclusion
50 and sequence information, opening a new avenue in biological In conclusion, in this work we defined three measures for
51 modelling. Such models trained on LHP data per residue might hydrophobicity: the THSA, RHSA and LHP. To determine
also lead to a better performance in predicting such a complex the LHP from structure, we developed a novel method named
52
characteristic. MolPatch. The THSA and RHSA can be accurately predicted
53 When investigating the link between tissue-based expression from sequence by adapting the output from NetSurfP2.0, while
54 levels and the measures for surface hydrophobicity, we clearly predicting the LHP from sequence remains challenging. We
55 observe that highly expressed proteins typically do not have have investigated the potential impact of the three measures by
56 a large hydrophobic surface area (THSA, RHSA and LHP investigating the relation between these measures and protein
57 as seen in Figure 6). A similar trend has previously been expression in the human proteome. Cells tend to avoid high
58
59 https://mc.manuscriptcentral.com/bioadv
60