[go: up one dir, main page]

0% found this document useful (0 votes)
28 views13 pages

HMST-Seq Analyzer

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 13

Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

journal homepage: www.elsevier.com/locate/csbj

HMST-Seq-Analyzer: A new python tool for differential methylation and


hydroxymethylation analysis in various DNA methylation sequencing
data
Amna Farooq a, Sindre Grønmyr a,b, Omer Ali a, Torbjørn Rognes b,d, Katja Scheffler f,g, Magnar Bjørås c,d,e,
Junbai Wang a,⁎
a
Department of Pathology, Oslo University Hospital – Norwegian Radium Hospital, Oslo, Norway
b
Department of Informatics, University of Oslo, Oslo, Norway
c
Institute for Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
d
Department of Microbiology, Oslo University Hospital, Oslo, Norway
e
Department of Microbiology, University of Oslo, Oslo, Norway
f
Department of Neuromedicine and Movement Science and Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
g
Department of Neurology and Department of Laboratory Medicine, St. Olavs Hospital, Trondheim, Norway

a r t i c l e i n f o a b s t r a c t

Article history: DNA methylation (5mC) and hydroxymethylation (5hmC) are chemical modifications of cytosine bases
Received 16 July 2020 which play a crucial role in epigenetic gene regulation. However, cost, data complexity and unavailability
Received in revised form 26 September of comprehensive analytical tools is one of the major challenges in exploring these epigenetic marks.
2020
Hydroxymethylation-and Methylation-Sensitive Tag sequencing (HMST-seq) is one of the most cost-
Accepted 27 September 2020
Available online 10 October 2020
effective techniques that enables simultaneous detection of 5mC and 5hmC at single base pair resolution.
We present HMST-Seq-Analyzer as a comprehensive and robust method for performing simultaneous dif-
ferential methylation analysis on 5mC and 5hmC data sets. HMST-Seq-Analyzer can detect Differentially
Keywords:
Methylation analysis
Methylated Regions (DMRs), annotate them, give a visual overview of methylation status and also per-
Hydroxy methylation form preliminary quality check on the data. In addition to HMST-Seq, our tool can be used on whole-
Differential methylation genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) data sets
Hydroxymethylation-and methylation- as well. The tool is written in Python with capacity to process data in parallel and is available at
sensitive tag sequencing (https://hmst-seq.github.io/hmst/).
Whole genome bisulfite sequencing Ó 2020 The Author(s). Published by Elsevier B.V. on behalf of Research Network of Computational and
Structural Biotechnology. This is an open access article under the CC BY license (http://creativecommons.
org/licenses/by/4.0/).

1. Introduction have been observed in many human diseases and can be used in
clinical outcome predictions [4]. Hence, correct profiling of DNA
Epigenetic DNA methylation provides an additional layer for methylation in a genome is key to understand the contribution
controlling cellular processes. It is the most stable epigenetic mark of epigenetics in gene regulation.
that plays a significant role in gene regulation with impact on Nowadays, bisulfite treatment is considered as the most effec-
health and disease [1]. Nevertheless, DNA methylation varies in tive way of targeting DNA methylation [5]. Whole-genome bisul-
response to cell differentiation, disease, and environmental factors. fite sequencing (WGBS) is the most comprehensive protocol for
In vertebrates, 5-methylcytosine (5mC) is the most abundant measuring genome-wide single-base-pair resolution methylation,
epigenetic mark. It is usually found in CpG context, where hence making it the golden standard technique [6] for studying
5mC can be iteratively oxidized by TET proteins to genomic DNA methylation. Some studies in vertebrate genomes
5-hydroxymethylcytocine (5hmC), the second most abundant epi- (e.g., birds and fishes) recommend 15X of sequencing depth for
genetic mark in vertebrates [2]. The aforementioned two methyla- WGBS experiments [7,8]. Others such as the NIH Roadmap Epige-
tion marks have essential roles in the development and regulation nomics Projects advise coverage of 30X (combined coverage of 2
of cellular processes [3]. Especially, abnormal methylation patterns replicates). Similarly, the ENCODE project and the International
Human Epigenome Consortium (IHEC) recommend users to submit
⁎ Corresponding author.
experimental data with sequencing depth of 30X for WGBS [9,10].
E-mail address: junbai.wang@rr-research.no (J. Wang). To achieve 30X sequencing depth of human sized genome, ~180 GB

https://doi.org/10.1016/j.csbj.2020.09.038
2001-0370/Ó 2020 The Author(s). Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

of sequencing data is produced in WGBS experiments [11]. The lation analysis pipeline called HMST-Seq-Analyzer, which is a user-
data size will be reduced in case of organisms with a smaller gen- friendly command line Python package. Though the differential
ome size [8]. Alternatively, an enhanced protocol (HiSeq X Ten) methylation analysis is similar between 5mC and 5hmC at the
offers improved cost effectiveness and coverage [12] for WGBS genome-wide scale, the pre-processing of HMST-Seq data for
experiments. However, to achieve the claimed low cost, all the 5mC and 5hmC is quite different. For example, the methylation
HiSeq X Ten machines are required to be run on full capacity. In and hydroxymethylation levels are calculated by taking ratio of
addition to this fact, the total price of the system makes it suitable tag counts from three libraries. Such difference in the low-level
for larger institutes only [13] Nevertheless, WGBS remains an methylation analysis is taken care by HMST-Seq-Analyzer auto-
expensive experiment with huge data handling and processing matically. Especially, the methylation analysis of both 5mC and
requirements for downstream analysis. This highlights the need 5hmC is done in a single run in the package, which simplifies the
of enrichment-based techniques, which offer a fair trade between illustration and interpretation of results. The package is optimized
coverage and cost. to process huge data sets from either HMST-seq or WGBS by using
Reduced representation bisulfite sequencing (RRBS) is a cost- parallel computation, as well as data from RRBS. HMST-Seq-
effective enrichment-based sequencing method to target DNA Analyzer implements a methylated region (MR) search method
methylation. RRBS reduces sequencing requirements by targeting similar to one that has been previously published in [11] to detect
CpG rich genomic regions only [14]. RRBS produces much less data differentially MRs (DMRs) from HMST-seq data. Two slightly mod-
and reduces the experimental cost significantly. Unfortunately, ified search methods to define MRs are available to suite the nature
both WGBS and RRBS are unable to distinguish between 5mC of different data sets. For instance, for larger data sets coming from
and 5hmC on DNA. Therefore, hydroxymethylation and WGBS experiments, pipeline allows tiling window analysis to effi-
methylation-sensitive tag sequencing (HMST-seq) was proposed ciently deal will large number of methylated sites. Multiple statis-
to detect both 5mC and 5hmC on DNA sequences simultaneously tics test are also available to search DMRs to accommodate
[11]. HMST-seq takes advantage of sequence specific DNA restric- differences in nature of data generated by different methylation
tion endonucleases: for example, HpaII can cleave unmodified detection platforms. All detected DMRs are automatically anno-
cytosine only while MspI cleaves at both 5mC and 5hmC. More- tated to the reference genome based on the refFlat file from UCSC
over, b-glucosyltransferase (b-GT) can transfer glucose to 5hmC Genome Browser [25]. Moreover, the package provides a simple
which will block MspI digestion. By combining these enzymatic statistical summary of the distribution of methylation in various
reactions, HMST-seq generates three tag libraries, which can be genomic regions (e.g., transcription start site (TSS), transcription
used to determine methylation and hydroxymethylation abun- end site (TES), gene body, intergenic, 50 distance region, and other
dance in a sample. The first library contains information of unmod- regions like enhancers). Finally, it also provides lists of hyper- and
ified, methylated and hydroxymethylated cytosines, generated by hypo-DMRs annotated to different genomic regions.
MspI digestion only. The second library refers to unmodified and
methylated cytosines generated by glucosylation of 5hmC and sub- 2. Material and methods
sequent MspI digestion. The third library contains only unmodified
cytosines generated by HpaII digestion. HMST-seq not only targets HMST-Seq-Analyzer is a new Python package, that employs
both 5mC and 5hmC in Msp1 sites (50 -CCGG-30 ) at single base res- robust statistical methods for differential methylation analysis in
olution but also generates only ~5 GB data to achieve a 30X whole-genome DNA sequencing data such as HMST-Seq, WGBS,
sequencing depth. Since HMST-seq relies on specific restriction or RRBS. Although the pipeline is optimized for HMST-seq data, it
enzymes, it is limited to regions with CCGG sites, thus covering can be easily applied on either WGBS or RRBS data by simply
approximately 4–7% CpG dinucleotides distributed throughout adjusting the default MR search parameters and the DMR search
the vertebrate genome. However, it has recently been demon- method. HMST-Seq-Analyzer conducts differential methylation
strated that locations of CCGGs largely reflects those of all CpGs analysis using three steps. Generally, differential methylation
in the genome [15] and that epigenetic profiling by using detection can be done at multiple genomic resolutions like non-
methylation- and hydroxymethylation-sensitive restriction CpG (CHG/CHH), CpG, genome wide tiles or at annotated regions.
enzymes can successfully address fundamental biological ques- However, in case of single base pair resolution data coming from
tions [16–19]. In contrast to WGBS sequencing, HMST-seq is distin- techniques like HMST, WGBS and RRBS, differential methylation
guishing between 5mC and 5hmC providing additional information detection at base pair level becomes a computationally exhaustive
about the dynamics of oxidative demethylation. The cost, coverage task. For two reasons, the best approach is to perform annotated
and detection of both 5mC and 5hmC on DNA sequences make genome analysis; first, use of external genome annotation data
HMST-seq an attractive tool for biologists to perform whole- focuses the analysis on those methylated regions which can possi-
genome methylation studies with large sample sizes. bly act as epigenetic regulatory switches and second, it reduces the
With the advancement of high throughput sequencing tech- search space, which is otherwise extremely dense. Hence, HMST-
nologies, data generation rate has outpaced the Moore’s law [20] Seq-Analyzer tries to minimize the computational burden, by first
while data analysis still remains a challenge. There are, however, extracting methylation regions (MR) from predefined genomic
several tools available for differential methylation analysis (e.g., areas (e.g., TSS, TES, enhancer, gene body, 50 distance region and
MethylKit, MethylSig, and BSmooth [21–23]). While the majority intergenic regions) that may act as epigenetic regulatory switches.
of tools are tailored towards 5mC only a few of them focus on In the second step, it filters MRs in parallel, which further narrows
5hmC. Especially, none of the publicly available tools for differen- down the total search space and computational cost. Finally, differ-
tial methylation analysis focuses on both 5mC and 5hmC that are entially methylated regions (DMRs) in predefined genomic areas
generated by the HMST-seq. Though the methylation data analysis are predicted by robust statistical tests (e.g., Wilcoxon rank-sum
pipeline MINT integrates several tools to process both methylation test) between the control and the reference samples. Fig. 1 summa-
and hydroxymethylation data, it cannot be applied on HMST-seq rizes the workflow of HMST-Seq-Analyzer. It is divided into eight
library tag data [24]. As mentioned earlier, HMST-seq outputs tag discrete modules, where the default settings for the parameters
counts for three libraries separately and requires processing to are aimed for HMST-seq data but parameters are flexible to accom-
methylation and hydroxymethylation levels after wards, this facil- modate RRBS and WGBS data in the package. A more detailed
ity is not available in previous tools like MethylKit, BSmooth or description of HMST-Seq-Analyzer can be found in the following
MINT pipleine. Thus, we have developed a new differential methy- subsections.
2878
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

Fig. 1. Detailed scheme of flow of HMST-Seq-Analyzer pipeline. HMST-Seq-Analyzer detects differential methylation in three major phases: 1) preprocessing of data where it
performs annotation and extracts methylated regions from predefined regions (e.g., TSS, TES, enhancer, gene body, 50 distance region and intergenic regions); 2) searches for
MRs with same locus in the case and control samples, finds DMRs (e.g. using Wilcoxon Ranksum, Kolmogorov–Smirnov test or T-test), and categorizes them into Hyper/Hypo-
methylated DMRs; 3) visualizes graphical summary of analysis results, exports the results in form of list of Hyper/Hypo DMRs and gene annotation.

2879
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

2.1. Quality control makes sense for HMST-seq data. Moreover, epigenetic profiling
by using methylation- and hydroxymethylation-sensitive restric-
Before running the pipeline, users can check input data quality tion enzymes successfully addresses fundamental biological ques-
by calculating the average of read count distribution among the tions [17–19].
samples. It is preferable to have majority of the reads distribution In HMST-Seq-Analyzer, an MR is defined as a cluster of 5mC or
above 15X coverage in a sample to ensure reliable prediction of 5hmC sites at a locus. The search for MRs is only carried out in cer-
DMR [7] as recommended by the ENCODE project [9]. This facility tain defined genomic regions (e.g., gene body, TSS, or enhancer
is provided as an additional script in the toolbox in the demo data etc). There are two major stringency parameters in the MR search
set. Apart from the average read count, the toolbox also offers pre- method: the number of consecutive sites and the distance of adja-
liminary cleaning and processing of input files. cent methylated sites. For HMST-Seq data, this is similar to the pre-
vious publication [11], where any clusters of methylated sites
2.2. Pre-processing would be classified as MR: (1) if there are at least N (the default
is 5 for gene body/intergenic/50 Distance and 3 for TSS/TES)
By default setting HMST-Seq-Analyzer extracts five types of 5mC/5hmC sites in the region, and (2) the distance between two
regions from a given reference genome refFlat file: TSS, gene body, consecutive methylated sites is not greater than a bp (the default
TES, intergenic, and 50 distance region of each gene. A detailed is 2000 bp). CpGs in an MR can have increasing or decreasing trend
illustration of these five regions is shown in SFig. 1, where Gene of methylation and this trend can be similar or mixed in the case
body is defined between TSS and TES. TSS and TES are ± 1 kb (de- and control samples. HMST-Seq-Analyzer allows restricting the
fault) from the original TSS and TES, respectively. The 50 distance is DMR search to same methylation trend by using an additional
from 10 kb to 100 kb upstream (default) of the TSS. The intergenic parameter in the DMR search function (isST = 1 or 0 will force
regions consider the whole genome except for TSS, TES and gene all CpG sites in the same MR shall have the same or mixed chang-
body. In this way, the pipeline includes whole genome in differen- ing trend, respectively) [11]. These parameters can be easily chan-
tial methylation analysis. The lengths of TSS, TES, 50 distance and ged by command line options. For WGBS and RRBS data, slightly
intergenic regions can be adjusted by the user in the pipeline. An modified parameters can be used to search MRs (e.g., N = 3 and
additional enhancer option can also be used by any other regions a = 2000 bp for WGBS; N = 2 and a = 200 bp for RRBS). Alterna-
of choice, provided that chromosomal coordinates are given in sim- tively, the package also provides an equal sized sliding window
ple bed format. The gene annotation task is required only once if bin approach which can be applied on the data to search for
the same genomic regions will be used across multiple experi- MRs. This option is suggested for large and dense data set such
ments. Predicting DMRs from aforementioned five types of geno- as WGBS.
mic regions may help us to identify putative functional DMRs in
downstream analysis. For HMST-Seq data, if input tag counts are 2.3.2. Identification of differentially methylated regions
not normalized across sample, then a quantile normalization The current package performs pair-wise or multiple samples
method can be applied. Subsequently, an abundance of 5mC/5hmC comparison for differential methylation analysis (e.g., one control
level from HMST-seq can be determined by the ratio between the versus one case, or one/multiple controls versus one/multiple
normalized tag counts of two libraries (e.g, between BGT (b–gluco cases). Therefore, MRs or methylation sites need to be available
syltransferase – to detect 5hmC) and HpaII for 5mC; between MspI in both conditions before performing a significant differential
and BGT for 5hmC). For RRBS and WGBS, a direct input of methy- methylation analysis. Unfortunately, due to DNA sequencing vari-
lation levels and the corresponding chromosome positions from a ation and many other uncontrollable reasons in methylation
bed formatted file is requested. The pipeline offers to switch experiments, some methylation signals may be missing in one of
between the two major types of data sets: methylation level based, the conditions, even if from technical replicates. This is a missing
i.e. WGBS/RRBS, and library tag count based, i.e. HMST-Seq. value problem in sequencing data analysis, which is the first chal-
lenge in differential methylation analysis. A missing value indi-
2.3. Differential methylation analysis cates that methylation level at a position is not recorded as an
experimental error or it was an unmethylated state and a methyla-
In HMST-Seq-Analyzer, there are two major steps to identify tion level was recorded as zero. There are three possible ways to
DMRs from genome-wide DNA methylation sequencing data: 1) deal with the missing value: ignoring, deletion, or imputation.
Searching for candidate methylation regions (MR); 2) Identifying The first two options cannot be good options because they can
significantly differentially methylated regions (DMRs) with robust result in loss of an important DMR. Therefore, all missing values
statistical tests. are imputed in the current pipeline before identifying DMRs. For
less computational cost, the missing values can simply be replaced
2.3.1. Searching for methylated regions by the median of methylations in an MR or by zeros (default,
Methylated and hydroxymethylated regions can vary in size because it resembles a possible real unmethylated state). Alterna-
from a single methylated base pair to an entire gene locus, depend- tively, a nearest neighbour data imputation method can be applied
ing on the biological question of interest. Although single methy- on an MR with missing values, but this method will have heavy
lated CpG is reported to affect gene expression regulation [26] computational burden. After finding MRs sharing the same locus
and disease risk [27], the majority of DMRs are reported to range in both conditions and imputing the missing values, the pipeline
from a few hundred bases to a few kilobases which may be biolog- applies the Wilcoxon Rank-sum test to evaluate the significance
ically more interesting [28]. Especially, DMRs are known to regu- of differentially methylated MRs as used by many similar studies
late cell-type specific transcriptional repression of genes [29], [11,18,30]. There are three versions of the Wilcoxon Rank-sum test
and this size range complements with the size of other gene regu- available in the pipeline: Pranksum, Mranksum, and Rranksum.
latory regions such as enhancers. Though HMST-seq data are lim- Pranksum is a Python implementation of the P-value computation
ited to HpaII restriction sites, it has recently been shown that for the Wilcoxon rank-sum test:
locations of CCGGs are evenly distributed comparable to those of
all CpGs in the genome, and methylation level determined in speci- 
exact if nx < 10
fic genomic regions (e.g. promoter) highly correlate with gene p¼
mannwhitneyu otherwise
expression [15]. Thus, searching for methylated regions (MR)
2880
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

where nx is the number of 5mC/5hmC sites in an MR, exact is a 2.5. Visualization and statistical summary of results
Python implementation of the exact enumeration of P-value [31]
when the sample size is small (this is a contribution of this research Three main figures in the results provide a statistical summary
work in the pipeline), and mannwhitneyu is a function from Python of the genome-wide methylation status: 1) relative density of
library scipy.stats that performs a two-sided Mann-Whitney U test 5mC/5hmC (e.g., methylation levels >1 and >0.5 for HMST-Seq
or Wilcoxon rank-sum test. The exact calculation of P-value is rec- [11] and WGBS/RRBS data, respectively) in defined genomic
ommended when there are less than 10 methylation sites in a MR. regions (e.g, TSS, TES, gene body, enhancer, 50 distance, and inter-
For a large sample size, an approximation of P-value can be genic regions) or genome-wide; 2) percentage of hyper-/
obtained from Python’s Mann-Whitney U test function. Mranksum hypomethylated DMRs in TSS, TES, gene body, 50 distance, and
is a Matlab version of rank-sum test, which is computed as intergenic regions; 3) a genomic average of 5mC/5hmC levels in
following: TSS-Gene-TES regions or enhancer regions. To plot the genomic
(   average of 5mC and 5hmC levels in TSS, TES, gene body or enhancer
exact if min nx ; ny < 10 and nx þ ny < 20 regions, the package maps MRs of these regions to a new uniform

approximate otherwise range that equalizes all genes’ length [31]. Here, one-dimensional
nearest-neighbour algorithm was used to interpolate methylation
where exact is the exact computation of the P-value when sample data when mapping the original data to a new range. Subsequently,
size is small, and approximate is the Matlab function of the two- the mean of methylation levels of all genes was calculated in the
sided Wilcoxon rank-sum test. Hence, the pipeline also offers new equalized range, where a centred moving average method is
Rranksum, which is the wilcox.test function of R. It computes P- used to smooth the data as follows:
values as follows:
 1 Xn=2
exact if nx < 50 and ny < 20 and no ties MAt ¼ A
p¼ n i¼n=2 i
approximate otherwise
Here, n is the window size, and Ai is the data point at the ith
However, the R version of the rank-sum test does not have tie position. The window size is the number of observations used for
correction and is much slower than both Matlab and Python imple- calculating the statistic. In order to reduce noise when plotting
mentation. MATLAB is a commercial license software that users the genome-wide average of methylation profiles, a one-
might not have access to. A performance comparison of the three dimensional Gaussian filter [35] is applied to further smooth the
methods for DMR detection is provided in SFigs. 2 and 3. One of mean data before plotting them in TSS, gene body and TES regions
these three methods should be chosen based on the resources in a single plot. The effect of the data smoothing is illustrated in
available. Alternative statistic test methods are also implemented SFig. 4. The genome-wide average methylation level of enhancers
in the HMST-Seq-Analyzer for identifying DMRs such as two sam- can be plotted in a similar way. The pipeline also exports the plot
pled T-test [22] and Kolmogorov-Smirnov test [32]. In the future, data for external visualization and analysis by users. For more
more advanced methods (e.g. the Beta-binomial method [23]) will information of implementation, package comparison and perfor-
be considered in the pipeline. Pranksum is the default setting in mance (e.g., CPU hours and memory consumption) please refer to
HMST-Seq-Analyzer. (http://urn.nb.no/URN:NBN:no-76419).
For each MR, the significance of differential methylation
between the two conditions is evaluated by a P-value. Correction
of the P-value by either without (default) or with Benjamini- 2.6. Methylation data
Hochberg false discovery rate [33] is implemented in the package.
MRs with P-values crossing the predefined threshold (e.g., p < 0.05 HMST-Seq-Analyzer can handle two types of data: methylation
by default) are considered as DMRs. Here, an identified DMR can and hydroxymethylation, from three different DNA sequencing
either exhibit an increase or a decrease in methylation levels approaches (e.g., HMST-Seq, WGBS, and RRBS). There are two types
between the two conditions, termed as hyper or hypomethylation, of methylation input data: 1) a bed formatted file containing
respectively. For that reason, a relative ratio (rratio) approach [34] methylation percentage per base, used for WGBS or RRBS data. 2)
is used to distinguish between the Hyper-DMRs and the Hypo- a TSV file containing normalized tag count per base from 3 libraries
DMRs. It is given by the following formula: of HMST-seq data. The pipeline has been tested successfully in
CpG, CHG, and CHH methylation from WGBS, RRBS, and HMST-
lKO  lWT
rratio ¼ seq. For WGBS, Human lymphoblastoid cell line (GM12878) and
ðlKO þ2lWT Þ human embryonic stem cell line (H1) from ENCODE data sets are
where mKO is the median of the original methylated levels in the used as test and control set, respectively [9]. For RRBS, TET knock-
case (KO) data, and mWT is the median of the original methylated out mice data was used [36]. HMST-seq data was acquired from a
levels in the control (WT) data. A DMR is considered as hyperme- study conducted on two hepato cellular carcinoma (HCC) cell lines
thylated or hypomethylated when the rratio is greater than or smal- (97L and LM6 cells), and a non-HCC sample [18].
ler than zero, respectively. As an option, users can also plot all DMRs
to investigate their quality or may run the DMR search function 2.7. Gene annotation
again by using new parameters (e.g., a new P-value threshold or a
new P-value correction method). As DMR detection will be spanned around annotated genomic
regions, gene annotation information is needed. The reference gen-
2.4. Annotation of DMRs ome file is a simple tab separated text file containing the chromo-
some number and the size of respective chromosome, which were
The pipeline also annotates all the hypo/hyper-DMRs to the prepared according to assembly and species of the input samples
genomic regions (TSS, TES, gene body, intergenic, 50 distance and (e.g., hg19 and mm10 chromosome size information [25]). If a
any given region like enhancers). As a result, lists of DMRs are cat- bed formatted enhancer position file is available, then the pipeline
egorically exported for each region with a corresponding P-value can map DMRs to enhancer regions. This enhancer option can be
for each DMR. Moreover, a list of gene names having DMRs associ- used to map DMRs across any other regions as well e.g. CpG
ated to their TSS, TES or gene body is also exported. islands.
2881
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

Fig. 2. Distribution of differentially methylated (or hydromethylated) regions (DMRs/DhMRs) between the two HCC cell lines (97L and LM6) and a non-HCC sample
(NO45268) in five types of genomic regions. HMST-Seq-Analyzer was applied on publicly available HMST-Seq data of three samples (two liver cancer cell lines and one
normal liver sample) simultaneously. Hyper/Hypo DMRs/DhMRs between the two liver cancer samples and one normal sample were identified by the pipeline automatically.
The distribution of these DMRs/DhMRs were mapped to five types of genomic regions (gene body, 50 distance region, intergenic region, TSS, and TES) by the HMST-Seq-
Analyzer. Here, the 50 distance region is defined as the upstream of TSS from 10 Kb to 100 Kb, and the intergenic regions are genomic regions excluding gene body, TSS and TES
which with the minimum and maximum length 2 Kb and 100 Kb, respectively. There is the same changing trend in each MR before using Kolmogorov-Smirnov test to predict
the DMRs/DhMRs.

2.8. Computational efficiency of HMST-Seq-Analyzer cellular carcinoma (HCC) [37]. In addition, 5-hmC levels are
reported to be lower in HCC tissues in comparison to non-tumor
In order to make the pipeline suitable for large data sets, HMST- tissues [38]. Simultaneous inspection of 5mC and 5hmC levels in
Seq-Analyzer modules Find MRs, Preparation for DMR Search, and HCC can reveal important characteristics of epigenetic alterations
DMR Search are optimized for parallel computation specifically. in HCC. In a study by F. Gao et al., HMST-seq was performed on
The input argument -p is used to define the number of CPUs a user two HCC cell lines (97L and LM6 cells), and a non-HCC sample
wants to use. As a result, the corresponding task is automatically [18]. In this work, HMST-Seq-Analyzer with default parameters
parallelized either on a single multi-core machine, or on a high- on chromosome 1 of this dataset was run and tag counts were
performance computer. HMST-Seq-Analyzer was tested on both a aligned to the same human reference genome (hg19). All results
small and a large dataset to evaluate the efficiency. For small data and figures were generated by a single run of HMST-Seq-
set (~4.5 MB), chromosome 1 from a HMST-Seq mouse data was Analyzer on both 5mC and 5hmC data simultaneously. Fig. 2 shows
used with 70,201 and 71,646 tag counts for the KO and WT sam- that there is not a strong difference between hyper-DMRs and
ples, respectively. For a large data set (~66 MB), the whole hypo-DMRs distributions in different genomic regions. However,
HMST-Seq mouse (20 chromosome) data set was used with KO a lower number of hyper-DhMRs but higher number of hypo-
(1037346 tag counts) and WT (1054613 tag counts) condition sam- DhMRs in HCC cell lines as compared to the non-HCC cell line
ple, respectively. The test was run on 5 CPUs with 4G memory on (lower panel of Fig. 2), especially in TSS regions are clearly seen.
SAGA computer cluster at Norwegian University of Science and Fig. 3 suggests that the number of significantly modified 5mC
Technology. and 5hmC sites are lower and higher for the non-HCC sample in
all genomic regions (especially in TSS) than that of HCC samples,
3. Results and discussion respectively. In Fig. 4, the 5mC levels around TSS seem to be
slightly higher in the HCC samples than in the non-HCC sample.
3.1. Differential methylation analysis of HMST-Seq data In contrast, 5hmC levels are increased more than twofold around
TSS in non-HCC compared to HCC samples. Another study also
Numerous studies have reported a role of altered methylation found uneven distribution of 5mC and 5hmC in TSS regions of
of tumor suppressor genes in the pathogenesis of human hepato- tumor samples as compared to non-tumor samples [39]. They

2882
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

Fig. 3. The relative density of significantly modified 5mC (or 5hmC) sites in the two HCC cell lines (97L and LM6) and one non-HCC sample, within the five types of genomic
regions. HMST-Seq-Analyzer was applied on public available HMST-Seq data of three samples (two liver cancer cell lines and one normal liver sample) simultaneously.
Significantly modified (methylation/hydromethylation levels > 1) 5mC/5hmC sites in the three samples were identified and their relative density in five types of genomic
regions (gene body, 50 distance region, intergenic region, TSS, and TES) was calculated by the pipeline automatically. The 50 distance region is defined as the upstream of TSS
from 10 Kb to 100 Kb. The intergenic regions are regions excluding gene body, TSS and TES, which with the minimum and maximum length 2 Kb and 100 Kb, respectively.

observed overall high 5mC levels and lower 5hmC levels in tumor mation along recovering almost all the DMCs reported by
tissues. These results are also in agreement with the results pre- MethylKit.
dicted by HMST-Seq-Analyzer as shown in Fig. 2, where ~70% of
MRs in the TSS region are DhMRs while only ~10% are DMRs. Thus,
3.2.2. A comparison of HMST-Seq-Analyzer and MethylKit in RRBS
our new pipeline reproduces the results of the original publication
In this comparison, RRBS data was obtained from experiments
in a single run, which simplifies the data interpretation. More
with TET knockout mice [36]. Both HMST-Seq-Analyzer and
interesting results can be found if the pipeline is run on the com-
MethylKit were applied on the same data, to detect differential
plete data set. For example, significantly differentially methylated
methylation events between TET1 and TET2 double knockout and
genes for hepatocellular carcinoma can be drawn out from the
wild type mice. In terms of the number of CpGs per kb, RRBS has
gene list exported by the proposed tool.
much sparse methylation sites than that in WGBS, hence we per-
formed whole-genome differential methylation detection in a sin-
3.2. A comparison of HMST-Seq-Analyzer and MethylKit in analysing gle batch. For MethylKit, the default parameters were used for
differentially methylated CpG sites detecting genome-wide DMCs. For example, a q-value < 0.05 with
minimum percent methylation difference cut-off of 25% was used
Currently, there is lack of publicly available tools for analysing to identify DMCs as used in the source publication [36]. For
HMST-Seq data. To evaluate the robustness of differential methyla- HMST-Seq-Analyzer, genome-wide DMCs were extracted from
tion analysis in HMST-Seq-Analyzer, results are compared to a DMRs reported by the package before comparing with the
popular tool - MethylKit, which uses Fisher’s Exact Test (FET) to MethylKit results, where the number of minimum consecutive
identify differentially methylated CpG sites (DMCs) [21] in WGBS sites is 2, adjacency is 200 bp, the same methylation changing
or RRBS data. trend, and T-test were used in DMR analysis. HMST-Seq-Analyzer
reported 4242 DMCs in its DMRs, while MethylKit reported
15,756 DMCs in total, in which ~48% of the DMCs obtained from
3.2.1. HMST-Seq Analyzer recovers MethylKit DMCs in WGBS HMST-Seq-Analyzer overlap with MethylKit DMCs (SFig. 5). If the
Here, a WGBS data set for human genome was gathered from same methylation changing trend is disabled in HMST-Seq-
the ENCODE project. Human lymphoblastoid cell line (GM12878) Analyzer, then the percentage of overlapping DMCs drops slightly
and human embryonic stem cell line (H1) are used as test and con- to 46% (SFig. 6). Though half of the predicted DMCs by HMST-
trol set, respectively [9]. Raw sequencing data set was mapped to Seq-Analyzer are recovered by MethylKit, HMST-Seq-Analyzer
hg38. Due to enormity of WGBS data, it was split chromosome gives much fewer DMCs than MethylKit. This is mainly caused by
wise for analysis on both tools. Results were later combined for the two packages using entirely different methods to predict
whole-genome level analysis. The default parameters of HMST- DMCs. HMST-Seq-Analyzer has a restriction of adjacency in nearby
Seq-Analyzer were used, except for shortening the 50 -Distance CpGs and of minimum number of CpGs in MR. Moreover, DMCs are
regions between 10 kbp to 50 kbp from TSS, in order to reduce extracted from the predicted DMRs by using T-test. On the con-
the computation time. For MethylKit, a q-value < 0.05 with mini- trary, MethylKit is designed to report DMCs only but does not con-
mum percent methylation difference cut-off of 25% are used to sider the spatial distribution of CpGs in a MR, and the significance
identify differentially methylated cytosines (DMCs). For HMST- of DMCs is tested on individual CpG directly based on Fishers’ exact
Seq-Analyzer, we extracted DMCs from our predicted DMRs before test.
comparing them to those provided by MethylKit. In total, HMST- In further analysis, a correlation between the percentage of
Seq-Analyzer and MethylKit reported 47 million and 11 million overlapping DMCs and the change of adjacency in HMST-Seq-
DMCs sites, respectively. HMST-Seq-Analyzer recovered 97.6% of Analyzer is studied. DMCs predicted by HMST-Seq-Analyzer with
the mCs reported by MethylKit (Fig. 5). While MethylKit reports different adjacencies are compared against to the prediction of
only individual DMCs, HMST-Seq-Analyzer gives much more infor- MethylKit with default parameters. The result in SFig. 7 suggests
2883
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

Fig. 4. The distribution of genome-wide average of 5mC (or 5hmC) levels in TSS-Gene-TES regions for both HCC cell lines (97L and LM6) and non-HCC sample. HMST-Seq-
Analyzer was applied on publicly available HMST-Seq data of three samples (two liver cancer cell lines and one normal liver sample) simultaneously. Methylated regions
(MRs) of 5mC and 5hmC in three different samples are identified by the pipeline in TSS, TES, and gene body regions, respectively. A genomic average of 5mC/5hmC levels in all
TSS-Gene-TES regions are calculated, where a centred moving average method is applied to smooth the data. Subsequently, one-dimensional Gaussian filter is used to reduce
noise in smoothed mean data before plotting the genome-wide average of methylation levels in TSS-Gene-TES region.

that a smaller adjacency in HMST-Seq-Analyzer gives a higher per- and T-test for detecting DMRs. In case of HMST-seq, the default
centage of overlap between the two methods. For example, the parameters are adjacency 2 kb, a minimum of 3 consecutive sites
percentage of overlapping DMCs is increased from about 41% to (mc1 = 3, mc3 = 3) and 5 (mc2 = 5) for TSS/TES/enhancer and
51%, when the CpG adjacency is reduced from 3000 bp to gene body/50 distance/intergenic regions, respectively. More infor-
100 bp. Therefore, the discrepancy of predictions between the mation of optimal parameter selection for HMST-Seq-Analyzer is
two packages is expected, which is mainly due to the difference included in the package.
in prediction method such as the restriction of spatial distribution
for nearby CpGs in HMST-Seq-Analyzer. A list of optimal parame- 3.2.3. HMST-Seq Analyzer captures important information that is
ters for HMST-Seq-Analyzer is provided in the package. For RRBS missed by MethylKit
data analysis, it is recommended to use adjacency of 200 bp To further evaluate the results between the two tools, we focus
(a = 200), the same methylation changing trend (isST = 1), a on the promoter regions (±1 Kb to TSS) from the first round of RRBS
minimum number of 2 consecutive sites (mc1 = 2, mc2 = 2, analysis with default parameters. HMST-Seq-Analyzer reported
mc3 = 2) and T-test for DMR prediction. For WGBS data, we 291 genes having more than two DMCs in the promoter regions
suggest to use window bin size 200 bp (W = yes, a = 200), a (STable 1). For MethylKit, it reported a total of 3938 genes with
minimum of 3 consecutive sites (mc1 = 3, mc2 = 5, mc3 = 3) DMCs in the promoters but the majority (61%) of them had only
2884
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

squamous cell carcinoma [42]. Hence, HOXD12 must specifically


be reported as differentially methylated in the TET knockout data.
While MethylKit failed to identify HOXD12 as differentially methy-
lated, HMST-Seq-Analyzer not only reported it as differentially
methylated but also correctly categorized it as hypermethylated
as expected in absence of TET proteins (Fig. 6b). In Fig. 7, the dis-
tribution of mCs in this DMR was plotted by HMST-Seq-Analyzer,
where the methylation levels from TET knock out and wild type
are different. For the sites 2–4 in Fig. 7, the WT sample provided
no methylation information for the highly methylated sites in
KO. MethylKit relies on single base pair differential methylation
analysis that discards such sites with missing values, which failed
to capture the DMCs in promoter of HOXD12. It is reasonable to
assume that WT mCs at sites 2–4 experienced demethylation in
presence of TET proteins and consider them in analysis, instead
of assuming that the three consecutive sites experienced experi-
mental error and ignoring them. This argument is more plausible
especially when mCs sites 2–4 are not partially but fully methy-
lated in the KO sample.
Because relatively low coverage per site increases the sampling
variation [43], variation at single sites is usually greater than that
Fig. 5. Whole-genome DMC comparison for WGBS (HMST-Seq-Analyzer VS of a region. Translating variation on a single CpG site to a differen-
MethylKit). Number of genome-wide DMCs predicted by HMST-Seq and MethylKit tial methylation event can thus be misleading. By imputing the
are shown in figure. Unique DMCs predicted by HMST-Seq-Analyzer are repre- missing values and considering the overall methylation pattern
sented in Pink, by MethylKit in green and overlapping in brown, respectively. Both
of adjacent CpG sites, HMST-Seq-Analyzer discovers important dif-
HMST-Seq-Analyzer and MethylKit were run in default parameters. Here, ~98% of
the DMCs identified by MethylKit are overlapping with DMCs predicted by HMST-
ferentially methylated genes that are completely missed by
Seq-Analyzer. (For interpretation of the references to colour in this figure legend, MethylKit. Therefore, there are two main reasons to the large dif-
the reader is referred to the web version of this article.) ference in the number of genes with DMCs in promoters reported
by the two tools. MethylKit predicts individual DMCs and then
reports the nearest feature (a gene’s promoter), hence it reports a
one DMC (SFig. 8). There are 91 genes overlapping between the huge number of promoters where majority of them (~61%; SFig. 8)
MethylKit and HMST-Seq-Analyzer results (STable 2). For instance, have only one DMC. Labelling whole promoter to be differentially
Atf4 which is a transcription factor, was reported as having DMCs methylated on basis of one DMC can be misleading. On the other
occurring at the promoter region by both tools (Fig. 6a). Because of hand, HMST-Seq-Analyzer has stringency parameters (e.g., 2
the wide difference in results from the two tools, it is interesting to DMC at a promoter), which filter out multiple standalone methy-
perform functional gene annotation on the two gene lists. DAVID lated sites. This reduces the total number of promoters reported
tools was used for this analysis [40]. GO results (P-value < 0.05) at the end of analysis, but provides significant results as every
of the genes predicted exclusively by HMST-Seq-Analyzer (~200) DMR would have at least two DMCs. In summary, DMCs predicted
and MethylKit (e.g., the top 200) are presented in STables 3 and by HMST-Seq-Analyzer capture more significant biological events
4, respectively. The highest number of genes (20) predicted by than MethylKit.
HMST-Seq-Analyzer are involved in organismal development
(STable 3). This was also reported in the original study that dele- 3.3. A comparison of HMST-Seq-Analyzer with more packages for
tion of TET proteins impairs differentiation in embryonic stem analysing differentially methylated regions
cells, which is clearly linked to organismal development [36]. Since
there were too many MethylKit predicted genes (>3800) with Though no public standalone tool is available for HMST-Seq
DMCs in promoters to give a meaningful gene enrichment test, data analysis, there are several R packages (e.g., BSmooth [22],
the top 200 differentially methylated genes (on the basis of the MethylSig [23], and MethylKit [21]) that can be used to detect
P-value from MethylKit) were selected for functional gene annota- DMR in WGBS experiments. We used WGBS data for two human
tion (STable 4). Here, genes were reported to be involved in 15 dif- cell lines (H1 and IMR90; each cell line with replicate) that
ferent processes, with the highest number of them being linked to included in BSmooth package, to evaluate the performance of
transcription (GO:0006351, 23 genes) and negative regulation of DMR prediction in HMST-Seq-Analyzer and three R packages
transcription (GO:0000122, 14 genes). However, the original study (BSmooth, MethylSig, MethylKit). First, WGBS datasets of the two
reports [36] that while the TET protein deletion did alter the DNA human cells of the original publication [44] were downloaded
methylation at some promoters, it did not correlate with the corre- using R scripts in BSmooth. CpG sites with coverage less than 4
sponding gene expression changes. Moreover, percentage of genes are removed. Then, default settings of each program were used
with only 1 DMC in the promoter further increases to ~70% in the in computation. For example, smoothed methylation levels by
top 200 results of MethylKit (SFig. 9). A similar result (STable 5) the BSmooth function were used in both BSmooth and MethylSig,
was obtained when the same analysis was repeated with the top but raw methylation levels were inputted into MethylKit and
500 genes from the MethylKit prediction. HMST-Seq-Analyzer. In each program, missing value imputation
Literature evidences of a few genes predicted by HMST-Seq- and data interpretation were based on the default settings. All cal-
Analyzer but missed by MethylKit are provided in Table 1, where culations were performed on a computer node of the Sigma2 Saga
the gene Hoxd12 is particularly interesting because it is directly computer cluster at the Norwegian University of Science and
targeted by TET1 [41]. TET1 is involved in active demethylation Technology.
and decreased recruitment of TET1 to the HOXD12 gene results In selected four packages, the definition of MR is different. Both
in increased methylation of HOXD12. HOXD12 is also reported to MethylKit and MethylSig consider MR as either tiling windows
have hypermethylated CpG as a potential marker for stage 1 lung (25 bp or default 200 bp) or a set of predefined regions (e.g., pro-
2885
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

Fig. 6. HMST-Seq-Analyzer predicted DMR at TSS of multiple genes. Panels a and b are for genes Atf4 and Hoxd12 respectively. The first track represents the methylation level
at mCs from wild type (WT) by orange bars, with the height of each vertical bar representing percentage of methylation. The second track represents the methylation level at
mCs from TET1, TET2 double knock out (DKO) in green. The third track represents DMRs detected by HMST-Seq-Analyzer in red. The fourth track presents DMCs predicted by
MethylKit in blue. The fifth track depicts the gene in dark blue. No DMCs were predicted by MethylKit for Hoxd12, hence, the track for methylkit is not included in pancel b.
(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

moters). BSmooth defines MR by a maximum gap (default is (e.g., H1 versus IMR90), MethylKit, MethylSig, BSmooth and
300 bp) between adjacent CpG sites and a minimum number (de- HMST-Seq-Analyzer apply logistic regression test, Beta-binomial
fault is 3) of CpG sites in a MR. This is similar to the MR search test, T-statistics and T-test on MRs with the same changing trend,
strategy in HMST-Seq-Analyzer (e.g., default maximum distance respectively. Since only HMST-Seq-Analyzer considers both DMR
between adjacent CpG sites is 2000 bp and minimum 3 CpG sites finding and gene annotation simultaneously when searching for
in a MR). To identify DMRs between the two human cell lines DMRs, the time usage for DMR finding and gene annotation was

2886
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

Table 1
Literature evidence for genes predicted by HMST-Seq-Analyzer but missed by MethylKit in differentially methylated CpG sites analysis.

Gene Name Comments References


Mael 1) hypomethylated in colorectal cancer. [48,49]
2) Hypermethylation of MAEL promoter in infertile men.
3) Mael promoter methylation levels are all increased in tet1 tet2 mutants.
Hoxd12 1) recovered as differentially methylated by swDMR (another tool). [30,41,4230,42,50]
2) Hoxd12 is targeted by TET1 and decreased recruitment of Tet1 on HOXD12 gene results in increased methylation.
3) hypermethylated CpGs in HOXD12 gene reported as potential marker for stage 1 lung squamous cell carcinoma.
Dazl 1) promoter is usually methylated (it’s usually expressed in germ cells, so unmethylated in germ cells only) [51]
Shank2 1) hyper methylated in prostate cancer [52]
Rn45s 1) hypo methylated in mice with high maternal folic acid. [53]

To identify differentially methylated CpG sites (DMCs), HMST-Seq-Analyzer and MethhylKit were applied on the same RRBS data of TET knockout mice experiments. There are
291 and 3938 genes that contain DMCs in the promoters (+/1Kb) based on the prediction of HMST-Seq-Analyzer and MethylKit, respectively. Selected genes with literature
support are listed here.

needs further improvement by optimizing the gene annotation


step during the DMR finding. One advantage of HMST-Seq-
Analyzer over the other tools is that it saves time for post process-
ing. A user already knows that every DMR will contain a minimum
number of methylated sites and will be annotated to a genomic
feature.

4. Conclusion

We present a new Python package called HMST-Seq-Analyzer


for differential methylation and hydroxymethylation analysis, that
identifies and annotates genome-wide differentially methylated
regions (DMRs) by using DNA sequencing data. Though it is opti-
mized for HMST-Seq data, the tool is highly flexible and is able
to analyse other popular types of DNA methylation sequencing
data such as WGBS and RRBS. HMST-Seq-Analyzer takes as an
input either library tag counts of HMST-Seq or methylation per-
centage per base in the case of WGBS/RRBS for DMR detection.
Regardless of pre- and postprocessing steps, this pipeline can be
Fig. 7. Distribution of Methylated Cytosines in HOXD12 DMR predicted by HMST- used independently to detect DMRs. Gene annotation of discovered
Seq-Analyzer. The heading of the figure provides information about the DMR in
format ‘‘chromosome number:start position:end position:strand:gene identifier”
DMRs is performed automatically at the end of the pipeline, sum-
followed by the p-values. Circles represent imputed methylation values at each mary statistics of methylation distributions in various genomic
methylated site (blue for wild type/control and green for knockout/case). Diamonds regions (e.g., TSS, TES, gene body, and 50 distance region) are illus-
represent original methylation value (orange for wildtype/control and red for trated in graphs, and the average of methylation levels spanning
knockout/case). (For interpretation of the references to colour in this figure legend,
gene regions such as TSS-gene-TES or enhancer regions are also
the reader is referred to the web version of this article.)
provided. The package is able to deal with very huge data sets such
as genome-wide methylation profiles of CpG, CHG, or CHH methy-
lation from WGBS data, because of the parallel implementation of
separated in the evaluation. Results of the performance compar- the MR/DMR search algorithms. The final results exported by the
ison on human chromosome 22 are shown in Table 2, where the pipeline (e.g. gene annotated hyper/hypo DMRs) are ready for biol-
detected DMRs from each method were sorted and merged by ogists to be used for further detailed investigation.
BEDTools [45] before counting their overlap to the predictions HMST-Seq-Analyzer is written in Python and is publicly avail-
from HMST-Seq-Analyzer. In summary, ~80% of DMRs detected in able. It is a command line tool and tested on both macOS and Linux
each R package are recovered by HMST-Seq-Analyzer, regardless operating systems. For small or medium sized data sets (e.g., RRBS/
of the tiling window size (e.g., either 25 bp or 200 bp) that was HMST-Seq), it can be run on desktop PCs. However, for big data
used in MethylKit and MethylSig. BSmooth has the highest per- (e.g., genome-wide CpG/CHG/CHH from WGBS), it is preferable to
centage (~86%) of overlapping DMRs with HMST-Seq-Analyzer. A run the pipeline on high performance computers with parallel
median size of the predicted DMRs from HMST-Seq-Analyzer computation. The package is able to process both human and
(~10 Kb) is much longer than that of the other packages (e.g., mouse data, as well as to the other species if the corresponding ref-
~200 to 400 bp). Though HMST-Seq-Analyzer used more running erence genomes are available and with the same format as our pro-
time (~400 s) than most of R packages (e.g., ~60 to 533 s), around vided human/mouse genome. For convenience, human (hg 19,
75% of its wall time were used in the gene annotation. Similar hg38) and mouse (mm10) genome annotation and chromosome
results were also obtained from both a median size human chro- size files are also included in the package. Though the current pipe-
mosome (chr17; STable 6) and a large human chromosome line is robust in differential methylation analysis when compared
(chr1; STable 7), when the same evaluation was performed. Over- to three popular R packages (overlapping with ~80% BSmooth
all, the predicted DMRs from HMST-Seq-Analyzer is robust against and ~70% MethylKit/MethylSig predicted DMRs), it is slow in pro-
those from the three R packages in WGBS data analysis (e.g overlap cessing of gene annotation during the DMR finding. Especially,
to ~80% of BSmooth and ~70% of MethylSig/MethylKit predictions). when multiple groups, conditions or replicates are considered in
However, the computational efficiency of HMST-Seq-Analyzer DMR prediction. In future, we aim to overcome this limitation by
2887
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

Table 2
A comparison of HMST-Seq-Analyzer and three other packages in differentially methylated region analysis (chr22).

DMRs Overlapping Percentage (%) Length (bp) Gene annotation (second) DMR finding (second) Total (second)
HMST-Seq-Analyzer 1107 1107 100 9768 308 86 394
BSmooth 1105 953 86 379 NA 60 60
MethylSig 34,347 27,816 81 200 NA 168 168
MethylKit 12,460 9865 79 200 NA 87 87
MethylSig 50,597 41,182 81 25 NA 533 533
MethylKit 16,683 13,371 80 25 NA 115 115

HMST-Seq-Analyzer, BSmooth, MethylSig, and MethylKit were applied on the same WGBS data to identify differentially methylated regions (DMRs) between human H1 and
IMR90 cells, respectively. Here, each cell line has replicated experiments and only chromosome 22 is used in the evaluation for all programs. MethylKit and MethylSig are
tested in two different lengths of window size (25 bp and 200 bp). In the table, DMRs represents the number of DMRs detected by the package. Overlapping and Percentage
are the number of and the percentage of DMRs that are overlapping with the DMRs from HMST-Seq-Analyzer, respectively. Length is the median length of DMRs. DMR finding,
Gene annotation and Total are wall time (seconds) used in each step, respectively.

developing new methods suitable for fast gene annotation and fur- [6] Cokus SJ et al. Shotgun bisulphite sequencing of the Arabidopsis genome
reveals DNA methylation patterning. Nature 2008;452(7184):215–9.
ther optimize its speed for very large data such as genome-wide
[7] Ziller MJ et al. Coverage recommendations for methylation analysis by whole-
non-CpG methylation. We will also consider adapting the software genome bisulfite sequencing. Nat Methods 2015;12(3):230.
for integration into Galaxy [46], CyVerse Discovery Environment [8] Lee I et al. Whole genome DNA methylation sequencing of the chicken retina,
[47], or similar platforms, as well as enabling export of results cornea and brain. Sci Data 2017;4:170148.
[9] Davis CA et al. The Encyclopedia of DNA elements (ENCODE): data portal
for visualization with the UCSC Genome browser [25] or other update. Nucl Acids Res 2018;46(D1):D794–801.
tools. [10] Bujold D et al. The international human epigenome consortium data portal.
Cell Syst 2016;3(5):496–9. e2.
[11] Gao F et al. Integrated detection of both 5-mC and 5-hmC by high-throughput
CRediT authorship contribution statement tag sequencing technology highlights methylation reprogramming of bivalent
genes during cellular differentiation. Epigenetics 2013;8(4):421–30.
[12] Nair SS et al. Guidelines for whole genome bisulphite sequencing of intact and
Amna Farooq: Software, Validation, Formal analysis, Visualiza- FFPET DNA on the Illumina HiSeq X Ten. Epigenet Chromatin 2018;11(1):24.
tion, Writing - original draft. Sindre Grønmyr: Software, Visualiza- [13] Van Dijk EL et al. Ten years of next-generation sequencing technology. Trends
Genet 2014;30(9):418–26.
tion, Validation, Formal analysis. Omer Ali: Validation, Formal
[14] Meissner A et al. Reduced representation bisulfite sequencing for comparative
analysis, Writing - review & editing. Torbjørn Rognes: Validation, high-resolution DNA methylation analysis. Nucl Acids Res 2005;33
Writing - review & editing. Katja Scheffler: Validation, Writing - (18):5868–77.
[15] Ball MP et al. Targeted and genome-scale strategies reveal gene-body
review & editing. Magnar Bjørås: Validation, Writing - review &
methylation signatures in human cells. Nat Biotechnol 2009;27(4):361–8.
editing. Junbai Wang: Conceptualization, Methodology, Software, [16] Maunakea AK et al. Conserved role of intragenic DNA methylation in
Data curation, Writing - review & editing, Visualization, Investiga- regulating alternative promoters. Nature 2010;466(7303):253–7.
tion, Formal analysis, Supervision, Validation, Project administra- [17] Guo JU et al. Neuronal activity modifies the DNA methylation landscape in the
adult brain. Nat Neurosci 2011;14(10):1345–51.
tion, Funding acquisition. [18] Gao F et al. Integrated analyses of DNA methylation and hydroxymethylation
reveal tumor suppressive roles of ECM1, ATF5, and EOMESin human
hepatocellular carcinoma. Genome Biol 2014;15(12):533.
Declaration of Competing Interest [19] Olsen MB et al. NEIL3-dependent regulation of cardiac fibroblast proliferation
prevents myocardial rupture. Cell Rep 2017;18(1):82–92.
The authors declare that they have no known competing finan- [20] Metagenomics versus Moore’s law. Nature Methods, 2009. 6(9): p. 623-623.
[21] Akalin A et al. methylKit: a comprehensive R package for the analysis of
cial interests or personal relationships that could have appeared genome-wide DNA methylation profiles. Genome Biol 2012;13(10):R87.
to influence the work reported in this paper. [22] Hansen KD, Langmead B, Irizarry RA. BSmooth: from whole genome bisulfite
sequencing reads to differentially methylated regions. Genome Biol 2012;13
(10):R83.
Acknowledgments [23] Park Y et al. MethylSig: a whole genome DNA methylation analysis pipeline.
Bioinformatics 2014;30(17):2414–22.
[24] Cavalcante RG et al. Integrating DNA methylation and hydroxymethylation
This work was supported by the South-Eastern Norway Regional
data with the mint pipeline. Cancer Res 2017;77(21):e27–30.
Health Authority (HSØ 2017061 and HSØ 2018107), and the Nor- [25] Haeussler M et al. The UCSC genome browser database: 2019 update. Nucl
wegian Research Council NOTUR project (nn4605k). We thank Acids Res 2019;47(D1):D853–8.
[26] Xu J et al. Pioneer factor interactions and unmethylated CpG dinucleotides
the four reviewers for their comments and suggestions that helped
mark silent tissue-specific enhancers in embryonic stem cells. Proc Natl Acad
us to improve the manuscript. Sci 2007;104(30):12377–82.
[27] Raval A et al. Downregulation of death-associated protein kinase 1 (DAPK1) in
chronic lymphocytic leukemia. Cell 2007;129(5):879–90.
Appendix A. Supplementary data [28] Sparago A et al. Microdeletions in the human H19 DMR result in loss of IGF2
imprinting and Beckwith-Wiedemann syndrome. Nat Genet 2004;36
(9):958–60.
Supplementary data to this article can be found online at [29] Avrahami D, Kaestner KH. The dynamic methylome of islets in health and
https://doi.org/10.1016/j.csbj.2020.09.038. disease. Mol Metab 2019;27:S25–32.
[30] Wang Z et al. swDMR: a sliding window approach to identify differentially
methylated regions based on whole genome bisulfite sequencing. PLoS One
References 2015;10(7):e0132866.
[31] Günther C-C et al. Statistical hypothesis testing for categorical data using
[1] Greenberg MV, Bourc’his D. The diverse roles of DNA methylation in enumeration in the presence of nuisance parameters. Preprint Statistics
mammalian development and disease. Nature Rev Mol Cell Biol 2019:1–18. 2009;4.
[2] Shi D-Q et al. New insights into 5hmC DNA modification: generation, [32] Tang B et al. Integration of DNA methylation and gene transcription across
distribution and function. Front Genet 2017;8(100). nineteen cell types reveals cell type-specific and genomic region-dependent
[3] Luo C, Hajkova P, Ecker JR. Dynamic DNA methylation: In the right place at the regulatory patterns. Sci Rep 2017;7(1):3626.
right time. Science 2018;361(6409):1336–40. [33] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and
[4] Bullinger L et al. Quantitative DNA methylation predicts survival in adult acute powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol)
myeloid leukemia. Blood, J Am Soc Hematol 2010;115(3):636–42. 1995;57(1):289–300.
[5] Frommer M et al. A genomic sequencing protocol that yields a positive display [34] Wang J et al. Genome-wide analysis uncovers high frequency, strong
of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci differential chromosomal interactions and their associated epigenetic
1992;89(5):1827–31. patterns in E2-mediated gene regulation. BMC Genomics 2013;14:70.

2888
A. Farooq, S. Grønmyr, O. Ali et al. Computational and Structural Biotechnology Journal 18 (2020) 2877–2889

[35] Haddad RA, Akansu AN. A class of fast Gaussian binomial filters for speech and [44] Lister R et al. Human DNA methylomes at base resolution show widespread
image processing. IEEE Trans Signal Process 1991;39(3):723–7. epigenomic differences. Nature 2009;462(7271):315–22.
[36] Reimer M et al. Deletion of Tet proteins results in quantitative disparities [45] Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis.
during ESC differentiation partially attributable to alterations in gene Curr Protoc Bioinformatics, 2014. 47: p. 11 12 1-34.
expression. BMC Dev Biol 2019;19(1):16. [46] Giardine B et al. Galaxy: a platform for interactive large-scale genome analysis.
[37] Umer M et al. Promoter hypermethylation of Wnt pathway inhibitors in Genome Res 2005;15(10):1451–5.
hepatitis C virus-induced multistep hepatocarcinogenesis. Virol J 2014;11 [47] Koltes JE et al. Bioinformatics resources for animal genomics using CyVerse
(1):117. cyberinfrastructure. J Anim Sci 2016;94:33–4.
[38] Liu C et al. Decrease of 5-hydroxymethylcytosine is associated with [48] Cheng Y-S et al. MAEL promoter hypermethylation is associated with de-
progression of hepatocellular carcinoma through downregulation of TET1. repression of LINE-1 in human hypospermatogenesis. Hum Reprod 2017;32
PLoS One 2013;8(5). (12):2373–81.
[39] Zhu Y et al. Integrated analyses of multi-omics reveal global patterns of [49] Vincent JJ et al. Stage-specific roles for tet1 and tet2 in DNA demethylation in
methylation and hydroxymethylation and screen the tumor suppressive roles primordial germ cells. Cell Stem Cell 2013;12(4):470–8.
of HADHB in colorectal cancer. Clin Epigenet 2018;10(1):30. [50] Worm J, Guldberg P. DNA methylation: an epigenetic pathway to cancer and a
[40] Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths promising target for anticancer therapy. J Oral Pathol Med 2002;31(8):443–9.
toward the comprehensive functional analysis of large gene lists. Nucl Acids [51] Linher K et al. An epigenetic mechanism regulates germ cell-specific
Res 2009;37(1):1–13. expression of the porcine Deleted in Azoospermia-Like (DAZL) gene.
[41] Cartron P-F et al. Identification of TET1 partners that control its DNA- Differentiation 2009;77(4):335–49.
demethylating function. Genes Cancer 2013;4(5–6):235–41. [52] Devaney JM et al. Genome-wide differentially methylated genes in prostate
[42] Rauch TA et al. DNA methylation biomarkers for lung cancer. Tumor Biol cancer tissues from African-American and Caucasian men. Epigenetics
2012;33(2):287–96. 2015;10(4):319–28.
[43] Smith AK et al. Differential immune system DNA methylation and cytokine [53] Barua S et al. DNA methylation profiling at single-base resolution reveals
regulation in post-traumatic stress disorder. Am J Med Genet Part B: gestational folic acid supplementation influences the epigenome of mouse
Neuropsychiatric Genet 2011;156(6):700–8. offspring cerebellum. Front Neurosci 2016;10:168.

2889

You might also like