[go: up one dir, main page]

Next Article in Journal
Gas-Vapor Mixture Temperature in the Near-Surface Layer of a Rapidly-Evaporating Water Droplet
Previous Article in Journal
Hall and Ion-Slip Effect on CNTS Nanofluid over a Porous Extending Surface through Heat Generation and Absorption
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Clustering Approach for Motif Discovery in ChIP-Seq Dataset

1
College of Science, Northwest A&F University, Yangling 712100, China
2
School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
3
School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
4
College of Software, Nankai University, Tianjin 300071, China
5
Department of Mathematical Sciences, Georgia Southern University, Statesboro, GA 30460, USA
*
Authors to whom correspondence should be addressed.
Entropy 2019, 21(8), 802; https://doi.org/10.3390/e21080802
Submission received: 6 June 2019 / Revised: 4 August 2019 / Accepted: 15 August 2019 / Published: 16 August 2019
(This article belongs to the Section Entropy and Biology)

Abstract

:
Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.

1. Introduction

Transcription factor binding sites (TFBSs) [1] are short, degenerated nucleotide fragments (usually ≤30 bps) located in specific DNA regions. They play an important role in regulating gene expression. The Planted ( l , d ) Motif Search (PMS) problem [2] is a popular motif model for the identification of TFBSs (i.e., motif discovery) in bioinformatics, and is formally defined as follows:
Definition 1 (PMS).
Given a set of DNA sequences X = { X 1 , X 2 , , X t } with | X i | = n and three non-negative integers d , q , l , with 0 d < l < n , 0 < q t , the PMS problem is to find an l-mer M (a string of length l), such that each selected sequence X i has an l-mer M i with Hamming distance d H ( M , M i ) d , for i = 1 , 2 , , q . The l-mer M is called an ( l , d ) motif and the l-mer M i is called a motif instance.
According to the different values of q representing the distribution of motif instances, there are three different motif discovery sequence models [3]: (i) Exactly one motif occurrence per sequence (the OOPS model), (ii) zero or one motif occurrences per sequence (the ZOOPS model), or (iii) zero or more motif occurrences per sequence (the TCM model). For the OOPS model, q = t , for the ZOOPS model and TCM model, 0 < q < t .
Generally, there are two kinds of algorithms for solving the PMS problem: exact algorithms and approximate algorithms. Exact algorithms [4,5,6,7,8,9,10] always use consensus sequences [11] to represent motifs and can find all ( l , d ) motifs. Most exact algorithms are pattern-driven algorithms, which attempt to enumerate all possible 4 l l-mers (substring patterns of length l) to find the l-mer with the maximum number of approximate occurrences. Approximate algorithms [12,13,14,15,16,17,18] usually use a position weight matrix (PWM) [19] to describe the most likely occurring motifs and can report results in a short time, but do not always identify all ( l , d ) motifs. Most approximate algorithms use probabilistic analysis to maximize the score function which describes how likely it is for an l-mer pattern to be a motif instance.
Recently, chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has produced extremely valuable information for the genome-wide identification of transcription factor binding sites (TFBSs) and in the field of epigenetics, which mainly focus on DNA methylation, nucleosome localization, and histone modification. For transcription factors, ChIP-Seq is widely used to study the binding of transcription factors for the analysis of gene expression regulation on a genome-wide scale. For histones, ChIP-Seq performs high-throughput histone modification sequencing in the whole genome with sufficient sequencing depth and range, which not only improves the sensitivity and specificity of sequencing, but can also transform qualitative sequencing methods into quantitative detection.
In this paper, we focus on an algorithm to discover transcription factor binding sites in a ChIP-Seq data set. A ChIP-Seq data set is a set of peak regions containing TFBSs obtained through ChIP-Seq experiments, read mapping, and peak calling. It contains hundreds or more DNA sequences, which increases the difficulty of accurate and efficient identification of TFBSs.
Some algorithms have recently been proposed to discover TFBSs in ChIP-Seq data sets [20,21,22,23,24,25,26,27,28,29]. However, none of them has proven to be absolutely superior, compared to the rest. Some of these are tailored versions of previous motif discovery algorithms, specifically tailored towards ChIP-Seq data sets, such as MEME-ChIP [20] and HMS [21]. MEME-ChIP [20], which incorporates two complementary motif discovery algorithms, known as MEME and DREME [22], can identify motifs without restriction on the size or number of sequences, allowing very large ChIP-Seq data sets to be analyzed. HMS [21], which is an improved version of Gibbs Sampler, combines stochastic sampling and a deterministic greedy search step, which improves computation efficiency. DREME [22] is specifically designed to find short, core DNA-binding motifs of eukaryotic TFs, and is optimized to analyze very large ChIP-Seq data sets in just minutes. One may speed up the existing motif discovery algorithms by integrating some information, such as in the cases of STEME [23] and ChIP-Munk [24]. STEME [23] accelerates MEME by indexing sequences with suffix trees. ChIP-Munk [24] combines a greedy approach with an expectation-maximization (EM) algorithm to achieve a high efficiency. There are also exhaustive methods for determining exact motifs in ChIP-Seq data sets, such as FMotif [25] and Weeder [26]. FMotif [25] first constructs a mismatched suffix tree to scan and count all possible motif instances, and then implements a depth-first search to enumerate all possible motifs. However, the run time of FMotif becomes unrealistic with increasing values of l and d. Others use word enumeration methods to process full-size ChIP-Seq data sets, such as CisFinder [28] and MCES [29]. CisFinder [28] employs a word clustering method to group short l-mers ( l = 7 , 8, or 9), but struggles to find exact ( l , d ) motifs with larger values of l and d in ChIP-Seq data sets. MCES [29], a new planted ( l , d ) motif discovery algorithm, mines and combines substrings to rapidly identify exact motifs in full-size ChIP-Seq data sets.
In this paper, we propose a new motif discovery algorithm, named AP-ChIP, which is specially designed for better discovering TFBSs in ChIP-Seq data sets. The algorithm first constructs and then further filters cluster subsets using probabilistic analysis. Then, Affinity Propagation (AP) clustering [30] is applied to the candidate cluster subsets in order to discover optimal motifs. Experimental results show that the AP-ChIP Algorithm 1 can find TFBSs in a ChIP-Seq data set very efficiently and effectively.

2. Method

A ChIP-Seq data set has the following fundamental features: (i) Some of the sequences may contain no motifs at all; and (ii) thousands of sequences lead to huge amount of background l-mers. To cater to ChIP-Seq data sets, we design the AP-ChIP Algorithm 1 under the ZOOPS model and set some proper thresholds to filter redundant background l-mers. More specifically, the AP-ChIP Algorithm 1 consists of the following three steps:

2.1. Construct Cluster Subsets

We introduce the observation that any two motif instances x 1 and x 2 , each of which differs from the same motif x up to d positions, must have Hamming distance of no more than 2 d , denoted as d H ( x 1 , x 2 ) 2 d . Consequently, if an l-mer in one sequence is a motif instance, all other motif instances in the remaining sequences will be gathered in the corresponding cluster subset. Under the ZOOPS model, we choose h ( h = t q + 1 ) sequences as the reference sequences to ensure that at least one sequence among the h sequences contains a motif instance [31]. In general, we use the first h sequences { X 1 , X 2 , , X h } as the reference sequences. As the l-mer which is the motif instance is not known in advance, we consider all l-mers x i , j ( i = 1 , 2 , , h ; j = 1 , 2 , , n l + 1 ) in the first h sequences as the reference subsequences.
The ideal cluster subsets are expected to contain as few background l-mers as possible and, also, to include sufficient motif instances. Therefore, we set a threshold k ( d < k 2 d ) , so that, for each reference subsequence x i , j , all l-mers x i , j ( i = 1 , 2 , , t , i i ; j = 1 , 2 , , n l + 1 ) in the whole sequences, except the ith sequence X i such that d H ( x i , j , x i , j ) k , are selected to construct a cluster subset; that is
C ( x i , j , X ) = { x i , j } i = 1 i i t B ( x i , j , X i ) ,
where B ( x i , j , X i ) = { x i , j : x i , j l X i , d H ( x i , j , x i , j ) k } represents the set of the selected l-mers in the i th sequence X i and x i , j l X i if and only if x i , j is an l mer of the sequence X i .
To set a proper threshold k, two probabilistic expressions are employed. The first is the probability of the Hamming distance between two random l-mers x 1 and x 2 being no more than k [4]:
p k = p ( d H ( x 1 , x 2 ) k ) = i = 0 k l i 3 4 i 1 4 l i .
The other is the probability of the Hamming distance between two selected motif instances m 1 and m 2 being no more than k: [18].
p k m o t i f = p ( d H ( m 1 , m 2 ) k ) .
Now, we describe the method for calculating p k m o t i f . Given two motif instances, m 1 and m 2 , of the same motif m 0 with up to d mutations, the distances between m 1 , m 2 , and m 0 satisfy d H ( m 0 , m 1 ) = α and d H ( m 0 , m 2 ) = β , ( 0 α d , 0 β d ) . Thus, the probability p k m o t i f can be calculated as
p k m o t i f = p ( d H ( m 1 , m 2 ) k ) = p ( d H ( m 1 , m 2 ) k | d H ( m 0 , m 1 ) = α , d H ( m 0 , m 2 ) = β ) × p ( α , β ) ,
where p ( d H ( m 1 , m 2 ) k | d H ( m 0 , m 1 ) = α , d H ( m 0 , m 2 ) = β ) represents the conditional probability of p ( d H ( m 1 , m 2 ) k ) given d H ( m 0 , m 1 ) = α , d H ( m 0 , m 2 ) = β , such that
p ( d H ( m 1 , m 2 ) k | d H ( m 0 , m 1 ) = α , d H ( m 0 , m 2 ) = β ) = i = [ α + β k 2 ] + 1 m i n ( α , β ) α i × l α β i × 3 β l β × 3 β k < α + β 2 d , 1 0 < α + β k ,
and p ( α , β ) represents the probability of d H ( m 0 , m 1 ) = α and d H ( m 0 , m 2 ) = β ; that is
p ( α , β ) = p ( d H ( m 0 , m 1 ) = α , d H ( m 0 , m 2 ) = β ) .
As d H ( m 0 , m 1 ) = α and d H ( m 0 , m 2 ) = β are independent, we have
p ( α , β ) = p ( d H ( m 0 , m 1 ) = α ) × ( d H ( m 0 , m 2 ) = β ) ,
p ( d H ( m 0 , m 1 ) = α ) = d α 3 α 4 d ,
p ( d H ( m 0 , m 2 ) = α ) = d β 3 β 4 d .
Combining Equations (5)–(9), we have
p k m o t i f = p ( d H ( m 1 , m 2 ) k ) , = 0 α , β d i = [ α + β k 2 ] + 1 m i n ( α , β ) α i l α β i × 3 β l β × 3 β × 2 d α 3 α 4 2 d × 2 d β 3 β 4 2 d k < α + β 2 d , ( 0 α , β d 1 ) × 2 d α 3 α 4 2 d × 2 d β 3 β 4 2 d 0 < α + β k .
Having calculated the two probabilities p k and p k m o t i f , we now describe how to set the proper threshold k, in order to construct the cluster subsets which contain as few background l-mers as possible while including sufficient motif instances. A larger value of p k m o t i f indicates more motif instances belong to the cluster subset; however, a smaller value of p k suggests that fewer background l-mers appear in the same cluster subset. Therefore, the threshold k should be set in a way that ensures that the value of p k m o t i f is large enough, compared to the value of p k .
To demonstrate this issue, let us consider the (18, 5) problem instance as an example. The values of p k and p k m o t i f are shown in Table 1. When k = 7 , the value of p k m o t i f is 0.7414, which allows us to obtain sufficient motif instances, whereas the value of p k is 0.0012 which, in turn, allows us to reduce the scale of background l-mers in the same cluster subset. Therefore, the optimal value of k is 7.

2.2. Filter Cluster Subsets

As is known, the true motif instances must exist in one of these h × ( n l + 1 ) cluster subsets C ( x i , j , X ) , which are constructed with the reference subsequences x i , j ( i = 1 , 2 , , h ; j = 1 , 2 , , n l + 1 ) from the first h sequences. However, with a great number of total cluster subsets, the identification of the cluster subsets that contain the true motif instances is highly time-consuming as most of these cluster subsets have redundant background l-mers.
To filter the interference cluster subsets, a threshold p o c c f (i.e., an occurrence frequency) [29] is employed with the purpose of analyzing the probability of a random motif instance x occurring in a given sequence.
p o c c f = i = 0 d d i × p m u t i × ( 1 p m u t i ) d i × 1 l i × 3 i ,
where p m u t is the probability of a character mutation in one position. For each sequence X i ( i = 1 , 2 , , t , i i ) , if the number of the selected l-mers in B ( x i , j , X i ) , denoted as | B ( x i , j , X i ) | , is greater than or equal to p o c c f × ( n l + 1 ) ,
| B ( x i , j , X i ) | p o c c f × ( n l + 1 ) ,
then the sequence X i may contain a motif instance and is stored in a set N ( i , j ) . For each N ( i , j ) , if the number of the selected sequences in N ( i , j ) , denoted as | N ( i , j ) | , is not less than q,
| N ( i , j ) | q ,
then there are at least q possible motif instances in the corresponding cluster subset C ( x i , j , X ) , the reference subsequence x i , j is considered as a potential motif instance, and the corresponding cluster subset C ( x i , j , X ) is a candidate cluster subset C c a n d i d a t e ( x i , j , X ) .

2.3. Refine Cluster Subsets

Due to the occurrence frequency p o c c f and the relatively small Hamming distance k between the reference subsequence and the selected l-mers, only a limited number of cluster subsets, which contains a small number of the selected l-mers, are retained as candidate cluster subsets for further Affinity Propagation (AP) clustering. In order to quickly produce highly conserved cluster subsets, we apply AP clustering to each candidate cluster subset. Compared to other clustering approaches, AP clustering can cluster large-scale data sets efficiently by exchanging messages between data points.

2.3.1. Affinity Propagation (AP) Clustering

As demonstrated in our recent work [32], it is possible to speed up AP clustering and improve its accuracy with the adapted similarity s ( i , k ) , which is based on pair-wise constraints and a variable-similarity measure [33]. The adapted similarity, s ( i , k ) , between two l-mers x i , j and x k , j is defined as
s ( i , k ) = ρ × d H ( x i , j , x k , j ) × L ( x i , j , x k , j , X ) ,
where
ρ = R 1 if d H ( x i , j , x k , j ) ( 0 , k ] R 2 if d H ( x i , j , x k , j ) ( k , 2 k ] + if d H ( x i , j , x k , j ) ( 2 k , 4 k ]
L ( x i , j , x k , j , X ) = + if x i , j l X p , x k , j l X q , p = q 1 otherwise .
Note that R 1 ( 1 , + ) and R 2 ( 0 , 1 ] .
For each candidate cluster subset, based on the adapted similarity s ( i , k ) , AP clustering recursively calculates two types of messages. The first type is the responsibility r ( i , k ) , which reflects the suitability of point x k , j as the exemplar for point x i , j . The other type is the availability a ( i , k ) , which indicates how suitable it would be for a point x i , j to choose the point x k , j as its exemplar:
r ( i , k ) = s ( i , k ) max x k , j x k , j { a ( i , k ) + s ( i , k ) } ,
a ( i , k ) = m i n { 0 , r ( k , k ) + x i , j { x i , j , x k , j } m a x { 0 , r ( i , k ) } } if x i , j x k , j ,
a ( k , k ) = x i , j x k , j m a x { 0 , r ( i , k ) } } .
When the AP clustering converges, a set of l-mers in the produced cluster subset are selected as exemplars e ( i ) associated to the point x i , j :
e ( i ) = arg max x k , j { r ( i , k ) + a ( i , k ) } .

2.3.2. Cluster Subset Refinement

To select an adequate number of desired cluster subsets with more motif instances but less background l-mers, we set an interval [min size, max size] = [ t q , t ] to further refine the cluster subsets C A P ( x i , j , X ) , which is produced, by AP clustering, using a reference l-mer x i , j . Regarding the number of the l-mers in each cluster subset C A P ( x i , j , X ) , we conclude that there are three cases:
(i) In the case where the number is less than t q (for example, | C A P ( x i , j , X ) | < t q ) , such a small number of l-mers is not enough to create a cluster subset C A P ( x i , j , X ) which represents a true motif, so we consider it as an invalid cluster subset.
(ii) In the case where the number is between t q and t (for example t q < | C A P ( x i , j , X ) | t ) , we consider it as a valid cluster subset.
(iii) In the case where the number is more than t (for example | C A P ( x i , j , X ) | > t ), the size of the cluster subset is so large that it may include too many background l-mers and, so, we use a greedy strategy to select tl-mers from C A P ( x i , j , X ) to form C v a l i d ( x i , j , X ) . First, C v a l i d ( x i , j , X ) is initialized with the AP clustering exemplar e ( i ) . Then, an l-mer x r , j from C A P ( x i , j , X ) C v a l i d ( x i , j , X ) is repeatedly chosen, following Equations (21) and (22), and added to C v a l i d ( x i , j , X ) until | C v a l i d ( x i , j , X ) | = t ) :
x r , j = arg max x i , j C A P C v a l i d x k , j C v a l i d s i m ( x i , j , x k , j ) ,
s i m ( x i , j , x k , j ) = l e n ( x i , j , x k , j ) | x i , j | + | x k , j | l e n ( x i , j , x k , j ) ,
where l e n ( x i , j , x k , j ) is the length of the maximum intersection of x i , j and x k , j .
To appropriately sort the valid cluster subsets C v a l i d ( x i , j , X ) , we use the information content (IC) [34], and the cluster subset with the maximum value of IC is considered as the true motif model:
I C ( C v a l i d ( x i , j , X ) ) = m = 1 l w = 1 4 p w , m log p w , m p w , 0 ,
where p w , m is the probability of the character w A , T , C , G at the position m of the l-mer x i , j , and p w , 0 is the corresponding background probability.
Based on the above described three steps, the whole AP-ChIP Algorithm 1 is described as follows:
Algorithm 1:AP-ChIP algorithm
Entropy 21 00802 i052
Steps 2–17 describe the process of constructing the cluster subsets. Steps 18–21 describe the filtration of the cluster subsets. Steps 22–31 describe the refinement of the cluster subsets and the verification of the motif with the maximum IC score.

3. Results

3.1. Results on Simulated Data

Simulated data provide quantitative measures to test the performance of the AP-ChIP Algorithm 1. As in [29], we generate the simulated data as follows:
First, we generated t independent and identically distributed (i.i.d) sequences of length n and a motif m of length l. Second, we randomly generated q ( 0 < q t ) motif instances, each of which differed from the motif m at up to d positions. Third, the q motif instances were placed in a random position in a random selection of q sequences selected out of the t sequences. We, then, implemented the AP-ChIP Algorithm 1, using Matlab on a computer with a 2.6 GHZ processor and 4 Gbyte memory. The final experimental results consisted of the averages of five trials of simulated data experiments.
To evaluate the motif prediction accuracy, the nucleotide level performance coefficient ( n P C ) , as defined by Pevezner and Sze [2], was used:
n P C = | K P | | K P | ,
where K is the set of nucleotide positions in the true motif and P is the corresponding set of nucleotide positions in the predicted motif. The value of n P C is between 0 and 1; the larger the value of n P C , the higher the accuracy of the predicted motif. We used the 2 d neighborhood probability p 2 d [8] to select a group of ( l , d ) motif instances. The larger the value of p 2 d , the weaker the corresponding ( l , d ) problem instance becomes:
p 2 d = P ( d H ( x 1 , x 2 ) 2 d ) = i = 0 2 d l i 3 4 i 1 4 l i .
In what follows, according to different values of α = q t (i.e., the ratio of the sequences containing motif instances to all the sequences), we designed two groups of experiments, both of which consisted of two sub-experiments, to test the performance of the AP-ChIP Algorithm 1 on simulated data sets.
In the first group of experiments, we set α = 100 % for both sub-experiments. We compared the performance of the AP-ChIP Algorithm 1 with that of the widely used motif finding algorithms MEME [13], VINE [14], and Projection [16].
In the first sub-experiment, we set the number of sequences as t = 20 with sequence length n = 600 . The running time and the predicted accuracy of these algorithms are shown in Table 2. For the instances (12, 2) and (15, 3) with p 2 d < 0 . 05 , the AP-ChIP Algorithm 1 achieved near-optimal predicted accuracy within a relatively short time. For the instances (15, 4), (14, 4), (25, 8), and (21, 7) with p 2 d 0 . 05 , the predicted accuracy of the AP-ChIP Algorithm 1 was over 90 % and the running time of the AP-ChIP Algorithm 1 remained competitive, compared to the other algorithms.
In general, it is easy to find the true motif by increasing the sequence number and decreasing its length. Therefore, in the second sub-experiment, we set the sequence number as t = 1000 with sequence length n = 200 . Table 3 shows that, for all ( l , d ) problem instances with p 2 d 0 . 05 , the predicted accuracy of the AP-ChIP Algorithm 1 is over 90 % with the computational costs being satisfactory.
Next, in the second group of experiments, to simulate a real ChIP-Seq data set, we set α = 90 % for both sub-experiments. This is because in real ChIP-Seq data set, most but not all of the sequences contain motif instances.
In the first sub-experiment, we test the validity of the AP-ChIP Algorithm 1 on the simulated ChIP-Seq data set for the identification of ( l , d ) motifs with t = 1000 and n = 200 . We choose p 2 d = 0 . 05 to select a group of ( l , d ) problem instances. The reason for this choice is that p 2 d = 0 . 05 is approximately the same as the p 2 d value of the (15, 4) problem instance, which is one of the most popular benchmarks for ( l , d ) problem instance. The running time and the predicted motif by the AP-ChIP Algorithm 1 are shown in Table 4. For each ( l , d ) problem instance, the AP-ChIP Algorithm 1 finds almost the same motif as the published one and also runs quite efficiently.
In the second sub-experiment, in order to further demonstrate the performance of the AP-ChIP Algorithm 1 on the simulated ChIP-Seq data set, we compared the AP-ChIP Algorithm 1 against some established motif-finding algorithms in the following two aspects: (i) Different values of p 2 d , ranging from 0.05 to 0.5, with fixed α = 90 % , and (ii) different values of α floating from 0.7 to 1 with fixed ( l , d ) = ( 9 , 2 ) . Although a genome-wide ChIP-Seq data set typically has thousands to tens of thousands of sequences, using 20 % to 50 % of the ChIP-Seq data set is usually adequate for obtaining a good estimate of the true motifs. MEME-ChIP, a well-known algorithm for discovering motifs in ChIP-Seq data sets, is able to well identify ( l , d ) motifs with only 600 sequences. Thus, it was reasonable to set the sequence number as t = 600 and sequence length as n = 200 for motif discovery in our experiments.
First, we compared the running time and prediction accuracy of the AP-ChIP Algorithm 1 with those of the three compared algorithms, MEME-ChIP [20], ChIP-Munk [24], and FMotif [25], on different values of p 2 d and with fixed α = 90 % . As shown in Table 5, for each ( l , d ) problem instance, the AP-ChIP Algorithm 1 could solve it in a relatively short time, and its prediction accuracy was better than those of the three compared algorithms. Specifically, with increasing values of l and d, FMotif found it difficult to find exact motifs.
Next, we compared the prediction accuracy of AP-ChIP Algorithm 1 with those of the algorithms MEME [13], MEME-ChIP [20], DREME [22], and FMotif [25], on different values of α floating from 0.7 to 1 and fixed ( l , d ) = ( 9 , 2 ) . As the ratio of a motif instance α = q t has a strong effect on the prediction accuracy, it was necessary for us to test the prediction accuracy of AP-ChIP Algorithm 1 on different values of α . It is rather cumbersome to identify the true motif when the value of α is small. FMotif is a powerful, exhaustive algorithm for finding exact short ( l , d ) motifs ( l 10 , d 2 ) contained in ChIP-Seq data sets. As shown in Figure 1, the prediction accuracy of AP-ChIP Algorithm 1 was nearly the same as that of FMotif, and was higher than that of MEME-ChIP, MEME, and DREME.

3.2. Results on Real Data

First, we tested the validity of the AP-ChIP Algorithm 1 for the identification of real motifs using a diverse set of real ChIP-Seq data sets; specifically, on 12 differently sized mESC data sets (Nanog, Oct4, Sox2, Esrrb, Zfx, Klf4, c-Myc, n-Myc, Tcfcp21l, Smad1, STAT3, and CTCF) [35]. We compared the motifs detected by the AP-ChIP Algorithm 1 with the motifs published by Chen et al. [35], and presented them in the form of sequence logos [36], which graphically represent the degree of motif conservation, as measured by relative entropy. Table 6 shows the running times and the predicted motifs of the AP-ChIP Algorithm 1. For each data set, the AP-ChIP Algorithm 1 was capable of finding motifs highly similar to the published ones within a reasonable time.
Moreover, to better show the results, we compared AP-ChIP 1 with MEME-ChIP on nine differently sized ENCODE data sets (Nfyb, Hnf4, Elf1, Ets, Egr1, Yy1, Six5, Srf, and Tal1) [37], where the TFBSs were referenced in the JASPAR database [38]. Table 7 shows the published motifs and the motifs predicted by the two algorithms. The motifs are also shown in the form of sequence logos. AP-ChIP 1 could successfully find a motif similar to the published motif for each data set, while, for some data sets MEME-ChIP failed to accurately predict the motif (e.g., in the Elf1 data set), or lost information on individual bases (e.g., in the Tal dataset).

4. Concluding Remarks

In this paper, our goal was to find a method providing balance between time performance and prediction accuracy for TFBS discovery in ChIP-Seq data sets. Consequently, we aimed to obtain these results with high prediction accuracy in a relatively short time. To do so, we proposed a novel clustering-based algorithm named AP-ChIP 1. Firstly, to achieve high prediction accuracy, we set a threshold k to restrict the number of the selected l-mers in the candidate cluster subsets. Next, to obtain good time performance, we set the threshold p o c c f in terms of probabilistic analysis, in order to filter the interferential candidate cluster subsets. Furthermore, a powerful data clustering method, AP clustering, was used to obtain the almost accurate motifs. Experimental results on both simulated and real ChIP-Seq datasets showed that the AP-ChIP Algorithm 1 not only discovers the motifs as consistently as the published ones, but also does so quite efficiently. This demonstrates that the AP-ChIP Algorithm 1 is a powerful new approach for ChIP-Seq data set analysis which provides a good trade-off between time performance and prediction accuracy.

Author Contributions

C.-x.S. and Y.Y. contribute for supervision, project administration, and formal analysis. C.-x.S., Y.Y., H.W., and W.-h.W. contribute for methodology and writing original draft preparation. The final draft we written by C.-x.S. and H.W.

Funding

This work is supported by the National Natural Science Foundation of China (grant nos. 61702291, 61772102, 61472058,11531001), China Postdoctoral Science Foundation (grant no. 2018M632095), Program for Science & Technology Innovation Talents in Universities of Henan Province (grant no. 19HASTIT029), the Key Research Project in Universities of Henan Province (grant nos. 19B110011, 19B630015), the Simons Foundation (grant no. 245307), and the Scientific Research Starting Foundation for High-level Talents of Pingdingshan University (grant no. PXY-BSQD2017006).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Stormo, G.D. DNA binding sites: Representation and discovery. Bioinformatics 2000, 16, 16–23. [Google Scholar] [CrossRef] [PubMed]
  2. Pevzner, P.A.; Sze, S.H. Combinatorial approaches to finding subtle signals in DNA sequences. In ISMB; American Association for Artificial Intelligence: Menlo Park, CA, USA, 2000; Volume 8, pp. 269–278. [Google Scholar]
  3. Bailey, T.; Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1994, 2, 28–36. [Google Scholar]
  4. Yu, Q.; Huo, H.; Zhang, Y.; Guo, H. PairMotif: A new pattern-driven algorithm for planted (l,d) DNA motif search. PLoS ONE 2012, 7, e48442. [Google Scholar] [CrossRef] [PubMed]
  5. Chin, F.Y.; Leung, H.C. Voting algorithms for discovering long motifs. In Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, Singapore, 17–21 January 2005; pp. 261–271. [Google Scholar]
  6. Davila, J.; Balla, S.; Rajasekaran, S. Fast and practical algorithms for planted (l,d) motif search. IEEE/ACM Trans. Comput. Biol. Bioinform. 2007, 4, 544–552. [Google Scholar] [CrossRef] [PubMed]
  7. Dinh, H.; Rajasekaran, S.; Kundeti, V.K. PMS5: An efficient exact algorithm for the (l,d)-motif finding problem. BMC Bioinform. 2011, 12, 410. [Google Scholar] [CrossRef] [PubMed]
  8. Ho, E.S.; Jakubowski, C.D.; Gunderson, S.I. iTriplet, a rule-based nucleic acid sequence motif finder. Algorithms Mol. Biol. 2009, 4, 14. [Google Scholar] [CrossRef] [PubMed]
  9. Dinh, H.; Rajasekaran, S.; Davila, J. qPMS7: A fast algorithm for finding (l,d)-motifs in DNA and protein sequences. PLoS ONE 2012, 7, e41425. [Google Scholar] [CrossRef] [PubMed]
  10. Nicolae, M.; Rajasekaran, S. Efficient sequential and parallel algorithms for planted motif search. BMC Bioinform. 2014, 15, 34. [Google Scholar] [CrossRef] [PubMed]
  11. Schneider, T.D. Consensus Sequence Zen. Appl. Bioinform. 2002, 1, 111–119. [Google Scholar]
  12. Quang, D.; Xie, X. EXTREME: An online EM algorithm for motif discovery. Bioinformatics 2014, 30, 1667–1673. [Google Scholar] [CrossRef] [PubMed]
  13. Bailey, T.L.; Williams, N.; Misleh, C.; Li, W.W. MEME: Discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006, 34, W369–W373. [Google Scholar] [CrossRef] [PubMed]
  14. Huang, C.W.; Lee, W.S.; Hsieh, S.Y. An improved heuristic algorithm for finding motif signals in DNA sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 2010, 8, 959–975. [Google Scholar] [CrossRef] [PubMed]
  15. Lawrence, C.E.; Altschul, S.F.; Boguski, M.S.; Liu, J.S.; Neuwald, A.F.; Wootton, J.C. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 1993, 262, 208–214. [Google Scholar] [CrossRef] [PubMed]
  16. Buhler, J.; Tompa, M. Finding motifs using random projections. J. Comput. Biol. 2002, 9, 225–242. [Google Scholar] [CrossRef] [PubMed]
  17. Lee, N.K.; Li, X.; Wang, D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf. Sci. 2018, 466, 25–43. [Google Scholar] [CrossRef]
  18. Wong, K.C. MotifHyades: Expectation maximization for de novo DNA motif pair discovery on paired sequences. Bioinformatics 2017, 33, 3028–3035. [Google Scholar] [CrossRef]
  19. Thompson, J.D.; Higgins, D.G.; Gibson, T.J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22, 4673–4680. [Google Scholar] [CrossRef] [PubMed]
  20. Machanick, P.; Bailey, T.L. MEME-ChIP: Motif analysis of large DNA datasets. Bioinformatics 2011, 27, 1696–1697. [Google Scholar] [CrossRef]
  21. Hu, M.; Yu, J.; Taylor, J.M.; Chinnaiyan, A.M.; Qin, Z.S. On the detection and refinement of transcription factor binding sites using ChIP-Seq data. Nucleic Acids Res. 2010, 38, 2154–2167. [Google Scholar] [CrossRef] [Green Version]
  22. Bailey, T.L. DREME: Motif discovery in transcription factor ChIP-seq data. Bioinformatics 2011, 27, 1653–1659. [Google Scholar] [CrossRef]
  23. Reid, J.E.; Wernisch, L. STEME: Efficient EM to find motifs in large data sets. Nucleic Acids Res. 2011, 39, e126. [Google Scholar] [CrossRef] [PubMed]
  24. Kulakovskiy, I.; Boeva, V.; Favorov, A.; Makeev, V. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics 2010, 26, 2622–2623. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Jia, C.; Carson, M.B.; Wang, Y.; Lin, Y.; Lu, H. A new exhaustive method and strategy for finding motifs in ChIP-enriched regions. PLoS ONE 2014, 9, e86044. [Google Scholar] [CrossRef] [PubMed]
  26. Zambelli, F.; Pavesi, G. A faster algorithm for motif finding in sequences from ChIP-Seq data. In International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics; Springer: Berlin/Heidelberg, Germany, 2011; pp. 201–212. [Google Scholar]
  27. Yu, Q.; Wei, D.; Huo, H. SamSelect: A sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC Bioinform. 2018, 19, 228. [Google Scholar] [CrossRef] [PubMed]
  28. Sharov, A.A.; Ko, M.S. Exhaustive search for over-represented DNA sequence motifs with CisFinder. DNA Res. 2009, 16, 261–273. [Google Scholar] [CrossRef] [PubMed]
  29. Yu, Q.; Huo, H.; Chen, X.; Guo, H.; Vitter, J.S.; Huan, J. An efficient motif finding algorithm for large DNA data sets. In Proceedings of the 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Belfast, UK, 2–5 November 2014; pp. 397–402. [Google Scholar]
  30. Frey, B.J.; Dueck, D. Clustering by passing messages between data points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef]
  31. Yu, Q.; Huo, H.; Zhao, R.; Feng, D.; Vitter, J.S.; Huan, J. Reference sequence selection for motif searches. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA, 9–12 November 2015; pp. 569–574. [Google Scholar]
  32. Sun, C.; Huo, H.; Yu, Q.; Guo, H.; Sun, Z. An affinity propagation-based DNA motif discovery algorithm. BioMed Res. Int. 2015, 2015, 853461. [Google Scholar] [CrossRef]
  33. Leone, M.; Weigt, M. Clustering by soft-constraint affinity propagation: Applications to gene-expression data. Bioinformatics 2007, 23, 2708–2715. [Google Scholar] [CrossRef]
  34. Wang, D.; Lee, N.K. Computational discovery of motifs using hierarchical clustering techniques. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 1073–1078. [Google Scholar]
  35. Chen, X.; Xu, H.; Yuan, P.; Fang, F.; Huss, M.; Vega, V.B.; Wong, E.; Orlov, Y.L.; Zhang, W.; Jiang, J.; et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 2008, 133, 1106–1117. [Google Scholar] [CrossRef]
  36. Crooks, G.E.; Hon, G.; Chandonia, J.M.; Brenner, S.E. WebLogo: A sequence logo generator. Genome Res. 2004, 14, 1188–1190. [Google Scholar] [CrossRef]
  37. Qu, H.; Fang, X. A Brief Review on the Human Encyclopedia of DNA Elements (ENCODE) Project. Genom. Proteom. Bioinform. 2013, 11, 135–141. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  38. Khan, A.; Fornes, O.; Stigliani, A.; Gheorghe, M.; Castro-Mondragon, J.A.; Van, d.L.R.; Bessy, A.; ChèNeby, J.; Kulkarni, S.R.; Tan, G. JASPAR 2018: Update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2017, 77, e43. [Google Scholar]
Figure 1. Prediction accuracy for different values of α .
Figure 1. Prediction accuracy for different values of α .
Entropy 21 00802 g001
Table 1. Values of p k and p k m o t i f under different values of k for (18, 5) problem instance.
Table 1. Values of p k and p k m o t i f under different values of k for (18, 5) problem instance.
k5678910
p k 3 . 42 × 10 5 2 . 31 × 10 4 0.00120.00540.01930.0569
p k m o t i f 0.23030.47410.74140.92420.99151
Table 2. Comparisons on ( l , d ) problem instances with t = 20 , n = 600 , and  α = 100 % .
Table 2. Comparisons on ( l , d ) problem instances with t = 20 , n = 600 , and  α = 100 % .
( l , d ) p 2 d MEMEVINEProjectionAP-ChIP
(12, 2)0.00280.68 (4 s)1.00 (8 s)0.86 (10 s)0.98 (18 s)
(15, 3)0.00420.73 (7 s)1.00 (9 s)0.82 (1.3 m)1.00 (23 s)
(15, 4)0.05660.87 (8 s)0.96 (5.6 m)0.89 (4.2 m)0.97 (36 s)
(14, 4)0.11170.84 (10 s)0.95 (8.3 m)0.80 (27.4 m)0.96 (47 s)
(25, 8)0.14940.91 (12 s)0.93 (9.8 m)0.78 (32.6 m)0.94 (1.1 m)
(21, 7)0.25640.87 (28 s)0.92 (11.2 m)0.76 (48.7 m)0.91 (58 s)
Table 3. Comparisons on ( l , d ) problem instances with t = 1000 , n = 200 , α = 100 % .
Table 3. Comparisons on ( l , d ) problem instances with t = 1000 , n = 200 , α = 100 % .
( l , d ) p 2 d MEMEVINEProjectionAP-ChIP
(12, 3)0.05400.94 (8 s)0.96 (2.4 m)0.91 (3.1 m)0.97 (21 s)
(11, 3)0.11460.86 (10 s)0.95 (5.2 m)0.84 (4.3 m)0.96 (33 s)
(13, 4)0.20600.83 (10 s)0.93 (8.1 m)0.78 (36.4 m)0.95 (36 s)
(15, 5)0.31350.78 (11 s)0.84 (9.6 m)0.74 (46.7 m)0.93 (34 s)
(17, 6)0.42610.70 (13 s)0.83 (18.6 m)0.72 (53.6 m)0.92 (38 s)
(19, 7)0.53460.68 (17 s)0.75 (24.5 m)0.70 (1.2 h)0.90 (40 s)
Table 4. The results on ( l , d ) problem instances with p 2 d = 0 . 05 , α = 90 % .
Table 4. The results on ( l , d ) problem instances with p 2 d = 0 . 05 , α = 90 % .
( l , d ) TimePredicted MotifPublished Motif
(9, 2)43 sTTATCCCTCTTATCCCTC
(12, 3)34 sTTTCCCGTCTGCCTTTCCCGTCTG
(15, 4)42 sGGTTGRAGCTTAGGGGGTTGGAGCTTAGGG
(18, 5)38 sCTTTGCCATATCCATAGGTTTGCCATATCCATAGGC
(21, 6)36 sCAGGTAAACCATATTAAATTAAGGTAAACCATATTAAATTAC
R: A,G.
Table 5. Comparison of ( l , d ) problem instances with t = 600 , n = 200 , and  α = 90 % .
Table 5. Comparison of ( l , d ) problem instances with t = 600 , n = 200 , and  α = 90 % .
( l , d ) p 2 d MEME-ChIPChIP-MunkFMotifAP-ChIP
(9, 2)0.0490.96 (12 s)0.96 (1.8 m)1.00 (47 s)1.00 (43 s)
(11, 3)0.1140.94 (24 s)0.92 (2.0 m)0.99 (7.9 m)0.98 (46 s)
(13, 4)0.2050.90 (38 s)0.83 (2.4 m)0.98 (1.45 h)0.93 (58 s)
(15, 5)0.3190.85 (42 s)0.80 (8.2 m)0.92 (1.1 m)
(17, 6)0.4260.80 (45 s)0.78 (9.6 m)0.89 (1.3 m)
(19, 7)0.5340.78 (48 s)0.76 (10.7 m)0.87 (1.6 m)
Table 6. Results on the mESC data set.
Table 6. Results on the mESC data set.
Data Set (Seq #)TimePredicted MotifPublished Motif
c-Myc (3422)125 s Entropy 21 00802 i001 Entropy 21 00802 i002
CTCF (39609)19 s Entropy 21 00802 i003 Entropy 21 00802 i004
Esrrb (21647)10 s Entropy 21 00802 i005 Entropy 21 00802 i006
Klf4 (10875)138 s Entropy 21 00802 i007 Entropy 21 00802 i008
Nanog (10343)12 s Entropy 21 00802 i009 Entropy 21 00802 i010
n-Myc (7182)36 s Entropy 21 00802 i011 Entropy 21 00802 i012
Oct4 (3761)48 s Entropy 21 00802 i013 Entropy 21 00802 i014
Smad1 (1126)12 s Entropy 21 00802 i015 Entropy 21 00802 i016
Sox2 (4525)15 s Entropy 21 00802 i017 Entropy 21 00802 i018
STAT3 (2546)27 s Entropy 21 00802 i019 Entropy 21 00802 i020
Tcfcp211 (26910)11 s Entropy 21 00802 i021 Entropy 21 00802 i022
Zfx (10338)68 s Entropy 21 00802 i023 Entropy 21 00802 i024
Table 7. Results on the ENCODE dataset.
Table 7. Results on the ENCODE dataset.
Data Set (Seq #)AP-ChIP Predicted MotifMEME-ChIP Predicted MotifPublished Motif
Nfyb (10096) Entropy 21 00802 i025 Entropy 21 00802 i026 Entropy 21 00802 i027
Hnf4 (11045) Entropy 21 00802 i028 Entropy 21 00802 i029 Entropy 21 00802 i030
Elf1 (8611) Entropy 21 00802 i031 Entropy 21 00802 i032 Entropy 21 00802 i033
Ets  (5525) Entropy 21 00802 i034 Entropy 21 00802 i035 Entropy 21 00802 i036
Egr1  (15400) Entropy 21 00802 i037 Entropy 21 00802 i038 Entropy 21 00802 i039
Yy1  (2077) Entropy 21 00802 i040 Entropy 21 00802 i041 Entropy 21 00802 i042
Six5  (4664) Entropy 21 00802 i043 Entropy 21 00802 i044 Entropy 21 00802 i045
Srf  (4903) Entropy 21 00802 i046 Entropy 21 00802 i047 Entropy 21 00802 i048
Tal1  (25507) Entropy 21 00802 i049 Entropy 21 00802 i050 Entropy 21 00802 i051

Share and Cite

MDPI and ACS Style

Sun, C.-x.; Yang, Y.; Wang, H.; Wang, W.-h. A Clustering Approach for Motif Discovery in ChIP-Seq Dataset. Entropy 2019, 21, 802. https://doi.org/10.3390/e21080802

AMA Style

Sun C-x, Yang Y, Wang H, Wang W-h. A Clustering Approach for Motif Discovery in ChIP-Seq Dataset. Entropy. 2019; 21(8):802. https://doi.org/10.3390/e21080802

Chicago/Turabian Style

Sun, Chun-xiao, Yu Yang, Hua Wang, and Wen-hu Wang. 2019. "A Clustering Approach for Motif Discovery in ChIP-Seq Dataset" Entropy 21, no. 8: 802. https://doi.org/10.3390/e21080802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop