3.1. Results on Simulated Data
Simulated data provide quantitative measures to test the performance of the AP-ChIP Algorithm 1. As in [
29], we generate the simulated data as follows:
First, we generated t independent and identically distributed (i.i.d) sequences of length n and a motif m of length l. Second, we randomly generated q motif instances, each of which differed from the motif m at up to d positions. Third, the q motif instances were placed in a random position in a random selection of q sequences selected out of the t sequences. We, then, implemented the AP-ChIP Algorithm 1, using Matlab on a computer with a 2.6 GHZ processor and 4 Gbyte memory. The final experimental results consisted of the averages of five trials of simulated data experiments.
To evaluate the motif prediction accuracy, the nucleotide level performance coefficient
, as defined by Pevezner and Sze [
2], was used:
where
K is the set of nucleotide positions in the true motif and
P is the corresponding set of nucleotide positions in the predicted motif. The value of
is between 0 and 1; the larger the value of
, the higher the accuracy of the predicted motif. We used the
neighborhood probability
[
8] to select a group of
motif instances. The larger the value of
, the weaker the corresponding
problem instance becomes:
In what follows, according to different values of (i.e., the ratio of the sequences containing motif instances to all the sequences), we designed two groups of experiments, both of which consisted of two sub-experiments, to test the performance of the AP-ChIP Algorithm 1 on simulated data sets.
In the first group of experiments, we set
for both sub-experiments. We compared the performance of the AP-ChIP Algorithm 1 with that of the widely used motif finding algorithms MEME [
13], VINE [
14], and Projection [
16].
In the first sub-experiment, we set the number of sequences as
with sequence length
. The running time and the predicted accuracy of these algorithms are shown in
Table 2. For the instances (12, 2) and (15, 3) with
, the AP-ChIP Algorithm 1 achieved near-optimal predicted accuracy within a relatively short time. For the instances (15, 4), (14, 4), (25, 8), and (21, 7) with
, the predicted accuracy of the AP-ChIP Algorithm 1 was over
and the running time of the AP-ChIP Algorithm 1 remained competitive, compared to the other algorithms.
In general, it is easy to find the true motif by increasing the sequence number and decreasing its length. Therefore, in the second sub-experiment, we set the sequence number as
with sequence length
.
Table 3 shows that, for all
problem instances with
, the predicted accuracy of the AP-ChIP Algorithm 1 is over
with the computational costs being satisfactory.
Next, in the second group of experiments, to simulate a real ChIP-Seq data set, we set for both sub-experiments. This is because in real ChIP-Seq data set, most but not all of the sequences contain motif instances.
In the first sub-experiment, we test the validity of the AP-ChIP Algorithm 1 on the simulated ChIP-Seq data set for the identification of
motifs with
and
. We choose
to select a group of
problem instances. The reason for this choice is that
is approximately the same as the
value of the (15, 4) problem instance, which is one of the most popular benchmarks for
problem instance. The running time and the predicted motif by the AP-ChIP Algorithm 1 are shown in
Table 4. For each
problem instance, the AP-ChIP Algorithm 1 finds almost the same motif as the published one and also runs quite efficiently.
In the second sub-experiment, in order to further demonstrate the performance of the AP-ChIP Algorithm 1 on the simulated ChIP-Seq data set, we compared the AP-ChIP Algorithm 1 against some established motif-finding algorithms in the following two aspects: (i) Different values of , ranging from 0.05 to 0.5, with fixed , and (ii) different values of floating from 0.7 to 1 with fixed . Although a genome-wide ChIP-Seq data set typically has thousands to tens of thousands of sequences, using to of the ChIP-Seq data set is usually adequate for obtaining a good estimate of the true motifs. MEME-ChIP, a well-known algorithm for discovering motifs in ChIP-Seq data sets, is able to well identify motifs with only 600 sequences. Thus, it was reasonable to set the sequence number as and sequence length as for motif discovery in our experiments.
First, we compared the running time and prediction accuracy of the AP-ChIP Algorithm 1 with those of the three compared algorithms, MEME-ChIP [
20], ChIP-Munk [
24], and FMotif [
25], on different values of
and with fixed
. As shown in
Table 5, for each
problem instance, the AP-ChIP Algorithm 1 could solve it in a relatively short time, and its prediction accuracy was better than those of the three compared algorithms. Specifically, with increasing values of
l and
d, FMotif found it difficult to find exact motifs.
Next, we compared the prediction accuracy of AP-ChIP Algorithm 1 with those of the algorithms MEME [
13], MEME-ChIP [
20], DREME [
22], and FMotif [
25], on different values of
floating from 0.7 to 1 and fixed
. As the ratio of a motif instance
has a strong effect on the prediction accuracy, it was necessary for us to test the prediction accuracy of AP-ChIP Algorithm 1 on different values of
. It is rather cumbersome to identify the true motif when the value of
is small. FMotif is a powerful, exhaustive algorithm for finding exact short
motifs
contained in ChIP-Seq data sets. As shown in
Figure 1, the prediction accuracy of AP-ChIP Algorithm 1 was nearly the same as that of FMotif, and was higher than that of MEME-ChIP, MEME, and DREME.
3.2. Results on Real Data
First, we tested the validity of the AP-ChIP Algorithm 1 for the identification of real motifs using a diverse set of real ChIP-Seq data sets; specifically, on 12 differently sized mESC data sets (Nanog, Oct4, Sox2, Esrrb, Zfx, Klf4, c-Myc, n-Myc, Tcfcp21l, Smad1, STAT3, and CTCF) [
35]. We compared the motifs detected by the AP-ChIP Algorithm 1 with the motifs published by Chen et al. [
35], and presented them in the form of sequence logos [
36], which graphically represent the degree of motif conservation, as measured by relative entropy.
Table 6 shows the running times and the predicted motifs of the AP-ChIP Algorithm 1. For each data set, the AP-ChIP Algorithm 1 was capable of finding motifs highly similar to the published ones within a reasonable time.
Moreover, to better show the results, we compared AP-ChIP 1 with MEME-ChIP on nine differently sized ENCODE data sets (Nfyb, Hnf4, Elf1, Ets, Egr1, Yy1, Six5, Srf, and Tal1) [
37], where the TFBSs were referenced in the JASPAR database [
38].
Table 7 shows the published motifs and the motifs predicted by the two algorithms. The motifs are also shown in the form of sequence logos. AP-ChIP 1 could successfully find a motif similar to the published motif for each data set, while, for some data sets MEME-ChIP failed to accurately predict the motif (e.g., in the Elf1 data set), or lost information on individual bases (e.g., in the Tal dataset).