[go: up one dir, main page]

Skip to content

This is a stand-alone version of RSAT matrix-clustering. This version is faster and simplified compared to the original RSAT matrix-clustering but the graphical output is still under development.

License

Notifications You must be signed in to change notification settings

jaimicore/matrix-clustering_stand-alone

Repository files navigation

matrix-clustering_stand-alone

This is a stand-alone version of RSAT matrix-clustering. This version is faster and simplified compared to the original one but the graphical output is still under development.

RSAT matrix-clustering is a software to cluster and align Transcription Factor binding motifs. Here is a brief description of the method:

  • Motif comparison: The motifs are compared to each other using two comparison metrics (pearson correlation coeficient (cor) and an alignment-width correction (normalized pearson correlation (Ncor)).
  • Hierarchical clustering: The motifs are hierarchically clustered based in the values of a comparison metric (default = Ncor) .
  • Tree partition: the hierarchical tree is partitioned by calculating the average cor and Ncor values at each node, each time a node does not satisfy any of these thresholds (one value for cor and another one for Ncor) the node is split in two clusters.
  • Motif alignment: for each cluster, the motifs are progressively aligned following the linkage order of the hierarchical tree, this ensures that each motif is aligned in relation to its closest motif in the cluster.
  • Radial alignment (optional): all the motifs are forced to be aligned, the aligned logos are displayed in a radial (circular) tree. This option is useful to visualize entire motif collections. See example in the JASPAR website.

As in the original version of RSAT matrix-clustering, there is no limit in the input motif files (so far we have tried with up to 900 input files). When users have two or more input files, some intersection statistics are calculated (e.g., overlap among input collections) visualized as heatmaps.

RSAT matrix-clustering is part of the RSAT suite for motif analysis, we decided to create a portable stand-alone version that can be ran without installing the whole RSAT environment and that can be easily integrated within pipelines.

   

Before starting

If you want to run the original version with all the graphical output, you can do it through the RSAT website or alternatively, installing RSAT locally and run the command line version of matrix-clustering.

⚠️ This repository is under active development, so you can expect many changes as long as you see this line.

 

🔧 Changes relative to the original version

  • We added a function to create motif alignments in a radial (circular) way (see Example 2). This representation allows to visualize entire motif collections and highlight categories (TF classes, TF families, Motif collection, etc).

  • We added a new functionality to calculte how well the resulting clusters are similar to a user provided annotation (see Example 3) for more details. This functionality could be used to select the parameters (thresholds in cor and Ncor) that maximizes the similarity to a user-provided annotation.

  • Default threshold are different: cor = 0.75 and Ncor = 0.55. To decide if a node in the hierarchical tree will be merged or split, we compute the average cor and Ncor of all the pairwise comparisons for all the motifs in a particualr node. We realized that the original version didn't considered all the pairwise comparisons, we corrected this problem, but now the original default thresholds are too permissive, so we updated them to obtain good results.

  • We implemented a motif trimming algorithm that is robust to IC spikes, see this reference. Motifs can be trimmed before clustering, more of this in the Extra section.

   

💻 Install required software

Download this repository

git clone https://github.com/jaimicore/matrix-clustering_stand-alone.git
cd matrix-clustering_stand-alone

 

R libraries

The following R/Bioconductor packages are required to run RSAT matrix-clustering, you can install them within R using the following commands

# --------------------------- #
# List of required R packages #
# --------------------------- #
required.packages = c("dplyr",          # Data manipulation
                      "data.table",     # Read long matrices in a quick way
                      "furrr",          # Run functions in parallel
                      "optparse",       # Read command-line arguments
                      "purrr",          # Iterations
                      "rcartocolor",    # Cluster colors
                      "reshape2",       # Dataframe manipulation
                      "this.path",      # Create relative paths
                      "tidyr",          # Data manipulation
                      "dendsort",       # To draw heatmap
                      "ggplot2",
                      "ggseqlogo",      # Draw logos
                      "RColorBrewer",   # Heatmap cell colors
                      "ape",            # Export hclust tree in newick format
                      "RJSONIO",        # Export hclust tree in JSON format
                      "circlize",       # Required to draw heatmaps
                      "flexclust",      # Calculate adjusted rand index
                      ""htmlwidgets,    # Save plotly output as html
                      "plotly",         # Interactive plots
                      "svglite",        # Easy export of ggplot content as svg
                      "jsonlite")       # To create the JSON file from the hclust outputs


for (lib in required.packages) {
  if (!require(lib, character.only = TRUE)) {
    install.packages(lib)
    suppressPackageStartupMessages(library(lib, character.only = TRUE))
  }
}


############################################
## Install required Bioconductor packages ##
############################################
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("universalmotif")
BiocManager::install("ComplexHeatmap")

 

Compile C dependencies

The motif comparison step is ran by compare-matrices-quick, a fast version of RSAT compare-matrices implemented in C (with limited options but much faster).

This repository contains the script written in C but it needs to be compiled to generate the executable script that will be called inside matrix-clustering.

Assuming you are in the main directory, after cloning this repository:

cd compare-matrices-quick
make

The makefile contains the commands to compile the compare-matrices-quick.c script, after running the makefile, be sure the next executable script was created by running the following command:

./compare-matrices-quick

This should print the help to run compare-matrices-quick and the exaplanation of the parameters, but don't worry you don't have to read it, this script will be called within the R scripts.

   

▶️ Quickstart

Assuming you are in the root of the repository folder you can run the following examples. The input files are part of the repository, they are found in the folder data.

Example 1

Clustering of 66 motifs separated in three motif collections (files). An Oct4 ChIP-seq dataset was analyzed with three different motif discovery tools (RSAT peak-motifs, MEME-ChIP, and HOMER), the resulting motifs are used as input and we detected a cluster of Oct4 motifs, including the canonical motif, and other binding variants including homodimers and heterodimers, see Fig 2 of the RSAT matrix-clustering paper for a detailed explanation.

⏳ Running time: ~1 minute

Rscript matrix-clustering.R                           \
  -i data/OCT4_datasets/OCT4_motif_table.txt          \
  -o results/OCT4_motifs_clusters/OCT4_motif_analysis \
  -w 8                              

 

Example 2

In this example we are reproducing the clustering of JASPAR nematodes (43 motifs corresponding to 12 TF classes). We use the option --radial_tree = TRUE to force the alignment of all motifs (as if they were in a single cluster), this alignment is displayed in a radial (circular) visualization.

Users can provide motif metadata with the option -a.

⏳ Running time: ~2 minutes

Rscript matrix-clustering.R                           \
  -i data/JASPAR_2022/Jaspar_nematodes_motifs_tab.txt \
  -o results/JASPAR_nematodes_radial/JASPAR_nematodes \
  -a data/JASPAR_2022/JASPAR_nematodes_metadata.txt   \
  --radial_tree TRUE                                  \
  --title JASPAR_CORE_nematodes                       \
  -w 8

   

Example 3

We cluster the JASPAR 2022 plants motif collection (656 motifs), we compare the resulting clusters detected by RSAT matrix-clustering against a user-provided reference annotation (in this case the Transcription Factor classes). We calculated the Adjusted Rand Index (ARI), a single-value metric (ranging from -1 to +1) indicating the proportion of consistent pairs between two classifications, in this example the ARI measures the proportion of motif pairs that are consistently classified between RSAT matrix-clustering results and the reference TF classes. We consider that a motif pair is consistently classified when the two motifs either belong to the same class and are co-clustered, or belong to different families and are not co-clustered.

The calculation of ARI is a new functionality of RSAT matrix-clustering, in this example, using default paramters we obtained ARI = 0.38, changing parameters may increase/decrease the resulting ARI.

⏳ Running time: ~5 minutes

Rscript matrix-clustering.R                         \
  -i data/JASPAR_2022/Jaspar_plants_motifs_tab.txt  \
  -o results/Jaspar_plants/Jaspar_plants            \
  -w 8                                              \
  --ARI TRUE                                        \
  -a data/JASPAR_2022/Jaspar_2022_plants_TF_fam.tab 

   

📜 Input files

Motifs (Mandatory)

This version of RSAT matrix-clustering relies on the R/Bioconductor package universalmotif for the motif manipulation steps. This is the list of supported TF motif formats:

  • cluster-buster
  • cisbp
  • homer
  • jaspar
  • meme
  • transfac
  • uniprobe

In case your motif format is not in this list, please contact me to add it.

 

Matrix file table (Mandatory)

To avoid long commands when the input are many motif collections, we opted for a simple file format. The input file (-i) must be a tab-delimited file providing the following information (in the following order; no header):

  1. Motif file path
  2. Collection name: an alias given to the motif file
  3. Motif format: see above for the list of supported format.

Each line in this table should correspond to a different motif file.

If a file path is duplicated it will be considered only once.

If a collection name is duplicated, the program will stop, collection names are needed to create unique motif IDs.

Input motifs may be in different formats.

Example:

data/OCT4_datasets/HOMER_OCT4_motifs.homer    HOMER_motifs	  homer
data/OCT4_datasets/MEME_OCT4_motifs.meme      MEME_motifs	  meme
data/OCT4_datasets/RSAT_OCT4_motifs.tf        RSAT_motifs	  tf

 

Motif annotation table (Optional)

Users can provide a reference table that may be used for to purposes:

  1. Compare the resulting clusters against a user-defined annotation (i.e., how good the resulting clusters resemble the reference classes in the annotation file). When this file is provided the adjusted rand index (ARI) will be calculated.
  2. Add some color features (class numbers and background) in a radial tree. Note that this table is only used with the parameter --radial_tree = TRUE.

The reference table (-a) must be a tab-delimited file providing at least the following columns (extra columns are ignored):

  1. motif_id
  2. class
  3. collection
  4. url (this may be an empty column but the column name is expected)

The motif IDs in this reference table must be the same IDs as in the motif files, if this is not the case the program will stop.

The collection names in the reference table must be the same as those in the matrix file table, if this is not the case the program will stop.

Example:

motif_id  class       collection     url
MA1404.1	BBR/BPC     JASPAR_plants  https://jaspar.uio.no/matrix/MA1404.1
MA1403.1	BBR/BPC     JASPAR_plants  https://jaspar.uio.no/matrix/MA1403.1
MA1402.1	BBR/BPC     JASPAR_plants  https://jaspar.uio.no/matrix/MA1402.1
MA1197.1	CAMTA       JASPAR_plants  https://jaspar.uio.no/matrix/MA1197.1
MA0969.1	CAMTA       JASPAR_plants  https://jaspar.uio.no/matrix/MA0969.1
MA0970.1	CAMTA       JASPAR_plants  https://jaspar.uio.no/matrix/MA0970.1
MA0975.1	AP2/EREBP   JASPAR_plants  https://jaspar.uio.no/matrix/MA0975.1
MA0976.2	AP2/EREBP   JASPAR_plants  https://jaspar.uio.no/matrix/MA0976.2
MA1376.1	AP2/EREBP   JASPAR_plants  https://jaspar.uio.no/matrix/MA1376.1

   

🔮 Example output:

This is the folder structure after running this software:

results
├── *_motifs
│   ├── individual_motifs_with_gaps
│   │   └── Two files per motif (direct and reverse orientation) in transfac format, these motifs are already aligned (may contain gaps).
│   │
│   ├── motifs_sep_by_cluster   (each folder contains the motifs belonging to a cluster)
│   │   └── Cluster_01
│   │   └── Cluster_02
│   │   └── ...
│   │   └── Cluster_N
│   │
│   └── root_motifs
│       └── Root_motifs.tf  (also referred as archetype motifs)
│
├── *_plots
│   ├── Clusters_vs_reference_contingency_table.pdf
│   └── Heatmap_clusters.pdf
│   
├── *_tables
│   ├── alignment_table.tab
│   ├── clusters.tab
│   ├── distance_table.tab
│   ├── pairwise_motif_comparison.tab
│   └── summary_table.tab
│
└── *_trees
    ├── tree.json
    ├── tree.newick
    └── tree.RData

 

Example 1

The main aoutput of this analysis is an html file containing the logo forest (multiple hierarchical trees, each representing a cluster with aligned motifs).

 

The analysis produces the file named alignment_table.tab which contains one line per motif with its corresponding cluster name, orientation in the alignment, the number of upstream/downstream gaps, the aligned consensus, and the alignment width.

cluster	    id	                        name                consensus         rc_consensus                  strand  offset_up offset_down aligned_consensus    alignment_width

cluster_01  RSAT_positions_7nt_m1_n9    positions_7nt_m1    NNATTTGCATATGCAAATNN    NNATTTGCATATGCAAATNN    R       4         0   ----NNATTTGCATATGCAAATNN	24
cluster_01  MEME_MEME_ChIP_1_n1         MEME_ChIP_1         ATGYWAA                 TTWRCAT                 R       7         10  -------TTWRCAT----------	24
cluster_01  MEME_MEME_ChIP_15_n15       MEME_ChIP_15        TATGCAAAT               ATTTGCATA               R       6         9   ------ATTTGCATA---------	24
cluster_01  RSAT_local_words_7nt_m3_n7  local_words_7nt_m3  NNATATGCAAATNN          NNATTTGCATATNN          R       4         6   ----NNATTTGCATATNN------	24
cluster_01  RSAT_oligos_7nt_mkv5_m1_n1  oligos_7nt_mkv5_m1  NNATGCAAATNN            NNATTTGCATNN            R       4         8   ----NNATTTGCATNN--------	24
cluster_01  RSAT_local_words_7nt_m2_n6  local_words_7nt_m2  NHATTTGCATAACAAWNN      NNWTTGTTATGCAAATDN      D       4         2   ----NHATTTGCATAACAAWNN--	24
cluster_01  HOMER_homer_1_n1            homer_1             YWTTNWNATGCAAA          TTTGCATNWNAAWR          R       7         3   -------TTTGCATNWNAAWR---	24
cluster_01  RSAT_local_words_7nt_m4_n8  local_words_7nt_m4  NNATTGTTATGCATAACAATNN  NNATTGTTATGCATAACAATNN  D       0         2   NNATTGTTATGCATAACAATNN--	24

The file clusters.tab contains one line per cluster with the motifs IDs and names.

cluster     id                                                              name
cluster_01  MEME_MEME_ChIP_3_n3,RSAT_oligos_7nt_mkv5_m4_n4,HOMER_homer_9_n9 MEME_ChIP_3,oligos_7nt_mkv5_m4,homer_9
cluster_02  HOMER_homer_6_n6,HOMER_homer_16_n16,HOMER_homer_18_n18          homer_6,homer_16,homer_18
cluster_03  MEME_MEME_ChIP_4_n4,MEME_MEME_ChIP_7_n7                         MEME_ChIP_4,MEME_ChIP_7
cluster_04  RSAT_oligos_7nt_mkv5_m3_n3                                      oligos_7nt_mkv5_m3

If the option --export_heatmap TRUE is indicated the file Heatmap_clusters.pdf will be generated. This is a heatmap of N x N where N is the number of motifs, each cell represents the motif similarity. The color annotation bar corresponds to the clusters.

 

Example 2

When the users activate the option --radial_tree TRUE) all the motifs are forced to be aligned in a single cluster. The output is an html document containing the code in D3 (a javascript library). Open this document in an internet browser to visualize the results.

You can zoom in/out using the mouse and change the motif orientation by clicking on the red buttons on the page top. Click on the Hide/Show legend button to ease the readability.

   

   

   

⚠️ It is possible that this html is not properly displayed by all browsers, we recommend to use Firefox.

⚠️ It is possible that this html have to be opened from a webserver and may need you have Apache ready to use. More information in the Extra section below.

   

Example 3

When the users provide a reference annotation table (argument -r or --reference_cluster_annotation) the script will produce a contingency table comparing the resulting clusters and the reference groups, this table is visualized as a heatmap in the file Clusters_vs_reference_contingency_table.pdf.

   

🔧 Arguments and Options

Mandatory arguments

  • -i or --matrix_file_table : A text-delimited file where each line contain the following fields/columns. It does not expect a header, but it expects these columns in the indicated order.

    1. Motif file path
    2. Motif collection name
    3. Motif format.
  • -o or --output_folder : Folder to save the results.

Comparison + Clustering arguments

  • -m or --comparison_metric : Comparison metric used to build the hierarchical tree. Default: Ncor. [Options: cor, Ncor].
  • -l or --linkage_method : Linkage/agglomeration method to build the hierarchical tree. Default: average. [Options: average, complete, single].
  • -c or --cor_th : Pearson correlation lower threshold. Default: 0.75. [Options: any value among -1 and +1].
  • -n or --Ncor_th : Normalized Pearson correlation lower threshold. Default: 0.55. [Options: any value among -1 and +1].

Output files

  • -M or --minimal_output : Only returns the aligned motifs and the alignment, clusters and motif description tables. Comparison results, plots and trees are not exported. Default : FALSE. [Options: TRUE, FALSE].
  • --export_newick : Export hierarchical tree in Newick format. Default : FALSE. [Options: TRUE, FALSE].
  • --export_heatmap : Export heatmap with clusters in PDF. Default : FALSE. [Options: TRUE, FALSE].

Annotation table

  • -a or --annotation_table: motif annotation tab. One line per motif, the proved color will be used as background in the radial tree, the class name text will be shown as an annotation layer (ring) in the radial tree. A tab-delimited file with the following columns (additional columns are ignored):

    1. motif_id
    2. class
    3. collection
    4. url (may be an empty column but the header is expected)

Radial trees

  • --radial_tree: When this option is activated all the motifs are forced to be aligned in a single cluster. Note : this option is under active development.

Others

  • -w or --number_of_workers : Number of cores to run in parallel. Default: 2. [Options: depends in your machine].
  • --heatmap_color_palette : Cell colors in clusters heatmap. Default: RdGy. [Options: any colorBrewer palette, see colorbrewer2.org for details ].
  • --color_palette_classes : Number of classes to create color palette in clusters heatmap. Default: 11. [Options: depends on the selected colorBrewer palette, see colorbrewer2.org for details ].
  • --ARI : when this option is TRUE and an annotation table is provided --annotation_table the program will calculate the ARI (partition similarity) among the resulting clusters and the provided annotation classes.

   

💥 Contact + Contributors + Report issues

Contributors

   

This repository is maintained by Jaime A Castro-Mondragon.

📧 j.a.castro.mondragon@gmail.com 📧 jacmondragon@nykode.com

Twitter: @jaimicore

Use this space to report issues related to this repository.

   

📌 To Do/Wishlist

  • When calculating the ARI, implement an option to find an optimal threshold thorugh a grid search approach.
  • Generate the interactive html output motif trees.
  • Implement the option to annotate clusters.
  • Trim root motifs
  • Detect the central motif within each cluster.
  • Export motif collection intersection stats.

   

‼️ Extra

This repository also contains the script convert-matrix which is a simplified version of the RSAT convert-matrix tool, for motif manipulation and format conversion.

For the moment this scripts has three main functions:

  1. Motif format conversion, see above for the supported formats.
  2. Export reverse-complement of the input motifs
  3. Trim motifs (remove columns with low information content).

Simple motif conversion from transfac to meme format.

Rscript convert-matrix.R                    \
  -i data/OCT4_datasets/RSAT_OCT4_motifs.tf \
  --from tf --to jaspar                     \
  --output_file results/convert-matrix_examples/RSAT_OCT4_motifs.jaspar

Simple motif conversion from transfac to meme format with reverse-complement motifs. The file with the reverse-complement motifs has the suffix _rc in its name. In this example:

  • Output: results/convert-matrix_examples/RSAT_OCT4_motifs.jaspar
  • Output (RC) : results/convert-matrix_examples/RSAT_OCT4_motifs_rc.jaspar
Rscript convert-matrix.R                    \
  -i data/OCT4_datasets/RSAT_OCT4_motifs.tf \
  --from tf --to jaspar                     \
  --rc TRUE                                 \
  --output_file results/convert-matrix_examples/RSAT_OCT4_motifs.jaspar

Simple motif conversion from transfac to meme format after motif trimming.

Rscript convert-matrix.R                    \
  -i data/OCT4_datasets/RSAT_OCT4_motifs.tf \
  --from tf --to jaspar                     \
  --trim TRUE                               \
  --IC_threshold 0.25                       \
  --spike_IC_threshold 0.25                 \
  --trim_values_output results/convert-matrix_examples/RSAT_OCT4_motifs_trim_values.txt \
  --output_file results/convert-matrix_examples/RSAT_OCT4_motifs_trimmed.jaspar

In this figure we show the advantages of using a window-based approach to trim the motifs instead of using a single value, we use as example the IRF7 motif from JASPAR.

 

   

Install apache

Unfortunately, to visualize the content of the html file containing the radial tree it is required to have installed apache2 and open this html file as a localhost. If you don't do this, you will not see the html content.

To install apache you can follow this instructions.

Once apache is installed in your computer:

  1. Remove the folder sudo rm -rf /var/www/html
  2. Copy the result folder (including all directories) to /var/www/ . You will need sudo permissions.
  3. Open your browser and type localhost. Now you can browse the files in /var/www/
  4. Search and open the file *.html

🎉 Acknowledgements

We thank the JASPAR curation team for their input to improve RSAT matrix-clustering; the RSAT developer team for their constant support across many years of collaboration; and the users for their advices, suggestions and reporting bugs 🪲.

Special thanks to my colleagues Ieva Rauluseviciute (and her gently reminders 😒 that pushed me to write this stand-alone version), Vipin Kumar and Katalin Ferenc (from Anthony Mathelier's lab) for testing this software, the discussions, ideas, and their suggestions of R libraries that make this script faster than the original version.

   

📃 How to cite this software?

If you use this software, please cite its own publication and/or the latest RSAT publication.

About

This is a stand-alone version of RSAT matrix-clustering. This version is faster and simplified compared to the original RSAT matrix-clustering but the graphical output is still under development.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published