The growing amount of genomic data has prompted a need for less demanding and user friendly functional annotators. At the present, it’s hard to find a pipeline for the annotation of multiple functional data, such as both enzyme commission numbers (E.C.) and orthologous identifiers (KEGG and eggNOG), protein names, gene names, alternative names, and descriptions. Here, we provide a new solution which combines different algorithms (BLAST, DIAMOND, HMMER3) and databases (UniprotKB, KOfam, NCBIFAMs, TIGRFAMs, and PFAM), and also overcome data download challenges. Arche analysis pipeline can accommodate advanced tools in a unique order, creating several advantages regarding to other commonly used annotators.
Before you download Arche (13Gb), make sure GeneMarkS-2 (GMS2) is working properly on your computer. As GMS2 requires a licence (free), you must download it manually
Download GeneMarkS-2 and key from http://exon.gatech.edu/GeneMark/license_download.cgi
tar xvfz gms2_linux_[version].tar.gz
Move the dir to the desired place, and make the binary files accesible to your PATH (e.g. add export PATH=$PATH:</path/to/gms2_linux_[version]> to your ~/.bashrc file)
Configure the key you've downloaded
gunzip gm_key.gz
cp gm_key ~/.gmhmmp2_key
or
cp gm_key ~/.gm_key
Test the software
gms2.pl --seq YOUR_GENOME
To install the other dependencies, you will require the anaconda distribution. Download and install it from https://www.anaconda.com/download/success
conda create -n arche_annotator diamond=2.0.14 bedtools=2.27.0 p7zip=16.02 barrnap=0.9 hmmer=3.3.2 prodigal=2.6.3 blast=2.12.0 fasta3=36.3.8i ucsc-fasomerecords=455 trnascan-se=2.0.9 gdown -c bioconda -c conda-forge
This command wil create a conda environment for arche future runs. It includes the installation of specific packages from bioconda and conda-forge channels.
The program with the already formatted databases and mapping files can be downloaded (13Gb) via command line using gdown:
conda activate arche_annotator
gdown --fuzzy https://drive.google.com/file/d/1x9caXGPpYXCHUoodOdnuJI0tCDe9qtGG/view?usp=sharing
Once the download is finished:
tar -xvf arche_[version].tar (move the output directory to the desired place)
cd arche_[version]/bin/
chmod +777 arche.sh
./arche.sh --install
You should make the bin directory accessible to your PATH (e.g. add export PATH=$PATH:</path/to/arche_[version]/bin> to your ~/.bashrc file)
In the case the instalation process or the running fails:
- Check you are working within the conda environment you've created ("conda activate arche_annotator")
- Check you have properly installed GeneMarkS-2
- If you have already run the command ./arche.sh --install, open the arche.sh script using a text editor and in the section "Main directory" (first lines) replace the string after DIR= with the full path of the working directory, e.g. /home/YOUR_USER/arche_1.0.1
- Delete arche's directory, uncompress from tar file, and install again
arche.sh -n ecoli -t 20 -r 40 e_coli.fna
arche.sh -n halorubrum -a ssearch -k achaea halorubrum_sp_DM2.fa
arche.sh -n seawater_meatgenome -k meta seawater_metagenome.fna
Here you can download a sample which includes the annotation of Escherichia coli K12 with several tools including Arche:
The final table with all the annotations comes in two flavours:
[...]_omic_table.tbl which can be examined through the linux console with the comand
column -ts "|" [example]_omic_table.tbl | less -S
[...]_omic_table.tsv which can be opened using spreadsheet editors like Microsoft Excel, LibreOffice Calc, etc.
File(s) | Description |
---|---|
rRNA.tsv | GFF v3 file containing rRNA annotations. |
rRNA.fna | FASTA file of all rRNA features. |
tRNA.tsv | Table with tRNA details (coordinates, isotype, anticodon, scores, etc). |
[...]_struc_annot.fna | FASTA file of all genomic features (nucleotide). |
[...]_struc_annot.faa | FASTA file of translated coding genes (aminoacid). |
heuristic[...]_out | Output matches of the search instance(s) performed with BLASTp, DIAMOND or SSEARCH36. |
heuristic[...]_non_match.faa | FASTA file with the remaining non-matched sequences after the search instance(s) performed with BLASTp, DIAMOND or SSEARCH36. |
hmmscan_[...]_out | HMMER3 output table of the search instance(s) performed against a specific HMMDB. |
[HMMDB]_non_match.faa | FASTA file with the remaining non-matched sequences after the search instance performed against a specific HMMDB. |
[...]_omic_table.tbl | Feature table with fields separated by vertical bars. |
[...]_omic_table.tsv | Feature table with tab-separated fields. |
arche_report | File which includes the parameters of the run and results. |
-h, --help This help.
-i, --install Set up the executable location, and install databases.
-n, --name-files Name of the files to be created in the output directory, in-
cluding the directory itself (default 'arche').
-o, --output Provide the full path to the directory where the output di-
rectory will be created. E.g. /home/user/ (default current).
-k, --kingdom Source of the contigs. Use 'arch' for archaeal genomes or
'meta' for metagenomes (default is for bacterial genomes).
-m, --mode Gives priority to Orthology (KO, eggNOG) or Enzyme Comission
designed databases during the annotation. Use 'kegg' for KO-->
eggNOG-->E.C., 'eggnog' for eggNOG-->KO-->E.C., or 'ec' for
E.C.-->KO-->eggNOG (default will use a shorter swiss-prot KO·
·eggNOG·E.C. designed database with no priority).
-a, --alignment Select the algorithm to use during the protein alignment step:
'diamond' (accelerated blastp) or 'ssearch' (Smith-Waterman)
(default 'blastp').
-t, --threads Number of threads to use (default '1').
-r, --memory Amount of RAM to use in GB (default '2').
-e, --evalue Similarity e-value cut-off (default '1e-08').
-q, --query-cov Minimum coverage on query protein (default '70').
-b, --bypass Use 'yes' to bypass the RNA gene prediction.
-v, --verbose Use 'yes' to turn on the verbose mode.
- Daniel Alonso
- email: gundizalvus16@hotmail.com