Tutorial
1:
An
Introduc1on
to
Linux
for
Next-‐Gen
DNA
Sequencing
Data
Analysis:
A
Hands-‐on
Tutorial
Dr
Stratos
Efstathiadis
efstra1os.efstathiadis@nyumc.org
Technical
Director
High
Performance
Compu1ng
Facility
Center
for
Health
Informa1cs
and
Bioinforma1cs
NYULMC
This is the First Part in a Series of tutorials in High
Performance Computing and Sequencing Informatics
• This tutorial will cover:
– Running basic Linux commands
• Logging-in, Files and Directories, File Editing, Running command line
tools.
– How to map millions of Next-Gen DNA reads onto a Genome
– A complete Next-Gen Seq. Read Alignment Example
– Using the Integrative Genome Viewer (IGV)
• This tutorial will Not cover:
– Sequencing technologies or alignment algorithms in detail.
– Advanced Linux concepts.
• File and Directory permissions, Environment variables, Shell scripts,
batch job submission, etc. will be covered in another tutorial.
Bioinformatics Requires Powerful
Computers
• One definition of bioinformatics is:
“The use of computers to analyze biological problems.
• As biological data sets have grown larger and
biological problems have become more complex, the
requirements for computing power have also grown.
• Computers that can provide this power generally use
the Unix operating system - so you must learn Unix
Will
computers
crash
Genomics?
SCIENCE,
11
February
2011,
pg
666
Remote Login using Secure Shell (ssh)
• Secure Shell: A set of tools that allow secure interaction with
remote servers (ssh, sshd, ssh-add, ssh-keygen, ssh-agent,
etc.) using two-factor authentication, based on
– What you know (a pass phrase)
– What you have (a key)
– ssh is the de-facto remote login mechanism in Linux/Unix.
rlogin, rsh, telnet, etc. use insecure protocols (transmit clear
text passwords) and are not used (actually, blocked).
How to login remotely on a Linux server
using Secure Shell (ssh)
$ ssh username@ec2-aa-ddd-cc-nn.compute-1.amazonaws.com
Username:
your
NYULMC
Kerberos
id
Hostname:
ec2-‐aa-‐ddd-‐cc-‐nn.compute-‐1.amazonaws.com
• Troubleshoo1ng:
ssh
–vvvv
username@hostname
• The
ssh-‐agent
stores
the
key
in
memory
(of
your
laptop/desktop).
First Linux Commands
-bash-3.2$ date
Wed Jan 2 13:28:49 EST 2013
-bash-3.2$ pwd (Present/current Working Directory)
/home/username
Home
directory
-bash-3.2$ ls (list files in the pwd)
-bash-3.2$ touch myfile (creates an empty file)
Linux
file
names
and
commands
are
Case
Sensi?ve:
myfile
is
different
than
MyFile
-bash-3.2$ LS
-‐bash:
LS:
command
not
found
Working
with
Directories
• Directories
are
a
means
of
organizing
your
files
on
a
Unix
computer.
– They
are
equivalent
to
folders
on
Windows
and
Macintosh
computers
• Directories
contain
files,
executable
programs,
and
sub-‐directories
• Understanding
how
to
use
directories
is
crucial
to
manipula1ng
your
files
on
a
Unix
system.
The
Tree
structure
of
the
file
system
Your
Home
Directory
• When
you
login
to
the
server,
you
always
start
in
your
Home
directory.
• Create
sub-‐directories
to
store
specific
projects
or
groups
of
informa1on,
just
as
you
would
place
folders
in
a
filing
cabinet.
• Do
not
accumulate
thousands
of
files
with
cryp1c
names
in
your
Home
directory
Shortcuts
• There
are
some
important
shortcuts
in
Unix
for
specifying
directories
•
.
(dot)
means
"the
current
directory"
•
..
means
"the
parent
directory"
-‐
the
directory
one
level
above
the
current
directory,
so
cd
..
will
move
you
up
one
level
•
~
(1lde)
means
your
Home
directory,
so
cd
~
will
move
you
back
to
your
Home.
– Just
typing
a
plain
cd
will
also
bring
you
back
to
your
home
directory
File
&
Directory
Commands
• This
is
a
minimal
list
of
Unix
commands
that
you
must
know
for
file
management:
ls (list) mkdir (make directory)
cd (change directory) rmdir (remove directory)
cp (copy) pwd (present working directory)
mv (move) more (view by page)
rm (remove) cat (view entire file on screen)
• All
of
these
commands
can
be
modified
with
many
op1ons.
Learn
to
use
Unix
man
pages
for
more
informa1on.
Copy
&
Move
• cp
lets
you
copy
a
file
from
any
directory
to
any
other
directory,
or
create
a
copy
of
a
file
with
a
new
name
in
one
directory
• cp filename.ext newfilename.ext
• cp filename.ext subdir/newname.ext
• cp /u/jdoe01/filename.ext ./subdir/newfilename.ext
• mv
allows
you
to
move
files
to
other
directories,
but
it
is
also
used
to
rename
files.
– Filename
and
directory
syntax
for
mv
is
exactly
the
same
as
for
the
cp
command.
• mv filename.ext subdir/newfilename.ext
– NOTE:
When
you
use
mv
to
move
a
file
into
another
directory,
the
current
file
is
deleted.
Delete
• Use
the
command
rm
(remove)
to
delete
files
• There
is
no
way
to
undo
this
command!!!
– We
have
set
the
server
to
ask
if
you
really
want
to
remove
each
file
before
it
is
deleted.
– You
must
answer
Y
or
else
the
file
is
not
deleted.
> ls
af151074.gb_pr5 test.seq
> rm test.seq
rm: remove test.seq? y
> ls
af151074.gb_pr5
Exercise:
Working
with
Files
and
Directories
-bash-3.2$ mkdir project1 (Make Directory)
-bash-3.2$ cd project1 (Change Directory)
-bash-3.2$ pwd
/home/efstae01/project1
-bash-3.2$ cp /data/tutorial/ChIPseq_chr19.fastq .
-bash-3.2$ ls –la
-bash-3.2$ head ChIPseq_chr19.fastq
@HWUSI-EAS610_0001:3:1:4:1405#0/1
GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG
+
B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79##
@HWUSI-EAS610_0001:3:1:5:1490#0/1
GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG
+
B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A?
@HWUSI-EAS610_0001:3:1:6:388#0/1
CAGAGTTCATGAAATAGGCCTCTAGTCTTCCTAGAC
FASTQ
Files
36
bp
read
@HWUSI-EAS610_0001:3:1:4:1405#0/1
GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG
+
B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79##
36
Quality
scores
• The de-facto file format for sharing sequence read data
• Sequence and a per-base quality score
• FASTQ Files are Text files
• 4 Lines per read
-bash-3.2$ wc –l ChIPseq_chr19.fastq
1082524 ChIPseq_chr19.fastq
Star?ng
emacs
• To
start
Emacs,
at
the
>
command
prompt,
just
type:
emacs
• To
use
Emacs
to
edit
a
file,
type:
emacs
filename
(where
filename
is
the
name
of
your
file)
• When
Emacs
is
launched,
it
opens
either
a
blank
text
window
or
a
window
containing
the
text
of
an
exis1ng
file.
Save
&
Exit
• To
save
a
file
as
you
are
working
on
it,
type:
Ctrl-‐x
»
Ctrl-‐s
• To
exit
emacs
and
return
to
the
Unix
shell,
type:
Ctrl
-‐x
»
Ctrl
-‐c
If
you
have
made
any
changes
to
the
file,
Emacs
will
ask
you
if
you
want
to
save:
Save file /u/browns02/nrdc.msf? (y,n,!,.,q,C-r or C-h)
• Type
“y”
to
save
your
changes
and
exit
• If
you
type
“n”,
then
it
will
ask
again:
Modified buffers exist; exit anyway? (yes or no)
• If
you
answer
“no”,
then
it
will
return
you
to
the
file,
you
must
answer
“yes”
to
exit
without
saving
changes
Copying
data
using
Secure
Copy
(scp)
Exercise:
Login
on
the
tutorial
Amazon
Instance
-‐bash-‐3.2$
cd
project1
-‐bash-‐3.2$
fastqc
ChIPseq_chr19.fastq
-‐bash-‐3.2$
ls
–l
Copy
the
file
ChIPseq_chr19.fastq.zip
to
your
laptop
using
scp:
Laptop>
scp
user@ec2-‐54-‐242-‐89-‐16.compute-‐1.amazonaws.com:~/
project1/ChIPseq_chr19_fastqc.zip
.
Unzip
it:
Laptop>
unzip
ChIPseq_chr19_fastqc.zip
Open
with
your
browser
fastqc-‐report.html
Short
Read
Alignment
• Given
a
reference
and
a
set
of
reads,
report
at
least
one
“good”
local
alignment
for
each
read
if
one
exists
– Approximate
answer
to:
where
in
genome
did
the
read
originate?
• What is “good”? For now, we concentrate on:
– Fewer mismatches is better …TGATCATA… beqer
than
…TGATCATA…
– Failing to align a low-quality GATCAA GAGAAT
base is better than failing to …TGATATTA… beqer
than
…TGATcaTA…
GATcaT GTACAT
align a high-quality base
Short
Read
Applica1ons
GGTATAC…
…CCATAG
TATGCGCCC
CGGAAATTT
CGGTATAC
…CCAT
CTATATGCG
TCGGAAATT
CGGTATAC
…CCAT
GGCTATATG
CTATCGGAAA
GCGGTATA
…CCA
AGGCTATAT
CCTATCGGA
TTGCGGTA
C…
Finding
the
…CCA
AGGCTATAT
GCCCTATCG
TTTGCGGT
C…
…CC
AGGCTATAT
GCCCTATCG
AAATTTGC
ATAC…
…CC
TAGGCTATA
GCGCCCTA
AAATTTGC
GTATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
alignments
is
GAAATTTGC
GGAAATTTG
typically
the
performance
CGGAAATTT
CGGAAATTT
TCGGAAATT
boqleneck
CTATCGGAAA
CCTATCGGA
TTTGCGGT
GCCCTATCG
AAATTTGC
…CC
GCCCTATCG
AAATTTGC
ATAC…
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
Indexing
• Genomes
and
reads
are
too
large
for
direct
approaches
like
dynamic
programming
• Indexing
is
required
Suffix
tree
Suffix
array
Seed
hash
tables
Many
variants,
incl.
spaced
seeds
• Choice
of
index
is
key
to
performance
NGS
Read
Alignment
Burrows
Wheeler
Transforma1on
(BWT)
• Invented
by
David
Wheeler
in
1983
(Bell
Labs).
Published
in
1994.
“A
Block
Sor@ng
Lossless
Data
Compression
Algorithm”
Systems
Research
Center
Technical
Report
No
124.
Palo
Alto,
CA:
Digital
Equipment
Corpora@on,
Burrows
M,
Wheeler
DJ.
1994
• Originally
developed
for
compressing
large
files
(bzip2,
etc.)
• Lossless,
Fully
Reversible
• Alignment
Tools
based
on
BWT:
bow@e,
BWA,
SOAP2,
etc.
• Approach:
– Align
reads
on
the
transformed
reference
genome,
using
an
efficient
index
(FM
index)
– Solve
the
simple
problem
first
(align
one
character)
and
then
build
on
that
solu1on
to
solve
a
slightly
harder
problem
(two
characters)
etc.
• Results
in
great
speed
and
efficiency
gains
(a
few
GigaByte
of
RAM
for
the
en1re
H.
Genome).
Other
approaches
require
tens
of
GigaBytes
of
memory
and
are
much
slower.
Generating the SAM file
-‐bash-‐3.2$
cd
project1
-‐bash-‐3.2$
bwa
aln
/data/tutorial/mm9.fasta
ChIPseq_chr19.fastq
>
ChIPseq_chr19.sai
bwa_aln]
17bp
reads:
max_diff
=
2
[bwa_aln]
38bp
reads:
max_diff
=
3
[bwa_aln]
64bp
reads:
max_diff
=
4
[bwa_aln]
93bp
reads:
max_diff
=
5
[bwa_aln]
124bp
reads:
max_diff
=
6
[bwa_aln]
157bp
reads:
max_diff
=
7
[bwa_aln]
190bp
reads:
max_diff
=
8
[bwa_aln]
225bp
reads:
max_diff
=
9
[bwa_aln_core]
calculate
SA
coordinate...
25.94
sec
[bwa_aln_core]
write
to
the
disk...
0.03
sec
[bwa_aln_core]
262144
sequences
have
been
processed.
[bwa_aln_core]
calculate
SA
coordinate...
0.86
sec
[bwa_aln_core]
write
to
the
disk...
0.00
sec
[bwa_aln_core]
270631
sequences
have
been
processed.
[main]
Version:
0.6.2-‐r126
[main]
CMD:
bwa
aln
/data/tutorial/mm9.fasta
ChIPseq_chr19.fastq
[main]
Real
1me:
28.230
sec;
CPU:
28.229
sec
-‐bash-‐3.2$
bwa
samse
/data/tutorial/mm9.fasta
ChIPseq_chr19.sai
ChIPseq_chr19.fastq
>
ChIPseq_chr19.sam
[bwa_aln_core]
convert
to
sequence
coordinate...
3.39
sec
[bwa_aln_core]
refine
gapped
alignments...
0.43
sec
[bwa_aln_core]
print
alignments...
0.50
sec
[bwa_aln_core]
262144
sequences
have
been
processed.
[bwa_aln_core]
convert
to
sequence
coordinate...
1.75
sec
[bwa_aln_core]
refine
gapped
alignments...
0.31
sec
[bwa_aln_core]
print
alignments...
0.01
sec
[bwa_aln_core]
270631
sequences
have
been
processed.
[main]
Version:
0.6.2-‐r126
[main]
CMD:
bwa
samse
/data/tutorial/mm9.fasta
ChIPseq_chr19.sai
ChIPseq_chr19.fastq
[main]
Real
1me:
6.710
sec;
CPU:
6.711
sec
more
• Use
the
command
more
to
view
at
the
contents
of
a
file
one
screen
at
a
1me:
> -bash-3.2$ more ChIPseq_chr19.sam
@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@SQ SN:chr12 LN:121257530
@SQ SN:chr13 LN:120284312
@SQ SN:chr14 LN:125194864
@SQ SN:chr15 LN:103494974
@SQ SN:chr16 LN:98319150
@SQ SN:chr17 LN:95272651
@SQ SN:chr18 LN:90772031
@SQ SN:chr19 LN:61342430
@SQ SN:chr1 LN:197195432
@SQ SN:chr2 LN:181748087
@SQ SN:chr3 LN:159599783
@SQ SN:chr4 LN:155630120
@SQ SN:chr5 LN:152537259
@SQ SN:chr6 LN:149517037
@SQ SN:chr7 LN:152524553
@SQ SN:chr8 LN:131738871
@SQ SN:chr9 LN:124076172
@SQ SN:chrM LN:16299
@SQ SN:chrX LN:166650296
@SQ SN:chrY LN:15902555
HWUSI-EAS610_0001:3:1:4:1405#0 16 chr19 60874227 37 36M * 0 0
CACTCACCTCTCTCTGATCTCTGGAATTGAACTATC ##97>08<<?6<0?2A>1
?7?5?=@5/7@4<;03;B XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:9T26
HWUSI-EAS610_0001:3:1:5:1490#0 0 chr19 32960373 37 36M * 0 0
GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG B@A?AAA1BB;A5B44>A
A3'@AB>+>@AB94A?A? XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
HWUSI-EAS610_0001:3:1:6:388#0 16 chr19 18177553 37 36M * 0 0
GTCTAGGAAGACTAGAGGCCTATTTCATGAACTCTG @AABAAB@@B@?BBBABA
BA;?<CBBBBBA>C;BBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
HWUSI-EAS610_0001:3:1:7:1045#0 16 chr19 60698168 37 36M * 0 0
ATGTGAGGCAATGTGCTCCATTTCCTTTCCCTATCC =>6AB?@BA<;:?AA@9A
B87;.=@=:>@B@>3,?B XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36
Hit
the
spacebar
to
page
down
through
the
file
– Ctrl-‐U
moves
back
up
a
page
– At
the
boqom
of
the
screen,
more
shows
how
much
of
the
file
has
been
displayed
hqp://samtools.sourceforge.net/SAM1.pdf
SAM/BAM
format
Header
sec1on
@HD
@SQ
VN:1.0
SN:chr20 AS:HG18 LN:62435964
@RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891
@RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891
posi1on
of
alignment
Alignment
sec1on
Query
Name
Ref
sequence
query
sequence
(same
strand
as
ref)
query
quality
V00-HWI-EAS132:3:38:959:2035#0 147 chr1 28 255 36M = 79 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG aaX`[\`Y^Y^]ZX``\EV_BBBBBBBBBBBBBBBB NM:i:1
V00-HWI-EAS132:4:99:122:772#0 177 chr1 32 255 36M = 9127 0 AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT aaaaaa\OWaI_\WL\aa`Xa^]\ZUaa[XWT\^XR NM:i:1
V00-HWI-EAS132:4:44:473:970#0 25 chr1 40 255 36M * 0 0 GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT __YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB NM:i:2
V00-HWI-EAS132:4:29:113:1934#0 99 chr1 44 255 36M = 107 0 GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC aaaQaa__``]\\_^``^a^`a`_^^^_XQ[ZS\XX NM:i:1
Converting the SAM file to a BAM file
BAM
file
B:
Binary
format
vs
Text
format
(SAM),
resul1ng
in
more
efficient
data
storage.
Sequencing
plazorm
independent.
-‐bash-3.2$ samtools view -bt /data/tutorial/mm9.fasta -o
ChIPseq_chr19.bam ChIPseq_chr19.sam
[samopen] SAM header is present: 22 sequences.
-bash-3.2$ samtools sort ChIPseq_chr19.bam
ChIPseq_chr19.sorted
-bash-3.2$ samtools index ChIPseq_chr19.sorted.bam
-bash-3.2$ samtools view in.bam chr2:20,100,000-20,200,000 >
out.bam
-bash-3.2$ samtools view lib15103.sorted.rmdup.bam "gi|
121635883|ref|NC_008769.1|:2,100,000-2,200,000"
BAM Files: Using the bitwise FLAG
-f : bit set
-F: bit is not set
Counting Alignments:
-bash-3.2$ samtools view -c ChIPseq_chr19.sorted.bam
270631
Count unmapped reads –f 4 (the bitwise flag 0x004 is set)
-bash-3.2$ samtools view -c -f 4 ChIPseq_chr19.sorted.bam
1839
Count the reads that are not unmapped (hence, count mapped
alignments) –F 4 (the bitwise flag 0x0004 is Not set)
-bash-3.2$ samtools view -c -F 4 ChIPseq_chr19.sorted.bam
268792
-bash-3.2$ samtools flagstat ChIPseq_chr19.sorted.bam
Putting it all together in a shell script (use emacs)
/data/tutorial/first
REFERENCE=/data/tutorial/mm9.fasta
FASTQ=ChIPseq_chr19.fastq
SAI=ChIPseq_chr19.sai
SAM=ChIPseq_chr19.sam
BAM=ChIPseq_chr19.bam
SORTED=ChIPseq_chr19.sorted
echo
$REFERENCE;
echo
$FASTQ;
echo
$SAI
bwa
aln
$REFERENCE
$FASTQ
>
$SAI
bwa
samse
$REFERENCE
$SAI
$FASTQ
>
$SAM
samtools
view
-‐bt
$REFERENCE
-‐o
$BAM
$SAM
samtools
sort
$BAM
$SORTED
The
screen
command
• Run
“screen”
before
you
run
your
command
• Detach
by
pressing:
Control-‐a
d
• Logout
• Login
again
• Re-‐aOach
by
running
the
command:
“screen
–r”
• You
may
have
several
screen
sessions
acRve
Using IGV to view alignments
The
Integra1ve
Genome
Viewer
can
be
installed
on
your
laptop
wget
hqp://www.broadins1tute.org/igv/projects/downloads/IGV_2.1.22.zip
unzip
IGV_2.1.22.zip
cd
IGV_2.1.22
java
-‐Xmx2g
-‐jar
igv.jar
You
will
need
to
transfer
to
your
laptop:
• Indexed
Reference
FASTA
file
• Reference
Genome
Feature
File
(GFF)
• Sorted
and
indexed
mapped
read
BAM
files
Using IGV to view alignments