[go: up one dir, main page]

0% found this document useful (0 votes)
151 views103 pages

Heikki Perl-Bioperl

The document provides an introduction to Perl and BioPerl. It discusses what Perl is, why it should be used, and how to get started with the language. Key topics covered include Perl's history and uses, basic Perl program structure, variable types like scalars, arrays and hashes, operators, control structures, and default variables. The document aims to give readers an overview of the Perl programming language and a starting point for learning more.

Uploaded by

cherry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views103 pages

Heikki Perl-Bioperl

The document provides an introduction to Perl and BioPerl. It discusses what Perl is, why it should be used, and how to get started with the language. Key topics covered include Perl's history and uses, basic Perl program structure, variable types like scalars, arrays and hashes, operators, control structures, and default variables. The document aims to give readers an overview of the Perl programming language and a starting point for learning more.

Uploaded by

cherry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Introduction to Perl and BioPerl

Institut Pasteur Tunis


22 March 2007

Heikki Lehväslaiho, SANBI


This work is licensed under the Creative Commons Attribution-ShareAlike 2.0
South Africa License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/2.0/za/
or send a letter to
Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Introduction to Perl and Bioperl


What is Perl
Perl is a programming language
Born from a combination of C & shell scripting for system administration
Larry Wall’s background in linguistics led to Perl borrowing ideas from
natural language.

“There is more than one way to do it”


The glue that holds the internet together.
Oldest scripting language
No separate compilation step needed
The line noise of the programming languages.
/^[^#]+\s*(?:\d+\w+\s*)[2,3]$/;

Introduction to Perl and Bioperl


Why use Perl
Easy to learn
Cross platform
Very strong community support
CPAN, perlmonks, Perl User Groups
Provides API to things that do not have API
Excellent documentation
see man perl

Introduction to Perl and Bioperl


The Camel Book

Introduction to Perl and Bioperl


Beginning Perl
open source bool
by Simon Cozens

Downloadable at
http://www.perl.org/books/beginning-perl/
and locally

Introduction to Perl and Bioperl


Perl Documentation
perldoc perltoc
perldoc CGI
perldoc Bio::PrimarySeq
perldoc -f open
http://perldoc.perl.org/
http://www.cpan.org/
http://qa.perl.org/phalanx/100/
http://perlmonks.org/

Introduction to Perl and Bioperl


Programming Perl
Best Practices
Aimed at Perl 5.8.x
Shortcuts
Code Re-Use
Maintainable Development
Shortest Path between two points

Introduction to Perl and Bioperl


Perl program structure
shebang #!
#!/usr/bin/perl
# hello.pl
directives (use) use warnings;

keywords # print a message


print “Hello world!\n”;
functions
statements ;
escape sequences: “\t\n” > chmod 755 hello.pl
> hello.pl
Hello world!
white space >

comments

Introduction to Perl and Bioperl


Variable types
Scalars - Start with a $
Strings, Integers, Floating Point Numbers, References to other variables
Arrays - Start with a @
Zero based index
Contain an ordered list of Scalars
Hashes - Start with %
Associative Arrays wihout order
Key => Value

Introduction to Perl and Bioperl


Scalars
Any single value
#!/usr/bin/perl
Automatic type casting # print_sum.pl
use warnings;
string interpolation use strict;

only in double quoted strings print “Give a number ”;


my $num = <STDIN>;
In Perl, context is everything! my $num2 = '0.5';
my $float = $num + 0.5;
my $res = 'Sum';

# print the sum


print “$res = $float\n”;

Introduction to Perl and Bioperl


Pragmas
‘use strict;’
Forces variable declaration
Needed for maintainable code
Scoping
Garbage collection
‘use warnings;’
Forces variables initialization
Warns on deprecated syntax
Useful for sanity checking
in desperate situations: 'no warnings;'

Introduction to Perl and Bioperl


undef
Q: What is the value of variable,
#!/usr/bin/perl
if the value has not been # print_sum.pl
assigned? use warnings;
use strict;
A: undef my $num;
# print
not defined, void print “$num\n”;

use warnings will warn if you try


to access undefined variables

Introduction to Perl and Bioperl


Operators
Function String Numeric
Assignment = =

Equality eq, ne ==, !=

Comparison lt,le, gt, ge <, <=, >, >=

Concatenation . N/A

Repetition x N/A

Basic Math N/A +,-,*,/

Modulus, Exponent N/A %, ^

Special Sorting cmp <=>

Introduction to Perl and Bioperl


Operators
normal matematical precedence
operators force the context on variables!
More:
boolean operators ( and, &&, or, || )
operating and assinging at once ($a += $b;)
autoincrement and autodecrement ($count++, ++$c;)

Introduction to Perl and Bioperl


Arrays
Implement stacks, lists, queues
Creation
@a = (); # literal empty list
@b= qw(a t c g); # white space limited list
functions: e.g. push @b, 'u'; $first = shift @b;

shift() push()
0 1 2 3 4
unshift() pop()

Introduction to Perl and Bioperl


Working with arrays
#!/usr/bin/perl
Special variable $#alph # counting.pl
use warnings;
index of last element use strict;

Special variable $_ my $alph = 'atgc';


print length($alph), “\n”;
my @alph =
split() and join(), foreach() split '', $alph;
print “$#alph\n”;
Enclosure print scalar(@alph), “\n”;
my $c = 0;
Scalar context gives foreach (@alph) {
print “$c: ”, $alph[$c], “$_\n”;
array length $c++;
my $alph = 'augc';
Access array elements }
print “$alph: $c\n”;
as scalars
note: @ -> $

Introduction to Perl and Bioperl


Variable Scope
Lexical Scope
Declared with my()
Limits scope to containing block
Widest scope: the file in which its declared
Package Scope
Default scope
Declared with our()
Permanent scope

Introduction to Perl and Bioperl


Working with arrays
Ranges, an easy way to generate lists:
(1 .. 6), ( 8 .. -2 ), ('a' .. 'z')
Can be used a slices
@three = reverse sort @months[ -1..1 ];
Months with 31 days:
@months[0,3,5, 7-8, 9, 11]
Swaping values without intermediate variables:
($a, $b) = ($b, $a);

Introduction to Perl and Bioperl


Hashes
Special Initialization
my %hash = ( ‘key1’ => ‘value1’ );
could be written ( ‘key1’, ‘value1’, ‘key2’, ‘value2’ )
Hash keys are unique!
Access scalar elements inside Hashes like this:
my $value = $hash{key};
Hashes auto-vivify!
$hash{test1} = 'value'; # creates an entry with key test1;
When you use hashes all the time, you have mastered perl!
hash references are even better, but we'll talk about them later

Introduction to Perl and Bioperl


Hash functions
my $is_there = exists $hash{key};
returns 1 if the key exists, undef if not.
does not auto-vivify.
my $has_value = defined $hash{key};
return 1 if the key has value, undef if not
my @list = keys %hash;
returns a list of the keys in the hash
my @list = values %hash;
returns a list of the values in the hash

Introduction to Perl and Bioperl


Default variables
$_ - the “default scalar”;
for example, chomp() and print() work on default scalar if no argument is
given
@_ & @ARGV - the “default arrays”;
Subroutines use @_ as default
Outside of a subroutine, @ARGV is the default array, only used for
command line input

Introduction to Perl and Bioperl


Control structures
if (<some test>) {
# do
Loops and decisions } elsif (<other test>) {
# do
for, foreach } else {
# do
}
if, elsif, else
$a = 5;
while while ($a>0) {
# do
“if not” equals “unless” }
$a--;

unless ($valid) {
check($value)
}
transposition helps check($value) unless $valid;
readability

Introduction to Perl and Bioperl


Loop modifers
while (<EXPR>) {
# redo always comes here
next do_something;
} continue {
last # next always comes here
}
# last always comes here
redo
continue
OUTER: foreach (<EXPR>) {
INNER: foreach (<EXPR>) {
last OUTER;
LABEL: }

name a loop to know which }


one is being jumped out of

Introduction to Perl and Bioperl


What is boolean in Perl
Anything can be tested.
An empty string is false
Number 0 and string “0” are false
An empty list () is false
Undefined value, undef, id false
everything else is true

Introduction to Perl and Bioperl


Pseudocode
Near English (or any natural language) explanation what code
does written before writing the code
Keep elaborating and adding programme code like elements until
it is easy to implement.
e.g. how to count from 10 to zero in even numbers:

start from 10, start from 10, $x = 10;


remove 2, keep repeating until 0 until ($x < 0) {
keep repeating until 0 print value print $x;
remove 2, $x -= 2;
}

Introduction to Perl and Bioperl


Subroutines
sub version;
create your own verbs print version, "\n";

prototypes and predeclarations sub add1 {


my $one = shift;
of subroutines can be used my $two = shift;
my $sum = $one + $two;
lexical scoping return $sum;
}
shift works on @_ sub add ($$) {
shift() + shift();
last statement is returned }

Note: you can not pass my $sum = add1(2,3);


$sum = add 2, 3;
two arrays, they are sub version {'1.0'};
flattened into one!

Introduction to Perl and Bioperl


Long arguments for subroutines
sub add2 {
if you have more than two my %args = @_;
my $one = $args{one} || 0;
arguments often, you might my $two = $args{two} || 0;
want to use hashes to pass my $sum = $one +$two;
return $sum;
arguments to subroutines }

sub add ($$) {


shift() + shift();
}

my $sum2 = add2(one => 2,


two => 3);
my $sum = add(2,3);

Introduction to Perl and Bioperl


References
@lower = ('a' .. 'z');
$myletters = \@lower;
Reference is a scalar
push @$myletters, '-';
variable pointer to some $upper = \('A' .. 'Z');
other, often more
${$all}{'upper'} = $upper;
complex, structure. $all->{'lower'} = \@lower;

It does not have to a named $matrix[0][5] = 3;


structure # using ref()
ref \$a; #returns SCALAR
references make it possible ref \@a; #returns ARRAY
to create complex structures: ref \%a; #returns HASH

hashes of hashes,
hashes of arrays, ...
ref() tells what is the referenced
structure

Introduction to Perl and Bioperl


References
@four = ('a' .. 'z');
$myletters = \@lower;
Reference is a scalar
push @$myletters, '-';
variable pointer to some $upper = \('A' .. 'Z');
other, often more
${$all}{'upper'} = $upper;
complex, structure. $all->{'lower'} = \@lower;

It does not have to a named $matrix[0][5] = 3;


structure
references make it possible
to create complex structures:
hashes of hashes,
hashes of arrays, ...

Introduction to Perl and Bioperl


Subroutines revisited
sub first_is_longer {
passing more compex my ($lref1, $lref2) = @_;
arguments as references $first = @$lref1; #length
$sec = @$lref2; # length
? : operator ($first > $sec) ? 1 : 0;
}

Introduction to Perl and Bioperl


Reading and Writing a file
# the most useful perl construct
The easy way:
while (<>) {
use while (<>){} construct # do something
}
redirect the output at command line
into a file

# same as:
> perl -ne '#do something'

# redirection

> perl -ne '#do something' > file

Introduction to Perl and Bioperl


Filehandles
print “Hello\n”;
Default filehandle is STDOUT print STDOUT “Hello\n”; # identical

$! special variable holds my $file = 'seq.embl';


die “Not exist”
error messages unless $file -e;
die “Not readable”
perldoc -f -x unless $file -r;

open FH, $file or die $!;


perldoc -f open while (<FH>) { chomp; print;}
close FH;
$/ 'input record separator'
{
defaults to “\n” open my $F, '>', $file
or die $!;
The three argument form while (<$F>) { chomp; ... }
}
is preferred
lexical scope to filehandles

Introduction to Perl and Bioperl


Reading and Writing a file
die “Not writable”
Permanent record of unless $file -w;
open my $LOG, '>>', $file
program execution or die $!;
print STDERR “log: $params\n”;
print $LOG “$params\n”;

local $/ = "\/\/\n";
read file one EMBL seq open my $SEQ, '<', shift
entry at a time or die $!;
while (<$SEQ>) {
my $seq = $_;
modify $/ in a closure my ($ac) =
or subroutine $seq =~ /AC +(\w+)/;
print "$ac\n"
only use for local you'll see! if $seq =~ /FT +CDS/;
}
}

Introduction to Perl and Bioperl


Regular expressions
/even/; # literal
used for finding patterns in
/eve+n; # + means one or more
free text, semi-structured text
(database parsing), /eve*n; # * means zero or more
sequences (e.g. prosite) /eve?n/; # ? means zero or one

consists of /e(ve)+n/ # group

literals /0|1|2|3|4|5|6|7|8|9/ # alteration

metacharacters /[0123456789]/ # character class

/[0-9]/ # range, in ASCII

/\d/ # character class

Introduction to Perl and Bioperl


Regex shorthands
/[a-zA-Z0-9_]/; # word character
Always use the shortest form /\w/; # word character
for clarity /[^a-zA-Z0-9_]/; # non-word char
/\W/; # non-word char
what does /p*/ match?
/\D/; # not-nummber
it always matches
/[^ \t\n\r\f]/ # white space
Exact number of reptions /\s/ # white space
/\S/ # non-white space

/./ # any

/\w{4}/ # four letter word


/\w{4,6}/ # 4-6 letters
/\w{4,}/ # at least four letters

Introduction to Perl and Bioperl


Regex anchors and operators
Anchoring the match to a border /^ \w+.+/ # ^ forces line start

/\d$/ # $ forces line end


regex works on $_
/\bword\b/ # word boundary
regexp operators tell regexps
to bind to other strings if (/\w/) { # word char
my $line = $_;
=~ # found the first digit
print “digit\n”
!~ if $line =~ /\d/;
# should have ID
print “error: $line”
if $line !~ /ID/;
}

Introduction to Perl and Bioperl


String manipulations with regexs
contents of parenthesis is /^ (\w+)(.+)/;
my first_word = $1;
remembered my $rest = $2;
# or
fancier version of split() my ($first_word, $rest) =
/^ (\w+)(.+)/;
any delimiter can be used when
# two words limited by '\'
declaring a regexp with 'm'
/\w+\\\w+/;
regexp operators m|\w+\\w+|;

match m// s/[Uu]/t/;


s/(\w+)/”$1”/; # add quotes around
substitution s/// # the first word

translate t/// $count = tr/[AT]/N/;

returns number of translations


useful for counting

Introduction to Perl and Bioperl


Regex modifiers and greedyness
modifiers s/(\w+)/”$1”/g; # quotes around
# every word
g - global my $count = tr/[AT]/N/;

Greedy by default
/.+(w+)/; # last word character
“always match all you can” /.+?(w+)/; # first whole word

lazy (non-greedy) matching by


adding ? to repetition

Introduction to Perl and Bioperl


Catching errors
$a = 0;
eval eval {
$b = 5/$a;
traps run time errors };
print $@ if $@;
error message stored in special
variable $@
semicolon at the end of the
eval block is required

Introduction to Perl and Bioperl


Calling external progams
system(“ls”);

# to catch the output use backtics


$files = `ls -1`;

Introduction to Perl and Bioperl


Running perl
man perrun
man perldebug
Chapter 9 on Beginning Perl
command line perl
you should have learned it by now by example!

Introduction to Perl and Bioperl


Modules
logical organisation of code
code reuse
@INC – paths where Perl looks for modules
(do) - call subroutines from an other file
require – runtime include of a file or module
allows testing and gracefull failure
use
compile time include
'use'ing a perl module makes object oriented interface availblae and
usually exports common functions

Introduction to Perl and Bioperl


GetOpt::Long
use constant PROGRAMME_NAME =>
a standard library 'testing.pl';
use constant VERSION => '0.1';
used to set short or long options
our $DEBUG = '';
from command line our $DIR = '.';
our $WINDOW = 7;
$0, name of the calling programme
GetOptions
('v|version' =>
sub{print PROGRAMME_NAME, ",
version ", VERSION, "\n";
exit 1; },
'd|directory:s'=> \$DIR,
'g|debug' => \$DEBUG,
'h|help|?' =>
sub{
exec('perldoc',$0); exit 0}
);

Introduction to Perl and Bioperl


Plain Old Documentation
=pod
POD: embeded structured =head1 Heading Text
comments in code Text in B<bold> I<italic>

Empty lines separate commands =head2 Heading Text


=head3 Heading Text
Three types of text: =head4 Heading Text
=over indentlevel
=item stuff
1. ordinary paragraphs =back
=begin format
formatting codes =end format
=for format text...
2. verbatim paragraphs =encoding type
=cut
indented
3. command paragraphs
see code

Introduction to Perl and Bioperl


POD tools
pod2html pod2latex pod2man pod2text pod2usage,
podchecker
use POD to create selfdocumenting scripts
exec('perldoc',$0); exit;
Headers for a program:
NAME, SYNOPSIS, DESCRIPTION (INSTALLING, RUNNING,
OPTIONS), VERSION, TODO, BUGS, AUTHOR, CONTRIBUTORS,
LICENSE, (SUBROUTINES)
Use inline documentation when you can

Introduction to Perl and Bioperl


Code reuse
Try not to reinvent wheels
CPAN Authors usually QA their code
The community reviews CPAN Modules
Always look for a module FIRST
Chances are, it’s been done faster and more secure than you
could do it by yourself
It saves time
You might be able to do it better, but is it worth it?

Introduction to Perl and Bioperl


Some Modules (I)
GetOpt::Long for command line parsing
Carp provides more intelligent designs for error/warning
messages
Data::Dumper for debugging
CGI & CGI::Pretty provide an interface to the CGI Environment
DBI provides a unified interface to relational databases
DateTime for date interfaces, also
DateTime::Format::DateManip

Introduction to Perl and Bioperl


Some Modules (II)
WWW::Mechanize for web screen scraping
HTML::TreeBuilder for HTML parsing
MIME::Lite for constructing email message with or without
attachments
Spreadsheet::ParseExcel to read in Excel Spreadsheets
Spreadsheet::WriteExcel to create spreadsheets in perl
XML::Twig for XML data
PDL, Perl Data Language, to work with matrices and math

Introduction to Perl and Bioperl


Perl Resources
Perl Phalanx
http://qa.perl.org/phalanx/100/
Comprehensive Perl Archive Network
http://www.cpan.org/
http://search.cpan.org/

Introduction to Perl and Bioperl


Installing from CPAN
use your distro's package manager to install most – and
especialy complex modules.
e.g. sudo apt-get install GD – graphics library
first run configures cpan
o conf init at cpan prompt reconfigures
sets closest mirrors and finds helper programs

$ sudo cpan
cpan> install YAML
...

Introduction to Perl and Bioperl


BioPerl
BioPerl is in CPAN
... but you will not want to use it from there!
sequence databases change so often that official releases are often
outdated
http://bioperl.org/

Introduction to Perl and Bioperl


Installing BioPerl via CVS (I)
http://www.bioperl.org/wiki/Using_CVS
You need cvs client on your local machine
Create a directory for BioPerl
$ mkdir ~/src;
$ mkdir ~/src/bioperl
$ cd ~/src/bioperl

Login to CVS (password is "cvs"):


$ cvs -d :pserver:cvs@code.open-bio.org:\
/home/repository/bioperl login

Introduction to Perl and Bioperl


Installing BioPerl via CVS (II)
Checkout the BioPerl core module, only
$ cvs -d :pserver:cvs@code.open-bio.org:\
/home/repository/bioperl checkout bioperl-live

Tell perl where to find BioPerl (set this in your .bash_profile,


.profile, or .cshrc):
bash: $ export PERL5LIB="$HOME/src/bioperl"
tcsh: $ setenv PERL5LIB "$HOME/src/bioperl"

Test
perl -MBio::Perl -le 'print Bio::Perl->VERSION;'

Introduction to Perl and Bioperl


What is Bioperl
A collection of Perl modules for processing data for the life
sciences
A project made up of biologists, bioinformaticians, computer
scientists
An open source toolkit of building blocks for life sciences
applications
Supported by Open Bioinformatics Foundation (O|B|F),
http://www.open-bio.org/
Collaborative online community

Introduction to Perl and Bioperl


Simple example
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $in = new Bio::SeqIO(-format => 'genbank',
-file => 'AB077698.gb');
while ( my $seq = $in->next_seq ) {
print "Sequence length is ", $seq->length(), "\n";
my $sequence = $seq->seq();
print "1st ATG is at ", index($sequence,'ATG')+1, "\n";
print "features are: \n";
foreach my $f ( $seq->top_SeqFeatures ) {
printf(" %s %s(%s..%s)\n",
$f->primary_tag,
$f->strand < 0 ? 'complement' : '',
$f->start,
$f->end);
}
}

Introduction to Perl and Bioperl


Simple example, output
% perl ex1.pl
Sequence length is 2701
1st ATG is at 80
features are:
source (1..2701)
gene (1..2701)
5'UTR (1..79)
CDS (80..1144)
misc_feature (137..196)
misc_feature (239..292)
misc_feature (617..676)
misc_feature (725..778)
3'UTR (1145..2659)
polyA_site (1606..1606)
polyA_site (2660..2660)

Introduction to Perl and Bioperl


Gotchas
Sequences start with 1 in Bioperl (historical reasons). In perl
strings, arrays, etc start with 0.
When using a module, CaseMatTers.
methods are usually lower case with underscores (_).
Make sure you know what you're getting back - if you get back an
array, don't assign it to a scalar in haste.
my ($val) = $obj->get_array(); # 1st item
my @vals = $obj->get_array(); # whole list
my $val = $obj->get_array(); # array length

Introduction to Perl and Bioperl


Where to go for help
http://docs.bioperl.org/
http://bioperl.org/
FAQ, HOWTOs, Tutorial
modules/ directory (for class diagrams)
perldoc Module::Name::Here
Publication - Stajich et al. Genome Res 2002
Bioperl mailing list: bioperl-l@bioperl.org
Bug reports: http://bugzilla.bioperl.org/

Introduction to Perl and Bioperl


Brief Object Oriented overview
Break problem into
components
Each component has
data (state) and methods
Only interact with
component through methods
Interface versus implementations

Introduction to Perl and Bioperl


Objects in Perl
An object is simply a reference that happens to know which class
it belongs to.
A class is simply a package that happens to provide methods to
deal with object references.
A method is simply a subroutine that expects an object reference
(or a package name, for class methods) as the first argument.

Introduction to Perl and Bioperl


Inheritance
Objects inherit methods
from their parent
They inherit state
(data members);
not explicitly in Perl.
Methods can be
overridden by children

Introduction to Perl and Bioperl


Interfaces
Interfaces can be thought of
as an agreement
Object will at least look
a certain way
It is independent of what
goes on under the hood

Introduction to Perl and Bioperl


Interfaces and Inheritance in Bioperl
What you need to know:
Interfaces are declared with trailing 'I' (Bio::PrimarySeqI)
Can be assured that at least these methods will be implemented by
subclasses
Can treat all inheriting objects as if they were the same, i.e.
Bio::PrimarySeq, Bio::Seq, Bio::Seq::RichSeq all have basic
Bio::PrimarySeqI methods.
In Perl, good OO requires good manners.
Methods which start with an underscore are considered 'private'
Watch out. Perl programmers can cheat.

Introduction to Perl and Bioperl


Modular programming (I)

From Stein et al. Genome Research 2002

Introduction to Perl and Bioperl


Modular programming (II)

Introduction to Perl and Bioperl


Bioperl components

Introduction to Perl and Bioperl


Sequence components I
Sequences
Bio::PrimarySeq - Basic sequence operations (aa and nt)
Bio::Seq - Supports attached features
Bio::Seq::RichSeq - GenBank,EMBL,SwissProt fields
Bio::LocatableSeq - subsequences
Bio::Seq::Meta - residue annotation

Introduction to Perl and Bioperl


Sequence components II
Features
Bio::SeqFeature::Generic - Basic Sequence features
Bio::SeqFeature::Similarity - Represent similarity info
Bio::SeqFeature::FeaturePair - Paired features (HSPs)
Sequence Input: Bio::SeqIO
Annotation: Bio::Annotation::XX objects

Introduction to Perl and Bioperl


Class diagram (subset)

From Stajich et al. Genome Research 2002

Introduction to Perl and Bioperl


Build a sequence and translate it
#!/usr/bin/perl -w
use strict;
use Bio::PrimarySeq;
my $seq = new Bio::PrimarySeq(-seq => 'ATGGGACCAAGTA',
-display_id => 'example1');
print "seq length is ", $seq->length, "\n";
print "translation is ", $seq->translate()->seq(), "\n";

% perl ex2.pl
seq length is 13
translation is MGPS

Introduction to Perl and Bioperl


Bio::PrimarySeq I
Initialization
-seq - sequence string
-display_id - sequence ID (i.e. >ID DESCRIPTION)
-desc - description
-accession_number - accession number
-alphabet - alphabet (dna,rna,protein)
-is_circular - is a circular sequence (boolean)
-primary_id - primary ID (like GI number)

Introduction to Perl and Bioperl


Bio::PrimarySeq III
Essential methods
length - return the length of the sequence
seq - get/set the sequence string
desc - get/set the description string
display_id - get/set the display id string
alphabet - get/set the sequence alphabet
subseq - get a sub-sequence as a string
trunc - get a sub-sequence as an object

Introduction to Perl and Bioperl


Bio::PrimarySeq III
Methods only for nucleotide sequences
translate - get the protein translation
revcom - get the reverse complement

Introduction to Perl and Bioperl


Bio::Seq
Initialization
annotation - Bio::AnnotationCollectionI object
features - array ref of Bio::SeqFeatureI objects
species - Bio::Species object

Introduction to Perl and Bioperl


Bio::Seq
Essential methods
species - get/set the Bio::Species object
annotation - get/set the Bio::AnnotationCollectionI object
add_SeqFeature - attach a Bio::SeqFeatureI object to Seq
flush_SeqFeatures - remove all features
top_SeqFeatures - Get all the toplevel features
all_SeqFeatures - Get all features flattening those which contain sub-
features (rare now).
feature_count - Get the number of features attached

Introduction to Perl and Bioperl


Parse a sequence from file
# ex3.pl
use Bio::SeqIO;
my $in = new Bio::SeqIO(-format => 'swiss',
-file => 'BOSS_DROME.sp');
my $seq = $in->next_seq();
my $species = $seq->species;
print "Organism name: ", $species->common_name, " ",
"(", $species->genus, " ", $species->species, ")\n";
my ($ref1) = $seq->annotation->get_Annotations('reference');
print $ref1->authors,"\n";
foreach my $feature ( $seq->top_SeqFeatures ) {
print $feature->start, " ",$feature->end, " ",
$feature->primary_tag, "\n";
}

Introduction to Perl and Bioperl


Parse a sequence from file, output
% perl ex3.pl
Organism name: Fruit fly (Drosophila melanogaster)
Hart A.C., Kraemer H., van Vactor D.L. Jr., Paidhungat M., Zipursky
1 31 SIGNAL
32 896 CHAIN
32 530 DOMAIN
531 554 TRANSMEM
570 588 TRANSMEM
615 637 TRANSMEM
655 676 TRANSMEM
693 712 TRANSMEM
728 748 TRANSMEM
759 781 TRANSMEM
782 896 DOMAIN
...

Introduction to Perl and Bioperl


Bio::SeqIO
Can read sequence from a file or a filehandle
special trick to read from a string: use IO::String
Initialize
-file - filename for input (prepend > for output files)
-fh - filehandle for reading or writing
-format - format for reading writing
Some supported formats:
genbank, embl, swiss, fasta, raw, gcg, scf, bsml, game, tab

Introduction to Perl and Bioperl


Read in sequence and write out in
different format
# ex4.pl
use Bio::SeqIO;
my $in = new Bio::SeqIO(-format => 'genbank',
-file => 'in.gb');
my $out = new Bio::SeqIO(-format => 'fasta',
-file =>'>out.fa');
while ( my $seq = $in->next_seq ) {
next unless $seq->desc =~ /hypothetical/i;
$out->write_seq($seq);
}

Introduction to Perl and Bioperl


Sequence Features:
Bio::SeqFeatureI
Basic sequence features - have a location in sequence
primary_tag, source_tag, score, frame
additional tag/value pairs
Subclasses by numerous objects - power of the interface!

Introduction to Perl and Bioperl


Sequence Features:
Bio::SeqFeature::Generic
Initialize
-start, -end, -strand
-frame - frame
-score - score
-tag - hash reference of tag/values
-primary - primary tag name
-source - source of the feature (e.g. program)
Essential methods
primary_tag, source_tag, start,end,strand, frame
add_tag_value, get_tag_values, remove_tag, has_tag

Introduction to Perl and Bioperl


Locations quandary
How to manage features that span more than just start/end
Solution: An interface Bio::LocationI, and implementations in
Bio::Location
Bio::Location::Simple - default: 234, 39^40
Bio::Location::Split - multiple locations (join,order)
Bio::Location::Fuzzy - (<1..30, 80..>900)
Each sequence feature has a location() method to get access to
this object.

Introduction to Perl and Bioperl


Create a sequence and a feature
#ex5.pl
use Bio::Seq;
use Bio::SeqFeature::Generic;
use Bio::SeqIO;
my $seq = Bio::Seq->new
(-seq => 'STTDDEVVATGLTAAILGLIATLAILVFIVV',
-display_id => 'BOSSfragment',
-desc => 'pep frag');
my $f = Bio::SeqFeature::Generic->new
(-seq_id => 'BOSSfragment',
-start => 7, -end => 22,
-primary => 'TRANSMEMBRANE',
-source => 'hand_curated',
-tag => {'note' => 'putative transmembrane'});
$seq->add_SeqFeature($f);
my $out = new Bio::SeqIO(-format => 'genbank');
$out->write_seq($seq);

Introduction to Perl and Bioperl


Create a sequence and a feature,
output
% perl ex5.pl
LOCUS BOSSfragment 34 aa linear UNK
DEFINITION pep frag
ACCESSION unknown
FEATURES Location/Qualifiers
TRANSMEMBRANE 10..25
/note="putative transmembrane"
ORIGIN
1 tvasttddev vatgltaail gliatlailv fivv
//

Introduction to Perl and Bioperl


Sequence Databases
Remote databases
GenBank, GenPept, EMBL, SwissProt - Bio::DB::XX
Local databases
local Fasta - Bio::Index::Fasta, Bio::DB::Fasta
local Genbank,EMBL,SwissProt - Bio::Index::XX
local alignments - Bio::Index::Blast, Bio::Index::SwissPfam
SQL dbs
Bio::DB::GFF
Bio::DB::BioSeqDatabases (through bioperl-db pkg)

Introduction to Perl and Bioperl


Retrieve sequences from a
database
# ex6.pl
use Bio::DB::GenBank;
use Bio::DB::SwissProt;
use Bio::DB::GenPept;
use Bio::DB::EMBL;
use Bio::SeqIO;
my $out = new Bio::SeqIO(-file => ">remote_seqs.embl",
-format => 'embl');
my $db = new Bio::DB::SwissProt();
my $seq = $db->get_Seq_by_acc('7LES_DROME');
$out->write_seq($seq);
$db = new Bio::DB::GenBank();
$seq = $db->get_Seq_by_acc('AF012924');
$out->write_seq($seq);
$db = new Bio::DB::GenPept();
$seq = $db->get_Seq_by_acc('CAD35755');
$out->write_seq($seq);

Introduction to Perl and Bioperl


The Open Biological Database
Access (OBDA) System
cross-platform, database independent
implemented in Bioperl, Biopython, Biojava, Bioruby
database access controlled by registry file(s)
global or user's own
the default registry retrieved over the web
Database types implemented:
flat - Bio::Index
biosql
biofetch - Bio::DB
more: http://www.bioperl.org/HOWTOs/html/OBDA_Access.html

Introduction to Perl and Bioperl


Retrieve sequences using OBDA
# ex7.pl
use Bio::DB::Registry 1.2;# needs bioperl release 1.2.2 or later
my $registry = Bio::DB::Registry->new;
# $registry->services
my $db = $registry->get_database('embl');
# get_Seq_by_{id|acc|version}
my $seq = $db->get_Seq_by_acc("J02231");
print $seq->seq,"\n";

Introduction to Perl and Bioperl


Alignments

Introduction to Perl and Bioperl


Alignment Components
Pairwise Alignments
Bio::SearchIO - Parser
Bio::Search::XX - Data Objects
Bio::SeqFeature::SimilarityPair
Multiple Seq Alignments
Bio::AlignIO - Parser
Bio::SimpleAlign - Data Object

Introduction to Perl and Bioperl


Multiple Sequence Alignments
# ex.pl
# usage: convert_aln.pl < in.aln > out.phy
use Bio::AlignIO;
my $in = new Bio::AlignIO(-format => 'clustalw');
my $out = new Bio::AlignIO(-format => 'phylip');
while( my $aln = $in->next_aln ) {
$out->write_aln($aln);
}

Introduction to Perl and Bioperl


BLAST/FASTA/HMMER Parsing
Can be split into 3 components
Result - one per query, associated db stats and run parameters
Hit - Sequence which matches query
HSP - High Scoring Segment Pairs. Components of the Hit which match
the query.
Corresponding object types in the Bio::Search namespace
Implemented for BLAST, FASTA, HMMER

Introduction to Perl and Bioperl


Parse a BLAST & FASTA report
# ex8.pl
use Bio::SearchIO;
use Math::BigFloat;
my $cutoff = Math::BigFloat->new('0.001');
my %files = ( 'blast' => 'BOSS_Ce.BLASTP',
'fasta' => 'BOSS_Ce.FASTA');
while( my ($format,$file) = each %files ) {
my $in = new Bio::SearchIO(-format => $format,
-file => $file);
while( my $r = $in->next_result ) {
print "Query is: ", $r->query_name, " ",
$r->query_description," ",$r->query_length," aa\n";
print " Matrix was ", $r->get_parameter('matrix'), "\n";
while( my $h = $r->next_hit ) {
last unless Math::BigFloat->new($h->significance) < $cutoff;
print "Hit is ", $h->name, "\n";
while( my $hsp = $h->next_hsp ) {
print " HSP Len is ", $hsp->length('total'), " ",
" E-value is ", $hsp->evalue, " Bit score ", $hsp->score, " \n",
" Query loc: ",$hsp->query->start, " ", $hsp->query->end," ",
" Sbject loc: ",$hsp->hit->start, " ", $hsp->hit->end,"\n";
}
}
print "--\n";
}
} Introduction to Perl and Bioperl
Parse a BLAST & FASTA report,
output
% perl ex7.pl
Query is: BOSS_DROME Bride of sevenless protein precursor. 896 aa
Matrix was BL50
Hit is F35H10.10
HSP Len is 728 E-value is 6.8e-05 Bit score 197.9
Query loc: 207 847 Sbject loc: 640 1330
--
Query is: BOSS_DROME Bride of sevenless protein precursor. 896 aa
Matrix was BLOSUM62
Hit is F35H10.10
HSP Len is 315 E-value is 4.9e-11 Bit score 182
Query loc: 511 813 Sbject loc: 1006 1298
HSP Len is 28 E-value is 1.4e-09 Bit score 39
Query loc: 508 535 Sbject loc: 427 454
--

Introduction to Perl and Bioperl


Create an HTML version of a report
#!/usr/bin/perl -w
# ex9.pl
use strict;
use Bio::SearchIO;
use Bio::SearchIO::Writer::HTMLResultWriter;
use Math::BigFloat;
my $cutoff = Math::BigFloat->new('0.2');
my $in = new Bio::SearchIO(-format => 'blast',
-file => 'BOSS_Ce.BLASTP');
my $writer = new Bio::SearchIO::Writer::HTMLResultWriter;
my $out = new Bio::SearchIO(-writer => $writer,
-file => '>BOSS_Ce.BLASTP.html');

Introduction to Perl and Bioperl


Create an HTML version of a report
while( my $result = $in->next_result ) {
my @keephits;
my $newresult = new Bio::Search::Result::GenericResult
(-query_name => $result->query_name,
-query_accession => $result->query_accession,
-query_description => $result->query_description,
-query_length => $result->query_length,
-database_name => $result->database_name,
-database_letters => $result->database_letters,
-database_entries => $result->database_entries,
-algorithm => $result->algorithm,
-algorithm_version => $result->algorithm_version,
);
foreach my $param ( $result->available_parameters ) {
$newresult->add_parameter($param,
$result->get_parameter($param));
}
foreach my $stat ( $result->available_statistics ) {
$newresult->add_statistic($stat,
$result->get_statistic($stat));
}
while( my $hit = $result->next_hit ) {
last if Math::BigFloat->new($hit->significance) > $cutoff;
$newresult->add_hit($hit);
}
$out->write_result($newresult);
}
Introduction to Perl and Bioperl
Other things covered by Bioperl

Introduction to Perl and Bioperl


Parse outputs from various
programs
Bio::Tools::Results::Sim4
Bio::Tools::GFF
Bio::Tools::Genscan,MZEF, GRAIL
Bio::Tools::Phylo::PAML, Bio::Tools::Phylo::Molphy
Bio::Tools::EPCR
(recent) Genewise, Genscan, Est2Genome, RepeatMasker

Introduction to Perl and Bioperl


Things I'm skipping (here)
In detail: Bio::Annotation objects
Bio::Biblio - Bibliographic objects
Bio::Tools::CodonTable - represent codon tables
Bio::Tools::SeqStats - base-pair freq, dicodon freq, etc
Bio::Tools::SeqWords - count n-mer words in a sequence
Bio::SeqUtils – mixed helper functions
Bio::Restriction - find restriction enzyme sites and cut sequence
Bio::Variation - represent mutations, SNPs, any small variations
of sequence

Introduction to Perl and Bioperl


More useful things
Bio::Structure - parse/represent protein structure (PDB) data
Bio::Tools::Alignment::Consed - process Consed data
Bio::TreeIO, Bio::Tree - Phylogenetic Trees
Bio::MapIO, Bio::Map - genetic, linkage maps (rudiments)
Bio::Coordinate - transformations between coordinate systems
Bio::Tools::Analysis – web scraping

Introduction to Perl and Bioperl


Bioperl can help you run things too
Namespace is Bio::Tools::Run
In separate CVS module bioperl-run since v1.2
EMBOSS, BLAST, TCoffee, Clustalw
SoapLab, PISE
Remote Blast searches at NCBI (Bio::Tools::Run::RemoteBlast)
Phylogenetic tools (PAML, Molphy, PHYLIP)
More utilities added on a regular basis for the BioPipe pipeline
project, http://www.biopipe.org/

Introduction to Perl and Bioperl


Other project off-shoots and
integrations
Microarray data and objects (Allen Day)
BioSQL - relational db for sequence data (Hilmar Lapp, Chris
Mungall, GNF)
Biopipe - generic pipeline setup (Elia Stupka, Shawn Hoon,
Fugu-Sg)
GBrowse - genome browser (Lincoln Stein)

Introduction to Perl and Bioperl


Acknowledgements
LOTS of people have made the toolkit what it is today.
The Bioperl AUTHORS list in the distro is a starting point.
Some people who really got the project started and kept it going:
Jason Stajich, Sendu Bala, Chris Field, Brian Osborne, Steven
Brenner, Ewan Birney, Lincoln Stein, Steve Chervitz, Ian Korf,
Chris Dagdigian, Hilmar Lapp, Heikki Lehväslaiho, Georg Fuellen
& Elia Stupka

Introduction to Perl and Bioperl

You might also like