Bioinformatics Tutorial 1- Answers
1. Transcribe the following DNA to RNA, then use the genetic code to translate it to a sequence of amino acids. TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT To transcribe the DNA, first substitute each DNA for its counterpart (i.e., G for C, C for G, T for A and A for T): AGTATTATGCAAAACATAAGCGGTCGCGAAGCCACA Next, remember that the Thymine (T) bases become a Uracil (U). Hence our sequence becomes: AGUAUUAUGCAAAACAUAAGCGGUCGCGAAGCCACA Using the genetic code is also easy just split the RNA sequence into triplets: AGU AUU AUG CAA AAC AUA AGC GGU CGC GAA GCC ACA then look each triplet (codon) up in the genetic code table. So AGU becomes Serine, which we can write as Ser, or just S. AUU becomes Isoleucine (Ile), which we write as I. Carrying on in this way, we get: SIMQNISGREAT Remove the first letter from this sequence, and start again. Use this example to explain why mutations (including deletions and insertions) are usually deleterious to an organism. Removing the first letter and splitting into codons again gives us: GUA UUA UGC AAA ACA UAA GCG GUC GCG AAG CCA CA GUA translates to Val (V), UUA translates to Leu (L), UGC translates to Cys (C), AAA translates to Lys (K), ACA translates to Thr (T), and UAA translates to STOP. This gives us the sequence: VLCKT STOP Continuing with the translation, we get: AVAKP So, if the above DNA sequence from which the RNA was transcribe was actually a gene, its effective length would have been halved, in addition to all of the amino acids changing in the residue sequence it generated. Given that the protein structure is largely dictated by its shape, and its shape is largely dictated by the residue sequence, we see that it is not surprising that a random mutation such as a deletion will cause harm, or even death to an organism.
2. What is the Hamming distance between these two strings? (ignore the overlapping end) BIOINFORMATICS_IS_THE BEST_FOR_STRUCTURE_PREDICTION To calculate the Hamming distance, just count the number of pairs of letters in the alignment which are not the same. So, the first letter in both sequences is B, so we dont count that. However, the second letter in the first sequence is I, but it is E in the second, so we must count this. If we put a star between the two sentences where they mismatch, we get: BIOINFORMATICS_IS_THE **** ** ********** BEST_FOR_STRUCTURE_PREDICTION There are 16 stars, so the Hamming distance must be 16. Note that the underscores are counted as normal letters in the sequence.
3. Using the BLOSUM62 substitution matrix, what is the best alignment of these two sequences? (Slide one over the other, and score 1 for end gaps, i.e., letters hanging over either ends). FYGNYK DGSFNW To work out the best alignment, we have to write down all the ways to overlap these sequences and work out the BLOSUM scores for each alignment, remembering to take off 1 for every gap (-). Its possible to use a heuristic, and have a look to see if there are any obviously good overlaps. If we score these first, then it may become obvious that all the others will not give us a good score. All possible overlaps are given in the two boxes with their scores. The best overlap is therefore the only one scoring a positive number (5) FYGNYK-DGSFNW score FYGNYK---------DGSFNW FYGNYK-------DGSFNW FYGNYK-----DGSFNW FYGNYK---DGSFNW FYGNYK-DGSFNW FYGNYK DGSFNW -11 -13 -8 -10 5 -14 -FYGNYK DGSFNW--FYGNYK DGSFNW----FYGNYK DGSFNW------FYGNYK DGSFNW--------FYGNYK DGSFNW----score -2 -7 -4 -9 -9
4. What is the compositional complexity of these residue sequences? KKKKTRAITERMMMM and TRAITER Remember that the formula we use for compositional complexity is the following:
Note that L is the sequence length and the nis are the number of occurrences of the letters of the alphabet that can occur in the sequence. As our sequence is a residue sequence, there can only be twenty different letters in the sequence. Well work out the complexity of the longer sequence first. To calculate the compositional complexity using this formula, we need to work out the values we will be putting into it. Firstly, we need length, L, of the sequence, which is 15. Next, we need the number of occurrences of each letter in the sequences. The number of occurrences of those letters we cannot see there is obviously zero. For the rest: there are 4 Ks, 2 Ts, 2 Rs, 1 A, 1 I, 1 E and 4 Ms. So we can write: nK = 4, nT = 2, nR = 2, nA = 1, nI = 1, nE = 1 and nM = 4 Now we need to multiply together all the factorials of these numbers. 0! = 1, so we dont need to worry about the letters which arent there, as we will just be multiplying by 1. Hence, we need to calculate: 4! * 2! * 2! * 1! * 1! *1! *4! = 24 * 2 * 2 * 1 *1 * 1 * 24 = 2304 We now divide L! by this number: 15!/2304 = 567567000, and take log to the base 20 of this big number: log20(567567000) = 6.729. To do this calculation with your calculator, you may need to remember that: logx(y) = ln(y)/ln(x), where ln(y) is the natural log of y, and your calculator should handle this. We finish by dividing our value by the length of the sequence, 15. So finally, our answer is: 6.729/15 = 0.449. The same calculation for TRAITER yields: 1/7 (log20(7/(2!*2!))) = 1/7 (log20(7!/4)) = 1/7 (log20(1260)) = 1/7 (2.383) = 0.340. Hence we see that the second sequence is less complex than the first. Which of these base sequences do you think is more complex? AAAGTGTGTAAC and CCCCAGATAGGATT What is the compositional complexity of these base sequences?
It seems likely that the second sentence is more complex than the first, but well see what the calculations say. We can use the same formula as the first part of the question, but we must remember to change the 20s to 4s, because we are now working with DNA sequences, which contain only four letters. The calculations come to: 1/12 (log4(12!/(5!*1!*3!*3!))) = 1/12 (log4(12!/(4320))) = 1/12(log4(110880)) = 1/12(8.379) = 0.698 for the first sequence, and: 1/14 (log4(14!(4!*4!*3!*3!))) = 1/14 (log4(14!/(20736))) = 1/14 (log4(4204200)) = 1/14 (11.002) = 0.79 for the second sequence. So we see that our guess was correct, the second sentence is indeed more complex than the first. 5. Extend this matching sequence into a HSP as in the BLAST algorithm: CPAGNDYWMIHRLV
WWCTGANDYWVMREH
What is the final BLOSUM62 score for the HSP?
To expand this triple into a HSP, we can first extend it to the left. Remember that we can continue expanding it only until the BLOSUM score for the whole HSP decrease. The current BLOSUM score of DYW:DYW is 24. In the BLOSUM matrix, N:N scores 6, so the HSP NDYW:NDYW scores 30, which is fine. G:A in BLOSUM scores 0, so the score stays the same, which is still OK. A:G scores 0, so we are still OK. However, P:T scores 1, hence the score for PAGNDYW:CGANDYW is 29, which is less than 30. Hence, we cut off our extension to the left before the P:T, giving us AGNDYW:GANDYW. Working to the right now, we can extend past M:V, because this scores 1. We can go past I:M, because this scores 1. H:R scores 0, so this is OK. R:E scores 0, so this is OK. Finally, L:H scores 3, so we cannot extend the HSP all the way, and our final HSP looks like this: AGNDYWMIHR GANDYWVMRE And it scores 32.