Lab 7

Overview

In Lab 5, we extracted all sequences in a Trie within a distance to a target word. In this lab, however, you are doing the reverse, where you are given two words and interested in the [minimum] distance between them, which is also equivalent to the [maximum] similarity between them. There will be no homework or coding associated with this lab.

The story of molecular evolutionary clock

Hemoglobin is an important protein that carries oxygen in our bodies as well as in other living beings. In the 60s, two scientists by the names of Pauling and Zuckerkandl devised a method, then called comparative fingerprinting, to show that human hemoglobin is much more different than that of horse, but more similar to that of gorilla, providing information to help speculate about evolutionary history of these species, humans, horses, and gorillas. See the whole story here.

Fingerprinting

There are a total of 20 amino acids found in nature and all proteins are made up of stream of amino acids. Here is an exhaustive list of the twenty amino acids: {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V}. The fingerprinting method the two scientists used in identifying the correct similarity measures involved accounting for insertions, deletions, and substitutions of amino acids between the protein sequences. The hypothesis was that the more dissimilar two protein sequences were, the earlier in the long past their corresponding host organisms had deviated from each other. The fingerprinting was done manually. This was a very tedious task. To find the correct similarity measure between any two sequences, one would have to find minimum number of changes that it takes from one protein to transit to the next. This is an enormous number of possibilities. Today, the fingerprinting method can be automated and performed much faster, thanks to dynamic programming. Below shows the amino-acid differences between hemoglobin protein sequences of different species. Image taken from Zaldivar et al (2017).

Notice how there are dashes between some sequences. If we consider only two sequences, one can interpret a dash in the first sequence (i.e., the original sequence) as an insertion of an amino acid that apears in the second sequence at that corresponding location. A dash in the second sequence (but not in the first sequence) is indicative of a deletion action of the amino acid in the first sequence, which doesn’t exist in the second sequence. Substitution is simply when we have different amino acids in an identical position. Hence, exact matches, insertions, deletions, and substitutions can all be shown via sequences and dashes and vice versa. i.e., for each series of actions, there is a unique letter-dash representation and for each representation, there is a unique interpretation of what is deleted and what is inserted, etc… If we had taken the second sequence as the original one, we would have to toggle our interpretations of deletions and insertions but essentially would be describing the same events in both cases, making the representation symmetric on choice of the original sequence. By the way, if we consider two sequence only, we can never have a gap in the same position on both sequences, which is an ambiguous event! (the only reason the above image constains such cases is because some other sequence in the image has an amino acid in that position).

Goal

Our goal in this lab is to find the maximum similarity (minimum distance) between two given sequences. Unlike the above image, which considers all sequences at the same time, we are only concerned with finding the maximum similarity betwen TWO sequences at a time and NOT ALL simultaneously. Given sequences p1 and p2, we should consider all possible transitions (insetion, deletion, substitution, exact matches) that p1 undergoes to transit to p2. We are then interested in the transition scenario, i.e., those sequences of actions, that results in minimum distance.

Why not just use a Trie?

You might have noticed by now the resemblence of this question to the Trie question. In fact, a Trie can be used to solve this question as well. The solution, however, would not be efficient. The solution of using a Trie would have been to choose p1 as the target word and insert p2 into the Trie (or vice versa) and run the algorithm you had already developed to traverse the Trie. The tolerance parameter would have to be set to a sufficiently high value not to miss the actual minimum distance, such as m+n, lengths of p1 and p2, respectively. The function suggestions would then add (p2,dist) to the words HashMap, settling for the minium distance traveral of that only word in the Trie. We would also have to make a minor change to the function to return number of deletions (d), insertions (i), and substitutions (s), separately. For instance, dist=[d,i,s] should be a vector. Subsequently, similarity would be lengh(p1) - (d+s) (or equivalently, lengh(p2) - (i+s)) would be the maximum similarity. In fact, the suggestions function basically considers all combinations of deletions, instertions, and substitutions possible, and finds the lowest distance, dist=d+i+s corresponding to a specific transition scenario, i.e., a specific transition of p1 to p2. The poroblem with this approach is that as sequences get longer, so does the corresponding tolerance, and as a result the worst-case time complexity would reach order of factorial of sequence lengths, which is unfeasible! In this lab, however, we will approach this problem from a different angle.

Step 0: Creating a markdown file for writeups

One person in your team should visit https://demo.hedgedoc.org/new to create an empty document
Share the URL of the document created with the others in your team (e.g., email them the link in the address bar). When someone else visits the URL, they are able to collaboratively edit the document.
Each person should copy and paste the URL of the document into the Moodle submission box for the “Lab 7” assignment. Do this right away, even before writing in the document, so that your course account has a link to your work.

Step 1: All possible transitions

You are given two protein sequences, p1 and p2.

p1: AARADCE
p2: AVAAD

There are many possible transitions from p1 to p2. Here are some examples of transitions:

p1: AARADCE
p2: AVAAD--     Similarity = 3 (distance = 4). First position of p1 is unchanged, second position is altered from A to V. Third position is altered from R to A, Forth and fifth positions of p1 are unchanged. Sixth and seventh positions of p1 are deleted in p2. Similarity is the number of locations that are unchanged.

p1: A-ARADCE
p2: AVA-A-D-    Similarity = 3 (distance = 5). First position of p1 is unchanged. Between first and second amino acids of p1, a V is inserted in p2. Second position of p1 is unchanged (appears as third position of p2). Third position of p1 is deleted. Forth position, unchanged. Fifth position is deleted. Sixth, substitued for D, seventh, deleted.

p1: AAR-ADCE
p2: AVAA---D    Similarity = 1 (distance = 7)

p1: AARADCE-----
p2: -------AVAAD     Similarity = 0 (distance = 12)

p1: AARADCE----
p2: -A-----VAAD     Similarity = 1 (distance = 10)

p1: AARADC-E
p2: AVAAD---     This is invalid (not counted as a transition), since there two gaps in the same position! 

Notice how that similarity is just the number of matches on the corresponding positions on the letter-dash representation! Ideally, to find maximum similarity, i.e., minimum number of insertions, deletions, and substitutions, one would have to exhaustively consider all possibilities of transitions. There are a total of (m+n)!/(m!*n!) possible ways for transitions, where m and n are sequences lengths. In the above example, plugging numbers in, gives 792 unique different ways to write the transitions between above p1 and p2! We provided 5 examples of transitions. Can you come up with five other examples, preferably with different similarity values, and their corresponding similarity measure and check answer with your TA? Don’t worry, there are a total of 792 - 5 = 787 possibilities to choose from!

Step 2: Recurrence relation

Write down a recurrence relation that encompasses all the possible ways transitions occur and returns the maximum similarity score. This is NOT the recurrence relation of runtime T(n), but the recurrence relation of the algorithm.

Let function D(i,j) be the similarity score of the best overall transition found between the first i letters of p1 and the first j letters of p2. Write your recurrence based on some otherD(...,...) terms. Of course, D(i=7,j=5) should return the desired value.

Where needed, call the sim(p1[i],p2[j]) function that returns value 1 if characters p1[i]==p2[j] and 0 otherwise. Consider sequences as zero indexed. i.e., p1[0] = 'A'.

Your recurrence relation should be something like this D(i,j) = max {few sub-problems that call the D function on smaller subsequences plus their own last steps}. D(7,5) for instance will return the maximum similarity between p1 and p2 by considering all the 792 possibilities but with the help of D(i',j') values representing solutions to smaller sequences! Hint: In writing the recurrence relation Think about how a transition scenario is constructed. Consider all possible last steps (transition or not) that could have occurred at the end of the two transition scenarios (take inspiration from Step 1, here). Remember that you do not need to report the actual transition scenario, but only its score.

Step 3: Dynamic Programming

The above recurrence relation, if implemented via recursion would have the high time complexity we mentioned, i.e., O(n!). The reason is that there is a lot of redundancy in the recursive implementation.

The following is a Dynamic Programming (DP) implementation of the above maximum similarity problem:

D = int[m+1,n+1]
D[0:m,0] = 0
D[0,0:n] = 0
for i = 1 to m 
    for j = 1 to n     
        D[i,j] = max {D[i-1,j], D[i,j-1], D[i-1,j-1] + sim(p1[i-1],p1[j-1])}	 
return D[m,n]

Tracing through the above code, complete filling the two-dimensional grid D, where size of D is m+1 by n+1. Then, return the value at the last location D[m,n], shown by ? as the final answer. The amino acids of p1 and p2 are mentioned as the column and row headers to give more hints about what the DP algorithm.

D:

Fill the above table using your DP pseudocode and include it in your writeup.
Report the time and space complexity of the above DP implementation.

Step 4: Retrieving the transition scenario

So far, you could compute the highest similarity between two sequences p1, and p2. However, the actual alignment that led to the high score is not reported. Think of a scheme, where after obtaining highest similarity, you can construct the actual transition scenario corresponding to highest similarity.

Hint: Construct the two transitions p1_al, and p2_al from the end to the beginning. For instance, if highest similarity was let say 3, then scenario p1_al = A-ARADCE and p2_al = AVA-A-D- could have been returned. Explain you strategy in a few lines. No need to provide pseudocode but do explain about any additional data structures you used.

Step 5: Submit the writeup

Each group member should submit the same markdown file to Moodle. In the submission comment add a link to the hedgedoc file and also indicate who you worked with.