Hamming Distance

Problem:

Calculate the number of different bases in the same position for homologous DNA strands

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).

Return: The Hamming distance

Sample Dataset:

GAGCCT

CATCGT

Sample Output:

Solution:

The principle behind the Hamming Distance is really simple, you compare two strings and figure out how many differences are there, for example:

Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTC

In this case the Hamming distance is 2 because at identical position in both lines of text, there are only two differences. Simple, isn't it?

Evolution is merely the addition of small beneficial mutations over time. The Hamming distance of any two DNA strands from species with a recent common ancestor simply tells us the number of theoretical point mutations (single base substitution) it took for the DNA from the ancestor to morph into two completely different organisms. This is theoretical and may not be accurate to reality, but it is a start!

It's really easy to calculate Hamming distance for small base pairs, like the example we are going to do, but for larger ones we need to have code that can automatically find it for us.

Our file looks like this:

So let's print our file:

< >

file = open("rosalind-hamm", "r")

print(file.readlines())

The output we get is:

['GAGCCT\n', 'CATCGT']

So each line becomes a list item. The \n you see denotes 'newline', but it is annoying to have that there, so we will need to use a strip() method to get rid of it.

So let's write a function that breaks down our Rosalind file line by line into a list, with all the annoying \n s removed:

< >

def read_file(file_path):

with open(file_path, "r") as file:

return [line.strip() for line in file.readlines()]

strings = read_file("rosalind-hamm.txt")

print(strings)

Most of the code above should be familiar from the last time we were calculating GC content, and now your output should look like:

['GAGCCT', 'CATCGT']

There, so much prettier! Now let's split our two list items into 2 different sequences, as we are only comparing two base pairs:

< >

def read_file(file_path):

with open(file_path, "r") as file:

return [line.strip() for line in file.readlines()]

strings = read_file("rosalind-hamm.txt")

s1 = strings[0]

s2 = strings[1]

Now we want to compare these two strings, there are many ways of doing this but I'll go with the zip() method. The zip method takes two iterable components and merges their corresponding parts:

< >

a = ["red", "yellow", "green"]

b = ["apple", "banana", "grapes"]

x = zip(a,b)

print(tuple(x))

(('red', 'apple'), ('yellow', 'banana'), ('green', 'grapes'))

This makes it easier to compare corresponding items on two different lists, and that is exactly what we will do with our code and compare the two values:

< >

def read_file(file_path):

with open(file_path, "r") as file:

return [line.strip() for line in file.readlines()]

strings = read_file("rosalind-hamm.txt")

s1 = strings[0]

s2 = strings[1]

sum = 0

for (a,b) in zip(s1, s2):

if a != b:

sum += 1

The code above compares corresponding letters on the strings and if there is a difference it will add it to the sum variable. So if s1 the first base was "A" and s2 the first base was also "A", the (a, b) comparison would be the same so the sum would remain zero. The final code can also be written as list comprehension:

print(sum(a!=b for (a,b) in zip(s1,s2)))

The hard part is over and now we can simply reformat it into a pretty function:

< >

def hamming_distance(file_path):

def read_file(file_path):

with open(file_path, "r") as file:

return [line.strip() for line in file.readlines()]

s1 = strings[0]

s2 = strings[1]

sum = 0

print(sum(a!= b)) for (a,b) in zip(s1, s2)

hamming_distance("rosalind-hamm.txt")

And that's it:

Output:

Copy and paste the output you get from running your code on the sample data onto the Rosalind answer terminal and submit. Annnd get yourself a cookie!