Posts
15
31
0
Sunday, August 10, 2014
De Bruijn Graphs

Remember doing word ladder puzzles? You have to change one word into another by changing a single letter at a time, always creating a new word. As an example from Wikipedia has it:

• COLD
• CORD
• CARD
• WARD
• WARM

A De Bruijn graph is sort of like that. But instead of changing a single letter at a time, you drop the first letter from word, and add a new letter at the end:

• DRAC
• RACH
• ACHI
• CHIN
• HINK

Okay, I cheated a little here. These aren’t actually words, just parts of words. But that’s really what makes De Bruijn graphs interesting; it isn’t so much that you can turn one word into another, it’s that you can attach words together based on their parts. Say we take a list of words and split them into their four-letter sequences (which we can call a 4-mer, mer being LatinGreek for “parts”, or a k-mer if we don’t care about the exact length.) If one of the words is “abscond” then we split it into the following 4-mers:

• absc
• bsco
• scon
• cond

Here’s a Python function to do that:

def kmers_in_word(w):
return [w[i:i + 4] for i in range(len(w) - 4)]

I got a list of English words from Mieliestronk and pulled all those with at least four letters into a Python list.

with open("corncob_lowercase.txt") as f:
s = (line.strip() for line in f)
word_list = [w for w in s if len(w) > 3]

Then I created a dictionary to list all of the kmers and the words they were in:

kmers = defaultdict(list)
for word in word_list:
for k in kmers_in_word(word):
kmers[k].append(word)

Now, we can represent the De Bruijn graph as a dictionary, with the keys as kmers and the values as a list of kmers that are connected by words in the dictionary.

debruijn = defaultdict(set)
for kmer, words in kmers.iteritems():
for word in words:
for k in kmers_in_word(word):
if kmer[1:4] == k[0:3]: