Posts
15
Comments
31
Trackbacks
0
De Bruijn Graphs

Ben Fulton

 

Remember doing word ladder puzzles? You have to change one word into another by changing a single letter at a time, always creating a new word. As an example from Wikipedia has it:

  • COLD
  • CORD
  • CARD
  • WARD
  • WARM

 A De Bruijn graph is sort of like that. But instead of changing a single letter at a time, you drop the first letter from word, and add a new letter at the end:

  • DRAC
  • RACH
  • ACHI
  • CHIN
  • HINK
abscond

Okay, I cheated a little here. These aren’t actually words, just parts of words. But that’s really what makes De Bruijn graphs interesting; it isn’t so much that you can turn one word into another, it’s that you can attach words together based on their parts. Say we take a list of words and split them into their four-letter sequences (which we can call a 4-mer, mer being LatinGreek for “parts”, or a k-mer if we don’t care about the exact length.) If one of the words is “abscond” then we split it into the following 4-mers:

  • absc
  • bsco
  • scon 
  • cond

Here’s a Python function to do that:

def kmers_in_word(w):
    return [w[i:i + 4] for i in range(len(w) - 4)]

I got a list of English words from Mieliestronk and pulled all those with at least four letters into a Python list.

with open("corncob_lowercase.txt") as f:
    s = (line.strip() for line in f)
    word_list = [w for w in s if len(w) > 3]

Then I created a dictionary to list all of the kmers and the words they were in:

kmers = defaultdict(list)
for word in word_list:
    for k in kmers_in_word(word):
        kmers[k].append(word)

Now, we can represent the De Bruijn graph as a dictionary, with the keys as kmers and the values as a list of kmers that are connected by words in the dictionary.

debruijn = defaultdict(set)
for kmer, words in kmers.iteritems():
    for word in words:
        for k in kmers_in_word(word):
            if kmer[1:4] == k[0:3]:
                debruijn[kmer].add(k)

 

debriujnHere’s a partial picture of the resulting graph. The top is the kmer “rati”, which occurs in the words “ingratiate”, “gratification”, and “aristocratic”, among others. The connecting kmers, then, are “atia”, “atif”, and “atic”. Now “atia” is in the words “palatial”, “insatiable”, and “ingratiate”, and those three words in turn lead us to the kmers “tial”, “tiab”, and “tiat”. You can see that the graph will get big, fast!

I created this graph using Gephi. Code, and a way to export the graph into Gephi, are available on Github.

posted on Sunday, August 10, 2014 11:07 PM Print
test