Posts
15
31
0
August 2014 Entries
De Bruijn Graphs

Remember doing word ladder puzzles? You have to change one word into another by changing a single letter at a time, always creating a new word. As an example from Wikipedia has it:

• COLD
• CORD
• CARD
• WARD
• WARM

A De Bruijn graph is sort of like that. But instead of changing a single letter at a time, you drop the first letter from word, and add a new letter at the end:

• DRAC
• RACH
• ACHI
• CHIN
• HINK

Okay, I cheated a little here. These aren’t actually words, just parts of words. But that’s really what makes De Bruijn graphs interesting; it isn’t so much that you can turn one word into another, it’s that you can attach words together based on their parts. Say we take a list of words and split them into their four-letter sequences (which we can call a 4-mer, mer being LatinGreek for “parts”, or a k-mer if we don’t care about the exact length.) If one of the words is “abscond” then we split it into the following 4-mers:

• absc
• bsco
• scon
• cond

Here’s a Python function to do that:

def kmers_in_word(w):
return [w[i:i + 4] for i in range(len(w) - 4)]

I got a list of English words from Mieliestronk and pulled all those with at least four letters into a Python list.

with open("corncob_lowercase.txt") as f:
s = (line.strip() for line in f)
word_list = [w for w in s if len(w) > 3]

Then I created a dictionary to list all of the kmers and the words they were in:

kmers = defaultdict(list)
for word in word_list:
for k in kmers_in_word(word):
kmers[k].append(word)

Now, we can represent the De Bruijn graph as a dictionary, with the keys as kmers and the values as a list of kmers that are connected by words in the dictionary.

debruijn = defaultdict(set)
for kmer, words in kmers.iteritems():
for word in words:
for k in kmers_in_word(word):
if kmer[1:4] == k[0:3]:

Here’s a partial picture of the resulting graph. The top is the kmer “rati”, which occurs in the words “ingratiate”, “gratification”, and “aristocratic”, among others. The connecting kmers, then, are “atia”, “atif”, and “atic”. Now “atia” is in the words “palatial”, “insatiable”, and “ingratiate”, and those three words in turn lead us to the kmers “tial”, “tiab”, and “tiat”. You can see that the graph will get big, fast!

I created this graph using Gephi. Code, and a way to export the graph into Gephi, are available on Github.

Author: Ben Fulton
posted @ Sunday, August 10, 2014 11:07 PM | Feedback (0)
Graph Diameter

Look at this graph. How wide is it?

It looks like S and V, or perhaps Y, are the nodes that are the farthest apart. If the numbers on the edges are accurate representations, the distance from S to Y and S to V should be the same, but if you add them up, you see that they aren’t. S to U to V sums up to 11, while S to X to Y is 7.

So our intuitive answer to the question of how wide the graph is is not quite right. We have to find the two nodes that are farthest apart. But if the graph has a loop, or cycle, you could just go around the loop forever and have two nodes that are infinitely far apart!

Let’s say instead that the distance between two nodes is always the shortest possible distance. Then the width, or diameter, of the graph is the longest, shortest distance between any two nodes.

We’ll represent our graph as a Python dictionary. The keys will be the names of the nodes, the values will be lists of 2-tuples, where the first value is the connecting node and the second value is the distance between the nodes. The graph in the picture will look like this:

graph = dict()
graph['S'] = [('U', 10), ('X', 5)]
graph['U'] = [('X', -2), ('V', 1)]
graph['V'] = [('Y', -5)]
graph['X'] = [('U', 3), ('Y', 2)]
graph['Y'] = [('V', 6), ('S', 7)]

The method for calculating this has been known for more than fifty years and is known as the Floyd-Warshall Algorithm. Here’s an example of implementing it in Python.

First, let’s create a dictionary to track the distances between any two nodes. Since we don’t know the distances between the nodes yet, we’ll set all the values to one million.

dist = defaultdict(lambda: defaultdict(lambda: 1000000))

Now let’s fill out the values we do know. We know that the distance from a node to itself is zero.

for v in graph.keys():
dist[v][v] = 0

We know the distances between the nodes that have direct paths. It could be that the direct path isn’t the shortest path, but it’s a place to start:

for u, values in graph.iteritems():
for v in values:
dist[u][v[0]] = v[1]

Note that it’s possible to define a graph where a distance from a node to itself is something other than zero. If it is, that will be picked up in the previous lines of code.

Now, we’ll go through and try to improve the distance between every pair of nodes. Say we look at nodes i and j. We’ll look at every node k, and figure out whether the distance from i to k to j is less than the distance from i to j that we’re already aware of.

for k in graph.keys():
for i in graph.keys():
for j in graph.keys():
if dist[i][j] > dist[i][k] + dist[k][j]:
dist[i][j] = dist[i][k] + dist[k][j]

After this triple loop, our dist dictionary will hold the shortest distances between any two nodes.

So what is the diameter of our graph?

print max(product(dist.keys(), dist.keys()), key=lambda x: dist[x[0]][x[1]])

('Y', 'U')

The shortest distance from Y to U is to go from Y to S to X to U. The edge lengths are 7, 5, and 3 respectively. So the diameter of our graph is 15.

Author: Ben Fulton
posted @ Monday, August 4, 2014 9:31 PM | Feedback (0)