Reconstructing words from fragments

Ben Fulton


Suppose I give you a set of kmers. (For a definition of kmers, see my article on De Bruijn graphs.) How would you determine the words that they came from?

Actually, let’s make it a little easier and say that the kmers came from a known list of words, and we have to determine the frequencies of the words. Let’s start with a list of words I’ve used before: ['ingratiate', 'gratification', 'aristocratic', 'aromaticity', 'palatial', 'beatific', 'beatify', 'latices', 'insatiable', 'artifice', 'beatified', 'certifiable']

data (scrabble)

Now suppose our list of kmers consists of ‘tify’ and ‘fiab’. We can say without question that the first one came from ‘beatify’ and the second one was from ‘certifiable’, since those are the only two instances of those kmers in our whole list. But suppose ‘able’ was in the list – there’s no way of knowing whether it came from ‘insatiable’ or ‘certifiable’.

This is an interesting problem because it’s an approximation of a problem in biology called RNA Sequencing. At any given time a cell has a bunch of strands of RNA floating around inside it, and exactly which strands are there and how many of them there are has a lot to do with the behavior of the cell. But our technology hasn’t advanced to the point where we can actually tell what RNA strands are in the cell – the best we can do is break them up and get parts of them, rather like our kmers. So we try to reconstruct the strands from the short reads that we have available to us.


Now suppose we have a hundred kmers in our list, so it looks like['ific', 'arom', 'tocr', 'tice', 'sati', 'arom',…], etc. How do we determine the original set of words in the list?

The easiest thing to do is simply to guess. For each kmer, we’ll look to see which words it occurs in, and then choose one at random. Here’s that routine:

def assign_reads(kmers):
    d = defaultdict(list)
    for read in kmers:
        words = [word for word in all_words if read in word]
    return d

Now since we’ve assigned each kmer to a word, we can just summarize the expected probabilities of the original word list we’re trying to reconstruct.

def get_frequencies(assigned_reads):
    total = sum(len(reads) for reads in assigned_reads.values())
    rmc = dict()
    for word, reads in assigned_reads.iteritems():
        rmc[word] = len(reads) / total
    return rmc

Here are some of the results you might get out of a routine like this:

  • palatial: 0.03
  • ingratiate: 0.15
  • aristocratic: 0.2

From this we’ll guess that the original source data consisted of 20% “aristocratic”, 15% “ingratiate”, and 3% “palatial”. This is pretty close to correct, as I created the source by using one instance of each word in the original list, except for replacing both “certifiable” and “gratification” with “aristocratic”. So simply by assigning each kmer at random to a word that contains it, we can create a rough approximation of the original frequencies of the words in our list. Code is available on Github.



posted on Wednesday, December 31, 2014 11:10 PM Print