Ben Fulton
A suffix array of a string is a list of suffixes of that string. For example, if the string is “ABCD”, the suffixes are “ABCD”, “BCD”, “CD”, and “D”. Not very interesting on its own, but what is useful is that you can track them by simply keeping track of the index of their starting character – 0,1,2, and 3. And that means you can fiddle with their order. You can sort them. In “ABCD” they are already sorted, but consider a string like “mississippi”. In sorting those suffixes, the final “i” is sorted to the front, while the first “s” is at the back. The order is
10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2
(For some idea of the usefulness of suffix arrays, see my suffix arrays blog post on my personal blog.)
Here’s an implementation of a suffix array function in Python:
def get_suffix_array(str):
return sorted(range(len(str)), key=lambda i: str[i:])
Simple enough, for short strings. But there are going to be problems as the string length increases. That key function is probably making a copy of the string slice to pass to the sorted function, and that is going to take a huge amount of space for a long string. I experimented a little by generating a random, 100,000 character string and calling this method on it; it used more than three gigabytes of memory and ran for almost four minutes before returning the suffix array. (Just for fun, I loaded up the million characters of Moby Dick and ran it on that. The whole operating system froze up. Eventually I got bored and hard-rebooted the machine.)
So for long strings, space will get to be a consideration. Python also supports a cmp argument for sorted, which takes two values and returns an integer indicating which value should be sorted first, but I understand this is being deprecated. This is unfortunate as it would make this problem much more tractable; but we can still do better, both in terms of total memory usage and speed. Python’s generic sorted function is probably built to handle any kind of data and to run in (n log n) time, but since we’re only handling one kind of data we can improve on that.
Here’s an idea that was suggested by Udi Manber and Gene Myers back in the 1990’s. It’s a variant on bucket sort, which is a sorting algorithm that operates by first looking only at the first character of each string, and putting those that share the same initial character into a bucket. Then it looks at each bucket for the second character in the string, and so on. This is a handy paradigm for us, since we don’t even have different strings to work on; we only have indices into our one string.
Here’s the idea: we’ll put each suffix into a bucket using its first character, so all the suffixes that start with ‘A’ will be in one bucket, and all the suffixes that start with ‘B’ will be in another, and so on. Obviously all the suffixes in the ‘A’ bucket will have a sort order before those in the ‘B’ bucket or ‘C’ bucket. It looks something like this:
def sort_bucket(str, bucket):
d = defaultdict(list)
for i in bucket:
key = str[i:i+1]
d[key].append(i)
def suffix_array_ManberMyers(str):
return sort_bucket(str, (i for i in range(len(str))))
Sort_bucket takes our string and a list of indices to examine. Our initial call to it will just pass in every index in the string – all of the suffixes as one big bucket.
D, then, is a dictionary of initial characters to suffix indices. Now what we need to do is call sort_bucket recursively to go through each bucket and sort the strings by second position, third position, etc. To make the whole thing a little more efficient, rather than index by one additional character each time, we’ll double the number of characters. The first recursive call will sort on strings of length two, the second call will sort on four, etc. We need to add an Order parameter to indicate the number of characters we want to sort on. Also, add code to return the sorted array and this is our final function:
def sort_bucket(str, bucket, order):
d = defaultdict(list)
for i in bucket:
key = str[i:i+order]
d[key].append(i)
result = []
for k,v in sorted(d.iteritems()):
if len(v) > 1:
result += sort_bucket(str, v, order*2)
else:
result.append(v[0])
return result
def suffix_array_ManberMyers(str):
return sort_bucket(str, (i for i in range(len(str))), 1)
Notice that sort_bucket now returns the list of sorted suffixes. If a bucket contains a single element, that is appended to the result; otherwise the method is called recursively and the sorted bucket is added to the result.
So how does it work? It turns out, quite nicely. If you want to do suffix arrays for “mississipi”, well, you’re better off using the initial implementation – it’s simpler and runs fairly quickly. But for longer strings…on my computer the tipping point appeared to be around 5000 characters. For a 10,000 character string, Manber-Myers was about three times as fast. And at 100,000 characters – the run that took more than four minutes – Manber-Myers still takes less than a second. And for Moby Dick? Manber-Myers was still under six seconds to run, and used about 50 megabytes worth of memory.
So it’s a pretty successful little algorithm. You can download this Suffix Array code on my GitHub page.
Now, 50M on a modern computer isn’t much. But that still represents 50 times the size of the book. If we want to scale up again, from the size of the book (1,200,000 bytes) to the size of a human genome (3,000,000,000 bytes) are we going to need a machine with 150 gigabytes of memory?
I sure hope not. I’ll try out some more ideas another time.