Posts
15
31
0
Wednesday, December 31, 2014
Reconstructing words from fragments

Suppose I give you a set of kmers. (For a definition of kmers, see my article on De Bruijn graphs.) How would you determine the words that they came from?

Actually, let’s make it a little easier and say that the kmers came from a known list of words, and we have to determine the frequencies of the words. Let’s start with a list of words I’ve used before: ['ingratiate', 'gratification', 'aristocratic', 'aromaticity', 'palatial', 'beatific', 'beatify', 'latices', 'insatiable', 'artifice', 'beatified', 'certifiable'] Now suppose our list of kmers consists of ‘tify’ and ‘fiab’. We can say without question that the first one came from ‘beatify’ and the second one was from ‘certifiable’, since those are the only two instances of those kmers in our whole list. But suppose ‘able’ was in the list – there’s no way of knowing whether it came from ‘insatiable’ or ‘certifiable’.

This is an interesting problem because it’s an approximation of a problem in biology called RNA Sequencing. At any given time a cell has a bunch of strands of RNA floating around inside it, and exactly which strands are there and how many of them there are has a lot to do with the behavior of the cell. But our technology hasn’t advanced to the point where we can actually tell what RNA strands are in the cell – the best we can do is break them up and get parts of them, rather like our kmers. So we try to reconstruct the strands from the short reads that we have available to us.

## Estimation

Now suppose we have a hundred kmers in our list, so it looks like['ific', 'arom', 'tocr', 'tice', 'sati', 'arom',…], etc. How do we determine the original set of words in the list?

The easiest thing to do is simply to guess. For each kmer, we’ll look to see which words it occurs in, and then choose one at random. Here’s that routine:

d = defaultdict(list)
words = [word for word in all_words if read in word]
return d

Now since we’ve assigned each kmer to a word, we can just summarize the expected probabilities of the original word list we’re trying to reconstruct.

rmc = dict()
return rmc

Here are some of the results you might get out of a routine like this:

• palatial: 0.03
• ingratiate: 0.15
• aristocratic: 0.2

From this we’ll guess that the original source data consisted of 20% “aristocratic”, 15% “ingratiate”, and 3% “palatial”. This is pretty close to correct, as I created the source by using one instance of each word in the original list, except for replacing both “certifiable” and “gratification” with “aristocratic”. So simply by assigning each kmer at random to a word that contains it, we can create a rough approximation of the original frequencies of the words in our list. Code is available on Github.

posted @ Wednesday, December 31, 2014 11:10 PM | Feedback (1)
Sunday, August 10, 2014
De Bruijn Graphs

Remember doing word ladder puzzles? You have to change one word into another by changing a single letter at a time, always creating a new word. As an example from Wikipedia has it:

• COLD
• CORD
• CARD
• WARD
• WARM

A De Bruijn graph is sort of like that. But instead of changing a single letter at a time, you drop the first letter from word, and add a new letter at the end:

• DRAC
• RACH
• ACHI
• CHIN
• HINK Okay, I cheated a little here. These aren’t actually words, just parts of words. But that’s really what makes De Bruijn graphs interesting; it isn’t so much that you can turn one word into another, it’s that you can attach words together based on their parts. Say we take a list of words and split them into their four-letter sequences (which we can call a 4-mer, mer being LatinGreek for “parts”, or a k-mer if we don’t care about the exact length.) If one of the words is “abscond” then we split it into the following 4-mers:

• absc
• bsco
• scon
• cond

Here’s a Python function to do that:

def kmers_in_word(w):
return [w[i:i + 4] for i in range(len(w) - 4)]

I got a list of English words from Mieliestronk and pulled all those with at least four letters into a Python list.

with open("corncob_lowercase.txt") as f:
s = (line.strip() for line in f)
word_list = [w for w in s if len(w) > 3]

Then I created a dictionary to list all of the kmers and the words they were in:

kmers = defaultdict(list)
for word in word_list:
for k in kmers_in_word(word):
kmers[k].append(word)

Now, we can represent the De Bruijn graph as a dictionary, with the keys as kmers and the values as a list of kmers that are connected by words in the dictionary.

debruijn = defaultdict(set)
for kmer, words in kmers.iteritems():
for word in words:
for k in kmers_in_word(word):
if kmer[1:4] == k[0:3]: Here’s a partial picture of the resulting graph. The top is the kmer “rati”, which occurs in the words “ingratiate”, “gratification”, and “aristocratic”, among others. The connecting kmers, then, are “atia”, “atif”, and “atic”. Now “atia” is in the words “palatial”, “insatiable”, and “ingratiate”, and those three words in turn lead us to the kmers “tial”, “tiab”, and “tiat”. You can see that the graph will get big, fast!

I created this graph using Gephi. Code, and a way to export the graph into Gephi, are available on Github.

posted @ Sunday, August 10, 2014 11:07 PM | Feedback (0)
Monday, August 4, 2014
Graph Diameter

Look at this graph. How wide is it? It looks like S and V, or perhaps Y, are the nodes that are the farthest apart. If the numbers on the edges are accurate representations, the distance from S to Y and S to V should be the same, but if you add them up, you see that they aren’t. S to U to V sums up to 11, while S to X to Y is 7.

So our intuitive answer to the question of how wide the graph is is not quite right. We have to find the two nodes that are farthest apart. But if the graph has a loop, or cycle, you could just go around the loop forever and have two nodes that are infinitely far apart!

Let’s say instead that the distance between two nodes is always the shortest possible distance. Then the width, or diameter, of the graph is the longest, shortest distance between any two nodes.

We’ll represent our graph as a Python dictionary. The keys will be the names of the nodes, the values will be lists of 2-tuples, where the first value is the connecting node and the second value is the distance between the nodes. The graph in the picture will look like this:

graph = dict()
graph['S'] = [('U', 10), ('X', 5)]
graph['U'] = [('X', -2), ('V', 1)]
graph['V'] = [('Y', -5)]
graph['X'] = [('U', 3), ('Y', 2)]
graph['Y'] = [('V', 6), ('S', 7)]

The method for calculating this has been known for more than fifty years and is known as the Floyd-Warshall Algorithm. Here’s an example of implementing it in Python. First, let’s create a dictionary to track the distances between any two nodes. Since we don’t know the distances between the nodes yet, we’ll set all the values to one million.

dist = defaultdict(lambda: defaultdict(lambda: 1000000))

Now let’s fill out the values we do know. We know that the distance from a node to itself is zero.

for v in graph.keys():
dist[v][v] = 0

We know the distances between the nodes that have direct paths. It could be that the direct path isn’t the shortest path, but it’s a place to start:

for u, values in graph.iteritems():
for v in values:
dist[u][v] = v

Note that it’s possible to define a graph where a distance from a node to itself is something other than zero. If it is, that will be picked up in the previous lines of code.

Now, we’ll go through and try to improve the distance between every pair of nodes. Say we look at nodes i and j. We’ll look at every node k, and figure out whether the distance from i to k to j is less than the distance from i to j that we’re already aware of.

for k in graph.keys():
for i in graph.keys():
for j in graph.keys():
if dist[i][j] > dist[i][k] + dist[k][j]:
dist[i][j] = dist[i][k] + dist[k][j]

After this triple loop, our dist dictionary will hold the shortest distances between any two nodes.

So what is the diameter of our graph?

print max(product(dist.keys(), dist.keys()), key=lambda x: dist[x][x])

('Y', 'U')

The shortest distance from Y to U is to go from Y to S to X to U. The edge lengths are 7, 5, and 3 respectively. So the diameter of our graph is 15.

posted @ Monday, August 4, 2014 9:31 PM | Feedback (0)
Sunday, June 30, 2013
Suffix Arrays

A suffix array of a string is a list of suffixes of that string. For example, if the string is “ABCD”, the suffixes are “ABCD”, “BCD”, “CD”, and “D”. Not very interesting on its own, but what is useful is that you can track them by simply keeping track of the index of their starting character – 0,1,2, and 3. And that means you can fiddle with their order. You can sort them. In “ABCD” they are already sorted, but consider a string like “mississippi”. In sorting those suffixes, the final “i” is sorted to the front, while the first “s” is at the back. The order is

10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2

(For some idea of the usefulness of suffix arrays, see my suffix arrays blog post on my personal blog.)

Here’s an implementation of a suffix array function in Python:

`    def get_suffix_array(str):      return sorted(range(len(str)), key=lambda i: str[i:]) `

Simple enough, for short strings. But there are going to be problems as the string length increases. That key function is probably making a copy of the string slice to pass to the sorted function, and that is going to take a huge amount of space for a long string. I experimented a little by generating a random, 100,000 character string and calling this method on it; it used more than three gigabytes of memory and ran for almost four minutes before returning the suffix array. (Just for fun, I loaded up the million characters of Moby Dick and ran it on that. The whole operating system froze up. Eventually I got bored and hard-rebooted the machine.)

So for long strings, space will get to be a consideration. Python also supports a cmp argument for sorted, which takes two values and returns an integer indicating which value should be sorted first, but I understand this is being deprecated. This is unfortunate as it would make this problem much more tractable; but we can still do better, both in terms of total memory usage and speed. Python’s generic sorted function is probably built to handle any kind of data and to run in (n log n) time, but since we’re only handling one kind of data we can improve on that.

Here’s an idea that was suggested by Udi Manber and Gene Myers back in the 1990’s. It’s a variant on bucket sort, which is a sorting algorithm that operates by first looking only at the first character of each string, and putting those that share the same initial character into a bucket. Then it looks at each bucket for the second character in the string, and so on. This is a handy paradigm for us, since we don’t even have different strings to work on; we only have indices into our one string.

Here’s the idea: we’ll put each suffix into a bucket using its first character, so all the suffixes that start with ‘A’ will be in one bucket, and all the suffixes that start with ‘B’ will be in another, and so on. Obviously all the suffixes in the ‘A’ bucket will have a sort order before those in the ‘B’ bucket or ‘C’ bucket. It looks something like this:

```    def sort_bucket(str, bucket):
d = defaultdict(list)
for i in bucket:
key = str[i:i+1]
d[key].append(i)
def suffix_array_ManberMyers(str):    return sort_bucket(str, (i for i in range(len(str)))) ```

Sort_bucket takes our string and a list of indices to examine. Our initial call to it will just pass in every index in the string – all of the suffixes as one big bucket.

D, then, is a dictionary of initial characters to suffix indices. Now what we need to do is call sort_bucket recursively to go through each bucket and sort the strings by second position, third position, etc. To make the whole thing a little more efficient, rather than index by one additional character each time, we’ll double the number of characters. The first recursive call will sort on strings of length two, the second call will sort on four, etc. We need to add an Order parameter to indicate the number of characters we want to sort on. Also, add code to return the sorted array and this is our final function:

```    def sort_bucket(str, bucket, order):
d = defaultdict(list)
for i in bucket:
key = str[i:i+order]
d[key].append(i)
result = []
for k,v in sorted(d.iteritems()):
if len(v) > 1:
result += sort_bucket(str, v, order*2)
else:
result.append(v)
return result

def suffix_array_ManberMyers(str):
return sort_bucket(str, (i for i in range(len(str))), 1) ```

Notice that sort_bucket now returns the list of sorted suffixes. If a bucket contains a single element, that is appended to the result; otherwise the method is called recursively and the sorted bucket is added to the result.

So how does it work? It turns out, quite nicely. If you want to do suffix arrays for “mississipi”, well, you’re better off using the initial implementation – it’s simpler and runs fairly quickly. But for longer strings…on my computer the tipping point appeared to be around 5000 characters. For a 10,000 character string, Manber-Myers was about three times as fast. And at 100,000 characters – the run that took more than four minutes – Manber-Myers still takes less than a second. And for Moby Dick? Manber-Myers was still under six seconds to run, and used about 50 megabytes worth of memory.

So it’s a pretty successful little algorithm.  You can download this Suffix Array code on my GitHub page.

Now, 50M on a modern computer isn’t much. But that still represents 50 times the size of the book. If we want to scale up again, from the size of the book (1,200,000 bytes) to the size of a human genome (3,000,000,000 bytes) are we going to need a machine with 150 gigabytes of memory?

I sure hope not. I’ll try out some more ideas another time.

posted @ Sunday, June 30, 2013 7:59 PM | Feedback (2)
Friday, March 29, 2013
The Expectation Maximization Algorithm

Conceptually, the expectation maximization algorithm is simple: you take a guess at the values you want to find, then run your calculations as though they were correct. Then, use those results to improve your guess.

What is the expectation maximization algorithm?

That article does a good job of explaining the algorithm in more detail, but I couldn’t make sense of the example. So I wrote some code to help me figure out.

The example is like this: You have two coins. They’re not exactly balanced, and you want to know how heavily weighted they are. The only information you have, though, are five sequences of ten flips each. And, you don’t even know which coin is responsible for which sequence!

Okay, let’s write some code. I’ll be working in Python for this example.

rolls = [ "HTTTHHTHTH",
"HHHHTHHHHH",
"HTHHHHHTHH",
"HTHTTTHHTT",
"THHHTHHHTH" ]

Our coins are A and B. Let’s make our initial guess that coin A comes up heads 60% of the time, and that coin B is fair. The tradition is that the unknown variables are called Theta, so:

theta = { "A":.6, "B":.5 }

We need a method to calculate the likelihood of a sequence given a weight. That’s straightforward binomial probability:

def likelihoodOfRoll( roll, prob ):
n = len(roll)
return pow( prob, numHeads ) * pow( 1-prob, n - numHeads )

Now what? One easy way to a solution would be to figure out which sequences were most likely to come from which coins, then credit all the heads and tails from those sequences to those coins. Using that method, rolls 1,2, and 4 are assigned to coin A, and rolls 0 and 3 are assigned to B. 24 of the 30 flips assigned to coin A are heads, and 9 of 20 flips for coin B. So with that strategy, our new value for theta becomes { “A”:.8, “B”:.45}.

But we can do better. Rather than assign an entire sequence to a single coin, we’ll use the calculated probabilities to weight the sequence. So – and this is a tricky bit – we have to compare the calculated probabilities to each other rather than looking at them as absolutes. So if one coin has a probability of 20% and the other has a probability of 30%, we’ll assign them weights of .4 and .6, respectively. Here’s how:

for roll in rolls:
s = sum( likelihoodOfRoll( roll, k) for k in theta.values())
for hidden in theta.keys():
a = likelihoodOfRoll( roll, theta[hidden])
aChance = a / s

I didn’t figure out that part until I came across a StackExchange question about it. How does expectation maximization work?

The variable aChance holds the weighted probability of the sequence coming from the coin. Next, we’ll assign that percentage of the heads in the sequence to that coin (We’ll keep a couple of dictionaries to hold the information):

headSum[hidden] += roll.count( "H" ) * aChance
tailSum[hidden] += roll.count( "T" ) * aChance

This is all we need to calculate new values for theta.

for hidden in theta.keys():

Iterate the whole thing a few times, or a whole lot of times, and eventually it will converge to values. Not necessarily the best values – it may end up in a local maximum.  But some values.

That’s all there is to it. The code is a lot simpler to understand than the math, I think. You can download the code from GitHub.

posted @ Friday, March 29, 2013 12:35 AM | Feedback (0)
Monday, November 1, 2010
Reduce in parallel

In Part I I put up a couple of simple algorithms to extract words from a string.  In Part II I showed some classes that remove the requirement that the characters be processed in order.

The final step is to reduce – to decide what order, in fact, we’re going to process those characters in.

The obvious solution, I would have thought, would have been the old divide-and-conquer algorithm:

public IWordState Reduce(IEnumerable<IWordState> set, int len)

{

if (len == 0)

return new Chunk("");

if (len == 1)

return set.First();

else

{

int pivot = len / 2;

var firstHalf = Reduce(set.Take(pivot).ToList(), pivot);

var secondHalf = Reduce(set.Skip(pivot).ToList(), pivot + len % 2);

IWordState result = firstHalf.Combine(secondHalf);

return result;

}

}

This isn't too bad.  On my machine it takes about five seconds to run this on my standard data set, the text of Moby Dick.  In my implementation the intent is very clear, too – but I suspect I’m costing myself quite a bit in the usage of LINQ.  Notice, for example, that firstHalf and secondHalf have ToList() calls at the end.  Without these, the function takes much longer, but I’m sure there’s a balance to strike in there somewhere.

Here’s another idea:

private const int INT_PARALLEL_CHUNKS = 100;

public override IEnumerable<string> Split(string s)

{

var subsets = Reduce(Map(s)).ToList();

while (subsets.Count > INT_PARALLEL_CHUNKS)

{

subsets = Reduce(subsets).ToList();

}

if (subsets.Any())

{

var final = subsets.Aggregate((runningProduct, nextFactor) => runningProduct.Combine(nextFactor));

return final.Finalize();

}

return new List<string>();

}

public IEnumerable<IWordState> Reduce(IEnumerable<IWordState> set)

{

var groups = set.Select((ws, index) => new { ws, index })

.GroupBy(g => g.index / INT_PARALLEL_CHUNKS, i => i.ws);

return groups.AsParallel().AsOrdered()

.Select(batch => batch.Aggregate((runningProduct, nextFactor) => runningProduct.Combine(nextFactor)));

}

}

In this implementation, we split the list into groups of 100, and run an aggregate method on each group, and repeat until our list is under length of 100.  Then we aggregate the remaining WordStates together.

This is a little better, perhaps in the neighborhood of 1500 milliseconds.  It still makes a lot of ToList() calls.  And the fact is, on my machine, these are both way slower than a simple serial processing algorithm, which takes around 300 milliseconds.  Still, the addition of state on each character should make this algorithm scale a lot better.  It wouldn’t be hard to farm out the processing to multiple machines, for example.

I hope you’ve picked up some tips on combining items in a list, without requiring a single state object to hold the aggregation.  There are a lot af applications of this technique, and will be even more once .Net 4 comes into common usage.  Let me know if you find applications in your own code for it!

posted @ Monday, November 1, 2010 7:37 PM | Feedback (0)
Thursday, October 21, 2010
Extracting words from a string

I still plan on writing more on scheduling algorithms, but I was at Strange Loop last week and listened to Guy Steele talking about parallel algorithms.  He’s working in a language he called Fortress, but certainly the algorithms could be implemented in C# as well.

The problem Steele was working on to demonstrate a parallel algorithm is splitting a string into words.  For simplicity, we’ll assume that words are only separated by spaces.  Here are some sample tests so you can see the pattern.

void Check(Splitter splitter)

{

Assert.Equal(new List<string> { "This", "is", "a", "sample" }, splitter.Split("This is a sample").ToList());

Assert.Equal(new List<string> { "Here", "is", "another", "sample" }, splitter.Split(" Here is another sample ").ToList());

Assert.Equal(new List<string> { "JustOneWord" }, splitter.Split("JustOneWord").ToList());

Assert.Empty(splitter.Split(" "));

Assert.Empty(splitter.Split(""));

Assert.Equal(new List<string> { "Here", "is", "a", "sesquipedalian", "string", "of", "words" }, splitter.Split("Here is a sesquipedalian string of words").ToList());

}

In a typical implementation, it’s not too complicated to do this in a sequential way.  Start with an empty list.  Iterate over each character, building a word along the way, and when you hit a space, add the word to the list and start over.  Maybe something like this:

public abstract class Splitter

{

public abstract IEnumerable<string> Split(string stringToSplit);

}

public class NormalSplitter : Splitter

{

public override IEnumerable<string> Split(string ss)

{

var result = new List<string>();

string word = "";

for (int i = 0; i < ss.Length; ++i)

{

if (ss[i] == ' ')

{

if (!string.IsNullOrEmpty(word))

{

word = "";

}

}

else

word += ss[i];

}

if (!string.IsNullOrEmpty(word))

return result;

}

}

Of course, I’m not big on For loops.  Linq’s Aggregate function is more or less the same as the Reduce side of the MapReduce algorithm, or some of the Folding methods you see in other languages.  How would we do this using Aggregate?

Instead of the For loop, we give the Aggregate algorithm the seed, or result it should give back for an empty string, and a lambda method that tells it how to combine a list of words, with an additional character, like this:

string word = "";

var words = ss.AsEnumerable().Aggregate(new List<string>(), (List<string> result, char c) =>

The lambda method more or less is the same as the inside of the normal splitter, above:

if (c == ' ')

{

if (!string.IsNullOrEmpty(word))

{

word = "";

}

}

else

word += c;

return result;

How do they compare?  They’re pretty close to each other.  I handed each one a string consisting of the text of Moby Dick and each one was able to process it in about 300 milliseconds.  Sometimes one is faster, sometimes the other.

Now, how would you break this problem down into chunks that multiple processors can work on?  It’s tricky.  We have a list of strings that we’re holding on to, we have a word that we’re holding on to, and as long as we have those two things, we’re stuck iterating over the characters one by one so we can update our stuff in the correct order.  If we try to process two characters simultaneously, the ordering of our words, or the words themselves, can be jumbled.

What do we do?  I’ll try to answer that question next time.

posted @ Thursday, October 21, 2010 7:33 PM | Feedback (22)
Thursday, September 9, 2010
Schedule constraints

In Part 1 we set up a very simple scheduling model.

In a more realistic schedule, of course, some jobs have to wait until other jobs are finished.  The guy putting the doors on the car can’t get started on his job until the person putting in the engine is finished.   Let’s set up a class modelling a constraint that says a job can’t start until another job is finished.

public class UniformConstraint

{

Sometimes also there’s a lag between the first job finishing and the second job starting.  That can be expressed as an amount of time, but it can also be expressed in iterations – say job 2 can’t start until job 1 has run twice.  We’ll keep track of both of those.

We’ll add a list of constraints to our Job class.  Then we’ll need to determine, for a given time and iteration, is a given constraint blocking a job from running?

public bool Blocker(int iteration, int time)

{

int triggerIteration = iteration + Iterations;

if (triggerIteration >= BlockingJob.CompletedIterations())

return true;

return BlockingJob.StartTime(triggerIteration) + BlockingJob.processingTime + Latency > time;

}

(There’s a constructor too, but that pretty much rounds out the Constraint class.  Now over to our Job class.  For a job and a specific time, we’ll need to know if a job is blocked from running or not:

public bool Blocked(int time)

{

return Blockers.Any(constraint => constraint.Blocker(startingTimes.Count, time));

}

Given all this, we’ll have to rewrite our processor.  We’ll just tick the clock, and at each step, we’ll start any jobs that are free to run.  (There are certainly more efficient ways to do this, but that’s ok.)

public void Process(int numIterations)

{

int time = 0;

while (!IterationComplete(numIterations))

{

foreach (var job in Unblocked(time).Where(job => job.IsIdle(time)))

{

job.Start(time);

}

time++;

}

}

That’s more or less it.  Here’s a simple test.  It shows that if our second job is blocked by our first job, the second job is blocked at time 2, but it’s free to run again at time 7 when job 1 is complete.

[Fact]

public void Constraint_blocks_job2_until_job1_completes()

{

_jobs.Constrain( new UniformConstraint( _jobs, 0, 0 ));

_jobs.Start(0);

Assert.True(_jobs.Blocked(2));

Assert.False(_jobs.Blocked(7));

}

Here’s a more complex example:  A part is manufactured successively in two shops.  The first shop can handle at most two parts at the same time, while the second can handle three parts.  The first shop has three machines with processing times of 1,2, and 2.  The second shop has two machines with processing times 2 and 3.  Finally, a part takes a time of 6 to be sent from the first shop to the second.

Here’s a test for that.

[Fact]

public void Complex_example()

{

_jobs = new List<Job>

{

new Job{ processingTime = 1 },

new Job{ processingTime = 2 },

new Job{ processingTime = 2 },

new Job{ processingTime = 2 },

new Job{ processingTime = 3 }

};

_jobs.Constrain(new UniformConstraint(_jobs, 0, 2));

_jobs.Constrain(new UniformConstraint(_jobs, 0, 3));

_jobs.Constrain(new UniformConstraint(_jobs, 0, 0));

_jobs.Constrain(new UniformConstraint(_jobs, 0, 0));

_jobs.Constrain(new UniformConstraint(_jobs, 6, 0));

_jobs.Constrain(new UniformConstraint(_jobs, 0, 0));

var sched = new CyclicSchedule(_jobs);

sched.Process(3);

Assert.Equal(22m, sched.IterationCompletionTime(0));

Assert.Equal(24m, sched.IterationCompletionTime(1));

Assert.Equal(26m, sched.IterationCompletionTime(2));

}

Interesting that this line has a startup time of 20, but after that it pumps out parts every 2 ticks.  Pretty efficient!

One other thing I’d like to note – it’s perfectly valid to have a negative value for a latency.  This would indicate that, for example, job 3 can start no later than 10 ticks after job 1.  This isn’t relevant to our processing model but leads to some interesting restrictions on the schedules.  We’ll look at that more next time.   This code is available on GitHub.

posted @ Thursday, September 9, 2010 10:03 PM | Feedback (1)
Sunday, August 22, 2010
A Cyclic Scheduling Model

Let’s consider a scheduling problem that you might run across in a factory.  Basically, you have a widget that has to be processed in several stations, in order.  So it goes from the first station, to the second station, and on out until it’s complete.  But of course, once the first station is done working on the first widget, it can get started on its second widget.  Once the second station is done with the first widget, if the first station is done with the second widget, it can work on that one – and in the meantime the first station is free to start working on its third widget.

We make the assumption that there are an infinite, or at least extremely large, number of widgets ready to come through the factory, so we can always think about when we can start the next widget, and not have to try to time it so the last widget is done efficiently.

This isn’t just a factory floor problem, either – consider a CPU that can compute several instructions in parallel, with each instruction consisting of a 7-stage pipeline.  There’s certainly no shortage of instructions coming through the line, so schedulers have to try to determine how to minimize the average time through the pipeline, and certainly there are always more instructions to be processed.

So how will we model this?  Let’s start by calling our stations, or stages, jobs, and for now we’ll just assume that the processing time is constant each time the job is run:

public class Job

{

public int processingTime;

}

Our scheduler, fundamentally, is just a list of jobs.  For each job, we’ll keep track of the starting time on each iteration.

public class CyclicSchedule

{

List<Job> _jobs;

Dictionary<Job, List<int>> _startingTimes;

public CyclicSchedule(List<Job> jobs)

{

_jobs = jobs;

_startingTimes = _jobs.ToDictionary(j => j, j => new List<int>());

}

}

We’ll use another simplifying assumption to get us started: we’ll assume that all the jobs can bang on our widget at the same time.  So for example, if we have three jobs that each take time 3 to process, we’ll see that each iteration completes in time 3.  Simple, right?

[Fact]

public void Three_jobs_of_time_three_process_evenly()

{

_jobs = new List<Job> {

new Job { processingTime = 3 },

new Job { processingTime = 3 },

new Job { processingTime = 3 }

};

var sched = new CyclicSchedule(_jobs);

sched.Process(3);

Assert.Equal(3m, sched.IterationCompletionTime(0));

Assert.Equal(3m, sched.IterationCompletionTime(1));

}

Again for now, each of our jobs simply runs flat out – when it finishes the first widget, it immediately starts on the next one – it’s 100% efficient.  This makes the task of creating a simulator very simple: for each iteration, the start time for a job is just the previous start time, plus the time it takes to process:

public void Process(int numIterations)

{

// each job starts at 0

for (int i = 1; i < numIterations; ++i)

{

}

}

Now we can measure some interesting things about our schedule.  For example, the IterationCompletionTime.  This does not refer to the time that the iteration completed, but the total time it took to complete.  As we see in the test above, both iteration 0 and iteration 1 took time of 3 to complete, as each of the three jobs takes time of 3.

But note – if our jobs took different amounts of time – say 7, 9, and 2, the iteration completion time is not just the average time for those three jobs.  What we’re trying to simulate is the amount of time that our raw materials – say a hunk of steel – take to be processed.  The raw materials come in as soon as an iteration starts.

So given this example, job 3 starts processing the second hunk of steel at time 2.  But job 2 can’t start looking at the second hunk until time 9, and it won’t be done until time 18.  So the IterationCompletionTime for iteration 2 = 18 – 2 = 16.  And we can see that iteration 3 will start at 4 and end at 27, and pretty soon you’ve got big piles of half-processed steel sitting around the factory.  Formally, the limit of the IterationCompletionTime as k approaches infinity, is infinity.  That’s known as an “unstable schedule”, not surprisingly.

Here’s one way to calculate that:

public decimal IterationCompletionTime(int i)

{

var values = _startingTimes.Select(x => new { job = x.Key, starttime = x.Value[i], endtime = x.Value[i] + x.Key.processingTime });

return values.Max(val => val.endtime) - values.Min(val => val.starttime);

}

One more statistic that might be of interest: the Average Cycle Time of a job.  The Cycle Time of a job is the time difference between starting one iteration and starting the next, so obviously, the average cycle time is the average of all those across all the iterations.  Since for our model so far, every job can start on the next iteraton as soon as it’s done with the current one, the average cycle time is equal to the processing time.  But here’s how you would calculate it anyway:

public decimal CycleTime(int jobNum)

{

var starts = _startingTimes[_jobs[jobNum]];

return (decimal)Enumerable.Range(1, starts.Count - 1)

.Select(i => starts[i] - starts[i - 1])

.Average();

}

Next time, we’ll model some more complex scenarios.  Code is available on GitHub.

posted @ Sunday, August 22, 2010 4:35 PM | Feedback (0)
Monday, May 31, 2010
An approximate solution for the Subset Sum problem

We described Greedy in Part 1.  It’s very fast and often quite accurate, but not always accurate enough.  So how can we improve on it?

Remember, we can get perfect accuracy by enumerating all of the possible subsets of the numbers, but that number gets out of hand fast.  But, look at this list:

{ 1, 3, 4, 6, 9 }

While this list has 32 distinct subsets, it does not have 32 distinct subset sums.  The subset { 1, 6 } sums to 7, and the subset { 3, 4 } also adds up to 7.  How can we take advantage of this?

### Computing all subsets

The LINQ algorithm I posted to calculate subsets is nice, but we need a more iterative way to generate the subsets so we can pole at them in between iterations.  Here’s another solution:

public override List<int> Find(List<int> list, int sum)

{

var subsets = new List<IEnumerable<int>> { new List<int>() };

for (int i = 0; i < list.Count; i++)

{

var T = subsets.Select(y => y.Concat(new[] { list[i] }));

var result = subsets.Find(s => s.Sum() == sum);

if (result != null)

return result.ToList();

}

return null;

}

The idea is to start with the subset with no items, then on each iteration, we take our current list of subsets and add the next number in the list to each subset, to create an additional set of subsets.  So if you had the list { 1, 2, 3 }, after the first iteration you’d just have the subset { 1 }.  On the next iteration, you’d have {2}, and {1,2} as well as the original { 1 }.  On the third iteration, you’d add 3 to all of those sets, so the new sets would be {3}, {1,3}, {2,3}, and {1,2,3}.

### Getting rid of duplicates

Let’s take the theory even further and instead of just eliminating subsets that sum to the same amount, we’ll get rid of them even if they’re just close to each other.

We’ll start by getting the subsets ordered by sum:

public override List<int> Find(List<int> list, int sum)

{

var subsets = new List<IEnumerable<int>> { new List<int>() };

for (int i = 0; i < list.Count; i++)

{

var T = subsets.Select(y => y.Concat(new[] { list[i] }));

var U = subsets.Union(T).OrderBy(k => k.Sum()).ToList();

Next, we’ll eliminate all subsets that sum to within a range of a delta value.  We’ll also eliminate any subsets that go over the target sum.

subsets = RemoveCloseSums(U, _delta).Where( s => s.Sum() <= sum).ToList();

Since we’re providing a delta anyway, we’ll go ahead and retrun any subset that gets us delta close to our target.  Note that this is what Greedy is doing as well, although in the case of Greedy it just stops looking when it’s out of ideas.

var result = subsets.Find(s => s.Sum() <= sum && s.Sum() > (1 - _delta) * sum);

if (result != null)

return result.ToList();

}

return null;

}

}

And that’s it.

### Is it any good?

Well, yes and no.  Greedy does a pretty bang-up job in most cases that I’ve tried.  If you have a lot of little jobs that you can poke in near the end after the big ones are finished, Greedy will find those and put them in when appropriate.  But you can fool it.  One of the test lists I used was 100 jobs, with times ranging from 200 to 5000, and Greedy had a hard time being too precise with its numbers, while the approximating algorithm was able to come up with a set just as quickly.  But paring off sets that were too large helped a lot with that as well.  Greedy didn’t much care if my target value was 5000 or 50000, it chose its solution and returned.  Approximating took a lot longer to compute the 50000 set.

Still, I didn’t put a lot of effort into optimizing the C# code.  Look at all those Sum() calls up there – I’m guessing the compiler isn’t managing to optimize those away, and if not, they’re really going to hurt performance.  The next step in speeding up the algorithm would probably be to eliminate those, and then you might be able to compete with Greedy again.

But that is left as an exercise for the reader.  Once again, the code is available on GitHub.  Happy coding!

posted @ Monday, May 31, 2010 9:04 AM | Feedback (0)