Posts
12
14
0
Sunday, June 30, 2013
Suffix Arrays

A suffix array of a string is a list of suffixes of that string. For example, if the string is “ABCD”, the suffixes are “ABCD”, “BCD”, “CD”, and “D”. Not very interesting on its own, but what is useful is that you can track them by simply keeping track of the index of their starting character – 0,1,2, and 3. And that means you can fiddle with their order. You can sort them. In “ABCD” they are already sorted, but consider a string like “mississippi”. In sorting those suffixes, the final “i” is sorted to the front, while the first “s” is at the back. The order is

10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2

(For some idea of the usefulness of suffix arrays, see my suffix arrays blog post on my personal blog.)

Here’s an implementation of a suffix array function in Python:

`    def get_suffix_array(str):      return sorted(range(len(str)), key=lambda i: str[i:]) `

Simple enough, for short strings. But there are going to be problems as the string length increases. That key function is probably making a copy of the string slice to pass to the sorted function, and that is going to take a huge amount of space for a long string. I experimented a little by generating a random, 100,000 character string and calling this method on it; it used more than three gigabytes of memory and ran for almost four minutes before returning the suffix array. (Just for fun, I loaded up the million characters of Moby Dick and ran it on that. The whole operating system froze up. Eventually I got bored and hard-rebooted the machine.)

So for long strings, space will get to be a consideration. Python also supports a cmp argument for sorted, which takes two values and returns an integer indicating which value should be sorted first, but I understand this is being deprecated. This is unfortunate as it would make this problem much more tractable; but we can still do better, both in terms of total memory usage and speed. Python’s generic sorted function is probably built to handle any kind of data and to run in (n log n) time, but since we’re only handling one kind of data we can improve on that.

Here’s an idea that was suggested by Udi Manber and Gene Myers back in the 1990’s. It’s a variant on bucket sort, which is a sorting algorithm that operates by first looking only at the first character of each string, and putting those that share the same initial character into a bucket. Then it looks at each bucket for the second character in the string, and so on. This is a handy paradigm for us, since we don’t even have different strings to work on; we only have indices into our one string.

Here’s the idea: we’ll put each suffix into a bucket using its first character, so all the suffixes that start with ‘A’ will be in one bucket, and all the suffixes that start with ‘B’ will be in another, and so on. Obviously all the suffixes in the ‘A’ bucket will have a sort order before those in the ‘B’ bucket or ‘C’ bucket. It looks something like this:

```    def sort_bucket(str, bucket):
d = defaultdict(list)
for i in bucket:
key = str[i:i+1]
d[key].append(i)
def suffix_array_ManberMyers(str):    return sort_bucket(str, (i for i in range(len(str)))) ```

Sort_bucket takes our string and a list of indices to examine. Our initial call to it will just pass in every index in the string – all of the suffixes as one big bucket.

D, then, is a dictionary of initial characters to suffix indices. Now what we need to do is call sort_bucket recursively to go through each bucket and sort the strings by second position, third position, etc. To make the whole thing a little more efficient, rather than index by one additional character each time, we’ll double the number of characters. The first recursive call will sort on strings of length two, the second call will sort on four, etc. We need to add an Order parameter to indicate the number of characters we want to sort on. Also, add code to return the sorted array and this is our final function:

```    def sort_bucket(str, bucket, order):
d = defaultdict(list)
for i in bucket:
key = str[i:i+order]
d[key].append(i)
result = []
for k,v in sorted(d.iteritems()):
if len(v) > 1:
result += sort_bucket(str, v, order*2)
else:
result.append(v[0])
return result

def suffix_array_ManberMyers(str):
return sort_bucket(str, (i for i in range(len(str))), 1) ```

Notice that sort_bucket now returns the list of sorted suffixes. If a bucket contains a single element, that is appended to the result; otherwise the method is called recursively and the sorted bucket is added to the result.

So how does it work? It turns out, quite nicely. If you want to do suffix arrays for “mississipi”, well, you’re better off using the initial implementation – it’s simpler and runs fairly quickly. But for longer strings…on my computer the tipping point appeared to be around 5000 characters. For a 10,000 character string, Manber-Myers was about three times as fast. And at 100,000 characters – the run that took more than four minutes – Manber-Myers still takes less than a second. And for Moby Dick? Manber-Myers was still under six seconds to run, and used about 50 megabytes worth of memory.

So it’s a pretty successful little algorithm.  You can download this Suffix Array code on my GitHub page.

Now, 50M on a modern computer isn’t much. But that still represents 50 times the size of the book. If we want to scale up again, from the size of the book (1,200,000 bytes) to the size of a human genome (3,000,000,000 bytes) are we going to need a machine with 150 gigabytes of memory?

I sure hope not. I’ll try out some more ideas another time.

posted @ Sunday, June 30, 2013 7:59 PM | Feedback (0)
Friday, March 29, 2013
The Expectation Maximization Algorithm

Conceptually, the expectation maximization algorithm is simple: you take a guess at the values you want to find, then run your calculations as though they were correct. Then, use those results to improve your guess.

What is the expectation maximization algorithm?

That article does a good job of explaining the algorithm in more detail, but I couldn’t make sense of the example. So I wrote some code to help me figure out.

The example is like this: You have two coins. They’re not exactly balanced, and you want to know how heavily weighted they are. The only information you have, though, are five sequences of ten flips each. And, you don’t even know which coin is responsible for which sequence!

Okay, let’s write some code. I’ll be working in Python for this example.

rolls = [ "HTTTHHTHTH",
"HHHHTHHHHH",
"HTHHHHHTHH",
"HTHTTTHHTT",
"THHHTHHHTH" ]

Our coins are A and B. Let’s make our initial guess that coin A comes up heads 60% of the time, and that coin B is fair. The tradition is that the unknown variables are called Theta, so:

theta = { "A":.6, "B":.5 }

We need a method to calculate the likelihood of a sequence given a weight. That’s straightforward binomial probability:

def likelihoodOfRoll( roll, prob ):
n = len(roll)
return pow( prob, numHeads ) * pow( 1-prob, n - numHeads )

Now what? One easy way to a solution would be to figure out which sequences were most likely to come from which coins, then credit all the heads and tails from those sequences to those coins. Using that method, rolls 1,2, and 4 are assigned to coin A, and rolls 0 and 3 are assigned to B. 24 of the 30 flips assigned to coin A are heads, and 9 of 20 flips for coin B. So with that strategy, our new value for theta becomes { “A”:.8, “B”:.45}.

But we can do better. Rather than assign an entire sequence to a single coin, we’ll use the calculated probabilities to weight the sequence. So – and this is a tricky bit – we have to compare the calculated probabilities to each other rather than looking at them as absolutes. So if one coin has a probability of 20% and the other has a probability of 30%, we’ll assign them weights of .4 and .6, respectively. Here’s how:

for roll in rolls:
s = sum( likelihoodOfRoll( roll, k) for k in theta.values())
for hidden in theta.keys():
a = likelihoodOfRoll( roll, theta[hidden])
aChance = a / s

I didn’t figure out that part until I came across a StackExchange question about it. How does expectation maximization work?

The variable aChance holds the weighted probability of the sequence coming from the coin. Next, we’ll assign that percentage of the heads in the sequence to that coin (We’ll keep a couple of dictionaries to hold the information):

headSum[hidden] += roll.count( "H" ) * aChance
tailSum[hidden] += roll.count( "T" ) * aChance

This is all we need to calculate new values for theta.

for hidden in theta.keys():

Iterate the whole thing a few times, or a whole lot of times, and eventually it will converge to values. Not necessarily the best values – it may end up in a local maximum.  But some values.

That’s all there is to it. The code is a lot simpler to understand than the math, I think. You can download the code from GitHub.

posted @ Friday, March 29, 2013 12:35 AM | Feedback (0)
Monday, November 1, 2010
Reduce in parallel

In Part I I put up a couple of simple algorithms to extract words from a string.  In Part II I showed some classes that remove the requirement that the characters be processed in order.

The final step is to reduce – to decide what order, in fact, we’re going to process those characters in.

The obvious solution, I would have thought, would have been the old divide-and-conquer algorithm:

public IWordState Reduce(IEnumerable<IWordState> set, int len)

{

if (len == 0)

return new Chunk("");

if (len == 1)

return set.First();

else

{

int pivot = len / 2;

var firstHalf = Reduce(set.Take(pivot).ToList(), pivot);

var secondHalf = Reduce(set.Skip(pivot).ToList(), pivot + len % 2);

IWordState result = firstHalf.Combine(secondHalf);

return result;

}

}

This isn't too bad.  On my machine it takes about five seconds to run this on my standard data set, the text of Moby Dick.  In my implementation the intent is very clear, too – but I suspect I’m costing myself quite a bit in the usage of LINQ.  Notice, for example, that firstHalf and secondHalf have ToList() calls at the end.  Without these, the function takes much longer, but I’m sure there’s a balance to strike in there somewhere.

Here’s another idea:

private const int INT_PARALLEL_CHUNKS = 100;

public override IEnumerable<string> Split(string s)

{

var subsets = Reduce(Map(s)).ToList();

while (subsets.Count > INT_PARALLEL_CHUNKS)

{

subsets = Reduce(subsets).ToList();

}

if (subsets.Any())

{

var final = subsets.Aggregate((runningProduct, nextFactor) => runningProduct.Combine(nextFactor));

return final.Finalize();

}

return new List<string>();

}

public IEnumerable<IWordState> Reduce(IEnumerable<IWordState> set)

{

var groups = set.Select((ws, index) => new { ws, index })

.GroupBy(g => g.index / INT_PARALLEL_CHUNKS, i => i.ws);

return groups.AsParallel().AsOrdered()

.Select(batch => batch.Aggregate((runningProduct, nextFactor) => runningProduct.Combine(nextFactor)));

}

}

In this implementation, we split the list into groups of 100, and run an aggregate method on each group, and repeat until our list is under length of 100.  Then we aggregate the remaining WordStates together.

This is a little better, perhaps in the neighborhood of 1500 milliseconds.  It still makes a lot of ToList() calls.  And the fact is, on my machine, these are both way slower than a simple serial processing algorithm, which takes around 300 milliseconds.  Still, the addition of state on each character should make this algorithm scale a lot better.  It wouldn’t be hard to farm out the processing to multiple machines, for example.

I hope you’ve picked up some tips on combining items in a list, without requiring a single state object to hold the aggregation.  There are a lot af applications of this technique, and will be even more once .Net 4 comes into common usage.  Let me know if you find applications in your own code for it!

posted @ Monday, November 1, 2010 7:37 PM | Feedback (1)
Thursday, October 21, 2010
Extracting words from a string

I still plan on writing more on scheduling algorithms, but I was at Strange Loop last week and listened to Guy Steele talking about parallel algorithms.  He’s working in a language he called Fortress, but certainly the algorithms could be implemented in C# as well.

The problem Steele was working on to demonstrate a parallel algorithm is splitting a string into words.  For simplicity, we’ll assume that words are only separated by spaces.  Here are some sample tests so you can see the pattern.

void Check(Splitter splitter)

{

Assert.Equal(new List<string> { "This", "is", "a", "sample" }, splitter.Split("This is a sample").ToList());

Assert.Equal(new List<string> { "Here", "is", "another", "sample" }, splitter.Split(" Here is another sample ").ToList());

Assert.Equal(new List<string> { "JustOneWord" }, splitter.Split("JustOneWord").ToList());

Assert.Empty(splitter.Split(" "));

Assert.Empty(splitter.Split(""));

Assert.Equal(new List<string> { "Here", "is", "a", "sesquipedalian", "string", "of", "words" }, splitter.Split("Here is a sesquipedalian string of words").ToList());

}

In a typical implementation, it’s not too complicated to do this in a sequential way.  Start with an empty list.  Iterate over each character, building a word along the way, and when you hit a space, add the word to the list and start over.  Maybe something like this:

public abstract class Splitter

{

public abstract IEnumerable<string> Split(string stringToSplit);

}

public class NormalSplitter : Splitter

{

public override IEnumerable<string> Split(string ss)

{

var result = new List<string>();

string word = "";

for (int i = 0; i < ss.Length; ++i)

{

if (ss[i] == ' ')

{

if (!string.IsNullOrEmpty(word))

{

word = "";

}

}

else

word += ss[i];

}

if (!string.IsNullOrEmpty(word))

return result;

}

}

Of course, I’m not big on For loops.  Linq’s Aggregate function is more or less the same as the Reduce side of the MapReduce algorithm, or some of the Folding methods you see in other languages.  How would we do this using Aggregate?

Instead of the For loop, we give the Aggregate algorithm the seed, or result it should give back for an empty string, and a lambda method that tells it how to combine a list of words, with an additional character, like this:

string word = "";

var words = ss.AsEnumerable().Aggregate(new List<string>(), (List<string> result, char c) =>

The lambda method more or less is the same as the inside of the normal splitter, above:

if (c == ' ')

{

if (!string.IsNullOrEmpty(word))

{

word = "";

}

}

else

word += c;

return result;

How do they compare?  They’re pretty close to each other.  I handed each one a string consisting of the text of Moby Dick and each one was able to process it in about 300 milliseconds.  Sometimes one is faster, sometimes the other.

Now, how would you break this problem down into chunks that multiple processors can work on?  It’s tricky.  We have a list of strings that we’re holding on to, we have a word that we’re holding on to, and as long as we have those two things, we’re stuck iterating over the characters one by one so we can update our stuff in the correct order.  If we try to process two characters simultaneously, the ordering of our words, or the words themselves, can be jumbled.

What do we do?  I’ll try to answer that question next time.

posted @ Thursday, October 21, 2010 7:33 PM | Feedback (4)
Thursday, September 9, 2010
Schedule constraints

In Part 1 we set up a very simple scheduling model.

In a more realistic schedule, of course, some jobs have to wait until other jobs are finished.  The guy putting the doors on the car can’t get started on his job until the person putting in the engine is finished.   Let’s set up a class modelling a constraint that says a job can’t start until another job is finished.

public class UniformConstraint

{

Sometimes also there’s a lag between the first job finishing and the second job starting.  That can be expressed as an amount of time, but it can also be expressed in iterations – say job 2 can’t start until job 1 has run twice.  We’ll keep track of both of those.

We’ll add a list of constraints to our Job class.  Then we’ll need to determine, for a given time and iteration, is a given constraint blocking a job from running?

public bool Blocker(int iteration, int time)

{

int triggerIteration = iteration + Iterations;

if (triggerIteration >= BlockingJob.CompletedIterations())

return true;

return BlockingJob.StartTime(triggerIteration) + BlockingJob.processingTime + Latency > time;

}

(There’s a constructor too, but that pretty much rounds out the Constraint class.  Now over to our Job class.  For a job and a specific time, we’ll need to know if a job is blocked from running or not:

public bool Blocked(int time)

{

return Blockers.Any(constraint => constraint.Blocker(startingTimes.Count, time));

}

Given all this, we’ll have to rewrite our processor.  We’ll just tick the clock, and at each step, we’ll start any jobs that are free to run.  (There are certainly more efficient ways to do this, but that’s ok.)

public void Process(int numIterations)

{

int time = 0;

while (!IterationComplete(numIterations))

{

foreach (var job in Unblocked(time).Where(job => job.IsIdle(time)))

{

job.Start(time);

}

time++;

}

}

That’s more or less it.  Here’s a simple test.  It shows that if our second job is blocked by our first job, the second job is blocked at time 2, but it’s free to run again at time 7 when job 1 is complete.

[Fact]

public void Constraint_blocks_job2_until_job1_completes()

{

_jobs[1].Constrain( new UniformConstraint( _jobs[0], 0, 0 ));

_jobs[0].Start(0);

Assert.True(_jobs[1].Blocked(2));

Assert.False(_jobs[1].Blocked(7));

}

Here’s a more complex example:  A part is manufactured successively in two shops.  The first shop can handle at most two parts at the same time, while the second can handle three parts.  The first shop has three machines with processing times of 1,2, and 2.  The second shop has two machines with processing times 2 and 3.  Finally, a part takes a time of 6 to be sent from the first shop to the second.

Here’s a test for that.

[Fact]

public void Complex_example()

{

_jobs = new List<Job>

{

new Job{ processingTime = 1 },

new Job{ processingTime = 2 },

new Job{ processingTime = 2 },

new Job{ processingTime = 2 },

new Job{ processingTime = 3 }

};

_jobs[2].Constrain(new UniformConstraint(_jobs[0], 0, 2));

_jobs[4].Constrain(new UniformConstraint(_jobs[3], 0, 3));

_jobs[1].Constrain(new UniformConstraint(_jobs[0], 0, 0));

_jobs[2].Constrain(new UniformConstraint(_jobs[1], 0, 0));

_jobs[3].Constrain(new UniformConstraint(_jobs[2], 6, 0));

_jobs[4].Constrain(new UniformConstraint(_jobs[3], 0, 0));

var sched = new CyclicSchedule(_jobs);

sched.Process(3);

Assert.Equal(22m, sched.IterationCompletionTime(0));

Assert.Equal(24m, sched.IterationCompletionTime(1));

Assert.Equal(26m, sched.IterationCompletionTime(2));

}

Interesting that this line has a startup time of 20, but after that it pumps out parts every 2 ticks.  Pretty efficient!

One other thing I’d like to note – it’s perfectly valid to have a negative value for a latency.  This would indicate that, for example, job 3 can start no later than 10 ticks after job 1.  This isn’t relevant to our processing model but leads to some interesting restrictions on the schedules.  We’ll look at that more next time.   This code is available on GitHub.

posted @ Thursday, September 9, 2010 10:03 PM | Feedback (1)
Sunday, August 22, 2010
A Cyclic Scheduling Model

Let’s consider a scheduling problem that you might run across in a factory.  Basically, you have a widget that has to be processed in several stations, in order.  So it goes from the first station, to the second station, and on out until it’s complete.  But of course, once the first station is done working on the first widget, it can get started on its second widget.  Once the second station is done with the first widget, if the first station is done with the second widget, it can work on that one – and in the meantime the first station is free to start working on its third widget.

We make the assumption that there are an infinite, or at least extremely large, number of widgets ready to come through the factory, so we can always think about when we can start the next widget, and not have to try to time it so the last widget is done efficiently.

This isn’t just a factory floor problem, either – consider a CPU that can compute several instructions in parallel, with each instruction consisting of a 7-stage pipeline.  There’s certainly no shortage of instructions coming through the line, so schedulers have to try to determine how to minimize the average time through the pipeline, and certainly there are always more instructions to be processed.

So how will we model this?  Let’s start by calling our stations, or stages, jobs, and for now we’ll just assume that the processing time is constant each time the job is run:

public class Job

{

public int processingTime;

}

Our scheduler, fundamentally, is just a list of jobs.  For each job, we’ll keep track of the starting time on each iteration.

public class CyclicSchedule

{

List<Job> _jobs;

Dictionary<Job, List<int>> _startingTimes;

public CyclicSchedule(List<Job> jobs)

{

_jobs = jobs;

_startingTimes = _jobs.ToDictionary(j => j, j => new List<int>());

}

}

We’ll use another simplifying assumption to get us started: we’ll assume that all the jobs can bang on our widget at the same time.  So for example, if we have three jobs that each take time 3 to process, we’ll see that each iteration completes in time 3.  Simple, right?

[Fact]

public void Three_jobs_of_time_three_process_evenly()

{

_jobs = new List<Job> {

new Job { processingTime = 3 },

new Job { processingTime = 3 },

new Job { processingTime = 3 }

};

var sched = new CyclicSchedule(_jobs);

sched.Process(3);

Assert.Equal(3m, sched.IterationCompletionTime(0));

Assert.Equal(3m, sched.IterationCompletionTime(1));

}

Again for now, each of our jobs simply runs flat out – when it finishes the first widget, it immediately starts on the next one – it’s 100% efficient.  This makes the task of creating a simulator very simple: for each iteration, the start time for a job is just the previous start time, plus the time it takes to process:

public void Process(int numIterations)

{

// each job starts at 0

for (int i = 1; i < numIterations; ++i)

{

}

}

Now we can measure some interesting things about our schedule.  For example, the IterationCompletionTime.  This does not refer to the time that the iteration completed, but the total time it took to complete.  As we see in the test above, both iteration 0 and iteration 1 took time of 3 to complete, as each of the three jobs takes time of 3.

But note – if our jobs took different amounts of time – say 7, 9, and 2, the iteration completion time is not just the average time for those three jobs.  What we’re trying to simulate is the amount of time that our raw materials – say a hunk of steel – take to be processed.  The raw materials come in as soon as an iteration starts.

So given this example, job 3 starts processing the second hunk of steel at time 2.  But job 2 can’t start looking at the second hunk until time 9, and it won’t be done until time 18.  So the IterationCompletionTime for iteration 2 = 18 – 2 = 16.  And we can see that iteration 3 will start at 4 and end at 27, and pretty soon you’ve got big piles of half-processed steel sitting around the factory.  Formally, the limit of the IterationCompletionTime as k approaches infinity, is infinity.  That’s known as an “unstable schedule”, not surprisingly.

Here’s one way to calculate that:

public decimal IterationCompletionTime(int i)

{

var values = _startingTimes.Select(x => new { job = x.Key, starttime = x.Value[i], endtime = x.Value[i] + x.Key.processingTime });

return values.Max(val => val.endtime) - values.Min(val => val.starttime);

}

One more statistic that might be of interest: the Average Cycle Time of a job.  The Cycle Time of a job is the time difference between starting one iteration and starting the next, so obviously, the average cycle time is the average of all those across all the iterations.  Since for our model so far, every job can start on the next iteraton as soon as it’s done with the current one, the average cycle time is equal to the processing time.  But here’s how you would calculate it anyway:

public decimal CycleTime(int jobNum)

{

var starts = _startingTimes[_jobs[jobNum]];

return (decimal)Enumerable.Range(1, starts.Count - 1)

.Select(i => starts[i] - starts[i - 1])

.Average();

}

Next time, we’ll model some more complex scenarios.  Code is available on GitHub.

posted @ Sunday, August 22, 2010 4:35 PM | Feedback (1)
Monday, May 31, 2010
An approximate solution for the Subset Sum problem

We described Greedy in Part 1.  It’s very fast and often quite accurate, but not always accurate enough.  So how can we improve on it?

Remember, we can get perfect accuracy by enumerating all of the possible subsets of the numbers, but that number gets out of hand fast.  But, look at this list:

{ 1, 3, 4, 6, 9 }

While this list has 32 distinct subsets, it does not have 32 distinct subset sums.  The subset { 1, 6 } sums to 7, and the subset { 3, 4 } also adds up to 7.  How can we take advantage of this?

### Computing all subsets

The LINQ algorithm I posted to calculate subsets is nice, but we need a more iterative way to generate the subsets so we can pole at them in between iterations.  Here’s another solution:

public override List<int> Find(List<int> list, int sum)

{

var subsets = new List<IEnumerable<int>> { new List<int>() };

for (int i = 0; i < list.Count; i++)

{

var T = subsets.Select(y => y.Concat(new[] { list[i] }));

var result = subsets.Find(s => s.Sum() == sum);

if (result != null)

return result.ToList();

}

return null;

}

The idea is to start with the subset with no items, then on each iteration, we take our current list of subsets and add the next number in the list to each subset, to create an additional set of subsets.  So if you had the list { 1, 2, 3 }, after the first iteration you’d just have the subset { 1 }.  On the next iteration, you’d have {2}, and {1,2} as well as the original { 1 }.  On the third iteration, you’d add 3 to all of those sets, so the new sets would be {3}, {1,3}, {2,3}, and {1,2,3}.

### Getting rid of duplicates

Let’s take the theory even further and instead of just eliminating subsets that sum to the same amount, we’ll get rid of them even if they’re just close to each other.

We’ll start by getting the subsets ordered by sum:

public override List<int> Find(List<int> list, int sum)

{

var subsets = new List<IEnumerable<int>> { new List<int>() };

for (int i = 0; i < list.Count; i++)

{

var T = subsets.Select(y => y.Concat(new[] { list[i] }));

var U = subsets.Union(T).OrderBy(k => k.Sum()).ToList();

Next, we’ll eliminate all subsets that sum to within a range of a delta value.  We’ll also eliminate any subsets that go over the target sum.

subsets = RemoveCloseSums(U, _delta).Where( s => s.Sum() <= sum).ToList();

Since we’re providing a delta anyway, we’ll go ahead and retrun any subset that gets us delta close to our target.  Note that this is what Greedy is doing as well, although in the case of Greedy it just stops looking when it’s out of ideas.

var result = subsets.Find(s => s.Sum() <= sum && s.Sum() > (1 - _delta) * sum);

if (result != null)

return result.ToList();

}

return null;

}

}

And that’s it.

### Is it any good?

Well, yes and no.  Greedy does a pretty bang-up job in most cases that I’ve tried.  If you have a lot of little jobs that you can poke in near the end after the big ones are finished, Greedy will find those and put them in when appropriate.  But you can fool it.  One of the test lists I used was 100 jobs, with times ranging from 200 to 5000, and Greedy had a hard time being too precise with its numbers, while the approximating algorithm was able to come up with a set just as quickly.  But paring off sets that were too large helped a lot with that as well.  Greedy didn’t much care if my target value was 5000 or 50000, it chose its solution and returned.  Approximating took a lot longer to compute the 50000 set.

Still, I didn’t put a lot of effort into optimizing the C# code.  Look at all those Sum() calls up there – I’m guessing the compiler isn’t managing to optimize those away, and if not, they’re really going to hurt performance.  The next step in speeding up the algorithm would probably be to eliminate those, and then you might be able to compete with Greedy again.

But that is left as an exercise for the reader.  Once again, the code is available on GitHub.  Happy coding!

posted @ Monday, May 31, 2010 9:04 AM | Feedback (1)
Sunday, May 2, 2010
The subset sum problem

### Job Scheduling

Suppose you’ve got two processors, and ten jobs to run on them.  You know the amount of time each job needs to complete, and your goal is to minimize the makespan of the jobs (that is, the time it takes to get all the jobs finished).  How would you do it?

It doesn’t take much thought before you hit on an algorithm called greedy.  The Greedy algorithm sorts the jobs by size, then assigns the largest to the first processor, the second largest to the second processor, the third back to the first processor, and so on.  And you know what?  That’s a fine algorithm.  In most circumstances it’ll get the job done for you.

On the other hand, if you’re in the habit of working with an infinite number of really fast processors, you might have said, “I know!  I’ll enumerate every possible way of splitting the jobs into two groups, and then find one where the sums are the same!”

That might have been a little wack.  There’re only 10 jobs, so there are only 1024 subsets to sift through, so it’s not unthinkable.  And, it has the additional advantage of working.  Perfectly.  Every time.  Greedy doesn’t, especially when you get a set of numbers with a really wide range, for example

771 121 281 854 885 734 486 1003 83 62

(Taken from American Scientist, The Easiest Hard Problem)

The largest number there is 15 times the smallest number, and that’ll give Greedy fits when trying to come up with a perfect solution.  (It still comes fairly close.  Maybe close enough, depending on the project.)

The enumeration algorithm will find the partition for these, but of course tomorrow the boss is going to come in with twelve jobs to be scheduled, and in accordance with the laws of NP-completeness, each additional job doubles the number of subsets that have to be examined, so now you’re up to 4000 subsets, and we haven’t even thought about working with numbers from ten to a million, say.

### The Subset Sum Problem

Now, this scheduling problem is really only a specialization of the Subset Sum problem.  The Subset Sum problem is simply this: Given a set of numbers and another number N, is there a subset of those numbers that sums to N?

Once again, this problem is NP-complete, so there’s no 100% way of figuring out the answer except by brute-forcing, looking at all the subsets to see if one fits.  I found a lovely LINQ example of enumerating all the subsets on Igor Ostrovsky’s blog:

var subsets = from m in Enumerable.Range(0, 1 << list.Count)

select

from i in Enumerable.Range(0, list.Count)

where (m & (1 << i)) != 0

select list[i];

Nice!  It blows up pretty quickly, though.  On my machine, I couldn’t get it to enumerate subsets for a list of 25 items without running out of memory, and “1 << list.Count” isn’t going to get past the number of bits in an int anyway, without some tricks.  Still, if this runs, returning a subset that equals a given number sum is just

return subsets.First(set => set.Sum() == sum)

If you don’t need the exact number, but are happy just to get close, Greedy will get the job done and you’re not limited by the length of the list, either:

var result = new List<int>();

foreach (int i in list.OrderByDescending(k => k).SkipWhile( k => k > sum))

{

if (result.Sum() + i > sum)

return result;

else

}

I ran Greedy against a thousand-item list and it ran in a couple of seconds, but I had to use a fault-tolerance of 20% to get the result I wanted.  Check out the results on GitHub.

Is Greedy the best we can do?  I think we can do better.  I’ll post some ideas next time.

posted @ Sunday, May 2, 2010 10:38 PM | Feedback (2)
Wednesday, February 17, 2010
A hash-based solution for the closest-pair problem

Here’s an interesting solution to the closest-pair problem.  (For background read parts one and two).  (I’m going to drop to one-dimensional geometry initally to make the calculations a little simpler.)  We’ll be trying to find the two closest points on a line, like this:

Suppose we can determine a way to partition the space into equal areas in such a way that at most one point falls within any given area, and that there are at least two adjacent nonempty intervals:

Then, determining the closest pair becomes an easy task.  We re-partition the space into areas that are double the size, and that guarantees that the closest points are either in the same area or in adjacent areas.

But, how do we determine the correct size for the partition?  Ah, there’s the rub, and the majority of the difficulty in the algorithm.  We’ll start with a simple multidictionary implementation that will let us place points in intervals:

public class Buckets

{

Dictionary<int, List<float>> buckets = new Dictionary<int, List<float>>();

public Buckets(float intervalSize)

{

_IntervalSize = intervalSize;

}

public int Bucket(float r)

{

return (int)(r / _IntervalSize);

}

public void Add(int bucket, float r)

{

if (!buckets.ContainsKey(bucket))

{

buckets[bucket] = new List<float>();

}

}

IEnumerable<float> Get(int id)

{

if (buckets.ContainsKey(id))

return buckets[id];

else

return new List<float>();

}

}

A reasonable first guess at an interval size is to divide the length of the line by the total number of points:

public float FindIntervalSize(List<float> S)

{

float intervalsize = (S.Max() - S.Min()) / S.Count;

Now we’ll start throwing our points into buckets.  If one of the buckets gets too large, we’ll break out and try something else.

var T = new List<float>(S);

while (T.Count > 0)

{

while (T.Count > 0 && !buckets.HasBucketLargerThan(Math.Sqrt(S.Count)))

{

float r = T[T.Count - 1];

T.RemoveAt(T.Count - 1);

int bucket = buckets.Bucket(r);

}

So at this point, we’ve either taken care of all our points, or we have some large buckets to deal with.  What do we do withthe buckets?  We’ll recur, of course!  We’ll run FindIntervalSize again on each one of our buckets, and take the smallest interval we’ve found so far.  That’ll get us a little farther.

intervalsize = buckets.BucketsWithMoreThanOneItem()

.Select(items => FindIntervalSize(items))

.Union(new[] { intervalsize })

.Min();

}

(I added a couple of helper functions to Buckets to make this code clearer.  The rather obvious implementations will be in the source code on GitHub.)  So now we’ve got a tentative interval size based on our recursion, but it’s not impossible that a pair slipped through the cracks somewhere.  We’re going to have to run the entire set of points again to verify that the interval size we’ve chosen is correct.  If we still have buckets with more than one point in them, we’ll try again to recur into them to get our final interval size.

Buckets morebuckets = new Buckets(intervalsize);

intervalsize = morebuckets.BucketsWithMoreThanOneItem()

.Select(items => FindIntervalSize(items))

.Union(new[] { intervalsize })

.Min();

return intervalsize;

}

Okay!  So now we’ve chosen our interval size.  To select our final answer, first we’ll put all points in buckets using twice the interval size.  Then, since we know that the closest points are either in the same bucket or neighboring buckets, we just have to check those three buckets. for each given point.  I’ll add another pair of little helper functions to Buckets to find that list of points:

public IEnumerable<float> PointsNearBucket(int id)

{

return Enumerable.Range(id - 1, 3).SelectMany(i => Get(i));

}

public IEnumerable<Pair> CandidatesInBucket(int id)

{

var list = PointsNearBucket(id).ToList();

return Enumerable.Range(0, list.Count())

.SelectMany(i => Enumerable.Range(i + 1, list.Count() - (i + 1))

.Select(j => new Pair(list[i], list[j])));

}

And here’s the method that puts the whole thing together.

PointF ClosestFloats(List<float> floats)

{

float intervalsize = FindIntervalSize(floats);

var buckets = new Buckets(2.0f * intervalsize);

var ordered = buckets.GetBuckets();

var pair = ordered.SelectMany(i => buckets.CandidatesInBucket(i))

.OrderBy(p => Math.Abs(p.first - p.second))

.First();

return new PointF(pair.first, pair.second);

}

An interesting strategy!  Is it faster?  I’m not entirely sure.  The algorithm comes from a paper by Steve Fortune and John Hopcroft all the way back in 1978, and they say the algorithm should run in O(n log log n) time.  But they don’t provide a two-dimensional implementation, so I can’t make an honest comparison to my previous implementations.  I might be able to put something together.  Source code is available on Github.

Finally, this code will fail with a stack overflow if two of the points happen to be identical.  That gives us a bucket size of 0, and it’s really hard to put points in a bucket that small!  I’m not sure how to solve that problem elegantly.  Suggestions welcome!

posted @ Wednesday, February 17, 2010 10:50 PM | Feedback (1)
Sunday, January 31, 2010
A faster method for solving the closest-pair problem

As is often the case, we’ll speed up the solver by using the divide-and-conquer method.  We’ll split the graph down the middle, solve the problem for the left side, solve the problem for the right side, and we’ll be done!

How do we solve the problem for the left side?  Easy!  We’ll split the graph down the middle, solve the problem for the left side, solve the problem for the right side, and we’ll be done!

This technique is called recursion.  The key thing to understand about recursion in general is that there has to be a stopping point.  For example, a function to calculate a factorial might look something like this:

int Factorial(int i)

{

return Factorial(i - 1) * i;

}

But this function won’t work, because it will continually call itself over and over until the machine can’t handle it any more.  So, we’ll just add a stopping point to say that the factorial value of 1 is, yes, 1.

int Factorial(int i)

{

if (i == 1) return 1;

return Factorial(i - 1) * i;

}

When will we stop our algorithm? It's a pretty fair bet that if there are three points or less left to be analyzed, the brute-force method will work just as well as any other, so we'll use that as a limit. So, here's what we have so far:

Segment Closest_Recursive(List<PointF> points)

{

if (points.Count() < 4) return Closest_BruteForce(points);

int split = points.Count() / 2;

var ordered = points.OrderBy(point => point.X);     // sort the points by their x value

var pointsOnLeft = ordered.Take(split).ToList();    // split into two groups, those to the

var pointsOnRight = ordered.Skip(split).ToList();   // left and those to the right

var leftMin = Closest_Recursive(pointsOnLeft);

var rightMin = Closest_Recursive(pointsOnRight);

if (leftMin.Length() < rightMin.Length())

return leftMin;

else

return rightMin;

}

But there’s a snag.  Do you see it?

Suppose as we recurse along we end up drawing one of our lines like this:

Now our closest pair is neither on the right or left.  It crosses between the two.  So it seems like we still have to examine each point to see if it’s closer to a point on the other side than it is to any points on its side.

But, we have some additional information to work with now.  We know the distance between the two closest points that don’t cross the line.  That means that if there are any crossing pairs that are closer, they will have to be at least that close to the line.  Let’s find all the points on the right-hand side that are that close:

float minDist = Math.Min(leftMin.Length(), rightMin.Length());

var xDivider = pointsOnLeft.Last().X;

var closeY = pointsOnRight.TakeWhile(point => point.X - xDivider < minDist).OrderBy(point => point.Y);

It’s handy the points are sorted by X value – it allows us to stop our search once we get too far away from the line.  And, note that closeY is sorted by the Y values now.  This allows us to analyze only a small number of points on the right for each point on left.  To finish off, we’ll take all the points on the left that are close to the line, and for each one, only examine the points on the right that are closer than minDist in the Y direction.

var crossingPairs = pointsOnLeft.SkipWhile(point => xDivider - point.X > minDist)

.SelectMany(p1 => closeY.SkipWhile(i => i.Y < p1.Y - minDist)

.TakeWhile(i => i.Y < p1.Y + minDist)

.Select(p2 => new Segment(p1, p2)));

And take the shortest distance out of everything.

return crossingPairs.Union(new[] { leftMin, rightMin })

.OrderBy(segment => segment.Length()).First();

Is it faster?  Well, I posted some speed comparisons on RosettaCode.  You can see there that for 10,000 points on the graph, this method ran in slightly more than a second.  The brute-force method?  Nearly two and a half minutes.  Choosing the right algorithm can make all the difference in the world :)

Source code available here: http://github.com/benfulton/Algorithmic-Alley

posted @ Sunday, January 31, 2010 4:31 PM | Feedback (1)