Algorithmic Alley - May 2010 Entries

Posts: 15
Comments: 31
Trackbacks: 0

May 2010 Entries

An approximate solution for the Subset Sum problem

We described Greedy in Part 1. It’s very fast and often quite accurate, but not always accurate enough. So how can we improve on it?

Remember, we can get perfect accuracy by enumerating all of the possible subsets of the numbers, but that number gets out of hand fast. But, look at this list:

{ 1, 3, 4, 6, 9 }

While this list has 32 distinct subsets, it does not have 32 distinct subset sums. The subset { 1, 6 } sums to 7, and the subset { 3, 4 } also adds up to 7. How can we take advantage of this?

Computing all subsets

The LINQ algorithm I posted to calculate subsets is nice, but we need a more iterative way to generate the subsets so we can pole at them in between iterations. Here’s another solution:

        public override List<int> Find(List<int> list, int sum)
        {
            var subsets = new List<IEnumerable<int>> { new List<int>() };
            for (int i = 0; i < list.Count; i++)
            {
                var T = subsets.Select(y => y.Concat(new[] { list[i] }));
 
                subsets.AddRange(T.ToList());
 
                var result = subsets.Find(s => s.Sum() == sum);
 
                if (result != null)
                    return result.ToList();
 
            }
 
            return null;
        }

The idea is to start with the subset with no items, then on each iteration, we take our current list of subsets and add the next number in the list to each subset, to create an additional set of subsets. So if you had the list { 1, 2, 3 }, after the first iteration you’d just have the subset { 1 }. On the next iteration, you’d have {2}, and {1,2} as well as the original { 1 }. On the third iteration, you’d add 3 to all of those sets, so the new sets would be {3}, {1,3}, {2,3}, and {1,2,3}.

Getting rid of duplicates

Let’s take the theory even further and instead of just eliminating subsets that sum to the same amount, we’ll get rid of them even if they’re just close to each other.

We’ll start by getting the subsets ordered by sum:

        public override List<int> Find(List<int> list, int sum)
        {
            var subsets = new List<IEnumerable<int>> { new List<int>() };
            for (int i = 0; i < list.Count; i++)
            {
                var T = subsets.Select(y => y.Concat(new[] { list[i] }));
                var U = subsets.Union(T).OrderBy(k => k.Sum()).ToList();

Next, we’ll eliminate all subsets that sum to within a range of a delta value. We’ll also eliminate any subsets that go over the target sum.

                subsets = RemoveCloseSums(U, _delta).Where( s => s.Sum() <= sum).ToList();
 

Since we’re providing a delta anyway, we’ll go ahead and retrun any subset that gets us delta close to our target. Note that this is what Greedy is doing as well, although in the case of Greedy it just stops looking when it’s out of ideas.

                var result = subsets.Find(s => s.Sum() <= sum && s.Sum() > (1 - _delta) * sum);
 
                if (result != null)
                    return result.ToList();
 
            }
 
            return null;
        }
    }

And that’s it.

Is it any good?

Well, yes and no. Greedy does a pretty bang-up job in most cases that I’ve tried. If you have a lot of little jobs that you can poke in near the end after the big ones are finished, Greedy will find those and put them in when appropriate. But you can fool it. One of the test lists I used was 100 jobs, with times ranging from 200 to 5000, and Greedy had a hard time being too precise with its numbers, while the approximating algorithm was able to come up with a set just as quickly. But paring off sets that were too large helped a lot with that as well. Greedy didn’t much care if my target value was 5000 or 50000, it chose its solution and returned. Approximating took a lot longer to compute the 50000 set.

Still, I didn’t put a lot of effort into optimizing the C# code. Look at all those Sum() calls up there – I’m guessing the compiler isn’t managing to optimize those away, and if not, they’re really going to hurt performance. The next step in speeding up the algorithm would probably be to eliminate those, and then you might be able to compete with Greedy again.

But that is left as an exercise for the reader. Once again, the code is available on GitHub. Happy coding!

Author: Ben Fulton

posted @ Monday, May 31, 2010 9:04 AM | Feedback (0)

The subset sum problem

Job Scheduling

Suppose you’ve got two processors, and ten jobs to run on them. You know the amount of time each job needs to complete, and your goal is to minimize the makespan of the jobs (that is, the time it takes to get all the jobs finished). How would you do it?

It doesn’t take much thought before you hit on an algorithm called greedy. The Greedy algorithm sorts the jobs by size, then assigns the largest to the first processor, the second largest to the second processor, the third back to the first processor, and so on. And you know what? That’s a fine algorithm. In most circumstances it’ll get the job done for you.

On the other hand, if you’re in the habit of working with an infinite number of really fast processors, you might have said, “I know! I’ll enumerate every possible way of splitting the jobs into two groups, and then find one where the sums are the same!”

That might have been a little wack. There’re only 10 jobs, so there are only 1024 subsets to sift through, so it’s not unthinkable. And, it has the additional advantage of working. Perfectly. Every time. Greedy doesn’t, especially when you get a set of numbers with a really wide range, for example

771 121 281 854 885 734 486 1003 83 62

(Taken from American Scientist, The Easiest Hard Problem)

The largest number there is 15 times the smallest number, and that’ll give Greedy fits when trying to come up with a perfect solution. (It still comes fairly close. Maybe close enough, depending on the project.)

The enumeration algorithm will find the partition for these, but of course tomorrow the boss is going to come in with twelve jobs to be scheduled, and in accordance with the laws of NP-completeness, each additional job doubles the number of subsets that have to be examined, so now you’re up to 4000 subsets, and we haven’t even thought about working with numbers from ten to a million, say.

The Subset Sum Problem

Now, this scheduling problem is really only a specialization of the Subset Sum problem. The Subset Sum problem is simply this: Given a set of numbers and another number N, is there a subset of those numbers that sums to N?

Once again, this problem is NP-complete, so there’s no 100% way of figuring out the answer except by brute-forcing, looking at all the subsets to see if one fits. I found a lovely LINQ example of enumerating all the subsets on Igor Ostrovsky’s blog:

             var subsets = from m in Enumerable.Range(0, 1 << list.Count)
              select
                  from i in Enumerable.Range(0, list.Count)
                  where (m & (1 << i)) != 0
                  select list[i];

Nice! It blows up pretty quickly, though. On my machine, I couldn’t get it to enumerate subsets for a list of 25 items without running out of memory, and “1 << list.Count” isn’t going to get past the number of bits in an int anyway, without some tricks. Still, if this runs, returning a subset that equals a given number sum is just

return subsets.First(set => set.Sum() == sum)

If you don’t need the exact number, but are happy just to get close, Greedy will get the job done and you’re not limited by the length of the list, either:

            var result = new List<int>();
            foreach (int i in list.OrderByDescending(k => k).SkipWhile( k => k > sum))
            {
                if (result.Sum() + i > sum)
                    return result;
                else
                    result.Add(i);
            }

I ran Greedy against a thousand-item list and it ran in a couple of seconds, but I had to use a fault-tolerance of 20% to get the result I wanted. Check out the results on GitHub.

Is Greedy the best we can do? I think we can do better. I’ll post some ideas next time.