The subset sum problem

Posts: 15
Comments: 31
Trackbacks: 0

<< A hash-based solution for the closest-pair problem | Home | An approximate solution for the Subset Sum problem >>

Job Scheduling

Suppose you’ve got two processors, and ten jobs to run on them. You know the amount of time each job needs to complete, and your goal is to minimize the makespan of the jobs (that is, the time it takes to get all the jobs finished). How would you do it?

It doesn’t take much thought before you hit on an algorithm called greedy. The Greedy algorithm sorts the jobs by size, then assigns the largest to the first processor, the second largest to the second processor, the third back to the first processor, and so on. And you know what? That’s a fine algorithm. In most circumstances it’ll get the job done for you.

On the other hand, if you’re in the habit of working with an infinite number of really fast processors, you might have said, “I know! I’ll enumerate every possible way of splitting the jobs into two groups, and then find one where the sums are the same!”

That might have been a little wack. There’re only 10 jobs, so there are only 1024 subsets to sift through, so it’s not unthinkable. And, it has the additional advantage of working. Perfectly. Every time. Greedy doesn’t, especially when you get a set of numbers with a really wide range, for example

771 121 281 854 885 734 486 1003 83 62

(Taken from American Scientist, The Easiest Hard Problem)

The largest number there is 15 times the smallest number, and that’ll give Greedy fits when trying to come up with a perfect solution. (It still comes fairly close. Maybe close enough, depending on the project.)

The enumeration algorithm will find the partition for these, but of course tomorrow the boss is going to come in with twelve jobs to be scheduled, and in accordance with the laws of NP-completeness, each additional job doubles the number of subsets that have to be examined, so now you’re up to 4000 subsets, and we haven’t even thought about working with numbers from ten to a million, say.

The Subset Sum Problem

Now, this scheduling problem is really only a specialization of the Subset Sum problem. The Subset Sum problem is simply this: Given a set of numbers and another number N, is there a subset of those numbers that sums to N?

Once again, this problem is NP-complete, so there’s no 100% way of figuring out the answer except by brute-forcing, looking at all the subsets to see if one fits. I found a lovely LINQ example of enumerating all the subsets on Igor Ostrovsky’s blog:

             var subsets = from m in Enumerable.Range(0, 1 << list.Count)
              select
                  from i in Enumerable.Range(0, list.Count)
                  where (m & (1 << i)) != 0
                  select list[i];

Nice! It blows up pretty quickly, though. On my machine, I couldn’t get it to enumerate subsets for a list of 25 items without running out of memory, and “1 << list.Count” isn’t going to get past the number of bits in an int anyway, without some tricks. Still, if this runs, returning a subset that equals a given number sum is just

return subsets.First(set => set.Sum() == sum)

If you don’t need the exact number, but are happy just to get close, Greedy will get the job done and you’re not limited by the length of the list, either:

            var result = new List<int>();
            foreach (int i in list.OrderByDescending(k => k).SkipWhile( k => k > sum))
            {
                if (result.Sum() + i > sum)
                    return result;
                else
                    result.Add(i);
            }

I ran Greedy against a thousand-item list and it ran in a couple of seconds, but I had to use a fault-tolerance of 20% to get the result I wanted. Check out the results on GitHub.

Is Greedy the best we can do? I think we can do better. I’ll post some ideas next time.

Algorithmic Alley

Job Scheduling

The Subset Sum Problem