Posts
15
Comments
31
Trackbacks
0
A hash-based solution for the closest-pair problem

Ben Fulton

Here’s an interesting solution to the closest-pair problem.  (For background read parts one and two).  (I’m going to drop to one-dimensional geometry initally to make the calculations a little simpler.)  We’ll be trying to find the two closest points on a line, like this:

line1

Suppose we can determine a way to partition the space into equal areas in such a way that at most one point falls within any given area, and that there are at least two adjacent nonempty intervals:

line1

 

Then, determining the closest pair becomes an easy task.  We re-partition the space into areas that are double the size, and that guarantees that the closest points are either in the same area or in adjacent areas.

But, how do we determine the correct size for the partition?  Ah, there’s the rub, and the majority of the difficulty in the algorithm.  We’ll start with a simple multidictionary implementation that will let us place points in intervals:

    public class Buckets

    {

        private readonly float _IntervalSize;

        Dictionary<int, List<float>> buckets = new Dictionary<int, List<float>>();

 

        public Buckets(float intervalSize)

        {

            _IntervalSize = intervalSize;

        }

 

        public int Bucket(float r)

        {

            return (int)(r / _IntervalSize);

        }

        public void Add(int bucket, float r)

        {

            if (!buckets.ContainsKey(bucket))

            {

                buckets[bucket] = new List<float>();

            }

            buckets[bucket].Add(r);

        }

 

        IEnumerable<float> Get(int id)

        {

            if (buckets.ContainsKey(id))

                return buckets[id];

            else

                return new List<float>();

        }

    }

 

A reasonable first guess at an interval size is to divide the length of the line by the total number of points:

        public float FindIntervalSize(List<float> S)

        {

            float intervalsize = (S.Max() - S.Min()) / S.Count;

Now we’ll start throwing our points into buckets.  If one of the buckets gets too large, we’ll break out and try something else.

            var T = new List<float>(S);

            while (T.Count > 0)

            {

                while (T.Count > 0 && !buckets.HasBucketLargerThan(Math.Sqrt(S.Count)))

                {

                    float r = T[T.Count - 1];

                    T.RemoveAt(T.Count - 1);

                    int bucket = buckets.Bucket(r);

                    buckets.Add(bucket, r);

                }

So at this point, we’ve either taken care of all our points, or we have some large buckets to deal with.  What do we do withthe buckets?  We’ll recur, of course!  We’ll run FindIntervalSize again on each one of our buckets, and take the smallest interval we’ve found so far.  That’ll get us a little farther.

                intervalsize = buckets.BucketsWithMoreThanOneItem()

                    .Select(items => FindIntervalSize(items))

                    .Union(new[] { intervalsize })

                    .Min();

            }

(I added a couple of helper functions to Buckets to make this code clearer.  The rather obvious implementations will be in the source code on GitHub.)  So now we’ve got a tentative interval size based on our recursion, but it’s not impossible that a pair slipped through the cracks somewhere.  We’re going to have to run the entire set of points again to verify that the interval size we’ve chosen is correct.  If we still have buckets with more than one point in them, we’ll try again to recur into them to get our final interval size.

            Buckets morebuckets = new Buckets(intervalsize);

            S.ForEach(r => morebuckets.Add(morebuckets.Bucket(r), r));

            intervalsize = morebuckets.BucketsWithMoreThanOneItem()

                .Select(items => FindIntervalSize(items))

                .Union(new[] { intervalsize })

                .Min();

 

            return intervalsize;

        }

Okay!  So now we’ve chosen our interval size.  To select our final answer, first we’ll put all points in buckets using twice the interval size.  Then, since we know that the closest points are either in the same bucket or neighboring buckets, we just have to check those three buckets. for each given point.  I’ll add another pair of little helper functions to Buckets to find that list of points:

            public IEnumerable<float> PointsNearBucket(int id)

            {

                return Enumerable.Range(id - 1, 3).SelectMany(i => Get(i));

            }

 

            public IEnumerable<Pair> CandidatesInBucket(int id)

            {

                var list = PointsNearBucket(id).ToList();

                return Enumerable.Range(0, list.Count())

                    .SelectMany(i => Enumerable.Range(i + 1, list.Count() - (i + 1))

                        .Select(j => new Pair(list[i], list[j])));

            }

And here’s the method that puts the whole thing together.

        PointF ClosestFloats(List<float> floats)

        {

            float intervalsize = FindIntervalSize(floats);

            var buckets = new Buckets(2.0f * intervalsize);

            floats.ForEach(r => buckets.Add(buckets.Bucket(r), r));

            var ordered = buckets.GetBuckets();

            var pair = ordered.SelectMany(i => buckets.CandidatesInBucket(i))

                .OrderBy(p => Math.Abs(p.first - p.second))

                .First();

 

            return new PointF(pair.first, pair.second);

        }

 

An interesting strategy!  Is it faster?  I’m not entirely sure.  The algorithm comes from a paper by Steve Fortune and John Hopcroft all the way back in 1978, and they say the algorithm should run in O(n log log n) time.  But they don’t provide a two-dimensional implementation, so I can’t make an honest comparison to my previous implementations.  I might be able to put something together.  Source code is available on Github.

Finally, this code will fail with a stack overflow if two of the points happen to be identical.  That gives us a bucket size of 0, and it’s really hard to put points in a bucket that small!  I’m not sure how to solve that problem elegantly.  Suggestions welcome!

posted on Wednesday, February 17, 2010 10:50 PM Print
test