In Part I I put up a couple of simple algorithms to extract words from a string. In Part II I showed some classes that remove the requirement that the characters be processed in order.
The final step is to reduce – to decide what order, in fact, we’re going to process those characters in.
The obvious solution, I would have thought, would have been the old divide-and-conquer algorithm:
public IWordState Reduce(IEnumerable<IWordState> set, int len)
{
if (len == 0)
return new Chunk("");
if (len == 1)
return set.First();
else
{
int pivot = len / 2;
var firstHalf = Reduce(set.Take(pivot).ToList(), pivot);
var secondHalf = Reduce(set.Skip(pivot).ToList(), pivot + len % 2);
IWordState result = firstHalf.Combine(secondHalf);
return result;
}
}
This isn't too bad. On my machine it takes about five seconds to run this on my standard data set, the text of Moby Dick. In my implementation the intent is very clear, too – but I suspect I’m costing myself quite a bit in the usage of LINQ. Notice, for example, that firstHalf and secondHalf have ToList() calls at the end. Without these, the function takes much longer, but I’m sure there’s a balance to strike in there somewhere.
Here’s another idea:
private const int INT_PARALLEL_CHUNKS = 100;
public override IEnumerable<string> Split(string s)
{
var subsets = Reduce(Map(s)).ToList();
while (subsets.Count > INT_PARALLEL_CHUNKS)
{
subsets = Reduce(subsets).ToList();
}
if (subsets.Any())
{
var final = subsets.Aggregate((runningProduct, nextFactor) => runningProduct.Combine(nextFactor));
return final.Finalize();
}
return new List<string>();
}
public IEnumerable<IWordState> Reduce(IEnumerable<IWordState> set)
{
var groups = set.Select((ws, index) => new { ws, index })
.GroupBy(g => g.index / INT_PARALLEL_CHUNKS, i => i.ws);
return groups.AsParallel().AsOrdered()
.Select(batch => batch.Aggregate((runningProduct, nextFactor) => runningProduct.Combine(nextFactor)));
}
}
In this implementation, we split the list into groups of 100, and run an aggregate method on each group, and repeat until our list is under length of 100. Then we aggregate the remaining WordStates together.
This is a little better, perhaps in the neighborhood of 1500 milliseconds. It still makes a lot of ToList() calls. And the fact is, on my machine, these are both way slower than a simple serial processing algorithm, which takes around 300 milliseconds. Still, the addition of state on each character should make this algorithm scale a lot better. It wouldn’t be hard to farm out the processing to multiple machines, for example.
I hope you’ve picked up some tips on combining items in a list, without requiring a single state object to hold the aggregation. There are a lot af applications of this technique, and will be even more once .Net 4 comes into common usage. Let me know if you find applications in your own code for it!
Author: Ben Fulton