Posts
15
Comments
31
Trackbacks
0
October 2010 Entries
Extracting words from a string

I still plan on writing more on scheduling algorithms, but I was at Strange Loop last week and listened to Guy Steele talking about parallel algorithms.  He’s working in a language he called Fortress, but certainly the algorithms could be implemented in C# as well.

The problem Steele was working on to demonstrate a parallel algorithm is splitting a string into words.  For simplicity, we’ll assume that words are only separated by spaces.  Here are some sample tests so you can see the pattern.

 

        void Check(Splitter splitter)

        {

            Assert.Equal(new List<string> { "This", "is", "a", "sample" }, splitter.Split("This is a sample").ToList());

            Assert.Equal(new List<string> { "Here", "is", "another", "sample" }, splitter.Split(" Here is another sample ").ToList());

            Assert.Equal(new List<string> { "JustOneWord" }, splitter.Split("JustOneWord").ToList());

            Assert.Empty(splitter.Split(" "));

            Assert.Empty(splitter.Split(""));

 

            Assert.Equal(new List<string> { "Here", "is", "a", "sesquipedalian", "string", "of", "words" }, splitter.Split("Here is a sesquipedalian string of words").ToList());

        }

 

In a typical implementation, it’s not too complicated to do this in a sequential way.  Start with an empty list.  Iterate over each character, building a word along the way, and when you hit a space, add the word to the list and start over.  Maybe something like this:

 

 

    public abstract class Splitter

    {

        public abstract IEnumerable<string> Split(string stringToSplit);

    }

 

    public class NormalSplitter : Splitter

    {

        public override IEnumerable<string> Split(string ss)

        {

            var result = new List<string>();

            string word = "";

            for (int i = 0; i < ss.Length; ++i)

            {

                if (ss[i] == ' ')

                {

                    if (!string.IsNullOrEmpty(word))

                    {

                        result.Add(word);

                        word = "";

                    }

                }

                else

                    word += ss[i];

            }

            if (!string.IsNullOrEmpty(word))

                result.Add(word);

 

            return result;

        }

    }

Of course, I’m not big on For loops.  Linq’s Aggregate function is more or less the same as the Reduce side of the MapReduce algorithm, or some of the Folding methods you see in other languages.  How would we do this using Aggregate?

Instead of the For loop, we give the Aggregate algorithm the seed, or result it should give back for an empty string, and a lambda method that tells it how to combine a list of words, with an additional character, like this:

            string word = "";

            var words = ss.AsEnumerable().Aggregate(new List<string>(), (List<string> result, char c) =>

The lambda method more or less is the same as the inside of the normal splitter, above:

                if (c == ' ')

                {

                    if (!string.IsNullOrEmpty(word))

                    {

                        result.Add(word);

                        word = "";

                    }

                }

                else

                    word += c;

 

                return result;

How do they compare?  They’re pretty close to each other.  I handed each one a string consisting of the text of Moby Dick and each one was able to process it in about 300 milliseconds.  Sometimes one is faster, sometimes the other.

Now, how would you break this problem down into chunks that multiple processors can work on?  It’s tricky.  We have a list of strings that we’re holding on to, we have a word that we’re holding on to, and as long as we have those two things, we’re stuck iterating over the characters one by one so we can update our stuff in the correct order.  If we try to process two characters simultaneously, the ordering of our words, or the words themselves, can be jumbled.

What do we do?  I’ll try to answer that question next time.

Author: Ben Fulton
posted @ Thursday, October 21, 2010 7:33 PM | Feedback (22)
Follow me on Twitter!