Algorithmic Alley http://algorithmicalley.com/Default.aspx Algorithms you don't see very often en-US Ben Fulton Subtext Version 2.1.2.2 Algorithmic Alley http://algorithmicalley.com/images/RSS2Image.gif http://algorithmicalley.com/Default.aspx 77 60 Reconstructing words from fragments http://algorithmicalley.com/archive/2014/12/31/reconstructing-words-from-fragments.aspx <p> </p> <p>Suppose I give you a set of kmers. (For a definition of kmers, see my article on <a href="http://algorithmicalley.com/archive/2014/08/10/de-bruijn-graphs.aspx">De Bruijn graphs</a>.) How would you determine the words that they came from?</p> <p>Actually, let’s make it a little easier and say that the kmers came from a known list of words, and we have to determine the frequencies of the words. Let’s start with a list of words I’ve used before: <font face="Courier New">['ingratiate', 'gratification', 'aristocratic', 'aromaticity', 'palatial', 'beatific', 'beatify', 'latices', 'insatiable', 'artifice', 'beatified', 'certifiable']</font></p> <a title="data (scrabble) by Justin Grimes, on Flickr" style="float: left; padding-bottom: 5px; padding-top: 5px; padding-left: 5px; padding-right: 5px" href="https://www.flickr.com/photos/notbrucelee/8016192302"><img alt="data (scrabble)" src="https://farm9.staticflickr.com/8170/8016192302_0e9c4b7170.jpg" width="500" height="374" /></a> <p>Now suppose our list of kmers consists of ‘tify’ and ‘fiab’. We can say without question that the first one came from ‘beatify’ and the second one was from ‘certifiable’, since those are the only two instances of those kmers in our whole list. But suppose ‘able’ was in the list – there’s no way of knowing whether it came from ‘insatiable’ or ‘certifiable’.</p> <p>This is an interesting problem because it’s an approximation of a problem in biology called <a href="http://en.wikipedia.org/wiki/RNA-Seq">RNA Sequencing</a>. At any given time a cell has a bunch of strands of RNA floating around inside it, and exactly which strands are there and how many of them there are has a lot to do with the behavior of the cell. But our technology hasn’t advanced to the point where we can actually tell what RNA strands are in the cell – the best we can do is break them up and get parts of them, rather like our kmers. So we try to reconstruct the strands from the short reads that we have available to us.</p> <h2>Estimation</h2> <p>Now suppose we have a hundred kmers in our list, so it looks like<font face="Courier New">['ific', 'arom', 'tocr', 'tice', 'sati', 'arom',…]</font>, etc. How do we determine the original set of words in the list?</p> <p>The easiest thing to do is simply to guess. For each kmer, we’ll look to see which words it occurs in, and then choose one at random. Here’s that routine:</p> <p><font face="Courier New">def assign_reads(kmers): <br />    d = defaultdict(list) <br />    for read in kmers: <br />        words = [word for word in all_words if read in word] <br />        d[random.choice(words)].append(read) <br />    return d <br /></font></p> <p>Now since we’ve assigned each kmer to a word, we can just summarize the expected probabilities of the original word list we’re trying to reconstruct.</p> <p><font face="Courier New">def get_frequencies(assigned_reads): <br />    total = sum(len(reads) for reads in assigned_reads.values()) <br />    rmc = dict() <br />    for word, reads in assigned_reads.iteritems(): <br />        rmc[word] = len(reads) / total <br />    return rmc <br /></font></p> <p>Here are some of the results you might get out of a routine like this:</p> <ul> <li>palatial: 0.03 </li> <li>ingratiate: 0.15 </li> <li>aristocratic: 0.2 </li> </ul> <p>From this we’ll guess that the original source data consisted of 20% “aristocratic”, 15% “ingratiate”, and 3% “palatial”. This is pretty close to correct, as I created the source by using one instance of each word in the original list, except for replacing both “certifiable” and “gratification” with “aristocratic”. So simply by assigning each kmer at random to a word that contains it, we can create a rough approximation of the original frequencies of the words in our list. Code is available on <a href="https://github.com/benfulton/Algorithmic-Alley/blob/master/AlgorithmicAlley/mRNA_Abundance.py">Github</a>.</p> <p> </p> <p> </p> <h2><font face="Courier New"> <br /></font></h2><img src="http://algorithmicalley.com/aggbug/18.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2014/12/31/reconstructing-words-from-fragments.aspx Thu, 01 Jan 2015 05:10:27 GMT http://algorithmicalley.com/archive/2014/12/31/reconstructing-words-from-fragments.aspx#feedback 1 http://algorithmicalley.com/comments/commentRss/18.aspx De Bruijn Graphs http://algorithmicalley.com/archive/2014/08/10/de-bruijn-graphs.aspx <p> </p> <p>Remember doing word ladder puzzles? You have to change one word into another by changing a single letter at a time, always creating a new word. As an example from Wikipedia has it:</p> <ul> <li>COLD </li> <li>CO<b>R</b>D </li> <li>C<b>A</b>RD </li> <li><b>W</b>ARD </li> <li>WAR<b>M</b> </li> </ul> <p><strong> </strong>A De Bruijn graph is sort of like that. But instead of changing a single letter at a time, you drop the first letter from word, and add a new letter at the end:</p> <ul> <li>DRAC </li> <li>RAC<strong>H</strong> </li> <li>ACH<strong>I</strong> </li> <li>CHI<strong>N</strong> </li> <li>HIN<strong>K</strong> </li> </ul> <a title="abscond by Bob May, on Flickr" href="https://www.flickr.com/photos/alternative_illustrations/6847773344"><img alt="abscond" src="https://farm8.staticflickr.com/7176/6847773344_cbf0a670cd.jpg" width="500" align="right" height="245" /></a> <p>Okay, I cheated a little here. These aren’t actually <em>words</em>, just parts of words. But that’s really what makes De Bruijn graphs interesting; it isn’t so much that you can turn one word into another, it’s that you can attach words together based on their parts. Say we take a list of words and split them into their four-letter sequences (which we can call a 4-mer, mer being <strike>Latin</strike>Greek for “parts”, or a k-mer if we don’t care about the exact length.) If one of the words is “abscond” then we split it into the following 4-mers:</p> <ul> <li>absc </li> <li>bsco </li> <li>scon  </li> <li>cond </li> </ul> <p>Here’s a Python function to do that:</p> <p><font face="Courier New">def kmers_in_word(w): <br />    return [w[i:i + 4] for i in range(len(w) - 4)] <br /></font></p> <p>I got a list of English words from <a href="http://www.mieliestronk.com/wordlist.html">Mieliestronk</a> and pulled all those with at least four letters into a Python list.</p> <p><font face="Courier New">with open("corncob_lowercase.txt") as f: <br />    s = (line.strip() for line in f) <br />    word_list = [w for w in s if len(w) &gt; 3]</font></p> <p>Then I created a dictionary to list all of the kmers and the words they were in:</p> <p><font face="Courier New">kmers = defaultdict(list) <br />for word in word_list: <br />    for k in kmers_in_word(word): <br />        kmers[k].append(word)</font></p> <p>Now, we can represent the De Bruijn graph as a dictionary, with the keys as kmers and the values as a list of kmers that are connected by words in the dictionary.</p> <p><font face="Courier New">debruijn = defaultdict(set) <br />for kmer, words in kmers.iteritems(): <br />    for word in words: <br />        for k in kmers_in_word(word): <br />            if kmer[1:4] == k[0:3]: <br />                debruijn[kmer].add(k)</font></p> <p> </p> <a title="debriujn by Ben Fulton, on Flickr" href="https://www.flickr.com/photos/benfulton/14695255748"><img alt="debriujn" src="https://farm4.staticflickr.com/3882/14695255748_568e84241e.jpg" width="500" align="right" height="271" /></a>Here’s a partial picture of the resulting graph. The top is the kmer “rati”, which occurs in the words “ingratiate”, “gratification”, and “aristocratic”, among others. The connecting kmers, then, are “atia”, “atif”, and “atic”. Now “atia” is in the words “palatial”, “insatiable”, and “ingratiate”, and those three words in turn lead us to the kmers “tial”, “tiab”, and “tiat”. You can see that the graph will get big, fast! <p>I created this graph using <a href="https://gephi.org/">Gephi</a>. Code, and a way to export the graph into Gephi, are available on <a href="https://github.com/benfulton/Algorithmic-Alley/blob/master/AlgorithmicAlley/debruijn.py">Github</a>.</p><img src="http://algorithmicalley.com/aggbug/17.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2014/08/10/de-bruijn-graphs.aspx Mon, 11 Aug 2014 04:07:06 GMT http://algorithmicalley.com/archive/2014/08/10/de-bruijn-graphs.aspx#feedback http://algorithmicalley.com/comments/commentRss/17.aspx Graph Diameter http://algorithmicalley.com/archive/2014/08/04/graph-diameter.aspx <p> </p> <p>Look at this graph. How wide is it?</p> <a title="By Mhombach (Own work) [CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3AFloyd_Warshall.JPG"><img alt="Floyd Warshall" src="//upload.wikimedia.org/wikipedia/commons/8/8b/Floyd_Warshall.JPG" width="625" height="449" /></a> <p>It looks like S and V, or perhaps Y, are the nodes that are the farthest apart. If the numbers on the edges are accurate representations, the distance from S to Y and S to V should be the same, but if you add them up, you see that they aren’t. S to U to V sums up to 11, while S to X to Y is 7.</p> <p>So our intuitive answer to the question of how wide the graph is is not quite right. We have to find the two nodes that are farthest apart. But if the graph has a loop, or cycle, you could just go around the loop forever and have two nodes that are infinitely far apart!</p> <p>Let’s say instead that the distance between two nodes is always the shortest possible distance. Then the width, or <a href="http://mathworld.wolfram.com/GraphDiameter.html">diameter</a>, of the graph is the longest, <em>shortest</em> distance between any two nodes.</p> <p>We’ll represent our graph as a Python dictionary. The keys will be the names of the nodes, the values will be lists of 2-tuples, where the first value is the connecting node and the second value is the distance between the nodes. The graph in the picture will look like this:</p> <p><font face="Courier New">graph = dict() <br />graph['S'] = [('U', 10), ('X', 5)] <br />graph['U'] = [('X', -2), ('V', 1)] <br />graph['V'] = [('Y', -5)] <br />graph['X'] = [('U', 3), ('Y', 2)] <br />graph['Y'] = [('V', 6), ('S', 7)]</font></p> <p>The method for calculating this has been known for more than fifty years and is known as the <a href="https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm">Floyd-Warshall Algorithm</a>. Here’s an example of implementing it in Python.</p> <a title="Octeract Petrie polygon ternary Boolean connectives Hasse diagram graph by Cuito Cuanavale, on Flickr" href="https://www.flickr.com/photos/hexadecimal_time/3238546967"><img alt="Octeract Petrie polygon ternary Boolean connectives Hasse diagram graph" src="https://farm4.staticflickr.com/3480/3238546967_104858abee_n.jpg" width="320" align="right" height="320" /></a> <p>First, let’s create a dictionary to track the distances between any two nodes. Since we don’t know the distances between the nodes yet, we’ll set all the values to one million.</p> <p><font face="Courier New">dist = defaultdict(lambda: defaultdict(lambda: 1000000))</font></p> <p>Now let’s fill out the values we do know. We know that the distance from a node to itself is zero.</p> <p><font face="Courier New">for v in graph.keys(): <br />    dist[v][v] = 0</font></p> <p>We know the distances between the nodes that have direct paths. It could be that the direct path isn’t the shortest path, but it’s a place to start:</p> <p><font face="Courier New">for u, values in graph.iteritems(): <br />    for v in values: <br />        dist[u][v[0]] = v[1]</font></p> <p>Note that it’s possible to define a graph where a distance from a node to itself is something other than zero. If it is, that will be picked up in the previous lines of code.</p> <p>Now, we’ll go through and try to improve the distance between every pair of nodes. Say we look at nodes i and j. We’ll look at every node k, and figure out whether the distance from i to k to j is less than the distance from i to j that we’re already aware of.</p> <p><font face="Courier New">for k in graph.keys(): <br />    for i in graph.keys(): <br />        for j in graph.keys(): <br />            if dist[i][j] &gt; dist[i][k] + dist[k][j]: <br />                dist[i][j] = dist[i][k] + dist[k][j] <br /></font></p> <p>After this triple loop, our dist dictionary will hold the shortest distances between any two nodes.</p> <p>So what is the diameter of our graph?</p> <p><font face="Courier New">print max(product(dist.keys(), dist.keys()), key=lambda x: dist[x[0]][x[1]])</font></p> <p>('Y', 'U') <br /></p> <p>The shortest distance from Y to U is to go from Y to S to X to U. The edge lengths are 7, 5, and 3 respectively. So the diameter of our graph is 15.</p><img src="http://algorithmicalley.com/aggbug/16.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2014/08/04/graph-diameter.aspx Tue, 05 Aug 2014 02:31:02 GMT http://algorithmicalley.com/archive/2014/08/04/graph-diameter.aspx#feedback http://algorithmicalley.com/comments/commentRss/16.aspx Suffix Arrays http://algorithmicalley.com/archive/2013/06/30/suffix-arrays.aspx <style type="text/css"><![CDATA[ .ln { color: rgb(0,0,0); font-weight: normal; font-style: normal; } .s0 { color: rgb(0,0,128); font-weight: bold; } .s1 { } .s2 { color: rgb(0,0,255); } .s3 { color: rgb(0,128,0); font-weight: bold; }]]></style> <p>A suffix array of a string is a list of suffixes of that string. For example, if the string is “ABCD”, the suffixes are “ABCD”, “BCD”, “CD”, and “D”. Not very interesting on its own, but what is useful is that you can track them by simply keeping track of the index of their starting character – 0,1,2, and 3. And that means you can fiddle with their order. You can sort them. In “ABCD” they are already sorted, but consider a string like “mississippi”. In sorting those suffixes, the final “i” is sorted to the front, while the first “s” is at the back. The order is</p> <p><font face="Courier New">10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2</font></p> <p>(For some idea of the usefulness of suffix arrays, see my <a href="http://blog.benfulton.net/2013/06/suffix-arrays.html">suffix arrays</a> blog post on my personal blog.)</p> <p>Here’s an implementation of a suffix array function in Python:</p> <pre><span class="s0"> def </span><span class="s1">get_suffix_array(str): </span><br />     <span class="s0">return </span><span class="s1">sorted(range(len(str)), key=</span><span class="s0">lambda </span><span class="s1">i: str[i:]) </span></pre> <p>Simple enough, for short strings. But there are going to be problems as the string length increases. That key function is probably making a copy of the string slice to pass to the <font face="Courier New">sorted</font> function, and that is going to take a huge amount of space for a long string. I experimented a little by generating a random, 100,000 character string and calling this method on it; it used more than three gigabytes of memory and ran for almost four minutes before returning the suffix array. (Just for fun, I loaded up the million characters of <em>Moby Dick </em>and ran it on that. The whole operating system froze up. Eventually I got bored and hard-rebooted the machine.)</p> <p>So for long strings, space will get to be a consideration. Python also supports a <em>cmp </em>argument for sorted, which takes two values and returns an integer indicating which value should be sorted first, but I understand this is being deprecated. This is unfortunate as it would make this problem much more tractable; but we can still do better, both in terms of total memory usage and speed. Python’s generic <em>sorted</em> function is probably built to handle any kind of data and to run in (n log n) time, but since we’re only handling one kind of data we can improve on that.</p> <p>Here’s an idea that was suggested by Udi Manber and Gene Myers back in the 1990’s. It’s a variant on <a href="https://en.wikipedia.org/wiki/Bucket_sort">bucket sort</a>, which is a sorting algorithm that operates by first looking only at the first character of each string, and putting those that share the same initial character into a bucket. Then it looks at each bucket for the second character in the string, and so on. This is a handy paradigm for us, since we don’t even have different strings to work on; we only have indices into our one string.</p> <p>Here’s the idea: we’ll put each suffix into a bucket using its first character, so all the suffixes that start with ‘A’ will be in one bucket, and all the suffixes that start with ‘B’ will be in another, and so on. Obviously all the suffixes in the ‘A’ bucket will have a sort order before those in the ‘B’ bucket or ‘C’ bucket. It looks something like this:</p> <pre> <p><span class="s0"> def </span><span class="s1">sort_bucket(str, bucket): </span> <span class="s1">d = defaultdict(list) </span> <span class="s0">for </span><span class="s1">i </span><span class="s0">in </span><span class="s1">bucket: </span> <span class="s1">    key = str[i:i+1] </span> <span class="s1">   d[key].append(i)</span></p> <span class="s0">def </span><span class="s1">suffix_array_ManberMyers(str):</span><br /><span class="s0">    return </span><span class="s1">sort_bucket(str, (i </span><span class="s0">for </span><span class="s1">i </span><span class="s0">in </span><span class="s1">range(len(str)))</span><span class="s1">) </span></pre> <p>Sort_bucket takes our string and a list of indices to examine. Our initial call to it will just pass in every index in the string – all of the suffixes as one big bucket. </p> <p>D, then, is a dictionary of initial characters to suffix indices. Now what we need to do is call sort_bucket recursively to go through each bucket and sort the strings by second position, third position, etc. To make the whole thing a little more efficient, rather than index by one additional character each time, we’ll double the number of characters. The first recursive call will sort on strings of length two, the second call will sort on four, etc. We need to add an Order parameter to indicate the number of characters we want to sort on. Also, add code to return the sorted array and this is our final function:</p> <pre><span class="s0"> def </span><span class="s1">sort_bucket(str, bucket, order): d = defaultdict(list) </span><span class="s0">for </span><span class="s1">i </span><span class="s0">in </span><span class="s1">bucket: key = str[i:i+order] d[key].append(i) result = [] </span><span class="s0">for </span><span class="s1">k,v </span><span class="s0">in </span><span class="s1">sorted(d.iteritems()): </span><span class="s0">if </span><span class="s1">len(v) &gt; </span><span class="s2">1</span><span class="s1">: result += sort_bucket(str, v, order*</span><span class="s2">2</span><span class="s1">) </span><span class="s0">else</span><span class="s1">: result.append(v[</span><span class="s2">0</span><span class="s1">]) </span><span class="s0">return </span><span class="s1">result </span><span class="s0"> def </span><span class="s1">suffix_array_ManberMyers(str): </span><span class="s0">return </span><span class="s1">sort_bucket(str, (i </span><span class="s0">for </span><span class="s1">i </span><span class="s0">in </span><span class="s1">range(len(str))), </span><span class="s2">1</span><span class="s1">) </span></pre> <p>Notice that sort_bucket now returns the list of sorted suffixes. If a bucket contains a single element, that is appended to the result; otherwise the method is called recursively and the sorted bucket is added to the result.</p> <p>So how does it work? It turns out, quite nicely. If you want to do suffix arrays for “mississipi”, well, you’re better off using the initial implementation – it’s simpler and runs fairly quickly. But for longer strings…on my computer the tipping point appeared to be around 5000 characters. For a 10,000 character string, Manber-Myers was about three times as fast. And at 100,000 characters – the run that took more than four minutes – Manber-Myers still takes less than a second. And for <em>Moby Dick</em>? Manber-Myers was still under six seconds to run, and used about 50 megabytes worth of memory.</p> <p>So it’s a pretty successful little algorithm.  You can download this <a href="https://github.com/benfulton/Algorithmic-Alley/tree/master/AlgorithmicAlley/SuffixArrays">Suffix Array</a> code on my GitHub page.</p> <p>Now, 50M on a modern computer isn’t much. But that still represents 50 times the size of the book. If we want to scale up again, from the size of the book (1,200,000 bytes) to the size of a human genome (3,000,000,000 bytes) are we going to need a machine with 150 gigabytes of memory?</p> <p>I sure hope not. I’ll try out some more ideas another time.</p><img src="http://algorithmicalley.com/aggbug/15.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2013/06/30/suffix-arrays.aspx Mon, 01 Jul 2013 00:59:46 GMT http://algorithmicalley.com/archive/2013/06/30/suffix-arrays.aspx#feedback 2 http://algorithmicalley.com/comments/commentRss/15.aspx The Expectation Maximization Algorithm http://algorithmicalley.com/archive/2013/03/29/the-expectation-maximization-algorithm.aspx <p>Conceptually, the expectation maximization algorithm is simple: you take a guess at the values you want to find, then run your calculations as though they were correct. Then, use those results to improve your guess. </p> <p><a href="http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html">What is the expectation maximization algorithm?</a></p> <p>That article does a good job of explaining the algorithm in more detail, but I couldn’t make sense of the example. So I wrote some code to help me figure out.</p> <p>The example is like this: You have two coins. They’re not exactly balanced, and you want to know how heavily weighted they are. The only information you have, though, are five sequences of ten flips each. And, you don’t even know which coin is responsible for which sequence!</p> <p>Okay, let’s write some code. I’ll be working in Python for this example.</p> <p><font face="Courier New">rolls = [ "HTTTHHTHTH", <br />          "HHHHTHHHHH", <br />          "HTHHHHHTHH", <br />          "HTHTTTHHTT", <br />          "THHHTHHHTH" ]</font></p> <p><font face="geo">Our coins are A and B. Let’s make our initial guess that coin A comes up heads 60% of the time, and that coin B is fair. The tradition is that the unknown variables are called Theta, so:</font></p> <p><font face="Courier New">theta = { "A":.6, "B":.5 }</font></p> <p>We need a method to calculate the likelihood of a sequence given a weight. That’s straightforward binomial probability:</p> <p><font face="Courier New">def likelihoodOfRoll( roll, prob ): <br />    numHeads = roll.count( "H" ) <br />    n = len(roll) <br />    return pow( prob, numHeads ) * pow( 1-prob, n - numHeads )</font></p> <p>Now what? One easy way to a solution would be to figure out which sequences were most likely to come from which coins, then credit all the heads and tails from those sequences to those coins. Using that method, rolls 1,2, and 4 are assigned to coin A, and rolls 0 and 3 are assigned to B. 24 of the 30 flips assigned to coin A are heads, and 9 of 20 flips for coin B. So with that strategy, our new value for theta becomes { “A”:.8, “B”:.45}.</p> <p>But we can do better. Rather than assign an entire sequence to a single coin, we’ll use the calculated probabilities to weight the sequence. So – and this is a tricky bit – we have to compare the calculated probabilities <em>to each other </em>rather than looking at them as absolutes. So if one coin has a probability of 20% and the other has a probability of 30%, we’ll assign them weights of .4 and .6, respectively. Here’s how:</p> <p><font face="Courier New">for roll in rolls: <br />    s = sum( likelihoodOfRoll( roll, k) for k in theta.values()) <br />    for hidden in theta.keys(): <br />        a = likelihoodOfRoll( roll, theta[hidden]) <br />        aChance = a / s </font></p> <p>I didn’t figure out that part until I came across a StackExchange question about it. <a href="http://math.stackexchange.com/questions/25111/how-does-expectation-maximization-work">How does expectation maximization work?</a> </p> <p>The variable aChance holds the weighted probability of the sequence coming from the coin. Next, we’ll assign that percentage of the heads in the sequence to that coin (We’ll keep a couple of dictionaries to hold the information):</p> <p><font face="Courier New">headSum[hidden] += roll.count( "H" ) * aChance <br />tailSum[hidden] += roll.count( "T" ) * aChance </font></p> <p>This is all we need to calculate new values for theta.</p> <p><font face="Courier New">for hidden in theta.keys(): <br />    theta[hidden] = headSum[hidden] / (headSum[hidden] + tailSum[hidden]) </font></p> <p>Iterate the whole thing a few times, or a whole lot of times, and eventually it will converge to values. Not necessarily the <em>best</em> values – it may end up in a local maximum.  But some values.</p> <p>That’s all there is to it. The code is a lot simpler to understand than the math, I think. You can download the code from <a href="https://github.com/benfulton/Algorithmic-Alley">GitHub</a>.</p><img src="http://algorithmicalley.com/aggbug/14.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2013/03/29/the-expectation-maximization-algorithm.aspx Fri, 29 Mar 2013 05:35:49 GMT http://algorithmicalley.com/archive/2013/03/29/the-expectation-maximization-algorithm.aspx#feedback http://algorithmicalley.com/comments/commentRss/14.aspx Reduce in parallel http://algorithmicalley.com/archive/2010/11/01/reduce-in-parallel.aspx <p>In <a href="http://algorithmicalley.com/archive/2010/10/21/extracting-words-from-a-string.aspx">Part I</a> I put up a couple of simple algorithms to extract words from a string.  In <a href="http://http://algorithmicalley.com/archive/2010/10/29/state-carrying-character-classes.aspx">Part II</a> I showed some classes that remove the requirement that the characters be processed in order.  </p> <p>The final step is to <em>reduce</em> – to decide what order, in fact, we’re going to process those characters in.</p> <p>The obvious solution, I would have thought, would have been the old divide-and-conquer algorithm:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> IWordState Reduce(<span style="color: #2b91af">IEnumerable</span>&lt;IWordState&gt; set, <span style="color: blue">int</span> len)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">if</span> (len == 0)</p> <p style="margin: 0px">                <span style="color: blue">return</span> <span style="color: blue">new</span> Chunk(<span style="color: #a31515">""</span>);</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">if</span> (len == 1)</p> <p style="margin: 0px">                <span style="color: blue">return</span> set.First();</p> <p style="margin: 0px">            <span style="color: blue">else</span></p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">int</span> pivot = len / 2;</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">var</span> firstHalf = Reduce(set.Take(pivot).ToList(), pivot);</p> <p style="margin: 0px">                <span style="color: blue">var</span> secondHalf = Reduce(set.Skip(pivot).ToList(), pivot + len % 2);</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                IWordState result = firstHalf.Combine(secondHalf);</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">return</span> result;</p> <p style="margin: 0px">            }</p> <p style="margin: 0px">        }</p> <p style="margin: 0px"> </p> </div> <p>This isn't <em>too</em> bad.  On my machine it takes about five seconds to run this on my standard data set, the text of <em>Moby Dick</em>.  In my implementation the intent is very clear, too – but I suspect I’m costing myself quite a bit in the usage of LINQ.  Notice, for example, that firstHalf and secondHalf have ToList() calls at the end.  Without these, the function takes much longer, but I’m sure there’s a balance to strike in there somewhere.</p> <p>Here’s another idea:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">private</span> <span style="color: blue">const</span> <span style="color: blue">int</span> INT_PARALLEL_CHUNKS = 100;</p> <p style="margin: 0px"> </p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">override</span> <span style="color: #2b91af">IEnumerable</span>&lt;<span style="color: blue">string</span>&gt; Split(<span style="color: blue">string</span> s)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">var</span> subsets = Reduce(Map(s)).ToList();</p> <p style="margin: 0px">            <span style="color: blue">while</span> (subsets.Count &gt; INT_PARALLEL_CHUNKS)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                subsets = Reduce(subsets).ToList();</p> <p style="margin: 0px">            }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">if</span> (subsets.Any())</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">var</span> final = subsets.Aggregate((runningProduct, nextFactor) =&gt; runningProduct.Combine(nextFactor));</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">return</span> final.Finalize();</p> <p style="margin: 0px">            }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">return</span> <span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt;();</p> <p style="margin: 0px">        }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: #2b91af">IEnumerable</span>&lt;IWordState&gt; Reduce(<span style="color: #2b91af">IEnumerable</span>&lt;IWordState&gt; set)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">var</span> groups = set.Select((ws, index) =&gt; <span style="color: blue">new</span> { ws, index })</p> <p style="margin: 0px">                .GroupBy(g =&gt; g.index / INT_PARALLEL_CHUNKS, i =&gt; i.ws);</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">return</span> groups.AsParallel().AsOrdered()</p> <p style="margin: 0px">                .Select(batch =&gt; batch.Aggregate((runningProduct, nextFactor) =&gt; runningProduct.Combine(nextFactor)));</p> <p style="margin: 0px"> </p> <p style="margin: 0px">        }</p> <p style="margin: 0px">    }</p> </div> <p>In this implementation, we split the list into groups of 100, and run an aggregate method on each group, and repeat until our list is under length of 100.  Then we aggregate the remaining WordStates together.</p> <p>This is a little better, perhaps in the neighborhood of 1500 milliseconds.  It still makes a lot of ToList() calls.  And the fact is, on my machine, these are both <em>way</em> slower than a simple serial processing algorithm, which takes around 300 milliseconds.  Still, the addition of state on each character should make this algorithm scale a lot better.  It wouldn’t be hard to farm out the processing to multiple machines, for example.</p> <p>I hope you’ve picked up some tips on combining items in a list, without requiring a single state object to hold the aggregation.  There are a lot af applications of this technique, and will be even more once .Net 4 comes into common usage.  Let me know if you find applications in your own code for it!</p><img src="http://algorithmicalley.com/aggbug/13.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2010/11/01/reduce-in-parallel.aspx Tue, 02 Nov 2010 00:37:03 GMT http://algorithmicalley.com/archive/2010/11/01/reduce-in-parallel.aspx#feedback http://algorithmicalley.com/comments/commentRss/13.aspx Extracting words from a string http://algorithmicalley.com/archive/2010/10/21/extracting-words-from-a-string.aspx <p>I still plan on writing more on scheduling algorithms, but I was at <a href="http://strangeloop2010.com/">Strange Loop</a> last week and listened to <a href="http://en.wikipedia.org/wiki/Guy_Steele">Guy Steele</a> talking about parallel algorithms.  He’s working in a language he called Fortress, but certainly the algorithms could be implemented in C# as well.</p> <p>The problem Steele was working on to demonstrate a parallel algorithm is splitting a string into words.  For simplicity, we’ll assume that words are only separated by spaces.  Here are some sample tests so you can see the pattern.</p> <p> </p> <p>        <span style="color: blue">void</span> Check(Splitter splitter)</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        {</p> <p style="margin: 0px">            Assert.Equal(<span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt; { <span style="color: #a31515">"This"</span>, <span style="color: #a31515">"is"</span>, <span style="color: #a31515">"a"</span>, <span style="color: #a31515">"sample"</span> }, splitter.Split(<span style="color: #a31515">"This is a sample"</span>).ToList());</p> <p style="margin: 0px">            Assert.Equal(<span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt; { <span style="color: #a31515">"Here"</span>, <span style="color: #a31515">"is"</span>, <span style="color: #a31515">"another"</span>, <span style="color: #a31515">"sample"</span> }, splitter.Split(<span style="color: #a31515">" Here is another sample "</span>).ToList());</p> <p style="margin: 0px">            Assert.Equal(<span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt; { <span style="color: #a31515">"JustOneWord"</span> }, splitter.Split(<span style="color: #a31515">"JustOneWord"</span>).ToList());</p> <p style="margin: 0px">            Assert.Empty(splitter.Split(<span style="color: #a31515">" "</span>));</p> <p style="margin: 0px">            Assert.Empty(splitter.Split(<span style="color: #a31515">""</span>));</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            Assert.Equal(<span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt; { <span style="color: #a31515">"Here"</span>, <span style="color: #a31515">"is"</span>, <span style="color: #a31515">"a"</span>, <span style="color: #a31515">"sesquipedalian"</span>, <span style="color: #a31515">"string"</span>, <span style="color: #a31515">"of"</span>, <span style="color: #a31515">"words"</span> }, splitter.Split(<span style="color: #a31515">"Here is a sesquipedalian string of words"</span>).ToList());</p> <p style="margin: 0px">        }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">In a typical implementation, it’s not too complicated to do this in a sequential way.  Start with an empty list.  Iterate over each character, building a word along the way, and when you hit a space, add the word to the list and start over.  Maybe something like this:</p> <p style="margin: 0px"> </p> <p style="margin: 0px"> </p> </div> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">    <span style="color: blue">public</span> <span style="color: blue">abstract</span> <span style="color: blue">class</span> <span style="color: #2b91af">Splitter</span></p> <p style="margin: 0px">    {</p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">abstract</span> <span style="color: #2b91af">IEnumerable</span>&lt;<span style="color: blue">string</span>&gt; Split(<span style="color: blue">string</span> stringToSplit);</p> <p style="margin: 0px">    }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">    <span style="color: blue">public</span> <span style="color: blue">class</span> <span style="color: #2b91af">NormalSplitter</span> : <span style="color: #2b91af">Splitter</span></p> <p style="margin: 0px">    {</p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">override</span> <span style="color: #2b91af">IEnumerable</span>&lt;<span style="color: blue">string</span>&gt; Split(<span style="color: blue">string</span> ss)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">var</span> result = <span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt;();</p> <p style="margin: 0px">            <span style="color: blue">string</span> word = <span style="color: #a31515">""</span>;</p> <p style="margin: 0px">            <span style="color: blue">for</span> (<span style="color: blue">int</span> i = 0; i &lt; ss.Length; ++i)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">if</span> (ss[i] == <span style="color: #a31515">' '</span>)</p> <p style="margin: 0px">                {</p> <p style="margin: 0px">                    <span style="color: blue">if</span> (!<span style="color: blue">string</span>.IsNullOrEmpty(word))</p> <p style="margin: 0px">                    {</p> <p style="margin: 0px">                        result.Add(word);</p> <p style="margin: 0px">                        word = <span style="color: #a31515">""</span>;</p> <p style="margin: 0px">                    }</p> <p style="margin: 0px">                }</p> <p style="margin: 0px">                <span style="color: blue">else</span></p> <p style="margin: 0px">                    word += ss[i];</p> <p style="margin: 0px">            }</p> <p style="margin: 0px">            <span style="color: blue">if</span> (!<span style="color: blue">string</span>.IsNullOrEmpty(word))</p> <p style="margin: 0px">                result.Add(word);</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">return</span> result;</p> <p style="margin: 0px">        }</p> <p style="margin: 0px">    }</p> </div> <p>Of course, I’m not big on For loops.  Linq’s Aggregate function is more or less the same as the Reduce side of the MapReduce algorithm, or some of the Folding methods you see in other languages.  How would we do this using Aggregate?</p> <p>Instead of the For loop, we give the Aggregate algorithm the seed, or result it should give back for an empty string, and a lambda method that tells it how to combine a list of words, with an additional character, like this:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">            <span style="color: blue">string</span> word = <span style="color: #a31515">""</span>;</p> <p style="margin: 0px">            <span style="color: blue">var</span> words = ss.AsEnumerable().Aggregate(<span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt;(), (<span style="color: #2b91af">List</span>&lt;<span style="color: blue">string</span>&gt; result, <span style="color: blue">char</span> c) =&gt;</p> </div> <p>The lambda method more or less is the same as the inside of the normal splitter, above:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">                <span style="color: blue">if</span> (c == <span style="color: #a31515">' '</span>)</p> <p style="margin: 0px">                {</p> <p style="margin: 0px">                    <span style="color: blue">if</span> (!<span style="color: blue">string</span>.IsNullOrEmpty(word))</p> <p style="margin: 0px">                    {</p> <p style="margin: 0px">                        result.Add(word);</p> <p style="margin: 0px">                        word = <span style="color: #a31515">""</span>;</p> <p style="margin: 0px">                    }</p> <p style="margin: 0px">                }</p> <p style="margin: 0px">                <span style="color: blue">else</span></p> <p style="margin: 0px">                    word += c;</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">return</span> result;</p> </div> <p>How do they compare?  They’re pretty close to each other.  I handed each one a string consisting of the text of <em>Moby Dick</em> and each one was able to process it in about 300 milliseconds.  Sometimes one is faster, sometimes the other.</p> <p>Now, how would you break this problem down into chunks that multiple processors can work on?  It’s tricky.  We have a list of strings that we’re holding on to, we have a word that we’re holding on to, and as long as we have those two things, we’re stuck iterating over the characters one by one so we can update our stuff in the correct order.  If we try to process two characters simultaneously, the ordering of our words, or the words themselves, can be jumbled.</p> <p>What do we do?  I’ll try to answer that question next time.</p><img src="http://algorithmicalley.com/aggbug/10.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2010/10/21/extracting-words-from-a-string.aspx Fri, 22 Oct 2010 00:33:40 GMT http://algorithmicalley.com/archive/2010/10/21/extracting-words-from-a-string.aspx#feedback 22 http://algorithmicalley.com/comments/commentRss/10.aspx Schedule constraints http://algorithmicalley.com/archive/2010/09/09/schedule-constraints.aspx <p>In Part 1 we set up a very simple scheduling model.</p> <p>In a more realistic schedule, of course, some jobs have to wait until other jobs are finished.  The guy putting the doors on the car can’t get started on his job until the person putting in the engine is finished.   Let’s set up a class modelling a constraint that says a job can’t start until another job is finished.</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">    <span style="color: blue">public</span> <span style="color: blue">class</span> <span style="color: #2b91af">UniformConstraint</span></p> <p style="margin: 0px">    {</p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">readonly</span> Job BlockingJob;</p> </div> <p>Sometimes also there’s a lag between the first job finishing and the second job starting.  That can be expressed as an amount of time, but it can also be expressed in iterations – say job 2 can’t start until job 1 has run twice.  We’ll keep track of both of those.</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">readonly</span> <span style="color: blue">int</span> Latency;</p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">readonly</span> <span style="color: blue">int</span> Iterations;</p> </div> <p>We’ll add a list of constraints to our Job class.  Then we’ll need to determine, for a given time and iteration, is a given constraint blocking a job from running?</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">bool</span> Blocker(<span style="color: blue">int</span> iteration, <span style="color: blue">int</span> time)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">int</span> triggerIteration = iteration + Iterations;</p> <p style="margin: 0px">            <span style="color: blue">if</span> (triggerIteration &gt;= BlockingJob.CompletedIterations())</p> <p style="margin: 0px">                <span style="color: blue">return</span> <span style="color: blue">true</span>;</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">return</span> BlockingJob.StartTime(triggerIteration) + BlockingJob.processingTime + Latency &gt; time;</p> <p style="margin: 0px">        }</p> <p style="margin: 0px"> </p> </div> <p>(There’s a constructor too, but that pretty much rounds out the Constraint class.  Now over to our Job class.  For a job and a specific time, we’ll need to know if a job is blocked from running or not:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">bool</span> Blocked(<span style="color: blue">int</span> time)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">return</span> Blockers.Any(constraint =&gt; constraint.Blocker(startingTimes.Count, time));</p> <p style="margin: 0px">        }</p> </div> <p> </p> <p>Given all this, we’ll have to rewrite our processor.  We’ll just tick the clock, and at each step, we’ll start any jobs that are free to run.  (There are certainly more efficient ways to do this, but that’s ok.)</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">void</span> Process(<span style="color: blue">int</span> numIterations)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">int</span> time = 0;</p> <p style="margin: 0px">            <span style="color: blue">while</span> (!IterationComplete(numIterations))</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">foreach</span> (<span style="color: blue">var</span> job <span style="color: blue">in</span> Unblocked(time).Where(job =&gt; job.IsIdle(time)))</p> <p style="margin: 0px">                {</p> <p style="margin: 0px">                    job.Start(time);</p> <p style="margin: 0px">                }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                time++;</p> <p style="margin: 0px">            }</p> <p style="margin: 0px">        }</p> </div> <p>That’s more or less it.  Here’s a simple test.  It shows that if our second job is blocked by our first job, the second job is blocked at time 2, but it’s free to run again at time 7 when job 1 is complete.</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        [Fact]</p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">void</span> Constraint_blocks_job2_until_job1_completes()</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            _jobs[1].Constrain( <span style="color: blue">new</span> UniformConstraint( _jobs[0], 0, 0 ));</p> <p style="margin: 0px">            _jobs[0].Start(0);</p> <p style="margin: 0px">            Assert.True(_jobs[1].Blocked(2));</p> <p style="margin: 0px">            Assert.False(_jobs[1].Blocked(7));</p> <p style="margin: 0px">        }</p> </div> <p>Here’s a more complex example:  A part is manufactured successively in two shops.  The first shop can handle at most two parts at the same time, while the second can handle three parts.  The first shop has three machines with processing times of 1,2, and 2.  The second shop has two machines with processing times 2 and 3.  Finally, a part takes a time of 6 to be sent from the first shop to the second.</p> <p>Here’s a test for that.</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        [Fact]</p> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">void</span> Complex_example()</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            _jobs = <span style="color: blue">new</span> List&lt;Job&gt;</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">new</span> Job{ processingTime = 1 },</p> <p style="margin: 0px">                <span style="color: blue">new</span> Job{ processingTime = 2 },</p> <p style="margin: 0px">                <span style="color: blue">new</span> Job{ processingTime = 2 },</p> <p style="margin: 0px">                <span style="color: blue">new</span> Job{ processingTime = 2 },</p> <p style="margin: 0px">                <span style="color: blue">new</span> Job{ processingTime = 3 }</p> <p style="margin: 0px">            };</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            _jobs[2].Constrain(<span style="color: blue">new</span> UniformConstraint(_jobs[0], 0, 2));</p> <p style="margin: 0px">            _jobs[4].Constrain(<span style="color: blue">new</span> UniformConstraint(_jobs[3], 0, 3));</p> <p style="margin: 0px">            _jobs[1].Constrain(<span style="color: blue">new</span> UniformConstraint(_jobs[0], 0, 0));</p> <p style="margin: 0px">            _jobs[2].Constrain(<span style="color: blue">new</span> UniformConstraint(_jobs[1], 0, 0));</p> <p style="margin: 0px">            _jobs[3].Constrain(<span style="color: blue">new</span> UniformConstraint(_jobs[2], 6, 0));</p> <p style="margin: 0px">            _jobs[4].Constrain(<span style="color: blue">new</span> UniformConstraint(_jobs[3], 0, 0));</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            var sched = <span style="color: blue">new</span> CyclicSchedule(_jobs);</p> <p style="margin: 0px">            sched.Process(3);</p> <p style="margin: 0px">            Assert.Equal(22m, sched.IterationCompletionTime(0));</p> <p style="margin: 0px">            Assert.Equal(24m, sched.IterationCompletionTime(1));</p> <p style="margin: 0px">            Assert.Equal(26m, sched.IterationCompletionTime(2));</p> <p style="margin: 0px">        }</p> </div> <p>Interesting that this line has a startup time of 20, but after that it pumps out parts every 2 ticks.  Pretty efficient!</p> <p>One other thing I’d like to note – it’s perfectly valid to have a <em>negative </em>value for a latency.  This would indicate that, for example, job 3 can start no later than 10 ticks <em>after</em> job 1.  This isn’t relevant to our processing model but leads to some interesting restrictions on the schedules.  We’ll look at that more next time.   This code is available on <a href="http://github.com/benfulton/Algorithmic-Alley">GitHub</a>.</p><img src="http://algorithmicalley.com/aggbug/9.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2010/09/09/schedule-constraints.aspx Fri, 10 Sep 2010 03:03:09 GMT http://algorithmicalley.com/archive/2010/09/09/schedule-constraints.aspx#feedback 1 http://algorithmicalley.com/comments/commentRss/9.aspx A Cyclic Scheduling Model http://algorithmicalley.com/archive/2010/08/22/a-cyclic-scheduling-model.aspx <p>Let’s consider a scheduling problem that you might run across in a factory.  Basically, you have a widget that has to be processed in several stations, in order.  So it goes from the first station, to the second station, and on out until it’s complete.  But of course, once the first station is done working on the first widget, it can get started on its second widget.  Once the second station is done with the first widget, if the first station is done with the second widget, it can work on that one – and in the meantime the first station is free to start working on its <em>third</em> widget.</p> <p>We make the assumption that there are an infinite, or at least extremely large, number of widgets ready to come through the factory, so we can always think about when we can start the next widget, and not have to try to time it so the last widget is done efficiently.</p> <p>This isn’t just a factory floor problem, either – consider a CPU that can compute several instructions in parallel, with each instruction consisting of a 7-stage pipeline.  There’s certainly no shortage of instructions coming through the line, so schedulers have to try to determine how to minimize the average time through the pipeline, and certainly there are always more instructions to be processed.</p> <p>So how will we model this?  Let’s start by calling our stations, or stages, jobs, and for now we’ll just assume that the processing time is constant each time the job is run:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">class</span> <span style="color: #2b91af">Job</span></p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">public</span> <span style="color: blue">int</span> processingTime;</p> <p style="margin: 0px">        }</p> </div> <p>Our scheduler, fundamentally, is just a list of jobs.  For each job, we’ll keep track of the starting time on each iteration.</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">class</span> <span style="color: #2b91af">CyclicSchedule</span></p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: #2b91af">List</span>&lt;Job&gt; _jobs;</p> <p style="margin: 0px">            <span style="color: #2b91af">Dictionary</span>&lt;Job, <span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt;&gt; _startingTimes;</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">public</span> CyclicSchedule(<span style="color: #2b91af">List</span>&lt;Job&gt; jobs)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                _jobs = jobs;</p> <p style="margin: 0px">                _startingTimes = _jobs.ToDictionary(j =&gt; j, j =&gt; <span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt;());</p> <p style="margin: 0px">            }</p> <p style="margin: 0px">        }</p> </div> <p>We’ll use another simplifying assumption to get us started: we’ll assume that all the jobs can bang on our widget at the same time.  So for example, if we have three jobs that each take time 3 to process, we’ll see that each iteration completes in time 3.  Simple, right?</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">           [<span style="color: #2b91af">Fact</span>]</p> <p style="margin: 0px">            <span style="color: blue">public</span> <span style="color: blue">void</span> Three_jobs_of_time_three_process_evenly()</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                _jobs = <span style="color: blue">new</span> List&lt;Job&gt; { </p> <p style="margin: 0px">                    <span style="color: blue">new</span> Job { processingTime = 3 },</p> <p style="margin: 0px">                    <span style="color: blue">new</span> Job { processingTime = 3 },</p> <p style="margin: 0px">                    <span style="color: blue">new</span> Job { processingTime = 3 }</p> <p style="margin: 0px">                };</p> <p style="margin: 0px">                <span style="color: blue">var</span> sched = <span style="color: blue">new</span> CyclicSchedule(_jobs);</p> <p style="margin: 0px">                sched.Process(3);</p> <p style="margin: 0px">                <span style="color: #2b91af">Assert</span>.Equal(3m, sched.IterationCompletionTime(0));</p> <p style="margin: 0px">                <span style="color: #2b91af">Assert</span>.Equal(3m, sched.IterationCompletionTime(1));</p> <p style="margin: 0px">            }</p> </div> <p> </p> <p>Again for now, each of our jobs simply runs flat out – when it finishes the first widget, it immediately starts on the next one – it’s 100% efficient.  This makes the task of creating a simulator very simple: for each iteration, the start time for a job is just the previous start time, plus the time it takes to process:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">            <span style="color: blue">public</span> <span style="color: blue">void</span> Process(<span style="color: blue">int</span> numIterations)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: green">// each job starts at 0</span></p> <p style="margin: 0px">                _jobs.ForEach(job =&gt; _startingTimes[job].Add(0));</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">for</span> (<span style="color: blue">int</span> i = 1; i &lt; numIterations; ++i)</p> <p style="margin: 0px">                {</p> <p style="margin: 0px">                    _jobs.ForEach(job =&gt; _startingTimes[job].Add(_startingTimes[job].Last() + job.processingTime));</p> <p style="margin: 0px">                }</p> <p style="margin: 0px">            }</p> </div> <p>Now we can measure some interesting things about our schedule.  For example, the IterationCompletionTime.  This does <em>not</em> refer to the time that the iteration completed, but the total time it took <em>to</em> complete.  As we see in the test above, both iteration 0 and iteration 1 took time of 3 to complete, as each of the three jobs takes time of 3.</p> <p>But note – if our jobs took different amounts of time – say 7, 9, and 2, the iteration completion time is not just the average time for those three jobs.  What we’re trying to simulate is the amount of time that our raw materials – say a hunk of steel – take to be processed.  The raw materials come in as soon as an iteration starts.</p> <p>So given this example, job 3 starts processing the second hunk of steel at time 2.  But job 2 can’t start looking at the second hunk until time 9, and it won’t be done until time 18.  So the IterationCompletionTime for iteration 2 = 18 – 2 = 16.  And we can see that iteration 3 will start at 4 and end at 27, and pretty soon you’ve got big piles of half-processed steel sitting around the factory.  Formally, the limit of the IterationCompletionTime as k approaches infinity, is infinity.  That’s known as an “unstable schedule”, not surprisingly.</p> <p>Here’s one way to calculate that:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">public</span> <span style="color: blue">decimal</span> IterationCompletionTime(<span style="color: blue">int</span> i)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">var</span> values = _startingTimes.Select(x =&gt; <span style="color: blue">new</span> { job = x.Key, starttime = x.Value[i], endtime = x.Value[i] + x.Key.processingTime });</p> <p style="margin: 0px">                <span style="color: blue">return</span> values.Max(val =&gt; val.endtime) - values.Min(val =&gt; val.starttime);</p> <p style="margin: 0px">            }</p> </div> <p> </p> <p>One more statistic that might be of interest: the Average Cycle Time of a job.  The Cycle Time of a job is the time difference between starting one iteration and starting the next, so obviously, the average cycle time is the average of all those across all the iterations.  Since for our model so far, every job can start on the next iteraton as soon as it’s done with the current one, the average cycle time is equal to the processing time.  But here’s how you would calculate it anyway:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">            <span style="color: blue">public</span> <span style="color: blue">decimal</span> CycleTime(<span style="color: blue">int</span> jobNum)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">var</span> starts = _startingTimes[_jobs[jobNum]];</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">return</span> (<span style="color: blue">decimal</span>)<span style="color: #2b91af">Enumerable</span>.Range(1, starts.Count - 1)</p> <p style="margin: 0px">                    .Select(i =&gt; starts[i] - starts[i - 1])</p> <p style="margin: 0px">                    .Average();</p> <p style="margin: 0px">            }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">Next time, we’ll model some more complex scenarios.  Code is available on <a href="http://github.com/benfulton/Algorithmic-Alley">GitHub</a>.</p> </div><img src="http://algorithmicalley.com/aggbug/8.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2010/08/22/a-cyclic-scheduling-model.aspx Sun, 22 Aug 2010 21:35:17 GMT http://algorithmicalley.com/archive/2010/08/22/a-cyclic-scheduling-model.aspx#feedback http://algorithmicalley.com/comments/commentRss/8.aspx An approximate solution for the Subset Sum problem http://algorithmicalley.com/archive/2010/05/31/an-approximate-solution-for-the-subset-sum-problem.aspx <p>We described Greedy in <a href="http://algorithmicalley.com/archive/2010/05/02/the-subset-sum-problem.aspx">Part 1</a>.  It’s very fast and often quite accurate, but not always accurate enough.  So how can we improve on it?</p> <p>Remember, we can get perfect accuracy by enumerating all of the possible subsets of the numbers, but that number gets out of hand fast.  But, look at this list:</p> <p>{ 1, 3, 4, 6, 9 }</p> <p>While this list has 32 distinct subsets, it does <em>not</em> have 32 distinct subset <em>sums</em>.  The subset { 1, 6 } sums to 7, and the subset { 3, 4 } also adds up to 7.  How can we take advantage of this?</p> <h3>Computing all subsets</h3> <p>The LINQ algorithm I posted to calculate subsets is nice, but we need a more iterative way to generate the subsets so we can pole at them in between iterations.  Here’s another solution:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">override</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt; Find(<span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt; list, <span style="color: blue">int</span> sum)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">var</span> subsets = <span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: #2b91af">IEnumerable</span>&lt;<span style="color: blue">int</span>&gt;&gt; { <span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt;() };</p> <p style="margin: 0px">            <span style="color: blue">for</span> (<span style="color: blue">int</span> i = 0; i &lt; list.Count; i++)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">var</span> T = subsets.Select(y =&gt; y.Concat(<span style="color: blue">new</span>[] { list[i] }));</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                subsets.AddRange(T.ToList());</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">var</span> result = subsets.Find(s =&gt; s.Sum() == sum);</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">if</span> (result != <span style="color: blue">null</span>)</p> <p style="margin: 0px">                    <span style="color: blue">return</span> result.ToList();</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">return</span> <span style="color: blue">null</span>;</p> <p style="margin: 0px">        }</p> </div> <p>The idea is to start with the subset with no items, then on each iteration, we take our current list of subsets and add the next number in the list to each subset, to create an additional set of subsets.  So if you had the list { 1, 2, 3 }, after the first iteration you’d just have the subset { 1 }.  On the next iteration, you’d have {2}, and {1,2} as well as the original { 1 }.  On the third iteration, you’d add 3 to all of those sets, so the new sets would be {3}, {1,3}, {2,3}, and {1,2,3}.</p> <h3 /> <h3>Getting rid of duplicates</h3> <p>Let’s take the theory even further and instead of just eliminating subsets that sum to the same amount, we’ll get rid of them even if they’re just <em>close</em> to each other.</p> <p>We’ll start by getting the subsets ordered by sum:</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">        <span style="color: blue">public</span> <span style="color: blue">override</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt; Find(<span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt; list, <span style="color: blue">int</span> sum)</p> <p style="margin: 0px">        {</p> <p style="margin: 0px">            <span style="color: blue">var</span> subsets = <span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: #2b91af">IEnumerable</span>&lt;<span style="color: blue">int</span>&gt;&gt; { <span style="color: blue">new</span> <span style="color: #2b91af">List</span>&lt;<span style="color: blue">int</span>&gt;() };</p> <p style="margin: 0px">            <span style="color: blue">for</span> (<span style="color: blue">int</span> i = 0; i &lt; list.Count; i++)</p> <p style="margin: 0px">            {</p> <p style="margin: 0px">                <span style="color: blue">var</span> T = subsets.Select(y =&gt; y.Concat(<span style="color: blue">new</span>[] { list[i] }));</p> <p style="margin: 0px">                <span style="color: blue">var</span> U = subsets.Union(T).OrderBy(k =&gt; k.Sum()).ToList();</p> </div> <p>Next, we’ll eliminate all subsets that sum to within a range of a delta value.  We’ll also eliminate any subsets that go over the target sum.</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">                subsets = RemoveCloseSums(U, _delta).Where( s =&gt; s.Sum() &lt;= sum).ToList();</p> <p style="margin: 0px"> </p> </div> <p>Since we’re providing a delta anyway, we’ll go ahead and retrun any subset that gets us delta close to our target.  Note that this is what Greedy is doing as well, although in the case of Greedy it just stops looking when it’s out of ideas.</p> <div style="font-family: courier new; background: white; color: black; font-size: 10pt"> <p style="margin: 0px">                <span style="color: blue">var</span> result = subsets.Find(s =&gt; s.Sum() &lt;= sum &amp;&amp; s.Sum() &gt; (1 - _delta) * sum);</p> <p style="margin: 0px"> </p> <p style="margin: 0px">                <span style="color: blue">if</span> (result != <span style="color: blue">null</span>)</p> <p style="margin: 0px">                    <span style="color: blue">return</span> result.ToList();</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            }</p> <p style="margin: 0px"> </p> <p style="margin: 0px">            <span style="color: blue">return</span> <span style="color: blue">null</span>;</p> <p style="margin: 0px">        }</p> <p style="margin: 0px">    }</p> </div> <p>And that’s it.</p> <h3>Is it any good?</h3> <p>Well, yes and no.  Greedy does a pretty bang-up job in most cases that I’ve tried.  If you have a lot of little jobs that you can poke in near the end after the big ones are finished, Greedy will find those and put them in when appropriate.  But you can fool it.  One of the test lists I used was 100 jobs, with times ranging from 200 to 5000, and Greedy had a hard time being <em>too</em> precise with its numbers, while the approximating algorithm was able to come up with a set just as quickly.  But paring off sets that were too large helped a lot with that as well.  Greedy didn’t much care if my target value was 5000 or 50000, it chose its solution and returned.  Approximating took a lot longer to compute the 50000 set.</p> <p>Still, I didn’t put a lot of effort into optimizing the C# code.  Look at all those Sum() calls up there – I’m guessing the compiler isn’t managing to optimize those away, and if not, they’re really going to hurt performance.  The next step in speeding up the algorithm would probably be to eliminate those, and then you might be able to compete with Greedy again.</p> <p>But that is left as an exercise for the reader.  Once again, the code is available on <a href="http://github.com/benfulton/Algorithmic-Alley">GitHub</a>.  Happy coding!</p><img src="http://algorithmicalley.com/aggbug/7.aspx" width="1" height="1" /> Ben Fulton http://algorithmicalley.com/archive/2010/05/31/an-approximate-solution-for-the-subset-sum-problem.aspx Mon, 31 May 2010 14:04:05 GMT http://algorithmicalley.com/archive/2010/05/31/an-approximate-solution-for-the-subset-sum-problem.aspx#feedback http://algorithmicalley.com/comments/commentRss/7.aspx