This article describes a program that sorts an arbitrarily large number of records in less time than any algorithm based on comparison sorting can. For many commonly encountered files, time will be strictly proportional to the number of records. It is not a toy program. It can sort on an arbitrary group of fields with arbitrary collating sequence on each field faster than any other program available.
Forget for a moment everything you know about computer science.
If you were given a stack of playing cards and asked to sort them by suit, how would you go about it. Most people would deal the cards into four piles, one for each suit. After dealing the last card the piles would be picked up one after the other. Sorting a deck of cards this way into four suits would take about a minute.
Suppose that you were given 100 decks of cards into suits. Would you use a different method? How long would it take you to do it. It should be pretty clear that the same method would be used by most people and that it would take 100 times as long as for sorting one deck of cards into four suits.
So here we have a sorting method which will time strictly proportional to the number of items to be sorted. This has been known as distribution sorting.
Most computer professionals "know" that time to sort a file of records must increase disproportionately to the number of records in the file. Its a good thing that the rest of us don't know it. We might never get our mail.
In fact, most will cite Donald E. Knuth's book The Art of Computer programming:Searching and Sorting Knuth's book contains a chapter on minimum comparison sorting which shows that a sorting method which compares records can do no less than (note: that's log base 2) comparisons.
However, this analysis specifically excludes methods that to not use key comparison. In fact, other sections of the book allude to methods whose time is proportional to the number of records sorted.
Then, why does it seem that every sort utility uses quicksort? Practical sort utilities have to be able to handle a wide variety of keys and key types and be fast. The methods described in Knuth's book that to not use comparison have depended on the use of fixed length keys. Also, although they might be faster for a sufficiently large number of records, they are slow on the numbers of records usually encountered. For example, to sort a group of records on a social security number using radix list sort would require 30 passes through the file!
There is a way to overcome these deficiency. In fact the post office uses it every day. While researching this article (after writing the program), I found that Knuth alluded to it and left it as an exercise for the reader.
Suppose we are given a large list of records to be ordered alphabetically on a particular field. Make one pass through the file. Each record read is added to one of 26 lists depending on the first letter in the field. The first list contains all the records with fields starting with the letter "A" while the last contains all the records with fields starting with the letter "Z". Now we have divided the problem down to 26 smaller subproblems. Now we address subproblem of sorting all the records in the sublist corresponding to key fields starting with the letter "A". If there are no records in the "A" sublist we can proceed to deal with the "B" sublist. If the "A" sublist contains only one record it can be written to output and we are done with that sublist. If the "A" sublist contains more that one record, it must be sorted then output. Only when the "A" list has been disposed of we can move on to each of the other sublists in sequence. The records will be written to the output in alphabetical order. When the "A" list contains more than one record it has to be sorted before it is output. What sorting algorithm should be used? Just like a real postman, we use the postman's sort. Of course we just apply the method to the second letter of the field. This is done to greater and greater depths until eventually all the words starting with "A" are written to the output. We can then proceed to deal with sublists "B" through "Z" in the same manner.
The above algorithm is implemented in the accompanying program. The function sort() is passed a sublist. First it is determined how many records the sublist has. If its one we're done after writing the record. If the sublist is empty we can just return. Otherwise we advance the pointer into the sort key and continue. plan() and allocate() reserve memory on the stack of sublist pointers. dist() distributes the input sublist among the newly created sublists. dsort() calls sort() for each of the newly created sublists.
A practical sort utility must be able to sort on any fields in any sequence with any collating sequence. The command line syntax permits specification of delimited fields, start of key within each field, and specification of collating sequences for each field. See the accompanying manual page for the sort program.
Sorting proceeds according to each key field. The first character within a field which corresponds to a zero collating sequence terminates the field. The rest of the field is not considered in the sort. Sorting con- tinues on remaining key fields. This means that "A<0>C" might be written to output before "A<0>B" in the final output. If this behavior is not desired, one should make sure that every character that could appear within a field is assigned a non-zero collating sequence value.
To be really useful the program must take into account that records can have varying numbers of fields, fields can have varying number of characters and that the command line may specify fields and/or key characters that a particular record does not contain. In general keys corresponding to non-existant fields are distributed with a collating sequence of zero. This turns out to be more natural, con- venient and flexible of the possibilities considered. This introduces some complications which require refinements of the algorithm. After distribution is completed by dist(). The newly created sublists are processed by dsort(). Sublists cor- responding to a zero collating value are handled first. For these sublists, dsort() advances key pointers to the next key field before continuing to distribute these sublists. The remaining sublists are handled as previously described.
Is this the best we can do? Suppose we wanted to sort 100 decks of shuffled playing cards by value and suit. We could first distribute by suit and then distribute each pile by value. But we probably wouldn't. What we would probably do is to arrange 4 rows and 13 columns. Each card would be dealt to the pile corresponding to its suit and value. All the cards could be sorted in one pass.
When we're sorting on the basis of a key field we can do the same thing by distributing among a larger number of sublists. In one pass we would have sublists " ?", "A ", "AA", "AB",...,"AZ", "B ", "BA", ...,"ZY","ZZ". Of course information on 1 + 26 * (1 + 26) = 703 sublists must be maintained in memory, but the effect is to reduce the number of times the key has to be addressed by half. The basic principle is to sort on the basis of as many bytes at a time as possible. If we want to include all 8 bit bytes in the collating sequence we are going to need 64k sublists to sort on two bytes at a time. This usually not going to be too practical (at least on Intel type processors where the maximum size data segment is 64k). Fortunately, in the real world we usually know something about the fields we want to sort on. For example, to create a telephone directory, we sort on peoples names last name first. These fields will only contain alphabetic characters. Hence for this purpose we can take two bytes at a time. Social security numbers only contain decimal digits so we can take 3 or 4 at time before things get out of hand.
The function plan() called from sort() determines the best number of bytes to take given the collating sequences defined for the key fields. It also sets up pointers to be able to find the key fields. The program is designed so that we can take bytes from more than one field on one pass. This should assure us that the maximum number of bytes pos- sible has been used to distribute the record each time it is addresses. To sort a file of records on a social security number in the third field in format 999-99-9999 one could use the command line.
psort -k '0'-'9' -f 2 -c 0 -c 4 -c 7
The default setup for asort will allocate sublist space to handle three decimal digits. The first time a record is addressed it will be distributed on the first three digits. The second time '-' will be encountered at the next key position so that all the records with the same first three digits will added to sublist 0. the third time distribution will be on the middle two digits. Enough space will be al- located to distribute on three digits but only a small por- tion will be used as each field will be terminated by the '-'. The fourth distribution will be on the next three digits and the fifth will be on the last digit and field terminator. The best command line to sort a file of records based on a social security number in the third field in format 999-99-9999 would be.
psort -k '0'-'9' -f 2 -c 0-2 -c 4-5 -c 7-10 .
In this case sort would pass only three times through the records. In general, the more information we supply in the command line the faster the sort is likely to be executed.
As usual a disproportionate part of the program is taken up with issues of memory management. Even so, won't go into it here as it peripheral to the subject of this article.
Below I offer an informal analysis of the speed of the postman's sort for some different types of files. Comparative analysis of the comparison sorting and the postman's sort is somewhat difficult since the methods are so different. I have chosen to consider "primitive operations". By primitive operation I mean each time a record is addressed and/or manipulated in some way. If the primitive operation has some sort of inner loop such as a string comparison does, I count each cycle of the inner loop as a primitive operation.
For comparison sorts this will be comparing two bytes and perhaps moving records. For the postman's sort this will be manipulation of a byte from the key field. Of course, this is going to be a rough approximation but I think it will still be worthwhile.
First consider the case of a relatively short fixed length key is used for the sort. eg. sorting accounting transactions by date in format mmdd. The number of days in a month or year is small enough so that we would expect to complete the job in one pass through the file distributing records among 2*10*4*10 = 800 sublists. This is determined by using a key of command line of
-k '0'-`1' -c 0 -k '0'-'9' -c 1 -k '0'-'3' -c 3 -k '0'-'9'
In this case the number of primitive operations required by the postman's sort will be equal to the number of records to be sorted times four operations per record as there are four bytes in the record. Note that this is an example of a sorting method which requires time proportional to the num- ber of records in the file. And its not just a toy problem.
Using any comparison sorting method the number of com- parisons will be at least . Since the number of records far exceeds the number of keys, The number of primitive operations per comparison will be close to 4. Using 3 as en estimate the total number of primitive operations will be about . The postman's sort will be faster than the fastest comparison sort by the following factor:
For 100,000 records the fastest comparison sort will require 12 times more primitive operations than the postman's sort.
ext consider the case of a relatively long variable length key. Suppose we use as an example keys made of random alphabetic characters. Using the postman's sort and taking two characters at a time, the first pass through the file will generate 703 subfiles. Each subfile will in turn be distributed among 703 subfiles, etc. until each subfile has only one record. Hence each record will be addressed ap- proximately times where N is the number of records in the file. The total number of primitive operations will be .
Using a comparison sort, most of the comparisons are between keys that are close together. If there are less the 26 records in the file we would expect most comparisons to terminate on the first byte. If there are less than 26*26 records in the file most comparisons will terminate on the second byte and so on. This works out to primitive operations for each comparison. The number of comparisons is at least . So the total primitive operation would be
For this file the postman's sort will be faster than the comparison sort by the following factor.
or almost 16 times faster for a 100,000 record file.
My machine is a 12 MegaHertz AT compatible with two 28ms Seagate 251-1 disk drives. One drive was used for temporary files. The number of drives used affected greatly the time for comparison sort of a large file. It had little effect on the time required for the postman's sort.
Well it looks like the postman's sort isn't as fast my theory predicts. Buts its still twice as fast at the next best thing.
One of the most gratifying things about the postman's sort is that the first records are produced on the standard out- put before the file has even completed reading! In our test file, records which contained just a newline character are output as they are read in. When the last record is read in the first non null records appear almost immediately. If one is working with a multi-tasking system with pipes, much time will be saved as the subsequent process can proceed parallel with the sorting process. So even though this sort is only twice as fast, the jobs that use it may be improved much more.
Well, there you have it. A program which sorts a file of N records in less time than it takes to make comparisons. For a large and common class of sorting keys, it will sort a file in a time proportional the the number of records in the file.
Note: This article originally appeared in the August ' 92 issue of the C Users Journal.