Table of Contents
Previous Go to Robert Ramey Software Development Home Page next

3. Tips to Minimize Sorting Time

The time required to sort a file with the postman's sort can can be significantly reduced by careful specification of the sorting keys. This distinguishes the Postman's sort from other programs. To maximize the speed of the sort it is useful to have a basic idea of how it works. The following illustrates the operation of the sort by analogy with the way the post office sorts letters.

Postman's Sort 1. Checks key field specs 2. Setup distribution lists 3. Distribute records among lists according to first bytes of key 4. For each list of records 5. If there is only on record in the box write it to output 6. Distribute according to next bytes of key Post Office 1. Doesn't need to check because they're always the same 2. Distribution pigeon holes already setup according to first digits of zip code. 3. Distribute letters according to first digits in zip code. 4. Send the contents of each pigeon hole on to the correct post office 5. Distribute letters received according to last digits of zip code and send to local postoffice 6. Distribute letters among carrier routes. The key to speeding up the postman's sort is to distribute records among the largest number of lists on each pass. The maximum number of lists than can be used on each pass is determined by the memory segment size specified with the -l switch. The default value is sufficient to create approximately 1000 lists on each pass.

Example: Suppose we are to sort a years worth of accounting transactions by date where the date field is in format MMDD. We could simply use:

-k '0'-'9' -c 0-3 In this case PSORT would set up lists for all combinations of the first three bytes before exhausting its limit of 1000 lists per pass. Once the file is distributed among this 1000 lists, each list could be distributed among 10 more sublists to sort records on the last byte. Thus each record would be processed twice. Now suppose that the following sort specification had been used: -k '0'-'1' -c 0 # first digit of month -k '0'-'9' -c 1 # second digit of month -k '0'-'3' -c 2 # first digit of day -k '0'-'9' -c 3 # second digit of day Here only 600 lists are needed to accommodate all the combinations of the four bytes that might occur. Since this is less than 1000 the file will be sorted in one pass. The cpu time expended on the sorting process will be reduced by almost 50%.

Note that the amount of time required to sort the file will be strictly proportional to the file size.

Suppose that all the sorting keys have the same data in the first bytes. If the data does not fit in memory, the whole file could be distributed to one list and paged out to disk. This would cost a lot of time with no contribution made to the sorting task. If records have many equal bytes at the beginning of the keys, the file could be copied many times before the sorting process actually starts to take place. PSORT deals with this by recognizing when a list starts to contain "too many" records and subdivides the list before writing it out to disk. The number of times a list can be subdivided is determined by the segment size set by he -l switch. Large files with large numbers of long keys with equal bytes will be sorted much faster if the -l switch is used to increase the segment size. This can be increased to 63K for 16 bit versions and upto about 20 percent of available memory for 32 bit versions. Using the -k switch to specify the type of data expected in the fields will permit psort to more efficiently allocate memory and increase the probability that the sort will in complete in record time.

The lesson of all this is that giving PSORT more information about the sorting keys will almost always speed up the sort.

By permitting overlapping asynchronous i/o when writing to work files, a good Disk Caching program will speed up the sort when the whole file will not fit in memory.

Beware of virtual memory systems. PSORT detects true installed memory on all platforms tested except for Win32s. PSORT limits its allocation of memory to an amount less that the available physical memory. However there may be some situations where the operating system does not provide the true available memory and/or will grant a request for memory from virtual memory. This will result in excruciatingly slow sorting times. This might occur with some DOS memory management programs, some UNIX platforms, Win32s, OS/2 when MEMMAN is set to SWAP, as well as others. If sorting large files seems to take disproportionately long and seems to require an excessive amount of disk activity, try using the -m switch to specify the maximum of memory that PSORT should use.