Table of Contents
Previous Go to Robert Ramey Software Development Home Page next

2. PSORT Command Reference

All facilities of PSORT can be requested from the command line with the following syntax.

psort {-h | [<global option>...] [<key field>...]} Executing psort with just the -h switch just displays a help message.

Global options are command line switches which apply to the whole sorting process. These indicate record format, delimiter characters, etc. If no global options are specified, it will be assumed that the file consists of variable length text records no more than 511 bytes long. All global options must precede the first key field.

Key fields specify how the records in the input file are to be finally sequenced. If no sorting key fields are specified, the whole record is taken as a sorting field. The first non-printable character terminates the sorting of the record.

The file to be sorted is read from the standard input. The sorted file is written to the standard output. These may be redirected from the DOS command line.

To sort a file named bob and send the results to a new file to be named anne the following command would suffice:

psort <bob >anne Naturally as the sorting to be done on the file becomes more complex, the command line will become more elaborate. Global Options

2.1. Global Options

The global options are specified with the following syntax:

[{-rt [<size>] | -rv [<range> [<size>]] | -rf <size> }] [-in <input file> ...] [-out <output file>] [-w [<dir>]] [-u] [-t <range> ...] [-i][-b <range> ...] ] [-mcf] [-mcr] [-mfr] [-q] [-v [<#levels>]] [-m <memory size>[{kmg}]] [-l <mem segment size>] [-ibs <buffer attributes>] [-obs <buffer attributes>] [-bs <buffer attributes>] [-wb <buffer attributes>] [-rb <buffer attributes>] where buffer attributes are specified thusly: <buffer size> [-sync | -async <buffer count>] [-flags [r|s][p|t][b}u] Global options may be specified in any order but must precede the specification of any key fields.

-rt [<maximum record size>] Indicates that the file is a text file consisting of a sequence of records delimited by a newline character. If a record larger than the indicated maximum size is encountered the program will terminate with an error message. The default file type is a text file with records up to 511 characters in length. If the last record in the file is not terminated with a carriage return - line feed sequence, it will be discarded. For text files that include a Z (0x1a) for the last character in the file, this implies that this last Z will not appear in the output file.

-rf <fixed record size> Indicates that the file consists of fixed length records of the indicated size. if a short record is found at the end of the file it will be discarded. If the file consists of fixed length records each terminated with a carrage return - line feed sequence, don't forget to include the 2 characters in the length of the record.

-rv [<range> [<maximum size>] Indicates that the file consists of variable length records up to the indicated maximum size. Variable records have one or two bytes reserved for the length of the record. The position of these two bytes is specified by the range. If the range is from a higher to lower number it is assumed that the record length is little endian. That is, that the higher order byte follows the lower order byte. The range may be more than two bytes wide but only the least significant two bytes are used since a record size cannot exceed approximately 64KB in length in any case. If the range is unspecified it is assumed that the record length is to be found in the first two bytes. The record length field should contain the number of bytes in the record that follow the record length field. That is, 0 is a valid record length and will correspond to a record with no bytes following the record length field. For purposes of specifying location of key fields the whole record including the record length field should be considered. Note that this is different that the record length field.

If no record type is specified, it is assumed that the file consists of text records no more than 511 characters long.

-in <input file>... Normally the input to be sorted is taken from the standard input. When this is inconvenient or there is more than one input file the -in switch may be used to specify the names of the input files. A simple "-" in the file list will expand to the standard input file.

-out <output file> Normally the sorted file will be written to the standard output file. When this is inconvenient the -out switch may be used to specify the output file. Using this switch, the output file can be written to the same file name as the input file. This will decrease the disk space required for the sort. It will also delete the input file during the sort so it will be lost if for any reason the sorting program does not run to completion. The output file should be written over the input file only in those cases where disk space is at a premium and adequate backup exists for the input file.

-w [<dir>] use the following name for the temporary directory. Default is taken from TMP, TEMP, or TMPDIR environment variables. If the -w switch is used with no argument, any temporary workfile will be created in the current directory. Note that if this environmental variable is assigned to a RAM disk, there may not be enough space available to sort a large file. In such cases the program will abort prior to completion indicating that disk space is exhausted while writing to the work file. If this happens, either use the -w switch or reassign the environment variable to a hard disk directory with sufficient space available. The work file will usually be somewhat larger than the input file.

-u output only records that are unique according to the sorting key fields.

-t [<range> ...] Indicates which characters terminate fields. For example -t '|' . If no -t specification is used the whole record will be considered as one field.

-i invert sorting sequence. Normally records are sorted according to increasing collating value of characters in the key fields. Use this switch to sort according to decreasing collating value instead. This switch applies to every field subsequently specified. However this switch and overridden on a field by field basis when keys are specified.

-b <range>... Indicates which characters should be considered blanks to be skipped to find the start of each field.

If PSORT detects records which do not have expected fields, it will normally terminate with an error message. The following switches alter this default behavior.

-mcf If a record is encountered with a sorting field too short to contain the entire sorting key, PSORT will normally terminate with an error message. This action can be overridden by using this switch. If this is done, the pointer to the current character in the sorting field will cease to advance when the end of the field is encountered.

-mcr This switch specifies that when short sorting fields are encountered, the pointer to the current character will cease to advance when and only when the end of the record is encountered. That is, field delimiters will be passed over and field will be continued towards the end of the record.

-mfr If a record is encountered with a missing sorting field, PSORT will normally terminate with an error message. This action can be overridden with this switch. If so, the pointer to the current character in the sorting field will be advanced to the null character at the end of the record.

The following switches normally need not be used. They are used in special situations, debugging, and fine tuning. Values for buffer and memory sizes have been initially set to values determined through experiment to give good results. In certain cases, modifying these values may decrease sorting time or amount of memory required.

-q suppresses display of program copyright notice when the program starts up.

-v specify visible mode. This displays statistics on each distribution pass in the file. It is used for debugging and fine tuning. If only the top levels of distribution are desired use -v <number of levels>.

-m <memory size> maximum memory in K to be allocated for sort. Normally PSORT will attempt to reserve enough memory from the system to hold the entire file. If the file is too large, all available memory will be reserved and a temporary work file will be created. This switch can be necessary in a multitasking environment to inhibit PSORT from requesting more than the specified amount of memory thereby leaving memory available for other tasks. Memory size can also be specified in megabytes or gigabytes by appending m or g respectively.

-l <allocation size> length of segment used by internal storage in K. It must exceed the longest record in the file. Default value is 39. The maximum permitted size is 63 on 16 bit versions. On 32 bit versions segment size can be as large as available memory. Psort will terminate if it is unable to allocate at least 4 segments when the program starts.

-ibs <buffer size> specify the attributes of the buffering for the main file input. These features include size, buffer count, etc. and is described in more detail below.

-obs <buffer size> same as -ibs but applies to the output file.

-wb <buffer size> specify the attributes output buffer used to write data to the temporary file.

-rb <buffer size> specify the attributes of the input buffer used to read data from the temporary file.

-bs <buffer> This is a short cut method for equivalent to specifying the attributes of both the input and output buffers.

*(gt size of the buffer in KB. The size can be specified in MB by appending an "m" to the value. If not specified, an environmental dependent default is assigned.

-sync use syncronous i/o. Suspend psort operation while waiting for data to be read and or written. When couple with buffered i/o (see below) the operating system will buffer sequential reads and write so in practice sorting operation will overlap i/o and operation will be quite efficient.

-async <buffer count> use asyncronous i/o. Let psort handle buffering of data with the specified number of buffers. For the input work file, defaul uses async i/o. This is due the fact that psort can schedule a seek on the next block of data while sorting the most recently aquired block This can enhance performance in many cases. For other files, experience suggests the performance with either method is comparable. For other files, performance doesn't seem to be effected

-flags [r|s][p|t][b}u]

r/s Random/Sequencial i/o - hint to the OS p/t Permanent/Temporary file - hint to the OS b/u Buffered/Unbuffered. Unbuffered i/o reads and writes data directly from the file to the sort address space. This can save on cpu time. However, this stymies OS buffering. So it should be coupled with async i/o (see above). Key Fields

2.2. Key Fields

A key field will consist several optional parts in following syntax.

[<key collating sequence&gt;] [<field specifier>...] where a <field specifier> looks like: [-i] [-b <range>...] [-f <range>...] [-c <range>...] A key field specification determines how the records in the output file will be ordered. Records will first be ordered according to the first key field. Within each group of records with the same value in the first key field, records will be ordered according to second field. Ordering with each key fields continues until there is only one record left or there are no more key fields.

In addition to the key collating sequence explained below, a key field consists of the following optional components.

-i Invert the sequence of the sort for the last key specified. Normally fields are sorted sequence according to the -i global option or a higher level key. Using this switch inverts this sequence for this field. That is, if the global -i switch or this local -i switch is used the field will be sorted in descending sequence. If both switches are specified, the this local -i switch will re-invert the sense of the global -i switch resulting in records sorted on this field being in ascending sequence.

-b <range>... Specify additional leading blank characters for this field. These leading blank characters are in addition to the ones specified with the global option -b.

-f <range>... Sort on one or more fields. Fields are groups of characters separated by one of the delimiter characters specified by the global -t switch. If no -t switch has been specified, the whole record is considered as one field and reference to fields in positions greater that 0 will terminate PSORT with an error message. After finding a delimiter, the characters specified by global or local -b switch are skipped over to determine where the field actually starts. Fields are numbered starting at 0. That is -f 0 refers to the first field of the record. A field specification may contain a range of fields as in -f 2-4 to indicate that sorting sequence is to be determined on the basis of the third, fourth and fifth fields in turn. A range must have a definite end. That is -f 2- is not permitted. A field range need not be increasing. That is, -f 3-2 is permitted and will sort first by the fourth then by the third field.

-c <range>... Sort on one or more characters within the indicated fields. Start counting character positions from 0. For example -f 1 -c 2-3 would sort on the third and fourth characters of the second field. Several character ranges may be specified for a given field. For example -f 2 -c 5-6 -c 3-4 -c 1-2 would specify three sorting fields of 2 characters each within the third delimited field. When specifying a character range within a field, the second number need not be greater than the first. That is -c 7-3 is permitted and will result sorting being applied to characters in positions 7,6,5,4, and 3 in that order. As we will see in the examples below, this will be useful in sorting certain types of binary number fields. An indefinite character range can be specified as in -c 4- . This will indicate all characters starting with the fifth to the end of the field where ever that might be. A -c -2 would indicate all characters starting at the last one in the field moving to the left upto and including the third character in the field. Key Collating Sequences

2.3. Key Collating Sequences

Key collating sequences are used to specify how characters are to be weighted in determining which record, field, or character is "less than" or "greater than" another. There are four kinds of key collating sequences that can be used.

-k [ [ [-r] <range>] ...] specifies a collating sequence The collating sequence is specified as one or more ranges of values. Characters are assigned collating sequence in order of their specification. For example, to sort a file containing only lower case alphabetic characters

-k 'a'-'z' could be used. This would assign value 1 to 'a', 2 to 'b',..., 26 to 'z'. Any characters not specified will be assigned a collating value of 0. Characters in a field beyond a character with a 0 collating value will not be considered when sorting on that field. Hence, either of records b<0>a and b<0>c may precede the other in the output file. Within a key specification, any number of ranges may be specified. For example, if the sorting field will contain any combination of lower case letters, digits and spaces, use: -k ' ' '0'-'9' 'a'-'z' Spaces will sort before digits which will sort before lower case letters.

-r repeats previous collating range. For example, to fold upper case letters to lower case letters for purposes of determining sorting priority use -k 'a'-'z' -r 'A'-'Z'. This would assign the first character following the -r the same collating value as the first one assigned in the previous range. That is 'A' through 'Z' would be assigned collating values 1 through 26 as would 'a' through 'z'. To give varying white space characters equal weight use

-k ' ' -r '\\t' -r '_'

-n [ [-r] <range> ]...[-d <decimals>] ] character numeric sort on the key. This is an alternative to -k. character numeric fields may contain a leading or trailing sign and/or a decimal point. Numeric fields should look like

[' ']...[+|-]['0'-'9']...[.['0'-'9']...][-] Any non-numeric characters will terminate the field. The field will be sorted by numeric value. The -n switch is equivalent to the following: sort by sign, negatives first. sort by number of non-zero digits to the left of the decimal point. sort by sequence of digits to the left of the decimal point. sort by sequence of digits to the right of the decimal point. If one is sorting number in other than base ten, the permitted characters should follow the -n switch. For example, -n '0'-'7' could be used to sort octal numbers. The resulting sort might be slightly faster than using the decimal default. To sort on hexadecimal numbers use -n '0'-'9' 'a'-'f' -r 'A'-'F' This will assign collating values 10 through 16 to characters a through f as well as A through F.

The -d switch is used to specify the maximum number of digits to the right of the decimal/radix point. It is usually not necessary to use the -d switch but could speed up the sort if the maximum number of digits to the right of the decimal/radix point is known.

Character numeric fields are inherently of variable length. If no fields delimiters are used on the record, it is possible for PSORT to fail to properly determine where the decimal fraction ends and the next field begins. This can be remedied by using the -d switch to specify how many characters after the decimal/radix point the field ends.

-s [ [ [-r] <range>]... ] This used to specify collating values for bytes corresponding to signs. The first half of the values are assumed to correspond to negative numbers are assumed to be negative and subsequent fields are ordered inversely to the sense specified by the global and local -i switches. The second half of the values are assumed to correspond to positive numbers and subsequent fields are sorted normally. To illustrate this consider the following sequence of records:

+0001 -4011 +9002 8888 -2231 ... In order to sort this in numerical order, first the records are sorted by sign. The negative values are then sorted in absolute value in inverted order while the positive ones are sorted in absolute value. The following would be used -s '-' '+' -r ' ' -c 0 -k '0'-'9' -c 1-4 Of course, this trivial example could have been done with the -n switch. The file includes examples such as keys for sorting packed and floating numbers which could not be done without the -s switch.

Nested Keys The final type of key specification is the nested field. Its syntax is (<sort command>) . In this case, commands enclosed in parenthesis are applied to each of the fields defined by the subsequent -b, -i, -f and -c parameters as if each of these were a record. This will be illustrated after macros and include files are described.

Default Key If no -k, -n, -s or nested field is specified a default collating sequence of all printable characters is used. Space and tab (0x09) are considered printable characters. For files containing non-printable characters be sure to include a -k specification. Notes on -b and -t switches

2.4. Notes on -b and -t switches

To clarify the interaction among the global -t and -b switch and the local -b switch, the following sequence of operations is presented. For each record in the file. A pointer starts at the beginning of the record. For each field Blank characters specified by the global -b switch are skipped. Blank characters specified by the local -b switch are skipped. The pointer now points to the beginning of the field. Characters not specified by the global switch -t are skipped. The pointer now points to the end of the field. The next character (the delimiter) is skipped. The pointer now points to first blank (if any) of the next field.

If it is desired fields be delimited by white space and that adjacent tabs/space not constitute null fields, the following is appropriate:

psort -b ' ' '\\t' -t ' ' '\\t' -k ... Leading tab characters are skipped so that adjacent ones are absorbed.

If it is desired that fields be delimited by white space but that adjacent tabs constitute null fields, the following would be used.

psort -b ' ' -t ' ' '\\t' -k .... Leading tab characters are not skipped so that adjacent ones are not absorbed.

Some systems maintain records with fields in the form "abcd","efgjkjl","irowq",.... To sort this file in descending sequence according to the third field, then in ascending sequence according the to first field, any of the following would suffice.

-t ',' -b '"' -k 'a'-'z' -i -f 3 -f 1 -t ',' -k 'a'-'z' -i -b '"' -f 3 -c 0- -f 1 -c 1- Getting a construction such as '"' through the DOS or other command line processor may pose a challange. In most cases, ' do the trick. Also, remember that characters can always be specified as decimal or ASCII digits. Thus, '"' could be replaced by 0x22 or 34 .

If no -b switch is used no characters are skipped after the delimiter. Inverted Sorting Sequence

2.5. Inverted Sorting Sequence

Remember that characters not specified within a collating sequence are taken as collating value zero and that the field is considered terminated when a character with a collating value of zero is recognized. This can result in unexpected behavior when fields are not the same length. Following is the result of sorting a small file with -k 'z'-'a'.

def cad basdf a aa This was probably not the result intended. To get the desired result, use -k 'a'-'z' -i. def cad basdf aa a Command Files and Macros

2.6. Command Files and Macros

The DOS command line only permits a maximum of 128 characters. This is not enough to permit all the command parameters that some files might require. In order to accommodate this and other situations the following switches can be used.

-#include <filename> This is used the specify the the contents of the indicated file should be inserted into the string of sorting parameters. Include files can be nested to reasonable depth. Include files have the exact same format as command line parameters with the following exceptions:

(1) The # character is special. It is used for placing comments into the command parameter file. All characters encountered after the # character are discarded. In order to be recognized as a comment character it must be preceded by a space. This prevents constructions such as '#' from being erroneously treated as comment characters.

(2) Command parameters are recognized across record boundaries until the end of the file is detected. There is no need to specify a special character such as '\' to specify continuation on to the next record.

-#define <macro name> <commands>... -#end These are used to create a sequence of commands and assign a name to them. For example, suppose that for a given file of accounting transactions account number is stored in positions 0-11 while the date is stored in positions 12-17. We could create a file named containing the following:

# account number key field -#define -acct -k '0'-'9' -c 0-11 -#end # date key field -#define -date -k '0'-'9' -c 12-17 -#end In order to prepare a journal with transactions sorted by date we would use the following: psort -#include -date which would be exactly equivalent to: psort -k '0'-'9' -c 12-17 Meanwhile a monthly general ledger with transactions sorted by account number then by date would use: psort -#include -acct -date This is only a small example of the possibilities of defined macros. Below they will be combined with the concept of nested keys to make the system really powerful.

Macro definitions are not expanded until they are used. Macros definitions may use other macros that may not have been defined yet. A macro may contain -#define statements so that new macros are created when it is used. Macros may include files with the -#include statement. Included files may define and reference defined macros. If a macro is defined and there already exist one with the same name, future references to that name will expand to the most recently defined macro. Macro invocations, definitions and included files may be nested to reasonable depth. When using nesting -#define statements each -#end will terminate the last unterminated -#define. The number of -#end parameters should match the number of -#define statements.

Macros should not contain references to themselves or to other macros that refer to themselves. This will result in a termination of the sort when the macro is invoked. In other words, macros may not be recursive. Field Nesting

2.7. Field Nesting

Suppose we have a file of records containing work orders. Each record has a promise date in field 1 and an order date in field 4. The dates are stored in format DDMMYY. The following key specification will result in the fastest sort by promise date as well as account for roll over at the end of the century.

-k '9' '0' -f 1 -c 4 # sort by decade considering -k '0'-'9' -f 1 -c 5 # century roll over -k '0'-'1' -f 1 -c 2 # high digit of month (0 or 1) -k '0'-'9' -f 1 -c 3 # low digit of month -k '0'-'3' -f 1 -c 0 # high digit of day -k '0'-'9' -f 1 -c 1 # low digit of day If we need orders with the same promise date sorted by order date, we could repeat the above specifications with a different field. This is tedious and error prone. To avoid doing this we could use "field nesting" ( -k '9' '0' -c 4 # sort by decade considering -k '0'-'9' -c 5 # century roll over -k '0'-'1' -c 2 # high digit of month (0 or 1) -k '0'-'9' -c 3 # low digit of month -k '0'-'3' -c 0 # high digit of day -k '0'-'9' -c 1 # low digit of day ) -f 1 4 Field nesting applies key fields specified within parenthesises to each of the fields defined by the subsequent parameters as if each of these were a record. Nested fields can contain -i, -b, and -c parameters. -f parameters are not permitted.

If our system used date fields in the form DDMMYY in several places it would be convenient to create a file named which contained:

-#define -date ( -k '9' '0' -c 4 # sort by decade considering -k '0'-'9' -c 5 # century roll over -k '0'-'1' -c 2 # high digit of month (0 or 1) -k '0'-'9' -c 3 # low digit of month -k '0'-'3' -c 0 # high digit of day -k '0'-'9' -c 1 # low digit of day ) -#end We could then use a command line: psort -#include -date -f 1 4 Suppose our accounting system has another file where the date is in the same format but starts in the fifteenth column. We could then use the command line: psort -#include -date -c 14- The key specified by the -date macro would be applied to the field extending from the fifteenth column to the end of the record as if that field were a record. The net effect is that the first character position in the -date field would be translated to the fifteenth position in the original record. The second character of the -date field would be found in the sixteenth column of the record, and so forth. That is, anywhere a date field in the format DDMMYY is used all we need do is to -#include the file containing the corresponding key definition. The location of bytes within the nested field will be adjusted according to the location of the outer level field list.

The displacements specified in the nested field parameter are applied from the first byte in the outer level field and move towards the end of the field. Thus nested fields properly account for outer level fields that have been specified in reverse order. The same nested field specification defined as -date in our previous example will serve just as well for fields where bytes are ordered in reverse order of significance. For example, suppose we have a field where the bytes of the date field were in reversed order, i.e. YYMMDD. If the field occupied positions 19 to 24, the following command would be appropriate:

psort -#include -date 24-19

Fields can be nested to any reasonable depth. In the following example the date field is specified as a combination of previously defined day, month, and year fields.

-#define -year ( -k '9' '0' -c 0 # sort by decade considering -k '0'-'9' -c 1 # century roll over ) -#end -#define -month ( -k '0'-'1' -c 0 # high digit of month (0 or 1) -k '0'-'9' -c 1 # low digit of month ) -#end -#define -day ( -k '0'-'3' -c 0 # high digit of day -k '0'-'9' -c 1 # low digit of day ) -#end -#define -date ( -k -day -c 0-1 -k -month -c 2-3 -k -year -c 4-5 ) -#end Notice how the field position of the -day, -month, and -year fields are all 0-1 . This is because the true relative positions of these fields are shifted by the fields in the next higher key. They will be shifted again when the -date field is applied to the location of the date field in the record to be sorted.

This technique would now permit us to define a special american date type field.

-#define -americandate ( -k -year -c 4-5 -k -month -c 0-1 -k -day -c 2-3 ) -#end Included in the postman's sort package is a file named which contains definitions for commonly encountered types of key fields. Among the types of fields considered are: -ieee IEEE floating point -packed COBOL comp-3 type -alphabetic Upper/lower case characters folded -unsignedbinary Simple bitwise sequence -signedbinary Two's complement binary integers -packeddate Date in packed 0ddmmyyC format This file can be edited to suit your particular environment. If desired your own field types can be added or an application or installation specific file can be created. This file is setup so that all field positions can be specified from low character position to high character position. This takes into account that binary and floating point values are stored least significant byte first in intel machines. Hence to sort on a binary number stored at the beginning of the record. -k -signedbinary -c 0-1 even though the bytes are stored in intel format of low order byte first. If this is inconvenient, the file can be edited to taste.

The complete syntax for nested key fields is

( <key field> ... ) where the key field can include its own -i and -c switches and parameters. Syntax Summary

2.8. Syntax Summary

<psort command> := psort {-h | [< <input file>] [><output file>] <command line>} <command line> := | [<global option>...] [<key field>...] <global option> := | [{-rt [<size>] | -rv [<range> [<size>]] | -rf <size> }] | [-w [<dir>]] [-in <input file>...] [-out <output file>] | [-u] [-t <range> ...] [-i][-b <range> ...] ] | [-mcf] [-mcr] [-mfr] [-q] [-v [<#levels>]] | [-m <memory size>[{kmg}]] [-l <mem segment size>] | [-bs <size>] [-ibs <size>] [-obs <size>] | [-wb <size>] [-rb <size>] <key field> := | [<key collating sequence>] [<field list>...] | ( <key field> ) [<field list>...] <collating sequence> := | -k [<range> ...] | -n [<range> ...] [-d <decimal count>] | -s [<range> ...] <field list> := [<local options>] [-f <range>...] [-c <range>...] <local options> := | -i [<local options>] | -b [<local options>] <range> := | <number> | <number>[-<number>] <number> := | <decimal number> | 0x<hexidecimal number> | '<character>'