DFILE Tools v0.7

Keith Crane

Overview

DFILE Tools are reusable software utilities and libraries for batch applications. Their primary function is to process files containing variable length text data. These software tools will generally batch process business rules faster and with fewer operational support issues than SQL databases.

While DFILE Tools utilities can be used to perform common data processing tasks through use of command line arguments and control files, sometimes business rules require custom programming. A C language API is available to access software libraries for reading and writing data files.

The following are DFILE Tools features:

A notable weakness associated with standard UNIX tools is their variable length record format. Using special characters for delimiting fields and records is convenient for configuration information but inadequate for processing actual data. Using printable characters as delimiter characters introduces a risk that the delimiter character may exist as data. On the other hand, non-printable delimiter characters are inconvenient to use with some UNIX tools. In either case, processing delimiter formats are inefficient. A better alternative is to store the length of each field in one byte prior to the actual value. This limits field values to 255 characters but is significantly more efficient than processing the delimiter format. For flexibility, both methods are supported. The following are available record formats:

Since DFILE Tools need record format and layout information to process data files, meta-data is maintained in configuration files. Environment variable ${CFGPATH} contains directories to search for configuration entries. There are two types of configuration files. The configuration file accessed first by DFILE Tools is dfile.cfg. It contains configuration records for dfiles. Each dfile configuration record contains a file name to an additional configuration file containing field names comprising the record layout. Each dfile.cfg configuration file entry uses the colon character (:) as a field delimiter, and each configuration record contains the following information:

  1. dfile name - This is a high level handle used to reference one or more UNIX files. This file naming abstraction is the only available method in DFILE Tools to read and write data files.
  2. field delimiter character - Variable length records contain fields that may be delimited by special characters. If the delimiter character is not printable, a decimal number can be used.
  3. record separator character - Variable length records may be separated by special characters. If the separator character is not printable, a decimal number can be used. Representing new line character as ASCII 10 is an example.
  4. delimiter/separator escape character - When field delimiter and record separator characters are specified, this is an option to have their occurrences escaped in data. If no escape value is specified, delimiter/separator characters that occur in data are converted to spaces.
  5. file name for record layout - Configuration file that contains field names that comprise record layout.
  6. UNIX file system path - File path used when accessing data.

If field delimiter and record separator are not specified, the variable length record format will be used that stores one byte field lengths adjacent to field values.

UNIX file system paths may contain two different types of variables. They may have tags such as %g and %p that are replaced by values passed into utilities at run time. This may occur with command line arguments or control file. Environment variables may be specified in braces such as ${HOME}.

GZIP compression is used directly by DFILE Tools when UNIX file system paths have a .gz suffix. This is a standard practice for files stored with GZIP compression. Below are example entries:

employee:|:10::employee.cfg:${HOME}/data/emp/employee.dat.gz
address:|:10::address.cfg:${HOME}/data/emp/address.dat.gz
phone:|:10::phone.cfg:${HOME}/data/emp/phone.dat
zip_code:,:10::zip_code.cfg:${HOME}/data/emp/zip_code.dat

employee.110::::employee.cfg:${HOME}/data/emp/employee.110/%p.dat
address.110::::address.cfg:${HOME}/data/emp/address.110/%p.dat
phone.110::::phone.cfg:${HOME}/data/emp/phone.110/%p.dat

Record layout configuration files contain field names and their order within the record. Utility and API use of field names are case insensitive. The following is an example record layout configuration file:

EMP_NBR
EFF_DT
EXPR_DT
FIRST_NAME
SURNAME
SSN
BIRTHDAY
EMP_START_DT
EMP_END_DT

Since batch applications are processing intensive, it is important they scale to hardware architecture. DFILE Tools use a data parallelism technique of partitioning large numbers of data records into multiple UNIX files. Records are assigned partition units based on data values in a key field. This makes partition unit file location predictable per key field values. DFILE Tools contains a process manager that executes concurrent instances of utilities based on configured CPU and data partition information. Each execution instance processes only one data partition unit. Partition unit identification is passed from process manager to utility using either environment variables or command line tag assignments (-t %p=99). Sometimes it is advantageous to hierarchically partition records based on the first few key fields.

Record Filtering

Often processes require only a portion of records from a file based on business rules. All utilities have a scripting language that allows unwanted records to be filtered at run time. While it is functionally similar to SQL WHERE clauses, the language looks much like LISP S-expressions. The following are examples:

( where ( = $emp_nbr '00098765 ) )

( where
    ( or
        ( in $state_cd ( TN AR IL MO KS IA ) )
        ( in $zip_cd ( 35035 86440 71953 93444 06777 ) ) ) )

( where
    ( and
        ( or
            ( > $sold_qty 0.0 )
            ( > $layaway_qty 0.0 ) )
        ( or
            ( = $store_qty 0.0 )
            ( = $warehouse_qty 0.0 ) ) ) )

The previous statement is equivalent to the following SQL WHERE clause:

where ( sold_qty > 0.0 or layaway_qty > 0.0 )
        and ( store_qty = 0.0 or warehouse_qty = 0.0 )

Expressions may grow in complexity.

( where
    ( and
        ( = $product_ind C )
        ( not ( = $inventory_type 1 ) )
        ( in $storage_type ( 6 7 8 M ) )
        ( not ( = $product_feature_cd_1 VM ) )
        ( not ( = $product_feature_cd_2 VM ) )
        ( not ( = $product_feature_cd_3 VM ) )
        ( not ( = $product_feature_cd_4 VM ) )
        ( not ( = $mps_feature_cd_5 VM ) ) ) )

( WHERE
    ( AND
        ( = $product_ind C )
        ( NOT
            ( OR
                ( LIKE $product_feature_cd_1 [lL][cC][iI][bB] )
                ( LIKE $product_feature_cd_2 [lL][cC][iI][bB] )
                ( LIKE $product_feature_cd_3 [lL][cC][iI][bB] )
                ( LIKE $product_feature_cd_4 [lL][cC][iI][bB] )
                ( LIKE $product_feature_cd_5 [lL][cC][iI][bB] ) ) ) ) )

Sometimes constant semantics are ambiguous. When the interpreter notices a constant beginning with a digit, it attempts to use it as a double floating point number. To force a numeric constant to be used as ASCII, it is necessary to prefix it with a tick mark ('). The list of constants used with the IN operator are exceptions. They are always placed in a hash table based on their ASCII value. The tick mark is also useful to represent a zero length value. Examples are as follows:

( where
    ( and
        ( >= $event_dt '20080601 )
        ( <= $event_dt '20080630 ) ) )

( where
    ( not
        ( = $orig_tid_cid ' ) )

When special characters [ \t\n()"] are needed in constants, the special characters may be escaped using a backslash character (\). If the quote character (") is not involved, the quote character may be used to surround constants containing special characters.

The following is a description of the predicate language grammar:

START( <where> <condition> )

condition → <compare>
        | ( <and> <conjunction> )
        | ( <or> <conjunction> )
        | ( <not> <condition> )

conjunction → <condition> <conjunction>
        | <condition>

compare( = <datum> <datum> )
        | ( > <datum> <datum> )
        | ( >= <datum> <datum> )
        | ( < <datum> <datum> )
        | ( <= <datum> <datum> )
        | ( <in> <variable> ( <literal_list> ) )
        | ( <like> <variable> <literal> )

literal_list → <literal> <literal_list>
        | <literal>

datum → <variable>
        | <number>
        | <literal>

where[Ww][Hh][Ee][Rr][Ee]

and [Aa][Nn][Dd]

or [Oo][Rr]

not [Nn][Oo][Tt]

in [Ii][Nn]

like [Ll][Ii][Kk][Ee]

literal → <ascii_string>
        | <default_literal>

ascii_string ['].*

number [-]?(([0-9]+)|([0-9]*[.][0-9]+))

variable [$][-a-zA-Z0-9_.]+

default_literal .+

Utilities

dcat

Since the recommended record format does not lend itself for use with standard UNIX tools, utility dcat is available for ad hoc instances to view data. Its default behavior is to write data to stdout with pipe (|) as a field delimiter and new line (\n) a record separator. The required argument is expected to be a dfile name. Optional command line argument -h prints a header record containing field names. The following is an example:

$ dcat -h -t %g=current processed_revenue_cycle | head
cycle_dt|load_dt
20070702|2007-08-05 18:24:58
20070703|2007-08-05 18:24:58
20070704|2007-08-05 18:24:58
20070705|2007-08-05 18:24:58
20070706|2007-08-05 18:24:58
20070707|2007-08-05 18:24:58
20070708|2007-08-05 18:24:58
20070709|2007-08-05 18:24:58
20070710|2007-08-05 18:24:58

If data is known to contain pipe characters, an alternate field delimiter can be specified with -F argument. No checking is performed to ensure output field delimiter is not contained in data.

dfile_exec

Batch schedulers such as UNIX Cron are designed to allow flexible job scheduling but lack features for managing job streams. Utility dfile_exec can supplement schedulers by adding job stream management features. Job streams are defined using control files. As steps of a job stream are executed, process check point information is written to a log file. The following is an example:

$ cat jobstream.ctl
( job sum_dflt_group
    ( step 100.awk
        ( exec awk -F: "{ print $3, $1 }" )
        ( stdin /etc/group )
        ( stdout group.100.out )
        ( stderr group.100.err ) )

    ( step 110.sort
        ( exec sort -n )
        ( stdin group.100.out )
        ( stdout group.110.out )
        ( stderr group.110.err ) )

    ( step 120.awk
        ( exec awk -F: "
            {
                ++group[ $4 ]
            }
            END {
                for ( i in group ) {
                    print i, group[ i ]
                }
            }" )
        ( stdin /etc/passwd )
        ( stdout passwd.120.out )
        ( stderr passwd.120.err ) )

    ( step 130.sort
        ( exec sort -n )
        ( stdin passwd.120.out )
        ( stdout passwd.130.out )
        ( stderr passwd.130.err ) )

    ( step 140.join
        ( exec join group.110.out passwd.130.out )
        ( stdout group.140.out )
        ( stderr group.140.err ) ) )

$ dfile_exec -c jobstream.ctl -l jobstream.log >jobstream.out 2>jobstream.err

$ exec_log jobstream.log
#DATE,TIME,JOB NAME,STEP NAME,PARTITION,PID,ACTION,EXIT CODE,SIGNAL
2011/05/09,16:29:55,sum_dflt_group,100.awk,,4712,START,,
2011/05/09,16:29:55,sum_dflt_group,100.awk,,4712,END,0,
2011/05/09,16:29:55,sum_dflt_group,110.sort,,4200,START,,
2011/05/09,16:29:55,sum_dflt_group,110.sort,,4200,END,0,
2011/05/09,16:29:55,sum_dflt_group,120.awk,,3560,START,,
2011/05/09,16:29:55,sum_dflt_group,120.awk,,3560,END,0,
2011/05/09,16:29:55,sum_dflt_group,130.sort,,2892,START,,
2011/05/09,16:29:55,sum_dflt_group,130.sort,,2892,END,0,
2011/05/09,16:29:55,sum_dflt_group,140.join,,4620,START,,
2011/05/09,16:29:55,sum_dflt_group,140.join,,4620,END,0,

$ cat jobstream.out
05/09/11 16:29:55 real 0h00m00s, user 0h00m00s, system 0h00m00s, 100.awk
05/09/11 16:29:55 real 0h00m00s, user 0h00m00s, system 0h00m00s, 110.sort
05/09/11 16:29:55 real 0h00m00s, user 0h00m00s, system 0h00m00s, 120.awk
05/09/11 16:29:55 real 0h00m00s, user 0h00m00s, system 0h00m00s, 130.sort
05/09/11 16:29:55 real 0h00m00s, user 0h00m00s, system 0h00m00s, 140.join

In this example, job steps 110 through 140 are executed sequentially. If any of the steps return an unsuccessful exit code, dfile_exec halts processing and returns a non-zero exit code. Progress is recorded in the log file. If dfile_exec is restarted, it reads the log file to determine which steps previously completed successfully. Successful completions are skipped on subsequent restarts.

Sometimes it is possible to find run time improvements using the data parallelism technique of partitioning data. Concurrent processes work on data partitions as individual isolated units. The following is a simple example:

$ cat jobstream.ctl
( job dir_space_usage
    ( step 10.ls
        ( exec ls -l )
        ( stdout ls.out )
        ( stderr ls.err ) )

    ( step 20.awk
        ( exec awk "/^d/ { print $NF }" )
        ( stdin ls.out )
        ( stdout local_directories )
        ( stderr awk.err ) )

    ( step 30.du
        ( exec du -k -s %s )
        ( partition local_directories )
        ( max-processes 3 )
        ( stdout du.%s.out )
        ( stderr du.%s.err ) )

    ( step 40.sort
        ( exec bash -c "sort -n du.120.*.out" )
        ( stdout sort.out )
        ( stderr sort.err ) ) )

$ dfile_exec -c jobstream.ctl -l jobstream.log > jobstream.out 2> jobstream.err

$ cat local_directories
cfg
data
dfiletools
parse_xml
tmp

$ exec_log jobstream.log
#DATE,TIME,JOB NAME,STEP NAME,PARTITION,PID,ACTION,EXIT CODE,SIGNAL
2011/05/10,15:31:06,dir_space_usage,10.ls,,3156,START,,
2011/05/10,15:31:06,dir_space_usage,10.ls,,3156,END,0,
2011/05/10,15:31:06,dir_space_usage,20.awk,,3176,START,,
2011/05/10,15:31:07,dir_space_usage,20.awk,,3176,END,0,
2011/05/10,15:31:07,dir_space_usage,30.du,cfg,3868,START,,
2011/05/10,15:31:07,dir_space_usage,30.du,data,4004,START,,
2011/05/10,15:31:07,dir_space_usage,30.du,dfiletools,3488,START,,
2011/05/10,15:31:07,dir_space_usage,30.du,cfg,3868,END,0,
2011/05/10,15:31:07,dir_space_usage,30.du,parse_xml,2996,START,,
2011/05/10,15:31:07,dir_space_usage,30.du,data,4004,END,0,
2011/05/10,15:31:07,dir_space_usage,30.du,tmp,2748,START,,
2011/05/10,15:31:07,dir_space_usage,30.du,dfiletools,3488,END,0,
2011/05/10,15:31:07,dir_space_usage,30.du,parse_xml,2996,END,0,
2011/05/10,15:31:07,dir_space_usage,30.du,tmp,2748,END,0,
2011/05/10,15:31:07,dir_space_usage,40.sort,,444,START,,
2011/05/10,15:31:07,dir_space_usage,40.sort,,444,END,0,

In the above example, job step 20.awk creates file local_directories containing directory names found in the current working directory. For each line of text in this file, job step 30.du runs an instance of the du command. The %s token is replaced by a line of text from the partition file. Initially, the max-processes count is used to start a number of concurrent instances of the exec command. As processes complete, subsequent entries in the partition file are used to start additional instances of the exec command. Utility dfile_exec repeats this execution cycle until all entries in the partition file have been processed.

When partition values are not appropriate for use in file names, token %n may be used in place of %s. Token %n represents partition number.

Sometimes partitions are defined using a whole number value instead of a text file. This is used when data is partitioned into units using a hash algorithm. Specifying whole number n instead of a file name with parameter partition causes dfile_exec to run n processes. Command line substitution for token %s includes values 0 to n-1.

$ cat jobstream.ctl
( job hash_partition_example
    ( step 10.echo
        ( exec echo partition unit %s )
        ( partition 5 )
        ( max-processes 1 ) ) )

$ dfile_exec -c jobstream.ctl -l jobstream.log
partition unit 0
partition unit 1
partition unit 2
partition unit 3
partition unit 4
05/17/11 17:32:52 real 0h00m00s, user 0h00m00s, system 0h00m00s, 10.echo

Occasionally dfile_exec processes may need to be nested while using the partition feature. In these cases, partition values may be passed from parent to child using environment variables.

$ cat a.ctl
( job a
    ( step run_js
        ( exec dfile_exec -c b.ctl )
        ( setenv ( table %s ) )
        ( partition tables.txt )
        ( stdout a.run_js.%s.out )
        ( stderr a.run_js.%s.err ) ) )
$ cat b.ctl
( job b
    ( setenv ( PARTITION_CNT 3 ) )
    ( step echo_values
        ( exec echo "table ${table}, partition %s" )
        ( partition ${PARTITION_CNT} )
        ( stdout b.echo_values.${table}.%s.out )
        ( stderr b.echo_values.${table}.%s.err ) ) )

Grammar for control file is as follows:

START( <job> <job_name> <job_setenv> <job_stream> )

job_setenv
        | ( <setenv> <env_list> )

job_stream → <job_step> <job_stream>
        | <job_step>

job_step( <step> <step_name> <exec_attribute> <step_attribute_list> )

exec_attribute( <exec> <command> <command_argument_list> )

command_argument_list
        | <command_argument> <command_argument_list>

step_attribute_list → <step_attribute> <step_attribute_list>

step_attribute
        | ( <stdin> <file_path> )
        | ( <stdout> <file_path> )
        | ( <stderr> <file_path> )
        | ( <partition> <partition_value> )
        | ( <max-processes> <whole_number> )
        | ( <successful_exit> <whole_number_list> )
        | ( <setenv> <env_list> )

env_list( <env_variable> <env_value> ) <env_list>

whole_number_list → <whole_number> <whole_number_list>

partition_value → <file_path>
        | <whole_number>

job[Jj][Oo][Bb]

step[Ss][Tt][Ee][Pp]

stdin[Ss][Tt][Dd][Ii][Nn]

stdout[Ss][Tt][Dd][Oo][Uu][Tt]

stderr[Ss][Tt][Dd][Ee][Rr][Rr]

partition[Pp][Aa][Rr][Tt][Ii][Oo][Nn]

max_processes[Mm][Aa][Xx]-[Pp][Rr][Oo][Cc][Ee][Ss][Ss][Ee][Ss]

successful_exit[Ss][Uu][Cc][Cc][Ee][Ss][Ss][Ff][Uu][Ll]-[Ee][Xx][Ii][Tt]

job_name[-_.a-zA-Z0-9]+

step_name[-_.a-zA-Z0-9]+

whole_number[1-9][0-9]*

file_path[-_./a-zA-Z0-9]+

command[-_./a-zA-Z0-9]+

command_argument[-_./a-zA-Z0-9]+

env_variable[-_.a-zA-Z0-9]+

env_value[-_.a-zA-Z0-9]+

dfile_partition

A large dfile can be split into many UNIX files using utility dfile_partition. Each UNIX file represents an individual partition unit. Partitions are defined according to rules associated with hash and range partitioning methods. The hash partitioning method involves specifying a key field contained in each record to apply an ASCII hashing algorithm. Records are written to UNIX files based on results of the algorithm and number of defined partition units. Specifically, partition unit file number is remainder from dividing hash value by number of defined partition units. Range partitioning method also requires a key field from each record. Key field values are used to perform binary search of partition definition values. Correct partition unit is located when the maximum ASCII value is found that is less than or equal to record key value. The following are examples:

$ cat dfile.cfg
bsa_sid:|:10::bsa_sid.cfg:bsa_sid.dat

bsa_sid.hash_partition:|:10::bsa_sid.cfg:bsa_sid/hash_partition/%p.dat

bsa_sid.range_partition:|:10::bsa_sid.cfg:bsa_sid/hash_partition/%p.dat


$ cat bsa_sid.cfg
BSA_ID
OWNER_CD
OWNER_TYPE_CD
SID


$ cat bsa_sid.dat
AUHLOG504|35|TPCS|70275
AVIRHN815|AU|AFG|70285
BPSNWT917|35|TPCS|07106
CJIVAL719|35|TPCS|07384
FKNALA104|35|TPCS|07654
HERNOR303|35|TPCS|70092
HUOPLA036|35|TPCS|07622
IMUPDC470|35|TPCS|70303
LESMAD556|K7|AFG|70343
LEXCAN661|35|TPCS|70109
MOAKEY605|35|TPCS|07151
NACNTL962|35|TPCS|07162
NDRSAN019|35|TPCS|07376
NNXSCR505|AR|AFG|70338
NYRFRM518|35|TPCS|70571
OJIMAR840|35|TPCS|07418
ONAMRT341|IL|AFG|73764
POTWEI804|35|TPCS|07171
SFAWTR805|BR|AFG|70413
SGRBRK310|35|TPCS|07183
SVLCOL718|35|TPCS|07190


$ dfile_partition -i bsa_sid -o bsa_sid.hash_partition -f bsa_id -h 7


$ ls -l bsa_sid/hash_partition
total 14
-rw-rw-rw-   1 kcrane   kcrane        96 Oct 24 14:53 0.dat
-rw-rw-rw-   1 kcrane   kcrane        94 Oct 24 14:53 1.dat
-rw-rw-rw-   1 kcrane   kcrane        48 Oct 24 14:53 2.dat
-rw-rw-rw-   1 kcrane   kcrane        71 Oct 24 14:53 3.dat
-rw-rw-rw-   1 kcrane   kcrane        95 Oct 24 14:53 4.dat
-rw-rw-rw-   1 kcrane   kcrane        47 Oct 24 14:53 5.dat
-rw-rw-rw-   1 kcrane   kcrane        48 Oct 24 14:53 6.dat


$ for file in bsa_sid/hash_partition/?.dat
> do
> echo "\n$file"
> cat $file
> done

bsa_sid/hash_partition/0.dat
AVIRHN815|AU|AFG|70285
ONAMRT341|IL|AFG|73764
POTWEI804|35|TPCS|07171
SGRBRK310|35|TPCS|07183

bsa_sid/hash_partition/1.dat
LEXCAN661|35|TPCS|70109
NNXSCR505|AR|AFG|70338
NYRFRM518|35|TPCS|70571

bsa_sid/hash_partition/2.dat
BPSNWT917|35|TPCS|07106
CJIVAL719|35|TPCS|07384
NACNTL962|35|TPCS|07162

bsa_sid/hash_partition/3.dat
LESMAD556|K7|AFG|70343
MOAKEY605|35|TPCS|07151
OJIMAR840|35|TPCS|07418
SFAWTR805|BR|AFG|70413
SVLCOL718|35|TPCS|07190

bsa_sid/hash_partition/4.dat

bsa_sid/hash_partition/5.dat
AUHLOG504|35|TPCS|70275
FKNALA104|35|TPCS|07654
HERNOR303|35|TPCS|70092
HUOPLA036|35|TPCS|07622
IMUPDC470|35|TPCS|70303

bsa_sid/hash_partition/6.dat
NDRSAN019|35|TPCS|07376


$ cat partition.cfg
A
G
M
S


$ dfile_partition -i bsa_sid -o bsa_sid.range_partition -f bsa_id -h partition.cfg


$ ls -l bsa_sid/range_partition
total 8
-rw-rw-rw-   1 kcrane   kcrane       119 Oct 24 15:56 A.dat
-rw-rw-rw-   1 kcrane   kcrane       119 Oct 24 15:56 G.dat
-rw-rw-rw-   1 kcrane   kcrane       190 Oct 24 15:56 M.dat
-rw-rw-rw-   1 kcrane   kcrane        71 Oct 24 15:56 S.dat


$ for file in bsa_sid/range_partition/?.dat
> do
> echo "\n$file"
> cat $file
> done

bsa_sid/range_partition/A.dat
AUHLOG504|35|TPCS|70275
AVIRHN815|AU|AFG|70285
BPSNWT917|35|TPCS|07106
CJIVAL719|35|TPCS|07384
FKNALA104|35|TPCS|07654

bsa_sid/range_partition/G.dat
HERNOR303|35|TPCS|70092
HUOPLA036|35|TPCS|07622
IMUPDC470|35|TPCS|70303
LESMAD556|K7|AFG|70343
LEXCAN661|35|TPCS|70109

bsa_sid/range_partition/M.dat
MOAKEY605|35|TPCS|07151
NACNTL962|35|TPCS|07162
NDRSAN019|35|TPCS|07376
NNXSCR505|AR|AFG|70338
NYRFRM518|35|TPCS|70571
OJIMAR840|35|TPCS|07418
ONAMRT341|IL|AFG|73764
POTWEI804|35|TPCS|07171

bsa_sid/range_partition/S.dat
SFAWTR805|BR|AFG|70413
SGRBRK310|35|TPCS|07183
SVLCOL718|35|TPCS|07190

dfile_sort

Data records can be ordered using utility dfile_sort. This includes sorting unordered records and merging already ordered records. Command line arguments can support sorting and merging of single dfiles, but control files are necessary to order records from multiple input dfiles into a single output dfile.

All sorting algorithms used in the utility support internal sorting only. It makes no attempt to create temporary work files for memory conservation. Partitioning data prior to sorting allows processing to be properly scaled to hardware. Quicksort is the utility's default algorithm. Algorithm may be specified at run time using the -a command line argument followed by algorithm flag. Below is a list of available algorithms.

ALGORITHMFLAG
Insertion SortI
Shell SortS
Merge SortM
QuicksortQ

A list of key sorting fields may be specified with the -k command line argument. Field names are separated by commas (,). By default key field values are sorted in ASCII ascending order. This can be changed by appending optional flags to affected field names. Optional flags are separated by periods (.). First flag is expected to be (A)sending or (D)ecending. Last flag specifies data values to be compared as (A)SCII, (N)umeric, or (H)igh value null. High value null is an ASCII comparison but considers zero length values to be special cases that are highest possible value. This is useful for sorting expiration/termination dates. Examples are shown below.

$ dfile_sort -k acct_nbr,eff_dt -i extract.account -o sort.account

$ dfile_sort -k acct_nbr.d,eff_dt -i extract.account -m pending_account -o sort.account

$ dfile_sort -a m -k acct_nbr,expr_dt.a.h -i extract.account -o sort.account

Sometimes several input dfiles need to be sorted or merged into one output dfile. This is done using a control file. Control files are specified with the -c command line argument. Specifying a control file causes the utility to ignore most command line arguments. Below are some control file examples:

( ( order-by ( sbscr_nbr ) ( cust_sys_cd ) ( RLS_EQPT_ID ) )

( merge
    ( dfile dfile_join.subs_act_bsa
        ( where ( in $KW_INDCR ( C B ) ) ) )

    ( dfile dfile_join.sw2_migrate_in3
        ( where ( in $KW_INDCR ( C B ) ) ) )

    ( dfile dfile_join.sw2_migrate_out3
        ( where ( in $KW_INDCR ( C B ) ) ) )

( output ( dfile dfile_sort.subs_act_bsa ) ) )



( ( order-by ( os_cust_id ) ( seq_nbr ) )

( merge
    ( dfile dfile_sort_basic.subs_rev.sbscrp_bsa
        ( where ( not ( = src_geo_id ' ) ) ) )

    ( dfile dfile_join.subs_rev.subs_rev_src_geo_id ) )

( output ( dfile dfile_sort.rls_src_invc_chg ) ) )



( ( order-by ( snpsht_dt ) )

( sort
    ( dfile cat_partitions.swy_swz_trigger )

    ( dfile max_swy_swz_trigger
        ( tag ( %g current ) ) ) )

( output ( dfile dfile_sort.swy_swz_trigger ) ) )



( ( order-by ( sw_id ) ( sys_update_date ) ( effective_date ) )

( sort
    ( dfile dfile_cache_join.bsa_sw_owner
        ( where
            ( and
                ( like $sw_id "^\([0-2]?[0-9]\{1,2\}[.]\)\{3,3\}[0-2]?[0-9]\{1,2\}$" )
                ( <= $sys_update_date '20081231000000 )
                ( <= $effective_date '20081231000000 )
                ( > $expiration_date '20081231000000 ) ) ) ) )

( output ( dfile sw_owner ) ) )



( ( order-by ( snpsht_dt ) )

( sort
    ( dfile cat_partitions.sw2_trigger
        ( where ( > $snpsht_dt '20081231000000 ) ) ) )

( merge
    ( dfile sw2_pend_trigger
        ( tag ( %g current ) )
        ( where ( > $snpsht_dt '20081231000000 ) ) ) )

( output ( dfile dfile_sort.sw2_pend_trigger ) ) )

Grammar for control file is as follows:

START( <order_by_section> <sort_section> <merge_section> <output_section> )

order_by_section( <order_by> <order_by_list> )

order_by_list → <order_by_field> <order_by_list>
        | <order_by_field>

order_by_field( <field_name> <field_attribute> )

field_attributenull
        | ( <order_by_direction> )
        | ( <compare_method> )

order_by_direction → <ascending>
        | <descending>

compare_method → <ascii>
        | <numeric>
        | <high-value-null>

sort_sectionnull
        | ( <sort> <specify_algorithm> <dfile_list> )

merge_sectionnull
        | ( <merge> <dfile_list> )

output_section( <output> <output_dfile> )

specify_algorithmnull
        | ( <algorithm> <sort_algorithm> )

sort_algorithm → <insertion_sort>
        | <shell_sort>
        | <hash_sort>
        | <merge_sort>
        | <quick_sort>

dfile_list → <input_dfile> <dfile_list>
        | <input_dfile>

input_dfile( <dfile> <dfile_name> <input_attribute> )

output_dfile( <dfile> <dfile_name> <output_attribute> )

input_attribute → <dfile_tag> <record_filter>

output_attribute → <dfile_tag> <dfile_open_mode>

dfile_tag( <tag> <tag_assignment_list> )

tag_assignment_list → <tag_assignment> <tag_assignment_list>
        | <tag_assignment>

tag_assignment( <tag_variable> <tag_value> )

dfile_open_mode( <open_mode> <open_file_mode> )

open_file_mode → <append>
        | <truncate>

order_by[Oo][Rr][Dd][Ee][Rr]-[Bb][Yy]

algorithm[Aa][Ll][Go][Rr][Ii][Tt][Hh][Mm]

insertion_sort[Ii][Nn][Ss][Ee][Rr][Tt][Ii][Oo][Nn]-[Ss][Oo][Rr][Tt]

shell_sort[Ss][Hh][Ee][Ll][Ll]-[Ss][Oo][Rr][Tt]

heap_sort[Hh][Ee][Aa][Pp]-[Ss][Oo][Rr][Tt]

merge_sort[Mm][Ee][Rr][Gg][Ee]-[Ss][Oo][Rr][Tt]

quick_sort[Qq][Uu][Ii][Cc][Kk]-[Ss][Oo][Rr][Tt]

ascending[Aa][Ss][Cc][Ee][Nn][Dd][Ii][Nn][Gg]

descending[Dd][Ee][Ss][Cc][Ee][Nn][Dd][Ii][Nn][Gg]

ascii[Aa][Ss][Cc][Ii][Ii]

numeric[Nn][Uu][Mm][Ee][Re][Ii][Cc]

high_value_null[Hh][Ii][Gg][Hh]-[Vv][Aa][Ll][Uu][Ee]-[Nn][Uu][Ll][Ll]

sort[Ss][Oo][Rr][Tt]

merge[Mm][Ee][Rr][Gg][Ee]

output[Oo][Uu][Tt][Pp][Uu][Tt]

dfile[Dd][Ff][Ii][Ll][Ee]

dfile_name[-_.a-zA-Z0-9]+

field_name[-_.a-zA-Z0-9]+

tag[Tt][Aa][Gg]

tag_variable%[a-zA-Z]

tag_value[-_/.a-zA-Z0-9]+

open_mode[Oo][Pp][Ee][Nn]-[Mm][Oo][Dd][Ee]

append[Aa][Pp][Ee][Nn][Dd]

truncate[Tt][Rr][Uu][Nn][Cc][Aa][Tt][Ee]

record_filter*** described in Overview section ***

dfile_cache_create

There are two methods available in utility dfile_join to join records between dfiles. One of them requires a sorted dfile to be pre-loaded into a UNIX shared memory segment. This action is performed using utility dfile_cache_create. Command line argument -a followed by a hexadecimal number specifies the IPC key associated with the shared memory segment. The -i command line argument allows input data dfile to specified. Records can be filtered using -y command argument followed by file name containing filter rules. After dfile_join is complete, shared memory segments are removed using UNIX command ipcrm. The following is an example:

$ dfile_cache_create -a 0x0331 -i tid_cid_owner

$ ipcs -m
IPC status from as of Thu Sep 11 17:32:45 CDT 2008
T         ID      KEY        MODE        OWNER    GROUP
Shared Memory:
m 1929379853   0x331      --r--r----- kcrane    kcrane

dfile_join

Joining records between dfiles is a common operation necessary for reporting. Utility dfile_join has two join methods available. One method is referred to as Sort-Merge Join. It requires input records in each dfile to be pre-sorted by corresponding key field values. Input dfiles are read sequentially and merged based on the key field values. The other join method requires all but one dfile to be pre-sorted. The sorted dfiles are also expected to be in UNIX shared memory segments. The join process sequentially reads the unsorted dfile and performs binary searches on records in shared memory. This method is generally for joining a large dfile to small dfiles.

Many command line arguments are available at run time. The -k argument is followed by one or more key field names. Field names are separated by commas (,). The -i argument is followed by the input dfile name. Fields from this dfile are mapped to output without specifying them individually. The -j argument is followed by a dfile name that contains joining records. Input dfile and dfile containing joining records are expected to have key field names in common. The -f argument is followed by a list of join dfile field names that specifies which field values to copy to output record. Field names are separated by commas (,). Command line argument -o followed by dfile name specifies an output dfile. The -m argument followed by I or O is an optional argument to specify inner or outer join operation. By default, inner joins are assumed. When outer joins are performed, the -s argument may be used followed by an output field name. This output field will contain values - or + based on whether the input record was successfully joined to a record from the join dfile. An optional -u argument may be specified followed by F or L. This limits input records to join with only the first or last record from the join dfile having matching key field values. Input, join, and output record filter files may be specified using arguments -x, -y, and -z respectively. If the join is with a UNIX shared memory segment, -a argument is followed by a hexadecimal number associated with the memory segment. Below are examples:

$ dfile_join -i sw2_non_bsa_swap -j swz_sbscr_hist -o subs_act_bsa \
    -k sbscr_nbr,cust_sys_cd -f bsa_id,acct_type_cd,acct_sub_type_cd \
    -m O -s subs_act_bsa_join_status -u L -y ${FILTER}/subs_act_bsa.cfg

$ dfile_join -i dp.trvlrev -j tid_cid_owner -o dj.trvlrev \
    -k orig_tid_cid -f orig_owner_cd,orig_owner_type_cd \
    -m O -s orig_join_status -u L -a 0x0331

Sometimes it is possible to combine multiple join operations into one execution of dfile_join. This requires join information to be passed to the utility in a control file at run time using command line argument -c. The following is an example:

$ dfile_join -c join.ctl -t %p=000

$ cat join.ctl
( ( input
    ( dfile subscription
        ( map-fields
            ( ( input geog_cd ) ( output bsa ) )
            ( ( input sbscrp_eff_dt ) ( output orig_eff_dt ) ) ) ) )

( join
    ( dfile acct_sbscrp
        ( key-fields ( sbscrp_id ) )
        ( copy-fields ( acct_nbr ) )
        ( outer-join )
        ( unique-join last-record ) )

    ( dfile sbscrp_phone_nbr
        ( key-fields ( sbscrp_id ) )
        ( copy-fields ( npa_nbr nxx_nbr line_nbr reason_cd ) )
        ( map-fields
            ( ( join reason_cd ) ( output sbscrp_phone_nbr_reason_cd ) ) )
        ( outer-join )
        ( unique-join last-record ) )

    ( dfile sbscrp_svc_plan_agrmt
        ( key-fields ( sbscrp_id ) )
        ( copy-fields ( svc_plan_cd ) )
        ( map-fields
            ( ( join svc_plan_cd ) ( output pkg_svc_name ) ) )
        ( where ( = $svc_plan_level_cd P ) )
        ( outer-join )
        ( unique-join last-record ) )

    ( dfile off_owner_cd
        ( key-fields ( bsa_id ) )
        ( copy-fields ( owner_cd ) )
        ( map-fields
            ( ( input geog_cd ) ( join bsa_id ) ) )
        ( outer-join ( status-field off_owner_cd_join_status ) )
        ( ipc-key 0x0333 ) ) )

( output
    ( dfile telephone_dim ) ) )

In the above example, all joins that do not specify an ipc-key will use the Sort-Merge Join method. It is necessary these joins use the same key field, sbscrp_id. As mentioned earlier, Sort-Merge Join operations require records to be consistently sorted. In this case, records must be order by sbscrp_id. Joins with ipc-key may have different key fields since records are pre-loaded in a UNIX shared memory segment for searching. Since this example uses bsa_id as the key field for the shared memory segment, records in shared memory are expected to be sorted by bsa_id.

An additional feature available in control files is the ability to map fields between dfiles when field names do not match. In the example input entry, input fields geog_cd and sbscrp_eff_dt are mapped to output fields bsa and orig_eff_dt respectively. Other join entries demonstrate mapping fields between input and join dfiles as well as join and output dfiles.

Grammar for control file is as follows:

START( <input_section> <join_section> <output_section> )

input_section( <input> ( <dfile> <dfile_name> <input_dfile_options> ) )

join_section( <join> ( <dfile_join_list> )

output_section( <output> ( <dfile> <dfile_name> <output_dfile_options> ) )

input_dfile_options → <null>
        | <map_fields_option>
        | <record_filter>

dfile_join_list → <join_dfile> <dfile_join_list>
        | <join_dfile>

join_dfile( <dfile> <dfile_name> <define_key_fields> <join_dfile_options> )

join_dfile_options → <null>
        | <map_fields_option>
        | <record_filter>
        | <copy_fields_option>
        | <join_method_option>
        | <unique_join_option>
        | <ipc_key_option>

map_fields_option( <map_fields> <field_map_list> )

field_map_list → <field_map> <field_map_list>
        | <field_map>

field_map( <define_field_map> <define_field_map> )

define_field_map( <record_source> <field_name> )

record_source → <input>
        | <join>
        | <output>

define_key_fields( <key_fields> ( <field_list> ) )

copy_fields_option( <copy_fields> ( <field_list> ) )

field_list → <field_name> <field_list>
        | <field_name>

join_method_option( <inner_join> )
        | ( <outer_join> )

unique_join_option( <unique_join> <unique_join_selection> )

unique_join_selection → <first_record>
        | <last_record>

output_dfile_options → <null>
        | <record_filter>

ipc_key_option( <ipc_key> <hex_number> )

input[Ii][Nn][Pp][Uu][Tt]

join[Jj][Oo][Ii][Nn]

output[Oo][Uu][Tt][Pp][Uu][Tt]

dfile[Dd][Ff][Ii][Ll][Ee]

dfile_name[-_.a-zA-Z0-9]+

field_name[-_.a-zA-Z0-9]+

copy_fields[Cc][Oo][Pp][Yy][-][Ff][Ii][Ee][Ll][Dd][Ss]

key_fields[Kk][Ee][Yy][-][Ff][Ii][Ee][Ll][Dd][Ss]

inner_join[Ii][Nn][Nn][Ee][Rr][-][Jj][Oo][Ii][Nn]

outer_join[Oo][Uu][Tt][Ee][Rr][-][Jj][Oo][Ii][Nn]

unique_join[Uu][Nn][Ii][Qq][Uu][Ee][-][Jj][Oo][Ii][Nn]

first_record[Ff][[Ii][Rr][Ss][Tt][-][Rr][Ee][Cc][Oo][Rr][Dd]

last_record[Ll][Aa][Ss][Tt][-][Rr][Ee][Cc][Oo][Rr][Dd]

ipc_key[Ii][Pp][Cc][-][Kk][Ee][Yy]

hex_number0[Xx][0-9]+

record_filter*** described in Overview section ***

dfile_agfunc

Typical data summarization operations can be performed using dfile_agfunc. This utility offers aggregate functions available in SQL Group By queries. Command line argument -k followed by a list of field names specifies key fields containing data values for preparing groups of records during summarization. This utility expects records to be pre-sorted by fields listed with the -k argument. If -k argument is not specified at run time, the utility summarizes all records in data file as one group. Command line argument -i followed by a dfile name specifies the input dfile. The -o argument, on the other hand, is followed by a dfile name to specify the output dfile. Input records may be filtered by providing a -x argument followed by a UNIX file name containing a record filter. On the other hand, output records can be filtered by specifying a -z argument followed by a UNIX file name containing record filter rules. Below is a table describing each aggregate function.

FUNCTIONCOMMAND LINE USAGEREMARKS
average-a sfield,rfield[,%g] sfield is the input field name and rfield is the output field name. An optional printf format may be specified.
count-c rfield rfield is the output field name.
ASCII minimum-m sfield,rfield sfield is the input field name and rfield is the output field name.
ASCII maximum-M sfield,rfield sfield is the input field name and rfield is the output field name.
numeric minimum-n sfield,rfield sfield is the input field name and rfield is the output field name.
numeric maximum-N sfield,rfield sfield is the input field name and rfield is the output field name.
sum-s sfield,rfield[,%g] sfield is the input field name and rfield is the output field name. An optional printf format may be specified.


$ dfile_agfunc -i dfile_sort.swy_swz_trigger -o max_swy_swz_trigger -M snpsht_dt,snpsht_dt

$ cat bsa_swap_dual_record.ctl
( where ( = $record_count '2 ) )

$ dfile_agfunc -i dfile_sort_basic.sw2_rule_engn_outpt_acty \
    -o dfile_agfunc.sw2_dual_records \
    -k sbscr_nbr,cust_sys_cd -c record_count -z bsa_swap_dual_record.ctl

$ dfile_agfunc -i filter_nonoff_trvl_usage.postpaid_trvl_usage_3g \
    -o dfile_agfunc.postpaid_trvl_usage_3g \
    -k sw_id,orig_owner_cd,billing_owner_cd,bill_year,bill_month,sys_source_cd,cycle_cd \
    -s usage_qty,usage_qty,%.0f -s chg,chg,%.2f

The examples above demonstrate usage for the utility. In the first example, the greatest snapshot_dt value in input dfile dfile_sort.swy_swz_trigger is written to output dfile max_swy_swz_trigger. The second example identifies sbscr_nbr and cust_sys_cd values having record count of two. The last example calculates sub-totals for usage_qty and chg.

dfile_unique

Sometimes it is necessary to purge duplicate records based on key field values. One common example involves applying record updates to a file. After delta records are sorted and merged with existing records, out of date records must be purged. Purging may be performed using utility dfile_unique. Records in this example are expected to be sorted by key field values plus the latest date in which record was originally created or its field values changed. The utility only retains the last record of the record group per composite key value. The following is an example:

$ dfile_unique -k ban,subscriber_no -i dfile_sort.subscriber_dim -o subscriber_dim -t %p=000

In the above example, input records are expected to be pre-sorted with ban and subscriber_no as the first two fields in the sort key. Only last input record per unique ban and subscriber_no value combinations will be written to output dfile.

dfile_transform

The need to make simple data value changes is common. Utility dfile_transform can make single step data changes. It requires transformations to be defined in a control file. Each transformation record defined in the control file uses the colon character (:) as a field delimiter. Control records contain the following information:

  1. input field name - Contains dfile field name that contains data value to search and apply transformations.
  2. transform type
    1. R)eplace - replaces entire value
    2. S)ubstitute all occurrences of value
    3. B)asic regular expression
    4. E)xtended regular expression
  3. search - Used to identify specific values for transformation.
  4. replace - New value to replace original value.
  5. output field name - Optional field specified to not overwrite original data value.

These regular expressions have a feature also found in UNIX sed. If a regular expression contains \( \), then data matched inside parenthesis can be copied into the replacement value using \1. Additional parenthesis may be used and may be reference in replacement value using corresponding \2, \3, ... \n token.

The following is an example using dfile_transform:

$ cat emp_transform.ctl
emp_start_dt:B:^\(20[0-9][0-9]\)\([01][0-9]\)\([0-3][0-9]\):\2/\3/\1:fmt_emp_start_dt
emp_end_dt:B:^\(20[0-9][0-9]\)\([01][0-9]\)\([0-3][0-9]\)$:\2/\3/\1:fmt_emp_end_dt
ssn:B:^\([0-9]\{3\}\)\([0-9]\{2\}\)\([0-9]\{4\}\)$:\1-\2-\3:fmt_ssn
birthday:B:^\([12][0-9]\{3\}\)\([01][0-9]\)\([0-3][0-9]\)$:\2/\3/\1:fmt_birthday
phone_nbr:B:^\([0-9]\{3\}\)\([0-9]\{3\}\)\([0-9]\{4\}\)$:(\1) \2-\3:fmt_phone_nbr

$ dfile_transform -c emp_transform.ctl -i emp.100 -o emp.110

In the above example, control file emp_transform.ctl contains transformation definitions. Input data is read from dfile emp.100 and results written to emp.110.

dfile_diff

Utility dfile_diff compares two dfiles and identifies differences. It requires both dfiles to be presorted by their record key values. One dfile is chosen to contain the original data values using command line argument -i. The dfile containing changed data values is specified with argument -j. Argument -k is used to list key fields. Command line arguments -a, -b, and -c allow output dfiles to be specified for added, deleted, and changed records. While added and deleted record layouts is expected to all fields found in input dfiles, the output dfile reporting changed records only contains key values and a sequence number field. Sequence number values may be used to join with an additional output change record dfile containing name of field having changed value and fields with original and subsequent values. This additional output change record dfile is specified with command line argument -l. Utility dfile_diff expects the sequence number field to be defined as change_seq_nbr. Other fields required in the change per field output dfile includes field_name, original_value and subsequent_value.

fixed2dfile

Utility fixed2dfile converts fixed length record files to dfiles. This requires a control file to map location and length of values from input record to output field names. The following is an example:

$ cat a.ctl
##
## 1. dfile field name
## 2. start field position (0 to reclen-1)
## 3. field length (1 to reclen)
## 4. trim spaces function ([L]eft or [R]ight)
##
##
f1:0:4:R
f2:4:3:R
f3:7:2:R
f4:9:1:R

$ cat a.dat
aaaabbbccd
a   b  c d
   a  b cd

$ cat cfg/dfile.cfg
a:|:10::a.cfg:tmp/a.dat

$ cat cfg/a.cfg
f1
f2
f3
f4

$ fixed2dfile -i a.dat -c a.ctl -o a -l 11

$ cat tmp/a.dat
aaaa|bbb|cc|d
a|b|c|d
   a|  b| c|d

In this example the input data file is a.dat, control file is a.ctl, output dfile is a, and fixed record length of input file is 11 bytes.

nsplit

Utility nsplit efficiently splits large ASCII data files into smaller data files. Output files will be roughly the same size. Since this utility does not actually parse data records, it does not use the DFile library. The following is an example:

$ nsplit -i /etc/passwd -n 4 -o passwd.%d.dat

$ ls -l /etc/passwd passwd.?.dat
-rw-r--r-- 1 root root 1431 Apr 26 21:28 /etc/passwd
-rw-r--r-- 1 keith keith 386 Oct 23 11:44 passwd.0.dat
-rw-r--r-- 1 keith keith 375 Oct 23 11:44 passwd.1.dat
-rw-r--r-- 1 keith keith 417 Oct 23 11:44 passwd.2.dat
-rw-r--r-- 1 keith keith 253 Oct 23 11:44 passwd.3.dat

DFILE Library

Application Program Interface (API)

All DFILE Tool utilities use DFILE Library as a common software library to read and write data files. This library contains a set of C language functions that serve as an API between programs and DFILE Library.

#include "dfile.h"

int dfile_cfg( dfile_cfg_t *dfile_cfg, const char *dfile_name );

dfile_t *dfile_read_open( const dfile_cfg_t *cfg,
    dfile_bind_t *program_bind, unsigned short program_bind_count,
    const dfile_tag_t *file_name_tag, unsigned short file_name_tag_count,
    unsigned short blocks_per_buffer_count, unsigned short buffer_count );

int dfile_read( dfile_t *dfile );

int dfile_read_close( dfile_t *dfile );

dfile_t *dfile_write_open( const dfile_cfg_t *cfg,
    const dfile_bind_t *program_bind, unsigned short program_bind_count,
    const dfile_tag_t *file_name_tag, unsigned short file_name_tag_cnt,
    unsigned short blocks_per_buffer_count, unsigned short buffer_count,
    dfile_open_mode_t open_mode );

int dfile_write( dfile_t *dfile );

int dfile_write_close( dfile_t *dfile );

The dfile_cfg() function gets configuration file information associated with dfile_name entry. Return value is zero when successful and -1 when a failure occurs. Argument dfile_cfg is a pointer to structure dfile_cfg_t that contains the following:

typedef struct {
    char field_separator;
    char record_separator;
    char separator_escape;
} dfile_rec_t;

typedef struct {
    const char *field_name;
    char **field_buffer;
    size_t *field_length;
} dfile_bind_t;

typedef struct {
    const char *dfile_name;
    dfile_rec_t rec_attribute;
    const char *record_layout_path;
    const char **field;
    dfile_bind_t *bind;
    unsigned short bind_cnt;
    void *bind_hash_table;
    const char *data_file_path;
} dfile_cfg_t;

dfile_read_open() creates and returns a structure to be used when calling dfile_read() to read records. Function argument cfg is a pointer to a structure that will usually be populated by previously calling function dfile_cfg(). program_bind is an array of structures used to bind C program variables with parsed field data in DFile buffers. If program_bind is a null pointer, the bind structure populated during dfile_cfg() containing all fields in record will be used. Argument program_bind_cnt is the number of C program variables in the program_bind array. file_name_tag is an array of the following structure:

typedef struct {
    const char *tag;
    const char *tag_value;
} dfile_tag_t;

Argument file_name_tag_cnt is the number of entries in file_name_tag array. Function argument blocks_per_buffer_count is the number of file system blocks accessed per I/O operation. Argument buffer_count defines the number I/O buffers to be used during processing. A value greater than one causes I/O and compression operations to be threaded.

Function dfile_read() causes C program variables to contain values of the next sequentially parsed record. Its dfile argument corresponds to the value return by dfile_read_open().

The dfile_read_close() function closes open file and releases I/O buffer memory. Its dfile argument corresponds to the value returned by dfile_read_open().

dfile_write_open() creates and returns a structure to be used when calling dfile_write() to write records. The first seven arguments are consistent with the first seven arguments of dfile_read_open(). open_mode can be Dfile_append or Dfile_trunc depending on whether an existing file is to be appended or truncated.

Function dfile_write() causes values contained in C program variables to be formatted into a data record and written. Its dfile argument corresponds to the value returned by dfile_write_open().

The dfile_write_close() function flushes I/O buffers to disk, closes the output file and releases I/O buffer memory. Its dfile argument corresponds to the value return by dfile_write_open().

Upon successful completion, most functions return 0. Failures are identified by return code -1. Exceptions, dfile_read_open() and dfile_write_open(), return valid address pointers when successful; otherwise null address pointers are returned. The end of data condition that occurs in dfile_read() can be distinguished from an error by checking structure variable dfile->error to verify it contains value Dfile_all_data_processed.

Binding Program Variables

C program variables are bound to record fields using structure dfile_bind_t. Variable field_name points to a string containing a field name. This field name is not case sensitive and should reference a field defined in the record layout configuration file. Address of program variable used to reference field data are assigned to field_buffer. When field_length is optionally set with the address of a program variable, dfile_read() populates that variable with the length of the parsed field value. Setting program variable assigned to field_length while doing dfile_write() eliminates the need to null terminate values used in field_buffer variable. Field length variables also improve processing efficiency if field length is known. Variable field_offset contains a field's offset into a record layout. The first field in a record would have an offset of zero. The following is an example of binding C program variables without field lengths:

    static char *sbscrp_id, *svc_plan_cd, *eff_dt, *expr_dt;
    static char *cntrct_start_dt, *cntrct_end_dt;

    static dfile_bind_t sali_field[] = {
        { "sbscrp_id", &sbscrp_id },
        { "svc_plan_cd",&svc_plan_cd },
        { "eff_dt", &eff_dt },
        { "expr_dt", &expr_dt },
        { "cntrct_start_dt", &cntrct_start_dt },
        { "cntrct_end_dt", &cntrct_end_dt, }
    };

    const unsigned short sali_field_cnt = sizeof( sali_field ) / sizeof( dfile_bind_t );

The following is an example of binding C program variables with field lengths:

    static char *sbscrp_id, *svc_plan_cd, *eff_dt, *expr_dt;
    static char *cntrct_start_dt, *cntrct_end_dt;

    static size_t sbscrp_id_len, svc_plan_cd_len, eff_dt_len, expr_dt_len;
    static size_t cntrct_start_dt_len, cntrct_end_dt_len;

    static dfile_bind_t sali_field[] = {
        { "sbscrp_id", &sbscrp_id, &sbscrp_id_len },
        { "svc_plan_cd",&svc_plan_cd, &svc_plan_cd_len },
        { "eff_dt", &eff_dt, &eff_dt_len },
        { "expr_dt", &expr_dt, &expr_dt_len },
        { "cntrct_start_dt", &cntrct_start_dt, &cntrct_start_dt_len },
        { "cntrct_end_dt", &cntrct_end_dt, &cntrct_end_dt_len }
    };

    const unsigned short sali_field_cnt = sizeof( sali_field ) / sizeof( dfile_bind_t );

Buffering

DFile library's I/O buffering system can process ASCII or GZIP formatted data. Its determination to apply data compression techniques are based on opened file name at run time. If a file name is suffixed with a '.gz', GZIP compression routines are applied. This results in compressed data files that are compatible with GNU's gzip utility.

If I/O or compress/uncompress operations are application bottle necks, an additional execution thread dedicated to I/O and compression may be started by opening files with multiple buffers. Data is passed between threads using circular buffer queues. Enough buffers should be allocated to minimize contention. Single buffer processing is slightly more efficient since there is no thread processing overhead. Typically the overhead is not worth the extra processing unless writing compressed files.

When opening a DFile, an application can control buffer size by specifying a block multiple. A block is based on the amount of data the UNIX file system containing the DFile prefers to communicate. Data per record cannot exceed buffer size. When applications process large records, they should choose a block multiple that creates buffers larger than the largest expected record. Also, there is slight processing overhead associated with buffer rotation. Larger buffers will incur fewer rotations.

API Examples

The following is an example of reading the UNIX password file.

$ grep passwd dfile.cfg
ipasswd:\::10::${PARM}/passwd.cfg:/etc/passwd
opasswd::::${PARM}/passwd.cfg:%d/passwd.dat

$ cat ${PARM}/passwd.cfg
user_name
password
user_id
group_id
comment
home
shell

$ cat passwd.c
#include <stdio.h>
#include "tbox.h"
#include "dfile.h"

int main( void )
{
    static char *user_name, *pass_word, *user_id, *group_id;
    static char *comment, *home_directory, *default_shell;

    static dfile_bind_t field[] = {
        { "user_name", &user_name },
        { "password", &pass_word },
        { "user_id", &user_id },
        { "group_id", &group_id },
        { "comment", &comment },
        { "home", &home_directory },
        { "shell", &default_shell }
    };
    const unsigned short field_cnt = sizeof( field ) / sizeof( dfile_bind_t );

    static dfile_tag_t output_tag[] = {
        { "%d", "/tmp" }
    };
    const unsigned short output_tag_cnt = sizeof( output_tag ) / sizeof( dfile_tag_t );

    dfile_cfg_t cfg;
    dfile_t *input_dfile, *output_dfile;
    const unsigned short blocks_per_buffer_cnt = 1;
    const unsigned short buffer_cnt = 1;

    if ( dfile_cfg( &cfg, "ipasswd" ) == -1 ) {
        return 5;
    }

    input_dfile = dfile_read_open( &cfg, field, field_cnt, (dfile_tag_t *)0, (unsigned short)0, blocks_per_buffer_cnt, buffer_cnt );

    if ( input_dfile == (dfile_t *)0 ) {
        return 10;
    }

    if ( dfile_cfg( &cfg, "opasswd" ) == -1 ) {
        return 15;
    }

    output_dfile = dfile_write_open( &cfg, field, field_cnt, output_tag, output_tag_cnt, blocks_per_buffer_cnt, buffer_cnt, Dfile_trunc );

    if ( output_dfile == (dfile_t *)0 ) {
        return 20;
    }

    (void) puts( "USER NAME  USER ID  GROUP ID         COMMENT           HOME DIRECTORY" );
    (void) puts( "---------  -------  ---------  --------------------- --------------------" );

    while ( dfile_read( input_dfile ) == 0 ) {
        (void) printf( "%-8.8s%10.10s%11.11s %-24.24s%-16.16s\n", user_name, user_id, group_id, comment, home_directory );

        if ( dfile_write( output_dfile ) == -1 ) {
            return 25;
        }
    }

    if ( input_dfile->error != Dfile_all_data_processed ) {
        (void) fputs( "Failed to read all records.\n", stderr );
        return 30;
    }

    }
    (void) printf( "\nrecord count %lu\n", input_dfile->file_rec_cnt );

    (void) dfile_write_close( output_dfile );
    (void) dfile_read_close( input_dfile );

    return 0;
}

$ ./passwd
USER NAME  USER ID  GROUP ID         COMMENT           HOME DIRECTORY
---------  -------  ---------  --------------------- --------------------
root             0          0  Charlie &             /root
...

$ head -1 /etc/passwd
root:*:0:0:Charlie &:/root:/bin/csh

$ od -c -x /tmp/passwd.dat
0000000  004   r   o   o   t 001   * 001   0 001   0  \t   C   h   a   r
0000020    l   i   e       & 005   /   r   o   o   t  \b   /   b   i   n
0000040    /   c   s   h
...

Generic programs allow specific DFiles to be chosen at run time. This flexibility requires extra bind structure coding. The following is a simple example:

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include "tbox.h"
#include "dfile.h"
#include "dcat.h"

/*
** This program displays dfile data.
*/
int main( int argc, char **argv )
{
    const unsigned short blocks_per_buffer_cnt = 2;
    const unsigned short buffer_cnt = 1;

    const char *dfile_name;
    char field_separator, heading_flag;
    dfile_t *dfile;
    dfile_tag_t *tag_tbl;
    unsigned short tag_tbl_cnt;
    dfile_cfg_t cfg;
    dfile_bind_t *bind_tbl;
    unsigned short ndx;

    if ( get_args( argc, argv, &dfile_name, &tag_tbl, &tag_tbl_cnt, &heading_flag, &field_separator ) == -1 ) {
        return 10;
    }

    if ( dfile_cfg( &cfg, dfile_name ) == -1 ) {
        return 20;
    }

    if ( heading_flag == 'Y' ) {
        /*
        ** Print field name heading.
        */
        for ( ndx = 0; ndx < cfg.field_cnt - 1; ++ndx ) {
            (void) fputs( cfg.field[ ndx ], stdout );
            (void) fputc( field_separator, stdout );
        }

        (void) fputs( cfg.field[ ndx ], stdout );
        (void) fputc( '\n', stdout );
    }

    dfile = dfile_read_open( &cfg, (dfile_bind_t *)0, (unsigned short)0, tag_tbl, tag_tbl_cnt, blocks_per_buffer_cnt, buffer_cnt );

    if ( dfile == (dfile_t *)0 ) {
        return 30;
    }

    bind_tbl = dfile->bind;

    while ( dfile_read( dfile ) == 0 ) {
        /*
        ** Print field values.
        */
        for ( ndx = 0; ndx < cfg.field_cnt - 1; ++ndx ) {
            (void) fputs( *bind_tbl[ ndx ].field_buffer, stdout );
            (void) fputc( field_separator, stdout );
        }

        (void) fputs( *bind_tbl[ ndx ].field_buffer, stdout );
        (void) fputc( '\n', stdout );
    }

    if ( dfile->error != Dfile_all_data_processed ) {
        fput_src_code( __FILE__, __LINE__, stderr );
        (void) fputs( "Failed to read all data.\n", stderr );
        return 40;
    }

    (void) dfile_read_close( dfile );

    return 0;
}


#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include "tbox.h"
#include "dfile.h"
#include "dcat.h"

static void    print_usage( const char * );

/*
** This function processes the command line arguments.
*/
int get_args( int argc, char * const argv[], const char **dfile_name, dfile_tag_t **tag_tbl, unsigned short *tag_tbl_cnt, char *heading_flag, char *field_separator )
{
    int ch;
    extern char *optarg;

    assert( argv != (char * const *) 0 );
    assert( dfile_name != (const char **) 0 );
    assert( tag_tbl != (dfile_tag_t **) 0 );
    assert( tag_tbl_cnt != (unsigned short *) 0 );
    assert( heading_flag != (char *) 0 );
    assert( field_separator != (char *) 0 );

    *dfile_name = (const char *)0;
    *tag_tbl = (dfile_tag_t *)0;
    *tag_tbl_cnt = (unsigned short)0;
    *heading_flag = 'N';
    *field_separator = '|';

    while ( ( ch = getopt( argc, argv, "F:ht:" ) ) != EOF ) {
        switch ( ch ) {
        case 'h':
            *heading_flag = 'Y';
            break;
        case 'F':
            *field_separator = *optarg;
            break;
        case 't':
            if ( parse_tag( tag_tbl, tag_tbl_cnt, optarg ) == -1 ) {
                return -1;
            }
            break;
        default:
            print_usage( argv[ 0 ] );
            return -1;
        }
    }

    if ( optind >= argc ) {
        fput_src_code( __FILE__, __LINE__, stderr );
        (void) fputs( "Must specify input dfile name.\n", stderr );
        print_usage( argv[ 0 ] );
        return -1;
    }

    *dfile_name = argv[ optind ];

    return 0;
}

static void print_usage( const char *exec_name )
{
    (void) fputs( "usage: ", stderr );
    (void) fputs( exec_name, stderr );
    (void) fputs( " [-F]", stderr );
    (void) fputs( " [-h]", stderr );
    (void) fputs( " [-t %x=abc]", stderr );
    (void) fputc( '\n', stderr );
    (void) fputs( "\t-F -> field separator (default |)\n", stderr );
    (void) fputs( "\t-h -> field heading\n", stderr );
    (void) fputs( "\t-t -> DFile path tags\n", stderr );
}

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include "tbox.h"
#include "dfile.h"
#include "dcat.h"

int parse_tag( dfile_tag_t **tag_tbl, unsigned short *tag_tbl_cnt, char *tag_str )
{
    size_t alloc_size;
    dfile_tag_t *new;

    assert( tag_tbl != (dfile_tag_t **)0 );
    assert( tag_tbl_cnt != (unsigned short *)0 );
    assert( tag_str != (char *)0 );

    /*
    ** This function expects tag_str to contain a tag in the form of
    ** %x=value.
    */
    if ( tag_str[ 0 ] != '%' || tag_str[ 2 ] != '=' ) {
        fput_src_code( __FILE__, __LINE__, stderr );
        (void) fputs( "tag [", stderr );
        (void) fputs( tag_str, stderr );
        (void) fputs( "] is not in correct format.\n", stderr );
        return -1;
    }

    /*
    ** Replace '=' with null character.
    */
    tag_str[ 2 ] = (char)0;

    alloc_size = sizeof( dfile_tag_t ) * ( (size_t)*tag_tbl_cnt + (size_t)1 );
    new = (dfile_tag_t *)realloc( *tag_tbl, alloc_size );
    if ( new == (dfile_tag_t *)0 ) {
        unix_error( "realloc() failed", __FILE__, __LINE__ );
        return -1;
    }

    new[ *tag_tbl_cnt ].tag = tag_str;
    new[ *tag_tbl_cnt ].tag_value = &tag_str[ 3 ];

    *tag_tbl = new;
    ++*tag_tbl_cnt;

    return 0;
}

Compiling Programs

C source files that reference dfile routines and data structures should include the following header file:

#include "dfile.h"

The command for the link step to create an executable must include the following arguments:

-ldfile -ltbox -lz -lpthread

Software library Dfile is dependent on library Tbox. Short for tool box, library Tbox contains general purpose routines needed for common programming tasks such data sorting and searching.

Software libraries dependent on library Dfile are Dfile_dynamic, Dfile_utility and Where. Library Dfile_dynamic contains helpful routines for processing DFiles without field names being hard coded in C programs. Library Dfile_utility contains common routines used in DFILE Tools utilities described earlier. Also, library Where is used in most DFILE Tools utilities. This library allows records to be filtered (discarded) during read and write operations. Library Where contains an interpreter to evaluate conditional expressions and is dependent on library Sexpr. Library Sexpr parses S-expressions and loads results into a tree structure.