pdr Reference
At the moment
there are the following types of data sources available:
|
these three data sources work
with expressions
|
|
these data sources work with
specific data formats in files
|
Input per command
line
The simplest (and most uncomfortablest) way to get data into the system
is the pdr command line, this means the invocation
of
pdr. There's nothing needed to be configured for this.
pdr has the command line option -e
(--expression) which
allows to
specify an expression.
This option can be multiply used. Moreover all characters behind pdr that are not part or
argument of a command line option are summed up to one big expression
and processed at once (see there).
If an expression on the command line doesn't have a timestamp the
current
date and time will be used.
If there's a failure during processing because of any incorrectness in
an expression pdr produces a message. A data transfer into the
rejections doesn't take place.
Input per mail (POP3
and IMAP)
For the use of e-mail mailboxes we assume that data (mails) have been
arrived in the mailbox and that they are not processed by any other
application.
These mails must have the following properties:
- a unique subject
- an exploitable timestamp (normally the SMTP server adds one
during sending)
- plain, continuous ASCII text format (no HTML, RTF ...)
- text completely in expressions
If there's an e-mail data source
configured the mail server will be requested during the next
invocation. pdr looks if there are
mails on the server, checks their subject and processes matching
e-mails one by one, line by line, each line is an expression. If a line
has a timestamp this one has priority. Otherwise the timestamp of the
e-mail is valid implicitly. This is very handy because normally you
will never have to
enter a timestamp manually in usual, single line e-mails.
Here's a complete e-mail source:
From: superhero
<Mymail@gmx.net>
To: MyMail@gmx.net
Subject: Q
Date: Thu, 04 Feb 2010 17:56:11 +0100
Message-ID: <87pr4ley8k.fsf@castor.ch>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
5.3 8i
Normally most of the values in the header lines are taken from default
values. Date and Message-ID are added by the
server, MIME-Version and
Content-Type come from the
e-mail client application. The only remaining text parts that have
really to be entered are the subject (that's why it should be short,
the
single letter Q here) and
the contents of the message, the data line.
On POP3 servers processed e-mails are deleted from the server
regardless of the
success. So they never get processed a second time. This deletion can
be suppressed by configuration. On IMAP servers the user can configure
if the mails should be deleted or marked as read. In this case the
mails remain on the server and can still be archived.
If there's a failure during processing because of any incorrectness in
an expression pdr transfers these expressions into the rejections and
writes out a
message.
Input per text file
If we use a text file for data input every line counts as expression.
This method is practical if you
get data in a period without any opportunity to transmit them online.
So you
have to collect them in a file manually, expression by expression.
Lines starting with # are
not processed.
If there's a failure during processing because of any incorrectness in
an expression pdr produces a message. A data transfer into the
rejections doesn't take place.
Text files that are processed successfully are deleted if they are
configured. So they are not processed a second time. This deletion can
be suppressed during configuration.
Input per CSV file
The abbreviation CSV means "comma
separated
values". Instead of the comma pdr also accepts the semicolon and
the tabulator as separator between the values.
There are two different ways to tell pdr what comma separated data
value should get into which collection:
- a control line in the CSV file preceding the data lines
- a control line in the configuration file, valid for the entire
CSV file
In the first case a pdr
CSV file would have the following structure:
control line
data line1
[...]
data lineN
control line
data line1
[...]
data lineN
[...]
This kind of use of control lines is unusually but gives us the
wanted flexibility and openness. Normally you can insert them easily by
hand
or by a program like sed.
In the second case the CSV file would contain only data lines as
expected.
A control line
has the following structure:
[#
pdr]
datetime
[separator
collection]+
Example:
#
pdr datetime, *, n, l; h; q»p, #
(» means a tabulator)
This is a control line for data lines with a timestamp and seven values
for the collections *, n, l, h, q, p and #.
Each control line in a CSV file will be
known on it's prefix # pdr,
a
control
line
in
a
configuration
file
doesn't
need
this
prefix.
The
following
keyword
datetime
marks the position of the timestamp on the data lines. It doesn't have
to be on the beginning but every line must have one - there are no data
values without a timestamp. In the example we can see that we can have
several separators on one data line. Data lines according to this
control
line whould look like this:
2008-10-11
12:31:38,
5.2,
7,
8;
42.3;
12»96, first measuring
2008-10-12 12:48:08, 6.1, , 8; 53.1; 16»93,
2008-10-13 12:43:57, 5.8, 7, 7; 34.2; 15»94,
third
measuring
The second line has no values for the collections n and #. In the case of missing
values just no inserts are made.
If you have CSV files containing more values than you want to import
into collections you can declare omissions in the control line:
#
pdr datetime, a, b, , , , c, d, e
Here we read a timestamp and two collections, then we omit three values
on the data lines and read again three values.
Lines starting with # are
not processed.
During the processing of a CSV file the whole file is handled in a
single transaction. If there's a failure because for instance a data
value on a line doesn't match the type of the declared collection the
whole file is dismissed. A data transfer into the rejections doesn't
take place.
CSV files that are processed successfully are deleted if they are
configured. So they are not processed a second time. This deletion
can be suppressed during configuration.
Input per XML file
pdr can read XML files for data input. These files are well formed,
read- and editable, and are the ideal thing for data exchange between
different software systems.
pdr defines an own, intentional very simple format. But the responsible
part of the program is designed to be extended for further XML formats.
The pdr XML format
The pdr XML format is completely documented in the file pdr.xsd:
<?xml version="1.0"
encoding="iso-8859-1" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" >
<xsd:annotation>
<xsd:documentation xml:lang="en">
pdr XML input file definition (C) T.M.
Bremgarten 2010-01-31
</xsd:documentation>
</xsd:annotation>
<xsd:element name="pdr">
<xsd:complexType>
<xsd:sequence>
<xsd:element
name="collection" type="collection" minOccurs="0" maxOccurs="unbounded"
/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:complexType name="collection">
<xsd:sequence>
<xsd:element name="item"
minOccurs="0" maxOccurs="unbounded">
<xsd:complexType>
<xsd:attribute name="datetime" type="xs:string" />
<xsd:attribute name="value" type="xs:string" />
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name="name" type="xs:string"
use="required" />
<xsd:attribute name="type" type="collection_type"
use="required" />
<xsd:attribute name="purpose" type="xs:string"
/>
</xsd:complexType>
</xsd:schema>
This definition allows files that look like this:
<?xml version="1.0"
encoding="ISO-8859-1"?>
<pdr>
<collection
name="#" type="text">
<item datetime="2001-07-09
18:27:11" value="first measuring"/>
<item date
time
="2001-
07
-10
07:52:01"
value="second
measuring"/>
<item date
time
="2001-
07
-10
10:07:00"
value="third
measuring"/>
[...]
</collection>
<collection
name="*" type="numeric">
<item date
time
="2001-
07
-12
13:57:01"
value="9.3"/>
<item date
time
="2001-
07
-12
14:46:45"
value="5.6"/>
<item date
time
="2001-
07
-12
18:25:36"
value="5.7"/>
[...]
</collection>
<collection
name="l" type="numeric">
<item date
time
="2001-
07
-03
21:41:58"
value="7"/>
<item date
time
="2001-
07
-04
21:48:43"
value="8"/>
<item date
time
="2001-
07
-05
21:50:49"
value="7"/>
[...]
</collection>
</pdr>
This format is self explaining. The data of the collections
are specified directly and well readable.
During the processing of a XML file the whole file is handled in a
single transaction. If there's a failure because for instance a data
value doesn't match the type of a collection the
whole file is dismissed. A data transfer into the rejections doesn't
take place.
XML files that are processed successfully are deleted if they are
configured. So they are not processed a second time. This deletion
can be suppressed during configuration.
(more XML formats)
...