Key-Value-Hierarchy Format (KVH) description
1. Introduction.
Key-Value-Hierarchy format is intended for exchange of structured
information
between applications independently of programing languages and
independently of operating systems hosting
this
application.
There are already some formats aimed to cover this need. The most
famous is
XML format (and its dialects), HDF (Hierarchical Data
format)
developed and supported by NCSA and some other formats. So why this
new format?
Mainly, because of outrageous complexity of cited formats compared to
modest
needs of some trivial tasks. For example, to parse and write
(correctly) an xml file from/into
your application you need to use some libraries specialized in xml
which may be as heavy as 100-500KB
As you will see
later in this description, KVH is so trivial format that you will
probably
write at once your own tiny parser in your favourite programming
language.
Yet, you should not underestimate KVH power and usefulness behind its
trivial
parsing rules.
Typographical and terminology
conventions.
Hereafter, we will use some obvious typographical conventions :
- a horizontal tabulation character (ASCII 0x09) is materialized
here as
<tab>;
- a new line character (ASCII 0x0A) is noted as <nl>;
- an backslash character (ASCII 0x5C) looks like <\> in this
document.
- This three
characters are called reserved.
- Optional elements are enclosed in brackets [...].
- We call a string any sequence of octets.
NB: KVH-format is very rigid. You cannot change, add or cancel
any of reserved characters and still call it KVH-format.
Neither you cannot add any sort of signatures, e.g. to define charset
encoding.
If you do, it will be anything else but not KVH-format.
A conform to this document KVH parser will interpret such signatures
as
key or value parts. See more about using of international characters
in Internationalization section.
2. Format definition.
2.0. Escaping rule.
We use very classic rules here. Any of two reserved
character <nl> and <\> must be escaped if they are part of
key
or
value strings, so their functional meaning is cancelled. The
<tab>
character
must be escaped only in the key. If it is in the value string, it
may
by
escaped but it is not mandatory. To escape
a character, precede it by one and only one backslash character
<\>.
A KVH parser will add a character following the <\> to key or
value
string depending of what is parsed at this moment. Any other character
may be escaped too. If any not reserved
character, noted here <char>, is escaped, the preceding it
backslash
is ignored, i.e. the backslash is
not added to key neither value. Only <char> itself will be in a
key or
value.
If unmatched escape character <\> is the last character in the
file,
it must be ignored. It is likely that a <nl> char is made the
last
character of
the file by your editor. In this case the last character that you see
(<\>)
will play its escaping role and you will have a <nl> character
ending
the
last key or value string.
2.1. Row definition.
A file in KVH-format is formed of rows terminated by one and only one
<nl>
or end of file.
A row is an octet sequence and is composed of :
- indentation sequence of <tab>s followed by
- a key-string,
- <tab> separating key and value, and finally
- value-string.
[<tab>...<tab>][key][<tab>[value]]<nl>
We will speak more about indentation later. For now, we will assume
that
indentation is void. So, a traditional hello_world example of a KVH
file
is something like this :
salutation<tab>Hello, world!<nl>
or in formatted form :
salutation Hello, world!
In this example "salutation" is a key and its corresponding value is
"Hello, world!". Please, note the absence of quotes, double quotes,
back-quotes and so on in the file. If such a character is present in
the file,
it will be a part of key or value strings. You must not use any type of
quotes
to delimit keys neither values.
2.2. Key definition.
A key may be any sequence of octets, even void (i.e zero length).
Any of three reserved character (<tab>, <nl> and <\>)
found in a
key should be escaped. Keys are not required to be unique, at least not
by KVH format specification. It's up to your application to
decide what to do with repeated keys.
2.3. Value definition.
A value may be any sequence of octets, even void.
The main formatting difference from a key string is that only two
reserved
characters <nl> and <\> must be escaped. Escaping of
<tab> and
any other
characters is optional. To avoid any ambiguous human reading of KVH
file,
it is safer to escape <tab>s in value strings too.
NB: The first goal of KVH is not be easily edited/read by a
human operator. Yet, its indent nature may help fixing or understanding
the KVH-formatted content. The main destination of this format is an
easy
exchange between applications.
If a key is immediately followed by a <nl>, i.e. separating
<tab> is
omitted,
the value must be considered as void. The same is obviously true when
a separating <tab> is immediately followed by <nl> - the
value is
void.
2.4. Hierarchical structure.
When some programming object (whatever it means) has some named
attributes
which in turn may be objects than we have to use some hierarchical
organization of our data. In KVH
format the hierarchy is translated into <tab> indentation.
Let consider and evolve our previous "Hello, world!" example. We
translate this salutation in French and use "en" and "fr" as keys
for these two salutations. Our file
becomes :
salutation<nl><tab>en<tab>Hello,
world!<nl><tab>fr<tab>Salut le monde !<nl> :
salutation
en Hello, world!
fr Salut le monde !
If this file is parsed, for example, in a php array $KVH_exmpl than
$KVH_exmpl["salutation"]["en"] will receive the string "Hello, world!"
and $KVH_exmpl["salutation"]["fr"] will become "Salut le monde !".
Simple fact that a key "salutation" is followed by a <nl> and not
by a
<tab> indicates that this key may become an hierarchical object.
We will say that we may go deeper from level 0 to level 1.
This occasion to become a somewhat nested entry is grasped or not
by an indentation <tab>s at the following row.
If, at the beginning of the next row,
we have at least the same number of <tab>s than at open level
plus
one,
than an occasion is grasped and one and only one supplementary level
of hierarchy starts here. Else, the level is not incremented
and the entry
is simply considered like having a void value.
When an hierarchy is created the number of <tab>s at the
beginning
corresponds to the level of the hierarchy.
If a key is followed by a <tab> and than by <nl> than a
level is
not incremented even if there are more indentation <tab>s at the
next
row
than at the current. The first <tab> after the current level
indentation
will be considered like a key-value separator and an entry with a void
key will be created without level increment.
Number of hierarchical levels is not limited.
2.5. Special cases.
In fact, special cases listed here are not special at all if you
follow rigorously the above rules. Nevertheless, let see what happen if
...
2.5.1. Void row.
If one <nl> follows immediately another <nl> than all open
levels are closed and a couple of void key-value is created at level 0.
2.5.2. More indentation <tab>s than needed.
We speak here about not escaped <tab>s encountered at the row
beginning.
The first of superfluous <tab>s will be considered as key-value
separator
and the others (if any) will be part of value string.
3. Internationalization.
It is strait forward if UTF-8 encoding is used .
No any of <tab>, <nl> neither <\> may be encountered
in
multi-octets UTF-8 sequences (thanks smart guys).
So any KVH parser should correctly proceed any UTF-8 encoded file.
It's up to an application
to decide how to treat key and value strings: as unicode strings or
as an array of integers in binary format.
Anyhow, for any textual information, UTF-8 is strongly recommended.
4. Programing limitations.
No counters are used in KVH format so there is no limits due to this
format. The only limits applied are due to :
- underlying file system (file size);
- 16/32/64 bits operation system (string length, hierarchy levels
and
so on);
5. Format extensions.
There is no one and there will not be. Instead, the creative energy of
people should be directed at developing of key sets.
We can suppose that there will be some dialects build on KVH-format.
For example, one may need references to other objects in KVH file and
adopt
Unix path like notation to point to some object, e. g.,
"/salutation/en" or "../fr". This is not KVH concern. Anyhow, the
application developers have to use some additional conventions to give
some
meaning to keys and values appropriate to their programing domain.
Copyright 2004, Serguei Sokol
(ssokol-AT-chez.com)
You may copy and distribute this document under GNU Free Documentation
License.