Key-Value-Hierarchy Format (KVH) description

1. Introduction.

Key-Value-Hierarchy format is intended for exchange of structured information between applications independently of programing languages and independently of operating systems hosting this application.

There are already some formats aimed to cover this need. The most famous is XML format (and its dialects), HDF (Hierarchical Data format) developed and supported by NCSA and some other formats. So why this new format? Mainly, because of outrageous complexity of cited formats compared to modest needs of some trivial tasks. For example, to parse and write (correctly) an xml file from/into your application you need to use some libraries specialized in xml which may be as heavy as 100-500KB

As you will see later in this description, KVH is so trivial format that you will probably write at once your own tiny parser in your favourite programming language. Yet, you should not underestimate KVH power and usefulness behind its trivial parsing rules.

Typographical and terminology conventions.
Hereafter, we will use some obvious typographical conventions :

a horizontal tabulation character (ASCII 0x09) is materialized here as <tab>;
a new line character (ASCII 0x0A) is noted as <nl>;
an backslash character (ASCII 0x5C) looks like <\> in this document.
This three characters are called reserved.
Optional elements are enclosed in brackets [...].
We call a string any sequence of octets.

NB: KVH-format is very rigid. You cannot change, add or cancel any of reserved characters and still call it KVH-format. Neither you cannot add any sort of signatures, e.g. to define charset encoding. If you do, it will be anything else but not KVH-format. A conform to this document KVH parser will interpret such signatures as key or value parts. See more about using of international characters in Internationalization section.

2. Format definition.

2.0. Escaping rule.

We use very classic rules here. Any of two reserved character <nl> and <\> must be escaped if they are part of key or value strings, so their functional meaning is cancelled. The <tab> character must be escaped only in the key. If it is in the value string, it may by escaped but it is not mandatory. To escape a character, precede it by one and only one backslash character <\>. A KVH parser will add a character following the <\> to key or value string depending of what is parsed at this moment. Any other character may be escaped too. If any not reserved character, noted here <char>, is escaped, the preceding it backslash is ignored, i.e. the backslash is not added to key neither value. Only <char> itself will be in a key or value. If unmatched escape character <\> is the last character in the file, it must be ignored. It is likely that a <nl> char is made the last character of the file by your editor. In this case the last character that you see (<\>) will play its escaping role and you will have a <nl> character ending the last key or value string.

2.1. Row definition.

A file in KVH-format is formed of rows terminated by one and only one <nl> or end of file. A row is an octet sequence and is composed of :

indentation sequence of <tab>s followed by
a key-string,
<tab> separating key and value, and finally
value-string.

[<tab>...<tab>][key][<tab>[value]]<nl>

We will speak more about indentation later. For now, we will assume that indentation is void. So, a traditional hello_world example of a KVH file is something like this :
salutation<tab>Hello, world!<nl>
or in formatted form :

salutation	Hello, world!

In this example "salutation" is a key and its corresponding value is "Hello, world!". Please, note the absence of quotes, double quotes, back-quotes and so on in the file. If such a character is present in the file, it will be a part of key or value strings. You must not use any type of quotes to delimit keys neither values.

2.2. Key definition.

A key may be any sequence of octets, even void (i.e zero length). Any of three reserved character (<tab>, <nl> and <\>) found in a key should be escaped. Keys are not required to be unique, at least not by KVH format specification. It's up to your application to decide what to do with repeated keys.

2.3. Value definition.

A value may be any sequence of octets, even void. The main formatting difference from a key string is that only two reserved characters <nl> and <\> must be escaped. Escaping of <tab> and any other characters is optional. To avoid any ambiguous human reading of KVH file, it is safer to escape <tab>s in value strings too.

NB: The first goal of KVH is not be easily edited/read by a human operator. Yet, its indent nature may help fixing or understanding the KVH-formatted content. The main destination of this format is an easy exchange between applications.

If a key is immediately followed by a <nl>, i.e. separating <tab> is omitted, the value must be considered as void. The same is obviously true when a separating <tab> is immediately followed by <nl> - the value is void.

2.4. Hierarchical structure.

When some programming object (whatever it means) has some named attributes which in turn may be objects than we have to use some hierarchical organization of our data. In KVH format the hierarchy is translated into <tab> indentation.

Let consider and evolve our previous "Hello, world!" example. We translate this salutation in French and use "en" and "fr" as keys for these two salutations. Our file becomes :
salutation<nl><tab>en<tab>Hello, world!<nl><tab>fr<tab>Salut le monde !<nl> :

salutation
	en	Hello, world!
	fr	Salut le monde !

If this file is parsed, for example, in a php array $KVH_exmpl than $KVH_exmpl["salutation"]["en"] will receive the string "Hello, world!" and $KVH_exmpl["salutation"]["fr"] will become "Salut le monde !". Simple fact that a key "salutation" is followed by a <nl> and not by a <tab> indicates that this key may become an hierarchical object. We will say that we may go deeper from level 0 to level 1. This occasion to become a somewhat nested entry is grasped or not by an indentation <tab>s at the following row. If, at the beginning of the next row, we have at least the same number of <tab>s than at open level plus one, than an occasion is grasped and one and only one supplementary level of hierarchy starts here. Else, the level is not incremented and the entry is simply considered like having a void value. When an hierarchy is created the number of <tab>s at the beginning corresponds to the level of the hierarchy. If a key is followed by a <tab> and than by <nl> than a level is not incremented even if there are more indentation <tab>s at the next row than at the current. The first <tab> after the current level indentation will be considered like a key-value separator and an entry with a void key will be created without level increment. Number of hierarchical levels is not limited.

2.5. Special cases.

In fact, special cases listed here are not special at all if you follow rigorously the above rules. Nevertheless, let see what happen if ...

2.5.1. Void row.

If one <nl> follows immediately another <nl> than all open levels are closed and a couple of void key-value is created at level 0.

2.5.2. More indentation <tab>s than needed.

We speak here about not escaped <tab>s encountered at the row beginning. The first of superfluous <tab>s will be considered as key-value separator and the others (if any) will be part of value string.

3. Internationalization.

It is strait forward if UTF-8 encoding is used . No any of <tab>, <nl> neither <\> may be encountered in multi-octets UTF-8 sequences (thanks smart guys). So any KVH parser should correctly proceed any UTF-8 encoded file. It's up to an application to decide how to treat key and value strings: as unicode strings or as an array of integers in binary format. Anyhow, for any textual information, UTF-8 is strongly recommended.

4. Programing limitations.

No counters are used in KVH format so there is no limits due to this format. The only limits applied are due to :

underlying file system (file size);
16/32/64 bits operation system (string length, hierarchy levels and so on);

5. Format extensions.

There is no one and there will not be. Instead, the creative energy of people should be directed at developing of key sets.
We can suppose that there will be some dialects build on KVH-format. For example, one may need references to other objects in KVH file and adopt Unix path like notation to point to some object, e. g., "/salutation/en" or "../fr". This is not KVH concern. Anyhow, the application developers have to use some additional conventions to give some meaning to keys and values appropriate to their programing domain.

Copyright 2004, Serguei Sokol (ssokol-AT-chez.com)
You may copy and distribute this document under GNU Free Documentation License.