[XBUP] XBUP - Extensible Binary Universal Protocol

» Concepts » Progress

Concept: Data Stream Structure

This document is part of the eXtensible Binary Universal Protocol project documentation. Provides description of the data stream structure and as a sequence of blocks.

Data Stream Architecture

In the case of specification of the structure of the data stream the proposal was even less direct than in the number's encoding. So far, the final solution has not been established and alternative solutions are still reconsidered. But the basis is still the same block-tree structure.

Declared Conditions

The objectives set for this project mostly relate to the data stream structure. Especially there is focus on compatibility and scalability of the implementation.

Requirements:

Data Sequence

Certainly, it will be necessary to store sequence of values in the UBNumber encoding. It will be necessary to provide limiting for such sequences. There are, for example, for example, the following options:

The first approach would mean, for example, that the block would begin with one or two items that define the type and one item determining the length of subsequent data blob. This very simple concept is used for example in EBML. Disadvantage and at the same time advantage is that each value would have to be enclosed in a separate block. The drawbacks of this otherwise full solutions include:

Although there are currently not available strong arguments against this, it will postponed for a later revaluation.

The disadvantage is the second approach is the necessity to read all items, even if only want to skip them. Most probably the third of these options is the best, making it easier to skip the entire sequence, if we are not interested in it. Occupied space should be in best case referenced as count of used data clusters. Indentation is useless in this case because there are used all the codes and it would cause some limitation. Also I believe that infinity sequence of these values will be addressed by other solution.

To store data, we need a sequence of bits and generally of various length. This sequence can be identified vy one value, determining the length, or reading till indentation is reached. Finally, UBENatural type value for length was chosen to declare length of the sequence with constant for infinity marking the end by indentation block. This option is especially appropriate to be used when there is not yet identified and therefore unknown length of bit sequences. Another way could be to use the value of zero to identify data terminated by indentation block. Empty block will be used as indentation.

Block Structure

Sequences of values should be organized into blocks. In addition to the relationship “to be subblock” there can exist other types of relationships between the blocks. These relationships can be expressed in several ways:

Any data representation could also provide a fixed order of block sequence, but that would avoid many of the required characteristics, and worsen extensibility and this could also cause conflicts between the realization and the logical structure of data.

Tree Structure

Apparently there must be a block to allow a general store of any sequence of bits that can be interpreted as subblock or better sequence of subblocks. If there would not exist such a block it would have to be known size of all sublocks along, which prevents the document streaming generating.

Thus begins block with a sequence of UBNumber type values, known as block attributes, which will be followed by a data part block to be interpreted as a sequence of subblocks. So there is a value at the beginning:

  UBNatural - AttribPartLength

Which is the length of the UBNumber value sequence.

That the value is of the UBNatural type means that the data stream has to known not only the count but also space used by attributes onward. This can cause problems for large sequences of attributes. The aim, however, is that the number of attributes should be as minimal as possible and all the relevant values should be stored as separate blocks. Also it is necessary to determine what type of block somehow.

The structure of the document encoded using XBUP is structured as follows:

File Header
Attribute Part
Data Part
Subblock's Attribute Part
Subblock's Data Part
...
...
Extended Area

Terminal Block

AttribPartLength value = 0 means that there are no more UBNumber values in the block and that this is a terminal block. This allows to limit the infinite sequence subblocks in combination with node block.

Data Block

When data of the block are interpreted as additional blocks or as a simple binary data can be determined several ways:

Data block may not contain any other attributes which may be included in the parent blocks, and so far this option appears to be sufficient. Therefore there is only one attribute in the attribute part:

  UBENatural - DataPartLength

This attribute specifies the length of a sequence of bit clusters, which follows. This sequence is interpreted as a bit sequence without any further meaning. Also there is support for the infinity value which indicates unlimited size of the block, which allows to generate data stream without knowing the amount of data. This raises the problem of the end of this type of block:

Termination using zero cluster brings another the problem about how the bit sequences should express the sequence of zeros:

This issue has not have been decided properly yet. For the time being the variant using two zero bytes is chosen for the prototype implementation.

Node Block

If there is given more than one attribute, then block is considered to be a node block or leaf block. The first value is in both cases:

UBENatural - SubPartLength

Non-zero value determines the length of a sequence of bits clusters, which follows. This is then interpreted as a sequence of blocks.

Leaf Block

Special case of node block is a block with DataPartLength = 0, which means that the block has no subblocks. This block store the information only in the form of attributes.

Correctness Argumentation

Similar to number's encoding, also here should be considered arguments supporting the chosen realization of the tree structure.

Protocol's Grammar

Protocol has form of the language over the alphabet {0,1}, which is generally the type 0. However, it is possible to include guidance for the presence of various structures such as protocol grammar without the need to establish a direct relationship with the binary expression.

Simplified Grammar

For the expression and understanding of the basic block structure of the protocol following grammar could be useful. Words written in lowercase letters are terminals.

Document ::= filehead + Block
Block ::= Attributes + Blocks | datablock
Attributes ::= Attributes + attribute | epsilon
Blocks ::= Blocks + Block | epsilon

This context-free grammar express that the entire document consists of a single block, each block is either a data block, or have two separate parts, any count of attributes and subblocks. Attributes are listed as the first block and the blocks are defined recursively. The grammar can be extended up to provide various other characteristics.

Grammar with Terminal Blocks

This grammar adds description of the use of terminator.

Document ::= filehead + Block
Block ::= Stdblock | Stdtermblock | Datablock | Datatermblock
Stdblock ::= Attributes + Blocks
Stdtermblock ::= Attributes + Blocks + terminator
Datablock ::= datablob
Datatermblock ::= datablob + terminator
Blocks ::= Block + Blocks | epsilon
Attributes ::= attribute + Attributes | epsilon

Document Parsing Grammar

Especially when parsing the blocks occurring in the limited order, which can also reduce the grammar, which is then context-free.

Document ::= Block
Block ::= blockBegin + Attributes + Blocks + blockEnd | blockBegin + blockData + blockEnd
Blocks ::= Block + Blocks | epsilon
Attributes ::= blockAttribute + Attributes | epsilon

The following chart reflects the basic graph of the occurrence of events in the sequential document parsing.

Explanation:

a - block attribute (blockAttribute)
b - begin of the block (blockBegin)
d - data part of block (blockData)
e - end of block (blockEnd)

Event occurrence graph

Graph source file graph-1.graphml

Alternatively, you can use the following automate transitional system, a regular grammars, whose language is close superset to the previous language:

Event occurrence graph

Graph source file graph-2.graphml

Finer Grammar

In the case of large data blocks may be their processed problematic and memory demanding. For this reason, it is possible and even advisable to use finer division of the individual bytes. Between the two types of grammars direct conversion of the corresponding events exist.

Explanation:

a - block attribute (blockAttribute)
b - begin of the block (blockBegin)
d - single byte of block data part (blockData)
e - end of block (blockEnd)

Event occurrence graph

Graph source file {ben:doc:devel:progress:images:graph-3.graphml|graph-3.graphml}}

Since the data event here represents exactly one byte, it was necessary to establish a direct transition at the end of the data block with an empty data sections.

Divide the attributes to individual bytes does not seems to be meaningful, as well as a split of the data to bits.

Rougher Grammar

On the other hand, it is possible to merge some events, for example include attributes and data block as a part of block begin.

Explanation:

b - begin of the block with attributes (blockBegin)
d - data block with data blob (blockData)
e - end of block (blockEnd)

Event occurrence graph

Graph source file graph-4.graphml

Correct Well-formness

Stream is created correctly (well-formed) if the following conditions holds. Names of exceptional invalid states are also listed for the case given condition is not met.

Introduction of the Levels

During the creation of a draft of the structure it has proved to be more appropriate to introduce multiple levels of the data realization. The reason is the ability to process the document at various levels of complexity and, in particular, to carry out inspections of the various conditions, according to the availability of specifications.

Encoding a basic block structure is known as Level 0. It is not a fully fledged protocol variant, since it only specifies the method of using the organization without implementing the transfer of the information. It shows the individual blocks without the interpretation of their content, just as blocks of 4 basic types. Is it possible to validate compliance with the block structure (syntax check).

Next levels extends interpretation of blocks introducing their type.

There is further declared the method of the type specification and basic blocks. Further extension would provide the permitted range for attributes, allowed subblocks and order of their sequence, transition and manner of the definition of their meaning. At higher levels it will be possible to validate a document according to the required extension level.

Possible levels are described in separate parts of the documentation focused on the levels.