[XBUP] XBUP - Extensible Binary Universal Protocol

Concept: Parsing

Parsing concept is describing method for conversion of raw data to components, typically using formal grammar. Although parsing is technique used mostly for text processing, it is possible to use similar approach to binary data.

Stream Processing

For processing of XBUP-encoded data one option is to consider data to be some ordered sequence of items of the particular type.

Focus will be on the following aspects:

Token Type

Transmitted items (tokens) constitutes a unit of the stream. The basic unit of information is the bit, which may be used as a token, but it is preferable to use cluster of bits, for example, byte, which among other things, simplifies the addressing on the devices, which follows the data align. For the purposes of the protocol following two types of tokens exist:

Binary Data Classes

Individual tokens are described in the document block structure.

Tokens vary depending on the level of the protocol.

Block Processing Classes

Another option is to transfer entire blocks of the protocol as described in the rough grammar. There are three types of blocks:

This type of stream is defined as from the 1st level and blocks varies according to the level.

Data Passing Method

The way in which data are transmitted is largely a matter of implementation. Among possible solutions, it should be noted in particular:

In the case of transmission using the links, it is also appropriate to distinguish whether it is forwarded as a copy, or a reference to the used value, either as an pointer in memory, or to other type of media, or source of the stream through a proxy interface, and whether it is allowed to change this value. It is largely a matter of optimization.

Block Type Dependency

From Level 1 the block type is defined. This is passed as a separate type of token, which must be preceded by (also empty) sequence of attributes. In addition to the ability of block type transmitting as a pair of values, it is possible to transmit the type of block as reference in the catalog or other type of declaration. It is possible to store types, which are dependent on the stream head, as converted to a numeric value using table generated by head of the stream. For links to catalog of the types it may be necessary for the storage of such documents initially collect those types and add links to the catalog in the header of the document.

Range Field

Another element is the implementation and field ranges. In the case of the data transferred between applications, it is necessary to consider whether pointers does not dependent on a particular application thread, or a running instance of the operating system. This dependence should be eliminated during transmission between the computers using the proxy-stub or so-called marshaling (Microsoft COM terminology). Possible cases are:

Data Stream Control

If you look at the data from the perspective of of the information transfer, it is possible to distinguish the way how the transmission of data in the data stream are managed. Communication stream can be controlled by either the recipient or the sender. (Master / Slave), or uncontrolled (so-called non-guaranteed transfer). Various controlling methods may be used in transmission:

Possible Movement in Stream

In addition to the basic sequential processing there can be provided bunch various operations to change the location of the source file stream (the content changes management). For example, the following:

Protocol Grammar

Protocol data structure can be interpreted as a form of the language over the alphabet {0,1}, which is generally the type 0. However, it is also possible to define grammar for this basic components even without the need to define exact binary data.

Simplified Grammar

This simplified grammar can be used for expression and understanding of the basic block structure of the protocol. Words written in lowercase letters are terminals.

Document ::= documentHeader + Block + extendedArea
Block ::= Attributes + Blocks | dataBlock
Attributes ::= Attributes + attribute | epsilon
Blocks ::= Blocks + Block | epsilon

This context-free grammar express that the entire document consists of a single block, each block is either a data block, or have two separate parts, sequence of attributes and sequence of subblocks. Attributes are listed as the first block and the blocks are defined recursively. The grammar can be extended to provide other characteristics.

Grammar with Terminal Blocks

This grammar adds description of the use of terminator.

Document ::= documentHeader + Block
Block ::= Stdblock | Stdtermblock | Datablock | Datatermblock
Stdblock ::= Attributes + Blocks
Stdtermblock ::= Attributes + Blocks + terminator
Datablock ::= dataBlob
Datatermblock ::= datablob + terminator
Blocks ::= Block + Blocks | epsilon
Attributes ::= attribute + Attributes | epsilon

Document Parsing Grammar

Especially when parsing the blocks occurring in the limited order, which can also reduce the grammar, which is then context-free.

Document ::= Block
Block ::= blockBegin + Attributes + Blocks + blockEnd | blockBegin + blockData + blockEnd
Blocks ::= Block + Blocks | epsilon
Attributes ::= blockAttribute + Attributes | epsilon

Parsers

Parser is a program which allows browsing of structured data defined by a grammar. At the output interface offers functionality to get access to processed result, most often in the form of the tree.

Token Parsers

Simple parsers allows to access data using primitive data objects called tokens.

Token Types

XBUP level 0 protocol recognizes following 4 types of tokens:

For XBUP level 1 protocol there is additional TYPE token:

Parser States and Transition Graph

Parser of the level 0 when reading data stream may be in one of the following states - for the needs of fine recognition, it is possible to distinguish whether the parser is at the beginning or within the region:

On level 1 there is additional state for block type:

This states are used in state transition graph below. Edges in this graph are transitions reflecting one passing token of specific type.

Event flow / state transition graph

Graph source file eventflow.graphml

For level 1 there is extension of this graph for type event.

Level 1 event flow / state transition graph

Graph source file eventflow-level1.graphml

Event Interfaces

Base of the parsers the transformation of the stream to the stream of recognized events, which are then processed. The reverse procedure is also possible to generate the document. For each token type, there is separate method defined:

void begin(terminationMode)
void attrib(attribute)
void data(data)
void end()

On level 1 additional method is included:

void type(blockType)

Parser Types

Parsers vary mostly on how they store service data of the processed document and there are separated parsers for each protocol's level. This concept will describe 3 types of parsers:

Parsers have different memory consumption and may vary on the complexity of the implementation, speed and performance.

Object Model Parser

These parsing classes process the entire source and create an object copy of the file in the memory (DOM - Document Object Model). The implementation works so that it checks file header and calls recursive/iterative retrieval of the root node, which recursively fills its internal structures. Object model parser usually cannot handle infinite streams. The object model can be also build using token stream and vice-verse.

Typically objects are blocks with following structure:

Document {
  Block rootBlock
  Blob extendedArea
}
Block {
  Boolean terminationMode
  attribute[] attributes
  Block[] childBlocks
  Blob data
}

Token Parser

This kind of parsers converts the input stream to sequence of tokens. This includes both event and pull parsers, which differs on calling method. Pull parser provides single method which returns the following token from the stream (bitstream → tokens), while event parser process whole stream at once. This parsers are complementary and in opposite direction (tokens → bitstream) event parser can receive single token while pull parser pulls all tokens at once. This parsers are using simple interface for sending single token and are suitable for streaming or parallel processing and also can be used for processing infinite data stream.

Parser tools of this type are usually faster and easier to implement, however, most of the work remains on the application. That means that this parsers are good for processing simple documents which could be read sequentially.

Command Parser

Command parser works on the basis of isolated orders to which returns relevant answers. (similar to the command line - command shell). Set of these commands allows both reading and handling of the document parts.

Command parser is a combination of previous parsers. If it's necessary, it stores parts of document into the memory as a tree structure and saves them only if necessary. The advantage is, for example, that until the document is updated, it's not necessary to copy complete document into the memory, only relevant indexes up to a maximum allowed size are stored for the rapid transitions between documents. If the document is processed sequentially, behaves as a token parser, only in the case of jumping it creates auxiliary indexes, or in the case of modification object model is used. Parser construction allows to perform more sophisticated operations, such as modification of the file without reading the entire content (delayed write, copy-on-write).

Example methods for command parser:

Open(Stream)
Close()
Reset()
Integer CurrentLevel()
GoTo(TreePath)
Skip(Count)

Parsing - http://en.wikipedia.org/wiki/Parsing
XML Pull Parsing - http://www.xmlpull.org
XML Document Object Model - http://www.w3.org/XML/DOM