Concept: Data Structure
Protocol data structure concept describes how to store data in binary stream. Technically block-tree organization structure is used.
Declared Conditions
Requirements on the structure are derived from project's requirements with additional focus on compatibility and scalability of the implementation.
Requirements:
- Bit organization by clusters - As with the encoding of the numbers, it is required that the individual elements should be addressable by n-tuples of bits.
- Realization of any large data sequences - This requirement, similar to the requirement for encoding of indefinitely large numbers, is seeking to enforce to provide ability to code indefinitely large sequence of data blobs. These data can then have any meaning, but usually specified by metadata.
- Establishment of normal form - Mainly because of the possibility of comparing the two files it would be useful to provide ability to convert the stream to a standard plain form, which would be for example not compressed or encrypted.
- Processibility - If we want to process the data created by XBUP protocol, it should be possible to easily skip parts that are not relevant for our current needs (random access).
- Minimalism - Block should contain the basic form with as few information as possible. Other attributes should be optional, or should be expandable by adding new subblocks. This precludes, for example, the use of control checksums, such as the CRC, or MD5 Hash. It is also prohibited to use empty reservation areas for later use.
Block Structure
Target method should describe how to structure various data types to be stored as potentially infinite sequence of bits. Following concept only describes basic approach using block as a basic structure entity, which can include one or more blocks to estabilish actual tree structure.
Each block consist of two parts:
- Attribute Part - Sequence of UBNumber encoded values
- Data Part - Additional Sequence of bytes (blob)
First value in attribute part is:
UBNatural - AttribPartLength (APL)
This value limits the length of the remaining attribute part area and therefore size of UBNumber values sequence. AttribPartLength value = 0 means that there are no more UBNumber values in the block and that this is a termination block.
That the value is of the UBNatural type means that the data stream has to known not only the count but also space used by attributes onward. This can cause problems for large sequences of attributes. However, it is expected, that huge data values should be stored as data blocks or as a sequence or structure of smaller blocks.
Next value in the attribute part is:
UBENatural - DataPartLength (DPL)
This value specifies the length of data part as a sequence of bit clusters (size in bytes), which follows after attribute part and also includes infinity constant for unknown size.
Remaining values in attribute part are called attributes. If there is still remaining space for attributes, block is called node block and data part is then processed as a sequence of blocks. If there is no more space for attributes, block is called data block and data part is considered as blob of binary data.
Meaning of other attributes is not defined here and should not affect basic block-tree structure.
Node Block
Node block must have at least one attribute. Data part is processed as sequence of sub blocks, where this sequence must fit exactly to available space.
If node block have data part size set to infinity, sequence of sub blocks must be ended by termination block.
If node block have data part size set to 0, there are no subblocks and block is called leaf block.
Examples (hexadecimal codes used):
- Leaf block with fixed size and one attribute (3 bytes)
APL | DPL | Attribute |
---|---|---|
02 | 00 | 00 |
- Terminated leaf block with one attribute (4 bytes)
APL | DPL | Attribute | Termination Block |
---|---|---|---|
02 | 7F | 00 | 00 |
- Node block with one attribute and one child (6 bytes)
APL | DPL | Attribute | Child |
---|---|---|---|
02 | 03 | 00 | 02 00 00 |
Data Block
Data block is block which doesn't have any attributes.
If data block have data part size set to 0, it is called empty block.
If data block have data part size set to infinity, data part is processed using following algorithm:
Data are processed by bytes. When next byte has value 0, additional byte is read to be used for length of empty space.
If empty space length is 0, data block ends.
Otherwise these two bytes are passed as data as a sequence of zero bytes of given empty space length.
Examples (hexadecimal codes used):
- Empty block with fixed size (2 bytes)
APL | DPL |
---|---|
01 | 00 |
- Terminated empty block (4 bytes)
APL | DPL | Data |
---|---|---|
01 | 7F | 00 00 |
- Data block with fixed size and one byte of data (3 bytes)
APL | DPL | Data |
---|---|---|
01 | 01 | 00 |
File or Stream Structure
The structure of the document encoded using XBUP is structured as follows:
Document Header | ||||
| ||||
Extended Area |
Document Header
In addition to blocks there should be at the beginning of the stream a sequence of bits that would indicate the used method of coding. As mentioned in concept of numerical encoding, it is useful to at determine the size of the used cluster.
ClusterSize = FEh
This value is using unary encoding and allows to determine the ClusterSize value, but to match cluster, this value is interpreted as incremented by one, so that unary encoded value FEh = 7 means actually 8 bits long clustering (bytes). It must be introduced at the beginning of the file, since the encoding on of the following values depends on it.
For development purposes the header of the document includes several other data:
UBNatural - ProtocolVersion = 00h DWord(4xUBNatural) - ProtocolSignature = 58 42 00 XXh ("XB" + development version)
Protocol version 0 is reserved for the protocol development stage. The development version then specify particular structure of the document and any incompatible changes shall be reflected in a change to this value.
If the stream contains no other data, then it's called an empty document and is technically not valid. Otherwise, the data is processed as a single block, and data after that block are called extended area. This should allow the use of protocol to be usable as a header for generally infinite bitstream.
One reason to use header is that in some operating systems file name extension is not used to distinguish the type of file, but just the first bytes of the content are usually accessed. The header can be interpreted as a 32-bit identification number and 16-bit version number. It is assumed that the final version will have different document header.
Document Validity
Stream is created correctly (well-formed) if the following conditions holds. Names of exceptional invalid states are also listed for the case given condition is not met.
- In each block the end of last attribute corresponds to the end of the attribute part (Attribute Overflow)
- In each block the end of last subblock corresponds to the end of the data block part (Block Overflow)
- End of file is after the end of the root block (Unexpected End)
- The terminal block is present only in blocks where it belongs to (Unexpected Terminator)
Page Source