Internet-Draft XBUP September 2023
Hajda Expires 25 March 2024 [Page]
Workgroup:
(NONE)
Internet-Draft:
draft-ietf-exbin-xbup-core-00
Published:
Intended Status:
Experimental
Expires:
Author:
M. Hajda
ExBin Project

Extensible Binary Universal Protocol (XBUP)

Abstract

The Extensible Binary Universal Protocol (XBUP) is general purpose binary data protocol and file format with primary focus on data abstraction and data transformation.

This proposal describes specification of the currently developed prototype version, example set of basic data types and the recommended API.

Protocol is part of the ExBin Project, which aims to provide proof-of-concept implementation and support for wider set of functionality.

NOTICE: This is not official or finished document and is not yet enrolled for any official track to be registered as IETF RFC.

Contributing

This document is being worked on by ExBin Project, published here in order to gather comments and to raise interest in this project.

To participate on the development of this project, visit https://xbup.exbin.org/?participate.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 25 March 2024.

Table of Contents

1. Introduction

The Extensible Binary Universal Protocol (XBUP) is a prototype of general purpose multi-layer binary data protocol and file format with primary focus on abstraction and data transformation.

Key features:

Secondary features includes some capabilities inspired by markup languages like SGML/XML [XML] and data representation languages like YAML [YAML], JSON [RFC4627] and similar binary formats like ASN.1 [ASN.1], HDF5 [HDF5], efficient XML [EfficientXML] or Protocol Buffers [ProtoBuf].

1.1. Goals

The primary goal of this project is to create a communication protocol / data format with the following characteristics, order by priority:

  • Universal - Capable of representation of any type of data, suitable for wide range of use including streaming, long-term storage and parallel accessing
  • Independent - Not tightly linked to a particular spoken language, product, company, processing architecture or programming language
  • Declarative - Self sufficient for data type definition and with the ability to build data types by combining existing ones
  • Normative - Providing reference form for data representation
  • Flexible - Support for data transformations, compatibility and extensibility handling
  • Efficient - Effective data compacting / compression support for plain binary and structured data

1.2. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

The term "byte" is used in its now-customary sense as a synonym for "octet" - sequence of 8 bits.

2. XBUP Specification

XBUP is multi-layer protocol for representation of data in bit/byte stream provided by other protocols / file data etc. Each layer is build on top of previous layer providing new capabilities, like new constraints and/or features. Higher levels can also declare retrospectivelly entities used in lower levels.

Applications can choose to support only up to specific layer of XBUP protocol when full support is not necessary.

Layers are indexed as levels by depth starting with level 0.

Layers of the protocol

Table 1: Layers
Level Layer
0 Tree Structure
1 Type System
2 Transformations
3 Relations

2.1. Level 0: Tree Structure

Lowest protocol's level defines basic tree structure using two primitive types.

  • UBNumber encoded value
  • Blob - Sequence of bits (bytes) with unspecified length or length specified by some attribute

Sequence of those primitive types forms a block. Single block represents node of the tree and can contain child blocks which are representing child nodes.

2.1.1. UBNumber Encoding

UBNumber is encoding which combines unary and binary encoding with varying length of units of bits (octets). It is typically representing natural non-negative integer number (or value of any other type with deterministic mapping to well ordered / countably infinite set). Encoding is applied recursively when unary part fills all bits of the first byte.

This is similar to other varying length encoding, for example used in UTF-8 [RFC3629].

To decode value, non-zero bits are counted for length up to 8 bits and then rest of bits is used as value + additional sequence of n bytes where n equals to length. Value is also shifted so that there is only one code for each number. For bit value 0xFF which corresponds to length 8, additional UBNatural value is added next. This new value contains additional length value.

Examples of the UBNatural - codes sequence of bits = value represented value in basic natural non-negative integer number:

  0 0000000                                 = 0
  0 0000001                                 = 1
  0 0000010                                 = 2
  0 0000011                                 = 3
  ...
  0 1111111                                 = 7Fh = 127
  10 000000 | 00000000                      = 80h = 128
  10 000000 | 00000001                      = 81h = 129
  ...
  10 111111 | 11111111                      = 407Fh = 16511
  110 00000 | 00000000 | 00000000           = 4080h = 16512
  ...
  11111110 11111111 .. 11111111             = 10204081020407Fh
           \_____ 7 times ____/
  ...
  11111111 00000000 00000000 .. 00000000    = 102040810204080h
           \+len 0/ \_____ 8 times ____/
  ...
  11111111 00000001 00000000 .. 00000000    = 10102040810204080h
           \+len 1/ \_____ 9 times ____/
Figure 1: UBNatural example codes and values

Other mappings to represent different values than natural numbers can be also used with UBNumber encoding. For level 0 following two mappings are used:

  • UBNatural encoding using directly value from UBNumber basic mapping as listed above
  • UBENatural where value 7Fh is reserved for infinity constant and higher codes are shifted by one

2.1.2. Document

Single document is typically represented as a single block, but data after this block are also considered part of the document.

To store document in the file / file system or in data streams protocol version was not negotiated prior, additional "Document header" should be present.

Document header contains information about protocol version. For the current version 0.2 of the protocol, it is 6 bytes long data blob. Explanation of each value is non-conformant, primary use is padding to help systems which uses beginning of file for identification of file type.

Document header with hexadecimal values:

Structure of file header

Table 2: Document Header Bytes
Byte Content
FE Unary encoded size of cluster (byte)
00 Reserved for future versions
58 ASCII constant 'X'
42 ASCII constant 'B'
00 UBNatural encoded major version
02 UBNatural encoded minor version

Primary block called "Root Block" follows after header. Any data after root block are optional data blob called "Tail Data".

2.1.3. Block

Block specifies encoding / decoding method for bytes sequence into sequence of blobs or child blocks and to defere its own size.

Each block starts with the value:

  UBNatural attributePartSize

Value attributePartSize is not allowed to equal to 0 for the block as value 0 is used for termination handling (see below).

Block continues with attribute part which is blob of the length in bytes specified by attributePartSize.

First value in attribute part represents:

  UBENatural dataPartSize

Rest of the data (if any) in the attribute part is interpreted as a nonempty sequence of attribute values encoded in any UBNumber encoding. Binary blob called data part follows after attribute part - is optional / can be empty.

If the dataPartSize value fills exactly whole space of the attribute part (there are no more attributes in attribute part) then this block is called "Data Block" otherwise block is called "Node Block".

After data part section, block ends.

  +-------------------------------------+
  | == Block ========================== |
  |                                     |
  | UBNatural attributePartSize         |
  +-------------------------------------+
  | == Attribute part ================= |
  |                                     |
  | UBENatural dataPartSize             |
  | UBNumber attribute 1                |
  | ...                                 |
  | UBNumber attribute n                |
  +-------------------------------------+
  | == Data part (optional) =========== |
  |                                     |
  | Single data blob or child blocks    |
  +-------------------------------------+

Effectively, transferred data are represented as a sequence of attributes and child blocks or data blob, while attributePartSize and dataPartSize values are present for the structural purpose.

See examples of blocks (Section 3.2).

2.1.4. Node Block

When there is at least one attribute value in attribute part, block is called node block. Data in data part are interpreted as a sequence of (child) blocks.

Data part has length in bytes specified by dataPartSize value or if dataPartSize equals infinity, sequence of child blocks can be infinite or terminated by terminator (single zero byte value).

If there are no child blocks, node block is also called leaf block.

2.1.5. Data Block

When there is no attribute value in attribute part, block is called data block. Data in data part are interpreted as a binary blob.

Data part has length in bytes specified by dataPartSize value or if dataPartSize equals infinity, data part is processed by byte and each value zero is used as a escape code, where directly following byte means:

  • Value 0 denotes end of the data part
  • Value 1 to 255 denotes sequence of zero bytes of given count and processing continues

If there is no data in data part, data block is also called empty block.

2.1.6. Validity

Binary stream is structured correctly as XBUP document (well-formed) if the following conditions are met. Description of invalid state is also included for each condition.

  • Optional: Stream header must be present (Corrupted or missing header)
  • Optional: Header version must be in supported range (Unsupported version)
  • In each block the end of last attribute corresponds to the end of the attribute part (Attribute overflow)
  • In each block the end of last subblock/child block corresponds to the end of the data part (Block overflow)
  • The terminal block is present only in blocks where it belongs to (Unexpected terminator)
  • End of file / data stream is before the end of the root block (Unexpected end of data)

2.1.7. Summary

To sum it up, data in protocol are structured as a tree of blocks.

  • Block is either data blob or finite sequence of attributes and child blocks
  • Block can have specified size - this allows to skip block processing, but also requires to know size of the block in advance when encoding
  • For block with unknown size, it's possible to use infinity size + termination or sequence of child blocks cound never end
  • Data block has no attributes, so either have to be wrapped or meaning should be understandable from the content / context

In theory, this should provide sufficient capability to represent any data when encoding to blob is available. More complex types can be either constructed using deeper tree structure or compacted into binary blob, but it should be possible to derive data type via transformation to basic data elements when needed.

2.2. Level 1: Block Types

Level 1 introduces block types, how to specify type of the block and catalog of types. Approach is somewhat similar to XML Namespaces [XMLNamespaces].

Since this level, if attribute is defined, but not present, it's value is considered as zero code of the UBNumber encoding.

2.2.1. Block Type

First two attributes in node block are interpreted as follows:

  UBNatural - TypeGroup
  UBNatural - BlockType
Figure 2: Block type attributes

These two values determine block type. Block types are organized into groups where TypeGroup value specifies to which group block type belongs and BlockType value specifies particular block type in the corresponding group.

TypeGroup with value 0 is always basic build-in group (cannot be overridden). Basic blocks provides ability to specify meaning of other groups via block type declarations, definitions or links to catalog or external source.

2.2.2. Type Context

For each block, there is type context which provides mapping of particular block type (as defined above) to particular declaration/definition (similar to XML Namespaces context). Context is the same for block and all it's children, except for "Document Declaration" block which is used to change context.

Range of groups and range of blocks for each group is speficied.

2.2.3. Block Type Definition

Block type is defined as a finite sequence of operations where each operation defines one or more attributes and/or child blocks. Operation can refer build-in or previously defined types or no type (for attribute and any). There are variants for singular item and list of items, 8 operations in total:

  • Single block - Single child block of any type.
  • Single attribute - Single attribute of any type.
  • Consist of definition - Single child block of referred type (as a component/element).
  • Append definition - Appends all attributes and all child blocks of referred type.
  • List of blocks - One attribute of type UBENatural to define count of blocks of any type and child blocks of that count. When count equals infinity, list of blocks ends with empty block.
  • List of attributes - One attribute of type UBNatural to define count of attributes of any type and attributes of that count.
  • List of consist of definitions - One attribute of type UBENatural to define count of blocks of referred type and child blocks of that count. When count equals infinity, list of blocks ends with empty block.
  • List of appended definitions - Appends one attribute of type UBNatural to define count of blocks of defined type and appends all attributes and all child blocks of referred type of that count.

Following syntax is used in this document (no final syntax is decided yet):

  any - Single block
  attribute - Single attribute
  Block_type_name - Consist of definition
  +Block_type_name - Append definition

  []any - List of blocks
  []attribute - List of attributes
  []Block_type_name - List of consist of definition
  +[]Block_type_name - List of append definition
Figure 3: Block type attributes

From the abstract point of view (more about abstraction (Section 3.3)) type definition is simply ordered list of child singular types or sets of child types including infinite number of them.

At the same time data definitions are similar to the table columns definition used in relation databases, except that infinite number of items is also supported.

2.2.4. Basic Blocks Definition

Following blocks are defined as build-in group 0, but also defined in catalog.

2.2.4.1. Unspecified (0)

This block is used for unspecified block values or data padding. Can be used to represent nil / null values.

2.2.4.2. Document Declaration (1)

Declaration block determines the allowed range of groups. This block should be located at the beginning of each file, if the application didn't provide any static/special meaning, but it might be used anywhere inside document as well.

  +Natural groupsCount - The number of allocated groups
  +Natural preserveGroups - The number of groups to keep from
    previous declarations
  FormatDeclaration formatDeclaration - Declaration of format
  Any documentRoot - Root node of document
Figure 4: Document Declaration

For subblocks of this block there is permitted range of values in the interval group preserveGroups + 1 .. preserveGroups + groupsCount + 1. preservedGroups + groupsCount + 1. If the value reserveGroups = 0, takes the highest not yet reserved group in the current or parental blocks + 1. For all values of zero and the application of rules of cutting the block of zeros coincides with the data block.

2.2.4.3. Format Declaration (2)

Format declaration allows you use either declaration from catalog or local format definition or both.

  +CatalogFormatSpecPath catalogFormatSpecPath - Specification
    of format defined as path in catalog
  +Natural formatSpecRevision - Specification's revision number
  FormatDefinition formatDefinition
Figure 5: Format Declaration
2.2.4.4. Format Definition (3)

This block allows to specify the basic structure of format specification. Specifies the sequence of parameters using either join or consist operation.

  Any[] formatParameters - Join or Consist format parameters
  +RevisionDefinition[] revisions
Figure 6: Format Definition
2.2.4.5. Format Join Parameter (4)

Join parameter for format definition.

  +FormatDeclaration formatDeclaration
Figure 7: Format Join Parameter
2.2.4.6. Format Consist Parameter (5)

Consist parameter for format definition.

  +GroupDeclaration groupDeclaration
Figure 8: Format Consist Parameter
2.2.4.7. Group Declaration (6)

Group declaration allows you use either declaration from catalog or local group definition or both.

  +CatalogGroupSpecPath catalogGroupSpecPath - Specification
    of format defined as path in catalog
  +Natural groupSpecRevision - Specification's revision number
  GroupDefinition groupDefinition
Figure 9: Group Declaration
2.2.4.8. Group Definition (7)

This block allows to specify the basic structure of group specification. Specifies the sequence of parameters using either join or consist operation.

  Any[] groupParameters - Join or Consist group parameters
  +RevisionDefinition[] revisions
Figure 10: Group Definition
2.2.4.9. Group Join Parameter (8)

Join parameter for group definition.

  +GroupDeclaration groupDeclaration
Figure 11: Group Join Parameter
2.2.4.10. Group Consist Parameter (9)

Consist parameter for group definition.

  +BlockDeclaration blockDeclaration
Figure 12: Group Consist Parameter
2.2.4.11. Block Declaration (10)

Block declaration allows you use either declaration from catalog or local block definition or both.

  +CatalogBlockSpecPath catalogBlockSpecPath - Specification
    of format defined as path in catalog
  +Natural blockSpecRevision - Specification's revision number
  BlockDefinition blockDefinition
Figure 13: Block Declaration
2.2.4.12. Block Definition (11)

This block allows to specify the basic structure of block specification. Specifies the sequence of parameters using either join, consist, list join or list consist operation.

  Any[] blockParameters - Join or Consist or List Join or List
    Consist block parameters
  +RevisionDefinition[] revisions
Figure 14: Block Definition
2.2.4.13. Block Join Parameter (12)

Join parameter for block definition.

  +BlockDeclaration blockDeclaration
Figure 15: Block Join Parameter
2.2.4.14. Block Consist Parameter (13)

Consist parameter for block definition.

  +BlockDeclaration blockDeclaration
Figure 16: Block Consist Parameter
2.2.4.15. Block List Join Parameter (14)

List join parameter for block definition.

  +BlockDeclaration blockDeclaration
Figure 17: Block List Join Parameter
2.2.4.16. Block List Consist Parameter (15)

List consist parameter for block definition.

  +BlockDeclaration blockDeclaration
Figure 18: Block List Consist Parameter
2.2.4.17. Revision Definition (16)

Revision allows to define parameters count for particular specification definition.

  +Natural parametersCount
Figure 19: Revision Definition

2.2.5. Main Catalog

To specify basic data types, catalog of block type definitions is established.

Catalog is structured as a tree of definitions, where each block type has a unique identifier (sequence of natural numbers). Tree nodes are denoted by ownership base and are suppose to follow similar pattern like internet domain names.

Additional to block, group and format specifications, catalog can contain basically any other data which will be properly specified on further protocol levels, for example:

  • Name of the type in multiple languages
  • Documentation for given type
  • Icon
  • Author / ownership
  • Custom viewer/editor

For basic access, catalog should be accesible as single document stored in XBUP format.

2.2.6. Additional Catalogs

Additional catalogs can be addressed from external sources.

2.3. Level 2: Transformations

In general, block transformation is data flow from one block type to another block type (more about abstraction (Section 3.3)). Transformation can be used for multiple tasks and cover various operations with data.

This level introduces capability to define transformations in catalog and automatically performs conversion between blocks.

Protocol processing is based on broad concept of dataflow paradigm, which typically state that there are input data, operation and output data.

Additional requirement here is, that operation must be deterministic (for same input returns the same output), but other than that, it can be run in any manner - as a local function in memory up to remote process in cloud.

Transformations can be also used for:

  • Paging
  • Compression
  • Encryption
  • Specify operation between multiple blocks
  • TODO

Additional properties can be specified for the transformation, like for example:

  • Time complexity
  • Space complexity
  • ...

TODO

2.3.1. Automatic Conversion

Support for transformations is used for automatic conversion of data when applications accesses this data with tools supporting this level of the protocol.

Typically application requests data to be send in a specific format, which it can process from a system service or a providing library and data are converted to the requested form.

Depending on the accessing method, transformations can be provided omni- or bi-direction. Processing service can also handle additional requirements for combination of various conversions.

General policy is to allow to include any type of data along side the main required type even when data are in transformed state, therefore it's still possible to include data outside the current specialized form for universal storage.

2.3.2. Paging

Support for basic data paging is available in basic catalog. Paging is solved using single data blob which is split into pages of the same size. Either each block of the full block structure can be stored in a way, that each block starts in new page or specific behavior can be defined via algorithm.

2.4. Level 3: Ontologies

Following level can additionaly specify more about meaning of the data:

  • Restrict number of items in list
  • Restrict type of any type
  • Specify restricted document structure
  • Restrict allowed transformations
  • Specify relations between blocks

This level introduces entities and relations to the catalog.

2.5. Data Types

Following section defines various data types considered for specification in catalog.

Typically, where exists automatic transformation between types in each group, either full or with some exceptions.

2.5.1. Boolean

For boolean logical value typical entities for "True" and "False" are declared.

Boolean can be also stored as attribute 0/1 or 0/1 in blob value.

TODO

2.5.1.1. UBBoolean

Basic variant using single attribute to store 0 or 1 for false/true.

  +Natural value
Figure 20: UBBoolean Definition
2.5.1.2. DataBoolean

Variant using data blob to store single byte 0 or 1 for false/true. When compacting, single bit could be actually used.

  Blob value
Figure 21: DataBoolean Definition

2.5.2. Natural Number

Natural numbers represent non-negative integer values, also called unsigned integer.

Natural type is also used as primary mapping for UBNumber encoding.

Value can be stored as single attribute or blob value.

Alternativelly value can be limited to specific maximum or blob length, typically specified in bits, for example natural value in 16, 32, 24, 64 bits, possibly even with swapped parts (endian etc.).

TODO

2.5.3. Integer Number

Integer value extends range to all integer values including negative values.

Integer can be stored using UBNumber encoding using 2-complement form.

Value can be stored as single attribute or blob value.

Alternativelly value can be limited to specific minimum and maximum or blob length, typically specified in bits, for example integer value in 16, 32, 24, 64 bits, possibly even with swapped parts (endian etc.).

TODO

2.5.4. Real Number

Real numbers have fractional part. Also called float or double.

Basic supported form is to use two integer attributes, one to represent base and other for mantisa. This will allow to store any real number of finite precision.

Alternative type is using [IEEE.754.1985] stored in blob.

TODO

2.5.4.1. UBReal

Basic variant using two UBInteger attributes to represent any real number with finite binary fraction.

  +UBInteger base
  +UBInteger mantissa
Figure 22: UBReal Definition

To eliminate redundancy, method of adding invisible bit before decimal point is used - with extra decrement for zero value.

  if (Base = 0 and Mantissa = 0) the Value := 0 else {
    Value := (Base * 2 + 1) * (2 ^ Mantissa)
    if (Base > 0 and Mantissa = 0) then Value := Value - 2
  }
Figure 23: UBReal algorithm
  ...
  (10)111111 11111111  (0)0000000          = -81h
  (0)1000000  (0)0000000                   = -7Fh
  (0)1000001  (0)0000000                   = -7Dh
  ...
  (0)1111110  (0)0000000                   = -3
  (0)1111111  (0)0000000                   = -1
  (0)0000000  (0)0000000                   = 0 (1)
  (0)0000001  (0)0000000                   = 1 (3)
  (0)0000010  (0)0000000                   = 3 (5)
  ...
  (0)0111111  (0)0000000                   = 7Dh (7Fh)
  (10)000000 00000000  (0)0000000          = 7Fh (81h)
  ...
Figure 24: UBReal example codes and values

Examples with non-zero mantissa:

  (0)1111111  (0)0000001                   = -2
  (0)0000000  (0)0000001                   = 2
  (0)0000001  (0)0000001                   = 6
  (0)0000010  (0)0000001                   = 10
  (0)0000000  (0)0000010                   = 4
  (0)0000000  (0)0000011                   = 8
  (0)0000000  (0)1111111                   = 0.5
  (0)0000001  (0)1111111                   = 1.5
Figure 25: UBReal example codes and values
2.5.4.2. DataReal

Variant using data blob to store real numbers.

  Blob value
Figure 26: DataReal Definition
2.5.4.3. UBRatio

Variant of real number with fixed range using single UBNatural attribute to represent any real number with finite binary fraction in range <0, 1>.

  +UBNatural value
Figure 27: UBRatio Definition

Method of reverting value is used.

  Value := Input
  if not (Value=0 or Value=1) then (
    Value := Value + 1
    while (Value = Trunc(Value)) do ( Value := Value * 2)
    Value := Trunc(Value/2) + 1
  )
Figure 28: UBRatio algorithm
  (0)0000000  0                          = 0     = 0
  (0)0000001  1                          = 1     = 1
  (0)0000010  0.1                        = 1/2   = 0.5
  (0)0000011  0.01                       = 1/4   = 0.25
  (0)0000100  0.11                       = 3/4   = 0.75
  (0)0000101  0.001                      = 1/8   = 0.125
  (0)0000110  0.011                      = 3/8   = 0.375
  (0)0000111  0.101                      = 5/8   = 0.625
  (0)0001000  0.111                      = 7/8   = 0,875
  (0)0001001  0.0001                     = 1/16  = 0,0625
  (0)0001010  0.0011                     = 3/16  = 0,1875
  (0)0001011  0.0101                     = 5/16  = 0,3125
  ...
Figure 29: UBRatio example codes and values
2.5.4.4. UBFixedPoint

Variant of real number with fixed precision is simply stored as UBInteger and using specific scaling. There can be also non-negative variant using UBNatural attribute.

  +UBInteger value
Figure 30: UBFixedPoint Definition

Values are simply multiplied by scale, for example for ratio 1/100.

  Value := Input * 0.01
Figure 31: UBFixedPoint algorithm
  (0)0000000                             = 0
  (0)0000001                             = 0.01
  (0)0000010                             = 0.02
  ...
Figure 32: UBFixedPoint example codes and values

2.5.5. String

Text string can be represented using various encodings.

Basic string type is using UTF-8 encoding by default.

Alternative type allows to specify used encoding using either IANA MIME name or encoding MIB index.

TODO

2.5.5.1. String

Basic UTF-8 encoded string stored as binary blob.

  Blob value
Figure 33: String Definition
2.5.5.2. Utf16String

Basic UTF-16 encoded string stored as binary blob.

  Blob value
Figure 34: UTF16String Definition

2.5.6. Time

Various types are defined to specify concrete date, time, timezone...

Types for time interval / range

TODO

2.5.7. URL - Uniform Resource Locator

Basic URL type is using string representation of the URL.

URL can be used to specify additional external catalogs.

TODO

2.5.8. Coordinates

Types to represent coordinates, like position on planet via latitude, longitude, altitude, elevation, rotation, GPS coordinates, distance.

2.6. Algorithms

Algorithms in the protocol are based on data-flow concept similar to what is used for transformations. This allows to define algorithms in wide range of paradigms including functional, logical and imperative.

3. Appendixes

3.1. Appendix: Motivation

Project should provide universal protocol as a more feature-rich alternative to currently used binary protocols. It should provide general methods for handling data of various form and types including:

  • Multimedia files - Audio, video, animation, 3D
  • Serialization protocol - Provide ability to serialize non-structured data
  • Application API - Remote or local method call execution, supporting parameters and result passing and error handling
  • Filesystem structure - Allow to represent data in the form of filesystem or as a compressed archive
  • Huge data - Use dynamic numeric values to allow support for data in terabytes range or greater
  • Random access - Segmented, paged, fragmented data
  • Parallel processing - Atomicity, structural data for database representation
  • Indexes, error detection and data correction

From the users point of view, protocol should provide new capabilities or enable new development in various areas:

  • Browseable binary content - Provide capability for viewing and editation of data, including visual and graphical tools and textual tools with multiple available syntaxes and supported languages
  • Flexible modular applications - With the ability to provide both independent API and data interchange format and with automatic transformation between both of them, it should be possible to utilize the protocol to enhance approach for modular applications design
  • Comprehensive scientific protocol - With the multiple levels of expresiveness and capability to define unlimited number of additional properties, it should be possible to utilize the protocol for definition and storage of specialized scientific data
  • Strong building blocks - Provide well specified data representation and ability to construct even complex data structures from combining data type definitions from wide libraries
  • Long-term storage - Provide way to define data with external or integrated specification

3.2. Appendix: Examples of Blocks

Examples of blocks and how their are encoded using XBUP protocol.

Fixed size node block with one attribute

Table 3: Stream Data
Byte Value
02 AttributePartSize
00 DataPartSize
77 Attribute 1 of value 0x77

Terminated node block with one attribute

Table 4: Stream Data
Byte Value
02 AttributePartSize
7F DataPartSize
05 Attribute 1
00 Terminator

Fixed size data block

Table 5: Stream Data
Byte Value
01 AttributePartSize
01 DataPartSize
BB One byte of data 0xBB

Terminated empty data block

Table 6: Stream Data
Byte Value
01 AttributePartSize
7F DataPartSize
00 Data block escape
00 Termination value

Fixed size block with one child

Table 7: Stream Data
Byte Value
02 AttributePartSize
03 DataPartSize
66 Attribute 1 of value 0x66
02 AttributePartSize
00 DataPartSize
77 Attribute 1 of value 0x77

3.3. Appendix: Abstraction

Primary focus on abstraction makes this protocol somewhat different compare to other similar binary formats which focus on efficiency, serialization or binary representation of a specific mark-up language. See Formats comparison (Section 3.5) for more information.

This protocol technically overlaps in functionality with many currently widely used protocols and formats including those defined by various RFCs. It has also somewhat different nature compare to currently used typically text-based internet protocols (on higher layers). Therefore various aspects should be evaluated, whether potential advantages this protocol could provide overweight complexity and other possible issues, see [RFC3117] for design consideration.

With the primary focus on abstraction, data in the protocol are considered more as abstract entities than a specific method for data representation.

Catalog is then viewed more as a set of general entities with unique identifier - using set theory terminology, it's well-ordered countable of items.

On level 1 of the protocol, some of the items have specific meaning for definition of type and some are used to identify ownership and type definition.

Level 2 introduces transformation method item to define data conversion between two specific types (input and output) and various related items which allows to specify additional properties of types and transformations.

Higher levels then define additional new meanings of categories of items for additional relations and also introduces dynamic processes to generate them.

TODO

3.4. Appendix: Parsing

Similar to parsing of textual formats, it's possible to provide parsing capability for binary protocol.

  • Object Model Parsing
  • Pull Parsing
  • Event Parsing
  • Hybrid Approaches

3.4.1. Level 0 Parsing

To process level 0 protocol following 4 types of tokens are used:

  • begin (terminationMode flag)
  • attribute (UBNumber value)
  • data (Binary data)
  • end

Following simplified grammar can be used for token processing.

  Document ::= header + Block + data
  Block ::= begin + Attributes + Blocks + end | begin + data + end
  Blocks ::= Block + Blocks | epsilon
  Attributes ::= attribute + Attributes | epsilon
Figure 35: Simplified grammar

3.4.2. Level 1 Parsing

To process level 1 protocol following 5 types of tokens are used:

  • begin (terminationMode flag)
  • type (block type)
  • attribute (UBNumber value)
  • data (Binary data)
  • end

Newly added type token serves the purpose of identifying type of block. There are few methods how to represent type and it's possible to convert between them:

  • Two attributes for groupId and blockId
  • Pointer to block type in current type context
  • Pointer to block type in main catalog

3.4.3. Level 2 Parsing

With support for transformations, additional interface to request specific transformation is available.

Typical parsing on this level is performed in a manner, that specific block type ranges are requested for specific blocks and parsers provide automatically transformed data.

TODO

3.5. Appendix: Comparison to Other Formats

While there are various binary formats and markup languages available, this project aims to take somewhat different approach to data representation.

  • While SGML, XML [XML] and related technologies were huge inspiration for this project, it seems that it wouldn't be feasible to use them as base for the binary variant due to attribute vs. child tag duality and use of Unicode string as a primitive data type in contrast to countable set used by this project
  • Using binary format is basically a necessity to make protocol reasonable usable for universal data like for example audio or video even thou text formats (for example JSON [RFC4627], YAML [YAML]) provide easy of use and readability advantages
  • Compare to wide range of existing binary formats with fixed block structure (for example RIFF), this project aims to provide more unified access to all data structures and their definitions
  • Compare to formats based on serialization of data primitives (for example Protocol Buffers [ProtoBuf], CBOR [RFC7049]) this project aims to provide capability for data definitions which would make transmitting primitive types unnecessary
  • Multi-level approach should allow to simplify and improve use compare to other dynamic binary formats (for example HDF5 [HDF5], ASN.1 [ASN.1] and EBML [EBML])

3.6. Appendix: User Interface

With unified tree structure it should be possible to provide tool which can process generic document encoded using XBUP protocol.

Following capabilities should be implemented:

  • Show document as visual tree
  • Show document as text using various syntaxes (including editing)
  • Support catalog including external definitions
  • Support for transformations including working with data in transformed form

Aim here is to provide comprehend tool to view and edit documents on different levels similar to what text editors provides for binary files representing text using typical encodings.

Additionaly, support for multiple syntaxes should allow to evolve syntax over time while underlying abstract concepts remain the same or it should be possible to adjust them via automatic transformations without constriction to syntax compatibility.

TODO

4. IANA Considerations

In the current early state of the development of the protocol, just basic media type for general files is defined: application/x-xbup

TODO

5. Security Considerations

Security was not considered at current level of the development.

6. Acknowledgements

TBD

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC3629]
Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, , <https://www.rfc-editor.org/info/rfc3629>.

7.2. Informative References

[RFC3117]
Rose, M., "On the Design of Application Protocols", RFC 3117, DOI 10.17487/RFC3117, , <https://www.rfc-editor.org/info/rfc3117>.
[RFC4627]
Crockford, D., "The application/json Media Type for JavaScript Object Notation (JSON)", RFC 4627, DOI 10.17487/RFC4627, , <https://www.rfc-editor.org/info/rfc4627>.
[RFC7049]
Bormann, C. and P. Hoffman, "Concise Binary Object Representation (CBOR)", RFC 7049, DOI 10.17487/RFC7049, , <https://www.rfc-editor.org/info/rfc7049>.
[ASN.1]
Union, I. T., "Information Technology -- ASN.1 encoding rules: Specification of Basic Encoding Rules (BER), Canonical Encoding Rules (CER) and Distinguished Encoding Rules (DER)", , <https://www.itu.int/rec/T-REC-X.690>. ITU-T Recommendation X.690
[YAML]
Ben-Kiki, O., Evans, C., and I. Net, "YAML Ain't Markup Language (YAML[TM]) Version 1.2, 3rd Edition", , <https://www.yaml.org/spec/1.2/spec.html>.
[XML]
Bray, T., Paoli, J., Sperberg-McQueen, C. M., Maler, E., and F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth Edition)", , <https://www.w3.org/TR/2008/REC-xml-20081126/>. W3C Recommendation REC-xml-20081126
[XMLNamespaces]
Bray, T., Hollander, D., Layman, A., Tobin, R., and H. S. Thompson, "Namespaces in XML 1.0 (Third Edition)", , <https://www.w3.org/TR/2009/REC-xml-names-20091208/>. W3C Recommendation REC-xml-names-20091208
[EfficientXML]
Schneider, J., Kamiya, T., Peintner, D., and R. Kyusakov, "Efficient XML Interchange (EXI) Format 1.0 (Second Edition)", , <https://www.w3.org/TR/2014/REC-exi-20140211/>.
[HDF5]
Group, T. H., "HDF5 File Format Specification Version 3.0", , <https://support.hdfgroup.org/HDF5/doc/H5.format.html>.
[EBML]
Lhomme, S., Rice, D., and M. Bunkus, "Extensible Binary Meta Language", Work in Progress, draft-ietf-cellar-ebml, , <https://datatracker.ietf.org/doc/draft-ietf-cellar-ebml/>.
[IEEE.754.1985]
Institute of Electrical and Electronics Engineers, "Standard for Binary Floating-Point Arithmetic", .
[ProtoBuf]
Google, "Protocol Buffers", , <https://developers.google.com/protocol-buffers/>.

Index

I

Author's Address

Miroslav Hajda
ExBin Project