Project: XBUP-XML
This document is part of the eXtensible Binary Universal Protocol project documentation. Provides description of the project for the creation of a prescription to save the XML document in XBUP format.
Introduction
The aim of XBUP-XML is to create a set of standard rules for the transfer of XML documents into the form of binary XBUP protocol and matching any semantic meaning the document.
Motivation
XML is a general text format for the representation of any data with description of the data blocks using words or abbreviations in selected language. The text representation of the markup symbols, however, has some drawbacks, especially in terms of performance and size. Therefore, there were attempts to create a binary XML variants that would introduce some positive aspects of the binary form, while resolving the negative. Although the objectives of the Protocol XBUP are somewhat different, it should be possible to use it appropriately and to represent the XML document in some useful binary form.
Principles
Proposal describe way how to represent various text XML items while maintaining the necessary information:
- Tree structure of tags
- Attribute sequence tags
- The names of tags, attributes
- Attribute values and text nodes
- Comments, instructions for processing and other ancillary items
- Namespaces, DOCTYPE, XMLSchema, RelaxNG, Schematron etc…
XML Data Encoding
The following variant is only an indicative idea of the possible solutions. There is white characters other than the elements which are not processed.
XML has the following types of items:
- Document – Element (maximum of one), ProcessingInstruction, Comment, DocumentType (maximum of one)
- DocumentFragment – Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
- DocumentType – no children
- EntityReference – Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
- Element – Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference
- Attr – Text, EntityReference
- ProcessingInstruction – no children
- Comment – no children
- Text – no children
- CDATASection – no children
- Entity – Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
- Notation – no children
Document Header
Document starts with the specification block followed by XML document node. It has the following items:
XML/Document (0):
UBPointer PrologPointer
UBPointer ElementPointer
UBPointer MiscPointer
The first shows the links to XML Prolog with the following attributes:
XML/Prolog (1):
UBPointer DeclarationPointer
UBPointer DocTypePointer
UBPointer MiscPointer
UBPointer MiscAfterDTPointer
All items are optional. MiscAfterDTPointer line should be present only if the DocTypePointer is empty. DocTypePointer refers to the type of ML/Doctype. DeclarationPointer refers to the block type:
XML/Declaration (2):
XBVersion XMLVersion
UBPointer EncodingPointer
UBPointer StandalonePointer
EncodingPointer which refers to the type of “text/Encoding Type” and StandalonePointer to Boolean type.
Example: <?xml version=“A.B” encoding=“UTF-8”?>
Here is a description of the XML/Misc (3) structure, which is the List type. It may include items of the XML/Comment (4) type, or XML/Processing Instruction (5).
Item type XML/Comment is a text string, which may not include two characters ”–” in a row. Processing instruction includes another attribute
XML/Processing Instruction (5)
UBPointer PITargetPointer
UBPointer PIStringPointer
PITargetPointer refers to a string of XML/PITarget (6), which may not be equal to “XML”, regardless of the size of characters. PIStringPointer refers to a string XML/PIString (7), which may not contain characters in a row ”?>”.
Document Tag
There are two basic document elements. XML/Element is an extension of List type, with following values:
XML/Tag (8)
UBPointer TagName
UBPointer AttributeListPointer
UBList Content
Items of the content list may be one of the following types:
- Text/StringList
- XML/Processing Instruction (5)
- XML/Comment (4)
- XML/CData (10)
XML/CData is a text string, which may not include a sequence of characters ”]]>”. Text data are converted in the translation using XML references.
If there is a need for some reason to distinguish an empty element and a non-empty element without content, it is possible to use following block.
XML/EmptyTag (9)
UBPointer AttributeName
UBPointer AttributesPointer
Tag's Attributes
Tag attributes can be expressed as a list of XML attributes / Attribute List (11) containing the specific XML attributes / Attribute (12) with the following values:
UBPointer AttributeName
UBPointer AttributeValue
Related Formats
ML/DocType (1):
UBPointer NamePointer
UBPointer ExternalIDPointer
UBPointer InternalPointer
The Resulting Specification Format
Combining placed groups and blocks is a test specification for the format.
- 1: XML Document Group
- 0: XML/Document
- 1: XML/Prolog
- 2: XML/Declaration
- 3: XML/Misc
- 4: XML/Comment
- 5: XML/Processing Instruction
- 6: XML/PITarget
- 7: XML/PIString
- 8: XML/Tag
- 9: XML/EmptyTag
- 10: XML/CData
- 11: XML/Attribute List
- 12: XML/Attribute
- 2: Text Blocks Group
- 3: SGML Group
An Example Document
Here is an example of simple document conversion into a binary form.
Source XHTML document:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Web Page</title> </head> <body> <h1>Welcome!</h1> </body></html>
The total size: 317 bytes
Code | Description |
---|---|
FE 00 58 42 00 01 | File Header |
07 80 C2 00 00 03 01 02 | Specification Tag |
0A 00 00 08 05 63 00 00 00 00 01 | Link to format specification in catalog (example) |
06 80 B0 01 00 01 02 | XML/Document: Root tag of the XML document |
05 7A 01 01 01 02 | XML/Prolog |
06 05 01 02 01 00 01 | XML/Declaration |
04 00 02 01 00 | Text/Encoding: encoding value |
06 67 03 01 01 02 03 | SGML/DocType |
01 04[68 74 6D 6C] | Data: html |
01 26[2D 2F 2F 57 33 43 2F 2F 44 54 44 20 58 48 54 4D 4C 20 31 2E 30 20 54 72 61 6E 73 69 74 69 6F 6E 61 6C 2F 2F 45 4E] | Data: "-//W3C//DTD XHTML 1.0 Transitional//EN" |
01 37[68 74 74 70 3A 2F 2F 77 77 77 2E 77 33 2E 6F 72 67 2F 54 52 2F 78 68 74 6D 6C 31 2F 44 54 44 2F 78 68 74 6D 6C 31 2D 74 72 61 6E 73 69 74 69 6F 6E 61 6C 2E 64 74 64] | Data: "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" |
08 80 27 01 08 01 02 02 03 | XML/Tag: Root tag |
01 04[68 74 6D 6C] | Data: "html" |
05 4F 01 0B 03 01 | XML/Attribute List |
05 25 01 0C 01 02 | XML/Attribute |
01 05[78 6D 6C 6E 73] | Data: "xmlns" |
01 1C[68 74 74 70 3A 2F 2F 77 77 77 2E 77 33 2E 6F 72 67 2F 31 39 39 39 2F 78 68 74 6D 6C] | Data: "http://www.w3.org/1999/xhtml" |
05 0E 01 0C 01 02 | XML/Attribute |
01 08[78 6D 6C 3A 6C 61 6E 67] | Data: "xml:lang" |
01 02[65 6E] | Data: "en" |
05 0A 01 0C 01 02 | XML/Attribute |
01 04[6C 61 6E 67] | Data: "lang" |
01 02[65 6E] | Data: "en" |
07 17 01 08 01 00 01 02 | XML/Tag |
01 04[68 65 61 64] | Data: "head" |
07 09 01 08 01 00 01 02 | XML/Tag |
01 05[74 69 74 6C 65] | Data: "title" |
01 08[57 65 62 20 50 61 67 65] | Data: "Web Page" |
07 1C 01 08 01 00 01 02 | XML/Tag |
01 04[62 6F 64 79] | Data: "body" |
07 0E 01 08 01 00 01 02 | XML/Tag |
01 02[68 31] | Data: "h1" |
01 08[57 65 6C 63 6F 6D 65 21] | Data: "Welcome!" |
The total size: 335 bytes
Elements with Indexed Name
One possible optimization is the identification of elements by using the identification numbers instead of text items.
Page Source