Concept: Number's Encoding
This document is part of the eXtensible Binary Universal Protocol project documentation. Provides description of used number's encoding types and their evolution.
Number's Encoding
Specific encoding of numbers is a fundamental basis of the entire protocol design, which makes it different from most of the current binary formats. It is a variant of gamma encoding, so far called a recursive unary-binary encoding. It is highly likely that this encoding has been previously described, and someone named it already.
Basic procedure is a combination of three techniques: Encoding on a fixed count of bits, the application of the indentation (recursion) and predefined value with its length.
To meet the basic requirement to allow any coding numbers, probably already the simplest unary encoding alpha would be sufficient. But this coding is totally unsatisfactory due to its wasting on space. On the other side as regards to the space-usage is the binary encoding beta which is coding individual bits as a value in exponent of two. This code, however, requires the definition of length and can not be directly used. The resulting code was therefore gradually constructed as a combinations of these two variants, and exclusion of inappropriate or impractical codes. From the number of suitable varieties there were selected those which seemed to be the most appropriate for the protocol needs.
Declared Conditions
The following requirements have been established which should comply with the proposed coding.
Declared Requirements:
- Coding of any large numbers - The basic requirement is to allow the expression of any large numbers. Can not be considered only big enough margin, since the format would lose its universality
- Coding into sequence of bits - The data are coded into a potentially endless sequence of bits. The sequence of two-state information mediators is selected because of its minimalism. Storing data into sequences of values with different ranges seems not to be proper data encoding and the choice of more than two-state data unit seems not having any logical justification
- Compatibility with unary encoding - Part of the Protocol should be originally a variant, providing a storage by using the sequence of bits 1, ending with 0 bit, therefore, equivalent of unary encoding alpha. I thought that the coding is so basic that it should be appropriate to allow the protocol to use it. Its negative characteristics, especially a linear relationship to the length of the value code is rather unpleasant. However, it seems appropriate to request on final coding, to include unary coding in some variant
- Efficiency - The aim is to get as close as possible to the binary representation of numbers, namely unary coding itself is very inefficient. Likewise, other encoding types, such as pure unary-binary encoding (gamma) or Fibonachi's coding, are still insufficiently effective. Recursive unary-binary encoding seems to be the best option at the moment
- Determinism (Efficiency) - Although a unifies expression of numbers is not a necessity, it is appropriate with regard to algorithms, to enforce the only oen possible variant of the representation of the value. It is also necessary with regard to the above-mentioned efficiency of the encoding. While this speaks against some of the benefits that have the redundant coding, but I decided to comply with this requirement
- Processibility (Efficiency) - This requirement seeks to force the simplest data processing and converts several issues. It should be, for example, necessary to use the least count of instructions for encoding like minimum of binary shift while reading and writing numbers, or the requirement for compactness of numbers - their own value is made up of a continuous sequence of bits
- Organization of data in bit clusters (Efficiency) - This requirement is introduced because of the the organization of bits as the bytes in the processing on current computers. The numbers should be encrypted preferably with a length of exponent with the base of 2 instead of the sequence of bits into bit groups. In the meantime, the organization using eight bits (octets) will be used, which is known bytes
Unary Encoding
Unary encoding seems to be as old as the human king itself (how many fingers, the value of many bars on the wall of the cave, so many killed Mammoths . The most known variant of the encoding in an infinite bit stream is the alpha encoding, which bears the same number of bits ending with the opposite bit. This encoding does not include natural numbers only without 0, which is needed, therefore similar encoding with the initial value of 0 is used. This sequence can be interpreted in other ways as well as, for example: The numbers belong to the language (10), providing an algebra of whole numbers with a constant (0) and successor operations (1), or as a binary code of single values stop with block of value 0. Bits are read from the input stream and when the bit 1 is received, the successor is used, otherwise the reading will stop and return the number of bits of value 1. The numbers in this format should be for their own processing data files transferred to the internal computer memory, implemented in the form, which can be processed more efficiently, for example, in binary encoding, which is described in the next paragraph.
Binary Encoding
The most well-known binary encoding the beta encoding which is intended for encoding data on the final sequence of bits only and not for encoding into a potentially infinite binary sequences, as it require additional value for the length. That can be realized, for example, as the third symbol, just like in Morse alphabet for the sequence of dots and dashes separated with a space. Each additional bit is another digit, and brings the increase in the value of the exponent based on 2 to the final result. For these reasons, binary encoding could not be used directly and had to be some way extended up to the encoding of length. So far, the the most appropriate solution seemed to be the combination of the unary and binary encoding.
Growing Binary Codes
One option is to use a gradually increasing binary codes stop by block indentation. Example with a linear increase in the length:
0 = 0 100 = 1 101 = 2 110 = 3 111000 = 4 111001 = 5 111010 = 6 111011 = 7 111100 = 8 111101 = 9 111110 = 10 1111110000 = 11 ...
The length of the code can grow even faster, for example, exponentially, and may be used for other basic length and, where appropriate, it does not have to happen to increase the length.
Interlaced Encoding
This variant of interlaced unary-binary encoding is an example of inadequate implementation. It works so that if the highest bit of the byte is 1, code is extending the value of the next byte. The basic form is a single number, while the highest value (first) bit is 0. If the highest bit is 1, code is extending the total length of the number of additional byte, while with every new byte the same rule re-applies. If the binary encoding expand up one bit, this encoding known as the gamma coding.
The sequence of values is as follows (binary form = range of possible values):
0xxxxxxx = 0..7Fh 1xxxxxxx 0xxxxxxx = 0..3FFFh 1xxxxxxx 1xxxxxxx 0xxxxxxx = 0..1FFFFFh 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx = 0..0FFFFFFFh ...
Binary encoded value can be either redundantly interpreted or non-redundantly and with zero width of binary code this corresponds to unary encoding. Although this method may seem appropriate at first glance, it has several disadvantages in fact:
- Skipping a numbers require reading the entire sequence (processibility)
- It is necessary to choose code endianity between Big and Little Endian (determinizm)
- There must be performed a lot of bit shifts for conversion, as is the value is not continuous (processibility)
- The length of the required space for allocation can be determined only by reading all the bytes (processibility)
Continuous Encoding
This encoding is divided into two parts compact, unary codified length and binary codified own value. Length must be introduced first, so the code is then prefix code and thus decodable in the same order as it is read. This encoding has the same basic single-byte form, where the highest bit is 0, and other bits specify the number. If the highest bit is 1, the number is expanding of additional bit of the unarycode and a group of binary coded bits. The number of these bits may be in a constant and every step and we can introduce constant for it named ClusterSize. Usually the ClusterSize = 7 will be used, as we have done in the previous case. Perhaps it will better to understand this principle from the example of a sequence.
Example (binary form = range of possible values):
0xxxxxxx = 0..7Fh 10xxxxxx xxxxxxxx = 0..3FFFh 110xxxxx xxxxxxxx xxxxxxxx = 0..1FFFFFh 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx = 0..0FFFFFFFh ... 11111110 xxxxxxxx .. xxxxxxxx = 0..1FFFFFFFFFFFFh 11111111 0xxxxxxx .. xxxxxxxx = 0..0FFFFFFFFFFFFFFh ...
This variant appeared to be much more acceptable, both due to numbers skip operation, and also to the recognizing the space needed for the allocation of the value. Also there is no need to carry out as many arithmetic shifts to the number value. Just to remove all the bits until the first zero the numbers is needed. This method of encoding is used, for example, in UTF-8, which is used to store characters in the Unicode Standard, or even in various projects introducing binary encoding of XML. The requirement on processibility/efficiency essentially enforces the order of bytes from the low post, the so-called Big Endian, because the amount of arithmetic shifts falls off.
That reserving bits for a record of the length, therefore this coding is of course, less effective than beta coding, but it is really binary. This loss can be expressed as, for example, the number of unused bits per byte. In both variants is the loss 1 bit per byte, that is, for better understanding, from each interval which could be implemented 1/(2^length in bytes/words) is lost. Specifically, it is in this case 1/2, which means that instead of the interval 0..0FFh which is possible to store into single byte, only half of it would be possible, therefore 0..7Fh.
Another possibility is to choose a higher increase ration than of the binary expansion of the number of constant length, such as exponential. However, this should lead to increased waste.
Redundant and Non-redundant Variants
In both previous variants, the binary encoding could be interpreted as a number directly as binary code without further adjustments, which, however, creates a certain ambiguity - redundancy. For example, we can store the value 1 as infinitely many forms of different lengths.
Forms of number 1:
00000001 = 1 10000000 00000001 = 1 11000000 00000000 00000001 = 1 ...
This redundancy has probably also some advantages. Especially with such way encoded numbers it works much easier, and you can use this redundancy in the file in advance to reserve space for a greater range of values, which may be subject of the restrictions on access to the file when enlarging or reducing the value. For example, when storing number 200 instead of other bigger, it does not necessarily result in a change in the size of the entire file. On the other hand, this redundancy increase the file size without increasing the total value of the information.
When the redundancy is allowed, it could be coded as other types of values, such as the integer numbers, real numbers and other, as follows:
Different types are interpreted as follows (type - description):
Natural - Is the natural number, or 0 (non-negative whole number) at a designated number of bits
Integer - The most left bit (highest) is a sign. 0 - positive, 1 - negative and because of the redundacy it is not important, whether they are negative numbers in normal, inverse, or in complementary binary code
Real - Real numbers are represent as two numbers of the Integer type. The first is the base and the other is mantissa, which indicates the index of the binary, or another shift
EReal - The real number with specific constants for +- ∞ infinity:
00111111 00000000 = +infinity 01000000 00000000 = -infinity
Since the value corresponding to these configurations may be stored in the different form with different length, these constants can be fixed without the problem.
MultiBit - Array of bits indexed from right from 0
The requirement for workability rather implies the use of non-redundant form of the encoding. Value of each type will then have a clear expression and if it gets off the value of restricted interval the entire file must be rebuild, which can lead to substantial overhead and requires additional techniques. Another disadvantage is more complex conversion of numbers.
Large loss rates of the number's capacity, which is due to the usege of the unary representation of the length of main values, can be partly to reduced using the recursion.
Recursive Encoding
The most recent proposed variant is a simple change to recursive non-interleaved unary binary encoding. The number is encoded pair of Length and Value, where the value of Length is a encoded almost same way as a whole number. The easiest way it to use directly the nonredundant deterministic variant, which is uniquely decodable encoding. There is again the constant ClusterSize which affects the encoding and determines the way how the Value capacity increase depending on the length.
First is the value Length, which is a natural number, including 0. It is coded as a single bit indicating recursion, and if the value of that bit is 0 then the value is 0 or in the opposite case that this bit is 1, other bits are interpreted recursively as additional value in the same recursive encoding and the resulting value is increased by the 1.
Value of Value is binary encoded on a bit sequences depending on the Length as follows:
If Length < ClusterSize + 1 then length of Value is (Length+1)*ClusterSize else Value has length Length*(ClusterSize+1)
As it is now possible to see if it is constant ClusterSize = -1, the entire encoding becomes the standard unary coding, which can be seen from the context of recursion and algebra with the successor referred earlier. In other cases it determines the size of the clusters as ClusterSize + 1 that extended the length of the value. This is useful for the alignment of the bits n-tuples, such as byte octets (eight bits), with a corresponding value ClusterSize = 7.
Best to show it on another example.
Example:
0xxxxxxx = 0..7Fh 10xxxxxx xxxxxxxx = 0..3FFFh 110xxxxx xxxxxxxx xxxxxxxx = 0..1FFFFFh 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx = 0..0FFFFFFFh ... 11111110 xxxxxxxx .. xxxxxxxx = 0..0FFFFFFFFFFFFFFh \_____ 7 times ____/ 11111111 00000000 xxxxxxxx .. xxxxxxxx = 0..0FFFFFFFFFFFFFFFFh \_more_/ \_____ 8 times ____/ 11111111 00000001 xxxxxxxx .. xxxxxxxx = 0..0FFFFFFFFFFFFFFFFFFh \_more_/ \_____ 9 times ____/ ...
Generally, the prefix is 0FFh:
11111111 N xxxxxxxx .. xxxxxxxx = 0..2^( 8*(N + 8)) \____ n+8 times ___/
where N is the number of additional recursive type.
It would seem that this variant is only unnecessary complication compare to unary binary encoding and benefits are only for very large numbers, however from other view, it makes sense, especially for small size of ClusterSize, or for storing large numbers for reasons of accuracy.
The following table shows the loss of information in one cluster compared to a pure binary encoding for storing of the same number (1 - (Divide of the max, the value of a binary recursive form to pure binary on given bits)/ the count of clusters), for example for ClusterSize = 7.
Losses list (number of bytes expressing main number = amount of lost information on the byte):
1..7 = 0,500000 8 = 0,670123 9 = 0,635129 10 = 0,603149 .. 14 = 0,500000 15 = 0,478801 .. 127+8 = 0,077761 127+9 = 0,112795 127+10 = 0,112037 ...
Loss gradually converge to zero, unlike to unary-binary encoding, where is constantly 0.5, the convergence of the binary code is therefore higher. The highest loss (Limes superior) reaches around 0.67.
For the use of recursion refers the efficiency only, although there are more faster converging encodings, this can in some sense be regarded as the basic recursion. To use this variant speaks requirement of effectiveness. Recursive call themselves generally bear the higher overhead of storing variables, but in this case, it is possible to use a simple iteration. Against speaks the higher complexity of implementation and nonlinearity of the increase of the length of the code.
This encoding was set for the XBUP protocol as a basic. The structure of the encoding requires determinizm and enforce the order of bit interpretation of binary encoding from the highest down to the unit. Similarly, as in previous variants, there is also enforced the endianity of the encoding by requirement for effectiveness. This encoding will remain in the document referred to as coding UBNumber.
A table of recursive codes is included for illustration for ClusterSize = 0, at a minimum “nonunary” option.
Code and the corresponding value
0 = 0 10 0 = 1 10 1 = 2 110 0 00 = 3 110 0 01 = 4 110 0 10 = 5 110 0 11 = 6 110 1 000 = 7 110 1 001 = 8 ... 110 1 111 = 14 1110 0 00 0000 = 15 ... 1110 0 00 1111 = 30 1110 0 01 00000 = 31 ...
Non-linear Recursive Encoding
One possible adjustment of the recursive encoding is possible change of the growth rate of the length of their own values. In the above scenario the value length is increasing linearly. One of the possible disadvantages is that in the case of the recursion transition while increasing, the total length of the value will be much larger than for numbers using base unit (ClusterSize + 1). Growth steps by unit could be achieved with limiting the growth, but on the other hand, it would lead to more complex calculating, but primarily to the failure of coding variant for ClusterSize = 0.
Another possible adjustment of the acceleration of growth for example on quadratic expansion, or exponentially. However, this could lead to more waste of space without any visible benefits.
Modified Recursive Encoding LRUB
The following modification (Linear-prolonging Recursive Unary-Binary encoding) is offered as a potential replacement for simple recursive encoding in the following versions. This is a slight modification at which in some specific cases, growth rate of binary numbers is reduced to achieve a purely linear increase in the length of the entire code. Encoding thus maintains both recursivity and determinizm while at the same time it solves some of the problems which may occur with the use of purely recursive code.
- Variant of the encoding with value ClusterSize = 0 is compatible with unary coding
- Reduces the step of increase of the length of the value - see. further text
Here is an example:
00 = 0 01 = 1 10 00 = 2 10 01 = 3 10 10 = 4 10 11 = 5 11 00 00 = 6 (limited growth) 11 00 01 = 7 11 00 10 = 8 11 00 11 = 9 11 01 00 00 = 10 ... 11 01 11 11 = 25 11 10 00 00 00 = 26 11 10 00 00 01 = 27 ...
Corrected Recursive Encoding SLRUB
SLRUB Encoding (Size-corrected Linear-prolonging Unary-Binary Encoding) is the adjustment of prior encoding, so that when we want to define the area through its length, we can define the size of any large area. In this case the interpretation of the value of the encoding is becoming redundant. This variant is more likely to be introduced as a type of LRUB number (for purely recursive coding is the implementation not appropriate).
What will be the interpretation of the same example:
00 = 0 01 = 1 10 00 = 1 10 01 = 2 10 10 = 3 10 11 = 4 11 00 00 = 4 11 00 01 = 5 11 00 10 = 6 11 00 11 = 7 11 01 00 00 = 7 ... 11 01 11 11 = 23 11 10 00 00 00 = 23 11 10 00 00 01 = 24 ...
UBNumber Encoding Forms
Unlike the definition of the number's encoding the encoding of other types of numbers is more or less automatic matter. Like in the previous case, also now we want to encode the whole, real, and other numbers. Additionally the condition that the ClusterSize>= 0 was set, sequence of bits 0 must always reflect the true value of 0. More precise specifications and examples of conversion algorithms can be found in the protocol specification. The following encodings for different types of numbers are derived from UBNumber encoding.
Note: Conversion algorithms for recursive encoding variant without linear limitation will be fixed later.
UBNatural Encoding
For encoding of natural numbers with zero, or the Omega Set, there is the coding, which is direct derivation of UBNumber. It allows to encode a natural number of any value.
Examples of the type UBNatural:
00000000 = 0 00000001 = 1 ... 01111111 = 7Fh 10000000 00000000 = 80h 10000000 00000001 = 81h ... 10111111 11111111 = 407Fh 11000000 00000000 00000000 = 4080h ...
Value encoding (value >> data stream)
Value := Hodnota Length := 0 while Length <= ByteSize+1 and Value>=2^(ByteSize*(Length+1)) do ( Length := Length + 1 Value := Value - 2^(ByteSize*(Length+1)) ) if Length = 8 the ( ExtLength := 0 while Value>=2^((ByteSize+1)*(ExtLength+ByteSize+1)) do ( ExtLength := ExtLength + 1 Value := Value - 2^((ByteSize+1)*(ExtLength+ByteSize+1)) ) write prefix write ExtLength ) else write unary Length write binary Value
Value decoding (data stream >> value)
read prefix or Length if prefix then ( Length := ByteSize + 1 read ExtLength ) read Value Result := Value if prefix then ( for each I from interval <1..ExtLength> Result := Result + (2^((I+ByteSize) * (ByteSize+1))) ) for each I from interval <1..Length> Result := Result + (2^(I * ByteSize))
UBENatural Encoding
This encoding is designed for natural numbers with a special constant for the infinity. It will be used primarily later in the block structure to represent the length and infinity there will express the unknown, potentially infinite length. For the realization is the value of infinity placed on the highest value in the basic form. Another option was to put infinity to 0, but that would conflict with the previously established rule of static zero.
Examples of the type UBENatural
... 0111110 = 7Eh 0111111 = ∞ 1000000 00000000 = 7Fh 1000000 00000001 = 80h ..
Conversion of the UBNatural value
if Value>(2^ByteSize) then Value := Value - 1 else if Value=(2^ByteSize)-1 then Value := ∞
UBInteger Encoding
This encoding is used to store integer numbers. Binary part is interpreted as a binary code in the two-complementary code with the highest bit as the sign. Probably another way of coding can not be used because of efficiency, moreover, the sign must be shown first, because it is more important than the value (affects processing).
Examples of the UBInteger type values
11011111 11111111 11111111 = -2041h 10100000 00000000 = -2040h ... 10111111 11111110 = -42h 10111111 11111111 = -41h 01000000 = -40h ... 01111110 = -2 01111111 = -1 00000000 = 0 00000001 = 1 ... 00111111 = 3Fh 10000000 00000000 = 40h 10000000 00000001 = 41h ... 10011111 11111111 = 203Fh 11000000 00000000 00000000 = 2040h ...
Value encoding
Value := Abs(Input) if Input<0 then Value := Value + 1 Length := 0 while Length <= ByteSize+1 and Value>=2^(ByteSize*(Length+1)) do ( Length := Length + 1 Value := Value - 2^(ByteSize*Length-1) ) if Length = 8 then ( ExtLength := 0 while Value>=2^((ByteSize+1)*(ExtLength+ByteSize+1)) do ( ExtLength := ExtLength + 1 Value := Value - 2^((ByteSize+1)*(ExtLength+ByteSize+1)) ) write prefix write ExtLength ) else write unary Length if Input<0 then Value := Neg(Value) write binary Value
Where the Neg operation is the inversion of the lowests ByteSize (Length +1) bits of the value Value (including sign) and Abs is a function of absolute value.
Value decoding
read prefix or Length
if prefix then (
Length := ByteSize + 1
read ExtLength
)
read Value
Result := Value
if prefix then BitLength := ByteSize*(ExtLength+8)+7 jinak BitLength := Length*ByteSize+(7-Length)
if (Value and 2^(BitLength))>0 then ( // Test on negative sign
Result := ( - Neg(Result) ) - 1
For each integer I from interval <1..Length> Result := Result - (2^(I*ByteSize-1))
if prefix then (
for each I from interval <1..ExtLength> Result := Result - (2^((I+ByteSize) * (ByteSize+1)))
)
) else (
For each I integer from interval <1..Length> Result := Result + (2^(I*ByteSize-1))
if prefix then (
for each I from interval <1..ExtLength> Result := Result + (2^((I+ByteSize) * (ByteSize+1)))
)
UBEInteger Encoding
Like in the case UBENatural this is an extension of the number of constants for infinity, this time it includes negative one. Placed the similar way.
Examples of the type UBEInteger
... 10111111 11111111 = -40h 01000000 = -∞ 01000001 = -3fh ... 00111110 = 3Eh 00111111 = ∞ 10000000 00000000 = 3Fh ...
Conversion of the UBInteger value
if Value>(2^(ByteSize-1)) then Value := Value - 1 else if Value<(2^(ByteSize-1))+1 then Value := Value + 1 else if Value=(2^(ByteSize-1))-1 then Value := ∞ else if Value=-(2^(ByteSize-1)) then Value := -∞
UBReal Encoding
The aim of this encoding is to enable the storage of real numbers. Because the real numbers are uncountable, it was necessary to limit on a subset. Chosen limitation was the final development of the binary form, because the endless numbers can not be fitted into finite memory. As a suitable solution seems to be a encoding as a pair of UBInteger type numbers. The first one defines the base and the other mantis. Base bears main value and mantissa determines how much the value is shifted. For example, for the Intel processor architecture are real numbers realized by the IEEE 754 standard, which is used to eliminate redundancy using the invisible one method. Also in this encoding that it is not possible to allow redundancy, in order not to violate the rule of determinism and it is necessary to find a way to solve such type of ambiguity. For example, one way is to try to use the similar method of the invisible one. It is possible to choose between placing the one before the highest, or at the lowest bit of the Value. There is also the possibility of multiple interpretations of the mantis. So first the technique of placing “invisible” one before the highest bit of the base should be described. When interpreting the number it is essential for any negation of the negative number, to add before valid bits also one bit with value 1. Already at this stage it is possible to recognize some problems with the point skipping. Another problem is the right interpretation of the decimal point shift. Usual decimal exponent will not be proper solution here so instead will be better to choose “half” (binary) point. Lets list three ways of placing point depending on the location of the invisible one before the highest bit:
- point interpretation to lowest bit
00000000 00000000 = 64 [(1)000000.]
point interpretation to highest bit
00000000 00000000 = 1 [(1).000000]
- point interpretation before invisible one
00000000 00000000 = 0.5 [.(1)000000]
Most preferable way seems to be to put the point to the highest bit, but there related problems with the mantis conversion, which is depending on the length of the base value. However, those cannot be avoid even with the placement of the point at the end of the code, which would force to somehow shift the mantis. The third method is basically similar unsuitable. Fortunately, we can still use less unusual option to add the invisible one at the end of the number. Opět si uvedeme způsoby interpretace tečky: Again, lets include different ways of point interpretation:
- point interpretation to highest bit
00000000 00000000 = 0.015625 [.000000(1)]
- point interpretation to lowest bit
00000000 00000000 = 0.5 [000000.(1)]
- point interpretation before invisible one
00000000 00000000 = 1 [000000(1).]
Finally, the third option was selected, since it appeared to be the best. Only the problem with the interpretation of the value 0 remains to resolve, with easiest way to shift it out. That approach brings together the following algorithm:
if (X=0 and Y=0) the Value := 0 else ( Value := (X*2+1)*(2^Y) if (X>0 and Y=0) then Value := Value - 2 )
Encoding numbers seem to be more technically difficult. It requires the division by the number two, until the remaining part is 1, or multiplication until number's fraction is less or equal equal to 1/2. The resulting whole number is then placed in the base and count of the performed division or negation of the count of performed multiplication is then placed as mantis. This operation should be optimized in the future as much as possible (could be added to as a processor integrated instruction, or two :-)).
The advantages of the selected form of encoding include, for example, the mantissa rate determines whether there is a whole number, or number, which has nonzero fraction. Likewise, it is possible to determine the parity of the number from the parity of the base. So, if instead of the both numbers only one part is included, it can still encode some useful value. If, for example, only base is included, the mantissa is zero and odd numbers are encoded plus code for zero, which is not so useful as a second case when there is only mantissa so binary exponents with the exception of 1 replaced by value 0 are encoded.
Some numbers with endless development will be possible to store using higher mathematical algorithms and blocks, such as using recursion. Binary form of numbers is simply the best for the bit stream data.
Examples of the type UBReal
- numbers with zero mantissa:
... 10111111 11111111 00000000 = -81h 01000000 00000000 = -7Fh 01000001 00000000 = -7Dh ... 01111110 00000000 = -3 01111111 00000000 = -1 00000000 00000000 = 0 (1) 00000001 00000000 = 1 (3) 00000010 00000000 = 3 (5) ... 00111111 00000000 = 7Dh (7Fh) 10000000 00000000 00000000 = 7Fh (81h) ...
- other numbers with different mantisses:
01111111 00000001 = -2 00000000 00000001 = 2 00000001 00000001 = 6 00000010 00000001 = 10 00000000 00000010 = 4 00000000 00000011 = 8 00000000 01111111 = 0.5 00000001 01111111 = 1.5
Value encoding
Base := 0 Mantissa := 0 if not Value=0 then ( if Frac(Value)=0 then ( while not Frac(Value)=0.5 do ( Mantissa := Mantissa + 1 Base := Base / 2 ) ) else ( while not Frac(Value)=0.5 do ( Mantissa := Mantissa - 1 Base := Base * 2 ) ) ) write UBInteger Base write UBInteger Mantissa
Value decoding
Base := read UBInteger Mantissa := read UBInteger if (Base=0 and Mantissa=0) then Result := 0 else ( Result := (Base*2+1)*(2^Mantissa) if (Base>0 and Mantissa=0) then Result := Result - 2 )
UBReal as a Single Number
Real number with the final decimal development is likely to be possible to encode into a single UBNumber value, as it is equally possible to effectively encode any final final seqeunce of integer numbers into a single natural number. For or against the use of encoding there has not be found the appropriate arguments.
One possible solution is the diagonal method:
(0,0) (0,1) (0,2) (0,3) / / / / / / (1,0) (1,1) (1,2) (1,3) / / / / / / (2,0) (2,1) (2,2) (2,3) / / / / / / (3,0) (3,1) (3,2) (3,3) ... (0,0) - (1,0) - (0,1) - (2,0) - (1,1) - (0,2) - ...
UBEReal Encoding
Again, another extension is this time to add constants for the +-infinity to real number. Basically, the mantis with a base value of 0 is interpreted as UBEInteger type. This corresponds to constants:
00111111 00000000 = +infinity 01000000 00000000 = -infinity
Examples of the UBEReal type
10111111 11111111 00000000 = -7Fh 01000000 00000000 = -∞ 01000001 00000000 = -7Dh ... 00111110 00000000 = 7Bh 00111111 00000000 = ∞ 10000000 00000000 00000000 = 7Dh ...
UBReal value conversion
if (Base=0 and Mantissa=0) then Value := 0 else ( if (Base=(2^ByteSize)-1 and Mantissa=0) then Value := ∞ else if (Base=-(2^ByteSize) and Mantissa=0) then Value := -∞ else ( if (Mantissa=0) then ( if (Base>0) then Value := Value - 2 if (Base<(2^ByteSize)-1) then Value := Value - 2 else if (Base>-(2^ByteSize)) then Value := Value - 2 ) ) )
UBRatio Encoding
So far the last approved type is the type of encoding real numbers in the interval <0,1> with the finite binary development. This set is already countable large, and it is possible to encode it as a UBNumber value. Value is essentially stored as a count of divided of half cut to a depth intervals. This type can be used to store real numbers for any constant intervals of equal spread. Likewise, for saving intensity and other physical quantities.
Encoding with Reversion
Základní úvaha vycházela z toho, že bude nutné kódovat posloupnost čísel s postupně prostoucím rozvojem. První hodnoty musí být samozřejmě 0 a 1, dále je předpokládáno pořadí postupně na poloviny dělených zlomků bez opakování, tedy 1/2, 1/4, 3/4, 1/8, 3/8, 5/8, 7/8, 1/16, 3/16 atd. Nakonec však bylo požadováno pouze to, aby byly skupiny zlomků se stejným zalomením v pořadí za sebou a bez opakování a použito bylo přidání tečky za celé číslo a otočení pořadí bitů, takže místo do vyšších hodnot porostou do vyšší přesnosti. Tento algoritmus vyžaduje použití netriviální operace reverze pořadí bitů.
Příklad převodu
0010111 =>(-1) 0010110 =>(reverse order) 0110100 =>("0." add) 0.0110100 =>(calculate) 0.25 + 0.125 + 0,03125 = 0,40625
Uveďme si příklady několika prvních hodnot ve formátu UBRatio:
kód, binární tvar skutečné hodnoty, zlomkový tvar
00000000 0 = 0 00000001 1 = 1 00000010 0.1 = 1/2 00000011 0.01 = 1/4 00000100 0.11 = 3/4 00000101 0.001 = 1/8 00000110 0.101 = 5/8 00000111 0.011 = 3/8 00001000 0.111 = 7/8 00001001 0.0001 = 1/16 00001010 0.1001 = 9/16 00001011 0.0101 = 5/16 ...
Dekódování hodnoty
Přečti UBNatural Value Je-li (Value<2) potom ( Hodnota := Value ) jinak ( Hodnota := 0 Value := Value - 1 Pomocná := 1 Dokud platí (Value>0) proveď ( pokud platí (Value and 1) potom Hodnota := Hodnota + 1/Pomocná Pomocná := Pomocná shl 1 Value := Value shr 1 ) )
Kódování hodnoty
Je-li (Hodnota=0 nebo Hodnota=1) potom ( Value := Hodnota ) jinak ( Value := 0 Pomocná := 0 Dokud platí (Hodnota>0) proveď ( Hodnota := Hodnota * 2 Je-li (Hodnota > 1) potom ( Value := Value + Pomocná Hodnota := Hodnota - 1 ) Pomocná := Pomocná shl 1 Value := Value + 1 ) ) Zapiš UBNatural Value
Direct Encoding
There is, fortunately, available a variant of the encoding without the reversing the order of bits which uses a simple shift. It also uses more natural order of fractions.
Example conversion
0010111 =>(-1) 0010110 =>(*2+1) 00101101 =>(shift)001.01101 =>(-1) 0.01101 =>(calculate) 0.25 + 0.125 + 0,03125 = 0,40625 = 13/32
Example values of the UBRatio type
00000000 0 = 0 00000001 1 = 1 00000010 0.1 = 1/2 00000011 0.01 = 1/4 00000100 0.11 = 3/4 00000101 0.001 = 1/8 00000110 0.011 = 3/8 00000111 0.101 = 5/8 00001000 0.111 = 7/8 00001001 0.0001 = 1/16 00001010 0.0011 = 3/16 00001011 0.0101 = 5/16 ...
Value decoding
Value := Input if not (Value=0 or Value=1) then ( Value := Value + 1 while (Value = Trunc(Value)) do ( Value := Value * 2) Value := Trunc(Value/2) + 1 ) write UBNatural Value
Value encoding
Value := read UBNatural if (Value<2) then ( Result := Value ) else ( Result := Value*2 - 1 while (Result>2) do ( Result := Result / 2 ) Result := Result - 1 ) )
Other Considered Encodings
In addition to the above there were considered some additional encoding of some frequently used types.
UBNReal - nonnegative real number - the base is interpreted as UBENatural/UBNatural
UBCReal - integer number with development - mantissa is interpreted as UBNatural
UBComplex - compex numberk - two real numbers
UBKvat - quaternion - four real numbers, but no practical use
UBFrag - fractions
UBBits - array of logical values - the number is interpreted as a field of 0/1, and it might make sense to define it if the format will be redundant. Thus it will be realized as data block
Any additional encodings might be described in other parts of the documentation.
Argumentation of the Correctness
Lets try to summarize why it is the chosen encoding UBNumber the most appropriate (perhaps I will be possile to reformulate these arguments to some form of proof later).
- It allows to encode any big final integer or real number
- It's unambiguously decodable (deterministic, block-chunked, prefix)
- It can be align to bitové clusters of variable size
- Efficiency is asymptotically approaching binary encoding
- It is relatively well implementable (low conversion complexity)
Since it is possible to encode a sequence of integers as a single whole number, it will be useful to prove that such operations is inappropriate.
Open Questions
Especially the following questions have not been sufficiently argumented so far:
- Placement for constants for infinity - Still variant of the locations for constants for infinity in the extended types to zero should be considered. At the moment I am even more in favor of this option
- The use of recursive variation - The argument for the use of recursion based on efficiency does not appear to be too strong. Nonrecursive option on the other hand, has advantages in its simplicity. In addition, LRUB encoding is difficult on correction. Yet still recursion seems to be beneficial in more efficient use of small value of ClusterSize constant
- UBReal type encoding - It is still not clear whether it is preferable to use two or single-valued variant of this encoding. Single value advantage is the ability to use it as only a single attribute, but at a price of less analyzable code, and with lower efficiency of extraction
Links
The list of resources, literature and relevant links:
Inverse Form https://en.wikipedia.org/wiki/Ones%27_complement
Two's Complement Form https://en.wikipedia.org/wiki/Two%27s_complement
RFC - Request For Comment https://www.ietf.org/rfc.html
MIME - Multipurpose Internet Mail Extensions https://en.wikipedia.org/wiki/MIME
UNICODE - The character encoding system https://www.unicode.org
UCS - https://www.unicode.org
UTF - Unicode Transformation Format https://www.unicode.org
INTEL - Intel Corporation (C) https://www.intel.com
IEEE - Institute of Electrical and Electronics Engineers, Inc. https://www.ieee.com
IEEE - IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) https://en.wikipedia.org/wiki/IEEE754
ISO - International Standartization Organization https://www.iso.ch
ANSI - American National Standartization Organization https://www.ansi.org
Alpha Coding - Alpha Unary Number Coding https://en.wikipedia.org/wiki/Unary_coding
Beta Coding - Beta Binary Coding https://en.wikipedia.org/wiki/Binary_number
Gama Coding - Elias Gamma Number Coding https://en.wikipedia.org/wiki/Elias_Gamma_coding
Convergence - Mathematical Convergence https://en.wikipedia.org/wiki/Convergence
Limes Superior - Mathematical Limes Superior https://www.math.uni-sb.de/~ag-wittstock/lehre/WS00/analysis1/Vorlesung/node62.html
Binary XML - XML Binary Characterization https://www.w3.org/XML/Binary/
Page Source