Is there a base64 encoding for numbers that works like base10 or base2? - base64

In base2 (binary), the characters to represent each digit are 01. 0 being the first character of the base2 alphabet, you can prefix any base2 number with as many 0 as you want without changing the meaning of the number.
All of these are equivalent:
11
011
0011
00011
In base10 (decimal), the characters to represent each digit are 0123456789. 0 being the first character of the base10 alphabet, you can prefix any base10 number with as many 0 as you want without changing the meaning of the number.
All of these are equivalent:
3
03
003
0003
In a hypothetical base64, let's assume the characters to represent each digit are ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. A being the first character of the base64 alphabet, you should be able to prefix any base64 number with as many A as you want without changing the meaning of the number.
All of these would be equivalent:
5+fn
A5+fn
AA5+fn
AAA5+fn
I understand that base64 does not work this way because it was not intended to encode numbers but any binary data.
Is there a formal RFC documenting this hypothetical base64 encoding? Are there any implementation in some programming languages?

Related

How does encode and decode 64 figure out that the last few zeros are mere padding?

https://learn.microsoft.com/en-us/dotnet/api/system.convert.tobase64string?view=net-5.0
It says
If an integral number of 3-byte groups does not exist, the remaining
bytes are effectively padded with zeros to form a complete group. In
this example, the value of the last byte is hexadecimal FF. The first
6 bits are equal to decimal 63, which corresponds to the base-64 digit
"/" at the end of the output, and the next 2 bits are padded with
zeros to yield decimal 48, which corresponds to the base-64 digit,
"w". The last two 6-bit values are padding and correspond to the
valueless padding character, "=".
Now,
Imagine that the byte array I send is
0
So, only one byte, namely 0
That one byte will be padded right into 000 right?
So now, we will have something like 0=== as the encoding because it takes 4 characters in base 64 encoding to encode 3 bytes.
Now, we gonna decode that.
How do we know that the original byte isn't 00, or 000, but just 0?
I must be missing something here.
So now, we will have something like 0=== as the encoding
3 padding characters is illegal. This would mean 6 bit plus padding.
And then 0 as a byte value is A in Base64, so it would be AA==.
So the first A has the first 6 bits of the 0 byte, the second A contributes the 2 remaining 0 bits for your byte, and then there are just 4 0 bits plus the padding left, not enough for a second byte.
How do we know that the original byte isn't 00, or 000, but just 0?
AA== has only 12 bits (6 bits per character) so it can only encode 1 Byte => 0
AAA= has 18 bits, enough for 2 bytes => 00
AAAA has 24 bits = 3 bytes => 000

HEX2OCT formula in MS Excel returns incorrect result

While converting the hexadecimal value "FFFFFFFF00" into octal value using Hex2Oct of MS Excel, it should return "Error string" as per the rules mentioned here:
If number is negative, HEX2OCT ignores places and returns a 10-character octal number.
If number is negative, it cannot be less than FFE0000000, and if number is positive, it cannot be greater than 1FFFFFFF.
If number is not a valid hexadecimal number, HEX2OCT returns the #NUM! error value.
If HEX2OCT requires more than places characters, it returns the #NUM! error value.
If places is not an integer, it is truncated.
If places is nonnumeric, HEX2OCT returns the #VALUE! error value.
If places is negative, HEX2OCT returns the #NUM! error value.
But it computes and returns as "7777777400" without considering the rules/remarks mentioned in the link.
For example:
While calculating HEX2OCT,
As per Excel rule, If number is positive, it cannot be greater than 1FFFFFFF(hex)<->3777777777(oct)<->536870911(decimal).
But while calculating the HEX2OCT for FFFFFFFF00(hex) <-> 7777777400(oct) <-> 1099511627520(decimal).
Here the hex value FFFFFFFF00 is greater than 1FFFFFFF, but MS Excel does not return the error string instead it returns the converted octal value.
Can anyone explain why?
FFFFFFFF00 is actually well within the range of hex2oct because it is a negative number.
According to that documentation the largest negative number it can handle is FFE0000000 which when converted to decimal is -536870912. Converting your "big" hex over to decimal yields -256.
The reason the value of FFFFFFFF00 looks so big is because it's a negative number. The first bit is set to 1 (when converted to binary) which signifies that the number is negative. Negatives are computed in binary using two's complement which is found by flipping each bit and then adding 1 to the number.
Undoing the two's complement:
For your big number, the binary representation is:
1111111111111111111111111111111100000000
Subtracting 1:
1111111111111111111111111111111011111111
Flipping all the bits:
0000000000000000000000000000000100000000
Which is 256
So.. basically if the hex looks big, but the first bit is 1 then it's actually a small negative and well within your range of allowable values.
Lastly, when you hex2oct you don't get a negative sign for these because we are still not in decimal notation. The first bit of your octal is still a 1 (when converted to binary) since it's still the same number, just represented in a different counting system.
The clue lies earlier in the documentation page you quote:
The HEX2OCT function syntax has the following arguments:
Number Required. The hexadecimal number you want to convert. Number cannot contain more than 10 characters. The most significant
bit of number is the sign bit. The remaining 39 bits are magnitude
bits. Negative numbers are represented using two's-complement notation.
The hex value FFFFFFFF00 corresponds the binary value
1111 1111 1111 1111 1111 1111 1111 1111 0000 0000
and as the documentation says, "the most significant bit is the sign bit ... two's complement notation". So this value represents a negative number. By the rules of two's complement, it actually represents -256. And this is fine, because it is not "less than FFE0000000", as FFE0000000 is -2097152.
If you actually want to treat FFFFFFFF00 as an unsigned quantity, and get the octal representation of decimal 1099511627520, you'll need to use another method.

Explain the number of bits in a hash value that features both numbers and letters

I need some help understanding this concept:
If I have a 256-bit hash, the value is essentially a 64-character long string. This is because each character is 4-bits long (64*4 = 256), correct? However, along with numbers letters are also used in hash values, and letters are 8-bits long. Doesn't a 64-character long hash key that features letters along with numbers ultimately create a hash value that is greater than 256-bits?
Take this hash value for example: 7833dc6e82e9378117bcb03128ac8fdd95d9073161ebc963783b3010dd847ff3
It is 64-characters long, but the letter d is 8-bits long rather than 4. So how does this hash count as 256-bits?
Thank you for your help!
The letters aren't really letters. You've probably noticed that the only included alphabet characters are A-F. This is because the hash is using base 16 (hexadecimal) numbering.
Unlike base 10 where the valid characters are 0-9, in base 16, there are sixteen valid characters: 0 1 2 3 4 5 6 7 8 9 A B C D E F. 16 = 2^4, so you need 4 bits for each character.

Space-efficient way to encode numbers as sortable strings

Starting with a list of integers the task is to convert each integer into a string such that the resulting list of strings will be in numeric order when sorted lexicographically.
This is needed so that a particular system that is only capable of sorting strings will produce an output that is in numeric order.
Example:
Given the integers
1, 23, 3
we could convert the to strings like this:
"01", "23", "03"
so that when sorted they become:
"01", "03", "23"
which is correct. A wrong result would be:
"1", "23", "3"
because that list is sorted in "string order", not in numeric order.
I'm looking for something more efficient than the simple zero-padding scheme. In order to cover all possible 32 bit integers we'd need to pad to 10 digits which is inefficient.
For integers, prefix each number with the length. To make it more readable, use 'a' for length 1, and 'b' for length 2. Example:
non-encoded encoded
1 "a1"
3 "a3"
23 "b23"
This scheme is a bit simpler than prefixing each digit, but only works with numbers, not numbers mixed with text. It can be made to work for negative numbers as well, and even BigDecimal numbers, using some tricks. I wrote an implementation in Apache Jackrabbit 2.x, to make BigDecimal indexable (sortable) as text. For that, I used a format that only uses the characters '0' to '9' and consists of:
one character for: signum(value) + 2
one character for: signum(exponent) + 2
one character for: length(exponent) - 1
multiple characters for: exponent
multiple characters for: value (-1 if inverted)
Only the signum is encoded if the value is zero. The exponent is not encoded if zero. Negative values are "inverted" character by character (0 => 9, 1 => 8, and so on). The same applies to the exponent.
Examples:
non-encoded encoded
0 "2"
2 "322" (signum 1; exponent 0; value 2)
120 "330212" (signum 1; exponent signum 1, length 1, value 2; value 12)
-1 "179" (signum -1, rest inverted; exponent 0; value 1 (-1, inverted))
Values between BigDecimal(BigInteger.ONE, Integer.MIN_VALUE) and BigDecimal(BigInteger.ONE, Integer.MAX_VALUE) are supported.
TL;DR
Encode digits according to their order of magnitude (OM) and other characters so they sort as desired, relative to numbers: jj-a123 would be encoded zjzjz-zaC1B2A3
Longer explanation
This would depend somewhat upon the sorting algorithm that will finally be used to sort and how one would want any given punctuation characters to be sorted in relation to letters and numbers, but if it's "ascii-betical" or similar, you could encode each digit of a number to represent its order of magnitude (OM) in the number, while encoding other characters such that they would sort according to your desired sort order.
For simplicity, I would suggest beginning with encoding every non-numeric character with a "high" value (e.g. lower case z or even ~ if final value is ASCII), so that it sorts after encoded digits. Then cache each digit encountered until another non-numeric is encountered, then encode each cached digit with a value representing its OM. If the number 12945 was encountered in between non-numerics, you would output an E to encode an OM of 5, then the digit that is that order of magnitude, 1, followed by the next OM of 4 (D) and its associated digit, 2. Continue until all numeric digits have been flushed, then continue with non-numerics.
Non-numerics would be treated individually and ranked relative to the OM of digits. If it is desired for them to sort "above" numbers (perhaps the space character or certain others deemed special) they would be encoded by prepending a low-value character (like the space character, if final value will be treated and sorted as ASCII). When/if another numeric is encountered, begin caching and encode according to OM once all consecutive numerics are cached.
Alternately, processing the string in reverse order would preclude the need to cache numbers except for a single "is it a digit?" test and "is the last character a digit?" test. If the first is not true, then use (one of?) the "non-digit" OM character(s). If the first test is true then use the lowest-OM "digit" character (A in my examples). If both tests are true, then increment your OM character (A -> B or E -> F) before use.
Certain levels of additional filtering - or even translation - could be applied. If one wanted to allow accurate sorting based upon Roman numerals, one could encode them as decimal (or even hexadecimal) numbers with an appropriate OM.
Treating decimal points (either periods or commas, depending) as actual decimal separators, and distinct from other punctuation would probably be beyond the true utility of this encoding scheme, as alphanumeric fields seldom use a period or comma as a decimal separator. If it is desired to use them that way, the algorithm would simply detect a decimal separator (either period or comma as appropriate, in between digits) and not encode the numeric portion after that separator as anything but normal text. Fractional portions are actually sorted correctly during a normal ASCII based sort, because more digits represents greater precision - not greater magnitude.
Examples
non-encoded encoded
----------- -------
12345 E1D2C3B4A5
a100 zaC1B0A0
a20 zaB2A0
a2000 zaD2C0B0A0
x100.5 zxC1B0A0z.A5
x100.23 zxC1B0A0z.B2A3
1, 23, 3 A1z,z B2A1z,z A3
1, 2, 3 A1z,z A2z,z A3
1,2,3 A1z,A2z,A3
Potential advantages
Going somewhat beyond simple numeric sorting, some advantages to this encoding method would be several aspects of flexibility with final effective sort order - you are essentially encoding a category for each character - digits get a category based upon their position within the greater string of digits known as a number, while other characters are simply told to sort in their normal way (e.g. ASCII), but after numbers. Any exceptions that should sort before numbers or in other orders would be in one or more additional categories. ASCII can effectively be re-encoded to sort in a non-ASCII way:
You could encode lower case letters to sort before or along with upper case letters. To switch the lower and upper cases, you encode lower case letters with a y and upper case letters with a z. For a pseudo-case-insensitive sort, categorizing both A and a with the same encoding character would sort both of them before B and b, though A would nonetheless always sort before a
If you want Extended ASCII characters (e.g. with diacritics) to sort along with their ASCII cousins, you encode À, Á, Â, Ã, Ä, Å, and Æ along with A by using an a as the OM character, encode B, C, and Ç with a b, and E, È, É, Ê, and Ë with a c, etc. The same intra-category sort order caveat still applies, and some decisions need to be made on characters like capital Eth, and to a certain extent others like Thorn, and Sharp S (Ð, Þ, and ß respectively) as to whether they will sort based on similarities in appearance or pronunciation, or instead more properly perhaps, alphabetical order.
Small advantage of being basically human-readable, with effort
Caveats
Though this allows many 'categories' of characters to be defined, be sure to remember that each order of magnitude for digits is its own category - you need to know that the data will not contain numbers that are greater in OM than approximately 250, depending upon how many other categories you wish to define (ASCII 0 is reserved for storing strings, and there needs to be at least one other character to indicate "not a digit" - at least for alphanumeric data - making the maximum perhaps 254 orders of magnitude), but that should be plenty for any situation I can imagine. I'm not sure what other issues quantum computing will bring about, but there's probably a quantum solution to it, whatever it is.
Finally, if the hyphen is encoded as a non-numeric character, and all non-numerics are encoded with a higher OM than digits, negative numbers would be encoded as greater than any positive number. The hyphen should be encoded as a lower-than-digit-OM (perhaps only when preceding a digit) if negative numbers need to be sorted correctly according to magnitude.
Since the ASCII code of A is greater than 9, you could encode them as hexadecimal strings.
The integers
1, 23, 3
can be encoded as
00000001, 00000017, 00000003
and 32-bit integers can always be encoded as 8-character strings. (assume unsigned)

Writing null-terminated string "R5" in hexadecimal, etc

This is supposed to be a low-level course, and it is only the third day of class. However, we are asked to "Write the null-terminated string 'R5' in hexadecimal, binary, and octal notations. Assume that ASCII code is used"
I have no idea where to go to learn how to do this. Any suggestions? Thanks.
NULL-terminated ASCII strings are stored with one byte per character, plus one byte for the NULL. You would therefore be printing three bytes - 'R', '5', and 0.
Look up 'R' and '5' on an ASCII chart to see what the numeric values are for those characters in ASCII. Then, write out your three bytes three different ways - one each for hexadecimal, binary and octal.
Hope that helps.
It seems like this just requires you to look up the appropriate entries from the ASCII table, which in most cases lists hex and octal and the characters themselves.
ASCII is a standard way of defining how characters are represented, and most tables will list characters against corresponding hex, decimal, and octal values. The first 128 is standard and the next 128 are the extended characters (those weird characters that don't map to an English keyboard).
If you google "ASCII table" you'll be inundated with different links. The top one I saw at www.asciitable.com appears to have everything you need - except binary.
Most of the times you're not going to see binary listed, but it's fairly academic to translate a hex value into binary - your Windows Calculator will happily do this for you.
To more directly translate your specific string you'll look up each character (including the NULL) separately and translate each individually.
Ultimately to the computer, everything is a number. To represent characters such as letters or symbols, we can agree on an encoding, or a numbering of these characters. For example, we could invent a new encoding where 1 means 'A', 2, means 'B', and so on. ASCII is one commonly used text encoding which maps characters to numbers. In this case, we are concerned with a string of 3 characters: 'R', '5', and null (a null character marks the end of a string. It is represented by the value 0. If you look in an ASCII table, you'll find that the numeric values are 82, 53, and 0.
String: R, 5, <null>
Decimal numbers: 82, 53, 0
Our normal number system is base-10, or decimal. This means that each digit represents a value ten times larger than the next (1, to 10, to 100, to 1000, etc.). Alternate bases include 8 (octal), 16 (hexadecimal), and 2 (binary). There is a straightforward way to convert between bases, although you can also easily find calculators that will do the conversion for you. You may want to review the relevant section of your textbook, or check out the Wikipedia articles. For the example of decimal 82, the hexadecimal value is 52 (this means 5*16 + 2 = 8*10 + 2). Oftentimes you will see a prefix of "0x", this is commonly used to make it clear the following digits are in base 16. (otherwise, you might think "52" refers to the decimal value 52).
Interesting. So would it be correct to say that the null-terminated string "R5" is simply "52, 35, 30" or is there a more correct format to it? Thank you for your patience. –
As I pointed out in another comment, the actual value 0 marks the end of a string, not the value 0x30, which represents a character '0' in the string. Note that the value of zero (0) is the same regardless of which base your numbers are in.
String: R, 5, <null>
Decimal : 82, 53, 0
Hexadecimal: 52, 35, 0

Resources