How does encode and decode 64 figure out that the last few zeros are mere padding? - base64

https://learn.microsoft.com/en-us/dotnet/api/system.convert.tobase64string?view=net-5.0
It says
If an integral number of 3-byte groups does not exist, the remaining
bytes are effectively padded with zeros to form a complete group. In
this example, the value of the last byte is hexadecimal FF. The first
6 bits are equal to decimal 63, which corresponds to the base-64 digit
"/" at the end of the output, and the next 2 bits are padded with
zeros to yield decimal 48, which corresponds to the base-64 digit,
"w". The last two 6-bit values are padding and correspond to the
valueless padding character, "=".
Now,
Imagine that the byte array I send is
0
So, only one byte, namely 0
That one byte will be padded right into 000 right?
So now, we will have something like 0=== as the encoding because it takes 4 characters in base 64 encoding to encode 3 bytes.
Now, we gonna decode that.
How do we know that the original byte isn't 00, or 000, but just 0?
I must be missing something here.

So now, we will have something like 0=== as the encoding
3 padding characters is illegal. This would mean 6 bit plus padding.
And then 0 as a byte value is A in Base64, so it would be AA==.
So the first A has the first 6 bits of the 0 byte, the second A contributes the 2 remaining 0 bits for your byte, and then there are just 4 0 bits plus the padding left, not enough for a second byte.
How do we know that the original byte isn't 00, or 000, but just 0?
AA== has only 12 bits (6 bits per character) so it can only encode 1 Byte => 0
AAA= has 18 bits, enough for 2 bytes => 00
AAAA has 24 bits = 3 bytes => 000

Related

baseurl64 buffer decoding

Can someone explain this behavior?
Buffer.from('5d9RAjZ2GCob-86_Ql', 'base64url').toString('base64url')
// 5d9RAjZ2GCob-86_Qg
Please take a close look at the last character l - g
Your string is 18 characters long, With 6 bits encoded in each character it means the first 16 characters represent 96 bits (12 bytes) and the last two represent one byte plus 4 unused bits. Only the first two bits of the last character are significant here. g is 100000, l is 100101. As the last 4 characters are not used, g is just the first choice for the two bits 1 0.
So for any character in the range between g and v, you would get a g when you convert it back to Base64Url.
See https://en.wikipedia.org/wiki/Base64#Base64_table_from_RFC_4648

Is there a base64 encoding for numbers that works like base10 or base2?

In base2 (binary), the characters to represent each digit are 01. 0 being the first character of the base2 alphabet, you can prefix any base2 number with as many 0 as you want without changing the meaning of the number.
All of these are equivalent:
11
011
0011
00011
In base10 (decimal), the characters to represent each digit are 0123456789. 0 being the first character of the base10 alphabet, you can prefix any base10 number with as many 0 as you want without changing the meaning of the number.
All of these are equivalent:
3
03
003
0003
In a hypothetical base64, let's assume the characters to represent each digit are ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/. A being the first character of the base64 alphabet, you should be able to prefix any base64 number with as many A as you want without changing the meaning of the number.
All of these would be equivalent:
5+fn
A5+fn
AA5+fn
AAA5+fn
I understand that base64 does not work this way because it was not intended to encode numbers but any binary data.
Is there a formal RFC documenting this hypothetical base64 encoding? Are there any implementation in some programming languages?

How to count binary sequence in binary number in Python?

I would like to count '01' sequence in 5760 binary bits.
First, I would like to combine several binary numbers then count # of '01' occurrences.
For example, I have 64 bits integer. Say, 6291456. Then I convert it into binary. Most significant 4 bits are not used. So I'll get 60 bits binary 000...000011000000000000000000000
Then I need to combine(just put bits together since I only need to count '01') first 60 bits + second 60 bits + ...so 96 of 60 bits are stitched together.
Finally, I want to count how many '01' appears.
s = binToString(5760 binary bits)
cnt = s.count('01');
num = 6291226
binary = format(num, 'b')
print(binary)
print(binary.count('01'))
If I use number given by you i.e 6291456 it's binary representation is 11000000000000000000000 which gives 0 occurrences of '01'.
If you always want your number to be 60 bits in length you can use
binary = format(num,'060b')
It will add leading 0 to make it of given length
Say that nums is your list of 96 numbers, each of which can be stored in 64 bits. Since you want to throw away the most 4 significant bits, you are really taking the number modulo 2**60. Thus, to count the number of 01 in the resulting string, using the idea of #ShrikantShete to use the format function, you can do it all in one line:
''.join(format(n%2**60,'060b') for n in nums).count('01')

Explain the number of bits in a hash value that features both numbers and letters

I need some help understanding this concept:
If I have a 256-bit hash, the value is essentially a 64-character long string. This is because each character is 4-bits long (64*4 = 256), correct? However, along with numbers letters are also used in hash values, and letters are 8-bits long. Doesn't a 64-character long hash key that features letters along with numbers ultimately create a hash value that is greater than 256-bits?
Take this hash value for example: 7833dc6e82e9378117bcb03128ac8fdd95d9073161ebc963783b3010dd847ff3
It is 64-characters long, but the letter d is 8-bits long rather than 4. So how does this hash count as 256-bits?
Thank you for your help!
The letters aren't really letters. You've probably noticed that the only included alphabet characters are A-F. This is because the hash is using base 16 (hexadecimal) numbering.
Unlike base 10 where the valid characters are 0-9, in base 16, there are sixteen valid characters: 0 1 2 3 4 5 6 7 8 9 A B C D E F. 16 = 2^4, so you need 4 bits for each character.

node.js: get byte length of the string "あいうえお"

I think, I should be able to get the byte length of a string by:
Buffer.byteLength('äáöü') // returns 8 as I expect
Buffer.byteLength('あいうえお') // returns 15, expecting 10
However, when getting the byte length with a spreadsheet program (libreoffice) using =LENB("あいうえお"), I get 10 (which I expect)
So, why do I get for 'あいうえお' a byte length of 15 rather than 10 using Buffer.byteLength?
PS.
Testing the "あいうえお" on these two sites, I get two different results
http://bytesizematters.com/ returns 10 bytes
https://mothereff.in/byte-counter returns 15 bytes
What is correct? What is going on?
node.js is correct. The UTF-8 representation of the string "あいうえお" is 15 bytes long:
E3 81 82 = U+3042 'あ'
E3 81 84 = U+3044 'い'
E3 81 86 = U+3046 'う'
E3 81 88 = U+3048 'え'
E3 81 8A = U+304A 'お'
The other string is 8 bytes long in UTF-8 because the Unicode characters it contains are below the U+0800 boundary and can each be represented with two bytes:
C3 A4 = U+E4 'ä'
C3 A1 = U+E1 'á'
C3 B6 = U+F6 'ö'
C3 BC = U+FC 'ü'
From what I can see in the documentation, LibreOffice's LENB() function is doing something different and confusing:
For strings which contain only ASCII characters, it returns the length of the string (which is also the number of bytes used to store it as ASCII).
For strings which contain non-ASCII characters, it returns the number of bytes required to store it in UTF-16, which uses two bytes for all characters under U+10000. (I'm not sure what it does with characters above that, or if it even supports them at all.)
It is not measuring the same thing as Buffer.byteLength, and should be ignored.
With regard to the other tools you're testing: Byte Size Matters is wrong. It's assuming that all Unicode characters up to U+FF can be represented using one byte, and all other characters can be represented using two bytes. This is not true of any character encoding. In fact, it's impossible. If you encode every characters up to U+FF using one byte, you've used up all possible values for that byte, and you have no way to represent anything else.

Resources