node.js: get byte length of the string "あいうえお" - node.js

I think, I should be able to get the byte length of a string by:
Buffer.byteLength('äáöü') // returns 8 as I expect
Buffer.byteLength('あいうえお') // returns 15, expecting 10
However, when getting the byte length with a spreadsheet program (libreoffice) using =LENB("あいうえお"), I get 10 (which I expect)
So, why do I get for 'あいうえお' a byte length of 15 rather than 10 using Buffer.byteLength?
PS.
Testing the "あいうえお" on these two sites, I get two different results
http://bytesizematters.com/ returns 10 bytes
https://mothereff.in/byte-counter returns 15 bytes
What is correct? What is going on?

node.js is correct. The UTF-8 representation of the string "あいうえお" is 15 bytes long:
E3 81 82 = U+3042 'あ'
E3 81 84 = U+3044 'い'
E3 81 86 = U+3046 'う'
E3 81 88 = U+3048 'え'
E3 81 8A = U+304A 'お'
The other string is 8 bytes long in UTF-8 because the Unicode characters it contains are below the U+0800 boundary and can each be represented with two bytes:
C3 A4 = U+E4 'ä'
C3 A1 = U+E1 'á'
C3 B6 = U+F6 'ö'
C3 BC = U+FC 'ü'
From what I can see in the documentation, LibreOffice's LENB() function is doing something different and confusing:
For strings which contain only ASCII characters, it returns the length of the string (which is also the number of bytes used to store it as ASCII).
For strings which contain non-ASCII characters, it returns the number of bytes required to store it in UTF-16, which uses two bytes for all characters under U+10000. (I'm not sure what it does with characters above that, or if it even supports them at all.)
It is not measuring the same thing as Buffer.byteLength, and should be ignored.
With regard to the other tools you're testing: Byte Size Matters is wrong. It's assuming that all Unicode characters up to U+FF can be represented using one byte, and all other characters can be represented using two bytes. This is not true of any character encoding. In fact, it's impossible. If you encode every characters up to U+FF using one byte, you've used up all possible values for that byte, and you have no way to represent anything else.

Related

Excel - convert cell from hex string to SHIFT-JIS characters

I need to convert space delineated hex values (ie. "B2 DD C0 B0 C8 AF C4 20 B9 DE B0 D1 81 48") in a column of cells to their SHIFT-JIS character equivalent in Excel. These strings may also include line breaks that need to be included in the translated cell value.
All of the functions and VBA code examples I've located so far appear to only work with western ascii values or unicode, which displays the incorrect characters. Converting by importing from CSV is not a viable solution, since the values are extracted from another hex dump in another worksheet (hex string extracted by using an offset table for length).
Tried using a sample VBA function to convert the hex characters, but it's not able to properly convert the SHIFT-JIS encoding.
=HexToString(SUBSTITUTE(I2,CHAR(32),"")) will result in "²ÝÀ°È¯Ä ¹Þ°ÑH" instead of "インターネット ゲーム?"
Public Function HexToString(InitialString As String) As String
Dim i As Long
For i = 1 To Len(InitialString) Step 2
HexToString = HexToString & Chr("&H" & (Mid(InitialString, i, 2)))
Next i
End Function

How does encode and decode 64 figure out that the last few zeros are mere padding?

https://learn.microsoft.com/en-us/dotnet/api/system.convert.tobase64string?view=net-5.0
It says
If an integral number of 3-byte groups does not exist, the remaining
bytes are effectively padded with zeros to form a complete group. In
this example, the value of the last byte is hexadecimal FF. The first
6 bits are equal to decimal 63, which corresponds to the base-64 digit
"/" at the end of the output, and the next 2 bits are padded with
zeros to yield decimal 48, which corresponds to the base-64 digit,
"w". The last two 6-bit values are padding and correspond to the
valueless padding character, "=".
Now,
Imagine that the byte array I send is
0
So, only one byte, namely 0
That one byte will be padded right into 000 right?
So now, we will have something like 0=== as the encoding because it takes 4 characters in base 64 encoding to encode 3 bytes.
Now, we gonna decode that.
How do we know that the original byte isn't 00, or 000, but just 0?
I must be missing something here.
So now, we will have something like 0=== as the encoding
3 padding characters is illegal. This would mean 6 bit plus padding.
And then 0 as a byte value is A in Base64, so it would be AA==.
So the first A has the first 6 bits of the 0 byte, the second A contributes the 2 remaining 0 bits for your byte, and then there are just 4 0 bits plus the padding left, not enough for a second byte.
How do we know that the original byte isn't 00, or 000, but just 0?
AA== has only 12 bits (6 bits per character) so it can only encode 1 Byte => 0
AAA= has 18 bits, enough for 2 bytes => 00
AAAA has 24 bits = 3 bytes => 000

Go Converting an integer from a string

Recently while taking some algorithm practise at leetcode i came across a solution, i understood everything except the part where the user converts an element in a string to an integer, look at the code below. Hopefully someone can explain this to me. Thanks for replies in advnace.
a := 234
b := strconv.Itoa(a)
c := int(b[0]-48) // why do we subtract 48?
48 is the code of the '0' character in the ASCII table.
Go stores strings as their UTF-8 byte sequences in memory, which maps characters of the ASCII table one-to-one to their code.
The digits in the ASCII table are listed contiguously, '0' being 48. So if you have a digit in a string, and you subtract 48 from the character's code, you get the digit as a numeric value.
Indexing a string indexes its bytes, and in your case b[0] is the first byte of the b string, which is 2. And '2' - 48 is 2.
For example:
fmt.Println('0' - 48)
fmt.Println('1' - 48)
fmt.Println('2' - 48)
fmt.Println('3' - 48)
fmt.Println('4' - 48)
This outputs (try it on the Go Playground):
0
1
2
3
4
“b” is a string “234”, a string is a slice of rune therefore b[0] is a byte/rune, in this case a value of 50 which is the decimal value of a “2” in ascii. So “c” will be 50-48=2

How to count binary sequence in binary number in Python?

I would like to count '01' sequence in 5760 binary bits.
First, I would like to combine several binary numbers then count # of '01' occurrences.
For example, I have 64 bits integer. Say, 6291456. Then I convert it into binary. Most significant 4 bits are not used. So I'll get 60 bits binary 000...000011000000000000000000000
Then I need to combine(just put bits together since I only need to count '01') first 60 bits + second 60 bits + ...so 96 of 60 bits are stitched together.
Finally, I want to count how many '01' appears.
s = binToString(5760 binary bits)
cnt = s.count('01');
num = 6291226
binary = format(num, 'b')
print(binary)
print(binary.count('01'))
If I use number given by you i.e 6291456 it's binary representation is 11000000000000000000000 which gives 0 occurrences of '01'.
If you always want your number to be 60 bits in length you can use
binary = format(num,'060b')
It will add leading 0 to make it of given length
Say that nums is your list of 96 numbers, each of which can be stored in 64 bits. Since you want to throw away the most 4 significant bits, you are really taking the number modulo 2**60. Thus, to count the number of 01 in the resulting string, using the idea of #ShrikantShete to use the format function, you can do it all in one line:
''.join(format(n%2**60,'060b') for n in nums).count('01')

node: converting buffers to decimal values

I have a buffer that is filled with data and begins with < Buffer 52 49 ...>
Assuming this buffer is defined as buf, if I run buf.readInt16LE(0) the following is returned:
18770
Now, the binary representation of hex values 52 and 49 are:
01010010 01001001
If I were to convert the first 15 bits to decimal, omitting the 16th bit for two's complement I would get the following:
21065
Why didn't my results give me the value of 18770?
18770 is 01001001 01010010 which is your 2 bytes reversed, which is what the readInt*LE functions are going to do.
Use readInt16BE.
You could do this: parseInt("0x" + buf.toString("hex")). Probably a lot slower but would do in a pinch.

Resources