Node Buffer Alias - binary is latin1? - node.js

According to this page:
'binary' - Alias for 'latin1'.
However, binary is not representable in latin1 as there are certain codepoints which are missing. So, a developer like me who wants to use NodeJS buffers for binary data (an extremely common use-case) would expect to use 'binary' as the encoding. There doesn't appear to be any documentation that properly explains how to handle binary data! I'm trying to make sense of this.
So my question is: Why was latin1 chosen as an alias for binary?
People have mentioned that using null as the encoding will work for binary data. So a followup question: Why doesn't null and 'binary' do the same thing?

The Node documentation's definition of 'latin1', on the line above the definition of 'binary' cited in the question, is not ISO 8859-1. It is:
'latin1' - A way of encoding the Buffer into a one-byte encoded string
(as defined by the IANA in RFC1345, page 63, to be the Latin-1
supplement block and C0/C1 control codes).
The 'latin1' charset specified in RFC 1345 defines mappings for all 256 codepoints. It does not have the holes that exist at 0x00-0x1f and 0x7f-0x9f in the ISO 8859-1 mapping.
Why doesn't null and 'binary' do the same thing?
Node does not have a null encoding. If you call Buffer.from('foo', null) then you get the same result as if you had called Buffer.from('foo'). That is, the default encoding is applied. The default encoding is 'utf8', and clearly that can produce results that differ from the 'binary' encoding.

Related

Is there any reason why okhttp3.Credentials use ISO_8859_1 instead UTF-8?

I am using okhttp3.Credentials to get Base64 string in my current project. I spot an issue with cyrillic symbols I passed on server as Base64 string and eventually find out current implementation of okhttp3.Credentials uses ISO_8859_1.
Is there a thoughtful intent to go with ISO_8859_1 here instead of more universal UTF-8?
Update from answer:
From reference to spec
The original definition of this authentication scheme failed to
specify the character encoding scheme used to convert the user-pass
into an octet sequence. In practice, most implementations chose
either a locale-specific encoding such as ISO-8859-1 ([ISO-8859-1]),
or UTF-8 ([RFC3629]). For backwards compatibility reasons, this
specification continues to leave the default encoding undefined, as
long as it is compatible with US-ASCII (mapping any US-ASCII
character to a single octet matching the US-ASCII character code).
B.3. Why not simply switch the default encoding to UTF-8?
There are sites in use today that default to a local character
encoding scheme, such as ISO-8859-1 ([ISO-8859-1]), and expect user
agents to use that encoding. Authentication on these sites will stop
working if the user agent switches to a different encoding, such as
UTF-8.
Note that sites might even inspect the User-Agent header field
([RFC7231], Section 5.5.3) to decide which character encoding scheme
to expect from the client. Therefore, they might support UTF-8 for
some user agents, but default to something else for others. User
agents in the latter group will have to continue to do what they do
today until the majority of these servers have been upgraded to
always use UTF-8.
Discussion here https://github.com/square/okhttp/pull/3134
It's legacy and you can override with the optional param.

Want To know the Deoding Method

I wanted to decode my PUBG name. I come to interact with this site: http://ddecode.com/hexdecoder/
It decodes as I want, but now I want to know what technique they use, so I can use it in my project.
Input :
PSYCH%C3%98%E4%B9%82JOKER
Decoded String:
PSYCHØ乂JOKER
Here Is The result Url: http://ddecode.com/hexdecoder/?results=48d3b517a922349a1838240623f6e7c3
You should take a look at Percent encoding, this is a way to encode stuff to be valid written in URLs. The characters after the % symbol are just the hexadecimal UTF-8 values to encode the special characters Ø乂.
0xC3 0x98 corresponds to Ø and 0xE4 0xB9 0x82 to 乂 in UTF-8.
By the way, since you added the encryption badge and wrote the word in your question. In this situation, we cannot speak of decryption; you might want to take a look at the difference between all that terminology (encoding and encryption, for example).

Converting bits to string (data)

I have file which contains some data (text copied and pasted from the "What You Will Learn" portion of this PDF). Firstly, I have converted the contents in the file to bits successfully. However, when I try to convert it back to the original format, some of the characters are not correctly converted, as shown below:
Cisco has
developed the Cisco Open Network Environment (ONE)
architecture as a multifaceted approach to network
programmability delivered across three pillars:
??)É¥ Í?н??ÁÁ±¥?Ñ¥½¸ÁɽÉ?µµ¥¹?¥¹Ñ?É???Ì?¡A%̤?)?áÁ½Í??¥É?ѱ佸Íݥѡ?Ì?¹É½ÕÑ?ÉÌѼ?Õµ?¹Ð?)?á¥ÍÑ¥¹?=Á?¹±½ÜÍÁ?¥?¥?Ñ¥½¹Ì* ¤&öGV7F?öâ×&VG?÷VäfÆ÷r6öçG&öÆÆW"æB÷VäfÆ÷r ¦vVçG0¨?HÝZ]HÙ??ÙXÝÈÈ[]?\??\X[Ý?\?^\Ë?\X[?Ù\?XÙ\Ë[??\ÛÝ\?ÙHÜ?Ú\Ý?][Û?Ø\X?[]Y\È[?H?]HÙ[
As you can see here some characters are converted successfully, others are not.
My code is below:
file = open("test.txt",'r')
myfile = ''.join(map(str,file))
l = []
for i in myfile:
asc11 = ord(i)
b = "{0:08b}".format(asc11)
l.extend(int(y) for y in b)
string_bin = ''.join(map(str,l))
mydata = ''.join(chr(int(string_bin[i:i+8], 2)) for i in range(0,len(string_bin), 8))
print(mydata)
What wrong with my code? What I need to change to make it work properly?
What's Going On?
You are running into an encoding issue because some characters in the PDF are non-ASCII characters. For example, the bullet points are U+2022 which require 3 bytes of storage.
When Python reads from your file, it doesn't know what encoding you used to write that data. Thus it reads bytes from the file and uses a character encoding to translate them into strs which are stored using Python's own internal unicode format. (This differs from Python 2 where open() returned raw bytes stored in a str which you could then manually decoded to unicode.)
Thus, in Python 3, open() accepts a named encoding parameter. For example open("test.txt",'r', encoding='ascii'). Because you don't specify the encoding when you call open(), you end up using your system's default encoding. For instance, on my laptop, the default encoding is CP1252 (LATIN-1). Yours may differ.
Whatever encoding Python uses to interpret your file, it then internally uses it's own unicode format to store your string. This means that your string may internally use mutli-byte characters even if the original encoding did not. For example, my laptop uses CP1252 to interpret U+2022 as • which is internally stored as U+00e2, U+20AC and U+00A2 -- € is stored using a multi-byte character even though it was just one byte in the original file.
Let's assume you computer is sane and uses UTF-8 by default (this explanation is similar for many multi-byte characters). When you reach a bullet point, it is stored as U+2022. When you call ord('\u2022') the result is 8226. When you then call "{0:08b}".format(8226) this returns "10000000100010". That's a 14 character string. Your parsing code assumes all of the ordinals will generate 8 character strings. Because of this, the "binary" output becomes misaligned. This means that when you then parse the binary string in 8-character segments, it gets thrown off and starts interpreting things as control characters and all sorts of foreign language characters.
If you call open(..., encoding='ascii'), Python will actually throw an exception because it reads non-valid ASCII characters.
Possible Solutions
I'm not sure why exactly you are converting the input string into the representation that you are using. It's not binary, as your question title would suggest. Rather, you've converted the data into a textual representation of it's binary encoding.
Technically speaking, when you store encoded text to a file, it's stored using a binary representation. Python, and any text editor, has to decode those bytes into it's internal character representation before it can display them as text. Thus, calling open("test.txt", "r", encoding="utf-8") reads the binary data out of your text file and converts it into Python's internal unicode format. Similarly, calling myfile.encode('utf-8') will return the UTF-8 encoded bytes which can then be written to a file, network socket, etc.
If, however, you do need to use a format similar to what you are currently using, first, I still recommend you specify an encoding when you call open() (I recommend UTF-8). Then you can consider these options:
Detect and omit non-ASCII characters. They will have an ordinal >= 128.
Mimic UTF-16 or UTF-32 and output multi-byte output for all characters. For example, use "{0:032b}".format(asc11) and then parse the result in 32-character chunks. It's memory and storage inefficient, but it will preserve multi-byte characters.
Regardless, I highly recommend reading the Dive Into Python 3 chapter about strings.

How to read a UTF-8 encoded list of string tokens into a vector?

I have a UTF-8 encoded text file with one token per line. I would like to read it into a vector. This is on MSWindows, version 3.0.1. I understand that the default encoding is UTF-8, right?
I am looking for a code snippet like the ones on
http://www.mayin.org/ajayshah/KB/R/html/r4.html
from 'R by example'
http://www.mayin.org/ajayshah/KB/R/index.html
However they do not have a UTF-8 example, only ASCII.
You can either read it in with read.table() and then extract the column as a vector, or with scan().
vect <- scan(file="path/to/file1.txt", what=character(0) )
You would not need to use UTF-8 as the encoding, since you know that it is the default, but there is the option of doing so:
vect <- scan(file="path/to/file1.txt", what=character(0), encoding="UTF-8" )
The NEWS file for R 3.0.0 said:
" o readLines() and scan() (and hence read.table()) in a UTF-8 locale now discard a UTF-8 byte-order-mark (BOM). Such BOMs are allowed but not recommended by the Unicode Standard: however Microsoft applications can produce them and so they are sometimes found on websites.
The encoding name "UTF-8-BOM" for a connection will ensure that a UTF-8 BOM is discarded. "
So perhaps the need for the encoding argument indicated either that you were in a nonUTF-8 locale and didn't tell us or that you were using an outdated R version?

How to convert a string to a byte array which is compiled with a given charset in Go?

In java, we can use the method of String : byte[] getBytes(Charset charset) .
This method Encodes a String into a sequence of bytes using the given charset, storing the result into a new byte array.
But how to do this in GO?
Is there any similar way in Go can do this?
Please let me know it.
The standard Go library only supports Unicode (UTF-8, UTF-16, UTF-32) and ASCII encoding. ASCII is a subset of UTF-8.
The go-charset package (found from here) supports conversion to and from UTF-8 and it also links to the GNU iconv library.
See also field CharsetReader in encoding/xml.Decoder.
I believe here is an answer: https://stackoverflow.com/a/6933412/1315563
There is no way to do it without writing the conversion yourself or
using a third-party package. You could try using this:
http://code.google.com/p/go-charset

Resources