What encodings does Buffer.toString() support? - node.js

I'm writing an app in node.js, and see that I can do things like this:
var buf = new Buffer("Hello World!")
console.log(buf.toString("hex"))
console.log(buf.toString("utf8"))
And I know of 'ascii' as an encoding type (it'll take an ASCII code, such as 112 and turn it into a p), but what other types of encoding can I do?

The official node.js documentation for Buffer is the best place to check for something like this. As previously noted, Buffer currently supports these encodings: 'ascii', 'utf8', 'utf16le'/'ucs2', 'base64', 'base64url', 'latin1'/'binary', and 'hex'.

As is always the way, I spent a while Googling but found nothing until after I posted the question:
http://www.w3resource.com/node.js/nodejs-buffer.php has the answer. You can use the following types in .toString() on a buffer:
ascii
utf8
utf16le
ucs2 (alias of utf16le)
base64
binary
hex

It support ascii , utf-8 , ucs2, base64, binary

Related

Node Buffer Alias - binary is latin1?

According to this page:
'binary' - Alias for 'latin1'.
However, binary is not representable in latin1 as there are certain codepoints which are missing. So, a developer like me who wants to use NodeJS buffers for binary data (an extremely common use-case) would expect to use 'binary' as the encoding. There doesn't appear to be any documentation that properly explains how to handle binary data! I'm trying to make sense of this.
So my question is: Why was latin1 chosen as an alias for binary?
People have mentioned that using null as the encoding will work for binary data. So a followup question: Why doesn't null and 'binary' do the same thing?
The Node documentation's definition of 'latin1', on the line above the definition of 'binary' cited in the question, is not ISO 8859-1. It is:
'latin1' - A way of encoding the Buffer into a one-byte encoded string
(as defined by the IANA in RFC1345, page 63, to be the Latin-1
supplement block and C0/C1 control codes).
The 'latin1' charset specified in RFC 1345 defines mappings for all 256 codepoints. It does not have the holes that exist at 0x00-0x1f and 0x7f-0x9f in the ISO 8859-1 mapping.
Why doesn't null and 'binary' do the same thing?
Node does not have a null encoding. If you call Buffer.from('foo', null) then you get the same result as if you had called Buffer.from('foo'). That is, the default encoding is applied. The default encoding is 'utf8', and clearly that can produce results that differ from the 'binary' encoding.

In Python 3, how can I convert ascii to string, *without encoding/decoding*

Python 3.6
I converted a string from utf8 to this:
b'\xe6\x88\x91\xe6\xb2\xa1\xe6\x9c\x89\xe7\x94\xb5#xn--ssdcsrs-2e1xt16k.com.au'
I now want that chunk of ascii back into string form, so there is no longer the little b for bytes at the beginning.
BUT I don't want it converted back to UTF8, I want that same sequence of characters that you ses above in my Python string.
How can I do so? All I can find are ways of converting bytes to string along with encoding or decoding.
The (wrong) answer is quite simple:
chr(asciiCode)
In your special case:
myString = ""
for char in b'\xe6\x88\x91\xe6\xb2\xa1\xe6\x9c\x89\xe7\x94\xb5#xn--ssdcsrs-2e1xt16k.com.au':
myString+=chr(char)
print(myString)
gives:
æ没æçµ#xn--ssdcsrs-2e1xt16k.com.au
Maybe you are also interested in the right answer? It will probably not please you, because it says you have ALWAYS to deal with encoding/decoding ... because myString is now both UTF-8 and ASCII at the same time (exactly as it already was before you have "converted" it to ASCII).
Notice that how myString shows up when you print it will depend on the implicit encoding/decoding used by print.
In other words ...
there is NO WAY to avoid encoding/decoding
but there is a way of doing it a not explicit way.
I suppose that reading my answer provided HERE: Converting UTF-8 (in literal) to Umlaute will help you much in understanding the whole encoding/decoding thing.
What you have there is not ASCII, as it contains for instance the byte \xe6, which is higher than 127. It's still UTF8.
The representation of the string (with the 'b' at the start, then a ', then a '\', ...), that is ASCII. You get it with repr(yourstring). But the contents of the string that you're printing is UTF8.
But I don't think you need to turn that back into an UTF8 string, but it may depend on the rest of your code.

How to check if a string is plaintext or base64 format in Node.js

I want to check if the given string to my function is plain text or base64 format. I am able to encode a plain text to base64 format and reverse it.
But I could not figure out any way to validate if the string is already base64 format. Can anyone please suggest correct way of doing this in node js? Is there any API for doing this already available in node js.
Valid base64 strings are a subset of all plain-text strings. Assuming we have a character string, the question is whether it belongs to that subset. One way is what Basit Anwer suggests. Those libraries require installing libicu though. A more portable way is to use the built-in Buffer:
Buffer.from(str, 'base64')
Unfortunately, this decoding function will not complain about non-Base64 characters. It will just ignore non-base64 characters. So, it alone will not help. But you can try encoding it back to base64 and compare the result with the original string:
Buffer.from(str, 'base64').toString('base64') === str
This check will tell whether str is pure base64 or not.
Encoding is byte level.
If you're dealing in strings then all you can do is to guess or keep meta data information with your string to identify
But you can check these libraries out:
https://www.npmjs.com/package/detect-encoding
https://github.com/mooz/node-icu-charset-detector

NodeJS Extended ASCII Support

I'm trying to read a file that contains extended ascii characters like 'á' or 'è', but NodeJS doesn't seem to recognize them.
I tried reading into:
Buffer
String
Tried differente encoding types:
ascii
base64
utf8
as referenced on http://nodejs.org/api/fs.html
Is there a way to make this work?
I use the binary type to read such files. For example
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like 'á' or 'è',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
When I do save the above into foo.js, then the following is shown on the output:
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like '⟡ 漀爀 ✀',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
The wierdness above is because I have run it in an emacs buffer.
The file I was trying to read was in ANSI encoding. When I tried to read it using the 'fs' module's functions, it couldn't perform the conversion of the extended ASCII characters.
I just figured out that nodepad++ is able to actually convert from some formats to UTF-8, instead of just flagging the file with UTF-8 encoding.
After converting it, I was able to read it just fine and apply all the operations I needed to the content.
Thank you for your answers!
I realize this is an old post, but I found it in my personal search for a solution to this particular problem.
I have written a module that provides Extended ASCII decoding and encoding support to/from Node Buffers. You can see the source code here. It is a part of my implementation of Buffer in the browser for an in-browser filesystem I've created called BrowserFS, but it can be used 100% independently of any of that within NodeJS (or Browserify) as it has no dependencies.
Just add bfs-buffer to your dependencies and do the following:
var ExtendedASCII = require('bfs-buffer/js/extended_ascii').default;
// Decodes the input buffer as an extended ASCII string.
ExtendedASCII.byte2str(aBufferWithText);
// Encodes the input string as an extended ASCII string.
ExtendedASCII.str2byte("Some text");
Alternatively, just adapt the module into your project if you don't want to add an extra dependency to your project. It's MIT Licensed.
I hope this helps anyone in the future who finds this page in their searches like I did. :)

How to convert a string to a byte array which is compiled with a given charset in Go?

In java, we can use the method of String : byte[] getBytes(Charset charset) .
This method Encodes a String into a sequence of bytes using the given charset, storing the result into a new byte array.
But how to do this in GO?
Is there any similar way in Go can do this?
Please let me know it.
The standard Go library only supports Unicode (UTF-8, UTF-16, UTF-32) and ASCII encoding. ASCII is a subset of UTF-8.
The go-charset package (found from here) supports conversion to and from UTF-8 and it also links to the GNU iconv library.
See also field CharsetReader in encoding/xml.Decoder.
I believe here is an answer: https://stackoverflow.com/a/6933412/1315563
There is no way to do it without writing the conversion yourself or
using a third-party package. You could try using this:
http://code.google.com/p/go-charset

Resources