Encode to Unicode UTF-8 not working - node.js

My Code -
var utf8 = require('utf8');
var y = utf8.encode('एस एम एस गपशप');
console.log(y);
Input -
एस एम एस गपशप
Expecting Output - \xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x8F\xE0\xA4\xAE\x20\xE0\xA4\x8F\xE0\xA4\xB8\x20\xE0\xA4\x97\xE0\xA4\xAA\xE0\xA4\xB6\xE0\xA4\xAA
Example Encoding using utf8.js
Output -
à¤à¤¸ à¤à¤® à¤à¤¸ à¤à¤ªà¤¶à¤ª
What am I doing wrong? Please help!

That code appears to be working. That output looks like UTF-8 bytes interpreted as some 8-bit character set, most likely ISO-8859-1, which is easily recognisable by the repeating patterns.
That example output is just how you would represent that string in source code.
Try this:
var utf8 = require('utf8');
var y = utf8.encode('एस');
console.log(y);
console.log('\xE0\xA4\x8F\xE0\xA4\xB8');
You will probably see the same output twice.
You can easily write some code to get that hexadecimal forms back using a lookup table and the charCodeAt function, but it is a rather unusual way to represent a string in JavaScript. JSON for example either just uses the literal characters, or '\uXXXX' escapes.

Related

How to get packet.tcp.payload and packet.http.data as string?

The return value for these attributes are in hex format seperated by ':'
Eg : 70:79:f6:2e: something like this.
When I am trying to decode it to plain string ( human readable ) it doesn't work. What encoding is being used? I tried various different methods like codecs.decode(), binascii.unhexlify(), bytes.fromhex() also different encodings ASCII and UTF-8. Nothing worked, any help is appreciated. I am using python 3.6
Thanks for your question! I believe you're wanting to read the payload in chunks of two hex places. The functions you tried are not able to parse the : delimiter out-of-the-box. Something like splitting the string by the : delimiter, converting their values to human-readable characters, and joining the "list" to a string should do the trick.
hex_string = '70:79:f6:2e'
hex_split = hex_string.split(':')
hex_as_chars = map(lambda hex: chr(int(hex, 16)), hex_split)
human_readable = ''.join(hex_as_chars)
print(human_readable)
Is this what you have in mind?

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

Is it possible to load ansi encoded string using nodejs

I have large quantity of html files (around 2k).
These html`s are result of conversion from word documents.
The files have some hebrew text inside html tags. I can see the text perfectly using vscode or notepad++ editors.
My goal is to loop through the folder and insert the contents of files into some DB.
Since i have a little knowledge of nodejs - i decided to build the "looping" using node.
Here is where i finished so far:
fs.readdir('./myFolder', function (err, files) {
total = files.length;
let fileArr = []
for(var x=0, l = files.length; x<l; x++) {
const content = fs.readFileSync(`./myFolder/${files[x]}`, 'utf8');
let title = content.match(/<title>(.*?)<\/title>/g).pop()
fileArr.push({id:files[x] , title})
}
});
The problem is: although the text displayed correctly inside editors -when debugging - i can see that "title" variable get strings which consists of question marks
I guess the problem is with file encoding, am i right here?
If so - is there way to decode the string?
P.S. my OS is windows10
Thanks
There are a couple of possibilities here, it may be possible that your input files are in a multibyte encoding (such as utf8 utf16 etc) and your debugger is simply not showing the correct characters due to font restrictions.
I would try writing the title variable to some test file like so:
fs.writeFileSync(`title-test-${x}.txt`, title, "utf8");
And see if the title looks correct in your text editor.
It may also be possible that the files are encoded in an extended ascii encoding such as Windows 1255 or ISO 8859-8. If this is the case, fs.readFileSync will not work correctly since it does not support these encodings (see node.js encoding list)
If the files are encoded using a single-byte extended ascii encoding, it should be possible to convert to a more portable encoding (such as utf8).
I'd recommend the iconv-lite module for this purpose, you can do a lot with it!
For example, to convert from a Windows 1255 file to utf8 you could try:
const iconv = require("iconv-lite");
const fs = require("fs");
// Convert from an encoded buffer to JavaScript string.
const fileData = iconv.decode(fs.readFileSync("./hebrew-win1255.txt"), "win1255");
// Convert from JavaScript string to a buffer.
const outputBuffer = iconv.encode(fileData, "utf8");
// Write output file..
fs.writeFileSync("./hebrew-utf8-output.txt", outputBuffer);

Node Buffers, from utf8 to binary

I'm receiving data as utf8 from a source and this data was originally in binary form (it was a Buffer). I have to convert back this data to a Buffer. I'm having a hard time figuring how to do this.
Here's a small sample that shows my problem:
var hexString = 'e61b08020304e61c09020304e61d0a020304e61e65';
var buffer1 = new Buffer(hexString, 'hex');
var str = buffer1.toString('utf8');
var buffer2 = new Buffer(str, 'utf8');
console.log('original content:', hexString);
console.log('buffer1 contains:', buffer1.toString('hex'));
console.log('buffer2 contains:', buffer2.toString('hex'));
prints
original content: e61b08020304e61c09020304e61d0a020304e61e65
buffer1 contains: e61b08020304e61c09020304e61d0a020304e61e65
buffer2 contains: efbfbd1b08020304efbfbd1c09020304efbfbd1d0a020304efbfbd1e65
Here, I would like buffer2 to be the exact same thing as buffer1.
How can I convert an utf8 string to its original binary Buffer?
You cannot expect binary data converted to utf8 and back again to be the same as the original binary data because of the way utf8 works (especially when invalid utf8 characters are replaced with \ufffd).
You have to use another format that correctly preserves the data. This could be 'hex', 'base64', 'binary', or some other binary-safe format provided by a third-party module. Obviously you should probably keep it as a Buffer if you can.
The accepted answer is misleading. Your main problem is that you're dealing with invalid UTF-8. If the data were valid, the conversion would not cause issues.
Specifically, take the first two bytes: e61b.
In binary, that's: 11100110, 00011011. This is invalid. Take a look at this diagram from the utf-8 wikipedia page.
This says that if a byte starts with 1110, the next byte must start with two bytes starting with 10 after it. This is not the case here.
Whenever js hits an invalid character, it replaces it with �, the unicode replacement character. The codepoint for that is U+FFFD, and the utf-8 encoding of that code point is efbfbd. Notice that this shows up in your output a few times.

Convert Binary data from file to readable string

I have binary data stored in a file. I am doing this:
byte[] fileBytes = File.ReadAllBytes(#"c:\carlist.dat");
string ascii = Encoding.ASCII.GetString(fileBytes);
This is giving me following result with lot of invalid characters. What am i doing wrong?
?D{F ?x#??4????? NBR-OF-CARSNUMBER-OF-CARS!"#??? NBR-OF-CARS$%??1y0#123?G??#$ NBR-OF-CARS%45??1y#  NUMBER-OF-CARSd?
hmm... seems like a save was made from a byte buffer where after NBR-OF-CARS was written some numeric data. If you have an access to the code that saves the file could you check if there are numbers over there and if there are - check does the code converts numbers to string before witing the value into the binary stream.

Resources