Iconv encoding conversion in Node - node.js

I'm using Iconv in Node.js to convert scraped HTML (via request with binary encoding) from SHIFT_JIS to UTF-8:
request({url:url, encoding:'binary'}, function (error, res, html) {
var iconv = new Iconv('SHIFT_JIS', 'UTF-8//TRANSLIT//IGNORE')
var converted = iconv.convert(new Buffer(html,'binary')).toString('utf8')
})
The conversion I'm getting back looks like:
é«SnÌ\r\núêXj[J[ÍAVvÉÈ調ȫ³É\r\nå«ÈCpNgð^
While the pre-conversion looks like: ���[�J�b�g����X�j�[�J�[
I tried using encoding:null in the request, but that didn't work either.

The encoding actually works as posted above, it was an issue in handling the final response outside the request function.

Related

NodeJS and Iconv - "ISO-8859-1" to "UTF-8"

I created a NodeJS application which should get some data from an external API-Server. That server provides its data only as 'Content-Type: text/plain;charset=ISO-8859-1'. I have got that information through the Header-Data of the server.
Now the problem for me is that special characters like 'ä', 'ö' or 'ü' are shown as �.
I tried to convert them with Iconv to UTF-8, but then I got these things '�'...
My question is, what am I doing wrong?
For testing I use Postman. These are the steps I do to test everything:
Use Postman to trigger my NodeJS application
The App requests data from the API-Server
API-Server sends Data to NodeJS App
My App prints out the raw response-data of the API, which already has those strange characters �
The App then tries to convert them with Iconv to UTF-8, where it shows me now this '�' characters
Another strange thing:
When I connect Postman directly to the API-Server, the special characters get shown as they have too without problems. Therefore i guess my application causes the problem but I cannot see where or why...
// Javascript Code:
try {
const response = await axios.get(
URL
{
params: params,
headers: headers
}
);
var iconv = new Iconv('ISO-8859-1', 'UTF-8');
var converted = await iconv.convert(response.data);
return converted.toString('UTF-8');
} catch (error) {
throw new Error(error);
}
So after some deeper research I came up with the solution to my problem.
The cause of all trouble seems to lie within the post-process of axios or something similar. It is the step close after data is received and convertet to text and shortly before the response is generated for my nodejs-application.
What I did was to define the "responseType" of the GET-method of axios as an "ArrayBuffer". Therefore an adjustment in axios was necessary like so:
var resArBuffer = await axios.get(
URL,
{
responseType: 'arraybuffer',
params: params,
headers: headers
}
);
Since JavaScript is awesome, the ArrayBuffer provides a toString() method itself to convert the data from ArrayBuffer to String by own definitions:
var response = resArBuffer.data.toString("latin1");
Another thing worth mentioning is the fact that I used "latin1" instead of "ISO-8859-1". Don't ask me why, some sources even recommended to use "cp1252" instead, but "latin1" workend for me here.
Unfortunately that was not enough yet since I needed the text in UTF-8 format. Using "toString('utf-8')" itself was the wrong way too since it would still print the "�"-Symbols. The workaround was simple. I used "Buffer.from(...)" to convert the "latin1" defined text into a "utf-8" text:
var text = Buffer.from(response, 'utf-8').toString();
Now I get the desired UTF-8 converted text I needed. I hope this thread helps anyone else outhere since thse informations hwere spread in many different threads for me.

nodeJS: convert response.body in utf-8 (from windows-1251 encoding)

I'm trying to convert an HTML body encoded in windows-1251 into utf-8 but I still get messed up characters on html.
They are basically Russian alphabet but I can't get them to be shown properly. I get ??????? ?? ???
const GOT = require('got') // https://www.npmjs.com/package/got
const WIN1251 = require('windows-1251') // https://www.npmjs.com/package/windows-1251
async function query() {
var body = Buffer.from(await GOT('https://example.net/', {resolveBodyOnly: true}), 'binary')
var html = WIN1251.decode(body.toString('utf8'))
console.log(html)
}
query()
You’re doing a lot of silly encoding back-and-forth here. And the ‘backs’ don’t even match the ‘forths’.
First, you use the got library to download a webpage; by default, got will dutifully decode response texts as UTF-8. You stuff the returned Unicode string into a Buffer with the binary encoding, which throws away the higher octet of each UTF-16 code unit of the Unicode string. Then you use .toString('utf-8') which interprets this mutilated string as UTF-8 (in actuality, it is most likely not valid UTF-8 at all). Then you pass the ‘UTF-8’ string to the windows-1251, to decode it as a ‘code page 1251’ string. Nothing good can possibly come from all this confusion.
The windows-1251 package you want to use takes so-called ‘binary’ (pseudo-Latin-1) strings as input. What you should do instead is take the binary response, interpret it as Latin-1/‘binary’ string and then pass it to the windows-1251 library for decoding.
In other words, use this:
const GOT = require('got');
const WIN1251 = require('windows-1251');
async function query() {
const body = await GOT('https://example.net/', {
resolveBodyOnly: true,
responseType: 'buffer'
});
const html = WIN1251.decode(body.toString('binary'))
console.log(html)
}
query()

iconv-lite not decoding everything properly, even though I'm using proper decoding

I'm using this piece of code to download a webpage (using request library) and decode everything (using iconv-lite library). The loader function is for finding some elements from the body of the website, then returning them as a JavaScript object.
request.get({url: url, encoding: null}, function(error, response, body) {
// if webpage exists, process it, otherwise throw 'not found' error
if (response.statusCode === 200) {
body = iconv.decode(body, "iso-8859-1");
const $ = cheerio.load(body);
async function show() {
var data = await loader.getDay($, date, html_tags, thumbs, res, image_thumbnail_size);
res.send(JSON.stringify(data));
}
show();
} else {
res.status(404);
res.send(JSON.stringify({"error":"No content for this date."}))
}
});
The pages are encoded in ISO-8859-1 format, and the content is looking normal, there are no bad chars. When I wasn't using iconv-lite, some characters, eg. ü, were looking like this: �. Now, when I'm using the library like in the code provided above, most of the chars are looking good, but some, eg. š are an empty box, even though they're displayed without any problems on the website.
I'm sure it's not cheerio's issue, because when I printed the output using res.send(body); or res.send(JSON.stringify({"body":body}));, the empty box character was still present there. Maybe it's a problem with Express? Is there a way to fix that?
EDIT:
I copied the empty box character to Google, and it has changed to š, maybe that's important
Also, I tried to change output of Express using res.charset but that didn't help.
I used this website: https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html to check if the page I'm scraping really has ISO-8859-1 encoding, it turned out that it has Windows-1252 encoding. I changed the encoding in my API (var encoding = 'windows-1252') and it works well now.

Node.js encoding UTF-8 issue

I have been facing an issue on node.js express framework encoding/decoding style.
Brief background, I store pdf file in mysql database with longblob data-type with latin1 charset. From server side, i need to send the binary data with UTF8 Encoding format as my client knows utf8 decoding format only.
I tried all the possible solutions available on google.
For ex:
new Buffer(mySqlData).toString('utf8');
Already tried module "UTF8" with given functionality utf8.encode(mySqlData); But it is not working.
Also i already tried "base64" encoding and retrieve data at client with base64 decoding. It is working just fine but i need to have utf8 encoding set. Also you know base64 certainly increase the size.
Please help guys.
Ok, your problem is the conversion of latin to utf-8. If you just call your buffer.toString('utf-8'), the latin encoded characters were wrong.
To convert other charset to utf-8, the simple wai is to use iconv and icu-charset-detector. With that, you can switch to utf-8 from all possibles charset (except certains charset).
This is an example of conversion using stream. The result stream is encoded with utf-8 :
var charsetDetector = require("node-icu-charset-detector"),
Iconv = require('iconv').Iconv,
Stream = require('stream'),
function convertToUtf8(source, callback) {
var iconv,
charsetTestStream = new Stream.PassThrough(),
newResStream = new Stream.PassThrough();
source.pipe(charsetTestStream);
source.pipe(newResStream);
charsetDetector.detectCharsetStream(charsetTestStream, function (charset) {
if (!iconv && charset && !/utf-*8/i.test(charset.toString())) {
try {
iconv = new Iconv(charset, 'utf-8');
console.log('Converting from charset %s to utf-8', charset);
iconv.on('error', function (err) {
callback(err);
});
var convertStream = newResStream.pipe(iconv);
callback(null, convertStream);
} catch(err) {
callback(err);
}
return;
}
callback(null, newResStream);
});
}

Serving binary/buffer/base64 data from Nodejs

I'm having trouble serving binary data from node. I worked on a node module called node-speak which does TTS (text to Speech) and return a base64 encoded audio file.
So far I'm doing this to convert from base64 to Buffer/binary and then serve it:
// var src = Base64 data
var binAudio = new Buffer(src.replace("data:audio/x-wav;",""), 'base64');
Now I'm trying to serve this audio from node with the headers like so:
res.writeHead(200, {
'Content-Type': 'audio/x-wav',
'Content-Length': binAudio.length
});
And serving it like so:
res.end(binAudio, "binary");
But its not working at all. Is there something I havnt quite understood or am I doing something wrong, because this is not serving a valid audio/x-wav file.
Note: The Base64 data is valid i can serve it like so [see below] and it works fine:
// assume proper headers sent and "src" = base64 data
res.end("<!DOCTYPE html><html><body><audio src=\"" + src + "\"/></body></html>");
So why can I not serve the binary file, what am I doing wrong?
Two things are wrong.
not Conetnt-Length, it's Content-Length
res.end(binAudio, "binary"); is wrong. Use res.end(binAudio);. With "binary", it expects a string - binary is a deprecated string encoding in node, use no encoding if you already have a buffer.

Resources