NodeJS and Iconv - "ISO-8859-1" to "UTF-8" - node.js

I created a NodeJS application which should get some data from an external API-Server. That server provides its data only as 'Content-Type: text/plain;charset=ISO-8859-1'. I have got that information through the Header-Data of the server.
Now the problem for me is that special characters like 'ä', 'ö' or 'ü' are shown as �.
I tried to convert them with Iconv to UTF-8, but then I got these things '�'...
My question is, what am I doing wrong?
For testing I use Postman. These are the steps I do to test everything:
Use Postman to trigger my NodeJS application
The App requests data from the API-Server
API-Server sends Data to NodeJS App
My App prints out the raw response-data of the API, which already has those strange characters �
The App then tries to convert them with Iconv to UTF-8, where it shows me now this '�' characters
Another strange thing:
When I connect Postman directly to the API-Server, the special characters get shown as they have too without problems. Therefore i guess my application causes the problem but I cannot see where or why...
// Javascript Code:
try {
const response = await axios.get(
URL
{
params: params,
headers: headers
}
);
var iconv = new Iconv('ISO-8859-1', 'UTF-8');
var converted = await iconv.convert(response.data);
return converted.toString('UTF-8');
} catch (error) {
throw new Error(error);
}

So after some deeper research I came up with the solution to my problem.
The cause of all trouble seems to lie within the post-process of axios or something similar. It is the step close after data is received and convertet to text and shortly before the response is generated for my nodejs-application.
What I did was to define the "responseType" of the GET-method of axios as an "ArrayBuffer". Therefore an adjustment in axios was necessary like so:
var resArBuffer = await axios.get(
URL,
{
responseType: 'arraybuffer',
params: params,
headers: headers
}
);
Since JavaScript is awesome, the ArrayBuffer provides a toString() method itself to convert the data from ArrayBuffer to String by own definitions:
var response = resArBuffer.data.toString("latin1");
Another thing worth mentioning is the fact that I used "latin1" instead of "ISO-8859-1". Don't ask me why, some sources even recommended to use "cp1252" instead, but "latin1" workend for me here.
Unfortunately that was not enough yet since I needed the text in UTF-8 format. Using "toString('utf-8')" itself was the wrong way too since it would still print the "�"-Symbols. The workaround was simple. I used "Buffer.from(...)" to convert the "latin1" defined text into a "utf-8" text:
var text = Buffer.from(response, 'utf-8').toString();
Now I get the desired UTF-8 converted text I needed. I hope this thread helps anyone else outhere since thse informations hwere spread in many different threads for me.

Related

Node-Fetch API and Blob to convert Windows-1252 characters to utf-8 [duplicate]

I'm trying to use Fetch to bring some data into the screen, however some of the characters ares showing a weird � sign which I believe has something to do with converting special chars.
When debugging on the server side or if I call the servlet on my browser, the problem doesn't happen, so I believe the issue is with my JavaScript. See the code below:
var myHeaders = new Headers();
myHeaders.append('Content-Type','text/plain; charset=UTF-8');
fetch('getrastreiojadlog?cod=10082551688295', myHeaders)
.then(function (response) {
return response.text();
})
.then(function (resp) {
console.log(resp);
});
I think it is probably some detail, but I haven't managed to find out what is happening. So any tips are welcome
Thx
The response's text() function always decodes the payload as utf-8.
If you want the text in other charset you may use TextDecoder to convert the response buffer (NOT the text) into a decoded text with chosen charset.
Using your example it should be:
var myHeaders = new Headers();
myHeaders.append('Content-Type','text/plain; charset=UTF-8');
fetch('getrastreiojadlog?cod=10082551688295', myHeaders)
.then(function (response) {
return response.arrayBuffer();
})
.then(function (buffer) {
const decoder = new TextDecoder('iso-8859-1');
const text = decoder.decode(buffer);
console.log(text);
});
Notice that I'm using iso-8859-1 as decoder.
Credits: Schneide Blog
Maybe your server isn't returning an utf-8 encoded response, try to find which charset is used and then modify it in call headers.
Maybe ISO-8859-1 :
myHeaders.append('Content-Type','text/plain; charset=ISO-8859-1');
As it turns out, the problem was in how ther servlet was serving the data without explicitly informing the enconding type on the response.
By adding the following line in the Java servlet:
response.setContentType("text/html;charset=UTF-8");
it was possible got get the characters in the right format.

AWS Lambda base64 encoded form data with image in Node

So, I'm trying to pass an image to a Node Lambda through API Gateway and this is automatically base64 encoded. This is fine, and my form data all comes out correct, except somehow my image is being corrupted, and I'm not sure how to decode this properly to avoid this. Here is the relevant part of my code:
const multipart = require('aws-lambda-multipart-parser');
exports.handler = async (event) => {
console.log({ event });
const buff = Buffer.from(event.body, 'base64');
// using utf-8 appears to lose some of the data
const decodedEventBody = buff.toString('ascii');
const decodedEvent = { ...event, body: decodedEventBody };
const jsonEvent = multipart.parse(decodedEvent, false);
const asset = Buffer.from(jsonEvent.file.content, 'ascii');
}
First off, it would be good to know if aws-sdk had a way of parsing the multipart form data rather than using this unsupported third party code. Next, the value of asset ends with a buffer that's exactly the same size as the original file, but some of the byte values are off. My assumption is that the way this is being encoded vs decoded is slightly different and maybe some of the characters are being interpreted differently.
Just an update in case anybody else runs into a similar problem - updated 'ascii' to 'latin1' in both places and then it started working fine.

How can I get the value in utf-8 from an axios get receiving iso-8859-1 in node.js

I have the following code:
const notifications = await axios.get(url)
const ctype = notifications.headers["content-type"];
The ctype receives "text/json; charset=iso-8859-1"
And my string is like this: "'Ol� Matheus, est� pendente.',"
How can I decode from iso-8859-1 to utf-8 without getting those erros?
Thanks
text/json; charset=iso-8859-1 is not a valid standard content-type. text/json is wrong and JSON must be UTF-8.
So the best way to get around this at least on the server, is to first get a buffer (does axios support returning buffers?), converting it to a UTF-8 string (the only legal Javascript string) and only then run JSON.parse on it.
Pseudo-code:
// be warned that I don't know axios, I assume this is possible but it's
// not the right syntax, i just made it up.
const notificationsBuffer = await axios.get(url, {return: 'buffer'});
// Once you have the buffer, this line _should_ be correct.
const notifications = JSON.parse(notificationBuffer.toString('ISO-8859-1'));

iconv-lite not decoding everything properly, even though I'm using proper decoding

I'm using this piece of code to download a webpage (using request library) and decode everything (using iconv-lite library). The loader function is for finding some elements from the body of the website, then returning them as a JavaScript object.
request.get({url: url, encoding: null}, function(error, response, body) {
// if webpage exists, process it, otherwise throw 'not found' error
if (response.statusCode === 200) {
body = iconv.decode(body, "iso-8859-1");
const $ = cheerio.load(body);
async function show() {
var data = await loader.getDay($, date, html_tags, thumbs, res, image_thumbnail_size);
res.send(JSON.stringify(data));
}
show();
} else {
res.status(404);
res.send(JSON.stringify({"error":"No content for this date."}))
}
});
The pages are encoded in ISO-8859-1 format, and the content is looking normal, there are no bad chars. When I wasn't using iconv-lite, some characters, eg. ü, were looking like this: �. Now, when I'm using the library like in the code provided above, most of the chars are looking good, but some, eg. š are an empty box, even though they're displayed without any problems on the website.
I'm sure it's not cheerio's issue, because when I printed the output using res.send(body); or res.send(JSON.stringify({"body":body}));, the empty box character was still present there. Maybe it's a problem with Express? Is there a way to fix that?
EDIT:
I copied the empty box character to Google, and it has changed to š, maybe that's important
Also, I tried to change output of Express using res.charset but that didn't help.
I used this website: https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html to check if the page I'm scraping really has ISO-8859-1 encoding, it turned out that it has Windows-1252 encoding. I changed the encoding in my API (var encoding = 'windows-1252') and it works well now.

Response encoding with node.js "request" module

I am trying to get data from the Bing search API, and since the existing libraries seem to be based on old discontinued APIs I though I'd try myself using the request library, which appears to be the most common library for this.
My code looks like
var SKEY = "myKey...." ,
ServiceRootURL = 'https://api.datamarket.azure.com/Bing/Search/v1/Composite';
function getBingData(query, top, skip, cb) {
var params = {
Sources: "'web'",
Query: "'"+query+"'",
'$format': "JSON",
'$top': top, '$skip': skip
},
req = request.get(ServiceRootURL).auth(SKEY, SKEY, false).qs(params);
request(req, cb)
}
getBingData("bookline.hu", 50, 0, someCallbackWhichParsesTheBody)
Bing returns some JSON and I can work with it sometimes but if the response body contains a large amount of non ASCII characters JSON.parse complains that the string is malformed. I tried switching to an ATOM content type, but there was no difference, the xml was invalid. Inspecting the response body as available in the request() callback actually shows bad code.
So I tried the same request with some python code, and that appears to work fine all the time. For reference:
r = requests.get(
'https://api.datamarket.azure.com/Bing/Search/v1/Composite?Sources=%27web%27&Query=%27sexy%20cosplay%20girls%27&$format=json',
auth=HTTPBasicAuth(SKEY,SKEY))
stuffWithResponse(r.json())
I am unable to reproduce the problem with smaller responses (e.g. limiting the number of results) and unable to identify a single result which causes the issue (by stepping up the offset).
My impression is that the response gets read in chunks, transcoded somehow and reassembled back in a bad way, which means the json/atom data becomes invalid if some multibyte character gets split, which happens on larger responses but not small ones.
Being new to node, I am not sure if there is something I should be doing (setting the encoding somewhere? Bing returns UTF-8, so this doesn't seem needed).
Anyone has any idea of what is going on?
FWIW, I'm on OSX 10.8, node is v0.8.20 installed via macports, request is v2.14.0 installed via npm.
i'm not sure about the request library but the default nodejs one works well for me. It also seems a lot easier to read than your library and does indeed come back in chunks.
http://nodejs.org/api/http.html#http_http_request_options_callback
or for https (like your req) http://nodejs.org/api/https.html#https_https_request_options_callback (the same really though)
For the options a little tip: use url parse
var url = require('url');
var params = '{}'
var dataURL = url.parse(ServiceRootURL);
var post_options = {
hostname: dataURL.hostname,
port: dataURL.port || 80,
path: dataURL.path,
method: 'GET',
headers: {
'Content-Type': 'application/json; charset=utf-8',
'Content-Length': params.length
}
};
obviously params needs to be the data you want to send
I think your request authentication is incorrect. Authentication has to be provided before request.get.
See the documentation for request HTTP authentication. qs is an object that has to be passed to request options just like url and auth.
Also you are using same req for second request. You should know that request.get returns a stream for the GET of url given. Your next request using req will go wrong.
If you only need HTTPBasicAuth, this should also work
//remove req = request.get and subsequent request
request.get('http://some.server.com/', {
'auth': {
'user': 'username',
'pass': 'password',
'sendImmediately': false
}
},function (error, response, body) {
});
The callback argument gets 3 arguments. The first is an error when applicable (usually from the http.Client option not the http.ClientRequest object). The second is an http.ClientResponse object. The third is the response body String or Buffer.
The second object is the response stream. To use it you must use events 'data', 'end', 'error' and 'close'.
Be sure to use the arguments correctly.
You have to pass the option {json:true} to enable json parsing of the response

Resources