Vanilla node.js response encoding - weird behaviour (heroku) - node.js

I'm developing my understanding of servers by writing a webdev framework using vanilla node.js. For the first time a situation arose where a french character was included in a json response from the server, and this character showed up as an unrecognized symbol (a question-mark within a diamond on chrome).
The problem was the encoding, which was being specified here:
/*
At this stage we have access to "response", which is an object of the
following format:
response = {
data: 'A string which contains the data to send',
encoding: 'The encoding type of the data, e.g. "text/html", "text/json"'
}
*/
var encoding = response.encoding;
var length = response.data.length;
var data = response.data;
res.writeHead(200, {
'Content-Type': encoding,
'Content-Length': length
});
res.end(data, 'binary'); // Everything is encoded as binary
The problem was that everything sent by the server is encoded as binary, which ruins the ability to display certain characters. The fix seemed simple; include a boolean binary value in response, and adjust the 2nd parameter of res.end accordingly:
/*
At this stage we have access to "response", which is an object of the
following format:
response = {
data: 'A string which contains the data to send',
encoding: 'The encoding type of the data, e.g. "text/html", "text/json"',
binary: 'a boolean value determining the transfer encoding'
}
*/
.
.
.
var binary = response.binary;
.
.
.
res.end(data, binary ? 'binary' : 'utf8'); // Encode responses appropriately
Here is where I have produced some very, very strange behavior. This modification causes french characters to appear correctly, but occasionally causes the last character of a response to be omitted on the client-side!!!
This bug only happens once I host my application on heroku. Locally, the last character is never missing.
I noticed this bug because certain responses (not all of them!) now break the JSON.parse call on the client-side, although they are only missing the final } character.
I have a horrible band-aid solution right now, which works:
var length = response.data.length + 1;
var data = response.data + ' ';
I am simply appending a space to every single response sent by the server. This actually causes all text/html, text/css, text/json, and application/javascript responses to work because they can tolerate the unnecessary whitespace, but I hate this solution and it will break other Content-Types!
My question is: can anyone give me some insight into this problem?

If you're going to explicitly set a Content-Length, you should always use Buffer.byteLength() on the body to calculate the length, since that method returns the actual number of bytes in the string and not the number of characters like the string .length property will return.

Related

NodeJS and Iconv - "ISO-8859-1" to "UTF-8"

I created a NodeJS application which should get some data from an external API-Server. That server provides its data only as 'Content-Type: text/plain;charset=ISO-8859-1'. I have got that information through the Header-Data of the server.
Now the problem for me is that special characters like 'ä', 'ö' or 'ü' are shown as �.
I tried to convert them with Iconv to UTF-8, but then I got these things '�'...
My question is, what am I doing wrong?
For testing I use Postman. These are the steps I do to test everything:
Use Postman to trigger my NodeJS application
The App requests data from the API-Server
API-Server sends Data to NodeJS App
My App prints out the raw response-data of the API, which already has those strange characters �
The App then tries to convert them with Iconv to UTF-8, where it shows me now this '�' characters
Another strange thing:
When I connect Postman directly to the API-Server, the special characters get shown as they have too without problems. Therefore i guess my application causes the problem but I cannot see where or why...
// Javascript Code:
try {
const response = await axios.get(
URL
{
params: params,
headers: headers
}
);
var iconv = new Iconv('ISO-8859-1', 'UTF-8');
var converted = await iconv.convert(response.data);
return converted.toString('UTF-8');
} catch (error) {
throw new Error(error);
}
So after some deeper research I came up with the solution to my problem.
The cause of all trouble seems to lie within the post-process of axios or something similar. It is the step close after data is received and convertet to text and shortly before the response is generated for my nodejs-application.
What I did was to define the "responseType" of the GET-method of axios as an "ArrayBuffer". Therefore an adjustment in axios was necessary like so:
var resArBuffer = await axios.get(
URL,
{
responseType: 'arraybuffer',
params: params,
headers: headers
}
);
Since JavaScript is awesome, the ArrayBuffer provides a toString() method itself to convert the data from ArrayBuffer to String by own definitions:
var response = resArBuffer.data.toString("latin1");
Another thing worth mentioning is the fact that I used "latin1" instead of "ISO-8859-1". Don't ask me why, some sources even recommended to use "cp1252" instead, but "latin1" workend for me here.
Unfortunately that was not enough yet since I needed the text in UTF-8 format. Using "toString('utf-8')" itself was the wrong way too since it would still print the "�"-Symbols. The workaround was simple. I used "Buffer.from(...)" to convert the "latin1" defined text into a "utf-8" text:
var text = Buffer.from(response, 'utf-8').toString();
Now I get the desired UTF-8 converted text I needed. I hope this thread helps anyone else outhere since thse informations hwere spread in many different threads for me.

nodeJS: convert response.body in utf-8 (from windows-1251 encoding)

I'm trying to convert an HTML body encoded in windows-1251 into utf-8 but I still get messed up characters on html.
They are basically Russian alphabet but I can't get them to be shown properly. I get ??????? ?? ???
const GOT = require('got') // https://www.npmjs.com/package/got
const WIN1251 = require('windows-1251') // https://www.npmjs.com/package/windows-1251
async function query() {
var body = Buffer.from(await GOT('https://example.net/', {resolveBodyOnly: true}), 'binary')
var html = WIN1251.decode(body.toString('utf8'))
console.log(html)
}
query()
You’re doing a lot of silly encoding back-and-forth here. And the ‘backs’ don’t even match the ‘forths’.
First, you use the got library to download a webpage; by default, got will dutifully decode response texts as UTF-8. You stuff the returned Unicode string into a Buffer with the binary encoding, which throws away the higher octet of each UTF-16 code unit of the Unicode string. Then you use .toString('utf-8') which interprets this mutilated string as UTF-8 (in actuality, it is most likely not valid UTF-8 at all). Then you pass the ‘UTF-8’ string to the windows-1251, to decode it as a ‘code page 1251’ string. Nothing good can possibly come from all this confusion.
The windows-1251 package you want to use takes so-called ‘binary’ (pseudo-Latin-1) strings as input. What you should do instead is take the binary response, interpret it as Latin-1/‘binary’ string and then pass it to the windows-1251 library for decoding.
In other words, use this:
const GOT = require('got');
const WIN1251 = require('windows-1251');
async function query() {
const body = await GOT('https://example.net/', {
resolveBodyOnly: true,
responseType: 'buffer'
});
const html = WIN1251.decode(body.toString('binary'))
console.log(html)
}
query()

Return JSON response in ISO-8859-1 with NodeJS/Express

I have built an API with Node and Express which returns some JSON. The JSON data is to be read by a web application. Sadly this application only accepts ISO-8859-1 encoded JSON which has proven to be a bit difficult.
I can't manage to return the JSON with the right encoding even though I've tried the methods in the Express documentation and also all tips from googling the issue.
The Express documentation says to use "res.set()" or "res.type()" but none of these is working for me. The commented lines are all the variants that I've tried (using Mongoose):
MyModel.find()
.sort([['name', 'ascending']])
.exec((err, result) => {
if (err) { return next(err) }
// res.set('Content-Type', 'application/json; charset=iso-8859-1')
// res.set('Content-Type', 'application/json; charset=ansi')
// res.set('Content-Type', 'application/json; charset=windows-1252')
// res.type('application/json; charset=iso-8859-1')
// res.type('application/json; charset=ansi')
// res.type('application/json; charset=windows-1252')
// res.send(result)
res.json(result)
})
None of these have any effect on the response, it always turns into "Content-Type: application/json; charset=utf-8".
Since JSON should(?) be encoded in utf-8, is it event possible to use any other encoding with Express?
If you look at the lib/response.js file in the Express source code (in your node_modules folder or at https://github.com/expressjs/express/blob/master/lib/response.js) you'll see that res.json takes your result, generates the corresponding JSON representation in a JavaScript String, and then passes that string to res.send.
The cause of your problem is that when res.send (in that same source file) is given a String argument, it encodes the string as UTF8 and also forces the charset of the response to utf-8.
You can get around this by not using res.json. Instead build the encoded response yourself. First use your existing code to set up the Content-Type header:
res.set('Content-Type', 'application/json; charset=iso-8859-1')
After that, manually generate the JSON string:
jsonString = JSON.stringify(result);
then encode that string as ISO-8859-1 into a Buffer:
jsonBuffer = Buffer.from(jsonString, 'latin1');
Finally, pass that buffer to res.send:
res.send(jsonBuffer)
Because res.send is no longer being called with a String argument, it should skip the step where it forces charset=utf-8 and should send the response with the charset value that you specified.

iconv-lite not decoding everything properly, even though I'm using proper decoding

I'm using this piece of code to download a webpage (using request library) and decode everything (using iconv-lite library). The loader function is for finding some elements from the body of the website, then returning them as a JavaScript object.
request.get({url: url, encoding: null}, function(error, response, body) {
// if webpage exists, process it, otherwise throw 'not found' error
if (response.statusCode === 200) {
body = iconv.decode(body, "iso-8859-1");
const $ = cheerio.load(body);
async function show() {
var data = await loader.getDay($, date, html_tags, thumbs, res, image_thumbnail_size);
res.send(JSON.stringify(data));
}
show();
} else {
res.status(404);
res.send(JSON.stringify({"error":"No content for this date."}))
}
});
The pages are encoded in ISO-8859-1 format, and the content is looking normal, there are no bad chars. When I wasn't using iconv-lite, some characters, eg. ü, were looking like this: �. Now, when I'm using the library like in the code provided above, most of the chars are looking good, but some, eg. š are an empty box, even though they're displayed without any problems on the website.
I'm sure it's not cheerio's issue, because when I printed the output using res.send(body); or res.send(JSON.stringify({"body":body}));, the empty box character was still present there. Maybe it's a problem with Express? Is there a way to fix that?
EDIT:
I copied the empty box character to Google, and it has changed to š, maybe that's important
Also, I tried to change output of Express using res.charset but that didn't help.
I used this website: https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html to check if the page I'm scraping really has ISO-8859-1 encoding, it turned out that it has Windows-1252 encoding. I changed the encoding in my API (var encoding = 'windows-1252') and it works well now.

Response encoding with node.js "request" module

I am trying to get data from the Bing search API, and since the existing libraries seem to be based on old discontinued APIs I though I'd try myself using the request library, which appears to be the most common library for this.
My code looks like
var SKEY = "myKey...." ,
ServiceRootURL = 'https://api.datamarket.azure.com/Bing/Search/v1/Composite';
function getBingData(query, top, skip, cb) {
var params = {
Sources: "'web'",
Query: "'"+query+"'",
'$format': "JSON",
'$top': top, '$skip': skip
},
req = request.get(ServiceRootURL).auth(SKEY, SKEY, false).qs(params);
request(req, cb)
}
getBingData("bookline.hu", 50, 0, someCallbackWhichParsesTheBody)
Bing returns some JSON and I can work with it sometimes but if the response body contains a large amount of non ASCII characters JSON.parse complains that the string is malformed. I tried switching to an ATOM content type, but there was no difference, the xml was invalid. Inspecting the response body as available in the request() callback actually shows bad code.
So I tried the same request with some python code, and that appears to work fine all the time. For reference:
r = requests.get(
'https://api.datamarket.azure.com/Bing/Search/v1/Composite?Sources=%27web%27&Query=%27sexy%20cosplay%20girls%27&$format=json',
auth=HTTPBasicAuth(SKEY,SKEY))
stuffWithResponse(r.json())
I am unable to reproduce the problem with smaller responses (e.g. limiting the number of results) and unable to identify a single result which causes the issue (by stepping up the offset).
My impression is that the response gets read in chunks, transcoded somehow and reassembled back in a bad way, which means the json/atom data becomes invalid if some multibyte character gets split, which happens on larger responses but not small ones.
Being new to node, I am not sure if there is something I should be doing (setting the encoding somewhere? Bing returns UTF-8, so this doesn't seem needed).
Anyone has any idea of what is going on?
FWIW, I'm on OSX 10.8, node is v0.8.20 installed via macports, request is v2.14.0 installed via npm.
i'm not sure about the request library but the default nodejs one works well for me. It also seems a lot easier to read than your library and does indeed come back in chunks.
http://nodejs.org/api/http.html#http_http_request_options_callback
or for https (like your req) http://nodejs.org/api/https.html#https_https_request_options_callback (the same really though)
For the options a little tip: use url parse
var url = require('url');
var params = '{}'
var dataURL = url.parse(ServiceRootURL);
var post_options = {
hostname: dataURL.hostname,
port: dataURL.port || 80,
path: dataURL.path,
method: 'GET',
headers: {
'Content-Type': 'application/json; charset=utf-8',
'Content-Length': params.length
}
};
obviously params needs to be the data you want to send
I think your request authentication is incorrect. Authentication has to be provided before request.get.
See the documentation for request HTTP authentication. qs is an object that has to be passed to request options just like url and auth.
Also you are using same req for second request. You should know that request.get returns a stream for the GET of url given. Your next request using req will go wrong.
If you only need HTTPBasicAuth, this should also work
//remove req = request.get and subsequent request
request.get('http://some.server.com/', {
'auth': {
'user': 'username',
'pass': 'password',
'sendImmediately': false
}
},function (error, response, body) {
});
The callback argument gets 3 arguments. The first is an error when applicable (usually from the http.Client option not the http.ClientRequest object). The second is an http.ClientResponse object. The third is the response body String or Buffer.
The second object is the response stream. To use it you must use events 'data', 'end', 'error' and 'close'.
Be sure to use the arguments correctly.
You have to pass the option {json:true} to enable json parsing of the response

Resources