How to convert Response content type from 'ISO-8859-1' to 'UTF-8' - node.js

We are using NodejS for our application. One of the external service is returning a response as content type - 'text/plain;charset=ISO-8859-1'. The response is something like this.
{"name","I contain special character" ,other filed....}.
The problem with this is for some request name contain special character inside. So this is breaking our system. we are looking for a way if the name in response contain special character than convert the name into UTF-8 format.
I already tried setting these to headers.
dataType: "json",
contentType: "application/json;charset=utf-8",
But it is still not working. Can someone please help with a workaround or any solution ?

Related

Decoding base64 while using GitHub API to Download a File

I am using the GitHub API to download a file from GitHub. I have been able to successfully authenticate as well as get a response from github, and see a base64 encoded string representing the file contents.
Unfortunately, I get an unusual error (string length is not a multiple of 4) when decoding the base64 string.
The HTTP request is illustrated below:
GET /repos/:owner/:repo/contents/:path
The (partial) response is illustrated below:
{
"name":....,
"download_url":...",
"type":"file",
"content":"ewogICAgInN3YWdnZXIiOiAiM...
}
The issue I am encountering is that the length of the string is 15263 bytes, and I get an error in decoding the string (string length is not a multiple of 4). I am using node.js and the 'base64-js' npm module to decode the string. Code to execute the decoding is illustrated below:
var base64 = require('base64-js');
var contents = base64.toByteArray(fileContent);
The decoding causes an exception:
Error: Invalid string. Length must be a multiple of 4
at placeHoldersCount (.../node_modules/base64-js/index.js:23:11)
at Object.toByteArray (...node_modules/base64-js/index.js:42:18)
:
:
I would think that the GitHub API is sending me the correct data, so I figure that is not the issue.
Am I performing the decoding improperly or is there another problem I am overlooking?
Any help is appreciated.
I experimented a bit and found a solution by using a different base64 decoding library as follows:
var base64 = require('js-base64').Base64;
var contents = base64.decode(res.content);
I am not sure if it is mandatory to have an encoded string length divisible by 4 (clearly my 15263 character length string is not divisible by 4) but the alternate library decoded the string properly.
A second solution which I also found to work is specific to how to use the GitHub API. By adding the following to the GitHub API call header, I was also able to get the decoded file contents:
'accept': 'application/vnd.github.VERSION.raw'
After much experimenting, I think I nailed down the difference between the working and broken base64 decoding.
It appears GitHub Base-64 encodes with:
UTF-8 charset
Base 64 MIME encoder (RFC2045)
As opposed to a "basic" (RFC4648) Base64 encoder. Several languages seem to default to the basic encoder (including Java, which I was using). When I switched to a MIME encoder, I got the full contents of the file un-garbled. This would explain why switching libraries in some cases fixed the issue.
I will note the contents field contained newline characters - decoders are supposed to ignore them, but not all do, so if you still get errors, you may need to try removing them.
The media-type header will do the job better, however in my case I am trying to use the API via a GitHub App - at time of writing, GitHub requires a specific media type be used when doing that, and it returns the JSON response.
For some reason the Github APIs base64 encoded content doesn't decode properly at all the online base64 decoders I've tried from the front page of google.
Python works however:
import base64
base64.b64decode("ewogICAgInN3YWdnZXIiOiAiM...")

Why do need metadata information specifying the encoding?

I feel a bit of a chicken and egg problem if i write a html meta tag specifying charset as say UTF-16 - like how do we decode the entire HTTP Request in the first place if we didn't know its UTF-16 data ? I believe request header needs to handle this and by the time we try to read metadata like say html tag charset="utf-16" we already know its UTF-16 .
Besides think one level higher about header information like Request Headers - are passed in ASCII as a standard ?
I mean at some level we need to agree upon and you can't set a data that is needed to decode as a metadata information . Can anyone clarify this ?
I am a bit confused on the idea of specifying a data that is needed to interpret the whole data as a metadata information inside the original data .
In general how can any form of encoding work if we don't have a standard agreed upon language/encoding to convey the data about the data itself ?
For example I am informed that Apache default has 8859-1 as the standard . So would all client need to enforce that for HTTP Headers and interpret the real content as UTF-8 if we want UTF-8 for the content-type ?
What character encoding should I use for a HTTP header? is a closely related question
UTF-16 (and other) encodings use a BOM (Byte Order Mark) that is read at the start of the file and that signals which encoding is being used. Only after that, the encoded part of the file begins.
For example, for UTF-16, you'll have the bytes FE FF if big-endian and FF FE if little-endian words are being used.
You also often see UTF-8 BOMs, although they don't need to be used (and may confuse some XML parsers).

Tornado decoding post arugment fails with UnicodeDecodeError

Some has been using my Tornado application and making POST requests which contain this character: ¡
Tornado was unable to decode the value and ended up with this error: HTTP 400: Bad Request (Invalid unicode in PARAMNAME: b'DATAHERE')
So I made some investigation and learned that In request body, I was receiving %A1 for the corresponding character, which python's decode method had no difficulty to decode for utf-8 encoding.
But, after URL-decoding this value, Tornado ended up with \xa1 for the character and tried to decode this using utf-8 and failed, because this was actually ISO-8859-1 encoding.
So, what should be the appropriate way to fix this? Because user is sending valid output I don't want to loose this data.
The best answer is to make sure the client always sends utf8 instead of iso8859-1 (this used to require weird tricks like the rails snowman; I'm not sure about the current state of the art). If you cannot do that, override RequestHandler.decode_argument (http://www.tornadoweb.org/en/stable/web.html#tornado.web.RequestHandler.decode_argument), which can see the raw bytes and decide how to decode them (or pass them through unchanged if you don't want to decode at this point).

Decode file names from MIME messages with LotusScript

I am parsing a MIME mail with LotusScript to get all attachments. But I get issues when it comes to encoded file names in the header. I got one file with the name
"HE336 =?Windows-1251?Q?=CF=E0=EA=E5=F2_=E4=EE=EA=F3=EC=E5=ED=F2=EE=E2.pdf?="
Is there any way to decode it with LotusScript?
The string I get is RFC 2047 header encoding. I found that Notes supports it in MIME headers. The issue I had is when I used MIMEHeader.GetParamVal it always returns the encoded value. However MIMEHeader.GetHeaderVal and GetHeaderValAndParams has an extra parameter
boolean decoded
true decodes any RFC-2047 encodings
false (default) retains any encodings; false is enforced if folded is true
When this is set to true, I get a decoded value.
it's been a while but I once used Julian Robichaux's Base64 classes with Jave and/or LS. You should be able to achieve what you are looking for with these.
Base64Encoding
Hope that helps.
Best wishes - Michael

how to get gmail content with content has been encode

When i try to get subject and content email, i alwasy got this content has been encode, example:
subject: Thư cảm ơn
I got this subject has been encode: Th=C6=B0_c=E1=BA=A3m_=C6=A1n?=
so how to decode this content.
That string looks like it's supposed to be encoded in the format mandated by RFC 2047, except that it is corrupt: RFC 2047 encoded strings always begin with =? end end with ?=.
Assuming you can get a non-corrupt encoded string, you should expect to see not only the Subject but also the display names in To, From, and other headers to be encoded in this way. You might want to look into using a library that does complete MIME parsing for you. Appropriate available libraries will depend on what language you're using.

Resources