I'm trying to convert an HTML body encoded in windows-1251 into utf-8 but I still get messed up characters on html.
They are basically Russian alphabet but I can't get them to be shown properly. I get ??????? ?? ???
const GOT = require('got') // https://www.npmjs.com/package/got
const WIN1251 = require('windows-1251') // https://www.npmjs.com/package/windows-1251
async function query() {
var body = Buffer.from(await GOT('https://example.net/', {resolveBodyOnly: true}), 'binary')
var html = WIN1251.decode(body.toString('utf8'))
console.log(html)
}
query()
You’re doing a lot of silly encoding back-and-forth here. And the ‘backs’ don’t even match the ‘forths’.
First, you use the got library to download a webpage; by default, got will dutifully decode response texts as UTF-8. You stuff the returned Unicode string into a Buffer with the binary encoding, which throws away the higher octet of each UTF-16 code unit of the Unicode string. Then you use .toString('utf-8') which interprets this mutilated string as UTF-8 (in actuality, it is most likely not valid UTF-8 at all). Then you pass the ‘UTF-8’ string to the windows-1251, to decode it as a ‘code page 1251’ string. Nothing good can possibly come from all this confusion.
The windows-1251 package you want to use takes so-called ‘binary’ (pseudo-Latin-1) strings as input. What you should do instead is take the binary response, interpret it as Latin-1/‘binary’ string and then pass it to the windows-1251 library for decoding.
In other words, use this:
const GOT = require('got');
const WIN1251 = require('windows-1251');
async function query() {
const body = await GOT('https://example.net/', {
resolveBodyOnly: true,
responseType: 'buffer'
});
const html = WIN1251.decode(body.toString('binary'))
console.log(html)
}
query()
Related
I'm trying to use Fetch to bring some data into the screen, however some of the characters ares showing a weird � sign which I believe has something to do with converting special chars.
When debugging on the server side or if I call the servlet on my browser, the problem doesn't happen, so I believe the issue is with my JavaScript. See the code below:
var myHeaders = new Headers();
myHeaders.append('Content-Type','text/plain; charset=UTF-8');
fetch('getrastreiojadlog?cod=10082551688295', myHeaders)
.then(function (response) {
return response.text();
})
.then(function (resp) {
console.log(resp);
});
I think it is probably some detail, but I haven't managed to find out what is happening. So any tips are welcome
Thx
The response's text() function always decodes the payload as utf-8.
If you want the text in other charset you may use TextDecoder to convert the response buffer (NOT the text) into a decoded text with chosen charset.
Using your example it should be:
var myHeaders = new Headers();
myHeaders.append('Content-Type','text/plain; charset=UTF-8');
fetch('getrastreiojadlog?cod=10082551688295', myHeaders)
.then(function (response) {
return response.arrayBuffer();
})
.then(function (buffer) {
const decoder = new TextDecoder('iso-8859-1');
const text = decoder.decode(buffer);
console.log(text);
});
Notice that I'm using iso-8859-1 as decoder.
Credits: Schneide Blog
Maybe your server isn't returning an utf-8 encoded response, try to find which charset is used and then modify it in call headers.
Maybe ISO-8859-1 :
myHeaders.append('Content-Type','text/plain; charset=ISO-8859-1');
As it turns out, the problem was in how ther servlet was serving the data without explicitly informing the enconding type on the response.
By adding the following line in the Java servlet:
response.setContentType("text/html;charset=UTF-8");
it was possible got get the characters in the right format.
Let's say I am creating a REST API with Node/Express and data is exchanged between the client and server via JSON.
A user is filling out a registration form and one of the fields is an image input to upload a profile image. Images cannot be sent through JSON and therefore must be converted to a base64 string.
How do I validate this is indeed a base64 string of an image on the serverside? Or is it best practice not to send the profile image as a base64?
You could start by checking if the string is a base64 image, with a proper mime type.
I found this library on npm registry doing exactly that (not tested).
const isBase64 = require('is-base64');
let base64str_img = '...ljA5GC68sN8AoXT/AF7fw7//2Q==';
console.log(isBase64(base64str_img, { mime: true })); // true
Then you can verify if the mime type is allowed within your app, or make other verifications like trying to display the image file and catch possible error.
Anyway, If you want to be really sure about user input, you have to handle it by yourself in the first place. That is the best practice you should care about.
The Base64 value it is a valid image only if its decoded data has the correct MIME type, and the width and height are greater than zero. A handy way to check it all, is to install the jimp package and use it as follows:
var b64 = 'R0lGODdhAQADAPABAP////8AACwAAAAAAQADAAACAgxQADs=',
buf = Buffer.from(b64, 'base64');
require('jimp').read(buf).then(function (img) {
if (img.bitmap.width > 0 && img.bitmap.height > 0) {
console.log('Valid image');
} else {
console.log('Invalid image');
}
}).catch (function (err) {
console.log(err);
});
I wanted to do something similar, and ended up Googling it and found nothing, so I made my own base64 validator:
function isBase64(text) {
let utf8 = Buffer.from(text).toString("utf8");
return !(/[^\x00-\x7f]/.test(utf8));
}
This isn't great, because I used it for a different purpose but you may be able to build on it, here is an example using atob to prevent invalid base64 chars (they are ignored otherwise):
function isBase64(text) {
try {
let utf8 = atob(text);
return !(/[^\x00-\x7f]/.test(utf8));
} catch (_) {
return false;
}
}
Now, about how it works:
Buffer.from(text, "base64") removes all invalid base64 chars from the string, then converts the string to a buffer, toString("utf8"), converts the buffer to a string. atob does something similar, but instead of removing the invalid chars, it will throw an error when it encounters one (hence the try...catch).
!(/[^\x00-\x7f]/.test(utf8)) will return true if all the chars from the decoded string belong in the ASCII charset, otherwise it will return false. This can be altered to use a smaller charset, for example, [^\x30-\x39\x41-\x5a\x61-\x7a] will only return true if all the characters are alphanumeric.
I have the following code:
const notifications = await axios.get(url)
const ctype = notifications.headers["content-type"];
The ctype receives "text/json; charset=iso-8859-1"
And my string is like this: "'Ol� Matheus, est� pendente.',"
How can I decode from iso-8859-1 to utf-8 without getting those erros?
Thanks
text/json; charset=iso-8859-1 is not a valid standard content-type. text/json is wrong and JSON must be UTF-8.
So the best way to get around this at least on the server, is to first get a buffer (does axios support returning buffers?), converting it to a UTF-8 string (the only legal Javascript string) and only then run JSON.parse on it.
Pseudo-code:
// be warned that I don't know axios, I assume this is possible but it's
// not the right syntax, i just made it up.
const notificationsBuffer = await axios.get(url, {return: 'buffer'});
// Once you have the buffer, this line _should_ be correct.
const notifications = JSON.parse(notificationBuffer.toString('ISO-8859-1'));
I used the code below to encode a file to base64.
var bitmap = fs.readFileSync(file);
return new Buffer(bitmap).toString('base64');
I figured that in the file we have issues with “” and ‘’ characters, but it’s fine with "
When we have It’s, node encodes the characters, but when I decode, I see it as
It’s
Here's the javascript I'm using to decode:
fs.writeFile(reportPath, body.buffer, {encoding: 'base64'}
So, once the file is encoded and decoded, it becomes unusable with these funky characters - It’s
Can anyone shed some light on this?
This should work.
Sample script:
const fs = require('fs')
const filepath = './testfile'
//write "it's" into the file
fs.writeFileSync(filepath,"it's")
//read the file
const file_buffer = fs.readFileSync(filepath);
//encode contents into base64
const contents_in_base64 = file_buffer.toString('base64');
//write into a new file, specifying base64 as the encoding (decodes)
fs.writeFileSync('./fileB64',contents_in_base64,{encoding:'base64'})
//file fileB64 should now contain "it's"
I suspect your original file does not have utf-8 encoding, looking at your decoding code:
fs.writeFile(reportPath, body.buffer, {encoding: 'base64'})
I am guessing your content comes from a http request of some sorts so it is possible that the content is not utf-8 encoded. Take a look at this:
https://www.w3.org/International/articles/http-charset/index if charset is not specified Content-Type text/ uses ISO-8859-1.
Here is the code that helped.
var bitmap = fs.readFileSync(file);
// Remove the non-standard characters
var tmp = bitmap.toString().replace(/[“”‘’]/g,'');
// Create a buffer from the string and return the results
return new Buffer(tmp).toString('base64');
You can provide base64 encoding to the readFileSync function itself.
const fileDataBase64 = fs.readFileSync(filePath, 'base64')
I have been facing an issue on node.js express framework encoding/decoding style.
Brief background, I store pdf file in mysql database with longblob data-type with latin1 charset. From server side, i need to send the binary data with UTF8 Encoding format as my client knows utf8 decoding format only.
I tried all the possible solutions available on google.
For ex:
new Buffer(mySqlData).toString('utf8');
Already tried module "UTF8" with given functionality utf8.encode(mySqlData); But it is not working.
Also i already tried "base64" encoding and retrieve data at client with base64 decoding. It is working just fine but i need to have utf8 encoding set. Also you know base64 certainly increase the size.
Please help guys.
Ok, your problem is the conversion of latin to utf-8. If you just call your buffer.toString('utf-8'), the latin encoded characters were wrong.
To convert other charset to utf-8, the simple wai is to use iconv and icu-charset-detector. With that, you can switch to utf-8 from all possibles charset (except certains charset).
This is an example of conversion using stream. The result stream is encoded with utf-8 :
var charsetDetector = require("node-icu-charset-detector"),
Iconv = require('iconv').Iconv,
Stream = require('stream'),
function convertToUtf8(source, callback) {
var iconv,
charsetTestStream = new Stream.PassThrough(),
newResStream = new Stream.PassThrough();
source.pipe(charsetTestStream);
source.pipe(newResStream);
charsetDetector.detectCharsetStream(charsetTestStream, function (charset) {
if (!iconv && charset && !/utf-*8/i.test(charset.toString())) {
try {
iconv = new Iconv(charset, 'utf-8');
console.log('Converting from charset %s to utf-8', charset);
iconv.on('error', function (err) {
callback(err);
});
var convertStream = newResStream.pipe(iconv);
callback(null, convertStream);
} catch(err) {
callback(err);
}
return;
}
callback(null, newResStream);
});
}