Nodejs convert string into UTF-8

Nodejs convert string into UTF-8 - node.js

From my DB im getting the following string:
Johan Ã–bert
What it should say is:
Johan Öbert
I've tried to convert it into utf-8 like so:
nameString.toString("utf8");
But still same problem.
Any ideas?

I'd recommend using the Buffer object:
var someEncodedString = Buffer.from('someString', 'utf-8').toString();
This avoids any unnecessary dependencies that other answers require, since Buffer is included with node.js, and is already defined in the global scope.

Use the utf8 module from npm to encode/decode the string.
Installation:
npm install utf8
In a browser:
<script src="utf8.js"></script>
In Node.js:
const utf8 = require('utf8');
API:
Encode:
utf8.encode(string)
Encodes any given JavaScript string (string) as UTF-8, and returns the UTF-8-encoded version of the string. It throws an error if the input string contains a non-scalar value, i.e. a lone surrogate. (If you need to be able to encode non-scalar values as well, use WTF-8 instead.)
// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9
utf8.encode('\xA9');
// → '\xC2\xA9'
// U+10001 LINEAR B SYLLABLE B038 E; see http://codepoints.net/U+10001
utf8.encode('\uD800\uDC01');
// → '\xF0\x90\x80\x81'
Decode:
utf8.decode(byteString)
Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)
utf8.decode('\xC2\xA9');
// → '\xA9'
utf8.decode('\xF0\x90\x80\x81');
// → '\uD800\uDC01'
// → U+10001 LINEAR B SYLLABLE B038 E
Resources

I had the same problem, when i loaded a text file via fs.readFile(), I tried to set the encodeing to UTF8, it keeped the same. my solution now is this:
myString = JSON.parse( JSON.stringify( myString ) )
after this an Ö is realy interpreted as an Ö.

When you want to change the encoding you always go from one into another. So you might go from Mac Roman to UTF-8 or from ASCII to UTF-8.
It's as important to know the desired output encoding as the current source encoding. For example if you have Mac Roman and you decode it from UTF-16 to UTF-8 you'll just make it garbled.
If you want to know more about encoding this article goes into a lot of details:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
The npm pacakge encoding which uses node-iconv or iconv-lite should allow you to easily specify which source and output encoding you want:
var resultBuffer = encoding.convert(nameString, 'ASCII', 'UTF-8');

You should be setting the database connection's charset, instead of fighting it inside nodejs:
SET NAMES 'utf8';
(works at least in MySQL and PostgreSQL)
Keep in mind you need to run that for every connection. If you're using a connection pool, do it with an event handler, eg.:
mysqlPool.on('connection', function (connection) {
connection.query("SET NAMES 'utf8'")
});
https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-client-configuration
https://www.postgresql.org/docs/current/multibyte.html#id-1.6.10.5.7
https://www.npmjs.com/package/mysql#connection

Just add this <?xml version="1.0" encoding="UTF-8"?>, will encode. For instance, an RSS would be made with any char after adding this
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
>....
Also add to your parent layout or main app.html <meta charset="utf-8" />
<!DOCTYPE html>
<html lang="en" class="overflowhere">
<head>
<meta charset="utf-8" />
</head>
</html>

Related

fw/1 - Can not display foreign characters in views

I have got the following main Controller default action in fw/1 (framework one 4.2), where i define some rc scope variables to be displayed in views/main/default.cfm.
main.cfc
function default( struct rc ) {
rc.name="ΡΨΩΓΔ";
rc.dupl_name="üößä";
}
views/main/default.cfm
<cfoutput>
<p> #rc.name# </p>
<p> #rc.dupl_name# </p>
</cfoutput>
and finally in layouts/default.cfm
<cfoutput><!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>
A Simple FW/1 Layout Template
</title>
</head>
<body>
#body#
</body>
</html>
</cfoutput>
Unfortunately i receive the following output
Î¡Î¨Î©Î“Î”
Ã¼Ã¶ÃŸÃ¤
Any idea that could help me?
Regards

Because I already gave you a solution within my comments, I’m posting it as an answer here, so that others with similar issues may find the root of their charset issues.
The issue described above is very often the result of conflicting charset encodings. For example, if you are reading an ISO-8859-1 encoded file and outputting it with UTF8 without having them converted (de-/encoded) properly.
For the purpose of sending characters through a webapp as you are doing, the most common charset encoding as of today is UTF-8.
All you need to do is to harmonize characterset encodings or at least ensure that the encodings are set in such a manner that the processing engine is able to encode and decode the charsets correctly. What you can do in your case is:
Verify the encoding of the template .cfm file and make sure it’s saved as UTF-8
Define UTF-8 as your web charset in Lucee Administrator » Settings » Charset » Web charset
Define UTF-8 as your ressource charset in Lucee Administrator » Settings » Charset » Resource charset.
Make sure there are no other places where charset encodings are incorrectly set, such as a http server response header of “content-type: text/html;charset...” or a html meta-tag with a charset directive.
If you have other type of ressources, such as a database, then you may need to check those also (connection and database/table settings).
Note: That charset doesn’t always have to be UTF-8 (UTF-8 might use multiple bytes for certain characters). There may be use cases you would achieve the same result with single byte encodings, such as ISO-8559-1, that would also need less computing ressources and less payload.
I hope this may help others with similar issues.

Haskell and the web (UTF-8)

I have a Haskell-Script what putStr some stuff ... and this output I want to display on a webpage. I compiled the Haskellscript and let it be executed like a CGI Script, works well.
But when I have special characters in the putStr, it will fail:
Terminal-Output:
<p class="text">what happrns at ä ?</p>
Web output:
what happrns at
And nothing behind it is displayed ... wha happened?

Some questions:
What Content-type header is your CGI script sending?
Are you setting the encoding on the stdout handle?
You should always send a Content-type header, and in that header you can stipulate the character encoding. See this page for more details.
Also, the encoding for Haskell file handles is determined by your OS's current default. It might be UTF-8, but it also might be something else. See this SO answer for more info.
So to make sure that the characters you send appear correctly on a web page, I would:
Send a Content-type header with charset=utf8
Make sure the encoding for stdout is also set to utf8:
import System.IO
...
hSetEncoding stdout utf8
If the Content-type header and your output encoding agree, things should work out.

Assuming that the Haskell program is properly outputting utf-8, you will still need to let the browser know what encoding you are using. In HTML 4, you can do this as follows
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
(add this inside the <head> tag).
In HTML 5 you can use
<meta charset="UTF-8">
You can verify that the Haskell program is outputting the correct format using xxd on the command line
./myProgram | xxd
This will show you the actual output in bytes, so that you can verify on a character by character basis.

Iterating over a string containing special characters with Dart

What I'm trying to do is to retrieve a corresponding sprite in a game for each character of a string; but when creating a String like so:
var s = "²";
The resulting String in the debugger or when printed is "Â²" and its length is 2.
I've also looked at the string runes and there is 2 of them.
So I don't get how I'm supposed to iterate over a string containing special characters.

The problem is that Dart Editor saves files as UTF8 without BOM by default and it's causing weird characters happening on my Windows machine only if meta meta charset="utf-8" is not in head.
Converting files to UTF8 with BOM using Notepad++ or adding meta charset="utf-8", solved the problem and everything is compiling nicely to JS.

This should work
s.codeUnits.forEach((e) => print(new String.fromCharCode(e)));

How to escape special characters from a markdown string?

I have a markdown file (utf8) that I am turning into a html file. My current setup is pretty straight forward (pseudo code):
var file = read(site.postLocation + '/in.md', 'utf8');
var escaped = marked( file );
write('out.html', escaped);
This works great, however I've now run into the issue where there are special characters in the markdown file (such as é) that will get messed up when viewed in a browser (Ã©).
I've found a couple of npm modules that can convert html entities, however they all convert just about all convertable characters. Including those required by the markdown syntax (for example '#' becomes '&num;' and '.' becomes '&period;' and the markdown parser will fail.
I've tried the libs entities and node-iconv.
I imagine this being a pretty standerd problem. How can I only replace all strange letter characters without all the markdown required symbols?

As pointed out by hilarudeens I forgot to include meta charset html tag.
<meta charset="UTF-8" />
If you come across similar issues, I would suggest you check that first.

NodeJS Extended ASCII Support

I'm trying to read a file that contains extended ascii characters like 'á' or 'è', but NodeJS doesn't seem to recognize them.
I tried reading into:
Buffer
String
Tried differente encoding types:
ascii
base64
utf8
as referenced on http://nodejs.org/api/fs.html
Is there a way to make this work?

I use the binary type to read such files. For example
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like 'á' or 'è',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
When I do save the above into foo.js, then the following is shown on the output:
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like '⟡ 漀爀 ✀',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
The wierdness above is because I have run it in an emacs buffer.

The file I was trying to read was in ANSI encoding. When I tried to read it using the 'fs' module's functions, it couldn't perform the conversion of the extended ASCII characters.
I just figured out that nodepad++ is able to actually convert from some formats to UTF-8, instead of just flagging the file with UTF-8 encoding.
After converting it, I was able to read it just fine and apply all the operations I needed to the content.
Thank you for your answers!

I realize this is an old post, but I found it in my personal search for a solution to this particular problem.
I have written a module that provides Extended ASCII decoding and encoding support to/from Node Buffers. You can see the source code here. It is a part of my implementation of Buffer in the browser for an in-browser filesystem I've created called BrowserFS, but it can be used 100% independently of any of that within NodeJS (or Browserify) as it has no dependencies.
Just add bfs-buffer to your dependencies and do the following:
var ExtendedASCII = require('bfs-buffer/js/extended_ascii').default;
// Decodes the input buffer as an extended ASCII string.
ExtendedASCII.byte2str(aBufferWithText);
// Encodes the input string as an extended ASCII string.
ExtendedASCII.str2byte("Some text");
Alternatively, just adapt the module into your project if you don't want to add an extra dependency to your project. It's MIT Licensed.
I hope this helps anyone in the future who finds this page in their searches like I did. :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string