How to escape special characters from a markdown string? - node.js

I have a markdown file (utf8) that I am turning into a html file. My current setup is pretty straight forward (pseudo code):
var file = read(site.postLocation + '/in.md', 'utf8');
var escaped = marked( file );
write('out.html', escaped);
This works great, however I've now run into the issue where there are special characters in the markdown file (such as é) that will get messed up when viewed in a browser (é).
I've found a couple of npm modules that can convert html entities, however they all convert just about all convertable characters. Including those required by the markdown syntax (for example '#' becomes '#' and '.' becomes '.' and the markdown parser will fail.
I've tried the libs entities and node-iconv.
I imagine this being a pretty standerd problem. How can I only replace all strange letter characters without all the markdown required symbols?

As pointed out by hilarudeens I forgot to include meta charset html tag.
<meta charset="UTF-8" />
If you come across similar issues, I would suggest you check that first.

Related

readFileSync() doesn't return special characters regardless of encoding

My requirements are very simple… open any old ANSI-ASCII-UTF8-Unicode TXT file and replace some of the special "word processing" characters like the fancy single quote (\u2019) and double quotes (\u201C and \u201D) with the plain vanilla Ascii ones, and then do some other (irrelevant to the problem) parsing.
However, regardless of the encoding I try (ascii, utf8, binary) I just can’t get Node.js to return all characters correctly so as to replace them with their Ascii equivalents and instead I get the useless little rectangles!
Here’s the relevant part of the function…
function LoadTxtFile(Name){
fs=require('fs');
if (fs.existsSync(Name)){
var Source=fs.readFileSync(Name,'binary').toString();
/* Replace miscellaneous characters which works fine…*/
Source=Source.replace(/\©/g,'©');
Source=Source.replace(/\…/g,'...');
Source=Source.replace(/\t/g,' ');
Source=Source.replace(/\'/g,''')
/* Replace the dreaded single/double quotes but they are never located! */
Source=Source.replace(/\u2019/g,''');
Source=Source.replace(/\u201C/g,'"');
Source=Source.replace(/\u201D/g,'"');
/* And we’re stuck! */
}}
Thank you very much.
Try the Node-Iconv library see if it helps

Iterating over a string containing special characters with Dart

What I'm trying to do is to retrieve a corresponding sprite in a game for each character of a string; but when creating a String like so:
var s = "²";
The resulting String in the debugger or when printed is "²" and its length is 2.
I've also looked at the string runes and there is 2 of them.
So I don't get how I'm supposed to iterate over a string containing special characters.
The problem is that Dart Editor saves files as UTF8 without BOM by default and it's causing weird characters happening on my Windows machine only if meta meta charset="utf-8" is not in head.
Converting files to UTF8 with BOM using Notepad++ or adding meta charset="utf-8", solved the problem and everything is compiling nicely to JS.
This should work
s.codeUnits.forEach((e) => print(new String.fromCharCode(e)));

Nodejs convert string into UTF-8

From my DB im getting the following string:
Johan Öbert
What it should say is:
Johan Öbert
I've tried to convert it into utf-8 like so:
nameString.toString("utf8");
But still same problem.
Any ideas?
I'd recommend using the Buffer object:
var someEncodedString = Buffer.from('someString', 'utf-8').toString();
This avoids any unnecessary dependencies that other answers require, since Buffer is included with node.js, and is already defined in the global scope.
Use the utf8 module from npm to encode/decode the string.
Installation:
npm install utf8
In a browser:
<script src="utf8.js"></script>
In Node.js:
const utf8 = require('utf8');
API:
Encode:
utf8.encode(string)
Encodes any given JavaScript string (string) as UTF-8, and returns the UTF-8-encoded version of the string. It throws an error if the input string contains a non-scalar value, i.e. a lone surrogate. (If you need to be able to encode non-scalar values as well, use WTF-8 instead.)
// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9
utf8.encode('\xA9');
// → '\xC2\xA9'
// U+10001 LINEAR B SYLLABLE B038 E; see http://codepoints.net/U+10001
utf8.encode('\uD800\uDC01');
// → '\xF0\x90\x80\x81'
Decode:
utf8.decode(byteString)
Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)
utf8.decode('\xC2\xA9');
// → '\xA9'
utf8.decode('\xF0\x90\x80\x81');
// → '\uD800\uDC01'
// → U+10001 LINEAR B SYLLABLE B038 E
Resources
I had the same problem, when i loaded a text file via fs.readFile(), I tried to set the encodeing to UTF8, it keeped the same. my solution now is this:
myString = JSON.parse( JSON.stringify( myString ) )
after this an Ö is realy interpreted as an Ö.
When you want to change the encoding you always go from one into another. So you might go from Mac Roman to UTF-8 or from ASCII to UTF-8.
It's as important to know the desired output encoding as the current source encoding. For example if you have Mac Roman and you decode it from UTF-16 to UTF-8 you'll just make it garbled.
If you want to know more about encoding this article goes into a lot of details:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
The npm pacakge encoding which uses node-iconv or iconv-lite should allow you to easily specify which source and output encoding you want:
var resultBuffer = encoding.convert(nameString, 'ASCII', 'UTF-8');
You should be setting the database connection's charset, instead of fighting it inside nodejs:
SET NAMES 'utf8';
(works at least in MySQL and PostgreSQL)
Keep in mind you need to run that for every connection. If you're using a connection pool, do it with an event handler, eg.:
mysqlPool.on('connection', function (connection) {
connection.query("SET NAMES 'utf8'")
});
https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-client-configuration
https://www.postgresql.org/docs/current/multibyte.html#id-1.6.10.5.7
https://www.npmjs.com/package/mysql#connection
Just add this <?xml version="1.0" encoding="UTF-8"?>, will encode. For instance, an RSS would be made with any char after adding this
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
>....
Also add to your parent layout or main app.html <meta charset="utf-8" />
<!DOCTYPE html>
<html lang="en" class="overflowhere">
<head>
<meta charset="utf-8" />
</head>
</html>

NodeJS Extended ASCII Support

I'm trying to read a file that contains extended ascii characters like 'á' or 'è', but NodeJS doesn't seem to recognize them.
I tried reading into:
Buffer
String
Tried differente encoding types:
ascii
base64
utf8
as referenced on http://nodejs.org/api/fs.html
Is there a way to make this work?
I use the binary type to read such files. For example
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like 'á' or 'è',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
When I do save the above into foo.js, then the following is shown on the output:
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like '⟡ 漀爀 ✀',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
The wierdness above is because I have run it in an emacs buffer.
The file I was trying to read was in ANSI encoding. When I tried to read it using the 'fs' module's functions, it couldn't perform the conversion of the extended ASCII characters.
I just figured out that nodepad++ is able to actually convert from some formats to UTF-8, instead of just flagging the file with UTF-8 encoding.
After converting it, I was able to read it just fine and apply all the operations I needed to the content.
Thank you for your answers!
I realize this is an old post, but I found it in my personal search for a solution to this particular problem.
I have written a module that provides Extended ASCII decoding and encoding support to/from Node Buffers. You can see the source code here. It is a part of my implementation of Buffer in the browser for an in-browser filesystem I've created called BrowserFS, but it can be used 100% independently of any of that within NodeJS (or Browserify) as it has no dependencies.
Just add bfs-buffer to your dependencies and do the following:
var ExtendedASCII = require('bfs-buffer/js/extended_ascii').default;
// Decodes the input buffer as an extended ASCII string.
ExtendedASCII.byte2str(aBufferWithText);
// Encodes the input string as an extended ASCII string.
ExtendedASCII.str2byte("Some text");
Alternatively, just adapt the module into your project if you don't want to add an extra dependency to your project. It's MIT Licensed.
I hope this helps anyone in the future who finds this page in their searches like I did. :)

escaped Ambersand in JSF i18n Resource Bundle

i have something like
<s:link view="/member/index.xhtml" value="My News" propagation="none"/>
<s:link view="/member/index.xhtml" value="#{msg.myText}" propagation="none"/>
where the value of myText in the messages.properties is
myText=My News
The first line of the example works fine and replaces the text to "My News", but the second that uses a value from the resource bundle escapes the ambersand, too "My&#160;News".
I tried also to use unicode escape sequences for the ambersand and/or hash with My\u0026\u0023160;News, My\u0026#160;News and My\u0026nbsp;News in the properties file without success.
(Used css no-wrap instead of the previous used xml encoding, but would be interested anyway)
EDIT - Answer to clarified question
The first is obviously inline, so interpreter knows that this is safe.
The second one comes from external source (you are using Expression Language) and as such is not safe and need to be escaped. The result of escaping would be as you wrote, basically it will show you the exact value of HTML entity.
This is related to security (XSS for example) and not necessary i18n.
Previous attempt
I don't quite know what you are asking for but I believe it is "how to display it?".
Most of the standard JSF controls contain escape attribute that if set to false won't escape the text. Unfortunately it seems that you are using something like SeamTools which does not have this attribute.
Well, in this case there is not much to be done. Unless you could use standard control, maybe you should go and try to actually save your properties file as Unicode (UTF-16 BigEndian in fact) and simply put valid Unicode non-breaking space character. Theoretically that should work; Unicode-encoded properties files are supported in latest version of Java (although I cannot recall if it was Java SE 5 or Java SE 6)...

Resources